Skip to content

Move to Parquet for storage. #29

@polymboinker

Description

@polymboinker

Feature request: move the live data pipeline from CSV to Parquet

The current process_live.py flow is not viable for large datasets because it reads and filters huge CSV files into
memory. In my local checkout, goldsky/orderFilled.csv is about 299 GB and processed/trades.csv is about 238 GB, which
can exceed available RAM even with the a proposed “lazy” script.

Request:

  • Replace CSV storage for live pipeline data with Parquet.
  • Store raw Goldsky events as partitioned Parquet, e.g. goldsky/order_filled/date=YYYY-MM-DD/*.parquet.
  • Store processed trades as partitioned Parquet, e.g. processed/trades/date=YYYY-MM-DD/*.parquet.
  • Migrate markets.csv and missing_markets.csv to Parquet as well.
  • Add a one-time migration script for existing CSV data.
  • Make process_live.py process bounded Parquet chunks/partitions instead of collecting the whole dataset.
  • Add configurable threading, ideally --threads, with support up to 40 threads.
  • Preserve resumability with state files instead of scanning/tailing huge CSVs.
  • Keep automatic missing-market discovery/fetching.

Why:

Parquet should be materially smaller than raw CSV for this data because it stores typed columns, compresses column-by-
column, and avoids repeating text-heavy CSV overhead. It also enables partition pruning so incremental processing can
scan only recent partitions instead of a 299 GB source file.

Parquet structure will be approximately 37G from 299GB after conversion when up-to-date at time of writing with improved processing performance to boot.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions