Move to Parquet for storage.

Feature request: move the live data pipeline from CSV to Parquet

  The current process_live.py flow is not viable for large datasets because it reads and filters huge CSV files into
  memory. In my local checkout, goldsky/orderFilled.csv is about 299 GB and processed/trades.csv is about 238 GB, which
  can exceed available RAM even with the a proposed “lazy” script.

  Request:

  - Replace CSV storage for live pipeline data with Parquet.
  - Store raw Goldsky events as partitioned Parquet, e.g. goldsky/order_filled/date=YYYY-MM-DD/*.parquet.
  - Store processed trades as partitioned Parquet, e.g. processed/trades/date=YYYY-MM-DD/*.parquet.
  - Migrate markets.csv and missing_markets.csv to Parquet as well.
  - Add a one-time migration script for existing CSV data.
  - Make process_live.py process bounded Parquet chunks/partitions instead of collecting the whole dataset.
  - Add configurable threading, ideally --threads, with support up to 40 threads.
  - Preserve resumability with state files instead of scanning/tailing huge CSVs.
  - Keep automatic missing-market discovery/fetching.

  Why:

  Parquet should be materially smaller than raw CSV for this data because it stores typed columns, compresses column-by-
  column, and avoids repeating text-heavy CSV overhead. It also enables partition pruning so incremental processing can
  scan only recent partitions instead of a 299 GB source file.

 Parquet structure will be approximately 37G from 299GB after conversion when up-to-date at time of writing with improved processing performance to boot.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move to Parquet for storage. #29

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Move to Parquet for storage. #29

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions