Feature request: move the live data pipeline from CSV to Parquet
The current process_live.py flow is not viable for large datasets because it reads and filters huge CSV files into
memory. In my local checkout, goldsky/orderFilled.csv is about 299 GB and processed/trades.csv is about 238 GB, which
can exceed available RAM even with the a proposed “lazy” script.
Request:
- Replace CSV storage for live pipeline data with Parquet.
- Store raw Goldsky events as partitioned Parquet, e.g. goldsky/order_filled/date=YYYY-MM-DD/*.parquet.
- Store processed trades as partitioned Parquet, e.g. processed/trades/date=YYYY-MM-DD/*.parquet.
- Migrate markets.csv and missing_markets.csv to Parquet as well.
- Add a one-time migration script for existing CSV data.
- Make process_live.py process bounded Parquet chunks/partitions instead of collecting the whole dataset.
- Add configurable threading, ideally --threads, with support up to 40 threads.
- Preserve resumability with state files instead of scanning/tailing huge CSVs.
- Keep automatic missing-market discovery/fetching.
Why:
Parquet should be materially smaller than raw CSV for this data because it stores typed columns, compresses column-by-
column, and avoids repeating text-heavy CSV overhead. It also enables partition pruning so incremental processing can
scan only recent partitions instead of a 299 GB source file.
Parquet structure will be approximately 37G from 299GB after conversion when up-to-date at time of writing with improved processing performance to boot.
Feature request: move the live data pipeline from CSV to Parquet
The current process_live.py flow is not viable for large datasets because it reads and filters huge CSV files into
memory. In my local checkout, goldsky/orderFilled.csv is about 299 GB and processed/trades.csv is about 238 GB, which
can exceed available RAM even with the a proposed “lazy” script.
Request:
Why:
Parquet should be materially smaller than raw CSV for this data because it stores typed columns, compresses column-by-
column, and avoids repeating text-heavy CSV overhead. It also enables partition pruning so incremental processing can
scan only recent partitions instead of a 299 GB source file.
Parquet structure will be approximately 37G from 299GB after conversion when up-to-date at time of writing with improved processing performance to boot.