Unified Pipeline Usage

Single Command for Complete Data Processing

Instead of running three separate commands (download, ingest, enrich), you can use the unified quantmini pipeline run command:

Full Pipeline (Download → Parquet → Enrich)

# Process stocks daily data for a date range
uv run quantmini pipeline run \
  --data-type stocks_daily \
  --start-date 2025-10-07 \
  --end-date 2025-10-10 \
  --skip-convert

# Process options minute data
uv run quantmini pipeline run \
  --data-type options_minute \
  --start-date 2025-10-07 \
  --end-date 2025-10-07 \
  --skip-convert

Skip Certain Steps

# Skip download/ingest (only enrich existing data)
uv run quantmini pipeline run \
  --data-type stocks_daily \
  --start-date 2024-01-01 \
  --end-date 2024-01-10 \
  --skip-ingest \
  --skip-convert

# Skip enrichment (only download and convert to Parquet)
uv run quantmini pipeline run \
  --data-type options_daily \
  --start-date 2025-01-01 \
  --end-date 2025-01-31 \
  --skip-enrich \
  --skip-convert

Pipeline Stages (Medallion Architecture)

The pipeline follows the Medallion Architecture pattern with three data quality layers:

Ingest (Landing → Bronze Layer)
- Downloads from Polygon.io S3 to landing layer
- Converts to validated Parquet format (bronze layer)
- Validates schema and data types
- Output: $DATA_ROOT/bronze/{data_type}/
Enrich (Bronze → Silver Layer)
- Reads validated Parquet from bronze layer
- Computes technical indicators
- Adds alpha factors using DuckDB
- Output: $DATA_ROOT/silver/{data_type}/
Convert (Silver → Gold Layer)
- Converts enriched silver data to Qlib binary format
- Creates ML-ready datasets in gold layer
- Output: $DATA_ROOT/gold/qlib/{data_type}/

Medallion Architecture Layers

Landing: Raw CSV.GZ files organized by source (landing/polygon-s3/)
Bronze: Validated Parquet files with schema enforcement (bronze/)
Silver: Feature-enriched Parquet with calculated indicators (silver/)
Gold: ML-ready binary format optimized for backtesting (gold/qlib/)

Data Access Limitations:

Stocks: 5-year access (2020-10-18 to present)
Options: 2-year access (2023-10-18 to present)

(Where $DATA_ROOT is configured via environment variable or config/pipeline_config.yaml)

Benefits

✅ Single command for entire workflow ✅ Automatic incremental processing (skips already processed dates) ✅ Trading day filtering (skips weekends/holidays automatically) ✅ Memory-optimized streaming for large datasets ✅ Atomic operations with proper error handling

Daily Update Example

# Update all data types for the latest trading day
TODAY=$(date +%Y-%m-%d)

for dtype in stocks_daily options_daily stocks_minute options_minute; do
  uv run quantmini pipeline run \
    --data-type $dtype \
    --start-date $TODAY \
    --end-date $TODAY \
    --skip-convert
done

Historical Backfill Example

# Backfill missing data (automatically skips existing dates with incremental mode)
uv run quantmini pipeline run \
  --data-type stocks_daily \
  --start-date 2024-01-01 \
  --end-date 2025-10-03 \
  --skip-convert

The pipeline automatically:

Filters to trading days only
Skips weekends and market holidays
Uses incremental mode (skips already processed dates)
Handles errors gracefully with detailed logging

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified Pipeline Usage

Single Command for Complete Data Processing

Full Pipeline (Download → Parquet → Enrich)

Skip Certain Steps

Pipeline Stages (Medallion Architecture)

Medallion Architecture Layers

Benefits

Daily Update Example

Historical Backfill Example

FilesExpand file tree

PIPELINE_USAGE.md

Latest commit

History

PIPELINE_USAGE.md

File metadata and controls

Unified Pipeline Usage

Single Command for Complete Data Processing

Full Pipeline (Download → Parquet → Enrich)

Skip Certain Steps

Pipeline Stages (Medallion Architecture)

Medallion Architecture Layers

Benefits

Daily Update Example

Historical Backfill Example