Last Updated: October 21, 2025 Project: QuantMini - High-Performance Financial Data Pipeline Architecture: Medallion Architecture (Bronze → Silver → Gold)
QuantMini is a production-grade financial data pipeline implementing Medallion Architecture with Qlib integration for quantitative trading research. The system processes market data from Polygon.io through multiple quality layers, culminating in ML-ready binary formats for backtesting.
Key Features:
- Medallion Architecture data lake (Landing → Bronze → Silver → Gold)
- Polygon.io REST API integration (direct downloads, no S3)
- High-performance async downloaders (HTTP/2, 100+ concurrent requests)
- 8+ years of news data, 5+ years of market data
- Qlib binary format for ML backtesting
- Complete automation with incremental updates
Landing Layer Bronze Layer Silver Layer Gold Layer
(Raw Sources) (Validated) (Enriched) (ML-Ready)
↓ ↓ ↓ ↓
Polygon.io → Validated Parquet → Feature-Enriched → Qlib Binary
REST API (Schema Check) (Alpha158) (Backtesting)
↓ ↓ ↓ ↓
landing/ bronze/{type}/ silver/{type}/ gold/qlib/
- Landing: Polygon REST API → Raw JSON responses
- Bronze: Validated Parquet files (schema-checked, ZSTD compressed)
- Silver: Parquet + Alpha158 features (158 technical indicators)
- Gold: Qlib binary format (columnar storage for ML)
IMPORTANT: The project has ONLY two entry point scripts for all data operations:
-
scripts/daily_update_parallel.sh- Daily/incremental updates- Processes recent data (default: yesterday, configurable with
--days-back) - Runs all layers in parallel for maximum performance
- Use for: daily automation, backfilling recent data
- Processes recent data (default: yesterday, configurable with
-
scripts/historical_data_load.sh- Historical data downloads- Downloads large historical datasets (multi-year)
- Optimized for bulk downloads with aggressive parallelization
- Use for: initial setup, downloading historical fundamentals/short data
All other scripts are internal components called by these two entry points. Do not run individual scripts directly unless debugging.
Primary Data Root: /Volumes/990EVOPLUS/quantlake/ (External SSD)
Config File: config/paths.yaml (ONLY source of truth for all paths)
Legacy Path: /Volumes/sandisk/quantmini-lake/ (deprecated)
- Configured in
config/pipeline_config.yaml - Must use external drive (500GB+ required)
Directory Structure:
/Volumes/990EVOPLUS/quantlake/
├── landing/ # Raw API responses (ephemeral)
├── bronze/ # Validated Parquet (~100GB + minute data)
│ ├── stocks_daily/ # Daily OHLCV data
│ ├── stocks_minute/ # Minute OHLCV data (34GB, partitioned)
│ ├── options_daily/ # Daily options aggregates
│ ├── options_minute/ # Minute options data (17GB, partitioned)
│ ├── news/ # News articles (12GB, 739K files, 10 years)
│ ├── fundamentals/ # Financial statements, ratios
│ ├── corporate_actions/ # Dividends, splits, IPOs
│ └── reference_data/ # Tickers, relationships
├── silver/ # Feature-enriched Parquet
│ ├── stocks_daily/ # + Alpha158 features
│ └── options_daily/
└── gold/ # ML-ready binary
└── qlib/ # Microsoft Qlib format
├── instruments/
├── calendars/
└── features/
Date-First Hive Partitioning (NOT ticker-first):
bronze/news/news/year=2025/month=09/ticker=AAPL.parquet
^^^^^^^^^^^^^^^^^^^^^^ Date first (enables partition pruning)
Why date-first:
- Efficient time-range queries (common use case)
- Partition pruning reduces I/O by 90%+
- Better compression ratios
Authentication:
- API Key: Stored in
config/credentials.yaml - No S3 credentials needed (migrated from S3 to REST API)
Key Downloaders (src/download/):
polygon_rest_client.py- Base HTTP/2 async clientnews.py- News articles downloader (8+ years available)bars.py- OHLCV data downloaderfundamentals.py- Income statements, balance sheets, cash flow, short datacorporate_actions.py- Dividends, splits, IPOs, ticker changesreference_data.py- Ticker metadata, relationships
Polygon API Endpoints Covered:
- Dividends -
/v3/reference/dividends- Cash dividends, payment dates - Stock Splits -
/v3/reference/splits- Forward and reverse splits - IPOs -
/vX/reference/ipos- Initial public offerings - Ticker Events -
/vX/reference/tickers/{id}/events- Symbol changes, rebranding - Short Interest -
/stocks/v1/short-interest- Bi-weekly short interest (2 year max) - Short Volume -
/stocks/v1/short-volume- Daily short volume (all history)
API Optimizations:
- HTTP/2 multiplexing (100+ concurrent requests)
- Automatic retries with exponential backoff
- Rate limiting compliance
- Cursor-based pagination
| Data Type | Start Date | Coverage | Size | Tickers/Contracts |
|---|---|---|---|---|
| Stocks Daily | 2020-01-01 | 5+ years | ~200GB | 11,994 |
| Options Daily | 2023-01-01 | 2+ years | ~500GB | 1,388,382 |
| Stocks Minute | 2020-01-01 | 5+ years | ~5TB | 11,994 |
| Options Minute | 2023-01-01 | 2+ years | ~10TB | 1,388,382 |
| News | 2017-04-10 | 8+ years | ~50GB | Millions |
| Fundamentals | 2015-01-01 | 10+ years | ~20GB | 11,994 |
Total Bronze Layer: ~770GB (excluding minute data)
Command Pattern:
uv run python -m src.cli.main data ingest \
-t <data_type> \
-s <start_date> \
-e <end_date> \
[--incremental]Data Types:
stocks_daily- Daily OHLCV bars for stocksoptions_daily- Daily OHLCV bars for optionsstocks_minute- Minute bars for stocks (large)options_minute- Minute bars for options (very large)
Key Flags:
--incremental- Skip existing dates (smart deduplication)--max-concurrent N- Control concurrency (default: 50)
Examples:
# Initial batch load (1 year)
uv run python -m src.cli.main data ingest -t stocks_daily -s 2024-01-01 -e 2025-10-18
# Incremental daily update
uv run python -m src.cli.main data ingest -t stocks_daily \
-s $(date -v-7d +%Y-%m-%d) -e $(date +%Y-%m-%d) --incremental
# Backfill gap
uv run python -m src.cli.main data ingest -t stocks_daily \
-s 2024-06-01 -e 2024-06-30 --incrementalNews:
# Download 1 year of news
uv run python scripts/download/download_news_1year.py --start-date 2024-01-01
# Download full 8-year history
uv run python scripts/download/download_news_1year.py --start-date 2017-04-10Fundamentals:
uv run python scripts/download/download_fundamentals.py \
--tickers-file tickers_cs.txt \
--include-short-dataCorporate Actions (Dividends, Splits, IPOs):
# Download dividends, splits, and IPOs
quantmini polygon corporate-actions \
--start-date 2024-01-01 \
--end-date 2025-10-21 \
--include-ipos \
--output-dir $BRONZE_DIR/corporate_actions
# Download ticker changes/rebranding
quantmini polygon ticker-events AAPL,MSFT,GOOGL \
--output-dir $BRONZE_DIR/corporate_actionsShort Interest & Short Volume:
# Downloads both short interest AND short volume
quantmini polygon short-data AAPL,MSFT,GOOGL \
--settlement-date-gte 2024-01-01 \
--date-gte 2024-01-01 \
--output-dir $BRONZE_DIR/fundamentalsBulk Download:
# Download all data types at once
bash scripts/bulk_download_all_data.shStrategy A: Full Historical (Recommended for production)
# 5+ years of stocks
uv run python -m src.cli.main data ingest -t stocks_daily -s 2020-01-01 -e 2025-10-18
# Duration: 2-4 hours
# Size: ~200GBStrategy B: Recent Data Only (Fast start)
# 1 year of stocks
uv run python -m src.cli.main data ingest -t stocks_daily -s 2024-01-01 -e 2025-10-18
# Duration: 30-60 minutes
# Size: ~40GBDaily Update Workflow:
# Update last 7 days (handles weekends/holidays)
uv run python -m src.cli.main data ingest -t stocks_daily \
-s $(date -v-7d +%Y-%m-%d) -e $(date +%Y-%m-%d) --incremental
uv run python -m src.cli.main data ingest -t options_daily \
-s $(date -v-7d +%Y-%m-%d) -e $(date +%Y-%m-%d) --incremental
# Download yesterday's news
uv run python scripts/download/download_news_1year.py \
--start-date $(date -v-1d +%Y-%m-%d)How --incremental works:
- Scans existing Parquet files for date coverage
- Skips dates already present in bronze layer
- Only downloads missing dates
- Prevents duplicate data and wasted API calls
Date Range Backfill:
# Backfill specific month
uv run python -m src.cli.main data ingest -t stocks_daily \
-s 2024-06-01 -e 2024-06-30 --incrementalMonthly Backfill (Large Gaps):
# Backfill year 2024, month by month
for month in {01..12}; do
uv run python -m src.cli.main data ingest -t stocks_daily \
-s 2024-${month}-01 -e 2024-${month}-31 --incremental
echo "Completed month: 2024-${month}"
sleep 10 # Rate limiting
doneLocation: scripts/transformation/
Add Features:
# Generate Alpha158 features
uv run python scripts/transformation/transform_add_features.pyWhat it does:
- Reads bronze Parquet files
- Calculates 158 technical indicators (Alpha158)
- Writes to silver layer with same partitioning
- Preserves date-first structure
Location: scripts/conversion/
Convert to Qlib Binary:
# Convert silver Parquet to Qlib format
uv run python scripts/conversion/convert_to_qlib_binary.pyOutput Structure:
gold/qlib/
├── instruments/
│ └── all.txt # List of tickers
├── calendars/
│ └── day.txt # Trading dates
└── features/
├── {ticker}/
│ ├── open.bin # Binary column files
│ ├── high.bin
│ ├── low.bin
│ ├── close.bin
│ └── volume.bin
└── ...
Daily Automation:
# Setup cron jobs
bash scripts/automation/setup_cron_jobs.sh
# Daily pipeline (6 AM)
bash scripts/automation/orchestrate_daily_pipeline.shWeekly Automation:
# Comprehensive weekly update
bash scripts/automation/orchestrate_weekly_pipeline.shFile: src/conversion/qlib_binary_writer.py
Issue: Original implementation assumed dict input, but received DataFrame.
Fix (October 18, 2025):
def _write_bin(self, symbol_df: Union[pl.DataFrame, Dict], code: str, _calendar):
"""
Write binary data for a single symbol.
CRITICAL: Handles both DataFrame AND dict input (legacy compatibility)
"""
# Handle DataFrame input (current standard)
if isinstance(symbol_df, pl.DataFrame):
symbol_df = symbol_df.to_dict(as_series=False)
# ... rest of implementationKey Points:
- Always check input type (DataFrame vs dict)
- Qlib format requires dict with lists
- Polars DataFrames are primary data structure in pipeline
Wrong (ticker-first):
bronze/stocks_daily/ticker=AAPL/year=2025/month=09/data.parquet
Correct (date-first):
bronze/stocks_daily/year=2025/month=09/ticker=AAPL.parquet
Why:
- Time-range queries are most common use case
- Partition pruning eliminates 90%+ of files
- Better compression (similar dates compress better)
- DuckDB/Polars can skip entire year/month directories
File: src/download/polygon_rest_client.py
Key Optimizations:
async with PolygonRESTClient(
api_key=credentials['api_key'],
max_concurrent=50, # 50 concurrent requests
max_connections=100 # 100 HTTP/2 connections
) as client:
# HTTP/2 multiplexing enables 50+ parallel requests per connectionPerformance:
- 50+ concurrent requests via HTTP/2
- Automatic retries with exponential backoff
- Connection pooling and keepalive
- Result: 10-20x faster than sequential
Implementation (in CLI command):
- Scan existing Parquet files in bronze layer
- Extract date coverage from Hive partitions
- Build set of existing dates
- Filter requested date range to missing dates only
- Download only missing dates
Benefits:
- No wasted API calls
- Idempotent (safe to rerun)
- Handles failures gracefully (just rerun)
Why Polars:
- 10-100x faster than Pandas
- Lazy evaluation (query optimization)
- Better memory management
- Native Parquet support
- Arrow-compatible
Key Pattern:
# Read Parquet (lazy)
df = pl.scan_parquet('bronze/stocks_daily/**/*.parquet')
# Filter (lazy, not executed yet)
df = df.filter(pl.col('date').is_between('2024-01-01', '2024-12-31'))
# Collect (execute optimized query)
result = df.collect()# Day 1: Install and configure
git clone https://github.com/nittygritty-zzy/quantmini.git
cd quantmini
uv sync
cp config/credentials.yaml.example config/credentials.yaml
# Edit config/credentials.yaml with Polygon API key
# Day 1: Initial batch load (1 year)
uv run python -m src.cli.main data ingest -t stocks_daily -s 2024-01-01 -e 2025-10-18
uv run python -m src.cli.main data ingest -t options_daily -s 2024-01-01 -e 2025-10-18
uv run python scripts/download/download_news_1year.py --start-date 2024-01-01
# Day 2: Feature engineering
uv run python scripts/transformation/transform_add_features.py
# Day 3: Qlib conversion
uv run python scripts/conversion/convert_to_qlib_binary.py
# Day 4: Setup automation
bash scripts/automation/setup_cron_jobs.sh# Runs via cron at 6 AM daily
bash scripts/automation/orchestrate_daily_pipeline.sh
# What it does:
# 1. Download yesterday's data (stocks, options, news)
# 2. Run feature engineering on new data
# 3. Update Qlib binary format
# 4. Validate data quality# Step 1: Detect gaps
uv run python scripts/validation/detect_data_gaps.py \
--data-type stocks_daily \
--start-date 2020-01-01 \
--end-date 2025-10-18
# Step 2: Backfill year by year
for year in {2020..2024}; do
uv run python -m src.cli.main data ingest -t stocks_daily \
-s ${year}-01-01 -e ${year}-12-31 --incremental
echo "Completed year: ${year}"
done
# Step 3: Verify completeness
uv run python scripts/validation/validate_data_completeness.pyLocation: scripts/tests/run_complete_pipeline.py
What it does:
- Downloads test data (5 tickers, 1 month)
- Processes through all layers (Bronze → Silver → Gold)
- Validates each layer
- Generates summary report
Usage:
# Clean test directory
rm -rf /Users/zheyuanzhao/workspace/quantmini/test_pipeline/*
# Run complete pipeline test
uv run python scripts/tests/run_complete_pipeline.py
# Check results
cat /Users/zheyuanzhao/workspace/quantmini/test_pipeline/PIPELINE_SUMMARY.mdRecent Test Results (September 2025):
- Bronze: 1,107 news articles (0.48 MB)
- Silver: 144 enriched records (0.01 MB)
- Gold: 15 binary files + 2 metadata (0.00 MB)
- API Success Rate: 100% (23/23 requests)
config/pipeline_config.yaml:
data_root: Primary data storage locationpartition_strategy: Date-first Hive partitioningcompression: ZSTD (best compression ratio)
config/credentials.yaml (NOT in git):
polygon:
api_key: "YOUR_POLYGON_API_KEY"
# NO S3 credentials needed (migrated to REST API)config/system_profile.yaml:
- System-specific optimizations
- Memory limits
- Concurrency settings
export DATA_ROOT=/Volumes/sandisk/quantmini-lake
export POLYGON_API_KEY=your_api_key_hereWhy uv:
- 10-100x faster than pip
- Better dependency resolution
- UV_LINK_MODE=copy for external drives
Install dependencies:
# Standard install
uv sync
# External drive (copy mode)
export UV_LINK_MODE=copy
uv syncCore:
polars- DataFrame library (10-100x faster than pandas)httpx- HTTP/2 async clientpyarrow- Parquet I/Oduckdb- Query engine
ML/Qlib:
qlib- Microsoft quantitative investment frameworknumpy- Numerical computingscipy- Scientific computing
CLI:
click- Command-line interfacerich- Terminal formattingtqdm- Progress bars
Date-First Structure:
{data_type}/year={YYYY}/month={MM}/ticker={SYMBOL}.parquet
Examples:
stocks_daily/year=2025/month=09/ticker=AAPL.parquetoptions_daily/year=2025/month=09/ticker=AAPL250117C00100000.parquetnews/year=2025/month=09/ticker=AAPL.parquet
Structure:
gold/qlib/features/{ticker}/{feature}.bin
Examples:
gold/qlib/features/aapl/open.bingold/qlib/features/aapl/close.bin
-
docs/guides/data-ingestion-strategies.md
- Complete guide to initial load, incremental, backfill
- 500+ lines, most comprehensive guide
-
docs/guides/batch-downloader.md
- Polygon REST API downloader guide
- Performance optimization tips
-
- Query bronze layer with DataLoader
- DuckDB integration examples
-
- Complete scripts reference
- Organized by Medallion Architecture layer
- Concise but complete
- Code examples for every feature
- Cross-references to related docs
- All code samples tested
- Updated October 18, 2025
- ❌ Use ticker-first partitioning (use date-first)
- ❌ Use pandas (use polars for 10-100x speedup)
- ❌ Use S3 flat files (use REST API)
- ❌ Use pip (use uv package manager)
- ❌ Forget
--incrementalflag for daily updates - ❌ Skip validation after backfill
- ✅ Use date-first Hive partitioning
- ✅ Use
--incrementalflag for daily updates - ✅ Run validation scripts after backfill
- ✅ Check today's date before time-sensitive commands
- ✅ Use uv for all Python commands
- ✅ Monitor background processes with BashOutput
Data Root (NEVER change without migration):
/Volumes/sandisk/quantmini-lake/
Config Files (MUST exist):
config/credentials.yaml
config/pipeline_config.yaml
config/system_profile.yaml
Test Directory (safe to delete):
/Users/zheyuanzhao/workspace/quantmini/test_pipeline/
- ✅ Qlib Binary Writer Fix - Handle DataFrame input
- ✅ Test Pipeline Scripts - Complete Bronze → Silver → Gold
- ✅ Documentation Update - Data ingestion strategies guide
- ✅ Scripts Organization - Medallion Architecture aligned
- ✅ September Test - Validated 1-month pipeline (1,107 articles)
- ✅ Migrated from S3 to REST API (complete)
- ✅ Bronze layer (770GB, excluding minute data)
- 🔄 Silver layer (partial, needs feature engineering)
- 🔄 Gold layer (partial, needs Qlib conversion)
Multiple data ingestion processes running:
stocks_minute(2020-2025)options_minute(2020-2025)stocks_daily(2020-2025)options_daily(2020-2025)
Check status:
# Monitor background process
uv run python scripts/validation/check_download_status.py# Daily update
uv run python -m src.cli.main data ingest -t stocks_daily \
-s $(date -v-7d +%Y-%m-%d) -e $(date +%Y-%m-%d) --incremental
# Backfill gap
uv run python -m src.cli.main data ingest -t stocks_daily \
-s 2024-06-01 -e 2024-06-30 --incremental
# Feature engineering
uv run python scripts/transformation/transform_add_features.py
# Qlib conversion
uv run python scripts/conversion/convert_to_qlib_binary.py
# Validate
uv run python scripts/validation/validate_duckdb_access.py
# Test pipeline
uv run python scripts/tests/run_complete_pipeline.pyRate limiting:
uv run python -m src.cli.main data ingest -t stocks_daily \
-s 2024-01-01 -e 2024-12-31 --max-concurrent 10Disk space:
df -h /Volumes/sandisk/quantlake
du -sh /Volumes/sandisk/quantlake/*Check gaps:
uv run python scripts/validation/detect_data_gaps.py --data-type stocks_dailyEnd of Project Memory Maintained by: Claude Code Assistant Last Review: October 18, 2025