QuantMini Project Memory

Last Updated: October 21, 2025 Project: QuantMini - High-Performance Financial Data Pipeline Architecture: Medallion Architecture (Bronze → Silver → Gold)

Project Overview

QuantMini is a production-grade financial data pipeline implementing Medallion Architecture with Qlib integration for quantitative trading research. The system processes market data from Polygon.io through multiple quality layers, culminating in ML-ready binary formats for backtesting.

Key Features:

Medallion Architecture data lake (Landing → Bronze → Silver → Gold)
Polygon.io REST API integration (direct downloads, no S3)
High-performance async downloaders (HTTP/2, 100+ concurrent requests)
8+ years of news data, 5+ years of market data
Qlib binary format for ML backtesting
Complete automation with incremental updates

Architecture

Medallion Architecture Layers

Landing Layer          Bronze Layer         Silver Layer          Gold Layer
(Raw Sources)         (Validated)          (Enriched)            (ML-Ready)
      ↓                    ↓                    ↓                     ↓
Polygon.io         →  Validated Parquet  →  Feature-Enriched  →  Qlib Binary
  REST API             (Schema Check)        (Alpha158)           (Backtesting)
      ↓                    ↓                    ↓                     ↓
landing/              bronze/{type}/      silver/{type}/        gold/qlib/

Data Flow

Landing: Polygon REST API → Raw JSON responses
Bronze: Validated Parquet files (schema-checked, ZSTD compressed)
Silver: Parquet + Alpha158 features (158 technical indicators)
Gold: Qlib binary format (columnar storage for ML)

Pipeline Entry Points

IMPORTANT: The project has ONLY two entry point scripts for all data operations:

scripts/daily_update_parallel.sh - Daily/incremental updates
- Processes recent data (default: yesterday, configurable with --days-back)
- Runs all layers in parallel for maximum performance
- Use for: daily automation, backfilling recent data
scripts/historical_data_load.sh - Historical data downloads
- Downloads large historical datasets (multi-year)
- Optimized for bulk downloads with aggressive parallelization
- Use for: initial setup, downloading historical fundamentals/short data

All other scripts are internal components called by these two entry points. Do not run individual scripts directly unless debugging.

Critical Technical Details

Data Storage

Primary Data Root: /Volumes/990EVOPLUS/quantlake/ (External SSD) Config File: config/paths.yaml (ONLY source of truth for all paths) Legacy Path: /Volumes/sandisk/quantmini-lake/ (deprecated)

Configured in config/pipeline_config.yaml
Must use external drive (500GB+ required)

Directory Structure:

/Volumes/990EVOPLUS/quantlake/
├── landing/           # Raw API responses (ephemeral)
├── bronze/            # Validated Parquet (~100GB + minute data)
│   ├── stocks_daily/      # Daily OHLCV data
│   ├── stocks_minute/     # Minute OHLCV data (34GB, partitioned)
│   ├── options_daily/     # Daily options aggregates
│   ├── options_minute/    # Minute options data (17GB, partitioned)
│   ├── news/              # News articles (12GB, 739K files, 10 years)
│   ├── fundamentals/      # Financial statements, ratios
│   ├── corporate_actions/ # Dividends, splits, IPOs
│   └── reference_data/    # Tickers, relationships
├── silver/            # Feature-enriched Parquet
│   ├── stocks_daily/  # + Alpha158 features
│   └── options_daily/
└── gold/              # ML-ready binary
    └── qlib/          # Microsoft Qlib format
        ├── instruments/
        ├── calendars/
        └── features/

Partitioning Strategy

Date-First Hive Partitioning (NOT ticker-first):

bronze/news/news/year=2025/month=09/ticker=AAPL.parquet
                 ^^^^^^^^^^^^^^^^^^^^^^ Date first (enables partition pruning)

Why date-first:

Efficient time-range queries (common use case)
Partition pruning reduces I/O by 90%+
Better compression ratios

Data Sources

Polygon.io REST API (Primary)

Authentication:

API Key: Stored in config/credentials.yaml
No S3 credentials needed (migrated from S3 to REST API)

Key Downloaders (src/download/):

polygon_rest_client.py - Base HTTP/2 async client
news.py - News articles downloader (8+ years available)
bars.py - OHLCV data downloader
fundamentals.py - Income statements, balance sheets, cash flow, short data
corporate_actions.py - Dividends, splits, IPOs, ticker changes
reference_data.py - Ticker metadata, relationships

Polygon API Endpoints Covered:

Dividends - /v3/reference/dividends - Cash dividends, payment dates
Stock Splits - /v3/reference/splits - Forward and reverse splits
IPOs - /vX/reference/ipos - Initial public offerings
Ticker Events - /vX/reference/tickers/{id}/events - Symbol changes, rebranding
Short Interest - /stocks/v1/short-interest - Bi-weekly short interest (2 year max)
Short Volume - /stocks/v1/short-volume - Daily short volume (all history)

API Optimizations:

HTTP/2 multiplexing (100+ concurrent requests)
Automatic retries with exponential backoff
Rate limiting compliance
Cursor-based pagination

Data Coverage (as of October 18, 2025)

Data Type	Start Date	Coverage	Size	Tickers/Contracts
Stocks Daily	2020-01-01	5+ years	~200GB	11,994
Options Daily	2023-01-01	2+ years	~500GB	1,388,382
Stocks Minute	2020-01-01	5+ years	~5TB	11,994
Options Minute	2023-01-01	2+ years	~10TB	1,388,382
News	2017-04-10	8+ years	~50GB	Millions
Fundamentals	2015-01-01	10+ years	~20GB	11,994

Total Bronze Layer: ~770GB (excluding minute data)

CLI Commands

Data Ingestion (Primary Interface)

Command Pattern:

uv run python -m src.cli.main data ingest \
  -t <data_type> \
  -s <start_date> \
  -e <end_date> \
  [--incremental]

Data Types:

stocks_daily - Daily OHLCV bars for stocks
options_daily - Daily OHLCV bars for options
stocks_minute - Minute bars for stocks (large)
options_minute - Minute bars for options (very large)

Key Flags:

--incremental - Skip existing dates (smart deduplication)
--max-concurrent N - Control concurrency (default: 50)

Examples:

# Initial batch load (1 year)
uv run python -m src.cli.main data ingest -t stocks_daily -s 2024-01-01 -e 2025-10-18

# Incremental daily update
uv run python -m src.cli.main data ingest -t stocks_daily \
  -s $(date -v-7d +%Y-%m-%d) -e $(date +%Y-%m-%d) --incremental

# Backfill gap
uv run python -m src.cli.main data ingest -t stocks_daily \
  -s 2024-06-01 -e 2024-06-30 --incremental

Download Scripts (Alternative Data)

News:

# Download 1 year of news
uv run python scripts/download/download_news_1year.py --start-date 2024-01-01

# Download full 8-year history
uv run python scripts/download/download_news_1year.py --start-date 2017-04-10

Fundamentals:

uv run python scripts/download/download_fundamentals.py \
  --tickers-file tickers_cs.txt \
  --include-short-data

Corporate Actions (Dividends, Splits, IPOs):

# Download dividends, splits, and IPOs
quantmini polygon corporate-actions \
  --start-date 2024-01-01 \
  --end-date 2025-10-21 \
  --include-ipos \
  --output-dir $BRONZE_DIR/corporate_actions

# Download ticker changes/rebranding
quantmini polygon ticker-events AAPL,MSFT,GOOGL \
  --output-dir $BRONZE_DIR/corporate_actions

Short Interest & Short Volume:

# Downloads both short interest AND short volume
quantmini polygon short-data AAPL,MSFT,GOOGL \
  --settlement-date-gte 2024-01-01 \
  --date-gte 2024-01-01 \
  --output-dir $BRONZE_DIR/fundamentals

Bulk Download:

# Download all data types at once
bash scripts/bulk_download_all_data.sh

Data Ingestion Strategies

1. Initial Batch Load (First Time)

Strategy A: Full Historical (Recommended for production)

# 5+ years of stocks
uv run python -m src.cli.main data ingest -t stocks_daily -s 2020-01-01 -e 2025-10-18

# Duration: 2-4 hours
# Size: ~200GB

Strategy B: Recent Data Only (Fast start)

# 1 year of stocks
uv run python -m src.cli.main data ingest -t stocks_daily -s 2024-01-01 -e 2025-10-18

# Duration: 30-60 minutes
# Size: ~40GB

2. Incremental Updates (Daily Maintenance)

Daily Update Workflow:

# Update last 7 days (handles weekends/holidays)
uv run python -m src.cli.main data ingest -t stocks_daily \
  -s $(date -v-7d +%Y-%m-%d) -e $(date +%Y-%m-%d) --incremental

uv run python -m src.cli.main data ingest -t options_daily \
  -s $(date -v-7d +%Y-%m-%d) -e $(date +%Y-%m-%d) --incremental

# Download yesterday's news
uv run python scripts/download/download_news_1year.py \
  --start-date $(date -v-1d +%Y-%m-%d)

How --incremental works:

Scans existing Parquet files for date coverage
Skips dates already present in bronze layer
Only downloads missing dates
Prevents duplicate data and wasted API calls

3. Backfill (Fill Data Gaps)

Date Range Backfill:

# Backfill specific month
uv run python -m src.cli.main data ingest -t stocks_daily \
  -s 2024-06-01 -e 2024-06-30 --incremental

Monthly Backfill (Large Gaps):

# Backfill year 2024, month by month
for month in {01..12}; do
  uv run python -m src.cli.main data ingest -t stocks_daily \
    -s 2024-${month}-01 -e 2024-${month}-31 --incremental

  echo "Completed month: 2024-${month}"
  sleep 10  # Rate limiting
done

Pipeline Scripts

Transformation (Bronze → Silver)

Location: scripts/transformation/

Add Features:

# Generate Alpha158 features
uv run python scripts/transformation/transform_add_features.py

What it does:

Reads bronze Parquet files
Calculates 158 technical indicators (Alpha158)
Writes to silver layer with same partitioning
Preserves date-first structure

Conversion (Silver → Gold)

Location: scripts/conversion/

Convert to Qlib Binary:

# Convert silver Parquet to Qlib format
uv run python scripts/conversion/convert_to_qlib_binary.py

Output Structure:

gold/qlib/
├── instruments/
│   └── all.txt              # List of tickers
├── calendars/
│   └── day.txt              # Trading dates
└── features/
    ├── {ticker}/
    │   ├── open.bin         # Binary column files
    │   ├── high.bin
    │   ├── low.bin
    │   ├── close.bin
    │   └── volume.bin
    └── ...

Automation

Daily Automation:

# Setup cron jobs
bash scripts/automation/setup_cron_jobs.sh

# Daily pipeline (6 AM)
bash scripts/automation/orchestrate_daily_pipeline.sh

Weekly Automation:

# Comprehensive weekly update
bash scripts/automation/orchestrate_weekly_pipeline.sh

Key Technical Fixes & Learnings

1. Qlib Binary Writer (CRITICAL)

File: src/conversion/qlib_binary_writer.py

Issue: Original implementation assumed dict input, but received DataFrame.

Fix (October 18, 2025):

def _write_bin(self, symbol_df: Union[pl.DataFrame, Dict], code: str, _calendar):
    """
    Write binary data for a single symbol.

    CRITICAL: Handles both DataFrame AND dict input (legacy compatibility)
    """
    # Handle DataFrame input (current standard)
    if isinstance(symbol_df, pl.DataFrame):
        symbol_df = symbol_df.to_dict(as_series=False)

    # ... rest of implementation

Key Points:

Always check input type (DataFrame vs dict)
Qlib format requires dict with lists
Polars DataFrames are primary data structure in pipeline

2. Date-First Partitioning (CRITICAL)

Wrong (ticker-first):

bronze/stocks_daily/ticker=AAPL/year=2025/month=09/data.parquet

Correct (date-first):

bronze/stocks_daily/year=2025/month=09/ticker=AAPL.parquet

Why:

Time-range queries are most common use case
Partition pruning eliminates 90%+ of files
Better compression (similar dates compress better)
DuckDB/Polars can skip entire year/month directories

3. Async HTTP/2 Client (Performance)

File: src/download/polygon_rest_client.py

Key Optimizations:

async with PolygonRESTClient(
    api_key=credentials['api_key'],
    max_concurrent=50,        # 50 concurrent requests
    max_connections=100       # 100 HTTP/2 connections
) as client:
    # HTTP/2 multiplexing enables 50+ parallel requests per connection

Performance:

50+ concurrent requests via HTTP/2
Automatic retries with exponential backoff
Connection pooling and keepalive
Result: 10-20x faster than sequential

4. Incremental Update Logic

Implementation (in CLI command):

Scan existing Parquet files in bronze layer
Extract date coverage from Hive partitions
Build set of existing dates
Filter requested date range to missing dates only
Download only missing dates

Benefits:

No wasted API calls
Idempotent (safe to rerun)
Handles failures gracefully (just rerun)

5. Polars > Pandas

Why Polars:

10-100x faster than Pandas
Lazy evaluation (query optimization)
Better memory management
Native Parquet support
Arrow-compatible

Key Pattern:

# Read Parquet (lazy)
df = pl.scan_parquet('bronze/stocks_daily/**/*.parquet')

# Filter (lazy, not executed yet)
df = df.filter(pl.col('date').is_between('2024-01-01', '2024-12-31'))

# Collect (execute optimized query)
result = df.collect()

Common Workflows

New Project Setup (Day 1-4)

# Day 1: Install and configure
git clone https://github.com/nittygritty-zzy/quantmini.git
cd quantmini
uv sync
cp config/credentials.yaml.example config/credentials.yaml
# Edit config/credentials.yaml with Polygon API key

# Day 1: Initial batch load (1 year)
uv run python -m src.cli.main data ingest -t stocks_daily -s 2024-01-01 -e 2025-10-18
uv run python -m src.cli.main data ingest -t options_daily -s 2024-01-01 -e 2025-10-18
uv run python scripts/download/download_news_1year.py --start-date 2024-01-01

# Day 2: Feature engineering
uv run python scripts/transformation/transform_add_features.py

# Day 3: Qlib conversion
uv run python scripts/conversion/convert_to_qlib_binary.py

# Day 4: Setup automation
bash scripts/automation/setup_cron_jobs.sh

Daily Maintenance (Automated)

# Runs via cron at 6 AM daily
bash scripts/automation/orchestrate_daily_pipeline.sh

# What it does:
# 1. Download yesterday's data (stocks, options, news)
# 2. Run feature engineering on new data
# 3. Update Qlib binary format
# 4. Validate data quality

Historical Backfill (Fill Gaps)

# Step 1: Detect gaps
uv run python scripts/validation/detect_data_gaps.py \
  --data-type stocks_daily \
  --start-date 2020-01-01 \
  --end-date 2025-10-18

# Step 2: Backfill year by year
for year in {2020..2024}; do
  uv run python -m src.cli.main data ingest -t stocks_daily \
    -s ${year}-01-01 -e ${year}-12-31 --incremental

  echo "Completed year: ${year}"
done

# Step 3: Verify completeness
uv run python scripts/validation/validate_data_completeness.py

Testing

Complete Pipeline Test

Location: scripts/tests/run_complete_pipeline.py

What it does:

Downloads test data (5 tickers, 1 month)
Processes through all layers (Bronze → Silver → Gold)
Validates each layer
Generates summary report

Usage:

# Clean test directory
rm -rf /Users/zheyuanzhao/workspace/quantmini/test_pipeline/*

# Run complete pipeline test
uv run python scripts/tests/run_complete_pipeline.py

# Check results
cat /Users/zheyuanzhao/workspace/quantmini/test_pipeline/PIPELINE_SUMMARY.md

Recent Test Results (September 2025):

Bronze: 1,107 news articles (0.48 MB)
Silver: 144 enriched records (0.01 MB)
Gold: 15 binary files + 2 metadata (0.00 MB)
API Success Rate: 100% (23/23 requests)

Configuration Files

Primary Config Files

config/pipeline_config.yaml:

data_root: Primary data storage location
partition_strategy: Date-first Hive partitioning
compression: ZSTD (best compression ratio)

config/credentials.yaml (NOT in git):

polygon:
  api_key: "YOUR_POLYGON_API_KEY"
  # NO S3 credentials needed (migrated to REST API)

config/system_profile.yaml:

System-specific optimizations
Memory limits
Concurrency settings

Environment Variables

export DATA_ROOT=/Volumes/sandisk/quantmini-lake
export POLYGON_API_KEY=your_api_key_here

Dependencies & Environment

Package Manager: uv (NOT pip)

Why uv:

10-100x faster than pip
Better dependency resolution
UV_LINK_MODE=copy for external drives

Install dependencies:

# Standard install
uv sync

# External drive (copy mode)
export UV_LINK_MODE=copy
uv sync

Key Dependencies

Core:

polars - DataFrame library (10-100x faster than pandas)
httpx - HTTP/2 async client
pyarrow - Parquet I/O
duckdb - Query engine

ML/Qlib:

qlib - Microsoft quantitative investment framework
numpy - Numerical computing
scipy - Scientific computing

CLI:

click - Command-line interface
rich - Terminal formatting
tqdm - Progress bars

File Naming Conventions

Parquet Files (Bronze/Silver)

Date-First Structure:

{data_type}/year={YYYY}/month={MM}/ticker={SYMBOL}.parquet

Examples:

stocks_daily/year=2025/month=09/ticker=AAPL.parquet
options_daily/year=2025/month=09/ticker=AAPL250117C00100000.parquet
news/year=2025/month=09/ticker=AAPL.parquet

Binary Files (Gold/Qlib)

Structure:

gold/qlib/features/{ticker}/{feature}.bin

Examples:

gold/qlib/features/aapl/open.bin
gold/qlib/features/aapl/close.bin

Documentation

Key Documentation Files

docs/guides/data-ingestion-strategies.md
- Complete guide to initial load, incremental, backfill
- 500+ lines, most comprehensive guide
docs/guides/batch-downloader.md
- Polygon REST API downloader guide
- Performance optimization tips
docs/guides/data-loader.md
- Query bronze layer with DataLoader
- DuckDB integration examples
scripts/README.md
- Complete scripts reference
- Organized by Medallion Architecture layer

Documentation Standards

Concise but complete
Code examples for every feature
Cross-references to related docs
All code samples tested
Updated October 18, 2025

Important Notes

DO NOT

❌ Use ticker-first partitioning (use date-first)
❌ Use pandas (use polars for 10-100x speedup)
❌ Use S3 flat files (use REST API)
❌ Use pip (use uv package manager)
❌ Forget --incremental flag for daily updates
❌ Skip validation after backfill

ALWAYS

✅ Use date-first Hive partitioning
✅ Use --incremental flag for daily updates
✅ Run validation scripts after backfill
✅ Check today's date before time-sensitive commands
✅ Use uv for all Python commands
✅ Monitor background processes with BashOutput

Critical Paths

Data Root (NEVER change without migration):

/Volumes/sandisk/quantmini-lake/

Config Files (MUST exist):

config/credentials.yaml
config/pipeline_config.yaml
config/system_profile.yaml

Test Directory (safe to delete):

/Users/zheyuanzhao/workspace/quantmini/test_pipeline/

Recent Changes (October 18, 2025)

Completed

✅ Qlib Binary Writer Fix - Handle DataFrame input
✅ Test Pipeline Scripts - Complete Bronze → Silver → Gold
✅ Documentation Update - Data ingestion strategies guide
✅ Scripts Organization - Medallion Architecture aligned
✅ September Test - Validated 1-month pipeline (1,107 articles)

Data Migration Status

✅ Migrated from S3 to REST API (complete)
✅ Bronze layer (770GB, excluding minute data)
🔄 Silver layer (partial, needs feature engineering)
🔄 Gold layer (partial, needs Qlib conversion)

Active Background Processes

Multiple data ingestion processes running:

stocks_minute (2020-2025)
options_minute (2020-2025)
stocks_daily (2020-2025)
options_daily (2020-2025)

Check status:

# Monitor background process
uv run python scripts/validation/check_download_status.py

Quick Reference

Most Common Commands

# Daily update
uv run python -m src.cli.main data ingest -t stocks_daily \
  -s $(date -v-7d +%Y-%m-%d) -e $(date +%Y-%m-%d) --incremental

# Backfill gap
uv run python -m src.cli.main data ingest -t stocks_daily \
  -s 2024-06-01 -e 2024-06-30 --incremental

# Feature engineering
uv run python scripts/transformation/transform_add_features.py

# Qlib conversion
uv run python scripts/conversion/convert_to_qlib_binary.py

# Validate
uv run python scripts/validation/validate_duckdb_access.py

# Test pipeline
uv run python scripts/tests/run_complete_pipeline.py

Troubleshooting

Rate limiting:

uv run python -m src.cli.main data ingest -t stocks_daily \
  -s 2024-01-01 -e 2024-12-31 --max-concurrent 10

Disk space:

df -h /Volumes/sandisk/quantlake
du -sh /Volumes/sandisk/quantlake/*

Check gaps:

uv run python scripts/validation/detect_data_gaps.py --data-type stocks_daily

End of Project Memory Maintained by: Claude Code Assistant Last Review: October 18, 2025

FilesExpand file tree

PROJECT_MEMORY.md

Latest commit

History

PROJECT_MEMORY.md

File metadata and controls

QuantMini Project Memory

Project Overview

Architecture

Medallion Architecture Layers

Data Flow

Pipeline Entry Points

Critical Technical Details

Data Storage

Partitioning Strategy

Data Sources

Polygon.io REST API (Primary)

Data Coverage (as of October 18, 2025)

CLI Commands

Data Ingestion (Primary Interface)

Download Scripts (Alternative Data)

Data Ingestion Strategies

1. Initial Batch Load (First Time)

2. Incremental Updates (Daily Maintenance)

3. Backfill (Fill Data Gaps)

Pipeline Scripts

Transformation (Bronze → Silver)

Conversion (Silver → Gold)

Automation

Key Technical Fixes & Learnings

1. Qlib Binary Writer (CRITICAL)

2. Date-First Partitioning (CRITICAL)

3. Async HTTP/2 Client (Performance)

4. Incremental Update Logic

5. Polars > Pandas

Common Workflows

New Project Setup (Day 1-4)

Daily Maintenance (Automated)

Historical Backfill (Fill Gaps)

Testing

Complete Pipeline Test

Configuration Files

Primary Config Files

Environment Variables

Dependencies & Environment

Package Manager: uv (NOT pip)

Key Dependencies

File Naming Conventions

Parquet Files (Bronze/Silver)

Binary Files (Gold/Qlib)

Documentation

Key Documentation Files

Documentation Standards

Important Notes

DO NOT

ALWAYS

Critical Paths

Recent Changes (October 18, 2025)

Completed

Data Migration Status

Active Background Processes

Quick Reference

Most Common Commands

Troubleshooting