Skip to content

LLAA178/qlib-gpu-model

Repository files navigation

qlib-gpu-model

GPU-first quant deep learning starter built with PyTorch and Qlib-style data pipelines.

This repo is for learning and engineering practice:

  • sequence model training on cuda, mps, or cpu
  • mixed precision, torch.compile, and DDP-ready training
  • inference latency benchmarking
  • simple signal evaluation with backtest and walk-forward scripts
  • a concrete CUDA optimization entry via Triton layer norm

This repo is not a production trading system:

  • no order management or live execution
  • no portfolio/risk engine
  • no claim of durable alpha from the included results

What Is Included

flowchart LR
    A[Open Data or Qlib Data] --> B[Sequence Windows]
    B --> C[Temporal Transformer]
    C --> D[AMP / compile / DDP Training]
    C --> E[Inference Benchmark]
    C --> F[Backtest / Walk-Forward]
    D --> G[Triton Optimization Entry]
Loading

Core modules:

  • train.py: training loop with AMP, torch.compile, and DDP hooks
  • infer.py: latency and throughput benchmark
  • backtest.py: simple cross-sectional long-short evaluation
  • walk_forward.py: rolling train/test evaluation
  • open_data.py: Yahoo Finance to parquet dataset builder
  • render_figures.py: regenerate README figures
  • triton_layer_norm.py: optional Triton layer norm path on CUDA

Results Snapshot

These figures are diagnostics for a learning repo, not production claims.

Strict walk-forward evaluation across rolling folds:

Walk-Forward Overview

Latest cross-sectional selection snapshot:

Latest Picks

Single-holdout backtest view:

Holdout Backtest Overview

Current strict walk-forward baseline on liquid50:

  • mean fold annualized return: 8.70%
  • mean fold Sharpe: 0.61
  • overall annualized return: 4.58%
  • overall Sharpe: 0.57
  • overall max drawdown: -5.62%

Interpretation:

  • the strict walk-forward setup is the one to trust more
  • the current baseline is useful as an engineering benchmark, not a strong strategy

Quick Start

1. Setup

git clone <repo-url>
cd qlib-gpu-model
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip
python3 -m pip install -e .
python3 -m pip install -e '.[viz]'

Optional extras:

python3 -m pip install -e '.[qlib]'
python3 -m pip install -e '.[triton]'
python3 -m pip install -e '.[profiling]'

Reproducible CPU baseline:

  • recommended interpreter: Python 3.11
  • pinned baseline package set: requirements-lock.txt
  • GitHub Actions runs a no-network smoke path on every push and PR

2. Smoke Train

python3 -m qlib_gpu_model.train \
  --data-source synthetic \
  --device cpu \
  --seq-len 32 \
  --num-features 8 \
  --batch-size 64 \
  --epochs 1 \
  --amp-dtype fp32 \
  --use-compile false \
  --out-dir outputs/smoke

3. Benchmark Inference

python3 -m qlib_gpu_model.infer \
  --checkpoint outputs/smoke/best.pt \
  --device cpu \
  --batch-size 32 \
  --iters 50 \
  --warmup 5

4. Try the Included CUDA-Trained Checkpoint

The repo includes one checked-in artifact for demonstration:

  • outputs/liquid50_rank5_cuda/best.pt

It can be loaded on CPU for validation:

python3 -m qlib_gpu_model.infer \
  --checkpoint outputs/liquid50_rank5_cuda/best.pt \
  --device cpu \
  --batch-size 32 \
  --iters 50 \
  --warmup 5

Open Data Workflow

Build a public parquet dataset:

python3 -m qlib_gpu_model.open_data \
  --tickers-file data/universes/us_liquid_50.txt \
  --start 2018-01-01 \
  --end 2025-12-31 \
  --out-path data/open_us_liquid50.parquet

Train a 5-day rank target baseline:

python3 -m qlib_gpu_model.train \
  --data-source parquet \
  --parquet-path data/open_us_liquid50.parquet \
  --target-col label_5d_rank \
  --device auto \
  --seq-len 32 \
  --batch-size 512 \
  --epochs 3 \
  --amp-dtype bf16 \
  --use-compile true \
  --out-dir outputs/liquid50_rank5_cuda

Evaluate with a simple cost-aware backtest:

python3 -m qlib_gpu_model.backtest \
  --checkpoint outputs/liquid50_rank5_cuda/best.pt \
  --parquet-path data/open_us_liquid50.parquet \
  --target-col label_5d \
  --device cpu \
  --split valid \
  --top-quantile 0.2 \
  --transaction-cost-bps 10 \
  --rebalance-every 5 \
  --period-days 5

Run a stricter walk-forward evaluation:

python3 -m qlib_gpu_model.walk_forward \
  --parquet-path data/open_us_liquid50.parquet \
  --train-target-col label_5d_rank \
  --eval-target-col label_5d \
  --device cuda \
  --train-months 24 \
  --test-months 6 \
  --step-months 6 \
  --seq-len 32 \
  --batch-size 1024 \
  --epochs 1 \
  --num-workers 2 \
  --sample-every 5 \
  --purge-days 5 \
  --amp-dtype bf16 \
  --use-compile false \
  --top-quantile 0.2 \
  --transaction-cost-bps 10 \
  --rebalance-every 5 \
  --period-days 5 \
  --out-dir outputs/liquid50_rank5_walk_forward_cuda_strict

Regenerate README figures:

python3 -m qlib_gpu_model.render_figures \
  --backtest-daily outputs/liquid50_rank5_cuda/backtest_5d_cost10.daily.parquet \
  --predictions outputs/liquid50_rank5_cuda/predictions_5d.parquet \
  --walk-forward-daily outputs/liquid50_rank5_walk_forward_cuda_strict/daily_returns.parquet \
  --walk-forward-summary outputs/liquid50_rank5_walk_forward_cuda_strict/summary.json \
  --walk-forward-folds outputs/liquid50_rank5_walk_forward_cuda_strict/fold_metrics.parquet \
  --top-quantile 0.2 \
  --out-dir assets/readme

Device Notes

  • --device auto resolves in order: cuda -> mps -> cpu
  • macOS is fine for debugging, plotting, and small smoke runs
  • serious throughput, DDP, Triton benchmarking, and profiling should be done on Linux + NVIDIA GPU
  • torch.compile is skipped automatically on unsupported devices
  • AMP falls back automatically when the requested dtype is unsupported

Useful Scripts

  • scripts/download_liquid50.sh
  • scripts/train_rank5d_cuda.sh
  • scripts/backtest_rank5d_cuda.sh
  • scripts/walk_forward_rank5d_cuda_strict.sh
  • scripts/profile_cuda.sh
  • scripts/render_readme_figures.sh

Repository Layout

  • src/qlib_gpu_model/: library code
  • scripts/: runnable entry scripts
  • data/universes/: public ticker lists
  • assets/readme/: README images
  • outputs/liquid50_rank5_cuda/best.pt: included demo checkpoint
  • requirements-lock.txt: pinned CPU/open-data baseline
  • .github/workflows/smoke.yml: train/infer/backtest smoke CI

Limitations

  • the included data pipeline is simple and public-data based
  • the backtest is intentionally lightweight and omits many real trading constraints
  • current strategy metrics are not strong enough to market as alpha
  • this repo is better positioned as a GPU quant modeling demo than a production system

Next Improvements

  • move preprocessing bottlenecks to cuDF or Triton/CUDA kernels
  • add CI smoke tests for train, infer, and backtest
  • add environment pinning for reproducibility
  • export ONNX or TensorRT baselines for inference comparison
  • expand the universe and factor set before taking signal quality seriously

About

GPU-first quant deep learning starter built with PyTorch and Qlib-style data pipelines

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Contributors