GPU-first quant deep learning starter built with PyTorch and Qlib-style data pipelines.
This repo is for learning and engineering practice:
- sequence model training on
cuda,mps, orcpu - mixed precision,
torch.compile, and DDP-ready training - inference latency benchmarking
- simple signal evaluation with backtest and walk-forward scripts
- a concrete CUDA optimization entry via Triton layer norm
This repo is not a production trading system:
- no order management or live execution
- no portfolio/risk engine
- no claim of durable alpha from the included results
flowchart LR
A[Open Data or Qlib Data] --> B[Sequence Windows]
B --> C[Temporal Transformer]
C --> D[AMP / compile / DDP Training]
C --> E[Inference Benchmark]
C --> F[Backtest / Walk-Forward]
D --> G[Triton Optimization Entry]
Core modules:
train.py: training loop with AMP,torch.compile, and DDP hooksinfer.py: latency and throughput benchmarkbacktest.py: simple cross-sectional long-short evaluationwalk_forward.py: rolling train/test evaluationopen_data.py: Yahoo Finance to parquet dataset builderrender_figures.py: regenerate README figurestriton_layer_norm.py: optional Triton layer norm path on CUDA
These figures are diagnostics for a learning repo, not production claims.
Strict walk-forward evaluation across rolling folds:
Latest cross-sectional selection snapshot:
Single-holdout backtest view:
Current strict walk-forward baseline on liquid50:
- mean fold annualized return:
8.70% - mean fold Sharpe:
0.61 - overall annualized return:
4.58% - overall Sharpe:
0.57 - overall max drawdown:
-5.62%
Interpretation:
- the strict walk-forward setup is the one to trust more
- the current baseline is useful as an engineering benchmark, not a strong strategy
git clone <repo-url>
cd qlib-gpu-model
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip
python3 -m pip install -e .
python3 -m pip install -e '.[viz]'Optional extras:
python3 -m pip install -e '.[qlib]'
python3 -m pip install -e '.[triton]'
python3 -m pip install -e '.[profiling]'Reproducible CPU baseline:
- recommended interpreter:
Python 3.11 - pinned baseline package set:
requirements-lock.txt - GitHub Actions runs a no-network smoke path on every push and PR
python3 -m qlib_gpu_model.train \
--data-source synthetic \
--device cpu \
--seq-len 32 \
--num-features 8 \
--batch-size 64 \
--epochs 1 \
--amp-dtype fp32 \
--use-compile false \
--out-dir outputs/smokepython3 -m qlib_gpu_model.infer \
--checkpoint outputs/smoke/best.pt \
--device cpu \
--batch-size 32 \
--iters 50 \
--warmup 5The repo includes one checked-in artifact for demonstration:
outputs/liquid50_rank5_cuda/best.pt
It can be loaded on CPU for validation:
python3 -m qlib_gpu_model.infer \
--checkpoint outputs/liquid50_rank5_cuda/best.pt \
--device cpu \
--batch-size 32 \
--iters 50 \
--warmup 5Build a public parquet dataset:
python3 -m qlib_gpu_model.open_data \
--tickers-file data/universes/us_liquid_50.txt \
--start 2018-01-01 \
--end 2025-12-31 \
--out-path data/open_us_liquid50.parquetTrain a 5-day rank target baseline:
python3 -m qlib_gpu_model.train \
--data-source parquet \
--parquet-path data/open_us_liquid50.parquet \
--target-col label_5d_rank \
--device auto \
--seq-len 32 \
--batch-size 512 \
--epochs 3 \
--amp-dtype bf16 \
--use-compile true \
--out-dir outputs/liquid50_rank5_cudaEvaluate with a simple cost-aware backtest:
python3 -m qlib_gpu_model.backtest \
--checkpoint outputs/liquid50_rank5_cuda/best.pt \
--parquet-path data/open_us_liquid50.parquet \
--target-col label_5d \
--device cpu \
--split valid \
--top-quantile 0.2 \
--transaction-cost-bps 10 \
--rebalance-every 5 \
--period-days 5Run a stricter walk-forward evaluation:
python3 -m qlib_gpu_model.walk_forward \
--parquet-path data/open_us_liquid50.parquet \
--train-target-col label_5d_rank \
--eval-target-col label_5d \
--device cuda \
--train-months 24 \
--test-months 6 \
--step-months 6 \
--seq-len 32 \
--batch-size 1024 \
--epochs 1 \
--num-workers 2 \
--sample-every 5 \
--purge-days 5 \
--amp-dtype bf16 \
--use-compile false \
--top-quantile 0.2 \
--transaction-cost-bps 10 \
--rebalance-every 5 \
--period-days 5 \
--out-dir outputs/liquid50_rank5_walk_forward_cuda_strictRegenerate README figures:
python3 -m qlib_gpu_model.render_figures \
--backtest-daily outputs/liquid50_rank5_cuda/backtest_5d_cost10.daily.parquet \
--predictions outputs/liquid50_rank5_cuda/predictions_5d.parquet \
--walk-forward-daily outputs/liquid50_rank5_walk_forward_cuda_strict/daily_returns.parquet \
--walk-forward-summary outputs/liquid50_rank5_walk_forward_cuda_strict/summary.json \
--walk-forward-folds outputs/liquid50_rank5_walk_forward_cuda_strict/fold_metrics.parquet \
--top-quantile 0.2 \
--out-dir assets/readme--device autoresolves in order:cuda -> mps -> cpu- macOS is fine for debugging, plotting, and small smoke runs
- serious throughput, DDP, Triton benchmarking, and profiling should be done on Linux + NVIDIA GPU
torch.compileis skipped automatically on unsupported devices- AMP falls back automatically when the requested dtype is unsupported
scripts/download_liquid50.shscripts/train_rank5d_cuda.shscripts/backtest_rank5d_cuda.shscripts/walk_forward_rank5d_cuda_strict.shscripts/profile_cuda.shscripts/render_readme_figures.sh
src/qlib_gpu_model/: library codescripts/: runnable entry scriptsdata/universes/: public ticker listsassets/readme/: README imagesoutputs/liquid50_rank5_cuda/best.pt: included demo checkpointrequirements-lock.txt: pinned CPU/open-data baseline.github/workflows/smoke.yml: train/infer/backtest smoke CI
- the included data pipeline is simple and public-data based
- the backtest is intentionally lightweight and omits many real trading constraints
- current strategy metrics are not strong enough to market as alpha
- this repo is better positioned as a GPU quant modeling demo than a production system
- move preprocessing bottlenecks to cuDF or Triton/CUDA kernels
- add CI smoke tests for train, infer, and backtest
- add environment pinning for reproducibility
- export ONNX or TensorRT baselines for inference comparison
- expand the universe and factor set before taking signal quality seriously


