qlib-gpu-model

GPU-first quant deep learning starter built with PyTorch and Qlib-style data pipelines.

This repo is for learning and engineering practice:

sequence model training on cuda, mps, or cpu
mixed precision, torch.compile, and DDP-ready training
inference latency benchmarking
simple signal evaluation with backtest and walk-forward scripts
a concrete CUDA optimization entry via Triton layer norm

This repo is not a production trading system:

no order management or live execution
no portfolio/risk engine
no claim of durable alpha from the included results

What Is Included

flowchart LR
    A[Open Data or Qlib Data] --> B[Sequence Windows]
    B --> C[Temporal Transformer]
    C --> D[AMP / compile / DDP Training]
    C --> E[Inference Benchmark]
    C --> F[Backtest / Walk-Forward]
    D --> G[Triton Optimization Entry]

Core modules:

train.py: training loop with AMP, torch.compile, and DDP hooks
infer.py: latency and throughput benchmark
backtest.py: simple cross-sectional long-short evaluation
walk_forward.py: rolling train/test evaluation
open_data.py: Yahoo Finance to parquet dataset builder
render_figures.py: regenerate README figures
triton_layer_norm.py: optional Triton layer norm path on CUDA

Results Snapshot

These figures are diagnostics for a learning repo, not production claims.

Strict walk-forward evaluation across rolling folds:

Latest cross-sectional selection snapshot:

Single-holdout backtest view:

Current strict walk-forward baseline on liquid50:

mean fold annualized return: 8.70%
mean fold Sharpe: 0.61
overall annualized return: 4.58%
overall Sharpe: 0.57
overall max drawdown: -5.62%

Interpretation:

the strict walk-forward setup is the one to trust more
the current baseline is useful as an engineering benchmark, not a strong strategy

Quick Start

1. Setup

git clone <repo-url>
cd qlib-gpu-model
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip
python3 -m pip install -e .
python3 -m pip install -e '.[viz]'

Optional extras:

python3 -m pip install -e '.[qlib]'
python3 -m pip install -e '.[triton]'
python3 -m pip install -e '.[profiling]'

Reproducible CPU baseline:

recommended interpreter: Python 3.11
pinned baseline package set: requirements-lock.txt
GitHub Actions runs a no-network smoke path on every push and PR

2. Smoke Train

python3 -m qlib_gpu_model.train \
  --data-source synthetic \
  --device cpu \
  --seq-len 32 \
  --num-features 8 \
  --batch-size 64 \
  --epochs 1 \
  --amp-dtype fp32 \
  --use-compile false \
  --out-dir outputs/smoke

3. Benchmark Inference

python3 -m qlib_gpu_model.infer \
  --checkpoint outputs/smoke/best.pt \
  --device cpu \
  --batch-size 32 \
  --iters 50 \
  --warmup 5

4. Try the Included CUDA-Trained Checkpoint

The repo includes one checked-in artifact for demonstration:

outputs/liquid50_rank5_cuda/best.pt

It can be loaded on CPU for validation:

python3 -m qlib_gpu_model.infer \
  --checkpoint outputs/liquid50_rank5_cuda/best.pt \
  --device cpu \
  --batch-size 32 \
  --iters 50 \
  --warmup 5

Open Data Workflow

Build a public parquet dataset:

python3 -m qlib_gpu_model.open_data \
  --tickers-file data/universes/us_liquid_50.txt \
  --start 2018-01-01 \
  --end 2025-12-31 \
  --out-path data/open_us_liquid50.parquet

Train a 5-day rank target baseline:

python3 -m qlib_gpu_model.train \
  --data-source parquet \
  --parquet-path data/open_us_liquid50.parquet \
  --target-col label_5d_rank \
  --device auto \
  --seq-len 32 \
  --batch-size 512 \
  --epochs 3 \
  --amp-dtype bf16 \
  --use-compile true \
  --out-dir outputs/liquid50_rank5_cuda

Evaluate with a simple cost-aware backtest:

python3 -m qlib_gpu_model.backtest \
  --checkpoint outputs/liquid50_rank5_cuda/best.pt \
  --parquet-path data/open_us_liquid50.parquet \
  --target-col label_5d \
  --device cpu \
  --split valid \
  --top-quantile 0.2 \
  --transaction-cost-bps 10 \
  --rebalance-every 5 \
  --period-days 5

Run a stricter walk-forward evaluation:

python3 -m qlib_gpu_model.walk_forward \
  --parquet-path data/open_us_liquid50.parquet \
  --train-target-col label_5d_rank \
  --eval-target-col label_5d \
  --device cuda \
  --train-months 24 \
  --test-months 6 \
  --step-months 6 \
  --seq-len 32 \
  --batch-size 1024 \
  --epochs 1 \
  --num-workers 2 \
  --sample-every 5 \
  --purge-days 5 \
  --amp-dtype bf16 \
  --use-compile false \
  --top-quantile 0.2 \
  --transaction-cost-bps 10 \
  --rebalance-every 5 \
  --period-days 5 \
  --out-dir outputs/liquid50_rank5_walk_forward_cuda_strict

Regenerate README figures:

python3 -m qlib_gpu_model.render_figures \
  --backtest-daily outputs/liquid50_rank5_cuda/backtest_5d_cost10.daily.parquet \
  --predictions outputs/liquid50_rank5_cuda/predictions_5d.parquet \
  --walk-forward-daily outputs/liquid50_rank5_walk_forward_cuda_strict/daily_returns.parquet \
  --walk-forward-summary outputs/liquid50_rank5_walk_forward_cuda_strict/summary.json \
  --walk-forward-folds outputs/liquid50_rank5_walk_forward_cuda_strict/fold_metrics.parquet \
  --top-quantile 0.2 \
  --out-dir assets/readme

Device Notes

--device auto resolves in order: cuda -> mps -> cpu
macOS is fine for debugging, plotting, and small smoke runs
serious throughput, DDP, Triton benchmarking, and profiling should be done on Linux + NVIDIA GPU
torch.compile is skipped automatically on unsupported devices
AMP falls back automatically when the requested dtype is unsupported

Useful Scripts

scripts/download_liquid50.sh
scripts/train_rank5d_cuda.sh
scripts/backtest_rank5d_cuda.sh
scripts/walk_forward_rank5d_cuda_strict.sh
scripts/profile_cuda.sh
scripts/render_readme_figures.sh

Repository Layout

src/qlib_gpu_model/: library code
scripts/: runnable entry scripts
data/universes/: public ticker lists
assets/readme/: README images
outputs/liquid50_rank5_cuda/best.pt: included demo checkpoint
requirements-lock.txt: pinned CPU/open-data baseline
.github/workflows/smoke.yml: train/infer/backtest smoke CI

Limitations

the included data pipeline is simple and public-data based
the backtest is intentionally lightweight and omits many real trading constraints
current strategy metrics are not strong enough to market as alpha
this repo is better positioned as a GPU quant modeling demo than a production system

Next Improvements

move preprocessing bottlenecks to cuDF or Triton/CUDA kernels
add CI smoke tests for train, infer, and backtest
add environment pinning for reproducibility
export ONNX or TensorRT baselines for inference comparison
expand the universe and factor set before taking signal quality seriously

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
assets/readme		assets/readme
data/universes		data/universes
outputs/liquid50_rank5_cuda		outputs/liquid50_rank5_cuda
scripts		scripts
src/qlib_gpu_model		src/qlib_gpu_model
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements-lock.txt		requirements-lock.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qlib-gpu-model

What Is Included

Results Snapshot

Quick Start

1. Setup

2. Smoke Train

3. Benchmark Inference

4. Try the Included CUDA-Trained Checkpoint

Open Data Workflow

Device Notes

Useful Scripts

Repository Layout

Limitations

Next Improvements

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

qlib-gpu-model

What Is Included

Results Snapshot

Quick Start

1. Setup

2. Smoke Train

3. Benchmark Inference

4. Try the Included CUDA-Trained Checkpoint

Open Data Workflow

Device Notes

Useful Scripts

Repository Layout

Limitations

Next Improvements

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages