Skip to content

invergent-ai/surogate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,189 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Surogate Surogate

⚡ FP8/FP4 Training, Fine-tuning and RL at the speed of light


GitHub stars GitHub issues GitHub pull requests Twitter Follow

Surogate Trainer

Surogate Trainer is built for developers and enterprises that need fast experimentation — whether running on-premise or in the cloud.

⚡ The Surogate trainer surpasses all existing training frameworks in performance for single-GPU, multi-GPU and GPU+CPU by a large margin.

✨ The native CPU offloading feature achieves superior performance and VRAM usage compared to QLoRA. You can fine-tune models at native bf16 precision, rendering QLoRA obsolete.

Highlights

  • 🔧 Pre-training + Fine-tuning: full fine-tuning, LoRA
  • 🔧 BF16, FP8 and NVFP4 Reinforcement Learning: advanced GRPO training and evaluation with custom, deterministic environments
  • 🔧 RL Environments*: predictable environments for RL training
  • 🖥️...🖥️ Native multi-GPU training with multi-threaded backend
  • 🖥️...🖥️ Native multi-Node DDP training with Ray
  • ⚡ Native C++/CUDA engine for near–Speed-Of-Light (SOL) throughput
  • 🔥 Python DSL with AOT auto-differentiation for adding new model architectures
  • ⚖️ Smart CPU Offloading for weights, gradients, activations, quants
  • 📜 Pre-built training recipes:
    • 💎 BF16: Baseline recipe using bfloat16 for all GEMMs, designed for maximum numerical accuracy. No quantization is applied.
    • 🔥 FP8: Native FP8 training delivering extreme performance with E4M3 used for activations and weights and E5M2 for gradients. Uses per-tensor delayed scaling to provide stable training.
    • 🔥 NVFP4: Native CUTLASS FP4 E2M1 training with two-level block scaling for extreme performance and memory efficiency on Blackwell GPUs (SM100+: B200, B300, RTX 50xx series). Uses stochastic rounding and random Hadamard Transforms for numerical stability. Supports NVIDIA B200, B300, RTX 5070, 5080, 5090 !!
  • ⚡ BnB/FP8/NVFP4 QLoRA Support for a variety of QLoRA configurations, including online quantization (FP8, NVFP4, BnB) or loading pre-quantized weights (FP8, NVFP4)
  • 👌 Optimizers: AdamW 8bit, !! NorMuon !!
  • 🖥️ Runs on all NVIDIA GPUs: sm80, sm86, sm89, sm90, sm100, sm103, sm120, sm121
  • 🧪 Mixed-precision training: Mix different dtypes for GEMMs, model, gradients and LoRA recipes to create your own flavor.
  • 🛡️ Designed for reliability: deterministic configs, explicit recipes, and a clear C++ core
  • 🧬 Adaptive Training: built-in automated training monitoring with automatic phase detection, multi-criteria early stopping (convergence, compute-efficiency, divergence, plateau), auto LR management, MoE imbalance detection, Chinchilla token budgeting and dynamic epoch adjustment
  • 🎨 Dedicated MoE Features: Expert Parallelism, Least-Loaded EP load-balancing, MoE training metrics, Imbalance detection
  • 🥞 Stacked LoRA training: Train a LoRA adapter on top of another LoRA adapter to skip offline merging into base model.

🧠 Supported Models:

We support the following models. Please create a PR if you need a specific model

Model Architecture Model Sizes
Qwen3 Qwen3ForCausalLM 0.6B, 1.7B, 4B, 8B, 14B, 35B
Qwen3VL Qwen3VLForConditionalGeneration 2B, 4B, 8B, 32B
Qwen3 MoE Qwen3MoeForCausalLM 30B-A3B, 235B-A22B
Qwen3.5 Qwen3_5ForCausalLM, Qwen3_5ForConditionalGeneration 0.8B, 2B 4B, 9B, 27B
Qwen3.5 Moe Qwen3MoeForCausalLM, Qwen3_5MoeForConditionalGeneration 35B-A3B, 122B-A10B, 397B-A17B
Nemotron Nano v3 NemotronHForCausalLM 30B-A3B
Nemotron Super v3 NemotronHForCausalLM 120B-A12B
Nemotron Cascade 2 NemotronHForCausalLM 30B-A3B
GPT-OSS GptOssForCausalLM 20B, 120B
Llama 3.1 LlamaForCausalLM 8B, 70B, 405B
Llama 3.2 LlamaForCausalLM 1B, 3B

🚀 Quickstart

You can interact with the Surogate High-Performance Training Engine at the framework level via the CLI.

Run the Surogate Training Engine:

Option A: Run using Docker (recommended)

Surogate provides 3 docker images for various CUDA versions. Currently only the x86-64 architecture is supported.

CUDA Image Recommended NVIDIA Driver Minimum NVIDIA Driver
12.8.1 ghcr.io/invergent-ai/surogate:latest-cu128 >= 570.124.06 >= 525
12.9.1 ghcr.io/invergent-ai/surogate:latest-cu129 >= 575.57.08 >= 525
13.1 ghcr.io/invergent-ai/surogate:latest-cu130 >= 590.48.01 >= 580
docker run --gpus=all -v /my/local/config.yaml:/home/surogate/config.yaml -v /my/local/output_dir:<OUTPUT_DIR_FROM_CONFIG_YAML> <IMAGE> sft config.yaml

Option B: Install via script

curl -LsSf https://surogate.ai/install.sh | sh

Option C: Build from source (dev / contributors)

You need CUDA 12.8/12.9/13.x installed on your machine and NCCL development libraries libnccl-dev for your CUDA version

# ...clone repo...
uv pip install -e .

Quickstart (SFT)

  1. Create a config (example):
model: Qwen/Qwen3-0.6B
output_dir: ./output

# training
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
sequence_len: 2048
learning_rate: 2e-4

# LoRA / QLoRA
lora: true
lora_rank: 16
# qlora_fp8: true  # optional, hardware-dependent
# qlora_fp4: true  # Blackwell+
# qlora_bnb: true  # Any GPU, lowest

datasets:
  - path: "mlabonne/FineTome-100k"
    type: auto
  1. Run:
surogate sft config.yaml
  1. Outputs:
  • checkpoints, logs and artifacts are written under output_dir

Hardware / Requirements

  • NVIDIA GPU + recent driver
  • CUDA 12.8, 12.9, 13, NCCL, cuDNN
  • Linux x86_64

Supported NVIDIA GPUs:

  • SM80: A100, A30
  • SM86: A2, A16, A10, A40, RTX3050, RTX3060, RTX 3070, RTX 3080, RTX 3090, A2000, A3000, A4000, A5000, A6000
  • SM89: L4, L40, L40S, RTX 4050, RTX 4060, RTX 4070, RTX 4080, RTX 4090, RTX 2000 Ada, RTX 4000 SFF Ada, RTX 4000 Ada, RTX 4500 Ada, RTX 5000 Ada, RTX 6000 Ada
  • SM90: H100, H200, GH200
  • SM100: B200, GB200
  • SM103: B300, GB300
  • SM120: RTX PRO 6000/5000/4000/2500/2000 Blackwell, RTX 5050, RTX 5060, RTX 5070, RTX 5080, RTX 5090
  • SM121: DGX Spark

Documentation / Examples


Contributing

We welcome contributions across the entire ecosystem! If you are submitting a PR to the core framework, please ensure you include a clear description, steps to test locally, and relevant examples.

If you’re adding kernels/recipes or touching build/tooling, please keep changes minimal and include:

  • a short description of the change,
  • how to reproduce/validate locally (make test where applicable),
  • and any GPU/arch assumptions.

License

Apache 2.0 — see LICENSE.