Skip to content

lna-lab/Homemade-10000TPS-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

10,000+ TPS with MiniMax-M2.7 NVFP4 on Blackwell

11,401 tok/s from a 456B parameter model on 6 consumer-class GPUs. NVFP4 quantization doesn't just save memory — it unlocks architectural freedom.

Peak Throughput Comparison

TL;DR

Configuration GPUs Peak Throughput vs FP8
FP8 TP=4 4x RTX PRO 6000 2,096 tok/s baseline
NVFP4 TP=4 4x RTX PRO 6000 2,496 tok/s 1.19x
NVFP4 TP=2 DP=2 4x RTX PRO 6000 3,995 tok/s 1.91x
NVFP4 TP=2 DP=3 6x RTX PRO 6000 11,401 tok/s 5.44x

The Insight

FP8 (229 GB) needs all 4 GPUs just to hold the model. There's no room for data parallelism.

NVFP4 (126 GB) fits in half the memory — 31.5 GB per GPU with TP=2 instead of 62.75 GB with TP=4. The other half becomes available for:

  1. More KV cache — 810K tokens vs 142K tokens
  2. Data parallelism — Run multiple model instances on the same hardware

This is the real value of quantization: not just smaller models, but architectural degrees of freedom.

Memory Layout

Throughput Scaling

Throughput Comparison

Detailed Results

FP8 TP=4 (4 GPUs, 229 GB model)

c=  1:    116 tok/s    c= 32:  1,406 tok/s
c=  4:    328 tok/s    c= 64:  2,126 tok/s
c=  8:    536 tok/s    c=128:  2,096 tok/s  <- peak
c= 16:    850 tok/s

NVFP4 TP=4 (4 GPUs, 126 GB model) — Same hardware, 19% faster

c=  1:    104 tok/s    c= 32:  1,635 tok/s
c=  4:    315 tok/s    c= 64:  2,472 tok/s
c=  8:    543 tok/s    c=128:  2,496 tok/s  <- peak (+19%)
c= 16:    892 tok/s

NVFP4 TP=2 DP=2 (4 GPUs, 2 instances) — Same hardware, 91% faster

c=  1:     85 tok/s    c= 32:  1,441 tok/s
c=  4:    292 tok/s    c= 64:  2,222 tok/s
c=  8:    518 tok/s    c=128:  3,995 tok/s  <- peak (+91%)
c= 16:    843 tok/s

NVFP4 TP=2 DP=3 (6 GPUs, 3 instances) — 10K+ breakthrough

c=  1:     84 tok/s    c=128:  3,963 tok/s
c=  4:    283 tok/s    c=256:  6,305 tok/s
c=  8:    485 tok/s    c=384:  7,859 tok/s
c= 16:    799 tok/s    c=512:  9,020 tok/s
c= 32:  1,312 tok/s    c=640: 10,217 tok/s  <- 10K breached!
c= 64:  2,406 tok/s    c=896: 11,401 tok/s  <- peak (5.44x)

Hardware

Component Spec
GPU 7x NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB, SM120)
CPU AMD (24 threads)
RAM 1 TB
Storage Intel Optane P5800X (models), Micron P4800X (code)
OS Ubuntu 24.04, Linux 6.17

Software

Component Version
vLLM 0.19.0
CUDA 12.x
Model MiniMaxAI/MiniMax-M2.7
NVFP4 Quantization lukealonso/MiniMax-M2.7-NVFP4

Quick Start

# Install vLLM
pip install vllm==0.19.0

# Download the model
huggingface-cli download lukealonso/MiniMax-M2.7-NVFP4

# Launch (choose your configuration)
bash scripts/launch_tp4.sh   # TP=4: best single-user latency
bash scripts/launch_dp2.sh   # DP=2: 1.91x throughput on 4 GPUs
bash scripts/launch_dp3.sh   # DP=3: 10K+ tok/s on 6 GPUs

# Benchmark
pip install aiohttp
python benchmarks/bench.py --url http://localhost:8040 --model minimax-m27-nvfp4

SM120 Notes

On Blackwell SM120 GPUs (RTX PRO 6000), some CUTLASS MoE tactics fail during FlashInfer autotuning:

[Autotuner]: Skipping tactic ... Failed to initialize cutlass TMA WS grouped gemm

This is expected — the autotuner falls back to working tactics automatically. Performance is not significantly affected, but may improve further when voipmonitor's B12X MoE patches are merged.

PCIe topologies without NVLink require:

  • NCCL_P2P_DISABLE=1
  • --disable-custom-all-reduce

Why This Matters

The conventional wisdom is that quantization trades quality for speed. But with MoE models like MiniMax-M2.7, NVFP4 quantization does something more interesting:

  • 456B parameters, but only 8 experts active per token — the quality impact of FP4 on sparse activations is minimal
  • Half the memory footprint — not just "fits on fewer GPUs" but "enables parallelism strategies that FP8 physically cannot"
  • Linear scaling with DP — each additional GPU pair adds another full-speed instance

The bottleneck shifts from "can I fit the model?" to "how many instances can I run?" — and that's a much better problem to have.

License

MIT

Acknowledgments

  • MiniMaxAI for MiniMax-M2.7
  • lukealonso for the NVFP4 quantization
  • vllm-project for the inference engine
  • voipmonitor for SM120 Blackwell pioneering work
  • Lna-Lab (Yuki + Ken) for the benchmarks and the embarrassing trinity_turbo incident (twice)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors