10,000+ TPS with MiniMax-M2.7 NVFP4 on Blackwell

11,401 tok/s from a 456B parameter model on 6 consumer-class GPUs. NVFP4 quantization doesn't just save memory — it unlocks architectural freedom.

TL;DR

Configuration	GPUs	Peak Throughput	vs FP8
FP8 TP=4	4x RTX PRO 6000	2,096 tok/s	baseline
NVFP4 TP=4	4x RTX PRO 6000	2,496 tok/s	1.19x
NVFP4 TP=2 DP=2	4x RTX PRO 6000	3,995 tok/s	1.91x
NVFP4 TP=2 DP=3	6x RTX PRO 6000	11,401 tok/s	5.44x

The Insight

FP8 (229 GB) needs all 4 GPUs just to hold the model. There's no room for data parallelism.

NVFP4 (126 GB) fits in half the memory — 31.5 GB per GPU with TP=2 instead of 62.75 GB with TP=4. The other half becomes available for:

More KV cache — 810K tokens vs 142K tokens
Data parallelism — Run multiple model instances on the same hardware

This is the real value of quantization: not just smaller models, but architectural degrees of freedom.

Throughput Scaling

Detailed Results

FP8 TP=4 (4 GPUs, 229 GB model)

c=  1:    116 tok/s    c= 32:  1,406 tok/s
c=  4:    328 tok/s    c= 64:  2,126 tok/s
c=  8:    536 tok/s    c=128:  2,096 tok/s  <- peak
c= 16:    850 tok/s

NVFP4 TP=4 (4 GPUs, 126 GB model) — Same hardware, 19% faster

c=  1:    104 tok/s    c= 32:  1,635 tok/s
c=  4:    315 tok/s    c= 64:  2,472 tok/s
c=  8:    543 tok/s    c=128:  2,496 tok/s  <- peak (+19%)
c= 16:    892 tok/s

NVFP4 TP=2 DP=2 (4 GPUs, 2 instances) — Same hardware, 91% faster

c=  1:     85 tok/s    c= 32:  1,441 tok/s
c=  4:    292 tok/s    c= 64:  2,222 tok/s
c=  8:    518 tok/s    c=128:  3,995 tok/s  <- peak (+91%)
c= 16:    843 tok/s

NVFP4 TP=2 DP=3 (6 GPUs, 3 instances) — 10K+ breakthrough

c=  1:     84 tok/s    c=128:  3,963 tok/s
c=  4:    283 tok/s    c=256:  6,305 tok/s
c=  8:    485 tok/s    c=384:  7,859 tok/s
c= 16:    799 tok/s    c=512:  9,020 tok/s
c= 32:  1,312 tok/s    c=640: 10,217 tok/s  <- 10K breached!
c= 64:  2,406 tok/s    c=896: 11,401 tok/s  <- peak (5.44x)

Hardware

Component	Spec
GPU	7x NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB, SM120)
CPU	AMD (24 threads)
RAM	1 TB
Storage	Intel Optane P5800X (models), Micron P4800X (code)
OS	Ubuntu 24.04, Linux 6.17

Software

Component	Version
vLLM	0.19.0
CUDA	12.x
Model	MiniMaxAI/MiniMax-M2.7
NVFP4 Quantization	lukealonso/MiniMax-M2.7-NVFP4

Quick Start

# Install vLLM
pip install vllm==0.19.0

# Download the model
huggingface-cli download lukealonso/MiniMax-M2.7-NVFP4

# Launch (choose your configuration)
bash scripts/launch_tp4.sh   # TP=4: best single-user latency
bash scripts/launch_dp2.sh   # DP=2: 1.91x throughput on 4 GPUs
bash scripts/launch_dp3.sh   # DP=3: 10K+ tok/s on 6 GPUs

# Benchmark
pip install aiohttp
python benchmarks/bench.py --url http://localhost:8040 --model minimax-m27-nvfp4

SM120 Notes

On Blackwell SM120 GPUs (RTX PRO 6000), some CUTLASS MoE tactics fail during FlashInfer autotuning:

[Autotuner]: Skipping tactic ... Failed to initialize cutlass TMA WS grouped gemm

This is expected — the autotuner falls back to working tactics automatically. Performance is not significantly affected, but may improve further when voipmonitor's B12X MoE patches are merged.

PCIe topologies without NVLink require:

NCCL_P2P_DISABLE=1
--disable-custom-all-reduce

Why This Matters

The conventional wisdom is that quantization trades quality for speed. But with MoE models like MiniMax-M2.7, NVFP4 quantization does something more interesting:

456B parameters, but only 8 experts active per token — the quality impact of FP4 on sparse activations is minimal
Half the memory footprint — not just "fits on fewer GPUs" but "enables parallelism strategies that FP8 physically cannot"
Linear scaling with DP — each additional GPU pair adds another full-speed instance

The bottleneck shifts from "can I fit the model?" to "how many instances can I run?" — and that's a much better problem to have.

License

MIT

Acknowledgments

MiniMaxAI for MiniMax-M2.7
lukealonso for the NVFP4 quantization
vllm-project for the inference engine
voipmonitor for SM120 Blackwell pioneering work
Lna-Lab (Yuki + Ken) for the benchmarks and the embarrassing trinity_turbo incident (twice)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmarks		benchmarks
images		images
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

10,000+ TPS with MiniMax-M2.7 NVFP4 on Blackwell

TL;DR

The Insight

Throughput Scaling

Detailed Results

Hardware

Software

Quick Start

SM120 Notes

Why This Matters

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

10,000+ TPS with MiniMax-M2.7 NVFP4 on Blackwell

TL;DR

The Insight

Throughput Scaling

Detailed Results

Hardware

Software

Quick Start

SM120 Notes

Why This Matters

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages