11,401 tok/s from a 456B parameter model on 6 consumer-class GPUs. NVFP4 quantization doesn't just save memory — it unlocks architectural freedom.
| Configuration | GPUs | Peak Throughput | vs FP8 |
|---|---|---|---|
| FP8 TP=4 | 4x RTX PRO 6000 | 2,096 tok/s | baseline |
| NVFP4 TP=4 | 4x RTX PRO 6000 | 2,496 tok/s | 1.19x |
| NVFP4 TP=2 DP=2 | 4x RTX PRO 6000 | 3,995 tok/s | 1.91x |
| NVFP4 TP=2 DP=3 | 6x RTX PRO 6000 | 11,401 tok/s | 5.44x |
FP8 (229 GB) needs all 4 GPUs just to hold the model. There's no room for data parallelism.
NVFP4 (126 GB) fits in half the memory — 31.5 GB per GPU with TP=2 instead of 62.75 GB with TP=4. The other half becomes available for:
- More KV cache — 810K tokens vs 142K tokens
- Data parallelism — Run multiple model instances on the same hardware
This is the real value of quantization: not just smaller models, but architectural degrees of freedom.
FP8 TP=4 (4 GPUs, 229 GB model)
c= 1: 116 tok/s c= 32: 1,406 tok/s
c= 4: 328 tok/s c= 64: 2,126 tok/s
c= 8: 536 tok/s c=128: 2,096 tok/s <- peak
c= 16: 850 tok/s
NVFP4 TP=4 (4 GPUs, 126 GB model) — Same hardware, 19% faster
c= 1: 104 tok/s c= 32: 1,635 tok/s
c= 4: 315 tok/s c= 64: 2,472 tok/s
c= 8: 543 tok/s c=128: 2,496 tok/s <- peak (+19%)
c= 16: 892 tok/s
NVFP4 TP=2 DP=2 (4 GPUs, 2 instances) — Same hardware, 91% faster
c= 1: 85 tok/s c= 32: 1,441 tok/s
c= 4: 292 tok/s c= 64: 2,222 tok/s
c= 8: 518 tok/s c=128: 3,995 tok/s <- peak (+91%)
c= 16: 843 tok/s
NVFP4 TP=2 DP=3 (6 GPUs, 3 instances) — 10K+ breakthrough
c= 1: 84 tok/s c=128: 3,963 tok/s
c= 4: 283 tok/s c=256: 6,305 tok/s
c= 8: 485 tok/s c=384: 7,859 tok/s
c= 16: 799 tok/s c=512: 9,020 tok/s
c= 32: 1,312 tok/s c=640: 10,217 tok/s <- 10K breached!
c= 64: 2,406 tok/s c=896: 11,401 tok/s <- peak (5.44x)
| Component | Spec |
|---|---|
| GPU | 7x NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB, SM120) |
| CPU | AMD (24 threads) |
| RAM | 1 TB |
| Storage | Intel Optane P5800X (models), Micron P4800X (code) |
| OS | Ubuntu 24.04, Linux 6.17 |
| Component | Version |
|---|---|
| vLLM | 0.19.0 |
| CUDA | 12.x |
| Model | MiniMaxAI/MiniMax-M2.7 |
| NVFP4 Quantization | lukealonso/MiniMax-M2.7-NVFP4 |
# Install vLLM
pip install vllm==0.19.0
# Download the model
huggingface-cli download lukealonso/MiniMax-M2.7-NVFP4
# Launch (choose your configuration)
bash scripts/launch_tp4.sh # TP=4: best single-user latency
bash scripts/launch_dp2.sh # DP=2: 1.91x throughput on 4 GPUs
bash scripts/launch_dp3.sh # DP=3: 10K+ tok/s on 6 GPUs
# Benchmark
pip install aiohttp
python benchmarks/bench.py --url http://localhost:8040 --model minimax-m27-nvfp4On Blackwell SM120 GPUs (RTX PRO 6000), some CUTLASS MoE tactics fail during FlashInfer autotuning:
[Autotuner]: Skipping tactic ... Failed to initialize cutlass TMA WS grouped gemm
This is expected — the autotuner falls back to working tactics automatically. Performance is not significantly affected, but may improve further when voipmonitor's B12X MoE patches are merged.
PCIe topologies without NVLink require:
NCCL_P2P_DISABLE=1--disable-custom-all-reduce
The conventional wisdom is that quantization trades quality for speed. But with MoE models like MiniMax-M2.7, NVFP4 quantization does something more interesting:
- 456B parameters, but only 8 experts active per token — the quality impact of FP4 on sparse activations is minimal
- Half the memory footprint — not just "fits on fewer GPUs" but "enables parallelism strategies that FP8 physically cannot"
- Linear scaling with DP — each additional GPU pair adds another full-speed instance
The bottleneck shifts from "can I fit the model?" to "how many instances can I run?" — and that's a much better problem to have.
MIT
- MiniMaxAI for MiniMax-M2.7
- lukealonso for the NVFP4 quantization
- vllm-project for the inference engine
- voipmonitor for SM120 Blackwell pioneering work
- Lna-Lab (Yuki + Ken) for the benchmarks and the embarrassing trinity_turbo incident (twice)


