PyTorch, Triton, flash-attention, BitsAndBytes — pre-built wheels and reproducible build scripts for NVIDIA DGX Spark (GB10, sm_121, Blackwell, CUDA 13.0, Python 3.12, ARM64).
Can't
pip install torchon your DGX Spark? You're in the right place.
DGX Spark ships with GB10 — a Blackwell GPU with compute capability sm_121. Most ML frameworks don't officially support this architecture yet:
- PyTorch: Official wheels max out at
sm_120, emit warnings onsm_121 - Triton: bundled
ptxasmay fail onsm_121a; use CUDA 13.0ptxasvia env override - flash-attention: No
sm_121kernels, compilation fails - vLLM: Requires Docker or source builds for Blackwell
- TransformerEngine: MXFP8 broken on this arch
This repo provides build scripts, pre-built wheels, compatibility info, and benchmarks so you can run a full LLM stack on your DGX Spark without fighting the toolchain.
git clone https://github.com/ogulcanaydogan/dgx-spark-llm-stack.git
cd dgx-spark-llm-stack
./install.shThis downloads pre-built wheels from GitHub Releases and installs the full stack. After installation, it runs verification automatically.
| Library | Version | sm_121 Status | Notes |
|---|---|---|---|
| PyTorch | 2.9.1 | Official max sm_120; our wheel targets sm_121 | |
| Triton | 3.5.1 | Set TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas on DGX Spark |
|
| flash-attention | 2.7+ | ❌ Not supported | Use SDPA fallback (see docs) |
| BitsAndBytes | 0.49+ | ✅ Works | FP4/NF4 quantization tested |
| vLLM | 0.8+ | Build from source or use NGC container | |
| llama.cpp | Latest | ✅ Works well | Best option for inference |
| TensorRT-LLM | 0.9+ | Legacy 1.1.0rc1 fails with SM90-only assertion; stable 1.2.0 pass validated (2026-04-03) |
|
| TransformerEngine | - | ❌ Broken | MXFP8 training unsupported |
| Unsloth | Latest | ✅ Works | Recommended for fine-tuning |
| transformers | 4.48+ | ✅ Works | Standard HF stack |
| PEFT / LoRA | Latest | ✅ Works | QLoRA with BitsAndBytes OK |
| TRL | Latest | ✅ Works | SFT, DPO, ORPO all work |
Full details: COMPATIBILITY.md
Check GitHub Releases for pre-built wheels:
torch-2.9.1+cu130-cp312-cp312-linux_aarch64.whlbitsandbytes-0.49.0+cu130-cp312-cp312-linux_aarch64.whlSHA256SUMS(checksum manifest for release artifacts)
These are built on DGX Spark with CUDA 13.0, Python 3.12, GCC 13.3.
install.sh verifies release wheel checksums before installation.
If you prefer to build everything yourself:
# Set up environment
source configs/env.sh
# Build all components (~6 hours total)
./build/build_all.sh
# Or build individually
./build/build_pytorch.sh # ~4 hours
./build/build_triton.sh # ~30 min
./build/build_flash_attn.sh # ~20 min
./build/build_bitsandbytes.sh # ~10 minDeterministic DGX Spark container flow using NGC PyTorch base + multi-stage vLLM source build.
Build image:
docker build \
--build-arg VLLM_REF=v0.18.0 \
-f docker/vllm/Dockerfile \
-t dgx-spark-vllm:0.18.0 \
.Note: first build can take significantly longer because xformers is compiled from source on ARM64.
Run OpenAI-compatible server:
docker run --rm \
--gpus all \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e VLLM_MODEL=Qwen/Qwen2.5-0.5B-Instruct \
-e VLLM_USE_V1=1 \
-e VLLM_ATTENTION_BACKEND=FLASH_ATTN \
dgx-spark-vllm:0.18.0Smoke test (build + /health + /v1/models):
./scripts/smoke_vllm_container.shOptional smoke build overrides:
XFORMERS_DISABLE_FLASH_ATTN=1 XFORMERS_BUILD_JOBS=2 ./scripts/smoke_vllm_container.shKey runtime env vars:
| Variable | Default | Purpose |
|---|---|---|
VLLM_MODEL |
Qwen/Qwen2.5-0.5B-Instruct |
Hugging Face model id |
VLLM_HOST |
0.0.0.0 |
Bind host |
VLLM_PORT |
8000 |
API port inside container |
VLLM_DTYPE |
bfloat16 |
vLLM dtype |
VLLM_MAX_MODEL_LEN |
4096 |
Max model length |
VLLM_GPU_MEMORY_UTILIZATION |
0.9 |
GPU memory target ratio |
VLLM_USE_V1 |
1 |
Use vLLM V1 engine path |
VLLM_ATTENTION_BACKEND |
FLASH_ATTN |
Attention backend for V1 |
VLLM_ENABLE_CUSTOM_OPS |
0 |
Keep custom C++ ops disabled by default on DGX Spark |
python scripts/verify_install.pySample output:
GPU: NVIDIA GB10 (128 GB) — Compute Capability: 12.1
CUDA: 13.0 — Driver: 570.x
PyTorch: 2.9.1+cu130 — CUDA available: ✓
Libraries: transformers ✓ | peft ✓ | trl ✓ | bitsandbytes ✓
MatMul test (4096×4096): PASSED — 2.3 TFLOPS
python scripts/benchmark_inference.py # Token generation speed
python scripts/benchmark_training.py # Fine-tuning throughput
python scripts/evaluate_perplexity.py # FP16/NF4/FP4 quality (perplexity)Phase 3 benchmark results (Qwen 7B/14B/32B/72B inference, FP16/NF4/FP4 quality, LoRA/QLoRA training):
- docs/benchmarks.md
artifacts/benchmarks/phase3-baseline-2026-03-13.jsonartifacts/benchmarks/inference-extended-32b-72b-2026-03-13.jsonartifacts/benchmarks/inference-fp4-2026-03-13.jsonartifacts/benchmarks/quality-ppl-fp16-nf4-fp4-2026-03-13.json
- Quick Start Guide — Get running in 5 minutes
- Training Guide — Fine-tune LLMs on DGX Spark
- Troubleshooting — Known issues and solutions
- Reproducible Builds — Deterministic wheel build and release flow
- Benchmarks — Phase 3 baseline numbers and methodology
- Ollama Integration Guide — Model import, quantization, and API usage on DGX Spark
- llama.cpp Build Guide — sm_121 CUDA build and GGUF inference on DGX Spark
- NGC Container Recipe — Pinned NGC PyTorch workflow for DGX Spark LLM workloads
- Docker Compose vLLM Stack — vLLM + OpenAI-compatible API via Docker Compose
- Example Notebooks — Inference, fine-tuning, and evaluation notebooks with Spark smoke flow
- FP8 Workaround — TransformerEngine FP8 fail + BF16 fallback validation on DGX Spark
- Continuous Batching (vLLM) — Serial vs concurrent OpenAI API smoke on DGX Spark
- Power/Thermal Profiling — Operational power and temperature profile during continuous batching smoke
- Multi-GPU Guide — Distributed preflight and torchrun cluster recipe for DGX Spark environments
- Speculative Decoding — Baseline vs assistant-model benchmark artifact on DGX Spark
- KV Cache Optimization — vLLM max-model-len and memory-utilization sweep with recommended profile
- TensorRT-LLM Attention Sinks — Deterministic fail (legacy) + pass (stable) validation on
sm_121 - Visibility Strategy — Community distribution, KPI tracking, and messaging guardrails
- Community Launch Pack — Copy-ready X/Reddit/HN/forum posts with CTA and link set
- Day 1 Launch Log — X + GitHub announcement execution record and KPI deltas
- Day 2 Launch Log — Reddit
r/LocalLLaMApost draft, baseline metrics, and follow-up checklist - Day 3 Launch Log — Reddit
r/nvidiapost draft, baseline metrics, and follow-up checklist - Day 4 Launch Log — Show HN post draft, baseline metrics, and follow-up checklist
- Day 5 Launch Log — NVIDIA forum + Hugging Face reply template, baseline metrics, and follow-up checklist
- Day 6 Launch Log — NVIDIA forum follow-up reply template, baseline metrics, and follow-up checklist
- Day 7 Launch Log — Hugging Face discussion follow-up and week-1 wrap checklist
- Launch Operations — URL evidence gate and day-by-day launch closing checklist
This project is under active development. Here's what's next:
| Phase | Focus | Status |
|---|---|---|
| 1. Foundation | Repo structure, build scripts, docs, compat matrix | ✅ Done |
| 2. Pre-built Wheels | Compile and publish wheels to GitHub Releases | ✅ Done |
| 3. Benchmarks | Inference tok/s, training throughput, model-specific tables | ✅ Done |
| 4. Community | vLLM Dockerfile, Ollama, llama.cpp guide, NGC recipe | ✅ Done |
| 5. Upstream | PyTorch sm_121 PR, Triton fix, flash-attention issue | ✅ Done |
| 6. Advanced | Multi-GPU, TensorRT-LLM, FP8 workaround | ✅ Done |
Full details with task checklists: ROADMAP.md
Read the contribution guide first: CONTRIBUTING.md.
Contributions are welcome, especially roadmap-aligned docs, build fixes, benchmarks, and integration improvements.
| Component | Spec |
|---|---|
| GPU | NVIDIA GB10 (Blackwell, sm_121) |
| VRAM | 128 GB unified memory |
| CPU | 20-core ARM64 (Grace) |
| RAM | 121 GB |
| Storage | 1.9 TB NVMe |
| CUDA | 13.0 |
| GCC | 13.3 |
| CMake | 3.28 |
| Python | 3.12 |
- Emre Yüz and his pytorch-gb10 repo for pioneering PyTorch on GB10
- NVIDIA for DGX Spark developer documentation