DGX Spark LLM Stack

PyTorch, Triton, flash-attention, BitsAndBytes — pre-built wheels and reproducible build scripts for NVIDIA DGX Spark (GB10, sm_121, Blackwell, CUDA 13.0, Python 3.12, ARM64).

Can't pip install torch on your DGX Spark? You're in the right place.

The Problem

DGX Spark ships with GB10 — a Blackwell GPU with compute capability sm_121. Most ML frameworks don't officially support this architecture yet:

PyTorch: Official wheels max out at sm_120, emit warnings on sm_121
Triton: bundled ptxas may fail on sm_121a; use CUDA 13.0 ptxas via env override
flash-attention: No sm_121 kernels, compilation fails
vLLM: Requires Docker or source builds for Blackwell
TransformerEngine: MXFP8 broken on this arch

This repo provides build scripts, pre-built wheels, compatibility info, and benchmarks so you can run a full LLM stack on your DGX Spark without fighting the toolchain.

Quick Start

git clone https://github.com/ogulcanaydogan/dgx-spark-llm-stack.git
cd dgx-spark-llm-stack
./install.sh

This downloads pre-built wheels from GitHub Releases and installs the full stack. After installation, it runs verification automatically.

Compatibility Matrix

Library	Version	sm_121 Status	Notes
PyTorch	2.9.1	⚠️ Warning, works	Official max sm_120; our wheel targets sm_121
Triton	3.5.1	⚠️ Works with env fix	Set `TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas` on DGX Spark
flash-attention	2.7+	❌ Not supported	Use SDPA fallback (see docs)
BitsAndBytes	0.49+	✅ Works	FP4/NF4 quantization tested
vLLM	0.8+	⚠️ Docker only	Build from source or use NGC container
llama.cpp	Latest	✅ Works well	Best option for inference
TensorRT-LLM	0.9+	⚠️ Partial	Legacy `1.1.0rc1` fails with SM90-only assertion; stable `1.2.0` pass validated (2026-04-03)
TransformerEngine	-	❌ Broken	MXFP8 training unsupported
Unsloth	Latest	✅ Works	Recommended for fine-tuning
transformers	4.48+	✅ Works	Standard HF stack
PEFT / LoRA	Latest	✅ Works	QLoRA with BitsAndBytes OK
TRL	Latest	✅ Works	SFT, DPO, ORPO all work

Full details: COMPATIBILITY.md

Pre-built Wheels

Check GitHub Releases for pre-built wheels:

torch-2.9.1+cu130-cp312-cp312-linux_aarch64.whl
bitsandbytes-0.49.0+cu130-cp312-cp312-linux_aarch64.whl
SHA256SUMS (checksum manifest for release artifacts)

These are built on DGX Spark with CUDA 13.0, Python 3.12, GCC 13.3. install.sh verifies release wheel checksums before installation.

Build from Source

If you prefer to build everything yourself:

# Set up environment
source configs/env.sh

# Build all components (~6 hours total)
./build/build_all.sh

# Or build individually
./build/build_pytorch.sh      # ~4 hours
./build/build_triton.sh       # ~30 min
./build/build_flash_attn.sh   # ~20 min
./build/build_bitsandbytes.sh # ~10 min

vLLM Container (Phase 4)

Deterministic DGX Spark container flow using NGC PyTorch base + multi-stage vLLM source build.

Build image:

docker build \
  --build-arg VLLM_REF=v0.18.0 \
  -f docker/vllm/Dockerfile \
  -t dgx-spark-vllm:0.18.0 \
  .

Note: first build can take significantly longer because xformers is compiled from source on ARM64.

Run OpenAI-compatible server:

docker run --rm \
  --gpus all \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -e VLLM_MODEL=Qwen/Qwen2.5-0.5B-Instruct \
  -e VLLM_USE_V1=1 \
  -e VLLM_ATTENTION_BACKEND=FLASH_ATTN \
  dgx-spark-vllm:0.18.0

Smoke test (build + /health + /v1/models):

./scripts/smoke_vllm_container.sh

Optional smoke build overrides:

XFORMERS_DISABLE_FLASH_ATTN=1 XFORMERS_BUILD_JOBS=2 ./scripts/smoke_vllm_container.sh

Key runtime env vars:

Variable	Default	Purpose
`VLLM_MODEL`	`Qwen/Qwen2.5-0.5B-Instruct`	Hugging Face model id
`VLLM_HOST`	`0.0.0.0`	Bind host
`VLLM_PORT`	`8000`	API port inside container
`VLLM_DTYPE`	`bfloat16`	vLLM dtype
`VLLM_MAX_MODEL_LEN`	`4096`	Max model length
`VLLM_GPU_MEMORY_UTILIZATION`	`0.9`	GPU memory target ratio
`VLLM_USE_V1`	`1`	Use vLLM V1 engine path
`VLLM_ATTENTION_BACKEND`	`FLASH_ATTN`	Attention backend for V1
`VLLM_ENABLE_CUSTOM_OPS`	`0`	Keep custom C++ ops disabled by default on DGX Spark

Verification

python scripts/verify_install.py

Sample output:

GPU: NVIDIA GB10 (128 GB) — Compute Capability: 12.1
CUDA: 13.0 — Driver: 570.x
PyTorch: 2.9.1+cu130 — CUDA available: ✓
Libraries: transformers ✓ | peft ✓ | trl ✓ | bitsandbytes ✓
MatMul test (4096×4096): PASSED — 2.3 TFLOPS

Benchmarks

python scripts/benchmark_inference.py   # Token generation speed
python scripts/benchmark_training.py    # Fine-tuning throughput
python scripts/evaluate_perplexity.py   # FP16/NF4/FP4 quality (perplexity)

Phase 3 benchmark results (Qwen 7B/14B/32B/72B inference, FP16/NF4/FP4 quality, LoRA/QLoRA training):

docs/benchmarks.md
artifacts/benchmarks/phase3-baseline-2026-03-13.json
artifacts/benchmarks/inference-extended-32b-72b-2026-03-13.json
artifacts/benchmarks/inference-fp4-2026-03-13.json
artifacts/benchmarks/quality-ppl-fp16-nf4-fp4-2026-03-13.json

Documentation

Quick Start Guide — Get running in 5 minutes
Training Guide — Fine-tune LLMs on DGX Spark
Troubleshooting — Known issues and solutions
Reproducible Builds — Deterministic wheel build and release flow
Benchmarks — Phase 3 baseline numbers and methodology
Ollama Integration Guide — Model import, quantization, and API usage on DGX Spark
llama.cpp Build Guide — sm_121 CUDA build and GGUF inference on DGX Spark
NGC Container Recipe — Pinned NGC PyTorch workflow for DGX Spark LLM workloads
Docker Compose vLLM Stack — vLLM + OpenAI-compatible API via Docker Compose
Example Notebooks — Inference, fine-tuning, and evaluation notebooks with Spark smoke flow
FP8 Workaround — TransformerEngine FP8 fail + BF16 fallback validation on DGX Spark
Continuous Batching (vLLM) — Serial vs concurrent OpenAI API smoke on DGX Spark
Power/Thermal Profiling — Operational power and temperature profile during continuous batching smoke
Multi-GPU Guide — Distributed preflight and torchrun cluster recipe for DGX Spark environments
Speculative Decoding — Baseline vs assistant-model benchmark artifact on DGX Spark
KV Cache Optimization — vLLM max-model-len and memory-utilization sweep with recommended profile
TensorRT-LLM Attention Sinks — Deterministic fail (legacy) + pass (stable) validation on sm_121
Visibility Strategy — Community distribution, KPI tracking, and messaging guardrails
Community Launch Pack — Copy-ready X/Reddit/HN/forum posts with CTA and link set
Day 1 Launch Log — X + GitHub announcement execution record and KPI deltas
Day 2 Launch Log — Reddit r/LocalLLaMA post draft, baseline metrics, and follow-up checklist
Day 3 Launch Log — Reddit r/nvidia post draft, baseline metrics, and follow-up checklist
Day 4 Launch Log — Show HN post draft, baseline metrics, and follow-up checklist
Day 5 Launch Log — NVIDIA forum + Hugging Face reply template, baseline metrics, and follow-up checklist
Day 6 Launch Log — NVIDIA forum follow-up reply template, baseline metrics, and follow-up checklist
Day 7 Launch Log — Hugging Face discussion follow-up and week-1 wrap checklist
Launch Operations — URL evidence gate and day-by-day launch closing checklist

Roadmap

This project is under active development. Here's what's next:

Phase	Focus	Status
1. Foundation	Repo structure, build scripts, docs, compat matrix	✅ Done
2. Pre-built Wheels	Compile and publish wheels to GitHub Releases	✅ Done
3. Benchmarks	Inference tok/s, training throughput, model-specific tables	✅ Done
4. Community	vLLM Dockerfile, Ollama, llama.cpp guide, NGC recipe	✅ Done
5. Upstream	PyTorch sm_121 PR, Triton fix, flash-attention issue	✅ Done
6. Advanced	Multi-GPU, TensorRT-LLM, FP8 workaround	✅ Done

Full details with task checklists: ROADMAP.md

Contributing

Read the contribution guide first: CONTRIBUTING.md.

Contributions are welcome, especially roadmap-aligned docs, build fixes, benchmarks, and integration improvements.

System Specs (DGX Spark)

Component	Spec
GPU	NVIDIA GB10 (Blackwell, sm_121)
VRAM	128 GB unified memory
CPU	20-core ARM64 (Grace)
RAM	121 GB
Storage	1.9 TB NVMe
CUDA	13.0
GCC	13.3
CMake	3.28
Python	3.12

Acknowledgments

Emre Yüz and his pytorch-gb10 repo for pioneering PyTorch on GB10
NVIDIA for DGX Spark developer documentation

License

Apache License 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DGX Spark LLM Stack

The Problem

Quick Start

Compatibility Matrix

Pre-built Wheels

Build from Source

vLLM Container (Phase 4)

Verification

Benchmarks

Documentation

Roadmap

Contributing

System Specs (DGX Spark)

Acknowledgments

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
artifacts/benchmarks		artifacts/benchmarks
build		build
configs		configs
docker		docker
docs		docs
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
COMPATIBILITY.md		COMPATIBILITY.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
install.sh		install.sh

Folders and files

Latest commit

History

Repository files navigation

DGX Spark LLM Stack

The Problem

Quick Start

Compatibility Matrix

Pre-built Wheels

Build from Source

vLLM Container (Phase 4)

Verification

Benchmarks

Documentation

Roadmap

Contributing

System Specs (DGX Spark)

Acknowledgments

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages