This guide details the environment setup and execution for local inference on the NVIDIA DGX Spark, specifically optimized for "Vibe-Coding."
Reference: NVIDIA Spark Nemotron Instructions
Verify installed versions:
git --version
cmake --version
nvcc --version
Manage Python environments with uv:
# Install
curl -LsSf https://astral.sh/uv/install.sh | sh
# Update
uv self updateuv sync
source .venv/bin/activate
hf versionTargeting DGX Spark architecture (sm_121).
See llama.cpp build docs for details.
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
mkdir build && cd build
# Configure for CUDA architectures 121
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j
We recommend Unsloth GGUF quants for best performance.
hf download unsloth/GLM-4.7-Flash-GGUF \
GLM-4.7-Flash-UD-Q8_K_XL.gguf \
--local-dir ~/models/GLM-4.7-Flash-UD-Q8_K_XLOr BF16 version:
hf download unsloth/GLM-4.7-Flash-GGUF \
--include "BF16/GLM-4.7-Flash-BF16-*.gguf" \
--local-dir ~/models/GLM-4.7-Flash-BF16Qwen3-Coder-Next
hf download unsloth/Qwen3-Coder-Next-GGUF \
--include "Q8_0/Qwen3-Coder-Next-Q8_0-*.gguf" \
--local-dir ~/models/Qwen3-Coder-Next-Q8_0or
hf download unsloth/Qwen3-Coder-Next-GGUF --include "UD-Q8_K_XL/Qwen3-Coder-Next-UD-Q8_K_XL-*.gguf" --local-dir ~/models/Qwen3-Coder-Next-UD-Q8_K_XL
hf download unsloth/Qwen3-Coder-Next-GGUF --include "Q8_0/Qwen3-Coder-Next-Q8_0-*.gguf" --local-dir ~/models/Qwen3-Coder-Next-Q8_0Devstral-2-123B-Instruct (UD-Q4_K_XL):
hf download unsloth/Devstral-2-123B-Instruct-2512-GGUF \
--include "UD-Q4_K_XL/Devstral-2-123B-Instruct-2512-UD-Q4_K_XL-*.gguf" \
--local-dir ~/models/Devstral-2-123B-Instruct-2512-UD-Q4_K_XLDevstral-Small-2-24B-Instruct (UD-Q8_K_XL / BF16):
hf download unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF \
Devstral-Small-2-24B-Instruct-2512-UD-Q8_K_XL.gguf \
--local-dir ~/models/Devstral-Small-2-24B-Instruct-UD-Q8_K_XL
hf download unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF \
Devstral-Small-2-24B-Instruct-2512-BF16.gguf \
--local-dir ~/models/Devstral-Small-2-24B-Instruct-2512-BF16Nemotron-3-Nano-30B-A3B (UD-Q8_K_XL):
hf download unsloth/Nemotron-3-Nano-30B-A3B-GGUF \
Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
--local-dir ~/models/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XLGPT-OSS-120B (F16):
hf download unsloth/gpt-oss-120b-GGUF \
gpt-oss-120b-F16.gguf \
--local-dir ~/models/gpt-oss-120b-F16screen -dmS qwen3-coder-next ./llama.cpp/build/bin/llama-server \
--model ~/models/Qwen3-Coder-Next-Q8_0/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf \
--alias "Qwen3-Coder-Next-Q8_0" \
--fit on \
--temp 1.0 \
--top-p 0.95 \
--top-k 40 \
--min-p 0.01 \
--port 8080 \
--host 0.0.0.0 \
--threads -4 \
--jinja \
--kv-unified \
--flash-attn on \
--ctx-size 0./llama.cpp/build/bin/llama-server --model ~/models/Qwen3-Coder-Next-UD-Q8_K_XL/UD-Q8_K_XL/Qwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.gguf \
--alias "Qwen3-Coder-Next-UD-Q8_K_XL" \
--fit on \
--temp 1.0 \
--top-p 0.95 \
--top-k 40 \
--min-p 0.01 \
--port 8080 \
--host 0.0.0.0 \
--threads -4 \
--jinja \
--ctx-size 262144screen -dmS glm-47 ./llama.cpp/build/bin/llama-server \
--model ~/models/GLM-4.7-Flash/GLM-4.7-Flash-UD-Q8_K_XL.gguf \
--alias "GLM-4.7-Flash-Q8_K_XL" \
--fit on \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--port 8080 \
--host 0.0.0.0 \
--threads -4 \
--jinja \
--ctx-size 0screen -dmS glm47 bash launch_glm4.7_flash.shMax context window: 202752 Documentation Documentation 2
screen -dmS devstral ./llama.cpp/build/bin/llama-server \
--model ~/models/Devstral-Small-2-24B-Instruct/Devstral-Small-2-24B-Instruct-2512-UD-Q8_K_XL.gguf \
--threads -2 \
--ctx-size 65536 \
--n-gpu-layers 99 \
--seed 3407 \
--prio 2 \
--temp 0.15 \
--jinja \
--port 8080 \
--host 0.0.0.0Note: Dense model, slower than MoE alternatives. Documentation
screen -dmS nemotron ./llama.cpp/build/bin/llama-server \
--model ~/models/Nemotron-3-Nano-30B-A3B/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
--threads -8 \
--ctx-size 262144 \
--n-gpu-layers 99 \
--jinja \
--fit on \
--temp 0.6 \
--top-p 0.95 \
--port 8080 \
--host 0.0.0.0Tool calling: --temp 0.6 --top-p 0.95, Context: 262144 or 1M
Documentation
screen -dmS gptoss ./llama.cpp/build/bin/llama-server \
--model ~/models/gpt-oss-120b/gpt-oss-120b-F16.gguf \
--host 0.0.0.0 \
--port 8080 \
--n-gpu-layers 99 \
--ctx-size 0 \
--threads 8 \
--jinja \
-ub 2048 \
-b 2048 \
--chat-template-kwargs '{"reasoning_effort": "high"}' \
--temp 1.0 \
--top-p 1.0 \
--min-p 0.0 \
--top-k 0.0Attach to screen:
screen -r glm-47 # GLM-4.7-Flash
screen -r devstral # Devstral-Small-2-24B
screen -r nemotron # Nemotron-3-Nano-30B
screen -r gptoss # GPT-OSS-120BDetach from screen: Ctrl+A then D
List all screens: screen -ls
- Port change: Update
--port(default: 8080) - Web UI:
http://localhost:8080 - Benchmark: ~42 tokens/sec (GLM-4.7-Flash Q8)
See SETUP_WHISPER.md for detailed instructions.
Whisper TUI Interface
Unlock the ultimate workflow.
mkdir vibe_coding_with_mistral_vibe
cd vibe_coding_with_mistral_vibeRefer to the official release:
uv tool install mistral-vibeuv tool upgrade mistral-vibeLaunch vibe, choose your theme, leave API key blank or whitespace (handled locally).
Edit ~/.vibe/config.toml and add:
[[providers]]
name = "llamacpp"
api_base = "http://127.0.0.1:8080/v1"
api_key_env_var = ""
api_style = "openai"
backend = "generic"
reasoning_field_name = "reasoning_content"
[[models]]
name = "Qwen3-Coder-Next-Q8_0"
provider = "llamacpp"
temperature = 1.0
input_price = 0.0
output_price = 0.0
[[models]]
name = "GLM-4.7-Flash-Q8_K_XL"
provider = "llamacpp"
temperature = 1.0
input_price = 0.0
output_price = 0.0
[[models]]
name = "Nemotron-3-Nano-30B-A3B"
provider = "llamacpp"
temperature = 0.6
input_price = 0.0
output_price = 0.0
- Run
/reloadin thevibeinterface - Type
/model, select your model with Enter - Hit ESC when finished
Run vibe CLI on different machines using DGX Spark's GPU.
- Tailscale: Both DGX Spark and local machine on same Tailscale network
- Host Binding: Ensure
llama-serveruses--host 0.0.0.0
Edit ~/.vibe/config.toml on remote machine:
[[providers]]
name = "dgx-remote-llamacpp"
api_base = "http://100.114.54.60:8080/v1" # Replace with DGX Tailscale IP
api_key_env_var = ""
api_style = "openai"
backend = "generic"
[[models]]
name = "Qwen3-Coder-Next-UD-Q8_K_XL"
provider = "dgx-remote-llamacpp"
temperature = 1.0
input_price = 0.0
output_price = 0.0
[[models]]
name = "GLM-4.7-Flash-Q8_K_XL"
provider = "dgx-remote-llamacpp"
temperature = 1.0
input_price = 0.0
output_price = 0.0
[[models]]
name = "Nemotron-3-Nano-30B-A3B"
provider = "dgx-remote-llamacpp"
temperature = 0.6
input_price = 0.0
output_price = 0.0
- Run
vibelocally - Execute
/reload - Select remote model with
/model
# GLM-4.7-Flash server
screen -dmS glm47
./launch_glm4.7_flash.sh
# Whisper server
screen -dmS whisper-server
./start_whisper_server.sh
# Gradio app (new terminal)
uv run python whisper_app.pyFind screen names:
screen -lsKill specific screen:
screen -S glm47 -X quit
screen -S whisper-server -X quitUpcoming: Support via vLLM will be added in a future update.
