DGX Spark Commands & Local Vibe-Coding Setup

This guide details the environment setup and execution for local inference on the NVIDIA DGX Spark, specifically optimized for "Vibe-Coding."

Reference: NVIDIA Spark Nemotron Instructions

1. Environment Verification (Single User)

Check Toolchain Versions

Verify installed versions:

git --version
cmake --version
nvcc --version

Install or Update `uv`

Manage Python environments with uv:

# Install
curl -LsSf https://astral.sh/uv/install.sh | sh

# Update
uv self update

Sync and Verify Environment

uv sync
source .venv/bin/activate
hf version

2. Building llama.cpp with CUDA Support

Targeting DGX Spark architecture (sm_121). See llama.cpp build docs for details.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
mkdir build && cd build

# Configure for CUDA architectures 121
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j

3. Model Downloads

We recommend Unsloth GGUF quants for best performance.

Recommended: GLM-4.7-Flash (UD-Q8_K_XL)

hf download unsloth/GLM-4.7-Flash-GGUF \
    GLM-4.7-Flash-UD-Q8_K_XL.gguf \
    --local-dir ~/models/GLM-4.7-Flash-UD-Q8_K_XL

Or BF16 version:

hf download unsloth/GLM-4.7-Flash-GGUF \
    --include "BF16/GLM-4.7-Flash-BF16-*.gguf" \
    --local-dir ~/models/GLM-4.7-Flash-BF16

Other Models

Qwen3-Coder-Next

hf download unsloth/Qwen3-Coder-Next-GGUF \
    --include "Q8_0/Qwen3-Coder-Next-Q8_0-*.gguf" \
    --local-dir ~/models/Qwen3-Coder-Next-Q8_0

or

hf download unsloth/Qwen3-Coder-Next-GGUF --include "UD-Q8_K_XL/Qwen3-Coder-Next-UD-Q8_K_XL-*.gguf" --local-dir ~/models/Qwen3-Coder-Next-UD-Q8_K_XL

hf download unsloth/Qwen3-Coder-Next-GGUF --include "Q8_0/Qwen3-Coder-Next-Q8_0-*.gguf" --local-dir ~/models/Qwen3-Coder-Next-Q8_0

Devstral-2-123B-Instruct (UD-Q4_K_XL):

hf download unsloth/Devstral-2-123B-Instruct-2512-GGUF \
    --include "UD-Q4_K_XL/Devstral-2-123B-Instruct-2512-UD-Q4_K_XL-*.gguf" \
    --local-dir ~/models/Devstral-2-123B-Instruct-2512-UD-Q4_K_XL

Devstral-Small-2-24B-Instruct (UD-Q8_K_XL / BF16):

hf download unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF \
    Devstral-Small-2-24B-Instruct-2512-UD-Q8_K_XL.gguf \
    --local-dir ~/models/Devstral-Small-2-24B-Instruct-UD-Q8_K_XL

hf download unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF \
    Devstral-Small-2-24B-Instruct-2512-BF16.gguf \
    --local-dir ~/models/Devstral-Small-2-24B-Instruct-2512-BF16

Nemotron-3-Nano-30B-A3B (UD-Q8_K_XL):

hf download unsloth/Nemotron-3-Nano-30B-A3B-GGUF \
    Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
    --local-dir ~/models/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL

GPT-OSS-120B (F16):

hf download unsloth/gpt-oss-120b-GGUF \
    gpt-oss-120b-F16.gguf \
    --local-dir ~/models/gpt-oss-120b-F16

4. Running GLM Inference Server

Qwen3-Coder-Next

screen -dmS qwen3-coder-next ./llama.cpp/build/bin/llama-server \
    --model ~/models/Qwen3-Coder-Next-Q8_0/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf \
    --alias "Qwen3-Coder-Next-Q8_0" \
    --fit on \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 40 \
    --min-p 0.01 \
    --port 8080 \
    --host 0.0.0.0 \
    --threads -4 \
    --jinja \
    --kv-unified \
    --flash-attn on \
    --ctx-size 0

./llama.cpp/build/bin/llama-server --model ~/models/Qwen3-Coder-Next-UD-Q8_K_XL/UD-Q8_K_XL/Qwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.gguf \
--alias "Qwen3-Coder-Next-UD-Q8_K_XL" \
--fit on \
--temp 1.0 \
--top-p 0.95 \
--top-k 40 \
--min-p 0.01 \
--port 8080 \
--host 0.0.0.0 \
--threads -4 \
--jinja \
--ctx-size 262144

Documentation

Launch GLM-4.7-Flash (UD-Q8_K_XL) from root dir

screen -dmS glm-47 ./llama.cpp/build/bin/llama-server \
    --model ~/models/GLM-4.7-Flash/GLM-4.7-Flash-UD-Q8_K_XL.gguf \
    --alias "GLM-4.7-Flash-Q8_K_XL" \
    --fit on \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --port 8080 \
    --host 0.0.0.0 \
    --threads -4 \
    --jinja \
    --ctx-size 0

or

screen -dmS glm47 bash launch_glm4.7_flash.sh

Max context window: 202752 Documentation Documentation 2

Launch Devstral-Small-2-24B-Instruct

screen -dmS devstral ./llama.cpp/build/bin/llama-server \
    --model ~/models/Devstral-Small-2-24B-Instruct/Devstral-Small-2-24B-Instruct-2512-UD-Q8_K_XL.gguf \
    --threads -2 \
    --ctx-size 65536 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 0.15 \
    --jinja \
    --port 8080 \
    --host 0.0.0.0

Note: Dense model, slower than MoE alternatives. Documentation

Launch Nemotron-3-Nano-30B-A3B

screen -dmS nemotron ./llama.cpp/build/bin/llama-server \
    --model ~/models/Nemotron-3-Nano-30B-A3B/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
    --threads -8 \
    --ctx-size 262144 \
    --n-gpu-layers 99 \
    --jinja \
    --fit on \
    --temp 0.6 \
    --top-p 0.95 \
    --port 8080 \
    --host 0.0.0.0

Tool calling: --temp 0.6 --top-p 0.95, Context: 262144 or 1M Documentation

Launch GPT-OSS-120B (F16)

screen -dmS gptoss ./llama.cpp/build/bin/llama-server \
    --model ~/models/gpt-oss-120b/gpt-oss-120b-F16.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    --n-gpu-layers 99 \
    --ctx-size 0 \
    --threads 8 \
    --jinja \
    -ub 2048 \
    -b 2048 \
    --chat-template-kwargs '{"reasoning_effort": "high"}' \
    --temp 1.0 \
    --top-p 1.0 \
    --min-p 0.0 \
    --top-k 0.0

Documentation

Screen Commands

Attach to screen:

screen -r glm-47      # GLM-4.7-Flash
screen -r devstral    # Devstral-Small-2-24B
screen -r nemotron    # Nemotron-3-Nano-30B
screen -r gptoss      # GPT-OSS-120B

Detach from screen: Ctrl+A then D

List all screens: screen -ls

Access & Utilities

Port change: Update --port (default: 8080)
Web UI: http://localhost:8080
Benchmark: ~42 tokens/sec (GLM-4.7-Flash Q8)

5. Running Whisper Server & Transcription

See SETUP_WHISPER.md for detailed instructions.

6. Whisper TUI: Voice-to-Text Terminal Interface

Whisper TUI Interface

Unlock the ultimate workflow.

👉 Setup Guide & Usage (TUI)

7. Vibe-Coding with Mistral Vibe

1. Setup Workspace

mkdir vibe_coding_with_mistral_vibe
cd vibe_coding_with_mistral_vibe

2. Install Mistral Vibe CLI

Refer to the official release:

uv tool install mistral-vibe

3. Update Mistral Vibe CLI

uv tool upgrade mistral-vibe

4. Configuration

Launch vibe, choose your theme, leave API key blank or whitespace (handled locally).

Edit ~/.vibe/config.toml and add:

[[providers]]
name = "llamacpp"
api_base = "http://127.0.0.1:8080/v1"
api_key_env_var = ""
api_style = "openai"
backend = "generic"
reasoning_field_name = "reasoning_content"

[[models]]
name = "Qwen3-Coder-Next-Q8_0"
provider = "llamacpp"
temperature = 1.0
input_price = 0.0
output_price = 0.0

[[models]]
name = "GLM-4.7-Flash-Q8_K_XL"
provider = "llamacpp"
temperature = 1.0
input_price = 0.0
output_price = 0.0

[[models]]
name = "Nemotron-3-Nano-30B-A3B"
provider = "llamacpp"
temperature = 0.6
input_price = 0.0
output_price = 0.0

5. Activation

Run /reload in the vibe interface
Type /model, select your model with Enter
Hit ESC when finished

8. Remote Vibe-Coding (Multi-Device Setup)

Run vibe CLI on different machines using DGX Spark's GPU.

Prerequisites

Tailscale: Both DGX Spark and local machine on same Tailscale network
Host Binding: Ensure llama-server uses --host 0.0.0.0

Local Machine Configuration

Edit ~/.vibe/config.toml on remote machine:

[[providers]]
name = "dgx-remote-llamacpp"
api_base = "http://100.114.54.60:8080/v1"  # Replace with DGX Tailscale IP
api_key_env_var = ""
api_style = "openai"
backend = "generic"

[[models]]
name = "Qwen3-Coder-Next-UD-Q8_K_XL"
provider = "dgx-remote-llamacpp"
temperature = 1.0
input_price = 0.0
output_price = 0.0

[[models]]
name = "GLM-4.7-Flash-Q8_K_XL"
provider = "dgx-remote-llamacpp"
temperature = 1.0
input_price = 0.0
output_price = 0.0

[[models]]
name = "Nemotron-3-Nano-30B-A3B"
provider = "dgx-remote-llamacpp"
temperature = 0.6
input_price = 0.0
output_price = 0.0

Usage on Remote Machine

Run vibe locally
Execute /reload
Select remote model with /model

9. Quick Start Commands

Start Everything in Background

# GLM-4.7-Flash server
screen -dmS glm47
./launch_glm4.7_flash.sh

# Whisper server
screen -dmS whisper-server
./start_whisper_server.sh

# Gradio app (new terminal)
uv run python whisper_app.py

Stop Everything

Find screen names:

screen -ls

Kill specific screen:

screen -S glm47 -X quit
screen -S whisper-server -X quit

10. Multiple User Support

Upcoming: Support via vLLM will be added in a future update.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets/image		assets/image
vibe_coding_with_mistral_vibe		vibe_coding_with_mistral_vibe
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
SETUP_SPEECH_TO_TEXT_TUI.md		SETUP_SPEECH_TO_TEXT_TUI.md
SETUP_WHISPER.md		SETUP_WHISPER.md
launch_glm4.7_flash.sh		launch_glm4.7_flash.sh
pyproject.toml		pyproject.toml
start_whisper_server.sh		start_whisper_server.sh
uv.lock		uv.lock
whisper_app.py		whisper_app.py
whisper_tui.sh		whisper_tui.sh

Folders and files

Latest commit

History

Repository files navigation

DGX Spark Commands & Local Vibe-Coding Setup

1. Environment Verification (Single User)

Check Toolchain Versions

Install or Update uv

Sync and Verify Environment

2. Building llama.cpp with CUDA Support

3. Model Downloads

Recommended: GLM-4.7-Flash (UD-Q8_K_XL)

Other Models

4. Running GLM Inference Server

Qwen3-Coder-Next

Launch GLM-4.7-Flash (UD-Q8_K_XL) from root dir

or

Launch Devstral-Small-2-24B-Instruct

Launch Nemotron-3-Nano-30B-A3B

Launch GPT-OSS-120B (F16)

Screen Commands

Access & Utilities

5. Running Whisper Server & Transcription

6. Whisper TUI: Voice-to-Text Terminal Interface

7. Vibe-Coding with Mistral Vibe

1. Setup Workspace

2. Install Mistral Vibe CLI

3. Update Mistral Vibe CLI

4. Configuration

5. Activation

8. Remote Vibe-Coding (Multi-Device Setup)

Prerequisites

Local Machine Configuration

Usage on Remote Machine

9. Quick Start Commands

Start Everything in Background

Stop Everything

10. Multiple User Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Install or Update `uv`

Packages