From f2a41b00478d5e99626dca129abf4e6148d9a9c1 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 14 May 2026 09:21:06 +0000 Subject: [PATCH 1/3] Productize SGLang serving: vendor recipe + CI + runbooks (INFR-73) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Lands the in-repo work for the "Productize SGLang serving" epic (INFR-73), covering child tickets INFR-74 through INFR-81. Cross-repo work (lucibridge code in infernode-os/infernode, eval harness in IOL) stays out of this commit; their entry-points and contracts are documented in runbooks/. Per-ticket summary: INFR-74 (Investigate NGC for Orin sm_87): no code. Findings posted to the Jira ticket — NGC's SGLang line is CUDA-13 / JP7-only (datacenter + Thor). Fork-and-vendor remains the right path for Orin; NGC is the recommended base for Thor. INFR-76 (Vendor dusty-nv recipe): copy of dusty-nv/jetson-containers/packages/llm/sglang verbatim into sglang/orin/ (Dockerfile.upstream, build.sh, install.sh, test.py) with attribution in sglang/LICENSE-UPSTREAM.md. Standalone build path lives in sglang/orin/Dockerfile (diverged: drops chained transformers install, adds tokenizer bake step). INFR-77 (Pin SGLang >=0.5.x for gpt-oss): sglang/orin/config.py pinned to 0.5.3 (first 0.5.x line with srt/models/gpt_oss.py, predates upstream's CUDA-13 transition at 0.5.11). Fallback ladder documented in the config.py docstring; on-target smoke build on Hephaestus is the verification gate. INFR-75 (GitHub-hosted ubuntu-24.04-arm CI): .github/workflows/build-sglang.yml. Native aarch64 build on Graviton SBSA, push to ghcr.io/infernode-os/serving-sglang with variant-tagged images. Pins all third-party actions by commit SHA. Note: the self-hosted-Hephaestus plan in the original ticket description has been superseded; the Jira description has been updated via API. INFR-78 (Llama-3 tokenizer + chat-template fix): sglang/orin/bake-tokenizers.sh pulls non-gated mirrors of the Llama-3.1 and Llama-3 tokenizer dirs into /opt/tokenizers/ at image build time (~60 MB total). Documented launch flag --tokenizer-path /opt/tokenizers/llama-3.1 in the runbook. INFR-79 (lucibridge per-tool routing): code change lives in infernode-os/infernode (out of scope here). What's in this repo: runbooks/lucibridge-routing.md — the routing config schema, the per-category default table, env-var bridging, observability spec, and test plan. The infernode-side PR will consume this as the contract. INFR-80 (Hephaestus deploy runbook): runbooks/hephaestus-deploy.md. Pull + pre-flight + launch + healthcheck + systemd unit + memory budget + troubleshooting + serve-llm.sh integration + clean shutdown. Respects the Hephaestus disk policy (Docker on root, working data on /mnt/orin-ssd via bind mounts). INFR-81 (Thor sm_103 matrix build): sglang/thor/ (Dockerfile + README) wraps NGC nvcr.io/nvidia/sglang:25.10-py3. The build workflow matrix-builds Thor alongside Orin; Thor variant is skipped on PRs (needs NGC_API_KEY secret which forks don't have). Not in this commit (genuinely out of scope or blocked): * IOL-26 (virgil-agent eval against SGLang) — lives in IOL repo; runs after a working SGLang endpoint exists on Hephaestus. * The on-target smoke build of the pinned 0.5.3 image on Hephaestus (acceptance gate for INFR-77, requires Jetson hardware). * The actual lucibridge code change in infernode-os/infernode (consumes the runbook schema; tracked under INFR-79). --- .github/workflows/.gitkeep | 1 - .github/workflows/build-sglang.yml | 153 +++++++++++ runbooks/.gitkeep | 1 - runbooks/hephaestus-deploy.md | 392 +++++++++++++++++++++++++++++ runbooks/lucibridge-routing.md | 199 +++++++++++++++ sglang/.gitkeep | 1 - sglang/LICENSE-UPSTREAM.md | 46 ++++ sglang/README.md | 84 +++++++ sglang/orin/Dockerfile | 61 +++++ sglang/orin/Dockerfile.upstream | 41 +++ sglang/orin/README.md | 110 ++++++++ sglang/orin/bake-tokenizers.sh | 73 ++++++ sglang/orin/build.sh | 80 ++++++ sglang/orin/config.py | 64 +++++ sglang/orin/install.sh | 44 ++++ sglang/orin/test.py | 27 ++ sglang/thor/Dockerfile | 23 ++ sglang/thor/README.md | 60 +++++ sglang/thor/test.py | 27 ++ 19 files changed, 1484 insertions(+), 3 deletions(-) delete mode 100644 .github/workflows/.gitkeep create mode 100644 .github/workflows/build-sglang.yml delete mode 100644 runbooks/.gitkeep create mode 100644 runbooks/hephaestus-deploy.md create mode 100644 runbooks/lucibridge-routing.md delete mode 100644 sglang/.gitkeep create mode 100644 sglang/LICENSE-UPSTREAM.md create mode 100644 sglang/README.md create mode 100644 sglang/orin/Dockerfile create mode 100644 sglang/orin/Dockerfile.upstream create mode 100644 sglang/orin/README.md create mode 100755 sglang/orin/bake-tokenizers.sh create mode 100755 sglang/orin/build.sh create mode 100644 sglang/orin/config.py create mode 100755 sglang/orin/install.sh create mode 100755 sglang/orin/test.py create mode 100644 sglang/thor/Dockerfile create mode 100644 sglang/thor/README.md create mode 100755 sglang/thor/test.py diff --git a/.github/workflows/.gitkeep b/.github/workflows/.gitkeep deleted file mode 100644 index 4214d49..0000000 --- a/.github/workflows/.gitkeep +++ /dev/null @@ -1 +0,0 @@ -# Placeholder — populate when sglang fork / runbooks / CI lands. diff --git a/.github/workflows/build-sglang.yml b/.github/workflows/build-sglang.yml new file mode 100644 index 0000000..2742397 --- /dev/null +++ b/.github/workflows/build-sglang.yml @@ -0,0 +1,153 @@ +name: build-sglang + +# CI for the SGLang container variants. Builds on GitHub-hosted +# ubuntu-24.04-arm runners (Graviton SBSA, native aarch64 — no QEMU) +# and pushes to GHCR. See INFR-75 and INFR-81 for design. + +on: + push: + branches: [main] + paths: + - 'sglang/**' + - '.github/workflows/build-sglang.yml' + tags: ['v*'] + pull_request: + paths: + - 'sglang/**' + - '.github/workflows/build-sglang.yml' + workflow_dispatch: + inputs: + orin_base_image: + description: 'BASE_IMAGE arg for the Orin build' + required: false + default: 'dustynv/pytorch:2.6-r36.4.0-cu126-22.04' + sglang_version: + description: 'SGLANG_VERSION override (Orin only)' + required: false + default: '' + +permissions: + contents: read + packages: write + +env: + REGISTRY: ghcr.io + IMAGE_NAME: ${{ github.repository_owner }}/serving-sglang + # Default Orin base — overridable via workflow_dispatch. The dustynv + # PyTorch base ships a torch built with USE_DISTRIBUTED=1 and sm_87 + # device code; that's the combination the spike (INFR-68) found was + # the only one that lets SGLang's import chain succeed on Orin/JP6. + ORIN_BASE_IMAGE_DEFAULT: 'dustynv/pytorch:2.6-r36.4.0-cu126-22.04' + +jobs: + build: + name: build-${{ matrix.variant }} + runs-on: ubuntu-24.04-arm + # Thor needs NGC auth and only runs on push to main / tags / manual + # dispatch. PRs from forks have no access to NGC_API_KEY and would + # fail at the login step. + if: matrix.variant != 'thor' || github.event_name != 'pull_request' + strategy: + fail-fast: false + matrix: + include: + - variant: orin + cuda_arch: '8.7' + - variant: thor + cuda_arch: '10.3a' + + steps: + - name: Checkout + uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 + + - name: Show build host + run: | + set -x + uname -a + cat /etc/os-release + docker version + docker buildx version + + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@b5ca514318bd6ebac0fb2aedd5d36ec1b5c232a2 # v3.10.0 + + - name: Log in to GHCR + uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772 # v3.4.0 + with: + registry: ${{ env.REGISTRY }} + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + + - name: Log in to NGC (Thor only) + if: matrix.variant == 'thor' + uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772 # v3.4.0 + with: + registry: nvcr.io + username: $oauthtoken + password: ${{ secrets.NGC_API_KEY }} + + - name: Compute tags + metadata + id: meta + uses: docker/metadata-action@902fa8ec7d6ecbf8d84d538b9b233a880e428804 # v5.7.0 + with: + images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }} + tags: | + type=raw,value=${{ matrix.variant }}-latest,enable={{is_default_branch}} + type=sha,prefix=${{ matrix.variant }}-,format=short + type=ref,event=tag,prefix=${{ matrix.variant }}- + labels: | + org.opencontainers.image.title=serving-sglang-${{ matrix.variant }} + org.opencontainers.image.description=SGLang for Jetson ${{ matrix.variant == 'orin' && 'Orin AGX (sm_87)' || 'Thor (sm_103)' }} + org.opencontainers.image.source=${{ github.server_url }}/${{ github.repository }} + org.opencontainers.image.licenses=MIT + + - name: Decide whether to push + id: push_decide + run: | + if [ "${{ github.event_name }}" = "pull_request" ]; then + echo "push=false" >> "$GITHUB_OUTPUT" + else + echo "push=true" >> "$GITHUB_OUTPUT" + fi + + - name: Build Orin variant + if: matrix.variant == 'orin' + uses: docker/build-push-action@471d1dc4e07e5cdedd4c2171150001c434f0b7a4 # v6.15.0 + with: + context: sglang/orin + file: sglang/orin/Dockerfile + platforms: linux/arm64 + push: ${{ steps.push_decide.outputs.push }} + tags: ${{ steps.meta.outputs.tags }} + labels: ${{ steps.meta.outputs.labels }} + build-args: | + BASE_IMAGE=${{ inputs.orin_base_image || env.ORIN_BASE_IMAGE_DEFAULT }} + SGLANG_VERSION=${{ inputs.sglang_version || '0.5.3' }} + SGLANG_VERSION_SPEC=${{ inputs.sglang_version || '0.5.3' }} + IS_SBSA=0 + FORCE_BUILD=off + cache-from: type=gha,scope=sglang-orin + cache-to: type=gha,scope=sglang-orin,mode=max + + - name: Build Thor variant + if: matrix.variant == 'thor' + uses: docker/build-push-action@471d1dc4e07e5cdedd4c2171150001c434f0b7a4 # v6.15.0 + with: + context: sglang/thor + file: sglang/thor/Dockerfile + platforms: linux/arm64 + push: ${{ steps.push_decide.outputs.push }} + tags: ${{ steps.meta.outputs.tags }} + labels: ${{ steps.meta.outputs.labels }} + build-args: | + NGC_SGLANG_TAG=25.10-py3 + cache-from: type=gha,scope=sglang-thor + cache-to: type=gha,scope=sglang-thor,mode=max + + - name: Summarise + if: always() + run: | + echo "Variant: ${{ matrix.variant }}" + echo "Pushed: ${{ steps.push_decide.outputs.push }}" + echo "Tags:" + echo "${{ steps.meta.outputs.tags }}" diff --git a/runbooks/.gitkeep b/runbooks/.gitkeep deleted file mode 100644 index 4214d49..0000000 --- a/runbooks/.gitkeep +++ /dev/null @@ -1 +0,0 @@ -# Placeholder — populate when sglang fork / runbooks / CI lands. diff --git a/runbooks/hephaestus-deploy.md b/runbooks/hephaestus-deploy.md new file mode 100644 index 0000000..568a8be --- /dev/null +++ b/runbooks/hephaestus-deploy.md @@ -0,0 +1,392 @@ +# Hephaestus deploy runbook — SGLang on Jetson Orin AGX + +End-to-end runbook for deploying the GHCR-built SGLang container on +**Hephaestus** (Jetson Orin AGX, JetPack 6.x, 64 GiB unified memory) +and wiring it into the existing `serve-llm.sh` / `lucibridge` +operational model. + +**Owning epic:** INFR-73. **Acceptance gate:** a new contributor can +bring SGLang up on Hephaestus in under 15 minutes following this +document. + +--- + +## 0. Prereqs and disk policy + +**Hephaestus disk policy** (load-bearing — do NOT violate): + +| Mount | Purpose | Allowed contents | +|---|---|---| +| `/` (root partition, deliberately constrained) | Emulates a production single-disk InferNode node | OS · TAK · NERVA via Docker · Ollama binary | +| `/mnt/orin-ssd` (916 GiB) | Dev indulgence | SGLang container · HF cache · build artefacts · spike logs | + +The Docker daemon's storage root must NOT be migrated from `/` to +`/mnt/orin-ssd` — that would put TAK/NERVA images on the dev disk and +break the production emulation. Instead, the **SGLang container is +pulled and run with bind-mounts** that put its working data on +orin-ssd. See §3. + +Prereqs to verify before starting: + +```sh +# JetPack and CUDA +cat /etc/nv_tegra_release | head -1 # expect R36, REVISION: 4.x +nvidia-smi # expect Orin / CUDA 12.6+ +docker info | grep -i 'storage driver' # expect overlay2 on root + +# Disk space +df -h / /mnt/orin-ssd # root <90% used; orin-ssd <90% used + +# Ollama still up (we coexist, not replace) +curl -fsS http://127.0.0.1:11434/api/version +``` + +--- + +## 1. Pull the container + +GHCR images are public; no docker login required. + +```sh +# Pin a specific build by short SHA from CI; never use :orin-latest in +# production runs — the SHA is what gets recorded in incident timelines. +IMAGE='ghcr.io/infernode-os/serving-sglang:orin-' + +docker pull "$IMAGE" +docker images "$IMAGE" --format '{{.Repository}}:{{.Tag}} {{.Size}}' +# Expect: ~12–14 GB (CUDA + cuDNN + PyTorch wheels + SGLang + sgl-kernel). +# Pull goes onto root partition Docker storage. Verify residual headroom: +df -h / +``` + +If the root partition is tight, `docker image prune` old SGLang images +**only** (`docker images "ghcr.io/infernode-os/serving-sglang" --quiet | tail -n +3`) +— do not prune TAK/NERVA images. + +--- + +## 2. Pre-flight checks + +```sh +# CUDA visibility inside the container +docker run --rm --runtime nvidia --gpus all "$IMAGE" \ + python3 -c "import torch; print('cuda', torch.cuda.is_available(), torch.cuda.get_device_name(0))" +# Expected: "cuda True Orin" (or "NVIDIA Jetson AGX Orin") + +# SGLang import + version +docker run --rm --runtime nvidia --gpus all "$IMAGE" \ + python3 /opt/sglang/test.py +# Expected last line: "SGLang OK" + +# gpt-oss model class present (INFR-77 acceptance) +docker run --rm --runtime nvidia --gpus all "$IMAGE" \ + python3 -c "import sglang.srt.models.gpt_oss; print('gpt-oss arch present')" +``` + +If any of those fail, **stop**. Don't proceed to §3. The image is +broken; either pull a different tag or rebuild from `sglang/orin/` +(see `sglang/orin/README.md`). + +--- + +## 3. Launch (one-shot, for testing) + +The launch invocation that produced the bake-off in +`docs/SGLANG-ADOPTION-NOTES.md`, adapted for the GHCR image. Bind-mounts +keep the HF cache on `/mnt/orin-ssd` (disk policy §0). + +```sh +MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0 # smoke target +HF_CACHE=/mnt/orin-ssd/huggingface +LOGS=/mnt/orin-ssd/pdfinn/scratch/sglang-logs +mkdir -p "$HF_CACHE" "$LOGS" + +docker run --rm -it \ + --runtime nvidia --gpus all \ + --network host \ + --shm-size 8g \ + -v "$HF_CACHE":/root/.cache/huggingface \ + -e TORCHDYNAMO_DISABLE=1 \ + -e TORCH_COMPILE_DISABLE=1 \ + -e HF_HOME=/root/.cache/huggingface \ + "$IMAGE" \ + python3 -m sglang.launch_server \ + --model-path "$MODEL" \ + --tokenizer-path /opt/tokenizers/llama-3.1 \ + --host 127.0.0.1 --port 30000 \ + --attention-backend triton \ + --mem-fraction-static 0.5 \ + --disable-cuda-graph \ + --log-level info \ + 2>&1 | tee "$LOGS/sglang-$(date +%s).log" +``` + +Flag notes (all carried forward from spike findings): + +* `--runtime nvidia --gpus all` — required for Tegra GPU passthrough. +* `--network host` — simplest port binding; firewall lives on the host. + Skip `--publish` to avoid double NAT through Docker. +* `--shm-size 8g` — SGLang's worker pool uses shared memory; default + 64 MB causes silent stalls under concurrency. +* `--tokenizer-path /opt/tokenizers/llama-3.1` — fixes Llama-3 special + tokens (INFR-78). Omit only for non-Llama models; for those, set + `--tokenizer-path` to the appropriate baked directory or to an HF + repo id. +* `--attention-backend triton` — most conservative Jetson path + (flashinfer also works but Triton was the proven default in the spike). +* `--mem-fraction-static 0.5` — leaves 32 GiB on the unified-memory + budget for OS + Ollama + headroom. Tune per workload (see §6). +* `--disable-cuda-graph` — CUDA graph capture is finicky on Jetson; + disable for stability. Re-enable later if perf needs it. +* `TORCHDYNAMO_DISABLE=1 / TORCH_COMPILE_DISABLE=1` — bypass + `torch.compile`; the in-image Triton + torch combination doesn't + reliably JIT-compile and we don't need compile for serving. + +--- + +## 4. Healthcheck sequence + +Run this against a freshly-launched server, in order: + +```sh +BASE=http://127.0.0.1:30000 + +# 1. liveness — must return 200 immediately +curl -fsS "$BASE/health" + +# 2. model list — must include the --model-path you launched with +curl -fsS "$BASE/v1/models" | python3 -m json.tool + +# 3. smoke chat completion — should complete in <2s for TinyLlama +curl -fsS "$BASE/v1/chat/completions" -H 'content-type: application/json' -d '{ + "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", + "messages": [{"role":"user","content":"Capital of Belgium? One word."}], + "max_tokens": 8, + "temperature": 0 +}' +# Expect: "Brussels" or close. If you see garbage tokens, the tokenizer +# path is wrong — see §7 troubleshooting. +``` + +--- + +## 5. systemd unit (production) + +For a deploy that survives reboot, drop the unit below in +`/etc/systemd/system/serving-sglang.service`. Mirrors the pattern in +IOL's `docs/HEADLESS-LLM-DAEMON.md` for `serve-llm.service`. + +```ini +[Unit] +Description=InferNode SGLang serving (Jetson Orin) +After=docker.service ollama.service network-online.target +Wants=docker.service network-online.target + +[Service] +Type=exec +Restart=on-failure +RestartSec=10 +TimeoutStopSec=60 + +# Resolved env file holds IMAGE pin, model path, port, etc. +EnvironmentFile=/etc/serving-sglang.env + +ExecStartPre=-/usr/bin/docker stop serving-sglang +ExecStartPre=-/usr/bin/docker rm serving-sglang + +ExecStart=/usr/bin/docker run --name serving-sglang --rm \ + --runtime nvidia --gpus all \ + --network host --shm-size 8g \ + -v ${HF_CACHE}:/root/.cache/huggingface \ + -e TORCHDYNAMO_DISABLE=1 -e TORCH_COMPILE_DISABLE=1 \ + -e HF_HOME=/root/.cache/huggingface \ + ${IMAGE} \ + python3 -m sglang.launch_server \ + --model-path ${MODEL_PATH} \ + --tokenizer-path ${TOKENIZER_PATH} \ + --host 127.0.0.1 --port ${PORT} \ + --attention-backend triton \ + --mem-fraction-static ${MEM_FRACTION_STATIC} \ + --disable-cuda-graph \ + --log-level info + +ExecStop=/usr/bin/docker stop serving-sglang + +[Install] +WantedBy=multi-user.target +``` + +Companion `/etc/serving-sglang.env` (chmod 0644, no secrets): + +```sh +IMAGE=ghcr.io/infernode-os/serving-sglang:orin- +MODEL_PATH=openai/gpt-oss-20b +TOKENIZER_PATH=/opt/tokenizers/llama-3.1 +PORT=30000 +MEM_FRACTION_STATIC=0.5 +HF_CACHE=/mnt/orin-ssd/huggingface +``` + +Activate: + +```sh +sudo systemctl daemon-reload +sudo systemctl enable --now serving-sglang.service +sudo systemctl status serving-sglang.service --no-pager +journalctl -u serving-sglang.service -f +``` + +`After=ollama.service` lets Ollama come up first (port 11434), then +SGLang on 30000. Both coexist on 64 GiB unified memory; see §6. + +--- + +## 6. Memory budgeting on 64 GiB unified + +Jetson Orin's 64 GiB is shared between CPU and GPU. Concrete budget +for the v1 cut, drawn from the spike's measured working set: + +| Component | Resident (typical) | Notes | +|---|---|---| +| OS + drivers | ~3 GiB | baseline | +| Ollama + one resident model (Devstral GGUF Q4) | ~14 GiB | `OLLAMA_KEEP_ALIVE` default | +| SGLang model weights (gpt-oss-20b, MXFP4 / AWQ) | ~13 GiB | `--quantization` dependent | +| SGLang KV cache | ~9 GiB at 4096 ctx, 16 running | `--mem-fraction-static 0.5` cap | +| Headroom | ~25 GiB | TAK / NERVA / NN bursts | + +Tunables: + +* **`--mem-fraction-static`** = fraction of GPU memory SGLang + pre-allocates for weights + activations. `0.5` is the spike default + (32 GiB cap). Raise to `0.6` if Ollama is dropped from the box; + drop to `0.4` if TAK/NERVA models grow. +* **`--max-running-requests`** = concurrency cap. `16` was the spike + default; tail latency blew out at N=16 due to starvation. For + Veltro's expected fan-out (≤8 concurrent), set `--max-running-requests 8`. +* **`--max-total-tokens`** = global token budget across batch. Keep + proportional to `mem-fraction-static * gpu_mem`. + +After any tune, re-run §4 healthchecks and validate p95 latency in +`docs/SGLANG-ADOPTION-NOTES.md`'s bake-off shape. + +--- + +## 7. Troubleshooting + +| Symptom | Likely cause | Fix | +|---|---|---| +| `CUDA unknown error` at container start | Tegra driver shim not on `LD_LIBRARY_PATH` inside container | Verify `--runtime nvidia` is set; the runtime injects `/usr/lib/aarch64-linux-gnu/tegra` automatically | +| Garbage output, off-topic answers (Belgium → "It seems like you are referring to…") | GGUF tokenizer used for Llama-3 family (INFR-78 known issue) | Set `--tokenizer-path /opt/tokenizers/llama-3.1` | +| `stop` strings never match, completions run to `max_tokens` | Same as above — special tokens not registered | Same as above | +| OOM during model load | Other process holds GPU memory | `nvidia-smi` to find culprit; drop Ollama's resident model with `curl -X POST :11434/api/generate -d '{"model":"...","keep_alive":0}'` | +| Server starts but `/health` 503s | Worker pool stalled on small shm | Confirm `--shm-size 8g` is on the `docker run` line | +| `import sglang` fails inside container | Stale image / partial install | Re-pull pinned tag; re-run §2 pre-flight | +| Per-request latency >2× expected | `torch.compile` accidentally enabled | Confirm `TORCHDYNAMO_DISABLE=1 TORCH_COMPILE_DISABLE=1` | +| Build/push CI green but pull on Hephaestus 404s | GHCR visibility set wrong on first publish | One-time: in repo Settings → Packages, set the package public | + +--- + +## 8. `serve-llm.sh` integration + +`serve-llm.sh` (in `infernode-os/infernode`) is the dev-side launcher +that the ZeroTier-mounted user interacts with. Today it talks to +Ollama at `http://127.0.0.1:11434/v1`. With SGLang available the +launcher gains a sibling backend. + +### Single-backend mode (Ollama-only — current default) + +No change. SGLang need not be installed. + +### Dual-backend mode (Ollama + SGLang — new) + +Set these in the env of `serve-llm.service` (or before invoking +`serve-llm.sh` interactively): + +```sh +export LLM_BACKEND_DEFAULT=http://127.0.0.1:11434/v1 +export LLM_BACKEND_SGLANG=http://127.0.0.1:30000/v1 +``` + +`lucibridge` (see §9) picks per-tool which URL to dispatch to. If +`LLM_BACKEND_SGLANG` is unset, lucibridge falls back to +`LLM_BACKEND_DEFAULT` for every request — backward compatible. + +### Switching modes + +```sh +# Ollama-only +sudo systemctl stop serving-sglang +sudo systemctl mask serving-sglang # prevent restart on reboot + +# Re-enable SGLang +sudo systemctl unmask serving-sglang +sudo systemctl start serving-sglang +``` + +--- + +## 9. lucibridge per-tool routing (cross-ref INFR-79) + +See `runbooks/lucibridge-routing.md` for the routing config schema and +the per-tool / per-capability mapping. Headline: + +* `tool_category in {limbo_authoring}` → `LLM_BACKEND_DEFAULT` + (Devstral via Ollama, current production) +* `tool_category in {dispatch, tool_call, memory, task}` → + `LLM_BACKEND_SGLANG` (gpt-oss via SGLang, post-INFR-77) +* unset / unknown category → `LLM_BACKEND_DEFAULT` (fallback) + +The routing change lives in `infernode-os/infernode`'s `lucibridge` +module; this runbook is the operational doc that documents what the +deployed config looks like and how to flip between modes. + +--- + +## 10. Stopping cleanly + +```sh +# Graceful — gives SGLang ~30s to drain in-flight requests +sudo systemctl stop serving-sglang +# or, ad-hoc: +docker stop serving-sglang + +# If the daemon is stuck (>60s), escalate +docker kill --signal=KILL serving-sglang +``` + +Expected drain time is sub-second when idle, up to ~30s under N≥16 +concurrent. If `docker stop` takes longer than 60s, it usually means +a downstream client is holding a streaming request open; the bridge +should be killed first (`systemctl stop serve-llm`). + +--- + +## 11. Verifying end-to-end with `serve-llm` + lucibridge + +```sh +# Bring everything up +sudo systemctl start ollama serving-sglang serve-llm +sleep 5 + +# Healthcheck the bridge endpoint +curl -fsS http://127.0.0.1:8080/health # serve-llm + +# Run a Veltro-shaped probe: a tool-call turn that routes to SGLang +# (gpt-oss) and a Limbo-authoring turn that routes to Ollama (Devstral). +# See runbooks/lucibridge-routing.md for the probe payload. +``` + +A passing run looks like: SGLang's journal shows one `POST +/v1/chat/completions` per dispatched tool call; Ollama's logs show one +generate for the Limbo authoring turn; bridge logs show the routing +decision for each. + +--- + +## References + +* `docs/SGLANG-ADOPTION-NOTES.md` — spike findings, bake-off numbers, the original launch flags +* `sglang/orin/README.md` — container build / version-pin details +* `runbooks/lucibridge-routing.md` — routing config schema (INFR-79) +* `infernode-os/infernode:docs/HEADLESS-LLM-DAEMON.md` — `serve-llm.service` operational template +* Tickets: INFR-73 (epic), INFR-77 (gpt-oss unblock), INFR-78 (tokenizer), INFR-79 (routing), INFR-80 (this runbook) diff --git a/runbooks/lucibridge-routing.md b/runbooks/lucibridge-routing.md new file mode 100644 index 0000000..6d58573 --- /dev/null +++ b/runbooks/lucibridge-routing.md @@ -0,0 +1,199 @@ +# lucibridge — per-tool routing for multi-backend serving + +**Owning ticket:** INFR-79. **Cross-repo dependency:** the bridge +implementation lives in +[`infernode-os/infernode`](https://github.com/infernode-os/infernode); +this file is the **operational schema and config** that lives in the +serving repo so deploy decisions stay with the deploy artefacts. The +agentlib changes in infernode that consume this schema are tracked on +INFR-79 itself. + +--- + +## Why this exists + +Per V4-PLAN's production strategy: **gpt-oss is for dispatch / tool +calls**, **devstral-limbo is for Limbo authoring**. After INFR-77 +unblocks gpt-oss on SGLang and the spike (INFR-68) measured ~3× +concurrent throughput vs Ollama, the box has two viable backends: + +| Backend | URL | Model strengths | +|---|---|---| +| Ollama | `http://127.0.0.1:11434/v1` | single-user latency, Devstral GGUF, Limbo authoring | +| SGLang | `http://127.0.0.1:30000/v1` | concurrent fan-out, gpt-oss-20b, xgrammar tool-call grounding | + +A single fixed `LLM_BACKEND_URL` env var no longer captures the +intent. `lucibridge` needs to **route per request**, and the routing +key is the tool-category (which the agentlib already attaches to each +dispatch). + +--- + +## Routing table + +The canonical route table for v1 (Hephaestus, Veltro-on-SGLang). All +URLs are relative to the configured backend prefix; `model` is the +model selector accepted by both backends' `/v1/chat/completions`. + +| Tool category | Backend | Model | Notes | +|---|---|---|---| +| `limbo_authoring` | Ollama | `devstral-limbo-v3` (or v4 once daedalus lands) | Single-user fluency; no concurrent fan-out | +| `dispatch` | SGLang | `openai/gpt-oss-20b` | Fast tool-call dispatch, xgrammar-friendly | +| `tool_call` | SGLang | `openai/gpt-oss-20b` | Per-tool args, may be grammar-constrained | +| `memory` | SGLang | `openai/gpt-oss-20b` | Fan-out heavy, benefits from RadixAttention | +| `task` | SGLang | `openai/gpt-oss-20b` | Same as above | +| `chat` / unset / unknown | Ollama | `LLM_DEFAULT_MODEL` | Fallback — backward compatible with v0 | + +Every row is overridable by config (see below); the table is the +**default** when no override is supplied. Unknown categories must +fall back to Ollama — the dev path stays unchanged. + +--- + +## Config schema + +Stored at `/etc/lucibridge/routing.json` on Hephaestus, owned by +`root:root`, mode `0644` (no secrets): + +```json +{ + "$schema": "infernode.lucibridge.routing/v1", + "backends": { + "ollama": { "base_url": "http://127.0.0.1:11434/v1" }, + "sglang": { "base_url": "http://127.0.0.1:30000/v1" } + }, + "default_backend": "ollama", + "default_model": "devstral-limbo-v3", + "routes": [ + { "category": "limbo_authoring", "backend": "ollama", "model": "devstral-limbo-v3" }, + { "category": "dispatch", "backend": "sglang", "model": "openai/gpt-oss-20b" }, + { "category": "tool_call", "backend": "sglang", "model": "openai/gpt-oss-20b", + "extra": { "grammar_backend": "xgrammar" } }, + { "category": "memory", "backend": "sglang", "model": "openai/gpt-oss-20b" }, + { "category": "task", "backend": "sglang", "model": "openai/gpt-oss-20b" } + ], + "fallback": { + "on_backend_unreachable": "default_backend", + "on_unknown_category": "default_backend", + "log_decisions": true + } +} +``` + +### Field reference + +| Field | Type | Meaning | +|---|---|---| +| `backends` | map\ | named pool of OpenAI-compatible endpoints | +| `default_backend` | string | backend used when no route matches; also the fallback target on unreachable backend | +| `default_model` | string | model passed to `default_backend` when none specified | +| `routes` | list\<{category, backend, model, extra?}\> | first-match per-category route; ordering matters only for diagnostics | +| `routes[].extra` | map | passed through as extra fields on the upstream `/v1/chat/completions` body (e.g. `grammar_backend`, `lora_name` for INFR-77 multi-LoRA) | +| `fallback.on_backend_unreachable` | enum | `default_backend` \| `fail` | +| `fallback.on_unknown_category` | enum | `default_backend` \| `fail` | +| `fallback.log_decisions` | bool | emit a structured log line per routing decision (recommended on) | + +### Backwards compatibility + +Bridges from before INFR-79 spoke a single `LLM_BACKEND_URL` env var. +The post-INFR-79 bridge MUST still honour that env var as a +configuration fallback: if `/etc/lucibridge/routing.json` is missing, +the bridge constructs an implicit config with one backend (the +`LLM_BACKEND_URL`) and the default model from `LLM_DEFAULT_MODEL`, and +routes every request to it. v0 deployments don't need to change. + +--- + +## env-var bridging from `serve-llm.sh` + +The serve-llm launcher exports the env vars that the bridge resolves +into the implicit / explicit config. For Hephaestus dual-backend mode: + +```sh +# /etc/serve-llm.env (sourced by serve-llm.service) +LLM_BACKEND_DEFAULT=http://127.0.0.1:11434/v1 +LLM_BACKEND_SGLANG=http://127.0.0.1:30000/v1 +LLM_DEFAULT_MODEL=devstral-limbo-v3 +LUCIBRIDGE_ROUTING_CONFIG=/etc/lucibridge/routing.json +``` + +The bridge uses `LUCIBRIDGE_ROUTING_CONFIG` if set, falling back to +the implicit single-backend mode (using `LLM_BACKEND_DEFAULT`) when +the file is absent. This matches the §8 mode-switching pattern in +`runbooks/hephaestus-deploy.md`. + +--- + +## Observability — what to log + +Every routing decision emits one structured log line at INFO. Format +(JSON; one line per request): + +```json +{ + "ts": "2026-05-14T03:14:15Z", + "event": "lucibridge.route", + "request_id": "req_…", + "category": "tool_call", + "matched_route_index": 2, + "backend": "sglang", + "backend_url": "http://127.0.0.1:30000/v1", + "model": "openai/gpt-oss-20b", + "fallback_used": false, + "fallback_reason": null +} +``` + +`fallback_used: true` with `fallback_reason: "unknown_category"` or +`"backend_unreachable"` is the signal for routing problems. Alert on +sustained `fallback_used: true` (>5% of decisions over a 5-minute +window). + +--- + +## Test plan (lives in `infernode-os/infernode:agentlib_test/`) + +Once the bridge code change is implemented, the tests that must exist: + +1. **Single-backend v0 compat:** routing.json absent, `LLM_BACKEND_URL` + set → every request goes to that URL. No regression vs pre-INFR-79. +2. **Per-category routing:** routing.json present, two backends mocked + → a `limbo_authoring` request hits Ollama mock; a `tool_call` + request hits SGLang mock; verify the URL + model dispatched. +3. **Fallback on unreachable backend:** SGLang mock returns 503 → + request falls back to default backend, log line shows `fallback_used: true, fallback_reason: "backend_unreachable"`. +4. **Fallback on unknown category:** `category: "frobnicate"` → + default backend used, fallback log emitted. +5. **`extra` passthrough:** route with `extra: {grammar_backend: "xgrammar"}` + → the outgoing body includes that field. +6. **Decision-log emission:** with `log_decisions: true`, every + request produces exactly one structured log line. + +These are pre-existing `lucibridge_test` shape; the diff is the new +fixture file and the new test functions. + +--- + +## Acceptance for INFR-79 + +| Criterion | Where verified | +|---|---| +| A configured Veltro session routes `limbo_authoring` to Ollama and `tool_call` to SGLang, both succeed | `infernode-os/infernode` agentlib_test suite + manual run on Hephaestus | +| Fallback documented and tested | tests 3 + 4 above | +| Existing single-URL configs still work unchanged | test 1 above | +| No regression in existing `lucibridge_test` suite | CI run on `infernode-os/infernode` | + +--- + +## Open issues + +* **Per-tool LoRA selection** — gpt-oss-20b will serve multiple + adapters (`gpt-oss-limbo-v3` plus future v4) layered on one resident + base. The route's `extra` field should carry `lora_name` once SGLang + multi-LoRA is wired up. Tracked as a follow-up under INFR-77's + multi-LoRA validation. +* **Bridge config hot-reload** — v1 reloads on `SIGHUP`; document + exact behaviour once the agentlib PR lands. +* **Cross-host routing** — current schema assumes Hephaestus-local + backends. If we add a second Jetson, `backends[].base_url` already + takes any URL; document the firewall/zerotier story before exposing. diff --git a/sglang/.gitkeep b/sglang/.gitkeep deleted file mode 100644 index 4214d49..0000000 --- a/sglang/.gitkeep +++ /dev/null @@ -1 +0,0 @@ -# Placeholder — populate when sglang fork / runbooks / CI lands. diff --git a/sglang/LICENSE-UPSTREAM.md b/sglang/LICENSE-UPSTREAM.md new file mode 100644 index 0000000..3adbebc --- /dev/null +++ b/sglang/LICENSE-UPSTREAM.md @@ -0,0 +1,46 @@ +# Upstream attribution — `dusty-nv/jetson-containers` + +The `sglang/orin/` recipe in this subtree is vendored from +[`dusty-nv/jetson-containers`](https://github.com/dusty-nv/jetson-containers), +specifically the path `packages/llm/sglang/`. + +* **Vendored on:** 2026-05-14 +* **Upstream commit at vendoring:** `6ec74990dc4b84f3cbba86c2def7f232db9d0eaf` +* **Upstream license:** MIT +* **Maintainer:** Dustin Franklin (NVIDIA DevRel) and contributors + +The license text below is the upstream `LICENSE.md` reproduced +verbatim, as required by the MIT terms. This repository is also +MIT-licensed (see `/LICENSE`); the two are compatible. + +When re-syncing from upstream, update the commit SHA above and diff the +verbatim files (`Dockerfile`, `build.sh`, `install.sh`, `test.py`) against +the upstream snapshot at the new SHA. `orin/config.py` is intentionally +divergent (pinned for Orin/JP6/CUDA 12.6 instead of upstream's CUDA-13 +pin) and should not be overwritten by a sync. + +--- + +## Upstream license (MIT) + +``` +Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. + +Permission is hereby granted, free of charge, to any person obtaining a +copy of this software and associated documentation files (the "Software"), +to deal in the Software without restriction, including without limitation +the rights to use, copy, modify, merge, publish, distribute, sublicense, +and/or sell copies of the Software, and to permit persons to whom the +Software is furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included +in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS +OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. +IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY +CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, +TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE +SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. +``` diff --git a/sglang/README.md b/sglang/README.md new file mode 100644 index 0000000..e6005d3 --- /dev/null +++ b/sglang/README.md @@ -0,0 +1,84 @@ +# sglang/ — vendored Jetson SGLang recipe + +This subtree holds container recipes for running SGLang on NVIDIA Jetson +hardware, vendored from +[`dusty-nv/jetson-containers`](https://github.com/dusty-nv/jetson-containers) +(MIT-licensed, NVIDIA-DevRel-maintained) and adapted for InferNode's +production needs. + +## Layout + +``` +sglang/ +├── README.md this file +├── LICENSE-UPSTREAM.md attribution for vendored dusty-nv recipe +├── orin/ Jetson Orin AGX (sm_87, JetPack 6.x, CUDA 12.6) +│ ├── README.md +│ ├── config.py OUR pinned SGLang version +│ ├── Dockerfile vendored from upstream +│ ├── build.sh vendored from upstream +│ ├── install.sh vendored from upstream +│ └── test.py vendored from upstream +└── thor/ Jetson Thor (sm_103, JetPack 7.x, CUDA 13.x) + ├── README.md + └── Dockerfile thin wrapper over NGC's nvcr.io/nvidia/sglang +``` + +## Why two variants + +The two hardware generations need different base-image strategies: + +* **Orin (`sm_87`)** has no NGC SGLang image (per INFR-74 investigation — + NGC's SGLang line is CUDA-13-based, JP7-targeted). The fork-and-build + path is the only one available. The Orin recipe is a full vendor of + the dusty-nv jetson-containers SGLang package with our pinned version. + +* **Thor (`sm_103`)** has NVIDIA's official NGC SGLang container + (`nvcr.io/nvidia/sglang:25.10-py3` and later). Cleaner upstream than + Orin's community-maintained chain — we use it as a base and add only + what InferNode-specific bits we need (tokenizers, entrypoint + conveniences). + +## Vendoring decision + +Straight copy with attribution, not git submodule or subtree-merge. Reasons: + +1. We need to **diverge** from upstream's pinned version: upstream pins + SGLang 0.5.11 with the explicit annotation "Compatible with CUDA 13 + (Spark and Thor)". JetPack 6.x ships CUDA 12.6 — that pin doesn't + work for Orin. We need our own version pin (see `orin/config.py`). +2. The upstream recipe is **small** (≈ 130 LOC across 4 files plus + README + test). Submodule overhead exceeds the merge-back cost. +3. We may want to apply Orin-specific patches (e.g. the chat-template + / tokenizer fixes per INFR-78) without coordinating with upstream. + +If upstream evolves in ways we care about, the re-sync is a manual +diff-and-merge against `LICENSE-UPSTREAM.md`'s recorded commit SHA. +The vendored files at the time of copy are an exact snapshot — the +diff is therefore easy to compute. + +## Upstream source + +* Repo: +* Path: `packages/llm/sglang/` +* Vendored from: `master` branch as of 2026-05-14 +* Upstream commit at vendoring: see `LICENSE-UPSTREAM.md` +* Upstream license: MIT (compatible with this repo's MIT) + +## What changed vs upstream + +| File | Status | +|---|---| +| `orin/config.py` | **Modified** — pinned to a 0.5.x release compatible with CUDA 12.6 (see `orin/README.md`) | +| `orin/Dockerfile.upstream` | Verbatim — kept for diff against upstream re-syncs | +| `orin/Dockerfile` | **Modified** — standalone build (drops chained `/tmp/transformers/install.sh`, adds tokenizer bake step per INFR-78) | +| `orin/build.sh` | Verbatim | +| `orin/install.sh` | Verbatim | +| `orin/test.py` | Verbatim | +| `orin/bake-tokenizers.sh` | InferNode-authored — pulls non-gated Llama-3 tokenizer dirs (INFR-78) | +| `thor/Dockerfile` | InferNode-authored — wraps NGC `nvcr.io/nvidia/sglang` | +| `thor/test.py` | Verbatim copy of `orin/test.py` (docker-context boundary) | + +When you re-sync from upstream, diff the `orin/` non-config files +against the upstream snapshot at that point; `config.py` is intentionally +divergent and should not be auto-overwritten. diff --git a/sglang/orin/Dockerfile b/sglang/orin/Dockerfile new file mode 100644 index 0000000..8e120eb --- /dev/null +++ b/sglang/orin/Dockerfile @@ -0,0 +1,61 @@ +# SGLang for Jetson Orin AGX (sm_87, JetPack 6.x, CUDA 12.6) +# +# Based on dusty-nv/jetson-containers:packages/llm/sglang/Dockerfile +# (see ../LICENSE-UPSTREAM.md). Kept verbatim at ./Dockerfile.upstream +# for diff-against-upstream. This file is the standalone production +# build with the following deltas: +# +# 1. `transformers` install moved to a `pip install` (upstream chains +# its own /tmp/transformers/install.sh from an earlier stage). +# 2. Tokenizer + chat-template bake step added at /opt/tokenizers/ +# (INFR-78). +# 3. Smoke-test script copied to /opt/sglang/test.py. + +ARG BASE_IMAGE +FROM ${BASE_IMAGE} + +ARG SGLANG_VERSION \ + SGLANG_VERSION_SPEC \ + IS_SBSA \ + FORCE_BUILD=off \ + TMP=/tmp/sglang \ + TOKENIZER_DIR=/opt/tokenizers + +RUN apt-get update -y && \ + apt-get install -y --no-install-recommends \ + libnuma-dev \ + libsndfile1 \ + libsndfile1-dev \ + libprotobuf-dev \ + libsm6 \ + libxext6 \ + libgl1 && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* + +# Vendored SGLang install (pip-first; falls back to source build on failure). +COPY build.sh install.sh $TMP/ +RUN $TMP/install.sh || $TMP/build.sh || touch $TMP/.build.failed +RUN if [ -f $TMP/.build.failed ]; then \ + echo "SGLANG ${SGLANG_VERSION} build failed!"; \ + exit 1; \ + fi + +# Standalone replacement for upstream's chained /tmp/transformers/install.sh. +# SGLang declares its own transformers pin via setup.py; this just ensures the +# transitive install actually completed. +RUN python3 -c "import transformers; print('transformers', transformers.__version__)" \ + || pip install --no-cache-dir 'transformers>=4.45,<5' + +# Bake Llama-3 family tokenizer + chat templates so --tokenizer-path can +# be set at launch time without a runtime HuggingFace pull. Adds ~60 MB +# (two families * ~30 MB) to the final image. (INFR-78.) +COPY bake-tokenizers.sh $TMP/ +RUN TOKENIZER_DIR=${TOKENIZER_DIR} $TMP/bake-tokenizers.sh + +# Smoke test, callable as `docker run --rm python3 /opt/sglang/test.py`. +COPY test.py /opt/sglang/test.py + +LABEL org.opencontainers.image.source="https://github.com/infernode-os/serving" +LABEL org.opencontainers.image.description="SGLang for Jetson Orin AGX (sm_87) — InferNode fork of dusty-nv/jetson-containers" +LABEL org.opencontainers.image.licenses="MIT" diff --git a/sglang/orin/Dockerfile.upstream b/sglang/orin/Dockerfile.upstream new file mode 100644 index 0000000..65452e9 --- /dev/null +++ b/sglang/orin/Dockerfile.upstream @@ -0,0 +1,41 @@ +#--- +# name: sglang +# group: llm +# config: config.py +# depends: [sgl-kernel, torch-memory-saver] +# buildkit_device: nvidia.com/gpu=all +# requires: '>=36' +# test: test.py +# notes: https://github.com/sgl-project/sglang +#--- +ARG BASE_IMAGE +FROM ${BASE_IMAGE} + +ARG SGLANG_VERSION \ + SGLANG_VERSION_SPEC \ + IS_SBSA \ + FORCE_BUILD=off \ + TMP=/tmp/sglang + +RUN apt-get update -y && \ + apt-get install -y --no-install-recommends \ + libnuma-dev \ + libsndfile1 \ + libsndfile1-dev \ + libprotobuf-dev \ + libsm6 \ + libxext6 \ + libgl1 && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* + +COPY build.sh install.sh $TMP/ +RUN $TMP/install.sh || $TMP/build.sh || touch $TMP/.build.failed + +# this retains the stage above for debugging on build failure +RUN if [ -f $TMP/.build.failed ]; then \ + echo "SGLANG ${SGLANG_VERSION} build failed!"; \ + exit 1; \ + fi + +RUN /tmp/transformers/install.sh diff --git a/sglang/orin/README.md b/sglang/orin/README.md new file mode 100644 index 0000000..350c764 --- /dev/null +++ b/sglang/orin/README.md @@ -0,0 +1,110 @@ +# sglang/orin/ — Jetson Orin AGX (sm_87) + +Container recipe for SGLang on Jetson Orin AGX. Vendored from +`dusty-nv/jetson-containers:packages/llm/sglang/` (see +`../LICENSE-UPSTREAM.md`) with version pin and Orin-specific notes +diverged from upstream. + +## Target + +| Property | Value | +|---|---| +| GPU | Ampere Tegra (`sm_87`) | +| JetPack | 6.x (R36.4 series) | +| CUDA | 12.6 | +| cuDNN | 9.3 | +| L4T base | `r36.4.0` | +| Python | 3.10 | +| PyTorch | 2.5–2.6 (`USE_DISTRIBUTED=1`, `TORCH_CUDA_ARCH_LIST=8.7`) | + +## Pinned version + +`SGLANG_VERSION=0.5.3` — see the docstring at the top of `config.py` +for the rationale and fallback ladder. The first 0.5.x line with +`srt/models/gpt_oss.py`. On-target smoke build on Hephaestus is the +gate (INFR-77). + +If the pinned version fails to build, fall back in this order: +0.5.2 → 0.5.1 → 0.5.0, then 0.4.5+. Document the working pin back in +`config.py`'s `package = [ … ]` and update this README's "Pinned +version" line. + +## Build (CI, GitHub-hosted) + +CI builds the container on `ubuntu-24.04-arm` (Graviton SBSA, native +aarch64 — no QEMU). See `.github/workflows/build-sglang.yml`. The +build cross-compiles CUDA kernels for `sm_87` via +`TORCH_CUDA_ARCH_LIST=8.7` at `nvcc` invocation; the output image +runs on Jetson Orin AGX (Tegra). + +## Build (manual on Hephaestus) + +When the CI image is unavailable or you're iterating on the recipe: + +```sh +cd ~/serving/sglang/orin +docker build \ + --build-arg BASE_IMAGE=dustynv/pytorch:2.6-r36.4.0-cu126-22.04 \ + --build-arg SGLANG_VERSION=0.5.3 \ + --build-arg SGLANG_VERSION_SPEC=0.5.3 \ + --build-arg IS_SBSA=0 \ + -t serving-sglang:orin-local . +``` + +(The exact `BASE_IMAGE` tag depends on what dustynv has published at +the time. Use `dustynv/pytorch` rather than `dustynv/sglang` to avoid +inheriting their stale 0.4.1 install; we re-install our pinned 0.5.x +fresh via `install.sh` / `build.sh`.) + +## Files + +| File | Origin | Purpose | +|---|---|---| +| `config.py` | **modified** | jetson-containers package config; pinned to our Orin-compatible version | +| `Dockerfile` | **modified** | standalone build with tokenizer bake (INFR-78) | +| `Dockerfile.upstream` | verbatim from upstream | reference copy for diff against upstream re-syncs | +| `build.sh` | verbatim from upstream | source-build fallback if `pip install` fails | +| `install.sh` | verbatim from upstream | `pip install sglang[all]~=$SGLANG_VERSION` first-try path | +| `bake-tokenizers.sh` | InferNode-authored | downloads Llama-3 / Llama-3.1 tokenizer dirs into `/opt/tokenizers/` (INFR-78) | +| `test.py` | verbatim from upstream | smoke test (`import sglang`, print version + CUDA device) | + +## What this does NOT include + +* **A `BASE_IMAGE`**. The Dockerfile expects one to be passed at build + time (matches the upstream jetson-containers pattern, which chains + base images via its framework). The CI workflow supplies a + Jetson-rooted base; for manual builds see the command above. +* **Tokenizer pre-bake**. The Llama-3 tokenizer fix lives at + `sglang/orin/tokenizers/` once INFR-78 lands. Until then, + `--tokenizer-path` must be set at launch time. +* **Entrypoint scripts**. Launch arguments live with the runbook + (`runbooks/hephaestus-deploy.md`) rather than baked into the image, + so the same image serves different model paths without rebuild. + +## Verifying a build + +After a successful image build, run the upstream smoke test inside the +container: + +```sh +docker run --rm --gpus all --runtime nvidia serving-sglang:orin-local \ + python3 /opt/sglang/test.py +``` + +Expected output: + +``` +testing SGLang... +✅ Memory cleared +SGLang version: 0.5.3 +CUDA available: True +CUDA device: Orin (or NVIDIA Jetson AGX Orin) +SGLang OK +``` + +`gpt-oss` arch verification (per INFR-77 acceptance): + +```sh +docker run --rm --gpus all --runtime nvidia serving-sglang:orin-local \ + python3 -c "import sglang.srt.models.gpt_oss as m; print('gpt_oss arch module:', m.__file__)" +``` diff --git a/sglang/orin/bake-tokenizers.sh b/sglang/orin/bake-tokenizers.sh new file mode 100755 index 0000000..bf7b788 --- /dev/null +++ b/sglang/orin/bake-tokenizers.sh @@ -0,0 +1,73 @@ +#!/usr/bin/env bash +# +# Bake a curated set of non-gated tokenizer + chat-template directories +# into the SGLang container at /opt/tokenizers//. +# +# Why: SGLang's GGUF tokenizer doesn't register Llama-3's special tokens +# (<|eot_id|>, <|begin_of_text|>, <|start_header_id|>) properly, so stops +# don't match cleanly and chat-template framing is wrong. The fix is to +# point --tokenizer-path at a real HuggingFace tokenizer dir at launch +# time (see runbooks/hephaestus-deploy.md). Baking the tokenizers into +# the image means the launch command is fully offline-capable. +# +# Each family pulled is the tokenizer files only (~30 MB per family); +# weights are NOT downloaded. We pick non-gated mirrors so no HF login +# is required at build time. +# +# Owning ticket: INFR-78. + +set -euo pipefail + +DEST="${TOKENIZER_DIR:-/opt/tokenizers}" +mkdir -p "$DEST" + +# family repo (non-gated mirror) alias dir under $DEST +declare -A FAMILIES=( + [llama-3.1]="unsloth/Meta-Llama-3.1-8B-Instruct" + [llama-3]="NousResearch/Meta-Llama-3-8B-Instruct" +) + +# Tokenizer-only files. No weights, no model.safetensors. +PATTERNS=( + "tokenizer.json" + "tokenizer_config.json" + "special_tokens_map.json" + "chat_template.json" + "generation_config.json" +) + +python3 - "$DEST" "${!FAMILIES[@]}" <<'PY' "${FAMILIES[@]}" +import os, sys +from huggingface_hub import snapshot_download + +dest_root = sys.argv[1] +families = sys.argv[2:] +# Args come in two halves: aliases then repos (same order). +n = len(families) // 2 +aliases, repos = families[:n], families[n:] + +patterns = [ + "tokenizer.json", + "tokenizer_config.json", + "special_tokens_map.json", + "chat_template.json", + "generation_config.json", +] + +for alias, repo in zip(aliases, repos): + target = os.path.join(dest_root, alias) + os.makedirs(target, exist_ok=True) + print(f"[bake-tokenizers] {repo} -> {target}") + snapshot_download( + repo_id=repo, + local_dir=target, + local_dir_use_symlinks=False, + allow_patterns=patterns, + ) + +print("[bake-tokenizers] done") +PY + +# Strip HuggingFace cache metadata; we only want the flat tokenizer dirs. +find "$DEST" -name '.huggingface' -prune -exec rm -rf {} + 2>/dev/null || true +du -sh "$DEST"/* 2>/dev/null || true diff --git a/sglang/orin/build.sh b/sglang/orin/build.sh new file mode 100755 index 0000000..be213e4 --- /dev/null +++ b/sglang/orin/build.sh @@ -0,0 +1,80 @@ +#!/usr/bin/env bash +set -x + +# Ensure required variables are set +: "${SGLANG_VERSION:?SGLANG_VERSION must be set}" +: "${PIP_WHEEL_DIR:?PIP_WHEEL_DIR must be set}" + +# --- PRE-INSTALL DEPS --- +# Install build dependencies first. uv is a very fast installer. +uv pip install --no-cache-dir ninja setuptools wheel numpy uv scikit-build-core compressed-tensors decord2 + +# --- CLONE SGLANG REPO --- +REPO_URL="https://github.com/sgl-project/sglang" +REPO_DIR="/opt/sglang" + +echo "Building SGLang ${SGLANG_VERSION}" + +if [ ! -d "${REPO_DIR}" ]; then + if git clone --recursive --depth 1 --branch "v${SGLANG_VERSION}" \ + "${REPO_URL}" "${REPO_DIR}"; then + echo "Cloned SGLang v${SGLANG_VERSION}" + else + echo "Tagged branch v${SGLANG_VERSION} not found; cloning default branch" + git clone --recursive --depth 1 "${REPO_URL}" "${REPO_DIR}" + fi +else + echo "Directory ${REPO_DIR} already exists, skipping clone." +fi +cd "${REPO_DIR}" || exit 1 + + +# --- PATCH 1: RELAX PYTORCH VERSION REQUIREMENTS --- +cd "${REPO_DIR}/python" || exit 1 +sed -i 's/==/>=/g' pyproject.toml + +echo "Patched ${REPO_DIR}/python/pyproject.toml to relax version constraints" +cat pyproject.toml + +# --- CONFIGURE PARALLEL BUILD --- +if [[ -z "${IS_SBSA:-}" || "${IS_SBSA}" == "0" || "${IS_SBSA,,}" == "false" ]]; then + export CORES=6 # Automatically use all available cores +else + export CORES=6 # GH200 or other specific hardware +fi +export CMAKE_BUILD_PARALLEL_LEVEL="${CORES}" +export MAX_JOBS="${CORES}" + +# --- BUILD SGLANG WHEEL (THE RIGHT WAY) --- +echo "🚀 Building sglang wheel ONLY with MAX_JOBS=${CORES}" + +# Use '--no-deps' to build ONLY the sglang wheel and ignore its dependencies. +# We will install dependencies later when we install the built wheel. +uv build --wheel \ + --no-build-isolation \ + . \ + --out-dir "${PIP_WHEEL_DIR}" + +# --- INSTALL THE BUILT WHEEL AND ITS DEPENDENCIES --- +echo "✅ sglang wheel built successfully." +echo "📦 Installing the sglang wheel from ${PIP_WHEEL_DIR} and its dependencies from PyPI..." + +# Now, when we install the local wheel, pip will fetch its dependencies +# (like torch, transformers, etc.) from the online package index (PyPI). +# We use 'uv' here because it's extremely fast. +uv pip install -v --find-links="${PIP_WHEEL_DIR}" "sglang[all]" + +# Your original script installed 'gemlite' here, so we keep it. +uv pip install gemlite orjson pybase64 + +echo "🎉 SGLang and all dependencies installed successfully!" + +cd / || exit 1 + +# Try uploading; ignore failure +if [ -x "$(command -v twine)" ]; then + twine upload --verbose "${PIP_WHEEL_DIR}/sglang"*.whl \ + || echo "Failed to upload wheel to ${TWINE_REPOSITORY_URL:-}" +else + echo "twine not installed, skipping upload." +fi diff --git a/sglang/orin/config.py b/sglang/orin/config.py new file mode 100644 index 0000000..4595cf0 --- /dev/null +++ b/sglang/orin/config.py @@ -0,0 +1,64 @@ +""" +SGLang package configuration — Jetson Orin AGX variant (sm_87, JetPack 6.x, CUDA 12.6) + +Vendored from dusty-nv/jetson-containers:packages/llm/sglang/config.py +(see ../LICENSE-UPSTREAM.md for attribution), with the version pin +diverged from upstream's CUDA-13-tied 0.5.11 to a JP6/CUDA-12.6- +compatible 0.5.x release that ships gpt_oss.py in srt/models/. + +Pin rationale (INFR-74 + INFR-77): + - Upstream's current default (0.5.11) is annotated "Compatible with + CUDA 13 (Spark and Thor)" — won't build for JetPack 6 / CUDA 12.6. + - Dustynv's last Orin-targeted published tag (r36.4.0) shipped + 0.4.1.post7 (Feb 2025), which predates gpt-oss support + (Aug 2025) and has no srt/models/gpt_oss.py. + - 0.5.3 is the initial pick: first 0.5.x line with gpt-oss model + class, predates the upstream's CUDA-13 transition (which landed + around 0.5.11). On-target smoke build on Hephaestus is the gate. + - Fallback ladder if 0.5.3 doesn't build: 0.5.2 → 0.5.1 → 0.5.0, + then 0.4.5+. Document the working pin back here when verified. +""" +from jetson_containers import CUDA_VERSION, IS_SBSA, update_dependencies +from packaging.version import Version + + +def sglang(version, version_spec=None, requires=None, depends=None, default=False): + pkg = package.copy() + + if requires: + pkg['requires'] = requires + + if not version_spec: + version_spec = version + + if depends: + pkg['depends'] = update_dependencies(pkg['depends'], depends) + + pkg['name'] = f'sglang:{version}' + + pkg['build_args'] = { + 'SGLANG_VERSION': version, + 'SGLANG_VERSION_SPEC': version_spec, + 'IS_SBSA': IS_SBSA + } + + builder = pkg.copy() + + builder['name'] = f'sglang:{version}-builder' + builder['build_args'] = {**pkg['build_args'], **{'FORCE_BUILD': 'on'}} + + if default: + pkg['alias'] = 'sglang' + builder['alias'] = 'sglang:builder' + + return pkg, builder + + +package = [ + sglang( + '0.5.3', + '0.5.3', + depends=['flashinfer', 'sgl-kernel:0.5.3', 'torchao:0.17.0'], + default=True, + ), # Orin/JP6/CUDA 12.6 pin — first 0.5.x with gpt_oss.py; verify via on-target smoke build (INFR-77) +] diff --git a/sglang/orin/install.sh b/sglang/orin/install.sh new file mode 100755 index 0000000..fd35dff --- /dev/null +++ b/sglang/orin/install.sh @@ -0,0 +1,44 @@ +#!/usr/bin/env bash +set -ex + +uv pip install \ + compressed-tensors \ + datasets \ + decord2 \ + fastapi \ + hf_transfer \ + huggingface_hub \ + interegular \ + "llguidance>=0.7.11,<0.8.0" \ + modelscope \ + ninja \ + orjson \ + packaging \ + partial_json_parser \ + pillow \ + "prometheus-client>=0.20.0" \ + psutil \ + pydantic \ + nvidia-ml-py \ + python-multipart \ + "pyzmq>=25.1.2" \ + "soundfile>=0.13.1" \ + "torchao>=0.9.0" \ + uvicorn \ + uvloop \ + "blobfile>=3.0.0" \ + "anthropic" \ + "msgspec" \ + orjson \ + litellm \ + pybase64 \ + fastapi \ + outlines + +if [ "$FORCE_BUILD" == "on" ]; then + echo "Forcing build of sglang ${SGLANG_VERSION}" + exit 1 +fi + +uv pip install sgl-kernel "sglang[all]~=${SGLANG_VERSION}" || \ +uv pip install sgl-kernel "sglang[all]~=${SGLANG_VERSION_SPEC}" diff --git a/sglang/orin/test.py b/sglang/orin/test.py new file mode 100755 index 0000000..5ac148c --- /dev/null +++ b/sglang/orin/test.py @@ -0,0 +1,27 @@ +#!/usr/bin/env python3 +import gc +import torch + +def clear_memory(): + """Free GPU and CPU memory before running SGLang.""" + try: + if torch.cuda.is_available(): + torch.cuda.empty_cache() + torch.cuda.ipc_collect() + torch.cuda.synchronize() + gc.collect() + print("✅ Memory cleared") + except Exception as e: + print(f"⚠️ Memory cleanup failed: {e}") + +print('testing SGLang...') +clear_memory() # <-- Clean before anything else + +import sglang as sgl + +print(f"SGLang version: {sgl.__version__}") +print(f"CUDA available: {torch.cuda.is_available()}") +if torch.cuda.is_available(): + print(f"CUDA device: {torch.cuda.get_device_name(0)}") + +print('SGLang OK\n') diff --git a/sglang/thor/Dockerfile b/sglang/thor/Dockerfile new file mode 100644 index 0000000..c494afd --- /dev/null +++ b/sglang/thor/Dockerfile @@ -0,0 +1,23 @@ +# Jetson Thor (sm_103, JetPack 7, CUDA 13) SGLang container. +# +# This is a thin wrapper over NVIDIA's official NGC SGLang image. The +# heavy lifting (PyTorch with sm_103 device code, sgl-kernel, flashinfer, +# xgrammar) is all done by NVIDIA in the base image. We add only what +# InferNode needs on top. +# +# Build: see sglang/thor/README.md or .github/workflows/build-sglang.yml. + +ARG NGC_SGLANG_TAG=25.10-py3 +FROM nvcr.io/nvidia/sglang:${NGC_SGLANG_TAG} + +LABEL org.opencontainers.image.source="https://github.com/infernode-os/serving" +LABEL org.opencontainers.image.description="SGLang for Jetson Thor (sm_103) — InferNode overlay on NGC base" +LABEL org.opencontainers.image.licenses="MIT" + +# Smoke test mirrored from sglang/orin/test.py so the same verification +# script works against either variant. (Duplicated rather than symlinked +# because docker build context can't reach outside this directory.) +COPY test.py /opt/sglang/test.py + +# No CMD/ENTRYPOINT override — preserve NGC's defaults so callers +# pass the launch args explicitly (see runbooks/hephaestus-deploy.md). diff --git a/sglang/thor/README.md b/sglang/thor/README.md new file mode 100644 index 0000000..7d17fb2 --- /dev/null +++ b/sglang/thor/README.md @@ -0,0 +1,60 @@ +# sglang/thor/ — Jetson Thor (sm_103) + +Container recipe for SGLang on Jetson Thor. Forward-looking — we +don't have a Thor box yet, but the recipe stays parallel to Orin's so +that the deploy story doesn't need a rewrite when one arrives. + +## Target + +| Property | Value | +|---|---| +| GPU | Blackwell Tegra (`sm_103`) | +| JetPack | 7.x | +| CUDA | 13.0+ | +| Base | `nvcr.io/nvidia/sglang:25.10-py3` (NGC) | + +## Why this is shorter than `orin/` + +NVIDIA ships an **official** NGC SGLang container line starting at +`25.10-py3` (October 2025) that explicitly targets Jetson Thor. See +the INFR-74 investigation comment for the full enumeration. The Thor +recipe therefore doesn't fork dusty-nv/jetson-containers — it just +pulls NGC's image, version-pins, and layers any InferNode-specific +conveniences on top. + +This is a deliberate asymmetry with `orin/`: NGC's SGLang line is +CUDA-13-based and JP7-targeted, so it can't run on Orin/JP6 — but +that's exactly what we want for Thor. + +## Pinned base + +`nvcr.io/nvidia/sglang:25.10-py3` — first NGC release with Jetson +Thor support. Bump as newer NGC tags ship (`25.11-py3`, `26.02-py3`, +etc.) once we have a Thor box and can verify each. + +## Known issues to track + +Per the [NGC SGLang 25.10 release notes](https://docs.nvidia.com/deeplearning/frameworks/sglang-release-notes/rel-25-10.html): + +* **`gpt-oss` family models cannot run on DGX Spark and Jetson Thor + due to an OpenAI Triton issue.** This blocks the V4-PLAN gpt-oss + prize on Thor specifically. Track NGC release notes for resolution; + Orin's recipe (with a JP6/CUDA-12.6 Triton stack) is the workaround + in the meantime. + +## Build + +CI matrix-builds both variants on `ubuntu-24.04-arm`; the Thor variant +is a `FROM nvcr.io/nvidia/sglang:25.10-py3` + thin overlay, so the +build is mostly a re-tag with our overlay scripts. See +`.github/workflows/build-sglang.yml`. + +NGC auth is required for the pull (free NGC account works; the CI +workflow expects `NGC_API_KEY` as a repo secret). + +## Manual build on a Thor host (when one exists) + +```sh +echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin +docker build -t serving-sglang:thor-local sglang/thor/ +``` diff --git a/sglang/thor/test.py b/sglang/thor/test.py new file mode 100755 index 0000000..5ac148c --- /dev/null +++ b/sglang/thor/test.py @@ -0,0 +1,27 @@ +#!/usr/bin/env python3 +import gc +import torch + +def clear_memory(): + """Free GPU and CPU memory before running SGLang.""" + try: + if torch.cuda.is_available(): + torch.cuda.empty_cache() + torch.cuda.ipc_collect() + torch.cuda.synchronize() + gc.collect() + print("✅ Memory cleared") + except Exception as e: + print(f"⚠️ Memory cleanup failed: {e}") + +print('testing SGLang...') +clear_memory() # <-- Clean before anything else + +import sglang as sgl + +print(f"SGLang version: {sgl.__version__}") +print(f"CUDA available: {torch.cuda.is_available()}") +if torch.cuda.is_available(): + print(f"CUDA device: {torch.cuda.get_device_name(0)}") + +print('SGLang OK\n') From d54cb2511f3942cb78ab416bc64e36c78e241ab7 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 14 May 2026 09:29:03 +0000 Subject: [PATCH 2/3] ci(sglang): drop NGC_API_KEY as a hard requirement NGC SGLang containers are anonymously pullable from nvcr.io for the default tags we care about. Make the NGC login step conditional on the secret being set (forward-compat with any future gated variant) and remove the PR-skip that was only there because of the bogus auth assumption. Thor variant now builds on every event, same as Orin. --- .github/workflows/build-sglang.yml | 20 +++++++++++++++----- sglang/thor/README.md | 13 +++++++------ 2 files changed, 22 insertions(+), 11 deletions(-) diff --git a/.github/workflows/build-sglang.yml b/.github/workflows/build-sglang.yml index 2742397..6d3c7a0 100644 --- a/.github/workflows/build-sglang.yml +++ b/.github/workflows/build-sglang.yml @@ -43,10 +43,6 @@ jobs: build: name: build-${{ matrix.variant }} runs-on: ubuntu-24.04-arm - # Thor needs NGC auth and only runs on push to main / tags / manual - # dispatch. PRs from forks have no access to NGC_API_KEY and would - # fail at the login step. - if: matrix.variant != 'thor' || github.event_name != 'pull_request' strategy: fail-fast: false matrix: @@ -78,8 +74,22 @@ jobs: username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} - - name: Log in to NGC (Thor only) + # NGC SGLang containers are anonymously pullable from nvcr.io — + # NGC auth is only needed for gated content (which the Thor base + # isn't). This step runs only when NGC_API_KEY is set, so it stays + # forward-compatible with any future gated variant. + - name: Check NGC auth if: matrix.variant == 'thor' + id: ngc + run: | + if [ -n "${{ secrets.NGC_API_KEY }}" ]; then + echo "use_auth=true" >> "$GITHUB_OUTPUT" + else + echo "use_auth=false" >> "$GITHUB_OUTPUT" + fi + + - name: Log in to NGC (Thor only, if secret set) + if: matrix.variant == 'thor' && steps.ngc.outputs.use_auth == 'true' uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772 # v3.4.0 with: registry: nvcr.io diff --git a/sglang/thor/README.md b/sglang/thor/README.md index 7d17fb2..5214c35 100644 --- a/sglang/thor/README.md +++ b/sglang/thor/README.md @@ -45,16 +45,17 @@ Per the [NGC SGLang 25.10 release notes](https://docs.nvidia.com/deeplearning/fr ## Build CI matrix-builds both variants on `ubuntu-24.04-arm`; the Thor variant -is a `FROM nvcr.io/nvidia/sglang:25.10-py3` + thin overlay, so the -build is mostly a re-tag with our overlay scripts. See +is `FROM nvcr.io/nvidia/sglang:25.10-py3` + thin overlay, so the build +is mostly a re-tag with our overlay scripts. See `.github/workflows/build-sglang.yml`. -NGC auth is required for the pull (free NGC account works; the CI -workflow expects `NGC_API_KEY` as a repo secret). +NGC SGLang containers are anonymously pullable from `nvcr.io` — no +auth required for the default tag. If NVIDIA ever gates a tag we want, +set the `NGC_API_KEY` repo secret and the workflow's optional login +step will fire automatically. -## Manual build on a Thor host (when one exists) +## Manual build ```sh -echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin docker build -t serving-sglang:thor-local sglang/thor/ ``` From 3bc873c6c3ddd98439281f6508aae90bb6aaa56e Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 14 May 2026 09:34:05 +0000 Subject: [PATCH 3/3] ci(sglang): point Orin BASE_IMAGE at a tag that actually exists MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit First CI run failed with: dustynv/pytorch:2.6-r36.4.0-cu126-22.04: not found dustynv moved the JP6 publishing line to cu128 / Ubuntu 24.04 a while back; the cu126-22.04 / Python 3.10 variant the spike used is no longer maintained. Switch the workflow default and the orin README's manual-build example to 2.6-r36.4.0-cu128-24.04. In-container Python 3.12 is fine — the spike's host-Python-alignment constraint only mattered for its hand-extracted-onto-host setup, not for Docker. CUDA 12.8 runtime is forward-compatible with JP6.x's CUDA 12.6 driver per NVIDIA's same-major compat policy. --- .github/workflows/build-sglang.yml | 8 ++++++-- sglang/orin/README.md | 15 ++++++++------- 2 files changed, 14 insertions(+), 9 deletions(-) diff --git a/.github/workflows/build-sglang.yml b/.github/workflows/build-sglang.yml index 6d3c7a0..3d7f020 100644 --- a/.github/workflows/build-sglang.yml +++ b/.github/workflows/build-sglang.yml @@ -20,7 +20,7 @@ on: orin_base_image: description: 'BASE_IMAGE arg for the Orin build' required: false - default: 'dustynv/pytorch:2.6-r36.4.0-cu126-22.04' + default: 'dustynv/pytorch:2.6-r36.4.0-cu128-24.04' sglang_version: description: 'SGLANG_VERSION override (Orin only)' required: false @@ -37,7 +37,11 @@ env: # PyTorch base ships a torch built with USE_DISTRIBUTED=1 and sm_87 # device code; that's the combination the spike (INFR-68) found was # the only one that lets SGLang's import chain succeed on Orin/JP6. - ORIN_BASE_IMAGE_DEFAULT: 'dustynv/pytorch:2.6-r36.4.0-cu126-22.04' + # cu128-24.04 is the current published JP6 line on Docker Hub (the + # earlier cu126-22.04 / Python 3.10 variant the spike used has been + # superseded; Python 3.12 in-container is fine — only the spike's + # hand-extracted-onto-host setup needed host Python alignment). + ORIN_BASE_IMAGE_DEFAULT: 'dustynv/pytorch:2.6-r36.4.0-cu128-24.04' jobs: build: diff --git a/sglang/orin/README.md b/sglang/orin/README.md index 350c764..ddb0ac9 100644 --- a/sglang/orin/README.md +++ b/sglang/orin/README.md @@ -11,11 +11,11 @@ diverged from upstream. |---|---| | GPU | Ampere Tegra (`sm_87`) | | JetPack | 6.x (R36.4 series) | -| CUDA | 12.6 | -| cuDNN | 9.3 | +| CUDA (container) | 12.8 (forward-compat with JP6.x's 12.6 driver) | +| cuDNN | 9.x | | L4T base | `r36.4.0` | -| Python | 3.10 | -| PyTorch | 2.5–2.6 (`USE_DISTRIBUTED=1`, `TORCH_CUDA_ARCH_LIST=8.7`) | +| Python (container) | 3.12 | +| PyTorch | 2.6 (`USE_DISTRIBUTED=1`, `TORCH_CUDA_ARCH_LIST=8.7`) | ## Pinned version @@ -44,7 +44,7 @@ When the CI image is unavailable or you're iterating on the recipe: ```sh cd ~/serving/sglang/orin docker build \ - --build-arg BASE_IMAGE=dustynv/pytorch:2.6-r36.4.0-cu126-22.04 \ + --build-arg BASE_IMAGE=dustynv/pytorch:2.6-r36.4.0-cu128-24.04 \ --build-arg SGLANG_VERSION=0.5.3 \ --build-arg SGLANG_VERSION_SPEC=0.5.3 \ --build-arg IS_SBSA=0 \ @@ -53,8 +53,9 @@ docker build \ (The exact `BASE_IMAGE` tag depends on what dustynv has published at the time. Use `dustynv/pytorch` rather than `dustynv/sglang` to avoid -inheriting their stale 0.4.1 install; we re-install our pinned 0.5.x -fresh via `install.sh` / `build.sh`.) +inheriting their stale SGLang install; we re-install our pinned 0.5.x +fresh via `install.sh` / `build.sh`. Check + for current options.) ## Files