From f2a41b00478d5e99626dca129abf4e6148d9a9c1 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 14 May 2026 09:21:06 +0000
Subject: [PATCH 1/3] Productize SGLang serving: vendor recipe + CI + runbooks
 (INFR-73)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Lands the in-repo work for the "Productize SGLang serving" epic
(INFR-73), covering child tickets INFR-74 through INFR-81. Cross-repo
work (lucibridge code in infernode-os/infernode, eval harness in IOL)
stays out of this commit; their entry-points and contracts are
documented in runbooks/.

Per-ticket summary:

INFR-74 (Investigate NGC for Orin sm_87): no code. Findings posted to
the Jira ticket — NGC's SGLang line is CUDA-13 / JP7-only (datacenter
+ Thor). Fork-and-vendor remains the right path for Orin; NGC is the
recommended base for Thor.

INFR-76 (Vendor dusty-nv recipe): copy of
dusty-nv/jetson-containers/packages/llm/sglang verbatim into
sglang/orin/ (Dockerfile.upstream, build.sh, install.sh, test.py)
with attribution in sglang/LICENSE-UPSTREAM.md. Standalone build path
lives in sglang/orin/Dockerfile (diverged: drops chained
transformers install, adds tokenizer bake step).

INFR-77 (Pin SGLang >=0.5.x for gpt-oss): sglang/orin/config.py pinned
to 0.5.3 (first 0.5.x line with srt/models/gpt_oss.py, predates
upstream's CUDA-13 transition at 0.5.11). Fallback ladder documented
in the config.py docstring; on-target smoke build on Hephaestus is the
verification gate.

INFR-75 (GitHub-hosted ubuntu-24.04-arm CI):
.github/workflows/build-sglang.yml. Native aarch64 build on Graviton
SBSA, push to ghcr.io/infernode-os/serving-sglang with variant-tagged
images. Pins all third-party actions by commit SHA. Note: the
self-hosted-Hephaestus plan in the original ticket description has
been superseded; the Jira description has been updated via API.

INFR-78 (Llama-3 tokenizer + chat-template fix):
sglang/orin/bake-tokenizers.sh pulls non-gated mirrors of the
Llama-3.1 and Llama-3 tokenizer dirs into /opt/tokenizers/ at image
build time (~60 MB total). Documented launch flag
--tokenizer-path /opt/tokenizers/llama-3.1 in the runbook.

INFR-79 (lucibridge per-tool routing): code change lives in
infernode-os/infernode (out of scope here). What's in this repo:
runbooks/lucibridge-routing.md — the routing config schema, the
per-category default table, env-var bridging, observability spec, and
test plan. The infernode-side PR will consume this as the contract.

INFR-80 (Hephaestus deploy runbook): runbooks/hephaestus-deploy.md.
Pull + pre-flight + launch + healthcheck + systemd unit + memory
budget + troubleshooting + serve-llm.sh integration + clean shutdown.
Respects the Hephaestus disk policy (Docker on root, working data on
/mnt/orin-ssd via bind mounts).

INFR-81 (Thor sm_103 matrix build): sglang/thor/ (Dockerfile +
README) wraps NGC nvcr.io/nvidia/sglang:25.10-py3. The build workflow
matrix-builds Thor alongside Orin; Thor variant is skipped on PRs
(needs NGC_API_KEY secret which forks don't have).

Not in this commit (genuinely out of scope or blocked):

* IOL-26 (virgil-agent eval against SGLang) — lives in IOL repo; runs
  after a working SGLang endpoint exists on Hephaestus.
* The on-target smoke build of the pinned 0.5.3 image on Hephaestus
  (acceptance gate for INFR-77, requires Jetson hardware).
* The actual lucibridge code change in infernode-os/infernode
  (consumes the runbook schema; tracked under INFR-79).
---
 .github/workflows/.gitkeep         |   1 -
 .github/workflows/build-sglang.yml | 153 +++++++++++
 runbooks/.gitkeep                  |   1 -
 runbooks/hephaestus-deploy.md      | 392 +++++++++++++++++++++++++++++
 runbooks/lucibridge-routing.md     | 199 +++++++++++++++
 sglang/.gitkeep                    |   1 -
 sglang/LICENSE-UPSTREAM.md         |  46 ++++
 sglang/README.md                   |  84 +++++++
 sglang/orin/Dockerfile             |  61 +++++
 sglang/orin/Dockerfile.upstream    |  41 +++
 sglang/orin/README.md              | 110 ++++++++
 sglang/orin/bake-tokenizers.sh     |  73 ++++++
 sglang/orin/build.sh               |  80 ++++++
 sglang/orin/config.py              |  64 +++++
 sglang/orin/install.sh             |  44 ++++
 sglang/orin/test.py                |  27 ++
 sglang/thor/Dockerfile             |  23 ++
 sglang/thor/README.md              |  60 +++++
 sglang/thor/test.py                |  27 ++
 19 files changed, 1484 insertions(+), 3 deletions(-)
 delete mode 100644 .github/workflows/.gitkeep
 create mode 100644 .github/workflows/build-sglang.yml
 delete mode 100644 runbooks/.gitkeep
 create mode 100644 runbooks/hephaestus-deploy.md
 create mode 100644 runbooks/lucibridge-routing.md
 delete mode 100644 sglang/.gitkeep
 create mode 100644 sglang/LICENSE-UPSTREAM.md
 create mode 100644 sglang/README.md
 create mode 100644 sglang/orin/Dockerfile
 create mode 100644 sglang/orin/Dockerfile.upstream
 create mode 100644 sglang/orin/README.md
 create mode 100755 sglang/orin/bake-tokenizers.sh
 create mode 100755 sglang/orin/build.sh
 create mode 100644 sglang/orin/config.py
 create mode 100755 sglang/orin/install.sh
 create mode 100755 sglang/orin/test.py
 create mode 100644 sglang/thor/Dockerfile
 create mode 100644 sglang/thor/README.md
 create mode 100755 sglang/thor/test.py

diff --git a/.github/workflows/.gitkeep b/.github/workflows/.gitkeep
deleted file mode 100644
index 4214d49..0000000
--- a/.github/workflows/.gitkeep
+++ /dev/null
@@ -1 +0,0 @@
-# Placeholder — populate when sglang fork / runbooks / CI lands.
diff --git a/.github/workflows/build-sglang.yml b/.github/workflows/build-sglang.yml
new file mode 100644
index 0000000..2742397
--- /dev/null
+++ b/.github/workflows/build-sglang.yml
@@ -0,0 +1,153 @@
+name: build-sglang
+
+# CI for the SGLang container variants. Builds on GitHub-hosted
+# ubuntu-24.04-arm runners (Graviton SBSA, native aarch64 — no QEMU)
+# and pushes to GHCR. See INFR-75 and INFR-81 for design.
+
+on:
+  push:
+    branches: [main]
+    paths:
+      - 'sglang/**'
+      - '.github/workflows/build-sglang.yml'
+    tags: ['v*']
+  pull_request:
+    paths:
+      - 'sglang/**'
+      - '.github/workflows/build-sglang.yml'
+  workflow_dispatch:
+    inputs:
+      orin_base_image:
+        description: 'BASE_IMAGE arg for the Orin build'
+        required: false
+        default: 'dustynv/pytorch:2.6-r36.4.0-cu126-22.04'
+      sglang_version:
+        description: 'SGLANG_VERSION override (Orin only)'
+        required: false
+        default: ''
+
+permissions:
+  contents: read
+  packages: write
+
+env:
+  REGISTRY: ghcr.io
+  IMAGE_NAME: ${{ github.repository_owner }}/serving-sglang
+  # Default Orin base — overridable via workflow_dispatch. The dustynv
+  # PyTorch base ships a torch built with USE_DISTRIBUTED=1 and sm_87
+  # device code; that's the combination the spike (INFR-68) found was
+  # the only one that lets SGLang's import chain succeed on Orin/JP6.
+  ORIN_BASE_IMAGE_DEFAULT: 'dustynv/pytorch:2.6-r36.4.0-cu126-22.04'
+
+jobs:
+  build:
+    name: build-${{ matrix.variant }}
+    runs-on: ubuntu-24.04-arm
+    # Thor needs NGC auth and only runs on push to main / tags / manual
+    # dispatch. PRs from forks have no access to NGC_API_KEY and would
+    # fail at the login step.
+    if: matrix.variant != 'thor' || github.event_name != 'pull_request'
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - variant: orin
+            cuda_arch: '8.7'
+          - variant: thor
+            cuda_arch: '10.3a'
+
+    steps:
+      - name: Checkout
+        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+
+      - name: Show build host
+        run: |
+          set -x
+          uname -a
+          cat /etc/os-release
+          docker version
+          docker buildx version
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@b5ca514318bd6ebac0fb2aedd5d36ec1b5c232a2 # v3.10.0
+
+      - name: Log in to GHCR
+        uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772 # v3.4.0
+        with:
+          registry: ${{ env.REGISTRY }}
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Log in to NGC (Thor only)
+        if: matrix.variant == 'thor'
+        uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772 # v3.4.0
+        with:
+          registry: nvcr.io
+          username: $oauthtoken
+          password: ${{ secrets.NGC_API_KEY }}
+
+      - name: Compute tags + metadata
+        id: meta
+        uses: docker/metadata-action@902fa8ec7d6ecbf8d84d538b9b233a880e428804 # v5.7.0
+        with:
+          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
+          tags: |
+            type=raw,value=${{ matrix.variant }}-latest,enable={{is_default_branch}}
+            type=sha,prefix=${{ matrix.variant }}-,format=short
+            type=ref,event=tag,prefix=${{ matrix.variant }}-
+          labels: |
+            org.opencontainers.image.title=serving-sglang-${{ matrix.variant }}
+            org.opencontainers.image.description=SGLang for Jetson ${{ matrix.variant == 'orin' && 'Orin AGX (sm_87)' || 'Thor (sm_103)' }}
+            org.opencontainers.image.source=${{ github.server_url }}/${{ github.repository }}
+            org.opencontainers.image.licenses=MIT
+
+      - name: Decide whether to push
+        id: push_decide
+        run: |
+          if [ "${{ github.event_name }}" = "pull_request" ]; then
+            echo "push=false" >> "$GITHUB_OUTPUT"
+          else
+            echo "push=true" >> "$GITHUB_OUTPUT"
+          fi
+
+      - name: Build Orin variant
+        if: matrix.variant == 'orin'
+        uses: docker/build-push-action@471d1dc4e07e5cdedd4c2171150001c434f0b7a4 # v6.15.0
+        with:
+          context: sglang/orin
+          file: sglang/orin/Dockerfile
+          platforms: linux/arm64
+          push: ${{ steps.push_decide.outputs.push }}
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+          build-args: |
+            BASE_IMAGE=${{ inputs.orin_base_image || env.ORIN_BASE_IMAGE_DEFAULT }}
+            SGLANG_VERSION=${{ inputs.sglang_version || '0.5.3' }}
+            SGLANG_VERSION_SPEC=${{ inputs.sglang_version || '0.5.3' }}
+            IS_SBSA=0
+            FORCE_BUILD=off
+          cache-from: type=gha,scope=sglang-orin
+          cache-to: type=gha,scope=sglang-orin,mode=max
+
+      - name: Build Thor variant
+        if: matrix.variant == 'thor'
+        uses: docker/build-push-action@471d1dc4e07e5cdedd4c2171150001c434f0b7a4 # v6.15.0
+        with:
+          context: sglang/thor
+          file: sglang/thor/Dockerfile
+          platforms: linux/arm64
+          push: ${{ steps.push_decide.outputs.push }}
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+          build-args: |
+            NGC_SGLANG_TAG=25.10-py3
+          cache-from: type=gha,scope=sglang-thor
+          cache-to: type=gha,scope=sglang-thor,mode=max
+
+      - name: Summarise
+        if: always()
+        run: |
+          echo "Variant: ${{ matrix.variant }}"
+          echo "Pushed: ${{ steps.push_decide.outputs.push }}"
+          echo "Tags:"
+          echo "${{ steps.meta.outputs.tags }}"
diff --git a/runbooks/.gitkeep b/runbooks/.gitkeep
deleted file mode 100644
index 4214d49..0000000
--- a/runbooks/.gitkeep
+++ /dev/null
@@ -1 +0,0 @@
-# Placeholder — populate when sglang fork / runbooks / CI lands.
diff --git a/runbooks/hephaestus-deploy.md b/runbooks/hephaestus-deploy.md
new file mode 100644
index 0000000..568a8be
--- /dev/null
+++ b/runbooks/hephaestus-deploy.md
@@ -0,0 +1,392 @@
+# Hephaestus deploy runbook — SGLang on Jetson Orin AGX
+
+End-to-end runbook for deploying the GHCR-built SGLang container on
+**Hephaestus** (Jetson Orin AGX, JetPack 6.x, 64 GiB unified memory)
+and wiring it into the existing `serve-llm.sh` / `lucibridge`
+operational model.
+
+**Owning epic:** INFR-73. **Acceptance gate:** a new contributor can
+bring SGLang up on Hephaestus in under 15 minutes following this
+document.
+
+---
+
+## 0. Prereqs and disk policy
+
+**Hephaestus disk policy** (load-bearing — do NOT violate):
+
+| Mount | Purpose | Allowed contents |
+|---|---|---|
+| `/` (root partition, deliberately constrained) | Emulates a production single-disk InferNode node | OS · TAK · NERVA via Docker · Ollama binary |
+| `/mnt/orin-ssd` (916 GiB) | Dev indulgence | SGLang container · HF cache · build artefacts · spike logs |
+
+The Docker daemon's storage root must NOT be migrated from `/` to
+`/mnt/orin-ssd` — that would put TAK/NERVA images on the dev disk and
+break the production emulation. Instead, the **SGLang container is
+pulled and run with bind-mounts** that put its working data on
+orin-ssd. See §3.
+
+Prereqs to verify before starting:
+
+```sh
+# JetPack and CUDA
+cat /etc/nv_tegra_release | head -1            # expect R36, REVISION: 4.x
+nvidia-smi                                      # expect Orin / CUDA 12.6+
+docker info | grep -i 'storage driver'          # expect overlay2 on root
+
+# Disk space
+df -h / /mnt/orin-ssd                           # root <90% used; orin-ssd <90% used
+
+# Ollama still up (we coexist, not replace)
+curl -fsS http://127.0.0.1:11434/api/version
+```
+
+---
+
+## 1. Pull the container
+
+GHCR images are public; no docker login required.
+
+```sh
+# Pin a specific build by short SHA from CI; never use :orin-latest in
+# production runs — the SHA is what gets recorded in incident timelines.
+IMAGE='ghcr.io/infernode-os/serving-sglang:orin-<short-sha>'
+
+docker pull "$IMAGE"
+docker images "$IMAGE" --format '{{.Repository}}:{{.Tag}} {{.Size}}'
+# Expect: ~12–14 GB (CUDA + cuDNN + PyTorch wheels + SGLang + sgl-kernel).
+# Pull goes onto root partition Docker storage. Verify residual headroom:
+df -h /
+```
+
+If the root partition is tight, `docker image prune` old SGLang images
+**only** (`docker images "ghcr.io/infernode-os/serving-sglang" --quiet | tail -n +3`)
+— do not prune TAK/NERVA images.
+
+---
+
+## 2. Pre-flight checks
+
+```sh
+# CUDA visibility inside the container
+docker run --rm --runtime nvidia --gpus all "$IMAGE" \
+  python3 -c "import torch; print('cuda', torch.cuda.is_available(), torch.cuda.get_device_name(0))"
+# Expected: "cuda True Orin" (or "NVIDIA Jetson AGX Orin")
+
+# SGLang import + version
+docker run --rm --runtime nvidia --gpus all "$IMAGE" \
+  python3 /opt/sglang/test.py
+# Expected last line: "SGLang OK"
+
+# gpt-oss model class present (INFR-77 acceptance)
+docker run --rm --runtime nvidia --gpus all "$IMAGE" \
+  python3 -c "import sglang.srt.models.gpt_oss; print('gpt-oss arch present')"
+```
+
+If any of those fail, **stop**. Don't proceed to §3. The image is
+broken; either pull a different tag or rebuild from `sglang/orin/`
+(see `sglang/orin/README.md`).
+
+---
+
+## 3. Launch (one-shot, for testing)
+
+The launch invocation that produced the bake-off in
+`docs/SGLANG-ADOPTION-NOTES.md`, adapted for the GHCR image. Bind-mounts
+keep the HF cache on `/mnt/orin-ssd` (disk policy §0).
+
+```sh
+MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0       # smoke target
+HF_CACHE=/mnt/orin-ssd/huggingface
+LOGS=/mnt/orin-ssd/pdfinn/scratch/sglang-logs
+mkdir -p "$HF_CACHE" "$LOGS"
+
+docker run --rm -it \
+  --runtime nvidia --gpus all \
+  --network host \
+  --shm-size 8g \
+  -v "$HF_CACHE":/root/.cache/huggingface \
+  -e TORCHDYNAMO_DISABLE=1 \
+  -e TORCH_COMPILE_DISABLE=1 \
+  -e HF_HOME=/root/.cache/huggingface \
+  "$IMAGE" \
+  python3 -m sglang.launch_server \
+    --model-path "$MODEL" \
+    --tokenizer-path /opt/tokenizers/llama-3.1 \
+    --host 127.0.0.1 --port 30000 \
+    --attention-backend triton \
+    --mem-fraction-static 0.5 \
+    --disable-cuda-graph \
+    --log-level info \
+  2>&1 | tee "$LOGS/sglang-$(date +%s).log"
+```
+
+Flag notes (all carried forward from spike findings):
+
+* `--runtime nvidia --gpus all` — required for Tegra GPU passthrough.
+* `--network host` — simplest port binding; firewall lives on the host.
+  Skip `--publish` to avoid double NAT through Docker.
+* `--shm-size 8g` — SGLang's worker pool uses shared memory; default
+  64 MB causes silent stalls under concurrency.
+* `--tokenizer-path /opt/tokenizers/llama-3.1` — fixes Llama-3 special
+  tokens (INFR-78). Omit only for non-Llama models; for those, set
+  `--tokenizer-path` to the appropriate baked directory or to an HF
+  repo id.
+* `--attention-backend triton` — most conservative Jetson path
+  (flashinfer also works but Triton was the proven default in the spike).
+* `--mem-fraction-static 0.5` — leaves 32 GiB on the unified-memory
+  budget for OS + Ollama + headroom. Tune per workload (see §6).
+* `--disable-cuda-graph` — CUDA graph capture is finicky on Jetson;
+  disable for stability. Re-enable later if perf needs it.
+* `TORCHDYNAMO_DISABLE=1 / TORCH_COMPILE_DISABLE=1` — bypass
+  `torch.compile`; the in-image Triton + torch combination doesn't
+  reliably JIT-compile and we don't need compile for serving.
+
+---
+
+## 4. Healthcheck sequence
+
+Run this against a freshly-launched server, in order:
+
+```sh
+BASE=http://127.0.0.1:30000
+
+# 1. liveness — must return 200 immediately
+curl -fsS "$BASE/health"
+
+# 2. model list — must include the --model-path you launched with
+curl -fsS "$BASE/v1/models" | python3 -m json.tool
+
+# 3. smoke chat completion — should complete in <2s for TinyLlama
+curl -fsS "$BASE/v1/chat/completions" -H 'content-type: application/json' -d '{
+  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+  "messages": [{"role":"user","content":"Capital of Belgium? One word."}],
+  "max_tokens": 8,
+  "temperature": 0
+}'
+# Expect: "Brussels" or close. If you see garbage tokens, the tokenizer
+# path is wrong — see §7 troubleshooting.
+```
+
+---
+
+## 5. systemd unit (production)
+
+For a deploy that survives reboot, drop the unit below in
+`/etc/systemd/system/serving-sglang.service`. Mirrors the pattern in
+IOL's `docs/HEADLESS-LLM-DAEMON.md` for `serve-llm.service`.
+
+```ini
+[Unit]
+Description=InferNode SGLang serving (Jetson Orin)
+After=docker.service ollama.service network-online.target
+Wants=docker.service network-online.target
+
+[Service]
+Type=exec
+Restart=on-failure
+RestartSec=10
+TimeoutStopSec=60
+
+# Resolved env file holds IMAGE pin, model path, port, etc.
+EnvironmentFile=/etc/serving-sglang.env
+
+ExecStartPre=-/usr/bin/docker stop serving-sglang
+ExecStartPre=-/usr/bin/docker rm   serving-sglang
+
+ExecStart=/usr/bin/docker run --name serving-sglang --rm \
+  --runtime nvidia --gpus all \
+  --network host --shm-size 8g \
+  -v ${HF_CACHE}:/root/.cache/huggingface \
+  -e TORCHDYNAMO_DISABLE=1 -e TORCH_COMPILE_DISABLE=1 \
+  -e HF_HOME=/root/.cache/huggingface \
+  ${IMAGE} \
+  python3 -m sglang.launch_server \
+    --model-path ${MODEL_PATH} \
+    --tokenizer-path ${TOKENIZER_PATH} \
+    --host 127.0.0.1 --port ${PORT} \
+    --attention-backend triton \
+    --mem-fraction-static ${MEM_FRACTION_STATIC} \
+    --disable-cuda-graph \
+    --log-level info
+
+ExecStop=/usr/bin/docker stop serving-sglang
+
+[Install]
+WantedBy=multi-user.target
+```
+
+Companion `/etc/serving-sglang.env` (chmod 0644, no secrets):
+
+```sh
+IMAGE=ghcr.io/infernode-os/serving-sglang:orin-<short-sha>
+MODEL_PATH=openai/gpt-oss-20b
+TOKENIZER_PATH=/opt/tokenizers/llama-3.1
+PORT=30000
+MEM_FRACTION_STATIC=0.5
+HF_CACHE=/mnt/orin-ssd/huggingface
+```
+
+Activate:
+
+```sh
+sudo systemctl daemon-reload
+sudo systemctl enable --now serving-sglang.service
+sudo systemctl status serving-sglang.service --no-pager
+journalctl -u serving-sglang.service -f
+```
+
+`After=ollama.service` lets Ollama come up first (port 11434), then
+SGLang on 30000. Both coexist on 64 GiB unified memory; see §6.
+
+---
+
+## 6. Memory budgeting on 64 GiB unified
+
+Jetson Orin's 64 GiB is shared between CPU and GPU. Concrete budget
+for the v1 cut, drawn from the spike's measured working set:
+
+| Component | Resident (typical) | Notes |
+|---|---|---|
+| OS + drivers | ~3 GiB | baseline |
+| Ollama + one resident model (Devstral GGUF Q4) | ~14 GiB | `OLLAMA_KEEP_ALIVE` default |
+| SGLang model weights (gpt-oss-20b, MXFP4 / AWQ) | ~13 GiB | `--quantization` dependent |
+| SGLang KV cache | ~9 GiB at 4096 ctx, 16 running | `--mem-fraction-static 0.5` cap |
+| Headroom | ~25 GiB | TAK / NERVA / NN bursts |
+
+Tunables:
+
+* **`--mem-fraction-static`** = fraction of GPU memory SGLang
+  pre-allocates for weights + activations. `0.5` is the spike default
+  (32 GiB cap). Raise to `0.6` if Ollama is dropped from the box;
+  drop to `0.4` if TAK/NERVA models grow.
+* **`--max-running-requests`** = concurrency cap. `16` was the spike
+  default; tail latency blew out at N=16 due to starvation. For
+  Veltro's expected fan-out (≤8 concurrent), set `--max-running-requests 8`.
+* **`--max-total-tokens`** = global token budget across batch. Keep
+  proportional to `mem-fraction-static * gpu_mem`.
+
+After any tune, re-run §4 healthchecks and validate p95 latency in
+`docs/SGLANG-ADOPTION-NOTES.md`'s bake-off shape.
+
+---
+
+## 7. Troubleshooting
+
+| Symptom | Likely cause | Fix |
+|---|---|---|
+| `CUDA unknown error` at container start | Tegra driver shim not on `LD_LIBRARY_PATH` inside container | Verify `--runtime nvidia` is set; the runtime injects `/usr/lib/aarch64-linux-gnu/tegra` automatically |
+| Garbage output, off-topic answers (Belgium → "It seems like you are referring to…") | GGUF tokenizer used for Llama-3 family (INFR-78 known issue) | Set `--tokenizer-path /opt/tokenizers/llama-3.1` |
+| `stop` strings never match, completions run to `max_tokens` | Same as above — special tokens not registered | Same as above |
+| OOM during model load | Other process holds GPU memory | `nvidia-smi` to find culprit; drop Ollama's resident model with `curl -X POST :11434/api/generate -d '{"model":"...","keep_alive":0}'` |
+| Server starts but `/health` 503s | Worker pool stalled on small shm | Confirm `--shm-size 8g` is on the `docker run` line |
+| `import sglang` fails inside container | Stale image / partial install | Re-pull pinned tag; re-run §2 pre-flight |
+| Per-request latency >2× expected | `torch.compile` accidentally enabled | Confirm `TORCHDYNAMO_DISABLE=1 TORCH_COMPILE_DISABLE=1` |
+| Build/push CI green but pull on Hephaestus 404s | GHCR visibility set wrong on first publish | One-time: in repo Settings → Packages, set the package public |
+
+---
+
+## 8. `serve-llm.sh` integration
+
+`serve-llm.sh` (in `infernode-os/infernode`) is the dev-side launcher
+that the ZeroTier-mounted user interacts with. Today it talks to
+Ollama at `http://127.0.0.1:11434/v1`. With SGLang available the
+launcher gains a sibling backend.
+
+### Single-backend mode (Ollama-only — current default)
+
+No change. SGLang need not be installed.
+
+### Dual-backend mode (Ollama + SGLang — new)
+
+Set these in the env of `serve-llm.service` (or before invoking
+`serve-llm.sh` interactively):
+
+```sh
+export LLM_BACKEND_DEFAULT=http://127.0.0.1:11434/v1
+export LLM_BACKEND_SGLANG=http://127.0.0.1:30000/v1
+```
+
+`lucibridge` (see §9) picks per-tool which URL to dispatch to. If
+`LLM_BACKEND_SGLANG` is unset, lucibridge falls back to
+`LLM_BACKEND_DEFAULT` for every request — backward compatible.
+
+### Switching modes
+
+```sh
+# Ollama-only
+sudo systemctl stop serving-sglang
+sudo systemctl mask serving-sglang   # prevent restart on reboot
+
+# Re-enable SGLang
+sudo systemctl unmask serving-sglang
+sudo systemctl start serving-sglang
+```
+
+---
+
+## 9. lucibridge per-tool routing (cross-ref INFR-79)
+
+See `runbooks/lucibridge-routing.md` for the routing config schema and
+the per-tool / per-capability mapping. Headline:
+
+* `tool_category in {limbo_authoring}` → `LLM_BACKEND_DEFAULT`
+  (Devstral via Ollama, current production)
+* `tool_category in {dispatch, tool_call, memory, task}` →
+  `LLM_BACKEND_SGLANG` (gpt-oss via SGLang, post-INFR-77)
+* unset / unknown category → `LLM_BACKEND_DEFAULT` (fallback)
+
+The routing change lives in `infernode-os/infernode`'s `lucibridge`
+module; this runbook is the operational doc that documents what the
+deployed config looks like and how to flip between modes.
+
+---
+
+## 10. Stopping cleanly
+
+```sh
+# Graceful — gives SGLang ~30s to drain in-flight requests
+sudo systemctl stop serving-sglang
+# or, ad-hoc:
+docker stop serving-sglang
+
+# If the daemon is stuck (>60s), escalate
+docker kill --signal=KILL serving-sglang
+```
+
+Expected drain time is sub-second when idle, up to ~30s under N≥16
+concurrent. If `docker stop` takes longer than 60s, it usually means
+a downstream client is holding a streaming request open; the bridge
+should be killed first (`systemctl stop serve-llm`).
+
+---
+
+## 11. Verifying end-to-end with `serve-llm` + lucibridge
+
+```sh
+# Bring everything up
+sudo systemctl start ollama serving-sglang serve-llm
+sleep 5
+
+# Healthcheck the bridge endpoint
+curl -fsS http://127.0.0.1:8080/health    # serve-llm
+
+# Run a Veltro-shaped probe: a tool-call turn that routes to SGLang
+# (gpt-oss) and a Limbo-authoring turn that routes to Ollama (Devstral).
+# See runbooks/lucibridge-routing.md for the probe payload.
+```
+
+A passing run looks like: SGLang's journal shows one `POST
+/v1/chat/completions` per dispatched tool call; Ollama's logs show one
+generate for the Limbo authoring turn; bridge logs show the routing
+decision for each.
+
+---
+
+## References
+
+* `docs/SGLANG-ADOPTION-NOTES.md` — spike findings, bake-off numbers, the original launch flags
+* `sglang/orin/README.md` — container build / version-pin details
+* `runbooks/lucibridge-routing.md` — routing config schema (INFR-79)
+* `infernode-os/infernode:docs/HEADLESS-LLM-DAEMON.md` — `serve-llm.service` operational template
+* Tickets: INFR-73 (epic), INFR-77 (gpt-oss unblock), INFR-78 (tokenizer), INFR-79 (routing), INFR-80 (this runbook)
diff --git a/runbooks/lucibridge-routing.md b/runbooks/lucibridge-routing.md
new file mode 100644
index 0000000..6d58573
--- /dev/null
+++ b/runbooks/lucibridge-routing.md
@@ -0,0 +1,199 @@
+# lucibridge — per-tool routing for multi-backend serving
+
+**Owning ticket:** INFR-79. **Cross-repo dependency:** the bridge
+implementation lives in
+[`infernode-os/infernode`](https://github.com/infernode-os/infernode);
+this file is the **operational schema and config** that lives in the
+serving repo so deploy decisions stay with the deploy artefacts. The
+agentlib changes in infernode that consume this schema are tracked on
+INFR-79 itself.
+
+---
+
+## Why this exists
+
+Per V4-PLAN's production strategy: **gpt-oss is for dispatch / tool
+calls**, **devstral-limbo is for Limbo authoring**. After INFR-77
+unblocks gpt-oss on SGLang and the spike (INFR-68) measured ~3×
+concurrent throughput vs Ollama, the box has two viable backends:
+
+| Backend | URL | Model strengths |
+|---|---|---|
+| Ollama | `http://127.0.0.1:11434/v1` | single-user latency, Devstral GGUF, Limbo authoring |
+| SGLang | `http://127.0.0.1:30000/v1` | concurrent fan-out, gpt-oss-20b, xgrammar tool-call grounding |
+
+A single fixed `LLM_BACKEND_URL` env var no longer captures the
+intent. `lucibridge` needs to **route per request**, and the routing
+key is the tool-category (which the agentlib already attaches to each
+dispatch).
+
+---
+
+## Routing table
+
+The canonical route table for v1 (Hephaestus, Veltro-on-SGLang). All
+URLs are relative to the configured backend prefix; `model` is the
+model selector accepted by both backends' `/v1/chat/completions`.
+
+| Tool category | Backend | Model | Notes |
+|---|---|---|---|
+| `limbo_authoring` | Ollama | `devstral-limbo-v3` (or v4 once daedalus lands) | Single-user fluency; no concurrent fan-out |
+| `dispatch` | SGLang | `openai/gpt-oss-20b` | Fast tool-call dispatch, xgrammar-friendly |
+| `tool_call` | SGLang | `openai/gpt-oss-20b` | Per-tool args, may be grammar-constrained |
+| `memory` | SGLang | `openai/gpt-oss-20b` | Fan-out heavy, benefits from RadixAttention |
+| `task` | SGLang | `openai/gpt-oss-20b` | Same as above |
+| `chat` / unset / unknown | Ollama | `LLM_DEFAULT_MODEL` | Fallback — backward compatible with v0 |
+
+Every row is overridable by config (see below); the table is the
+**default** when no override is supplied. Unknown categories must
+fall back to Ollama — the dev path stays unchanged.
+
+---
+
+## Config schema
+
+Stored at `/etc/lucibridge/routing.json` on Hephaestus, owned by
+`root:root`, mode `0644` (no secrets):
+
+```json
+{
+  "$schema": "infernode.lucibridge.routing/v1",
+  "backends": {
+    "ollama":  { "base_url": "http://127.0.0.1:11434/v1" },
+    "sglang":  { "base_url": "http://127.0.0.1:30000/v1" }
+  },
+  "default_backend": "ollama",
+  "default_model":   "devstral-limbo-v3",
+  "routes": [
+    { "category": "limbo_authoring", "backend": "ollama",  "model": "devstral-limbo-v3" },
+    { "category": "dispatch",        "backend": "sglang",  "model": "openai/gpt-oss-20b" },
+    { "category": "tool_call",       "backend": "sglang",  "model": "openai/gpt-oss-20b",
+      "extra": { "grammar_backend": "xgrammar" } },
+    { "category": "memory",          "backend": "sglang",  "model": "openai/gpt-oss-20b" },
+    { "category": "task",            "backend": "sglang",  "model": "openai/gpt-oss-20b" }
+  ],
+  "fallback": {
+    "on_backend_unreachable": "default_backend",
+    "on_unknown_category":    "default_backend",
+    "log_decisions":          true
+  }
+}
+```
+
+### Field reference
+
+| Field | Type | Meaning |
+|---|---|---|
+| `backends` | map\<name, {base_url}\> | named pool of OpenAI-compatible endpoints |
+| `default_backend` | string | backend used when no route matches; also the fallback target on unreachable backend |
+| `default_model` | string | model passed to `default_backend` when none specified |
+| `routes` | list\<{category, backend, model, extra?}\> | first-match per-category route; ordering matters only for diagnostics |
+| `routes[].extra` | map | passed through as extra fields on the upstream `/v1/chat/completions` body (e.g. `grammar_backend`, `lora_name` for INFR-77 multi-LoRA) |
+| `fallback.on_backend_unreachable` | enum | `default_backend` \| `fail` |
+| `fallback.on_unknown_category` | enum | `default_backend` \| `fail` |
+| `fallback.log_decisions` | bool | emit a structured log line per routing decision (recommended on) |
+
+### Backwards compatibility
+
+Bridges from before INFR-79 spoke a single `LLM_BACKEND_URL` env var.
+The post-INFR-79 bridge MUST still honour that env var as a
+configuration fallback: if `/etc/lucibridge/routing.json` is missing,
+the bridge constructs an implicit config with one backend (the
+`LLM_BACKEND_URL`) and the default model from `LLM_DEFAULT_MODEL`, and
+routes every request to it. v0 deployments don't need to change.
+
+---
+
+## env-var bridging from `serve-llm.sh`
+
+The serve-llm launcher exports the env vars that the bridge resolves
+into the implicit / explicit config. For Hephaestus dual-backend mode:
+
+```sh
+# /etc/serve-llm.env  (sourced by serve-llm.service)
+LLM_BACKEND_DEFAULT=http://127.0.0.1:11434/v1
+LLM_BACKEND_SGLANG=http://127.0.0.1:30000/v1
+LLM_DEFAULT_MODEL=devstral-limbo-v3
+LUCIBRIDGE_ROUTING_CONFIG=/etc/lucibridge/routing.json
+```
+
+The bridge uses `LUCIBRIDGE_ROUTING_CONFIG` if set, falling back to
+the implicit single-backend mode (using `LLM_BACKEND_DEFAULT`) when
+the file is absent. This matches the §8 mode-switching pattern in
+`runbooks/hephaestus-deploy.md`.
+
+---
+
+## Observability — what to log
+
+Every routing decision emits one structured log line at INFO. Format
+(JSON; one line per request):
+
+```json
+{
+  "ts": "2026-05-14T03:14:15Z",
+  "event": "lucibridge.route",
+  "request_id": "req_…",
+  "category": "tool_call",
+  "matched_route_index": 2,
+  "backend": "sglang",
+  "backend_url": "http://127.0.0.1:30000/v1",
+  "model": "openai/gpt-oss-20b",
+  "fallback_used": false,
+  "fallback_reason": null
+}
+```
+
+`fallback_used: true` with `fallback_reason: "unknown_category"` or
+`"backend_unreachable"` is the signal for routing problems. Alert on
+sustained `fallback_used: true` (>5% of decisions over a 5-minute
+window).
+
+---
+
+## Test plan (lives in `infernode-os/infernode:agentlib_test/`)
+
+Once the bridge code change is implemented, the tests that must exist:
+
+1. **Single-backend v0 compat:** routing.json absent, `LLM_BACKEND_URL`
+   set → every request goes to that URL. No regression vs pre-INFR-79.
+2. **Per-category routing:** routing.json present, two backends mocked
+   → a `limbo_authoring` request hits Ollama mock; a `tool_call`
+   request hits SGLang mock; verify the URL + model dispatched.
+3. **Fallback on unreachable backend:** SGLang mock returns 503 →
+   request falls back to default backend, log line shows `fallback_used: true, fallback_reason: "backend_unreachable"`.
+4. **Fallback on unknown category:** `category: "frobnicate"` →
+   default backend used, fallback log emitted.
+5. **`extra` passthrough:** route with `extra: {grammar_backend: "xgrammar"}`
+   → the outgoing body includes that field.
+6. **Decision-log emission:** with `log_decisions: true`, every
+   request produces exactly one structured log line.
+
+These are pre-existing `lucibridge_test` shape; the diff is the new
+fixture file and the new test functions.
+
+---
+
+## Acceptance for INFR-79
+
+| Criterion | Where verified |
+|---|---|
+| A configured Veltro session routes `limbo_authoring` to Ollama and `tool_call` to SGLang, both succeed | `infernode-os/infernode` agentlib_test suite + manual run on Hephaestus |
+| Fallback documented and tested | tests 3 + 4 above |
+| Existing single-URL configs still work unchanged | test 1 above |
+| No regression in existing `lucibridge_test` suite | CI run on `infernode-os/infernode` |
+
+---
+
+## Open issues
+
+* **Per-tool LoRA selection** — gpt-oss-20b will serve multiple
+  adapters (`gpt-oss-limbo-v3` plus future v4) layered on one resident
+  base. The route's `extra` field should carry `lora_name` once SGLang
+  multi-LoRA is wired up. Tracked as a follow-up under INFR-77's
+  multi-LoRA validation.
+* **Bridge config hot-reload** — v1 reloads on `SIGHUP`; document
+  exact behaviour once the agentlib PR lands.
+* **Cross-host routing** — current schema assumes Hephaestus-local
+  backends. If we add a second Jetson, `backends[].base_url` already
+  takes any URL; document the firewall/zerotier story before exposing.
diff --git a/sglang/.gitkeep b/sglang/.gitkeep
deleted file mode 100644
index 4214d49..0000000
--- a/sglang/.gitkeep
+++ /dev/null
@@ -1 +0,0 @@
-# Placeholder — populate when sglang fork / runbooks / CI lands.
diff --git a/sglang/LICENSE-UPSTREAM.md b/sglang/LICENSE-UPSTREAM.md
new file mode 100644
index 0000000..3adbebc
--- /dev/null
+++ b/sglang/LICENSE-UPSTREAM.md
@@ -0,0 +1,46 @@
+# Upstream attribution — `dusty-nv/jetson-containers`
+
+The `sglang/orin/` recipe in this subtree is vendored from
+[`dusty-nv/jetson-containers`](https://github.com/dusty-nv/jetson-containers),
+specifically the path `packages/llm/sglang/`.
+
+* **Vendored on:** 2026-05-14
+* **Upstream commit at vendoring:** `6ec74990dc4b84f3cbba86c2def7f232db9d0eaf`
+* **Upstream license:** MIT
+* **Maintainer:** Dustin Franklin (NVIDIA DevRel) and contributors
+
+The license text below is the upstream `LICENSE.md` reproduced
+verbatim, as required by the MIT terms. This repository is also
+MIT-licensed (see `/LICENSE`); the two are compatible.
+
+When re-syncing from upstream, update the commit SHA above and diff the
+verbatim files (`Dockerfile`, `build.sh`, `install.sh`, `test.py`) against
+the upstream snapshot at the new SHA. `orin/config.py` is intentionally
+divergent (pinned for Orin/JP6/CUDA 12.6 instead of upstream's CUDA-13
+pin) and should not be overwritten by a sync.
+
+---
+
+## Upstream license (MIT)
+
+```
+Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+
+Permission is hereby granted, free of charge, to any person obtaining a
+copy of this software and associated documentation files (the "Software"),
+to deal in the Software without restriction, including without limitation
+the rights to use, copy, modify, merge, publish, distribute, sublicense,
+and/or sell copies of the Software, and to permit persons to whom the
+Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included
+in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+```
diff --git a/sglang/README.md b/sglang/README.md
new file mode 100644
index 0000000..e6005d3
--- /dev/null
+++ b/sglang/README.md
@@ -0,0 +1,84 @@
+# sglang/ — vendored Jetson SGLang recipe
+
+This subtree holds container recipes for running SGLang on NVIDIA Jetson
+hardware, vendored from
+[`dusty-nv/jetson-containers`](https://github.com/dusty-nv/jetson-containers)
+(MIT-licensed, NVIDIA-DevRel-maintained) and adapted for InferNode's
+production needs.
+
+## Layout
+
+```
+sglang/
+├── README.md                  this file
+├── LICENSE-UPSTREAM.md        attribution for vendored dusty-nv recipe
+├── orin/                      Jetson Orin AGX (sm_87, JetPack 6.x, CUDA 12.6)
+│   ├── README.md
+│   ├── config.py              OUR pinned SGLang version
+│   ├── Dockerfile             vendored from upstream
+│   ├── build.sh               vendored from upstream
+│   ├── install.sh             vendored from upstream
+│   └── test.py                vendored from upstream
+└── thor/                      Jetson Thor (sm_103, JetPack 7.x, CUDA 13.x)
+    ├── README.md
+    └── Dockerfile             thin wrapper over NGC's nvcr.io/nvidia/sglang
+```
+
+## Why two variants
+
+The two hardware generations need different base-image strategies:
+
+* **Orin (`sm_87`)** has no NGC SGLang image (per INFR-74 investigation —
+  NGC's SGLang line is CUDA-13-based, JP7-targeted). The fork-and-build
+  path is the only one available. The Orin recipe is a full vendor of
+  the dusty-nv jetson-containers SGLang package with our pinned version.
+
+* **Thor (`sm_103`)** has NVIDIA's official NGC SGLang container
+  (`nvcr.io/nvidia/sglang:25.10-py3` and later). Cleaner upstream than
+  Orin's community-maintained chain — we use it as a base and add only
+  what InferNode-specific bits we need (tokenizers, entrypoint
+  conveniences).
+
+## Vendoring decision
+
+Straight copy with attribution, not git submodule or subtree-merge. Reasons:
+
+1. We need to **diverge** from upstream's pinned version: upstream pins
+   SGLang 0.5.11 with the explicit annotation "Compatible with CUDA 13
+   (Spark and Thor)". JetPack 6.x ships CUDA 12.6 — that pin doesn't
+   work for Orin. We need our own version pin (see `orin/config.py`).
+2. The upstream recipe is **small** (≈ 130 LOC across 4 files plus
+   README + test). Submodule overhead exceeds the merge-back cost.
+3. We may want to apply Orin-specific patches (e.g. the chat-template
+   / tokenizer fixes per INFR-78) without coordinating with upstream.
+
+If upstream evolves in ways we care about, the re-sync is a manual
+diff-and-merge against `LICENSE-UPSTREAM.md`'s recorded commit SHA.
+The vendored files at the time of copy are an exact snapshot — the
+diff is therefore easy to compute.
+
+## Upstream source
+
+* Repo: <https://github.com/dusty-nv/jetson-containers>
+* Path: `packages/llm/sglang/`
+* Vendored from: `master` branch as of 2026-05-14
+* Upstream commit at vendoring: see `LICENSE-UPSTREAM.md`
+* Upstream license: MIT (compatible with this repo's MIT)
+
+## What changed vs upstream
+
+| File | Status |
+|---|---|
+| `orin/config.py` | **Modified** — pinned to a 0.5.x release compatible with CUDA 12.6 (see `orin/README.md`) |
+| `orin/Dockerfile.upstream` | Verbatim — kept for diff against upstream re-syncs |
+| `orin/Dockerfile` | **Modified** — standalone build (drops chained `/tmp/transformers/install.sh`, adds tokenizer bake step per INFR-78) |
+| `orin/build.sh` | Verbatim |
+| `orin/install.sh` | Verbatim |
+| `orin/test.py` | Verbatim |
+| `orin/bake-tokenizers.sh` | InferNode-authored — pulls non-gated Llama-3 tokenizer dirs (INFR-78) |
+| `thor/Dockerfile` | InferNode-authored — wraps NGC `nvcr.io/nvidia/sglang` |
+| `thor/test.py` | Verbatim copy of `orin/test.py` (docker-context boundary) |
+
+When you re-sync from upstream, diff the `orin/` non-config files
+against the upstream snapshot at that point; `config.py` is intentionally
+divergent and should not be auto-overwritten.
diff --git a/sglang/orin/Dockerfile b/sglang/orin/Dockerfile
new file mode 100644
index 0000000..8e120eb
--- /dev/null
+++ b/sglang/orin/Dockerfile
@@ -0,0 +1,61 @@
+# SGLang for Jetson Orin AGX (sm_87, JetPack 6.x, CUDA 12.6)
+#
+# Based on dusty-nv/jetson-containers:packages/llm/sglang/Dockerfile
+# (see ../LICENSE-UPSTREAM.md). Kept verbatim at ./Dockerfile.upstream
+# for diff-against-upstream. This file is the standalone production
+# build with the following deltas:
+#
+#   1. `transformers` install moved to a `pip install` (upstream chains
+#      its own /tmp/transformers/install.sh from an earlier stage).
+#   2. Tokenizer + chat-template bake step added at /opt/tokenizers/
+#      (INFR-78).
+#   3. Smoke-test script copied to /opt/sglang/test.py.
+
+ARG BASE_IMAGE
+FROM ${BASE_IMAGE}
+
+ARG SGLANG_VERSION \
+    SGLANG_VERSION_SPEC \
+    IS_SBSA \
+    FORCE_BUILD=off \
+    TMP=/tmp/sglang \
+    TOKENIZER_DIR=/opt/tokenizers
+
+RUN apt-get update -y && \
+    apt-get install -y --no-install-recommends \
+        libnuma-dev \
+        libsndfile1 \
+        libsndfile1-dev \
+        libprotobuf-dev \
+        libsm6 \
+        libxext6 \
+        libgl1 && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+# Vendored SGLang install (pip-first; falls back to source build on failure).
+COPY build.sh install.sh $TMP/
+RUN $TMP/install.sh || $TMP/build.sh || touch $TMP/.build.failed
+RUN if [ -f $TMP/.build.failed ]; then \
+      echo "SGLANG ${SGLANG_VERSION} build failed!"; \
+      exit 1; \
+    fi
+
+# Standalone replacement for upstream's chained /tmp/transformers/install.sh.
+# SGLang declares its own transformers pin via setup.py; this just ensures the
+# transitive install actually completed.
+RUN python3 -c "import transformers; print('transformers', transformers.__version__)" \
+    || pip install --no-cache-dir 'transformers>=4.45,<5'
+
+# Bake Llama-3 family tokenizer + chat templates so --tokenizer-path can
+# be set at launch time without a runtime HuggingFace pull. Adds ~60 MB
+# (two families * ~30 MB) to the final image. (INFR-78.)
+COPY bake-tokenizers.sh $TMP/
+RUN TOKENIZER_DIR=${TOKENIZER_DIR} $TMP/bake-tokenizers.sh
+
+# Smoke test, callable as `docker run --rm <image> python3 /opt/sglang/test.py`.
+COPY test.py /opt/sglang/test.py
+
+LABEL org.opencontainers.image.source="https://github.com/infernode-os/serving"
+LABEL org.opencontainers.image.description="SGLang for Jetson Orin AGX (sm_87) — InferNode fork of dusty-nv/jetson-containers"
+LABEL org.opencontainers.image.licenses="MIT"
diff --git a/sglang/orin/Dockerfile.upstream b/sglang/orin/Dockerfile.upstream
new file mode 100644
index 0000000..65452e9
--- /dev/null
+++ b/sglang/orin/Dockerfile.upstream
@@ -0,0 +1,41 @@
+#---
+# name: sglang
+# group: llm
+# config: config.py
+# depends: [sgl-kernel, torch-memory-saver]
+# buildkit_device: nvidia.com/gpu=all
+# requires: '>=36'
+# test: test.py
+# notes: https://github.com/sgl-project/sglang
+#---
+ARG BASE_IMAGE
+FROM ${BASE_IMAGE}
+
+ARG SGLANG_VERSION \
+    SGLANG_VERSION_SPEC \
+    IS_SBSA \
+    FORCE_BUILD=off \
+    TMP=/tmp/sglang
+
+RUN apt-get update -y && \
+    apt-get install -y --no-install-recommends \
+        libnuma-dev \
+        libsndfile1 \
+        libsndfile1-dev \
+        libprotobuf-dev \
+        libsm6 \
+        libxext6 \
+        libgl1 && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+COPY build.sh install.sh $TMP/
+RUN $TMP/install.sh || $TMP/build.sh || touch $TMP/.build.failed
+
+# this retains the stage above for debugging on build failure
+RUN if [ -f $TMP/.build.failed ]; then \
+      echo "SGLANG ${SGLANG_VERSION} build failed!"; \
+      exit 1; \
+    fi
+
+RUN /tmp/transformers/install.sh
diff --git a/sglang/orin/README.md b/sglang/orin/README.md
new file mode 100644
index 0000000..350c764
--- /dev/null
+++ b/sglang/orin/README.md
@@ -0,0 +1,110 @@
+# sglang/orin/ — Jetson Orin AGX (sm_87)
+
+Container recipe for SGLang on Jetson Orin AGX. Vendored from
+`dusty-nv/jetson-containers:packages/llm/sglang/` (see
+`../LICENSE-UPSTREAM.md`) with version pin and Orin-specific notes
+diverged from upstream.
+
+## Target
+
+| Property | Value |
+|---|---|
+| GPU | Ampere Tegra (`sm_87`) |
+| JetPack | 6.x (R36.4 series) |
+| CUDA | 12.6 |
+| cuDNN | 9.3 |
+| L4T base | `r36.4.0` |
+| Python | 3.10 |
+| PyTorch | 2.5–2.6 (`USE_DISTRIBUTED=1`, `TORCH_CUDA_ARCH_LIST=8.7`) |
+
+## Pinned version
+
+`SGLANG_VERSION=0.5.3` — see the docstring at the top of `config.py`
+for the rationale and fallback ladder. The first 0.5.x line with
+`srt/models/gpt_oss.py`. On-target smoke build on Hephaestus is the
+gate (INFR-77).
+
+If the pinned version fails to build, fall back in this order:
+0.5.2 → 0.5.1 → 0.5.0, then 0.4.5+. Document the working pin back in
+`config.py`'s `package = [ … ]` and update this README's "Pinned
+version" line.
+
+## Build (CI, GitHub-hosted)
+
+CI builds the container on `ubuntu-24.04-arm` (Graviton SBSA, native
+aarch64 — no QEMU). See `.github/workflows/build-sglang.yml`. The
+build cross-compiles CUDA kernels for `sm_87` via
+`TORCH_CUDA_ARCH_LIST=8.7` at `nvcc` invocation; the output image
+runs on Jetson Orin AGX (Tegra).
+
+## Build (manual on Hephaestus)
+
+When the CI image is unavailable or you're iterating on the recipe:
+
+```sh
+cd ~/serving/sglang/orin
+docker build \
+  --build-arg BASE_IMAGE=dustynv/pytorch:2.6-r36.4.0-cu126-22.04 \
+  --build-arg SGLANG_VERSION=0.5.3 \
+  --build-arg SGLANG_VERSION_SPEC=0.5.3 \
+  --build-arg IS_SBSA=0 \
+  -t serving-sglang:orin-local .
+```
+
+(The exact `BASE_IMAGE` tag depends on what dustynv has published at
+the time. Use `dustynv/pytorch` rather than `dustynv/sglang` to avoid
+inheriting their stale 0.4.1 install; we re-install our pinned 0.5.x
+fresh via `install.sh` / `build.sh`.)
+
+## Files
+
+| File | Origin | Purpose |
+|---|---|---|
+| `config.py` | **modified** | jetson-containers package config; pinned to our Orin-compatible version |
+| `Dockerfile` | **modified** | standalone build with tokenizer bake (INFR-78) |
+| `Dockerfile.upstream` | verbatim from upstream | reference copy for diff against upstream re-syncs |
+| `build.sh` | verbatim from upstream | source-build fallback if `pip install` fails |
+| `install.sh` | verbatim from upstream | `pip install sglang[all]~=$SGLANG_VERSION` first-try path |
+| `bake-tokenizers.sh` | InferNode-authored | downloads Llama-3 / Llama-3.1 tokenizer dirs into `/opt/tokenizers/` (INFR-78) |
+| `test.py` | verbatim from upstream | smoke test (`import sglang`, print version + CUDA device) |
+
+## What this does NOT include
+
+* **A `BASE_IMAGE`**. The Dockerfile expects one to be passed at build
+  time (matches the upstream jetson-containers pattern, which chains
+  base images via its framework). The CI workflow supplies a
+  Jetson-rooted base; for manual builds see the command above.
+* **Tokenizer pre-bake**. The Llama-3 tokenizer fix lives at
+  `sglang/orin/tokenizers/` once INFR-78 lands. Until then,
+  `--tokenizer-path` must be set at launch time.
+* **Entrypoint scripts**. Launch arguments live with the runbook
+  (`runbooks/hephaestus-deploy.md`) rather than baked into the image,
+  so the same image serves different model paths without rebuild.
+
+## Verifying a build
+
+After a successful image build, run the upstream smoke test inside the
+container:
+
+```sh
+docker run --rm --gpus all --runtime nvidia serving-sglang:orin-local \
+  python3 /opt/sglang/test.py
+```
+
+Expected output:
+
+```
+testing SGLang...
+✅ Memory cleared
+SGLang version: 0.5.3
+CUDA available: True
+CUDA device: Orin (or NVIDIA Jetson AGX Orin)
+SGLang OK
+```
+
+`gpt-oss` arch verification (per INFR-77 acceptance):
+
+```sh
+docker run --rm --gpus all --runtime nvidia serving-sglang:orin-local \
+  python3 -c "import sglang.srt.models.gpt_oss as m; print('gpt_oss arch module:', m.__file__)"
+```
diff --git a/sglang/orin/bake-tokenizers.sh b/sglang/orin/bake-tokenizers.sh
new file mode 100755
index 0000000..bf7b788
--- /dev/null
+++ b/sglang/orin/bake-tokenizers.sh
@@ -0,0 +1,73 @@
+#!/usr/bin/env bash
+#
+# Bake a curated set of non-gated tokenizer + chat-template directories
+# into the SGLang container at /opt/tokenizers/<family>/.
+#
+# Why: SGLang's GGUF tokenizer doesn't register Llama-3's special tokens
+# (<|eot_id|>, <|begin_of_text|>, <|start_header_id|>) properly, so stops
+# don't match cleanly and chat-template framing is wrong. The fix is to
+# point --tokenizer-path at a real HuggingFace tokenizer dir at launch
+# time (see runbooks/hephaestus-deploy.md). Baking the tokenizers into
+# the image means the launch command is fully offline-capable.
+#
+# Each family pulled is the tokenizer files only (~30 MB per family);
+# weights are NOT downloaded. We pick non-gated mirrors so no HF login
+# is required at build time.
+#
+# Owning ticket: INFR-78.
+
+set -euo pipefail
+
+DEST="${TOKENIZER_DIR:-/opt/tokenizers}"
+mkdir -p "$DEST"
+
+# family       repo (non-gated mirror)                     alias dir under $DEST
+declare -A FAMILIES=(
+  [llama-3.1]="unsloth/Meta-Llama-3.1-8B-Instruct"
+  [llama-3]="NousResearch/Meta-Llama-3-8B-Instruct"
+)
+
+# Tokenizer-only files. No weights, no model.safetensors.
+PATTERNS=(
+  "tokenizer.json"
+  "tokenizer_config.json"
+  "special_tokens_map.json"
+  "chat_template.json"
+  "generation_config.json"
+)
+
+python3 - "$DEST" "${!FAMILIES[@]}" <<'PY' "${FAMILIES[@]}"
+import os, sys
+from huggingface_hub import snapshot_download
+
+dest_root = sys.argv[1]
+families = sys.argv[2:]
+# Args come in two halves: aliases then repos (same order).
+n = len(families) // 2
+aliases, repos = families[:n], families[n:]
+
+patterns = [
+    "tokenizer.json",
+    "tokenizer_config.json",
+    "special_tokens_map.json",
+    "chat_template.json",
+    "generation_config.json",
+]
+
+for alias, repo in zip(aliases, repos):
+    target = os.path.join(dest_root, alias)
+    os.makedirs(target, exist_ok=True)
+    print(f"[bake-tokenizers] {repo} -> {target}")
+    snapshot_download(
+        repo_id=repo,
+        local_dir=target,
+        local_dir_use_symlinks=False,
+        allow_patterns=patterns,
+    )
+
+print("[bake-tokenizers] done")
+PY
+
+# Strip HuggingFace cache metadata; we only want the flat tokenizer dirs.
+find "$DEST" -name '.huggingface' -prune -exec rm -rf {} + 2>/dev/null || true
+du -sh "$DEST"/* 2>/dev/null || true
diff --git a/sglang/orin/build.sh b/sglang/orin/build.sh
new file mode 100755
index 0000000..be213e4
--- /dev/null
+++ b/sglang/orin/build.sh
@@ -0,0 +1,80 @@
+#!/usr/bin/env bash
+set -x
+
+# Ensure required variables are set
+: "${SGLANG_VERSION:?SGLANG_VERSION must be set}"
+: "${PIP_WHEEL_DIR:?PIP_WHEEL_DIR must be set}"
+
+# --- PRE-INSTALL DEPS ---
+# Install build dependencies first. uv is a very fast installer.
+uv pip install --no-cache-dir ninja setuptools wheel numpy uv scikit-build-core compressed-tensors decord2
+
+# --- CLONE SGLANG REPO ---
+REPO_URL="https://github.com/sgl-project/sglang"
+REPO_DIR="/opt/sglang"
+
+echo "Building SGLang ${SGLANG_VERSION}"
+
+if [ ! -d "${REPO_DIR}" ]; then
+  if git clone --recursive --depth 1 --branch "v${SGLANG_VERSION}" \
+      "${REPO_URL}" "${REPO_DIR}"; then
+    echo "Cloned SGLang v${SGLANG_VERSION}"
+  else
+    echo "Tagged branch v${SGLANG_VERSION} not found; cloning default branch"
+    git clone --recursive --depth 1 "${REPO_URL}" "${REPO_DIR}"
+  fi
+else
+  echo "Directory ${REPO_DIR} already exists, skipping clone."
+fi
+cd "${REPO_DIR}" || exit 1
+
+
+# --- PATCH 1: RELAX PYTORCH VERSION REQUIREMENTS ---
+cd "${REPO_DIR}/python" || exit 1
+sed -i 's/==/>=/g' pyproject.toml
+
+echo "Patched ${REPO_DIR}/python/pyproject.toml to relax version constraints"
+cat pyproject.toml
+
+# --- CONFIGURE PARALLEL BUILD ---
+if [[ -z "${IS_SBSA:-}" || "${IS_SBSA}" == "0" || "${IS_SBSA,,}" == "false" ]]; then
+  export CORES=6 # Automatically use all available cores
+else
+  export CORES=6  # GH200 or other specific hardware
+fi
+export CMAKE_BUILD_PARALLEL_LEVEL="${CORES}"
+export MAX_JOBS="${CORES}"
+
+# --- BUILD SGLANG WHEEL (THE RIGHT WAY) ---
+echo "🚀 Building sglang wheel ONLY with MAX_JOBS=${CORES}"
+
+# Use '--no-deps' to build ONLY the sglang wheel and ignore its dependencies.
+# We will install dependencies later when we install the built wheel.
+uv build --wheel \
+    --no-build-isolation \
+    . \
+    --out-dir "${PIP_WHEEL_DIR}"
+
+# --- INSTALL THE BUILT WHEEL AND ITS DEPENDENCIES ---
+echo "✅ sglang wheel built successfully."
+echo "📦 Installing the sglang wheel from ${PIP_WHEEL_DIR} and its dependencies from PyPI..."
+
+# Now, when we install the local wheel, pip will fetch its dependencies
+# (like torch, transformers, etc.) from the online package index (PyPI).
+# We use 'uv' here because it's extremely fast.
+uv pip install -v --find-links="${PIP_WHEEL_DIR}" "sglang[all]"
+
+# Your original script installed 'gemlite' here, so we keep it.
+uv pip install gemlite orjson pybase64
+
+echo "🎉 SGLang and all dependencies installed successfully!"
+
+cd / || exit 1
+
+# Try uploading; ignore failure
+if [ -x "$(command -v twine)" ]; then
+    twine upload --verbose "${PIP_WHEEL_DIR}/sglang"*.whl \
+      || echo "Failed to upload wheel to ${TWINE_REPOSITORY_URL:-<unset>}"
+else
+    echo "twine not installed, skipping upload."
+fi
diff --git a/sglang/orin/config.py b/sglang/orin/config.py
new file mode 100644
index 0000000..4595cf0
--- /dev/null
+++ b/sglang/orin/config.py
@@ -0,0 +1,64 @@
+"""
+SGLang package configuration — Jetson Orin AGX variant (sm_87, JetPack 6.x, CUDA 12.6)
+
+Vendored from dusty-nv/jetson-containers:packages/llm/sglang/config.py
+(see ../LICENSE-UPSTREAM.md for attribution), with the version pin
+diverged from upstream's CUDA-13-tied 0.5.11 to a JP6/CUDA-12.6-
+compatible 0.5.x release that ships gpt_oss.py in srt/models/.
+
+Pin rationale (INFR-74 + INFR-77):
+  - Upstream's current default (0.5.11) is annotated "Compatible with
+    CUDA 13 (Spark and Thor)" — won't build for JetPack 6 / CUDA 12.6.
+  - Dustynv's last Orin-targeted published tag (r36.4.0) shipped
+    0.4.1.post7 (Feb 2025), which predates gpt-oss support
+    (Aug 2025) and has no srt/models/gpt_oss.py.
+  - 0.5.3 is the initial pick: first 0.5.x line with gpt-oss model
+    class, predates the upstream's CUDA-13 transition (which landed
+    around 0.5.11). On-target smoke build on Hephaestus is the gate.
+  - Fallback ladder if 0.5.3 doesn't build: 0.5.2 → 0.5.1 → 0.5.0,
+    then 0.4.5+. Document the working pin back here when verified.
+"""
+from jetson_containers import CUDA_VERSION, IS_SBSA, update_dependencies
+from packaging.version import Version
+
+
+def sglang(version, version_spec=None, requires=None, depends=None, default=False):
+    pkg = package.copy()
+
+    if requires:
+        pkg['requires'] = requires
+
+    if not version_spec:
+        version_spec = version
+
+    if depends:
+        pkg['depends'] = update_dependencies(pkg['depends'], depends)
+
+    pkg['name'] = f'sglang:{version}'
+
+    pkg['build_args'] = {
+        'SGLANG_VERSION': version,
+        'SGLANG_VERSION_SPEC': version_spec,
+        'IS_SBSA': IS_SBSA
+    }
+
+    builder = pkg.copy()
+
+    builder['name'] = f'sglang:{version}-builder'
+    builder['build_args'] = {**pkg['build_args'], **{'FORCE_BUILD': 'on'}}
+
+    if default:
+        pkg['alias'] = 'sglang'
+        builder['alias'] = 'sglang:builder'
+
+    return pkg, builder
+
+
+package = [
+    sglang(
+        '0.5.3',
+        '0.5.3',
+        depends=['flashinfer', 'sgl-kernel:0.5.3', 'torchao:0.17.0'],
+        default=True,
+    ),  # Orin/JP6/CUDA 12.6 pin — first 0.5.x with gpt_oss.py; verify via on-target smoke build (INFR-77)
+]
diff --git a/sglang/orin/install.sh b/sglang/orin/install.sh
new file mode 100755
index 0000000..fd35dff
--- /dev/null
+++ b/sglang/orin/install.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+set -ex
+
+uv pip install \
+  compressed-tensors \
+  datasets \
+  decord2 \
+  fastapi \
+  hf_transfer \
+  huggingface_hub \
+  interegular \
+  "llguidance>=0.7.11,<0.8.0" \
+  modelscope \
+  ninja \
+  orjson \
+  packaging \
+  partial_json_parser \
+  pillow \
+  "prometheus-client>=0.20.0" \
+  psutil \
+  pydantic \
+  nvidia-ml-py \
+  python-multipart \
+  "pyzmq>=25.1.2" \
+  "soundfile>=0.13.1" \
+  "torchao>=0.9.0" \
+  uvicorn \
+  uvloop \
+  "blobfile>=3.0.0" \
+  "anthropic" \
+  "msgspec" \
+  orjson \
+  litellm \
+  pybase64 \
+  fastapi \
+  outlines
+
+if [ "$FORCE_BUILD" == "on" ]; then
+	echo "Forcing build of sglang ${SGLANG_VERSION}"
+	exit 1
+fi
+
+uv pip install sgl-kernel "sglang[all]~=${SGLANG_VERSION}" || \
+uv pip install sgl-kernel "sglang[all]~=${SGLANG_VERSION_SPEC}"
diff --git a/sglang/orin/test.py b/sglang/orin/test.py
new file mode 100755
index 0000000..5ac148c
--- /dev/null
+++ b/sglang/orin/test.py
@@ -0,0 +1,27 @@
+#!/usr/bin/env python3
+import gc
+import torch
+
+def clear_memory():
+    """Free GPU and CPU memory before running SGLang."""
+    try:
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+            torch.cuda.ipc_collect()
+            torch.cuda.synchronize()
+        gc.collect()
+        print("✅ Memory cleared")
+    except Exception as e:
+        print(f"⚠️ Memory cleanup failed: {e}")
+
+print('testing SGLang...')
+clear_memory()  # <-- Clean before anything else
+
+import sglang as sgl
+
+print(f"SGLang version: {sgl.__version__}")
+print(f"CUDA available: {torch.cuda.is_available()}")
+if torch.cuda.is_available():
+    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
+
+print('SGLang OK\n')
diff --git a/sglang/thor/Dockerfile b/sglang/thor/Dockerfile
new file mode 100644
index 0000000..c494afd
--- /dev/null
+++ b/sglang/thor/Dockerfile
@@ -0,0 +1,23 @@
+# Jetson Thor (sm_103, JetPack 7, CUDA 13) SGLang container.
+#
+# This is a thin wrapper over NVIDIA's official NGC SGLang image. The
+# heavy lifting (PyTorch with sm_103 device code, sgl-kernel, flashinfer,
+# xgrammar) is all done by NVIDIA in the base image. We add only what
+# InferNode needs on top.
+#
+# Build: see sglang/thor/README.md or .github/workflows/build-sglang.yml.
+
+ARG NGC_SGLANG_TAG=25.10-py3
+FROM nvcr.io/nvidia/sglang:${NGC_SGLANG_TAG}
+
+LABEL org.opencontainers.image.source="https://github.com/infernode-os/serving"
+LABEL org.opencontainers.image.description="SGLang for Jetson Thor (sm_103) — InferNode overlay on NGC base"
+LABEL org.opencontainers.image.licenses="MIT"
+
+# Smoke test mirrored from sglang/orin/test.py so the same verification
+# script works against either variant. (Duplicated rather than symlinked
+# because docker build context can't reach outside this directory.)
+COPY test.py /opt/sglang/test.py
+
+# No CMD/ENTRYPOINT override — preserve NGC's defaults so callers
+# pass the launch args explicitly (see runbooks/hephaestus-deploy.md).
diff --git a/sglang/thor/README.md b/sglang/thor/README.md
new file mode 100644
index 0000000..7d17fb2
--- /dev/null
+++ b/sglang/thor/README.md
@@ -0,0 +1,60 @@
+# sglang/thor/ — Jetson Thor (sm_103)
+
+Container recipe for SGLang on Jetson Thor. Forward-looking — we
+don't have a Thor box yet, but the recipe stays parallel to Orin's so
+that the deploy story doesn't need a rewrite when one arrives.
+
+## Target
+
+| Property | Value |
+|---|---|
+| GPU | Blackwell Tegra (`sm_103`) |
+| JetPack | 7.x |
+| CUDA | 13.0+ |
+| Base | `nvcr.io/nvidia/sglang:25.10-py3` (NGC) |
+
+## Why this is shorter than `orin/`
+
+NVIDIA ships an **official** NGC SGLang container line starting at
+`25.10-py3` (October 2025) that explicitly targets Jetson Thor. See
+the INFR-74 investigation comment for the full enumeration. The Thor
+recipe therefore doesn't fork dusty-nv/jetson-containers — it just
+pulls NGC's image, version-pins, and layers any InferNode-specific
+conveniences on top.
+
+This is a deliberate asymmetry with `orin/`: NGC's SGLang line is
+CUDA-13-based and JP7-targeted, so it can't run on Orin/JP6 — but
+that's exactly what we want for Thor.
+
+## Pinned base
+
+`nvcr.io/nvidia/sglang:25.10-py3` — first NGC release with Jetson
+Thor support. Bump as newer NGC tags ship (`25.11-py3`, `26.02-py3`,
+etc.) once we have a Thor box and can verify each.
+
+## Known issues to track
+
+Per the [NGC SGLang 25.10 release notes](https://docs.nvidia.com/deeplearning/frameworks/sglang-release-notes/rel-25-10.html):
+
+* **`gpt-oss` family models cannot run on DGX Spark and Jetson Thor
+  due to an OpenAI Triton issue.** This blocks the V4-PLAN gpt-oss
+  prize on Thor specifically. Track NGC release notes for resolution;
+  Orin's recipe (with a JP6/CUDA-12.6 Triton stack) is the workaround
+  in the meantime.
+
+## Build
+
+CI matrix-builds both variants on `ubuntu-24.04-arm`; the Thor variant
+is a `FROM nvcr.io/nvidia/sglang:25.10-py3` + thin overlay, so the
+build is mostly a re-tag with our overlay scripts. See
+`.github/workflows/build-sglang.yml`.
+
+NGC auth is required for the pull (free NGC account works; the CI
+workflow expects `NGC_API_KEY` as a repo secret).
+
+## Manual build on a Thor host (when one exists)
+
+```sh
+echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin
+docker build -t serving-sglang:thor-local sglang/thor/
+```
diff --git a/sglang/thor/test.py b/sglang/thor/test.py
new file mode 100755
index 0000000..5ac148c
--- /dev/null
+++ b/sglang/thor/test.py
@@ -0,0 +1,27 @@
+#!/usr/bin/env python3
+import gc
+import torch
+
+def clear_memory():
+    """Free GPU and CPU memory before running SGLang."""
+    try:
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+            torch.cuda.ipc_collect()
+            torch.cuda.synchronize()
+        gc.collect()
+        print("✅ Memory cleared")
+    except Exception as e:
+        print(f"⚠️ Memory cleanup failed: {e}")
+
+print('testing SGLang...')
+clear_memory()  # <-- Clean before anything else
+
+import sglang as sgl
+
+print(f"SGLang version: {sgl.__version__}")
+print(f"CUDA available: {torch.cuda.is_available()}")
+if torch.cuda.is_available():
+    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
+
+print('SGLang OK\n')

From d54cb2511f3942cb78ab416bc64e36c78e241ab7 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 14 May 2026 09:29:03 +0000
Subject: [PATCH 2/3] ci(sglang): drop NGC_API_KEY as a hard requirement

NGC SGLang containers are anonymously pullable from nvcr.io for the
default tags we care about. Make the NGC login step conditional on the
secret being set (forward-compat with any future gated variant) and
remove the PR-skip that was only there because of the bogus auth
assumption. Thor variant now builds on every event, same as Orin.
---
 .github/workflows/build-sglang.yml | 20 +++++++++++++++-----
 sglang/thor/README.md              | 13 +++++++------
 2 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/.github/workflows/build-sglang.yml b/.github/workflows/build-sglang.yml
index 2742397..6d3c7a0 100644
--- a/.github/workflows/build-sglang.yml
+++ b/.github/workflows/build-sglang.yml
@@ -43,10 +43,6 @@ jobs:
   build:
     name: build-${{ matrix.variant }}
     runs-on: ubuntu-24.04-arm
-    # Thor needs NGC auth and only runs on push to main / tags / manual
-    # dispatch. PRs from forks have no access to NGC_API_KEY and would
-    # fail at the login step.
-    if: matrix.variant != 'thor' || github.event_name != 'pull_request'
     strategy:
       fail-fast: false
       matrix:
@@ -78,8 +74,22 @@ jobs:
           username: ${{ github.actor }}
           password: ${{ secrets.GITHUB_TOKEN }}
 
-      - name: Log in to NGC (Thor only)
+      # NGC SGLang containers are anonymously pullable from nvcr.io —
+      # NGC auth is only needed for gated content (which the Thor base
+      # isn't). This step runs only when NGC_API_KEY is set, so it stays
+      # forward-compatible with any future gated variant.
+      - name: Check NGC auth
         if: matrix.variant == 'thor'
+        id: ngc
+        run: |
+          if [ -n "${{ secrets.NGC_API_KEY }}" ]; then
+            echo "use_auth=true"  >> "$GITHUB_OUTPUT"
+          else
+            echo "use_auth=false" >> "$GITHUB_OUTPUT"
+          fi
+
+      - name: Log in to NGC (Thor only, if secret set)
+        if: matrix.variant == 'thor' && steps.ngc.outputs.use_auth == 'true'
         uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772 # v3.4.0
         with:
           registry: nvcr.io
diff --git a/sglang/thor/README.md b/sglang/thor/README.md
index 7d17fb2..5214c35 100644
--- a/sglang/thor/README.md
+++ b/sglang/thor/README.md
@@ -45,16 +45,17 @@ Per the [NGC SGLang 25.10 release notes](https://docs.nvidia.com/deeplearning/fr
 ## Build
 
 CI matrix-builds both variants on `ubuntu-24.04-arm`; the Thor variant
-is a `FROM nvcr.io/nvidia/sglang:25.10-py3` + thin overlay, so the
-build is mostly a re-tag with our overlay scripts. See
+is `FROM nvcr.io/nvidia/sglang:25.10-py3` + thin overlay, so the build
+is mostly a re-tag with our overlay scripts. See
 `.github/workflows/build-sglang.yml`.
 
-NGC auth is required for the pull (free NGC account works; the CI
-workflow expects `NGC_API_KEY` as a repo secret).
+NGC SGLang containers are anonymously pullable from `nvcr.io` — no
+auth required for the default tag. If NVIDIA ever gates a tag we want,
+set the `NGC_API_KEY` repo secret and the workflow's optional login
+step will fire automatically.
 
-## Manual build on a Thor host (when one exists)
+## Manual build
 
 ```sh
-echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin
 docker build -t serving-sglang:thor-local sglang/thor/
 ```

From 3bc873c6c3ddd98439281f6508aae90bb6aaa56e Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 14 May 2026 09:34:05 +0000
Subject: [PATCH 3/3] ci(sglang): point Orin BASE_IMAGE at a tag that actually
 exists
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

First CI run failed with:
  dustynv/pytorch:2.6-r36.4.0-cu126-22.04: not found

dustynv moved the JP6 publishing line to cu128 / Ubuntu 24.04 a while
back; the cu126-22.04 / Python 3.10 variant the spike used is no
longer maintained. Switch the workflow default and the orin README's
manual-build example to 2.6-r36.4.0-cu128-24.04.

In-container Python 3.12 is fine — the spike's host-Python-alignment
constraint only mattered for its hand-extracted-onto-host setup, not
for Docker. CUDA 12.8 runtime is forward-compatible with JP6.x's
CUDA 12.6 driver per NVIDIA's same-major compat policy.
---
 .github/workflows/build-sglang.yml |  8 ++++++--
 sglang/orin/README.md              | 15 ++++++++-------
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/.github/workflows/build-sglang.yml b/.github/workflows/build-sglang.yml
index 6d3c7a0..3d7f020 100644
--- a/.github/workflows/build-sglang.yml
+++ b/.github/workflows/build-sglang.yml
@@ -20,7 +20,7 @@ on:
       orin_base_image:
         description: 'BASE_IMAGE arg for the Orin build'
         required: false
-        default: 'dustynv/pytorch:2.6-r36.4.0-cu126-22.04'
+        default: 'dustynv/pytorch:2.6-r36.4.0-cu128-24.04'
       sglang_version:
         description: 'SGLANG_VERSION override (Orin only)'
         required: false
@@ -37,7 +37,11 @@ env:
   # PyTorch base ships a torch built with USE_DISTRIBUTED=1 and sm_87
   # device code; that's the combination the spike (INFR-68) found was
   # the only one that lets SGLang's import chain succeed on Orin/JP6.
-  ORIN_BASE_IMAGE_DEFAULT: 'dustynv/pytorch:2.6-r36.4.0-cu126-22.04'
+  # cu128-24.04 is the current published JP6 line on Docker Hub (the
+  # earlier cu126-22.04 / Python 3.10 variant the spike used has been
+  # superseded; Python 3.12 in-container is fine — only the spike's
+  # hand-extracted-onto-host setup needed host Python alignment).
+  ORIN_BASE_IMAGE_DEFAULT: 'dustynv/pytorch:2.6-r36.4.0-cu128-24.04'
 
 jobs:
   build:
diff --git a/sglang/orin/README.md b/sglang/orin/README.md
index 350c764..ddb0ac9 100644
--- a/sglang/orin/README.md
+++ b/sglang/orin/README.md
@@ -11,11 +11,11 @@ diverged from upstream.
 |---|---|
 | GPU | Ampere Tegra (`sm_87`) |
 | JetPack | 6.x (R36.4 series) |
-| CUDA | 12.6 |
-| cuDNN | 9.3 |
+| CUDA (container) | 12.8 (forward-compat with JP6.x's 12.6 driver) |
+| cuDNN | 9.x |
 | L4T base | `r36.4.0` |
-| Python | 3.10 |
-| PyTorch | 2.5–2.6 (`USE_DISTRIBUTED=1`, `TORCH_CUDA_ARCH_LIST=8.7`) |
+| Python (container) | 3.12 |
+| PyTorch | 2.6 (`USE_DISTRIBUTED=1`, `TORCH_CUDA_ARCH_LIST=8.7`) |
 
 ## Pinned version
 
@@ -44,7 +44,7 @@ When the CI image is unavailable or you're iterating on the recipe:
 ```sh
 cd ~/serving/sglang/orin
 docker build \
-  --build-arg BASE_IMAGE=dustynv/pytorch:2.6-r36.4.0-cu126-22.04 \
+  --build-arg BASE_IMAGE=dustynv/pytorch:2.6-r36.4.0-cu128-24.04 \
   --build-arg SGLANG_VERSION=0.5.3 \
   --build-arg SGLANG_VERSION_SPEC=0.5.3 \
   --build-arg IS_SBSA=0 \
@@ -53,8 +53,9 @@ docker build \
 
 (The exact `BASE_IMAGE` tag depends on what dustynv has published at
 the time. Use `dustynv/pytorch` rather than `dustynv/sglang` to avoid
-inheriting their stale 0.4.1 install; we re-install our pinned 0.5.x
-fresh via `install.sh` / `build.sh`.)
+inheriting their stale SGLang install; we re-install our pinned 0.5.x
+fresh via `install.sh` / `build.sh`. Check
+<https://hub.docker.com/r/dustynv/pytorch/tags> for current options.)
 
 ## Files