diff --git a/CLAUDE.md b/CLAUDE.md index 7cdc63b..1a80cec 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co Container builds, deployment recipes, and routing configuration for the LLM serving stack behind InferNode (`llmsrv` / `lucibridge` / `serve-llm.sh`). It does **not** contain the OS runtime (that lives in `infernode-os/infernode`) or the training pipeline (that lives in `pdfinn/infernode-os-llm`). Lifecycle here tracks JetPack releases / SGLang versions / hardware generations, not model training cycles. -The repo is intentionally **public** (changed 2026-05-14) so CI can use the free `ubuntu-24.04-arm` GitHub-hosted runner minutes. Do not commit secrets or credential-shaped files. +The repo is intentionally **public** (changed 2026-05-14) so CI can use the free `ubuntu-24.04-arm` GitHub-hosted runner minutes. Do not commit secrets, credential-shaped files, internal hostnames, or names of confidential workloads that may colocate with this stack on a dev box. Operational policy that's specific to a particular deployment lives in private docs, not here. Work is tracked under the **INFR Jira project's "Productize SGLang serving" epic** (parent: INFR-68). When a file references `INFR-NN`, that's a Jira ticket — see ticket commentary for context Claude can't recover from the tree. @@ -29,16 +29,16 @@ CI is `.github/workflows/build-sglang.yml`, runs on `ubuntu-24.04-arm` (Graviton `workflow_dispatch` exposes `orin_base_image` as an override input — use when exploring an alternate dustynv tag (e.g. for INFR-92's upgrade work). -### Manual Orin build (on Hephaestus dev daemon, or any aarch64 host) +### Manual Orin build (any aarch64 host) ```sh cd sglang/orin -docker --host unix:///run/docker-dev.sock build \ +docker build \ --build-arg BASE_IMAGE=dustynv/sglang:r36.4.0 \ -t serving-sglang:orin-local . ``` -(Drop `--host` on a field-deployment Orin AGX. The dev-daemon socket is Hephaestus-specific — see `runbooks/hephaestus-deploy.md` §0.1.) +If your build host runs a dedicated experimental Docker daemon on a non-default socket, prepend `--host unix:///run/.sock` — that's a per-environment concern, not part of the build itself. ### Manual Thor build @@ -51,45 +51,20 @@ docker build -t serving-sglang:thor-local sglang/thor/ The on-hardware validator launches `sglang.launch_server` with TinyLlama, asserts `/health`, asserts the `KV Cache is allocated` startup line, and exercises `/v1/chat/completions`. **Run it after every published-image pull** — CI's build-time guards can only check metadata (no GPU on the GitHub runner): ```sh -docker --host unix:///run/docker-dev.sock run --rm \ +HF_CACHE="${HF_CACHE:-/var/lib/huggingface}" +docker run --rm \ --runtime nvidia --gpus all \ - -v /mnt/orin-ssd/huggingface:/root/.cache/huggingface \ + -v "$HF_CACHE":/root/.cache/huggingface \ -e HF_HOME=/root/.cache/huggingface \ serving-sglang:orin-local \ /opt/sglang/validate-on-hardware.sh ``` -Expected last line: `All on-hardware checks passed.`. Serving smoke takes ~60s cold, ~15s warm. Deploy + healthcheck sequence is in `runbooks/hephaestus-deploy.md`. - -## Hephaestus dual-purpose model (do NOT violate — INFR-90) - -Hephaestus serves two purposes that pull in opposite directions: - -1. **Field-parity reference** — TAK / NERVA / Ollama must run under a Docker daemon whose storage lives on the device's only disk (root partition), exactly as they would on a no-SSD field unit. -2. **Development environment** — experimental work (SGLang, future Jetson experiments) needs the 916 GiB `/mnt/orin-ssd` and would crush the root partition if forced through the production daemon. - -A single Docker daemon has one storage root, so the only honest answer is **two daemons**: - -| Daemon | Socket | Storage | Used by | -|---|---|---|---| -| `docker.service` (production) | `/run/docker.sock` | root partition (with the Docker 29 quirk that its image content lives at `/var/lib/containerd` on root, regardless of where `/var/lib/docker` symlinks to) | TAK · NERVA · Ollama · field-parity workloads | -| `docker-dev.service` (experimental) | `/run/docker-dev.sock` | `/mnt/orin-ssd/docker-dev` (own legacy snapshotter, `containerd-snapshotter: false`) | SGLang · any other Jetson experiment | - -**Rules for Claude when working on this host:** - -- For SGLang or any experimental work, use the dev daemon: `docker --host unix:///run/docker-dev.sock` (or set up `alias dev-docker=...`). -- **Never** pull SGLang or other experimental images on the production daemon — its 12-14 GiB image plus working set would crush root. -- **Never** migrate TAK / NERVA / Ollama to the dev daemon — they belong on production for field parity. -- The runbook §0.1 has the full one-time setup if the dev daemon ever needs to be re-created. -- Pre-flight any plan that involves `docker pull`/`docker run` by checking *which* daemon you're hitting. The default `docker` CLI uses the production daemon — easy to forget on Hephaestus. - -### Watch out: `/var/lib/docker` symlink is misleading - -`/var/lib/docker` is symlinked to `/mnt/orin-ssd/docker/docker` on this host, but `/var/lib/containerd` (the actual image content store under Docker 29's containerd integration) lives on root. So the symlink only redirects the daemon's *metadata*, not its *image content*. A pull on the production daemon will land on root regardless of what the symlink suggests. This was discovered the hard way (root went from 12 GiB → 877 MB free during a base-image pull); the dev daemon avoids the trap by disabling the containerd snapshotter entirely. +Expected last line: `All on-hardware checks passed.`. Serving smoke takes ~60s cold, ~15s warm. Deploy + healthcheck sequence is in `runbooks/deploy.md`. ## Operational coexistence with Ollama -SGLang **coexists** with Ollama on Hephaestus, it does not replace it. `lucibridge` routes per-request based on tool-category: +SGLang **coexists** with Ollama on the deploy host, it does not replace it. `lucibridge` routes per-request based on tool-category: - `limbo_authoring` → Ollama (Devstral, single-user fluency) - `dispatch` / `tool_call` / `memory` / `task` → SGLang (gpt-oss, concurrent fan-out, xgrammar) @@ -109,14 +84,14 @@ A pre-INFR-79 single-`LLM_BACKEND_URL` config must keep working unchanged. The b | `sglang/orin/{Dockerfile.upstream,install.sh,build.sh,config.py}` | Leftovers from the deprecated framework-vendoring path; not driven by the build. Slated for removal after one release cycle | | `sglang/LICENSE-UPSTREAM.md` | Records the dusty-nv commit SHA at vendoring; consult before re-syncing | | `docs/SGLANG-ADOPTION-NOTES.md` | Spike findings, measured bake-off (SGLang ~78 tok/s @ N=8 vs Ollama's ~23 tok/s plateau), the canonical working recipe | -| `runbooks/hephaestus-deploy.md` | End-to-end Hephaestus deploy; acceptance gate is "new contributor brings SGLang up in <15 min" | +| `runbooks/deploy.md` | End-to-end deploy guide for a clean Orin AGX; acceptance gate is "new contributor brings SGLang up in <15 min" | | `runbooks/lucibridge-routing.md` | Routing config schema + per-tool / per-category mapping | ## Conventions worth knowing -- **Launch arguments live in the runbook, not the image.** No `CMD` / `ENTRYPOINT` is baked into the production Dockerfiles so the same image can serve different model paths without rebuild. When changing launch flags, update `runbooks/hephaestus-deploy.md` §3 and §5 (systemd unit) together. +- **Launch arguments live in the runbook, not the image.** No `CMD` / `ENTRYPOINT` is baked into the production Dockerfiles so the same image can serve different model paths without rebuild. When changing launch flags, update `runbooks/deploy.md` §3 and §5 (systemd unit) together. - **Tokenizer path is mandatory for Llama-3 family, not for the TinyLlama smoke.** SGLang 0.4's GGUF tokenizer doesn't register Llama-3 special tokens; `--tokenizer-path /opt/tokenizers/llama-3.1` (baked by `bake-tokenizers.sh`) is the fix. TinyLlama uses Llama-2's tokenizer and doesn't need the override. See INFR-78. - **`TORCHDYNAMO_DISABLE=1 TORCH_COMPILE_DISABLE=1` and `--disable-cuda-graph` are required on Jetson** — `torch.compile` is broken at-import in this Triton 3.2 + Torch 2.5 combo (it pulls in `torch._inductor.runtime.hints` which calls `fields(AttrsDescriptor)` on a non-dataclass), and CUDA-graph capture is unreliable on Tegra. The two env vars are baked into the image; don't remove without re-running the validator. - **The dataclasses backport must stay removed.** dustynv ships a Python-2-era `dataclasses.py` in `/usr/local/lib/python3.10/dist-packages/` that shadows the stdlib. The Dockerfile moves it to `_disabled_backports/` and asserts the stdlib is now what resolves. If a future base-image bump re-introduces it, the same pattern works. -- **CI guards are metadata-only; `validate-on-hardware.sh` is the real gate.** GitHub-hosted runners have no GPU, so the in-Dockerfile guards can only check torch+CUDA metadata, triton import, and the sglang version pin. Real correctness lives in the on-hardware validator's serving smoke. Always run the validator after pulling a new tag on Hephaestus. +- **CI guards are metadata-only; `validate-on-hardware.sh` is the real gate.** GitHub-hosted runners have no GPU, so the in-Dockerfile guards can only check torch+CUDA metadata, triton import, and the sglang version pin. Real correctness lives in the on-hardware validator's serving smoke. Always run the validator after pulling a new tag onto the deploy hardware. - **NGC pulls are anonymous by default.** The `NGC_API_KEY` step in CI is forward-compatible scaffolding for any future gated tag — don't make it a hard requirement. diff --git a/README.md b/README.md index 9259ab7..d73982b 100644 --- a/README.md +++ b/README.md @@ -48,7 +48,7 @@ serving/ │ ├── orin/ pinned config for Orin (sm_87) │ └── thor/ pinned config for Thor (sm_103) — when we have a Thor box ├── runbooks/ -│ ├── hephaestus-deploy.md +│ ├── deploy.md │ └── ... └── .github/workflows/ └── build-sglang.yml GitHub-hosted ubuntu-24.04-arm runner @@ -66,11 +66,11 @@ SBSA). Native aarch64 — no QEMU — and `nvcc` cross-compiles for sm_87 (Orin) and sm_103 (Thor) via `TORCH_CUDA_ARCH_LIST` at build time. The output is a Jetson-Tegra-targeted image; GitHub's hosted runners have no Jetson hardware, so end-to-end smoke testing (CUDA -paths actually executing) happens manually on Hephaestus after each -successful CI build. **Hephaestus must never be configured as a -GitHub self-hosted runner**, and the remote CI must not hold any -secrets that link back to the device — the public CI surface stays -strictly isolated from the dev box. +paths actually executing) happens manually on Jetson hardware after +each successful CI build. **The on-hardware validation host must +never be configured as a GitHub self-hosted runner**, and the remote +CI must not hold any secrets that link back to it — the public CI +surface stays strictly isolated from on-hardware validation. The `sglang/` subtree is intended to vendor the canonical [`dusty-nv/jetson-containers`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/sglang) @@ -99,32 +99,16 @@ the model registry). See `docs/SGLANG-ADOPTION-NOTES.md` for: - The spike attempts that didn't work and why (PyTorch wheel `USE_DISTRIBUTED` vs `sm_87` constraint) -- The working recipe (extract dustynv container via crane onto - orin-ssd, host Python 3.10 with patched `LD_LIBRARY_PATH`) +- The working recipe (extract dustynv container via crane onto a + fast local disk, host Python 3.10 with patched `LD_LIBRARY_PATH`) - Measured bake-off results — SGLang vs Ollama on Llama-3.1-8B (TL;DR: SGLang scales to ~78 tok/s at N=8 concurrent vs Ollama's ~23 tok/s plateau; tied at single-user) -- Operations: where things live on Hephaestus, start/stop, verify +- Operations: on-host layout, start/stop, verify - Known gaps: SGLang 0.4.1's GGUF tokenizer doesn't recognize Llama 3 special tokens (needs HF tokenizer dir); no `gpt_oss.py` in 0.4.1's model registry (needs SGLang 0.5.x bump) -## Hephaestus disk policy (important for the build path) - -Hephaestus is the Jetson Orin AGX dev box. Its root partition is -**deliberately constrained** to emulate a production single-disk -node (OS + TAK + NERVA via Docker + Ollama binary). The 916 GB -`/mnt/orin-ssd` is the dev indulgence; serving-spike artifacts live -there. - -**Do not migrate Docker / containerd state from root to orin-ssd.** -That would move TAK/NERVA images onto the dev-only disk and break -the production emulation. The Jetson-container build needs to either -fit in root partition's residual space, or use a daemonless build -path (we did extraction-only via -[`crane`](https://github.com/google/go-containerregistry) on the -spike; expect to reuse). - ## Tracking Work in this repo is tracked under the **INFR Jira project's diff --git a/docs/SGLANG-ADOPTION-NOTES.md b/docs/SGLANG-ADOPTION-NOTES.md index 2866498..2cc5a2e 100644 --- a/docs/SGLANG-ADOPTION-NOTES.md +++ b/docs/SGLANG-ADOPTION-NOTES.md @@ -4,11 +4,11 @@ **Date:** 2026-05-13 (initial) — 2026-05-14 (working recipe + bake-off) **Owning ticket:** [INFR-68](https://nervsystems-team.atlassian.net/browse/INFR-68) — Spike: SGLang on Jetson Orin for LoRA-native multi-model serving -**Status:** **Measured + working recipe documented.** SGLang server runs on Hephaestus; bake-off complete (SGLang scales ~3× under concurrent load vs Ollama). Productization tracked under the "Productize SGLang serving" epic in INFR. +**Status:** **Measured + working recipe documented.** SGLang server runs on a Jetson Orin AGX dev box; bake-off complete (SGLang scales ~3× under concurrent load vs Ollama). Productization tracked under the "Productize SGLang serving" epic in INFR. This document captures what would have to change in IOL's training and deployment pipelines if InferNode swaps (or doubles up) Ollama with -SGLang as the Hephaestus inference backend. It is **deliberately +SGLang as the on-device inference backend. It is **deliberately written before any experiment**, so the first reality-check round of the spike will probably contradict some of it. Treat as a working hypothesis, not a recipe. @@ -30,7 +30,7 @@ Current pipeline (per `docs/V3-DEPLOY-RUNBOOK.md`): PEFT LoRA checkpoint → merge_to_gguf.py (merge + convert + Q4_K_M quantize) → GGUF + Modelfile - → rsync to Hephaestus + → rsync to deploy host → ollama create -f Modelfile → /v1/chat/completions ``` @@ -39,7 +39,7 @@ With SGLang the destination accepts the PEFT checkpoint **directly**: ``` PEFT LoRA checkpoint - → rsync to Hephaestus (just the adapter dir; tens of MB, not 13 GB) + → rsync to the deploy host (just the adapter dir; tens of MB, not 13 GB) → sglang.launch_server --model-path --lora-paths → /v1/chat/completions ``` @@ -167,7 +167,7 @@ to change there. The post-training step in the deploy runbook If we keep both stacks, this becomes a Make target fan-out: - `make deploy-ollama` — existing GGUF pipeline (current) -- `make deploy-sglang` — rsync adapter dir to Hephaestus + restart SGLang +- `make deploy-sglang` — rsync adapter dir to the deploy host + restart SGLang No new Make scripts to author per the project's existing "describe inline, don't wrap" preference; this is one rsync. @@ -220,7 +220,7 @@ SGLang's `/v1/chat/completions` is OpenAI-compatible, so the eval is literally: ```sh -make eval-baseline MODEL=devstral-limbo-v3 BASE_URL=http://hephaestus.lan:30000/v1 +make eval-baseline MODEL=devstral-limbo-v3 BASE_URL=http://:30000/v1 ``` `tools/virgil-agent/scenarios/*.yaml` should also work as-is. @@ -228,8 +228,8 @@ make eval-baseline MODEL=devstral-limbo-v3 BASE_URL=http://hephaestus.lan:30000/ Bake-off harness for the spike: ```sh -make eval-baseline MODEL=devstral-limbo-v3 BASE_URL=http://hephaestus.lan:11434/v1 # Ollama -make eval-baseline MODEL=devstral-limbo-v3 BASE_URL=http://hephaestus.lan:30000/v1 # SGLang +make eval-baseline MODEL=devstral-limbo-v3 BASE_URL=http://:11434/v1 # Ollama +make eval-baseline MODEL=devstral-limbo-v3 BASE_URL=http://:30000/v1 # SGLang make eval-grounding INPUT=eval/runs/.gated.jsonl make eval-grounding INPUT=eval/runs/.gated.jsonl ``` @@ -276,13 +276,13 @@ real numbers. ## What this does not address - **Cost of running two stacks.** If both Ollama and SGLang stay - resident on Hephaestus we lose the multi-LoRA memory win. Either + resident on the deploy host we lose the multi-LoRA memory win. Either pick one or split Veltro's two production roles across them (which the V4-PLAN routing already implies). - **Veltro Limbo-authoring path.** Compile-gate is unchanged; this is a serve-time discussion. - **Build/CI.** None of this touches the IOL CI surface. The compile - gate keeps running on Docker x86; the spike happens on Hephaestus. + gate keeps running on Docker x86; the spike happens on the Jetson dev box. - **Veltro security / namespace model.** Untouched. --- @@ -290,7 +290,7 @@ real numbers. ## TL;DR for the spike runner 1. Validate AWQ-INT4 quantization runs on `sm_87` with whatever JetPack - PyTorch/Triton combo Hephaestus is on **first**. Stop if it doesn't. + PyTorch/Triton combo the dev box is on **first**. Stop if it doesn't. 2. Validate SGLang loads the existing gpt-oss-limbo-v3 PEFT adapter on its MoE base **second**. If it doesn't, decide between repack / retrain-attn-only / merge-and-quantize before continuing. @@ -302,7 +302,7 @@ real numbers. ## Spike attempt — 2026-05-13/14 (measured) -First real attempt on Hephaestus. Result: **parked at the PyTorch +First real attempt on the Jetson dev box. Result: **parked at the PyTorch layer; no prebuilt wheel exists that satisfies all of SGLang's runtime requirements on Jetson Orin AGX (`sm_87`).** This is a genuine spike-finding — not in the pre-spike risk list above — so promoted to @@ -310,17 +310,17 @@ its own section. ### What worked -- Hephaestus baseline: JetPack 6.2 (L4T R36.4.7) / CUDA 12.6 / cuDNN - 9.3.0 / Orin `sm_87` / 64 GB unified memory / 916 GB orin-ssd. -- Disk reclamation on orin-ssd: ~129 GB freed (HF xet cache, conda pkg +- Dev-box baseline: JetPack 6.2 (L4T R36.4.7) / CUDA 12.6 / cuDNN + 9.3.0 / Orin `sm_87` / 64 GB unified memory / 916 GB working SSD. +- Disk reclamation on the working SSD: ~129 GB freed (HF xet cache, conda pkg cache, two F5-TTS conda envs, three unused HF model caches) without touching anything load-bearing (Devstral, gpt-oss-20b, all daedalus - artifacts, NERVA TAK models, swapfile, IOL repo). orin-ssd 97% → 82%. -- Conda env at `/mnt/orin-ssd/pdfinn/conda-envs/sglang-spike` + artifacts, other resident workloads, swapfile, IOL repo). 97% → 82%. +- Conda env at `${WORK}/conda-envs/sglang-spike` (Python 3.10.20). cuDNN-9-aligned PyTorch installed and verified (NVIDIA's `torch-2.5.0a0+nv24.08`, then upstream `torch-2.6.0+cu126`). - **`daedalus-v1` structural validation**: the v4 Devstral PEFT - adapter at `/mnt/orin-ssd/pdfinn/daedalus-v1/checkpoint-stripped/` + adapter at `${WORK}/daedalus-v1/checkpoint-stripped/` is canonical-shape — peft 0.18.1 LoRA, r=32, α=64, all 7 target modules (q/k/v/o + gate/up/down), 560 tensors covering all 40 layers, 184.8M trainable params. **Would load into SGLang's LoRA adapter @@ -351,7 +351,7 @@ Available prebuilt PyTorch wheels on aarch64+CUDA: In other words: SGLang on Jetson Orin AGX requires a custom PyTorch source build with **both** `USE_DISTRIBUTED=1` **and** -`TORCH_CUDA_ARCH_LIST=8.7`. The existing `/mnt/orin-ssd/pytorch-build/` +`TORCH_CUDA_ARCH_LIST=8.7`. The existing `${WORK}/pytorch-build/` (7.1 GB, cp311 + torch 2.1.0a0) appears to be an earlier attempt at this same exercise. @@ -394,7 +394,7 @@ apply: ### What this means for INFR-68 The spike's preregistered goal "does SGLang serve a model on -Hephaestus" got refined into "does SGLang's `import` chain pass on a +Jetson Orin" got refined into "does SGLang's `import` chain pass on a working CUDA-on-`sm_87` PyTorch". Answer: not with anything off the shelf today. Reopen criterion sharpens to: @@ -406,7 +406,7 @@ system library. Once that wheel exists, the remaining install steps in §"Required workarounds" above carry the runtime to `import sglang` in well under an hour. -The existing `/mnt/orin-ssd/pytorch-build/` is the historical seed for +The existing `${WORK}/pytorch-build/` is the historical seed for this work — keep it (the user wisely declined to purge during the spike). A clean restart would target cp310 + torch 2.6 + cu126 + sm_87. @@ -423,9 +423,9 @@ spike). A clean restart would target cp310 + torch 2.6 + cu126 - `daedalus-v1` itself. We confirmed the artifact is well-formed; it will deploy on anything that loads a PEFT adapter. -### Environment artifacts on Hephaestus (preserved) +### Environment artifacts (preserved on the dev box) -- Conda env: `/mnt/orin-ssd/pdfinn/conda-envs/sglang-spike` — SGLang +- Conda env: `${WORK}/conda-envs/sglang-spike` — SGLang stack installed minus the PyTorch issue. Reusable for the next attempt by just `pip install --force-reinstall` a working torch. - Patched files (vs upstream sglang 0.5.11 wheel): @@ -439,7 +439,7 @@ spike). A clean restart would target cp310 + torch 2.6 + cu126 ## Spike attempt 2 — 2026-05-14 (working) -**Result: SGLang is live on Hephaestus**, serving an OpenAI-compatible +**Result: SGLang is live on the Jetson Orin dev box**, serving an OpenAI-compatible `/v1/chat/completions` endpoint on port 30000 with TinyLlama-1.1B as the smoke target. End-to-end inference proven: `"2+2=" → "The answer to the question is 4"` in 8 tokens. @@ -462,49 +462,51 @@ dustynv/sglang:r36.4.0 7.8 GB (Ubuntu 22.04 / Python 3.10 / CUDA 12.6 / sm_8 ``` Important: the **`-24.04` variants of the dustynv images use Python -3.12**, which does *not* match Hephaestus's host Python 3.10. Use the +3.12**, which does *not* match the dev box's host Python 3.10. Use the plain `r36.4.0` tag (Ubuntu 22.04) to align with the host. -### Honouring the Hephaestus disk policy +### Daemonless extraction (why) -Per `[[hephaestus-disk-policy]]`: root partition emulates a -production single-disk node (OS + TAK/NERVA via Docker + Ollama). -`/mnt/orin-ssd` is the dev indulgence. So **we do NOT relocate -Docker's containerd storage** — that would migrate TAK/NERVA images -off the production-emulating disk. Instead, we extract the dustynv -container as files onto orin-ssd, never invoking Docker's daemon. +We extract the dustynv container as files onto the working SSD via +`crane`, never invoking Docker's daemon. This keeps the container +artifacts off whatever disk the host's Docker daemon stores images +on, which matters when the dev box's root partition is intentionally +constrained for unrelated reasons. ### The recipe (reproducible) +`${WORK}` below is whatever working directory you keep this stack +under (e.g. a fast local SSD with ample free space). + ```sh # 1. crane (daemonless OCI puller, single static binary) -mkdir -p /mnt/orin-ssd/pdfinn/bin /mnt/orin-ssd/pdfinn/scratch -cd /mnt/orin-ssd/pdfinn/scratch +mkdir -p ${WORK}/bin ${WORK}/scratch +cd ${WORK}/scratch LATEST=$(curl -sS "https://api.github.com/repos/google/go-containerregistry/releases/latest" \ | python3 -c "import sys,json; print(json.load(sys.stdin)['tag_name'])") curl -sSL -o crane.tar.gz \ "https://github.com/google/go-containerregistry/releases/download/${LATEST}/go-containerregistry_Linux_arm64.tar.gz" -tar -xzf crane.tar.gz -C /mnt/orin-ssd/pdfinn/bin crane -chmod +x /mnt/orin-ssd/pdfinn/bin/crane +tar -xzf crane.tar.gz -C ${WORK}/bin crane +chmod +x ${WORK}/bin/crane # 2. Pull the Jetson SGLang container as an OCI layout (no Docker daemon) -/mnt/orin-ssd/pdfinn/bin/crane pull --format=oci \ +${WORK}/bin/crane pull --format=oci \ dustynv/sglang:r36.4.0 \ - /mnt/orin-ssd/pdfinn/scratch/sglang-r36.4.0-oci + ${WORK}/scratch/sglang-r36.4.0-oci -# 3. Extract all layers in manifest order to a merged rootfs on orin-ssd -cd /mnt/orin-ssd/pdfinn/scratch/sglang-r36.4.0-oci +# 3. Extract all layers in manifest order to a merged rootfs on the working SSD +cd ${WORK}/scratch/sglang-r36.4.0-oci MANIFEST=$(python3 -c "import json; print(json.load(open('index.json'))['manifests'][0]['digest'].split(':',1)[1])") -mkdir -p /mnt/orin-ssd/pdfinn/scratch/sglang-rootfs +mkdir -p ${WORK}/scratch/sglang-rootfs python3 -c "import json; m=json.load(open('blobs/sha256/$MANIFEST')); \ [print(l['digest'].split(':',1)[1]) for l in m['layers']]" | \ while read d; do tar --no-same-owner --warning=no-unknown-keyword -xzf "blobs/sha256/$d" \ - -C /mnt/orin-ssd/pdfinn/scratch/sglang-rootfs 2>/dev/null || true + -C ${WORK}/scratch/sglang-rootfs 2>/dev/null || true done # 4. Remove the stdlib-shadowing dataclasses backport (Python-2-era cruft) -DP=/mnt/orin-ssd/pdfinn/scratch/sglang-rootfs/usr/local/lib/python3.10/dist-packages +DP=${WORK}/scratch/sglang-rootfs/usr/local/lib/python3.10/dist-packages mkdir -p $DP/_disabled_backports mv $DP/dataclasses.py $DP/_disabled_backports/ mv $DP/__pycache__/dataclasses.* $DP/_disabled_backports/ 2>/dev/null || true @@ -517,7 +519,7 @@ so the host PYTHONPATH must include both the dist-packages and the extracted source tree: ```sh -ROOTFS=/mnt/orin-ssd/pdfinn/scratch/sglang-rootfs +ROOTFS=${WORK}/scratch/sglang-rootfs export PYTHONNOUSERSITE=1 export PYTHONPATH=$ROOTFS/workspace/sglang/python:$ROOTFS/usr/local/lib/python3.10/dist-packages # Critical: Tegra driver shim FIRST in LD_LIBRARY_PATH, else "CUDA unknown error" @@ -526,7 +528,7 @@ export LD_LIBRARY_PATH=/usr/lib/aarch64-linux-gnu/tegra:/usr/local/cuda-12.6/lib # torch._inductor at import. We don't need compile for serving. export TORCHDYNAMO_DISABLE=1 export TORCH_COMPILE_DISABLE=1 -export HF_HOME=/mnt/orin-ssd/huggingface +export HF_HOME=${WORK}/huggingface ``` ### Launch command (working) @@ -592,17 +594,17 @@ This shifts the **immediate next achievable target** from "gpt-oss LoRA unblock" to "Devstral + daedalus-v1 LoRA bake-off" — which is served by 0.4.1.post7's AWQ/GPTQ/bf16 paths without issue. -### Environment artifacts on Hephaestus (preserved) +### Environment artifacts (preserved on the dev box) -- `/mnt/orin-ssd/pdfinn/scratch/sglang-rootfs/` — extracted container +- `${WORK}/scratch/sglang-rootfs/` — extracted container rootfs (18 GB). The "installation." **Keep.** -- `/mnt/orin-ssd/pdfinn/bin/crane` — daemonless OCI puller, kept for +- `${WORK}/bin/crane` — daemonless OCI puller, kept for future container extractions on this Jetson. **Keep.** -- `/mnt/orin-ssd/pdfinn/scratch/sglang-logs/` — proof-of-life and +- `${WORK}/scratch/sglang-logs/` — proof-of-life and bake-off run logs. Small. **Keep.** -- `/mnt/orin-ssd/pdfinn/scratch/sglang-r36.4.0-oci/` — original OCI +- `${WORK}/scratch/sglang-r36.4.0-oci/` — original OCI layout (7.3 GB). **Removed** after `sglang-rootfs/` verified working. -- `/mnt/orin-ssd/pdfinn/conda-envs/sglang-spike/` — attempt-1 env that +- `${WORK}/conda-envs/sglang-spike/` — attempt-1 env that didn't lead anywhere. **Removed.** --- @@ -613,9 +615,9 @@ served by 0.4.1.post7's AWQ/GPTQ/bf16 paths without issue. throughput under concurrent load. At single-request latency they are roughly tied (Ollama slightly faster).** -Both stacks ran simultaneously on Hephaestus, serving the **same** +Both stacks ran simultaneously on the dev box, serving the **same** Llama-3.1-8B Q4_K_M GGUF blob (Ollama's own model store, path -`/mnt/orin-ssd/ollama/models/blobs/sha256-667b0c1932…`). SGLang loaded +`/blobs/sha256-667b0c1932…`). SGLang loaded the blob via `--load-format gguf --quantization gguf`. Total resident memory across both servers ≈ 18 GiB on the 64 GiB unified Jetson — they coexist comfortably. @@ -645,7 +647,7 @@ curl -sS -X POST http://127.0.0.1:11434/api/generate \ "stream":false,"keep_alive":"30m"}' # Launch SGLang against the SAME GGUF blob -BLOB=/mnt/orin-ssd/ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 +BLOB=/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 # (env vars per Operations section above) python3 -m sglang.launch_server \ --model-path "$BLOB" \ @@ -791,6 +793,6 @@ After the run: - SGLang server: stopped (`kill -TERM` via port-30000 ownership). - Ollama: `keep_alive=0` POST dropped Llama-3.1-8B; daemon left up, no models loaded. -- Extracted SGLang rootfs at `/mnt/orin-ssd/pdfinn/scratch/sglang-rootfs/` +- Extracted SGLang rootfs at `${WORK}/scratch/sglang-rootfs/` preserved for the next session — just relaunch per the Operations section above. diff --git a/runbooks/hephaestus-deploy.md b/runbooks/deploy.md similarity index 63% rename from runbooks/hephaestus-deploy.md rename to runbooks/deploy.md index b767c1d..f863d20 100644 --- a/runbooks/hephaestus-deploy.md +++ b/runbooks/deploy.md @@ -1,30 +1,22 @@ -# Hephaestus deploy runbook — SGLang on Jetson Orin AGX +# Deploy runbook — SGLang on Jetson Orin AGX End-to-end runbook for deploying the GHCR-built SGLang container on -**Hephaestus** (Jetson Orin AGX, JetPack 6.x, 64 GiB unified memory) -and wiring it into the existing `serve-llm.sh` / `lucibridge` -operational model. +a Jetson Orin AGX (JetPack 6.x, 64 GiB unified memory) and wiring it +into the `serve-llm.sh` / `lucibridge` operational model. **Owning epic:** INFR-73. **Acceptance gate:** a new contributor can -bring SGLang up on Hephaestus in under 15 minutes following this +bring SGLang up on a clean Orin AGX in under 15 minutes following this document. --- ## 0. Prereqs -This runbook covers two deployment shapes: +This runbook covers the **field-deployment shape**: a vanilla Jetson +Orin AGX with a single disk and a single Docker daemon (the standard +system one). -* **Field deployment** — a vanilla Jetson Orin AGX with a single disk - and a single Docker daemon (the standard system one). This is the - intended end-user path. Most readers should follow this. -* **Hephaestus (dual-purpose dev box)** — both a field-parity reference - *and* a development environment. Has a 916 GiB `/mnt/orin-ssd` in - addition to the root partition, and runs a **second** Docker daemon - (`docker-dev.service`) on the SSD specifically for experimental work - (SGLang, anything else that would otherwise crush root). See §0.1. - -### Hardware + driver prereqs (both shapes) +### Hardware + driver prereqs ```sh cat /etc/nv_tegra_release | head -1 # expect R36, REVISION: 4.x @@ -32,167 +24,28 @@ nvidia-smi # expect Orin / CUDA 12.6+ docker info | grep -iE 'server version|runtimes' # expect 24+ and nvidia runtime registered ``` -The `nvidia` runtime must be registered with whichever Docker daemon +The `nvidia` runtime must be registered with the Docker daemon that will run SGLang. Verify with `docker info --format '{{.Runtimes}}'` — output should include `nvidia`. If missing, install `nvidia-container-toolkit` and re-add `"runtimes": {"nvidia": ...}` -to that daemon's `daemon.json`. +to the daemon's `daemon.json`. -### Field deployment (single-disk Orin AGX) +### Disk space -The standard system Docker daemon's storage on `/var/lib/docker`. The -~12-14 GiB SGLang image + runtime data all live there. Verify space: +The standard system Docker daemon stores images at `/var/lib/docker`. +The ~12-14 GiB SGLang image + runtime data all live there. Verify +space: ```sh df -h / # need ≥20 GiB free for image + working set ``` -Skip §0.1 entirely; jump to §1. - -### Hephaestus disk policy (load-bearing — do NOT violate on this host) - -Hephaestus serves two purposes simultaneously: **(a)** field-parity -reference for TAK / NERVA / Ollama (which must stay on root partition, -exactly as they would on a no-SSD field unit), and **(b)** development -environment for experimental work. Because a single Docker daemon has -exactly one storage root, the only way to satisfy both is **two -daemons**. - -| Daemon | Socket | Storage | Used by | -|---|---|---|---| -| `docker.service` (production) | `/run/docker.sock` | root partition (`/var/lib/docker` symlinked to `/mnt/orin-ssd/docker/docker`, but containerd content store at `/var/lib/containerd` lives on root — both halves end up landing pulled images on root via the shared system containerd) | TAK · NERVA · Ollama · anything mirroring field | -| `docker-dev.service` (experimental, see INFR-90) | `/run/docker-dev.sock` | `/mnt/orin-ssd/docker-dev` (own legacy snapshotter, `containerd-snapshotter: false`, completely off the shared containerd path) | SGLang · any other Jetson experiment | - -**Do not** migrate TAK / NERVA / Ollama to the dev daemon — they belong -on production for field parity. **Do not** install SGLang on the -production daemon — its 12-14 GiB image plus working set would crush the -intentionally-constrained root partition. Use `docker-dev.service` for -all of §1-§7 below. - -See §0.1 for one-time dev-daemon setup if it isn't running yet. - ---- - -## 0.1 Dev-daemon setup (Hephaestus only — one-time, INFR-90) - -Skip this entire section on a field-deployment Orin AGX. - -If `systemctl is-active docker-dev` returns `active` and -`DOCKER_HOST=unix:///run/docker-dev.sock docker info` succeeds, the -dev daemon is already up — skip ahead to §1. - -Otherwise, set it up: - -1. Create `/etc/docker/daemon-dev.json`: - - ```json - { - "data-root": "/mnt/orin-ssd/docker-dev", - "exec-root": "/var/run/docker-dev", - "pidfile": "/run/docker-dev.pid", - "hosts": ["unix:///run/docker-dev.sock"], - "bridge": "docker1", - "default-address-pools": [{ "base": "172.31.0.0/16", "size": 24 }], - "features": { "containerd-snapshotter": false }, - "runtimes": { - "nvidia": { "args": [], "path": "nvidia-container-runtime" } - }, - "log-driver": "json-file", - "log-opts": { "max-size": "100m", "max-file": "3" } - } - ``` - - `containerd-snapshotter: false` is critical — it puts image content - under `data-root` instead of the shared `/var/lib/containerd` on - root. Without this, image pulls on the dev daemon would still land - on root via the shared containerd, defeating the whole point. - -2. Create `/etc/systemd/system/docker-dev.service`: - - ```ini - [Unit] - Description=Docker Application Container Engine (dev / SSD-rooted) - Documentation=https://docs.docker.com - RequiresMountsFor=/mnt/orin-ssd - After=network-online.target nss-lookup.target containerd.service docker.service - Wants=network-online.target containerd.service - StartLimitBurst=3 - StartLimitIntervalSec=60 - - [Service] - Type=notify - ExecStart=/usr/bin/dockerd --config-file=/etc/docker/daemon-dev.json - ExecReload=/bin/kill -s HUP $MAINPID - TimeoutStartSec=0 - TimeoutStopSec=120s - RestartSec=2 - Restart=always - LimitNOFILE=infinity - LimitNPROC=infinity - LimitCORE=infinity - TasksMax=infinity - Delegate=yes - KillMode=process - OOMScoreAdjust=-500 - - [Install] - WantedBy=multi-user.target - ``` - -3. Add a drop-in for the bridge — Docker refuses to start with a - non-default bridge name unless that bridge already exists in the - kernel, and kernel bridges don't persist across reboots: - - ```sh - sudo mkdir -p /etc/systemd/system/docker-dev.service.d - sudo tee /etc/systemd/system/docker-dev.service.d/bridge.conf <<'EOF' - [Service] - ExecStartPre=/bin/sh -c 'ip link show docker1 >/dev/null 2>&1 || ip link add docker1 type bridge' - ExecStartPre=/bin/sh -c 'ip link set docker1 up' - EOF - ``` - -4. Enable + start: - - ```sh - sudo systemctl daemon-reload - sudo systemctl enable --now docker-dev.service - ``` - -5. Add a shell helper so you don't have to remember the socket path: - - ```sh - echo "alias dev-docker='docker --host unix:///run/docker-dev.sock'" >> ~/.bashrc - source ~/.bashrc - ``` - -6. Verify: - - ```sh - dev-docker info --format '{{.ServerVersion}} | {{.DockerRootDir}} | runtimes: {{.Runtimes}}' - # Expect: 29.x | /mnt/orin-ssd/docker-dev | runtimes: ... nvidia ... - ``` - -The bridge subnet `172.31.0.0/16` was picked to avoid colliding with -prod's `docker0` (172.17), TAK (172.18), NERVA (172.19), TBL4 -(172.20-21), ZeroTier (10.243), and LAN (192.168.1). If you have other -networks on the host, audit with `ip route` and pick a free /16. - --- ## 1. Pull the container GHCR images are public; no docker login required. -**On Hephaestus**, prefix every `docker` command in this section -through to §7 with `dev-docker` (or `docker --host -unix:///run/docker-dev.sock`) so it hits the dev daemon, not the -production one. **On a field-deployment Orin AGX**, use plain -`docker`. - -For convenience the snippets below use plain `docker`; mentally -substitute `dev-docker` if you're on Hephaestus. - ```sh # For end-user trials and dev: use :orin-latest (always points at the # tip of main). @@ -205,15 +58,14 @@ IMAGE='ghcr.io/infernode-os/serving-sglang:orin-latest' docker pull "$IMAGE" docker images "$IMAGE" --format '{{.Repository}}:{{.Tag}} {{.Size}}' # Expect: ~12–14 GB (CUDA + cuDNN + PyTorch wheels + SGLang + sgl-kernel). -# Verify residual headroom — on field deploy this is on root partition; -# on Hephaestus this is on /mnt/orin-ssd via the dev daemon. + +# Verify residual headroom on the disk that backs Docker's data-root. df -h / -df -h /mnt/orin-ssd # Hephaestus only ``` -If the root partition is tight, `docker image prune` old SGLang images -**only** (`docker images "ghcr.io/infernode-os/serving-sglang" --quiet | tail -n +3`) -— do not prune TAK/NERVA images. +If the disk is tight, `docker image prune` old SGLang images **only** +(`docker images "ghcr.io/infernode-os/serving-sglang" --quiet | tail -n +3`) +— do not prune unrelated workload images that share the daemon. --- @@ -225,8 +77,7 @@ TinyLlama serve through `/v1/chat/completions`, asserting the `KV Cache is allocated` startup line appears. ```sh -HF_CACHE=/mnt/orin-ssd/huggingface # Hephaestus dev daemon -# HF_CACHE=/var/lib/huggingface # field-deployment Orin AGX +HF_CACHE="${HF_CACHE:-/var/lib/huggingface}" mkdir -p "$HF_CACHE" docker run --rm \ @@ -255,16 +106,15 @@ either pull a different tag or rebuild from `sglang/orin/` (see ## 3. Launch (one-shot, for testing) The launch invocation that produced the bake-off in -`docs/SGLANG-ADOPTION-NOTES.md`, adapted for the GHCR image. Bind-mounts -keep the HF cache on `/mnt/orin-ssd` (disk policy §0). +`docs/SGLANG-ADOPTION-NOTES.md`, adapted for the GHCR image. **TinyLlama smoke** — the same payload the validator runs, but interactive so you can watch the logs and curl by hand: ```sh MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0 # smoke target — uses Llama-2 tokenizer; no --tokenizer-path needed -HF_CACHE=/mnt/orin-ssd/huggingface -LOGS=/mnt/orin-ssd/pdfinn/scratch/sglang-logs +HF_CACHE="${HF_CACHE:-/var/lib/huggingface}" +LOGS="${LOGS:-/var/log/sglang}" mkdir -p "$HF_CACHE" "$LOGS" docker run --rm -it \ @@ -284,7 +134,7 @@ docker run --rm -it \ 2>&1 | tee "$LOGS/sglang-$(date +%s).log" ``` -**Llama-3 family launch** — for any Llama-3.x model (e.g. `unsloth/Meta-Llama-3.1-8B-Instruct` or a Llama-3-GGUF blob from Ollama's store), add the tokenizer override: +**Llama-3 family launch** — for any Llama-3.x model (e.g. `unsloth/Meta-Llama-3.1-8B-Instruct`), add the tokenizer override: ``` --tokenizer-path /opt/tokenizers/llama-3.1 \ @@ -392,7 +242,7 @@ MODEL_PATH=unsloth/Meta-Llama-3.1-8B-Instruct TOKENIZER_PATH=/opt/tokenizers/llama-3.1 PORT=30000 MEM_FRACTION_STATIC=0.5 -HF_CACHE=/mnt/orin-ssd/huggingface +HF_CACHE=/var/lib/huggingface ``` Activate: @@ -420,14 +270,14 @@ for the v1 cut, drawn from the spike's measured working set: | Ollama + one resident model (Devstral GGUF Q4) | ~14 GiB | `OLLAMA_KEEP_ALIVE` default | | SGLang model weights (Llama-3.1-8B Q4 GGUF, v1) | ~5 GiB | bake-off shape; INFR-92 swaps in gpt-oss-20b at ~13 GiB | | SGLang KV cache | ~9 GiB at 4096 ctx, 16 running | `--mem-fraction-static 0.5` cap; the validator's TinyLlama smoke logs `K=8.8 GB V=8.8 GB` | -| Headroom | ~25–33 GiB | TAK / NERVA / NN bursts | +| Headroom | ~25–33 GiB | other resident workloads, bursts | Tunables: * **`--mem-fraction-static`** = fraction of GPU memory SGLang pre-allocates for weights + activations. `0.5` is the spike default (32 GiB cap). Raise to `0.6` if Ollama is dropped from the box; - drop to `0.4` if TAK/NERVA models grow. + drop to `0.4` if other resident models grow. * **`--max-running-requests`** = concurrency cap. `16` was the spike default; tail latency blew out at N=16 due to starvation. For Veltro's expected fan-out (≤8 concurrent), set `--max-running-requests 8`. @@ -450,16 +300,16 @@ After any tune, re-run §4 healthchecks and validate p95 latency in | Server starts but `/health` 503s | Worker pool stalled on small shm | Confirm `--shm-size 8g` is on the `docker run` line | | `import sglang` fails inside container | Stale image / partial install | Re-pull pinned tag; re-run §2 pre-flight | | Per-request latency >2× expected | `torch.compile` accidentally enabled | Confirm `TORCHDYNAMO_DISABLE=1 TORCH_COMPILE_DISABLE=1` | -| Build/push CI green but pull on Hephaestus 404s | GHCR visibility set wrong on first publish | One-time: in repo Settings → Packages, set the package public | +| Build/push CI green but `docker pull` 404s | GHCR visibility set wrong on first publish | One-time: in repo Settings → Packages, set the package public | --- ## 8. `serve-llm.sh` integration `serve-llm.sh` (in `infernode-os/infernode`) is the dev-side launcher -that the ZeroTier-mounted user interacts with. Today it talks to -Ollama at `http://127.0.0.1:11434/v1`. With SGLang available the -launcher gains a sibling backend. +that talks to local LLM backends. Today it talks to Ollama at +`http://127.0.0.1:11434/v1`. With SGLang available the launcher +gains a sibling backend. ### Single-backend mode (Ollama-only — current default) diff --git a/runbooks/lucibridge-routing.md b/runbooks/lucibridge-routing.md index 6d58573..1669b55 100644 --- a/runbooks/lucibridge-routing.md +++ b/runbooks/lucibridge-routing.md @@ -31,7 +31,7 @@ dispatch). ## Routing table -The canonical route table for v1 (Hephaestus, Veltro-on-SGLang). All +The canonical route table for v1 (single-host, Veltro-on-SGLang). All URLs are relative to the configured backend prefix; `model` is the model selector accepted by both backends' `/v1/chat/completions`. @@ -52,7 +52,7 @@ fall back to Ollama — the dev path stays unchanged. ## Config schema -Stored at `/etc/lucibridge/routing.json` on Hephaestus, owned by +Stored at `/etc/lucibridge/routing.json` on the deploy host, owned by `root:root`, mode `0644` (no secrets): ```json @@ -107,7 +107,7 @@ routes every request to it. v0 deployments don't need to change. ## env-var bridging from `serve-llm.sh` The serve-llm launcher exports the env vars that the bridge resolves -into the implicit / explicit config. For Hephaestus dual-backend mode: +into the implicit / explicit config. For dual-backend mode: ```sh # /etc/serve-llm.env (sourced by serve-llm.service) @@ -120,7 +120,7 @@ LUCIBRIDGE_ROUTING_CONFIG=/etc/lucibridge/routing.json The bridge uses `LUCIBRIDGE_ROUTING_CONFIG` if set, falling back to the implicit single-backend mode (using `LLM_BACKEND_DEFAULT`) when the file is absent. This matches the §8 mode-switching pattern in -`runbooks/hephaestus-deploy.md`. +`runbooks/deploy.md`. --- @@ -178,7 +178,7 @@ fixture file and the new test functions. | Criterion | Where verified | |---|---| -| A configured Veltro session routes `limbo_authoring` to Ollama and `tool_call` to SGLang, both succeed | `infernode-os/infernode` agentlib_test suite + manual run on Hephaestus | +| A configured Veltro session routes `limbo_authoring` to Ollama and `tool_call` to SGLang, both succeed | `infernode-os/infernode` agentlib_test suite + manual run on the deploy host | | Fallback documented and tested | tests 3 + 4 above | | Existing single-URL configs still work unchanged | test 1 above | | No regression in existing `lucibridge_test` suite | CI run on `infernode-os/infernode` | @@ -194,6 +194,6 @@ fixture file and the new test functions. multi-LoRA validation. * **Bridge config hot-reload** — v1 reloads on `SIGHUP`; document exact behaviour once the agentlib PR lands. -* **Cross-host routing** — current schema assumes Hephaestus-local +* **Cross-host routing** — current schema assumes single-host local backends. If we add a second Jetson, `backends[].base_url` already takes any URL; document the firewall/zerotier story before exposing. diff --git a/sglang/orin/Dockerfile b/sglang/orin/Dockerfile index 777eb40..813d747 100644 --- a/sglang/orin/Dockerfile +++ b/sglang/orin/Dockerfile @@ -62,7 +62,7 @@ RUN DP=/usr/local/lib/python3.10/dist-packages \ # Build-time guards: CI runners have no GPU, so this is metadata-only. # Real on-hardware verification lives in validate-on-hardware.sh and runs -# post-pull on Hephaestus / a field Orin. +# post-pull on the Jetson hardware. COPY validate-on-hardware.sh /opt/sglang/validate-on-hardware.sh COPY test.py /opt/sglang/test.py RUN chmod +x /opt/sglang/validate-on-hardware.sh \ diff --git a/sglang/orin/README.md b/sglang/orin/README.md index 4f62594..7cc85e7 100644 --- a/sglang/orin/README.md +++ b/sglang/orin/README.md @@ -42,16 +42,16 @@ That's the whole delta. The base image is otherwise untouched. CI builds on `ubuntu-24.04-arm` (Graviton SBSA, native aarch64 — no QEMU). See `.github/workflows/build-sglang.yml`. The Dockerfile carries three build-time guards that fail the build if torch loses CUDA, if Triton or SGLang fail to import, or if the SGLang version pin drifts. -## Build (manual on Hephaestus) +## Build (manual, any aarch64 host) ```sh cd ~/serving/sglang/orin -docker --host unix:///run/docker-dev.sock build \ +docker build \ --build-arg BASE_IMAGE=dustynv/sglang:r36.4.0 \ -t serving-sglang:orin-local . ``` -(Hephaestus-specific: use the dev daemon socket via `--host` per the dual-daemon policy in `runbooks/hephaestus-deploy.md` §0.1. On a field-deployment Orin AGX, drop the `--host` flag.) +If your build host runs a dedicated experimental Docker daemon on a non-default socket, prepend `--host unix:///run/.sock` — that's a per-environment concern, not part of the build. ## Files @@ -69,18 +69,18 @@ The "unused" files are retained for now under MIT attribution (`../LICENSE-UPSTR ## What this does NOT include -- **Entrypoint scripts.** Launch arguments live with the runbook (`runbooks/hephaestus-deploy.md`) rather than baked into the image, so the same image serves different model paths without rebuild. +- **Entrypoint scripts.** Launch arguments live with the runbook (`runbooks/deploy.md`) rather than baked into the image, so the same image serves different model paths without rebuild. ## Verifying a build (on hardware) The validator runs all seven checks including a real serving smoke (launches `sglang.launch_server` with TinyLlama, asserts `/health`, asserts the `KV Cache is allocated` startup line, exercises `/v1/chat/completions`): ```sh -# On Hephaestus dev daemon (mount the host HF cache so TinyLlama isn't -# redownloaded each run): -docker --host unix:///run/docker-dev.sock run --rm \ +# Mount the host HF cache so TinyLlama isn't redownloaded each run: +HF_CACHE="${HF_CACHE:-/var/lib/huggingface}" +docker run --rm \ --runtime nvidia --gpus all \ - -v /mnt/orin-ssd/huggingface:/root/.cache/huggingface \ + -v "$HF_CACHE":/root/.cache/huggingface \ -e HF_HOME=/root/.cache/huggingface \ ghcr.io/infernode-os/serving-sglang:orin-latest \ /opt/sglang/validate-on-hardware.sh @@ -100,7 +100,7 @@ On success the final line is `All on-hardware checks passed.`. Expected serving- `gpt-oss` arch verification — deferred to INFR-92, expected to fail on v1: ```sh -docker --host unix:///run/docker-dev.sock run --rm --runtime nvidia --gpus all \ +docker run --rm --runtime nvidia --gpus all \ ghcr.io/infernode-os/serving-sglang:orin-latest \ python3 -c "import sglang.srt.models.gpt_oss" || echo "(expected for v1; INFR-92)" ``` diff --git a/sglang/orin/bake-tokenizers.sh b/sglang/orin/bake-tokenizers.sh index bf7b788..03cd6c8 100755 --- a/sglang/orin/bake-tokenizers.sh +++ b/sglang/orin/bake-tokenizers.sh @@ -7,7 +7,7 @@ # (<|eot_id|>, <|begin_of_text|>, <|start_header_id|>) properly, so stops # don't match cleanly and chat-template framing is wrong. The fix is to # point --tokenizer-path at a real HuggingFace tokenizer dir at launch -# time (see runbooks/hephaestus-deploy.md). Baking the tokenizers into +# time (see runbooks/deploy.md). Baking the tokenizers into # the image means the launch command is fully offline-capable. # # Each family pulled is the tokenizer files only (~30 MB per family); diff --git a/sglang/orin/config.py b/sglang/orin/config.py index 4595cf0..585738b 100644 --- a/sglang/orin/config.py +++ b/sglang/orin/config.py @@ -14,7 +14,7 @@ (Aug 2025) and has no srt/models/gpt_oss.py. - 0.5.3 is the initial pick: first 0.5.x line with gpt-oss model class, predates the upstream's CUDA-13 transition (which landed - around 0.5.11). On-target smoke build on Hephaestus is the gate. + around 0.5.11). On-target smoke build on Jetson hardware is the gate. - Fallback ladder if 0.5.3 doesn't build: 0.5.2 → 0.5.1 → 0.5.0, then 0.4.5+. Document the working pin back here when verified. """ diff --git a/sglang/orin/validate-on-hardware.sh b/sglang/orin/validate-on-hardware.sh index 443f82f..4241d09 100755 --- a/sglang/orin/validate-on-hardware.sh +++ b/sglang/orin/validate-on-hardware.sh @@ -9,21 +9,17 @@ # untested-upstream regressions before they reach a deploy — the kind of # breakage that previously slipped past the import-only checks. # -# Usage (on Hephaestus dev daemon, with an HF cache mount): -# docker --host unix:///run/docker-dev.sock run --rm \ -# --runtime nvidia --gpus all \ -# -v /mnt/orin-ssd/huggingface:/root/.cache/huggingface \ -# -e HF_HOME=/root/.cache/huggingface \ -# ghcr.io/infernode-os/serving-sglang:orin-latest \ -# /opt/sglang/validate-on-hardware.sh -# -# Usage (field-deployment Orin AGX, single docker daemon): +# Usage: +# HF_CACHE="${HF_CACHE:-/var/lib/huggingface}" # docker run --rm --runtime nvidia --gpus all \ -# -v /var/lib/huggingface:/root/.cache/huggingface \ +# -v "$HF_CACHE":/root/.cache/huggingface \ # -e HF_HOME=/root/.cache/huggingface \ # ghcr.io/infernode-os/serving-sglang:orin-latest \ # /opt/sglang/validate-on-hardware.sh # +# If your build/test host runs a non-default Docker daemon, prepend +# `--host unix:///run/.sock` to the docker invocation. +# # The HF mount is recommended (re-uses ~2 GB TinyLlama between runs). If # absent, the smoke will pull TinyLlama fresh into the container each time. @@ -97,7 +93,7 @@ step "7. Serving smoke (TinyLlama → /v1/chat/completions, KV-cache verified)" # HTTP path. Verifies: server starts, KV cache is allocated, /health 200s, # /v1/chat/completions returns a sensible completion. # -# Flags mirror runbooks/hephaestus-deploy.md §3 — keeps the smoke aligned +# Flags mirror runbooks/deploy.md §3 — keeps the smoke aligned # with the documented production launch shape. rm -f "$SMOKE_LOG" diff --git a/sglang/thor/Dockerfile b/sglang/thor/Dockerfile index c494afd..45704cc 100644 --- a/sglang/thor/Dockerfile +++ b/sglang/thor/Dockerfile @@ -20,4 +20,4 @@ LABEL org.opencontainers.image.licenses="MIT" COPY test.py /opt/sglang/test.py # No CMD/ENTRYPOINT override — preserve NGC's defaults so callers -# pass the launch args explicitly (see runbooks/hephaestus-deploy.md). +# pass the launch args explicitly (see runbooks/deploy.md).