Skip to content

W-TITO bullet 1 + W2 Miles shim + smg response-flatten Cargo patch#3

Closed
DavidBellamy wants to merge 15 commits intollm360-mainfrom
add-completion-token-ids-response
Closed

W-TITO bullet 1 + W2 Miles shim + smg response-flatten Cargo patch#3
DavidBellamy wants to merge 15 commits intollm360-mainfrom
add-completion-token-ids-response

Conversation

@DavidBellamy
Copy link
Copy Markdown
Collaborator

See PR body below

…uest

Adds a [patch.crates-io] override pointing openai-protocol at a fork
that adds serde(flatten) to ChatCompletionRequest, matching the existing
pattern on CompletionRequest. This allows engine-specific parameters
like return_routed_experts to pass through the gateway to workers.

Upstream PR: lightseekorg/smg#1119
Remove this patch once the upstream crate publishes with the fix.
…ions

Adds completion_token_ids: Optional[List[int]] to ChatCompletionResponseChoice
and return_completion_token_ids: bool flag to ChatCompletionRequest.
Guarded by the flag; no behavior change when false.

Required for token-in-token-out (TITO) on the agentic /v1/chat/completions
path used by Harbor external agents during RL rollouts. Without this,
the agent loop would have to re-tokenize string content each turn, which
drifts from the training forward pass and corrupts the training signal.
Updates sgl-model-gateway's [patch.crates-io] override from the
request-side flatten branch (already merged upstream via lightseekorg/smg#1130)
to the response-side flatten branch (DavidBellamy/smg:add-flatten-chat-completion-response).

The response-side branch is based off upstream/main, which now carries the
merged sgl-project#1130 request-side flatten; so a single branch covers both request
and response flatten without needing two separate patch entries.

Needed so routed_experts and completion_token_ids survive the round-trip
through sgl-model-gateway without being dropped by serde's default
unknown-field handling on response structs.
Adds two axum routes on the gateway that mirror the Miles rollout
orchestrator's existing worker-management API:
  - POST /add_worker?url=... (query param, not JSON body)
  - GET  /list_workers       (returns {"urls": [...]} shape)

The native /workers routes remain the long-term API. The shim lets
Miles switch its base_url from the Miles router to sgl-model-gateway
without client-side changes in W5.

POST /add_worker constructs a WorkerConfigRequest from just the URL and
delegates to worker_service.create_worker(); GET /list_workers reads
the worker registry directly and flattens to URL list form.
Adds five unit tests in TestCompletionTokenIds asserting:
  - ChatCompletionRequest accepts return_completion_token_ids=True.
  - The flag defaults to False (no behavior change for existing clients).
  - ChatCompletionResponseChoice accepts completion_token_ids as a list.
  - _serialize drops completion_token_ids when None (mirrors the
    existing hidden_states / prompt_token_ids handling).
  - _serialize keeps completion_token_ids when populated.

Verified against the actual protocol module (runs standalone; no GPU).
WorkerConfigRequest does not implement Default on the current smg branch,
so the prior `..Default::default()` spread fails to compile on a proper
Linux build env. Mac-side cargo check missed this (stale resolution).

Switch to `serde_json::from_value(json!({"url": ...}))` which relies on
the struct's `#[serde(default)]` field annotations to fill in defaults.
More resilient to upstream field additions; fails loud with a 400 if the
shape ever changes.
…let 5)

Adds a tokenizer_sha256 field to /model_info so agent-side consistency
checks (harbor.utils.tokenizer_identity.assert_tokenizer_matches in the
Harbor tito/agentic-path-infra branch) can abort when the trainer and
rollout-worker tokenizers have drifted.

The hash is SHA256 of the HF fast tokenizer's canonical JSON form
(backend_tokenizer.to_str()), which is stable across loading methods
(local path, HF Hub, programmatic construction) and changes iff
tokenization output could change. Computed once and cached; returns
None for slow tokenizers (no stable canonical form).

A silent tokenizer mismatch between trainer and rollout is a known
failure mode for TITO (PRD W-TITO bullet 5, risk R11); this patch is
the worker side of that consistency check.
…e (tito_debug)

Temporary DEBUG-level logging (target: smg::tito_debug) to localize where
completion_token_ids is dropped on the /v1/chat/completions pass-through.
Enable with RUST_LOG=smg::tito_debug=debug. Revert once localized.
…=1.0.0 pin)

The prior [patch.crates-io] override never applied: crates.io pins
openai-protocol at 1.0.0, our smg fork is at 1.7.0, so cargo silently
fell back to the frozen crates.io version without the serde(flatten)
catch-all. See docs/gateway-smg-version-drift.md in LLM360/RL360#76
for the full localization writeup.

Switching to a direct git dependency on the fork branch so cargo
takes our smg at its declared version and the flatten actually lands
in the built binary. Will drive gateway source updates for the
upstream smg API drift in follow-up commits.
… fork)

Gateway byte-streams responses (see http/router.rs:584 — res.bytes() ->
Body::from(body), no ChatCompletionResponse deserialization), so the
response-side serde(flatten) I added in DavidBellamy/smg:
add-flatten-chat-completion-response is NOT needed by the gateway. It
only adds value for gRPC paths and for any future deserialize-response
callers.

The ONLY flatten the gateway needs is request-side on ChatCompletionRequest,
which upstream smg main already carries via sgl-project#1130 (merged). Pointing at
upstream directly lets the request-side flatten apply and skips the
struct-literal maintenance burden in smg builders that my response-side
additions introduced.

Upstream smg PR lightseekorg/smg#1160 (response flatten) remains open
as an independent improvement; it is not on this gateway's critical
path anymore.
The prior direct-git dep only affected sgl-model-gateway's top-level
dependency; transitive deps (smg-grpc-client, reasoning-parser,
llm-tokenizer, tool-parser) still pulled openai-protocol 1.0.0 from
crates.io, so the workspace had TWO incompatible openai-protocol
versions linked simultaneously, giving 111 type errors.

Restoring =1.0.0 and patching crates.io to a new fork branch
(DavidBellamy/smg:gateway-patch-1.0.0) that carries upstream main's
code but declares itself version 1.0.0 makes the patch apply to every
openai-protocol consumer in the workspace. One version in the graph,
and it's the one that has the ChatCompletionRequest flatten.
Expands the workspace patch so every smg-family transitive dep (smg-auth,
smg-mcp, smg-mesh, smg-wasm, smg-grpc-client, reasoning-parser,
llm-tokenizer, tool-parser) resolves to the same fork branch
(gateway-patch-1.0.0), not crates.io. Prevents old-shape smg crates from
colliding with the new openai-protocol shape.
…mg 1.7.0)

Replaces the 9-crate smg workspace patch with a single minimal patch:
openai-protocol 1.0.0 source verbatim + serde(flatten) on
ChatCompletionRequest. Everything else stays as published crates.io
1.0.0, so gateway source compiles without the 168-error smg API drift
cascade that full upstream main triggered.

Scope: fixes request-side unknown-field round-trip
(return_completion_token_ids, return_routed_experts, input_ids,
extra_body). Response side stays byte-streamed; no response flatten
needed at the gateway.
…gfault

Adds threading.Lock around register/batch_register and engine init to
prevent concurrent ibv_reg_mr on GPU memory from multiple DP worker
threads. nvidia-peermem's page callback segfaults when handling concurrent
GPU memory registration from multiple IB VF contexts on M2's SR-IOV
devices. Sequential registration works on all 8 HCAs.

Reproducer: /mnt/weka/shrd/k2pta/rl360/ibv_reg_mr_stress.py (segfaults
without this patch, passes with it).

Tracked in #1 and LLM360/posttraining_knowledge#109.
…r shim

The W2 Miles-API compatibility shim previously accepted only `?url=` and
registered every worker as Regular. After miles#990 relaxed the client-
side assert on /add_worker, PD-disagg rollouts using Miles forward
worker_type=prefill|decode (and bootstrap_port for prefill) as query
params on the same endpoint. This patch reads those params and plumbs
them into the JSON passed to serde_json::from_value<WorkerConfigRequest>
so the gateway's existing PD routing layers (routers/http/pd_router.rs
etc.) receive properly-typed workers.

Matches the shape that src/service_discovery.rs:433 already sends for
PD pods discovered in-cluster: worker_type is a flat Option<String>
alongside bootstrap_port: Option<u16> -- not nested. Explicit
'regular' and missing worker_type both preserve the original regular-
only behavior.

Unblocks PD disaggregation in the Miles agentic rollout loop without
switching away from sgl-model-gateway. Keeps cache_aware + serde(flatten)
response passthrough intact (unlike the alternative of swapping in
sglang_router.launch_router per-variant).
@DavidBellamy
Copy link
Copy Markdown
Collaborator Author

Superseded by 5 focused replacement PRs that split this PR's content by concern:

The sgl-model-gateway/Cargo.toml [patch.crates-io] override (pointing openai-protocol at a personal fork) is intentionally not in any replacement PR — it was dead in the production build path (the binary that runs in production is built from a different source), so it's dropped.

Closing as superseded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant