Skip to content

feat(http_server): expose tokenizer SHA256 on /get_model_info for parity verification#15

Open
DavidBellamy wants to merge 19 commits intomainfrom
feat/tokenizer-sha256-on-model-info-llm360-fork
Open

feat(http_server): expose tokenizer SHA256 on /get_model_info for parity verification#15
DavidBellamy wants to merge 19 commits intomainfrom
feat/tokenizer-sha256-on-model-info-llm360-fork

Conversation

@DavidBellamy
Copy link
Copy Markdown
Collaborator

Summary

Add a `tokenizer_sha256` field to the `/get_model_info` endpoint that returns a deterministic hash of the active tokenizer's canonical JSON form. Lets clients verify two SGLang instances (or an SGLang instance and a separate trainer/embedding service) are using bit-identical tokenizers before relying on cross-process token-id assumptions.

Why

When token IDs cross process boundaries (e.g. an inference worker emits token IDs that another component uses as input to a separate process), the two sides must use bit-identical tokenizers — including merges, special tokens, and byte fallbacks. A subtle mismatch silently corrupts downstream logic in ways that are hard to diagnose because the IDs still look plausible.

Exposing a tokenizer hash on the existing model-info endpoint gives clients a one-call way to do this consistency check at startup.

Changes (`python/sglang/srt/entrypoints/http_server.py`)

  • New `_compute_tokenizer_sha256()` helper that returns a SHA256 of `tokenizer.backend_tokenizer.to_str()` for HF fast tokenizers, or `None` if the active tokenizer doesn't expose that interface. Cached after first call.
  • `/get_model_info` payload includes `tokenizer_sha256` next to the existing `tokenizer_path` field.

Behavior

  • Field is optional (`None` when the tokenizer doesn't support `to_str()`); existing clients ignore unknown fields.
  • One-time hash, cached.
  • No new dependencies (stdlib `hashlib`).

Provenance

One of five focused PRs that supersede #3.

mickqian and others added 19 commits April 4, 2026 23:37
…alistic perf and auto-discover ut (sgl-project#22086)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
…gl-project#21649)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
…ity verification

Add a tokenizer_sha256 field to the /get_model_info endpoint that returns a
deterministic hash of the active tokenizer's canonical JSON form. Lets clients
verify two SGLang instances (or an SGLang instance and a separate
trainer/embedding service) are using bit-identical tokenizers before relying
on cross-process token-id assumptions.

When token IDs cross process boundaries, the two sides must use bit-identical
tokenizers including merges, special tokens, and byte fallbacks. A subtle
mismatch silently corrupts downstream logic in ways that are hard to diagnose
because the IDs still look plausible. Exposing a tokenizer hash on the existing
model-info endpoint gives clients a one-call way to do this consistency check
at startup.

Field is optional (None when the tokenizer doesn't support .to_str()); existing
clients ignore unknown fields. One-time hash, cached. No new dependencies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants