feat: add trtllm snapshot entrypoint and update backend support docs#9280
feat: add trtllm snapshot entrypoint and update backend support docs#9280
Conversation
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
WalkthroughThis PR adds TensorRT-LLM support to Dynamo Snapshot by introducing an async ChangesTensorRT-LLM Snapshot Support
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/reference/feature-matrix.md`:
- Line 31: The Dynamo Snapshot row currently marks vLLM and SGLang with ✅ which
conflicts with the legend and Snapshot guide that treat them as limited/preview;
update the badges in the Dynamo Snapshot table row so vLLM and SGLang use the
preview/limited badge (🚧) instead of ✅ and ensure the Snapshot Docs link and
any adjacent wording remain unchanged; target the table row containing "Dynamo
Snapshot" and the entries for "vLLM" and "SGLang" in feature-matrix.md.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bfce0974-488d-4f41-8e92-f814c3e3b5f9
📒 Files selected for processing (3)
components/src/dynamo/trtllm/snapshot.pydocs/kubernetes/snapshot.mddocs/reference/feature-matrix.md
| | **Tool Calling** | ✅ | ✅ | ✅ | [Tool Calling Doc][tools] | | ||
| | **Speculative Decoding** | 🚧 | ✅ | ✅ | Backend READMEs | | ||
| | **Dynamo Snapshot** | ✅ | | ✅ | [Snapshot Docs][snapshot] | | ||
| | **Dynamo Snapshot** | ✅ | 🚧 | ✅ | [Snapshot Docs][snapshot] | |
There was a problem hiding this comment.
Align Snapshot support badges with the “limited preview” definition.
Line 31 marks vLLM and SGLang as ✅, but this conflicts with the legend (🚧 includes limited/preview) and the Snapshot guide wording that both are limited preview. Please make these labels consistent across docs.
Suggested doc fix
-| **Dynamo Snapshot** | ✅ | 🚧 | ✅ | [Snapshot Docs][snapshot] |
+| **Dynamo Snapshot** | 🚧 | 🚧 | 🚧 | [Snapshot Docs][snapshot] |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | **Dynamo Snapshot** | ✅ | 🚧 | ✅ | [Snapshot Docs][snapshot] | | |
| | **Dynamo Snapshot** | 🚧 | 🚧 | 🚧 | [Snapshot Docs][snapshot] | |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 31-31: Reference links and images should use a label that is defined
Missing link or image reference definition: "snapshot"
(MD052, reference-links-images)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/reference/feature-matrix.md` at line 31, The Dynamo Snapshot row
currently marks vLLM and SGLang with ✅ which conflicts with the legend and
Snapshot guide that treat them as limited/preview; update the badges in the
Dynamo Snapshot table row so vLLM and SGLang use the preview/limited badge (🚧)
instead of ✅ and ensure the Snapshot Docs link and any adjacent wording remain
unchanged; target the table row containing "Dynamo Snapshot" and the entries for
"vLLM" and "SGLang" in feature-matrix.md.
|
|
||
| snapshot_controller = EngineSnapshotController( | ||
| engine=engine, | ||
| quiesce_controller=TRTLLMEngineQuiesceController(engine), |
There was a problem hiding this comment.
fyi we needed to put in some pretty patches into TRTLLM itself to get this to work (currently it is a bit of a no-op for the actual MPI workers).
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| async def prepare_snapshot_engine( |
There was a problem hiding this comment.
Adding this entrypoint without wiring it into dynamo.trtllm.main.worker leaves DYN_SNAPSHOT_CONTROL_DIR ignored by TRT-LLM and never checkpoints the engine before runtime creation. Fix: call it before create_runtime, reload restore identity, and pass the restored engine into init_worker/init_llm_worker for reuse.
| Must be called BEFORE runtime creation so the engine can be checkpointed | ||
| without active NATS/etcd connections. | ||
|
|
||
| Weight quiesce (GMS) is intentionally excluded from the snapshot scope. |
There was a problem hiding this comment.
might want to double check that weights are actually being staged into CPU memory if not quiescing them via GMS.
Overview:
Prepare for TRTLLM Snapshot support with placeholder snapshot file.
Details:
setup_trtllm_enginecallable is a placeholder for the engine-creation refactor (functionality doesn't exist yet)Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Documentation