Skip to content

feat: add trtllm snapshot entrypoint and update backend support docs#9280

Open
hhzhang16 wants to merge 1 commit intomainfrom
hannahz/dep-923-add-trtllmsnapshotpy-entrypoint
Open

feat: add trtllm snapshot entrypoint and update backend support docs#9280
hhzhang16 wants to merge 1 commit intomainfrom
hannahz/dep-923-add-trtllmsnapshotpy-entrypoint

Conversation

@hhzhang16
Copy link
Copy Markdown
Contributor

@hhzhang16 hhzhang16 commented May 7, 2026

Overview:

Prepare for TRTLLM Snapshot support with placeholder snapshot file.

Details:

  • Adds components/src/dynamo/trtllm/snapshot.py with prepare_snapshot_engine(), mirroring the vLLM/SGLang pattern. Quiesces KV cache only (no GMS/weights). The setup_trtllm_engine callable is a placeholder for the engine-creation refactor (functionality doesn't exist yet)
  • Updated docs to reflect SGLang snapshot support and TRT-LLM as in-progress

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Added TensorRT-LLM backend support for Dynamo Snapshot checkpoint and restore capabilities.
  • Documentation

    • Updated Kubernetes snapshot documentation to include TensorRT-LLM backend support status.
    • Updated feature matrix to reflect TensorRT-LLM snapshot functionality availability.

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
@hhzhang16 hhzhang16 requested review from a team as code owners May 7, 2026 20:27
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added feat documentation Improvements or additions to documentation backend::trtllm Relates to the trtllm backend labels May 7, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

Walkthrough

This PR adds TensorRT-LLM support to Dynamo Snapshot by introducing an async prepare_snapshot_engine integration function, updating backend documentation to reflect in-progress TensorRT-LLM support alongside existing vLLM and SGLang coverage, and marking TensorRT-LLM as work-in-progress in the feature matrix.

Changes

TensorRT-LLM Snapshot Support

Layer / File(s) Summary
Core Integration
components/src/dynamo/trtllm/snapshot.py
New async prepare_snapshot_engine function conditionally enables snapshot mode from environment config, builds TRT-LLM engine via injected factory, creates EngineSnapshotController with KV-cache-only quiesce scope, awaits restore, and exits on timeout.
Backend Support Documentation
docs/kubernetes/snapshot.md
Prerequisites, Limitations, and Planned Features sections updated to document vLLM/SGLang limited preview support and TensorRT-LLM in-progress status.
Feature Matrix Update
docs/reference/feature-matrix.md
Quick Comparison table marks Dynamo Snapshot as 🚧 for TensorRT-LLM while preserving ✅ for SGLang and vLLM.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding a TRT-LLM snapshot entrypoint and updating backend support documentation.
Description check ✅ Passed The description covers the Overview, Details, and Related Issues sections from the template, though Related Issues placeholder is not filled with actual issue number.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/reference/feature-matrix.md`:
- Line 31: The Dynamo Snapshot row currently marks vLLM and SGLang with ✅ which
conflicts with the legend and Snapshot guide that treat them as limited/preview;
update the badges in the Dynamo Snapshot table row so vLLM and SGLang use the
preview/limited badge (🚧) instead of ✅ and ensure the Snapshot Docs link and
any adjacent wording remain unchanged; target the table row containing "Dynamo
Snapshot" and the entries for "vLLM" and "SGLang" in feature-matrix.md.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bfce0974-488d-4f41-8e92-f814c3e3b5f9

📥 Commits

Reviewing files that changed from the base of the PR and between 2cefc4a and 81dd2a1.

📒 Files selected for processing (3)
  • components/src/dynamo/trtllm/snapshot.py
  • docs/kubernetes/snapshot.md
  • docs/reference/feature-matrix.md

| **Tool Calling** | ✅ | ✅ | ✅ | [Tool Calling Doc][tools] |
| **Speculative Decoding** | 🚧 | ✅ | ✅ | Backend READMEs |
| **Dynamo Snapshot** | ✅ | | ✅ | [Snapshot Docs][snapshot] |
| **Dynamo Snapshot** | ✅ | 🚧 | ✅ | [Snapshot Docs][snapshot] |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align Snapshot support badges with the “limited preview” definition.

Line 31 marks vLLM and SGLang as , but this conflicts with the legend (🚧 includes limited/preview) and the Snapshot guide wording that both are limited preview. Please make these labels consistent across docs.

Suggested doc fix
-| **Dynamo Snapshot** | ✅ | 🚧 | ✅ | [Snapshot Docs][snapshot] |
+| **Dynamo Snapshot** | 🚧 | 🚧 | 🚧 | [Snapshot Docs][snapshot] |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| **Dynamo Snapshot** | | 🚧 | | [Snapshot Docs][snapshot] |
| **Dynamo Snapshot** | 🚧 | 🚧 | 🚧 | [Snapshot Docs][snapshot] |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 31-31: Reference links and images should use a label that is defined
Missing link or image reference definition: "snapshot"

(MD052, reference-links-images)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reference/feature-matrix.md` at line 31, The Dynamo Snapshot row
currently marks vLLM and SGLang with ✅ which conflicts with the legend and
Snapshot guide that treat them as limited/preview; update the badges in the
Dynamo Snapshot table row so vLLM and SGLang use the preview/limited badge (🚧)
instead of ✅ and ensure the Snapshot Docs link and any adjacent wording remain
unchanged; target the table row containing "Dynamo Snapshot" and the entries for
"vLLM" and "SGLang" in feature-matrix.md.


snapshot_controller = EngineSnapshotController(
engine=engine,
quiesce_controller=TRTLLMEngineQuiesceController(engine),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi we needed to put in some pretty patches into TRTLLM itself to get this to work (currently it is a bit of a no-op for the actual MPI workers).

logger = logging.getLogger(__name__)


async def prepare_snapshot_engine(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this entrypoint without wiring it into dynamo.trtllm.main.worker leaves DYN_SNAPSHOT_CONTROL_DIR ignored by TRT-LLM and never checkpoints the engine before runtime creation. Fix: call it before create_runtime, reload restore identity, and pass the restored engine into init_worker/init_llm_worker for reuse.

Must be called BEFORE runtime creation so the engine can be checkpointed
without active NATS/etcd connections.

Weight quiesce (GMS) is intentionally excluded from the snapshot scope.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to double check that weights are actually being staged into CPU memory if not quiescing them via GMS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::trtllm Relates to the trtllm backend documentation Improvements or additions to documentation feat size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants