-
Notifications
You must be signed in to change notification settings - Fork 498
Dynamic Inference Headers with Prediction Trie Integration #1483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
rapids-bot
merged 41 commits into
NVIDIA:develop
from
dnandakumar-nv:dynamic-inference-headers
Jan 26, 2026
Merged
Changes from all commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
adc3f61
Set and clear DynamoPrefixContext for workflow KV optimization
dnandakumar-nv ffd3d28
Add tests for DynamoPrefixContext integration in Runner
dnandakumar-nv c35228a
Merge branch 'develop' into dynamic-inference-headers
dnandakumar-nv 360288d
Add design doc for prediction trie inference routing
dnandakumar-nv 80df59f
docs: add prediction trie implementation plan
dnandakumar-nv 64bfc21
feat(profiler): add prediction trie data models
dnandakumar-nv ee00b66
feat(profiler): add MetricsAccumulator for prediction trie
dnandakumar-nv 325afac
feat(profiler): add PredictionTrieBuilder
dnandakumar-nv 932f674
feat(profiler): add PredictionTrieLookup
dnandakumar-nv d62fc1e
feat(profiler): add prediction trie serialization
dnandakumar-nv f916957
feat(llm): add LLMCallTracker for runtime prediction lookups
dnandakumar-nv 381f0af
feat(profiler): integrate prediction trie generation
dnandakumar-nv ee166c8
feat(llm): add prediction header injection to Dynamo client
dnandakumar-nv 7b8931c
feat(llm): add prediction_trie_path config to DynamoModelConfig
dnandakumar-nv 52fb243
test(profiler): add end-to-end prediction trie test
dnandakumar-nv 6d36b20
docs: add runtime prediction trie integration design
dnandakumar-nv 1e5d370
docs: add runtime prediction trie implementation plan
dnandakumar-nv 6137fb7
feat(context): add function_path_stack ContextVar to ContextState
dnandakumar-nv b91d1f2
feat(context): update push_active_function to track function path stack
dnandakumar-nv d4f02f2
feat(step_manager): increment LLM call tracker on LLM_START events
dnandakumar-nv fa7830d
Add dynamic prediction hook for runtime trie lookups
dnandakumar-nv bd40ae0
fix(test): include prediction_trie_path in dynamo field names test
dnandakumar-nv 2ec7d7c
Add prediction_lookup parameter to create_httpx_client_with_dynamo_hooks
dnandakumar-nv 3a7e42e
Add end-to-end integration test for runtime prediction trie
dnandakumar-nv 912245a
docs: add prediction trie example config design
dnandakumar-nv 3c6f65d
feat(examples): add prediction trie example configs and docs
dnandakumar-nv 4f09ca3
Refactor header injection with dynamic prediction logic
dnandakumar-nv a1ab80c
Refactor Dynamo prefix handling to centralize logic.
dnandakumar-nv 2fc6a09
Refactor DynamoPrefixContext for depth-aware prefix handling
dnandakumar-nv 37ad4a4
Refactor DynamoPrefixContext for depth-aware prefix handling
dnandakumar-nv e4f7893
Merge branch 'develop' into dynamic-inference-headers
dnandakumar-nv 4cbd4e6
Refactor DynamoPrefixContext for depth-aware prefix handling
dnandakumar-nv a611564
Merge remote-tracking branch 'origin/dynamic-inference-headers' into …
dnandakumar-nv fa79197
Remove DynamoPrefixContext handling in Runner class
dnandakumar-nv 5bec15d
Merge remote-tracking branch 'upstream/develop' into dynamic-inferenc…
dnandakumar-nv 2cba23c
Add Apache 2.0 license headers to source and test files
dnandakumar-nv 7d2c087
Add "Trie(s)" to accepted vocabulary list
dnandakumar-nv 727f564
Update README and test files for clarity and consistency
dnandakumar-nv 74c191d
Fix formatting of `job_id` in README_PREDICTION_TRIE.md
dnandakumar-nv c412dc8
Add Apache 2.0 license headers to test files
dnandakumar-nv 22327b9
Refactor imports for PredictionTrieLookup across modules
dnandakumar-nv File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
172 changes: 172 additions & 0 deletions
172
examples/dynamo_integration/react_benchmark_agent/README_PREDICTION_TRIE.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,172 @@ | ||
| <!-- | ||
| Copyright (c) 2025-2026, NVIDIA CORPORATION | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| --> | ||
| <!-- path-check-skip-file --> | ||
|
|
||
| # Prediction Trie Optimization for Dynamo | ||
|
|
||
| Use profiled execution data to inject accurate per-call prediction headers instead of static guesses. | ||
|
|
||
| ## Overview | ||
|
|
||
| The prediction trie enables **dynamic header injection** for Dynamo's KV-aware routing. Instead of using static values like `prefix_total_requests=10` for every call, the trie provides accurate predictions based on: | ||
| - **Function path**: Where in the agent hierarchy the call originates (e.g., `["react_workflow", "react_agent"]`) | ||
| - **Call index**: Which LLM call this is within the current function (1st, 2nd, 3rd, etc.) | ||
|
|
||
| This allows Dynamo's Thompson Sampling router to make better worker assignment decisions. | ||
dnandakumar-nv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Quick Start | ||
|
|
||
| ### Phase 1: Build the Prediction Trie | ||
|
|
||
| Run profiling to collect execution data and build the trie: | ||
|
|
||
| ```bash | ||
| nat eval --config_file configs/profile_rethinking_full_test.yml | ||
| ``` | ||
|
|
||
| **Output location:** | ||
| ``` | ||
| outputs/dynamo_evals/rethinking_full_test_for_profiling/<job_id>/prediction_trie.json | ||
| ``` | ||
dnandakumar-nv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ### Phase 2: Run with Predictions | ||
|
|
||
| 1. **Update the trie path** in `configs/run_with_prediction_trie.yml`: | ||
| ```yaml | ||
| prediction_trie_path: ./examples/dynamo_integration/react_benchmark_agent/outputs/dynamo_evals/rethinking_full_test_for_profiling/<YOUR_JOB_ID>/prediction_trie.json | ||
| ``` | ||
|
|
||
| 2. **Run with dynamic predictions:** | ||
| ```bash | ||
| nat eval --config_file configs/run_with_prediction_trie.yml | ||
| ``` | ||
|
|
||
| ## How It Works | ||
|
|
||
| ### During Profiling (Phase 1) | ||
|
|
||
| The profiler collects data for each LLM call: | ||
| - Function path at time of call | ||
| - Call index within the parent function | ||
| - Output tokens generated | ||
| - Time until the next LLM call | ||
| - Remaining LLM calls in the workflow | ||
|
|
||
| This data is aggregated into a trie structure with statistical summaries (mean, p50, p90, etc.) at each node. | ||
|
|
||
| ### During Execution (Phase 2) | ||
|
|
||
| For each LLM request: | ||
| 1. Read the current function path from context | ||
| 2. Read the call index from the LLM call tracker | ||
| 3. Look up the prediction in the trie | ||
| 4. Inject headers into the HTTP request | ||
|
|
||
| ### Fallback Chain | ||
|
|
||
| If an exact match isn't found, the trie lookup falls back: | ||
| 1. Exact path + exact call index (most specific) | ||
| 2. Exact path + any call index | ||
| 3. Partial path + exact call index | ||
| 4. Root aggregated stats (most general) | ||
|
|
||
| This ensures predictions are always available, even for novel execution paths. | ||
|
|
||
| ## Headers Injected | ||
|
|
||
| | Header | Source | Description | | ||
| |--------|--------|-------------| | ||
| | `x-nat-remaining-llm-calls` | `prediction.remaining_calls.mean` | Expected remaining LLM calls in workflow | | ||
| | `x-nat-interarrival-ms` | `prediction.interarrival_ms.mean` | Expected milliseconds until next call | | ||
| | `x-nat-expected-output-tokens` | `prediction.output_tokens.p90` | Expected output tokens (90th percentile) | | ||
|
|
||
| ## Comparing Results | ||
|
|
||
| To measure the impact of prediction trie vs static headers: | ||
|
|
||
| 1. **Run with static headers** (baseline): | ||
| ```bash | ||
| nat eval --config_file configs/eval_config_rethinking_full_test.yml | ||
| ``` | ||
|
|
||
| 2. **Run with prediction trie**: | ||
| ```bash | ||
| nat eval --config_file configs/run_with_prediction_trie.yml | ||
| ``` | ||
|
|
||
| 3. **Compare metrics**: | ||
| - `avg_llm_latency`: Lower is better | ||
| - `avg_workflow_runtime`: Lower is better | ||
| - Look for improvements in KV cache hit rates in Dynamo logs | ||
|
|
||
| ## Configuration Reference | ||
|
|
||
| ### Profiler Configuration (Phase 1) | ||
|
|
||
| Enable trie building in the profiler section: | ||
|
|
||
| ```yaml | ||
| profiler: | ||
| prediction_trie: | ||
| enable: true | ||
| output_filename: prediction_trie.json # default | ||
| ``` | ||
|
|
||
| ### LLM Configuration (Phase 2) | ||
|
|
||
| Add the trie path to your Dynamo LLM config: | ||
|
|
||
| ```yaml | ||
| llms: | ||
| dynamo_llm: | ||
| _type: dynamo | ||
| prefix_template: "react-benchmark-{uuid}" | ||
|
|
||
| # Static fallbacks (used if trie lookup fails) | ||
| prefix_total_requests: 10 | ||
| prefix_osl: MEDIUM | ||
| prefix_iat: MEDIUM | ||
|
|
||
| # Dynamic predictions from profiled data | ||
| prediction_trie_path: /path/to/prediction_trie.json | ||
| ``` | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### "Prediction trie file not found" | ||
|
|
||
| The trie file doesn't exist at the configured path. Check: | ||
| - Did Phase 1 profiling complete successfully? | ||
| - Is the `job_id` in the path correct? | ||
| - Is the path relative to where you're running the command? | ||
|
|
||
| ### "No prediction found for path" | ||
|
|
||
| This is normal - it means the trie is using fallback predictions. The trie will fall back to more general predictions when exact matches aren't found. | ||
|
|
||
| ### Headers not being injected | ||
|
|
||
| Ensure: | ||
| - `prefix_template` is set (required for Dynamo hooks) | ||
| - `prediction_trie_path` points to a valid trie file | ||
| - You're using the `dynamo` LLM type | ||
dnandakumar-nv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Files | ||
|
|
||
| | File | Purpose | | ||
| |------|---------| | ||
| | `configs/profile_rethinking_full_test.yml` | Phase 1: Profile and build trie | | ||
| | `configs/run_with_prediction_trie.yml` | Phase 2: Run with dynamic predictions | | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.