extension/llm/runner: Engine/Session C++ core + token-step primitives#19991
extension/llm/runner: Engine/Session C++ core + token-step primitives#19991mergennachin wants to merge 6 commits into
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19991
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New Failures, 2 Unrelated Failures, 2 Unclassified FailuresAs of commit 6edbb17 with merge base eeb0646 ( NEW FAILURES - The following jobs have failed:
UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
This PR introduces a model-agnostic C++ Engine/Session interface for LLM serving and adds token-step primitives to TextLLMRunner (seek/prefill_tokens/position/decode_one), enabling prefix-KV reuse and session-oriented decoding over a single loaded Program.
Changes:
- Add
LLMEngine/LLMSessioninterfaces plus shared structs (SamplingConfig,DecodeResult,LLMServingCapacity). - Extend
TextLLMRunnerwith KV cursor control (seek,position), pre-tokenized prefill (prefill_tokens), and single-token decode (decode_one) using shared logit-processor logic. - Add
TextLLMEngine/TextLLMSessionadapter layer and new gtests covering the new primitives.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| extension/llm/runner/text_token_generator.h | Factors logit-processor application into a shared helper and adds an EOS helper for single-step decode. |
| extension/llm/runner/text_llm_runner.h | Adds session primitives (seek, prefill_tokens, position, decode_one) and tracks previous decode token. |
| extension/llm/runner/text_llm_runner.cpp | Implements the new primitives, adds KV-capacity checks, and aligns logit processing across decode paths. |
| extension/llm/runner/test/test_text_llm_runner.cpp | Adds unit tests for new primitives (seek/prefill_tokens/decode_one) and serving-capacity default. |
| extension/llm/runner/targets.bzl | Exports the new public header llm_session.h. |
| extension/llm/runner/llm_session.h | Adds the new model-agnostic Engine/Session API definitions. |
| extension/llm/runner/llm_runner_helper.h | Adds shared-Program runner construction + TextLLMEngine declaration. |
| extension/llm/runner/llm_runner_helper.cpp | Implements shared-Program runner creation and TextLLMEngine / TextLLMSession adapter. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
[ghstack-poisoned]
[ghstack-poisoned]
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
extension/llm/runner/text_llm_runner.cpp:255
- If the first generated token returned by prefill (
cur_token) is an EOS token andignore_eosis false, generate() currently still enters TextTokenGenerator::generate(), which forwards the EOS and can emit additional tokens. This diverges from the new session decode_one() semantics (which stop at EOS without forwarding it) and is likely incorrect for one-shot generation as well.
// start the main loop
prompt_tokens.push_back(cur_token);
// Set ignore_eos based on config
text_token_generator_->set_ignore_eos(config.ignore_eos);
// Generate max_new_tokens - 1 because prefill already generated 1 token.
auto generate_result = text_token_generator_->generate(
prompt_tokens, pos_, max_new_tokens - 1, resolved_temp, wrapped_callback);
if (!generate_result.ok()) {
return generate_result.error();
}
int64_t num_generated_tokens = generate_result.get();
pos_ += num_generated_tokens;
| virtual ::executorch::runtime::Result<uint64_t> prefill( | ||
| std::vector<uint64_t>& prompt_tokens, | ||
| int64_t& start_pos); | ||
| int64_t& start_pos, | ||
| float temperature = 0.0f); | ||
|
|
||
| /** | ||
| * Helper method to prefill a chunk of tokens. | ||
| * @param prompt_tokens The chunk of text prompt tokens to process. | ||
| * @param start_pos The starting position in KV cache of the input in the LLM | ||
| * Module. | ||
| * @param temperature Sampling temperature for the token produced by this | ||
| * chunk. Defaults to greedy (0.0). | ||
| * @return The next token of the LLM Module after prefilling this chunk. | ||
| */ | ||
| virtual ::executorch::runtime::Result<uint64_t> prefill_chunk( | ||
| std::vector<uint64_t>& prompt_tokens, | ||
| int64_t& start_pos); | ||
| int64_t& start_pos, | ||
| float temperature = 0.0f); |
| runtime.cxx_library( | ||
| name = "runner_lib" + aten_suffix, | ||
| exported_headers = [ | ||
| "text_llm_runner.h", | ||
| "text_llm_session.h", | ||
| "llm_runner_helper.h", | ||
| "llm_session.h", | ||
| "constants.h", | ||
| ], | ||
| srcs = [ |
[ghstack-poisoned]
[ghstack-poisoned]
|
@claude review |
|
Claude finished @kirklandsign's task in 2m 11s —— View job Code Review: Engine/Session C++ core + token-step primitives
Overall this is a well-structured PR. The LLMEngine/LLMSession abstraction cleanly separates the serving surface from the model-specific runner, and the token-step primitives ( Issues1.
pos_ < metadata_.at(kMaxContextLen)After a successful 2. The /// Returns InvalidArgument if pos is out of [0, position()].
/// Returns NotSupported for sliding-window models or non-KV-cache models.3. The comment says:
This is now accurate (the Copilot review flagged an earlier version that mentioned 4. Each
This isn't a correctness bug, but worth noting for the follow-up stacked PRs. 5. This dual-flag stop is intentional (one for the session decode loop, one for the generate loop), but if
So no bug, but the dual-stop mechanism is subtle enough to warrant a brief inline comment on 6. const float temp = (temperature < 0.0f)
? (temperature_ == -1.0f ? 0.0f : temperature_)
: temperature;When 7. The Copilot reviewer's concern about this file being exported was incorrect. It's in Design Observations (Not Issues)
CI FailuresThe lintrunner failure ( SummaryThis is a solid foundation for the serving stack. The main actionable item is improving the |
[ghstack-poisoned]
Add the model-agnostic LLMEngine/LLMSession interfaces (llm_session.h) with
SamplingConfig, DecodeResult and LLMServingCapacity; the TextLLMRunner
token-step primitives the session layer is built on (seek, prefill_tokens,
position, decode_one); and TextLLMEngine/TextLLMSession over a single loaded
Program. decode_one() shares generate()'s logit processors via
TextTokenGenerator::apply_logit_processors so the two decode paths cannot
diverge. serving_capacity() reports a conservative single physical session
(physical weight sharing is backend-dependent).
Also add utf8_complete_prefix_len and stop_safe_prefix_len (util.h): byte-level
BPE tokenizers can emit a token that is only part of a multi-byte UTF-8
character, so a streaming consumer must forward only the complete-character
prefix of accumulated pieces and hold the trailing bytes until the rest arrives;
stop_safe_prefix_len additionally holds back the longest possible partial-stop
tail so a stop string straddling pieces is still caught. The C++ workers built on
this core use both to stream UTF-8-safe text with stop sequences. Covered by
gtests in test_text_llm_runner.cpp and test_util.cpp.
First of six stacked commits: C++ core -> server foundations -> worker-based
HTTP server -> pi docs -> Qwen worker -> Qwen CUDA V2 (per-session state).
Part of #20001