Skip to content

extension/llm/runner: Engine/Session C++ core + token-step primitives#19991

Open
mergennachin wants to merge 6 commits into
mainfrom
gh/mergennachin/2/head
Open

extension/llm/runner: Engine/Session C++ core + token-step primitives#19991
mergennachin wants to merge 6 commits into
mainfrom
gh/mergennachin/2/head

Conversation

@mergennachin

@mergennachin mergennachin commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Add the model-agnostic LLMEngine/LLMSession interfaces (llm_session.h) with
SamplingConfig, DecodeResult and LLMServingCapacity; the TextLLMRunner
token-step primitives the session layer is built on (seek, prefill_tokens,
position, decode_one); and TextLLMEngine/TextLLMSession over a single loaded
Program. decode_one() shares generate()'s logit processors via
TextTokenGenerator::apply_logit_processors so the two decode paths cannot
diverge. serving_capacity() reports a conservative single physical session
(physical weight sharing is backend-dependent).

Also add utf8_complete_prefix_len and stop_safe_prefix_len (util.h): byte-level
BPE tokenizers can emit a token that is only part of a multi-byte UTF-8
character, so a streaming consumer must forward only the complete-character
prefix of accumulated pieces and hold the trailing bytes until the rest arrives;
stop_safe_prefix_len additionally holds back the longest possible partial-stop
tail so a stop string straddling pieces is still caught. The C++ workers built on
this core use both to stream UTF-8-safe text with stop sequences. Covered by
gtests in test_text_llm_runner.cpp and test_util.cpp.

First of six stacked commits: C++ core -> server foundations -> worker-based
HTTP server -> pi docs -> Qwen worker -> Qwen CUDA V2 (per-session state).

Part of #20001

[ghstack-poisoned]
Copilot AI review requested due to automatic review settings June 3, 2026 22:32
@pytorch-bot

pytorch-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19991

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Unrelated Failures, 2 Unclassified Failures

As of commit 6edbb17 with merge base eeb0646 (image):

NEW FAILURES - The following jobs have failed:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a model-agnostic C++ Engine/Session interface for LLM serving and adds token-step primitives to TextLLMRunner (seek/prefill_tokens/position/decode_one), enabling prefix-KV reuse and session-oriented decoding over a single loaded Program.

Changes:

  • Add LLMEngine / LLMSession interfaces plus shared structs (SamplingConfig, DecodeResult, LLMServingCapacity).
  • Extend TextLLMRunner with KV cursor control (seek, position), pre-tokenized prefill (prefill_tokens), and single-token decode (decode_one) using shared logit-processor logic.
  • Add TextLLMEngine/TextLLMSession adapter layer and new gtests covering the new primitives.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
extension/llm/runner/text_token_generator.h Factors logit-processor application into a shared helper and adds an EOS helper for single-step decode.
extension/llm/runner/text_llm_runner.h Adds session primitives (seek, prefill_tokens, position, decode_one) and tracks previous decode token.
extension/llm/runner/text_llm_runner.cpp Implements the new primitives, adds KV-capacity checks, and aligns logit processing across decode paths.
extension/llm/runner/test/test_text_llm_runner.cpp Adds unit tests for new primitives (seek/prefill_tokens/decode_one) and serving-capacity default.
extension/llm/runner/targets.bzl Exports the new public header llm_session.h.
extension/llm/runner/llm_session.h Adds the new model-agnostic Engine/Session API definitions.
extension/llm/runner/llm_runner_helper.h Adds shared-Program runner construction + TextLLMEngine declaration.
extension/llm/runner/llm_runner_helper.cpp Implements shared-Program runner creation and TextLLMEngine / TextLLMSession adapter.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread extension/llm/runner/text_llm_runner.cpp
Comment thread extension/llm/runner/text_llm_runner.cpp
Comment thread extension/llm/runner/text_llm_runner.h Outdated
Comment thread extension/llm/runner/llm_session.h Outdated
Comment thread extension/llm/runner/llm_runner_helper.cpp Outdated
[ghstack-poisoned]
[ghstack-poisoned]

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

extension/llm/runner/text_llm_runner.cpp:255

  • If the first generated token returned by prefill (cur_token) is an EOS token and ignore_eos is false, generate() currently still enters TextTokenGenerator::generate(), which forwards the EOS and can emit additional tokens. This diverges from the new session decode_one() semantics (which stop at EOS without forwarding it) and is likely incorrect for one-shot generation as well.
  // start the main loop
  prompt_tokens.push_back(cur_token);

  // Set ignore_eos based on config
  text_token_generator_->set_ignore_eos(config.ignore_eos);

  // Generate max_new_tokens - 1 because prefill already generated 1 token.
  auto generate_result = text_token_generator_->generate(
      prompt_tokens, pos_, max_new_tokens - 1, resolved_temp, wrapped_callback);

  if (!generate_result.ok()) {
    return generate_result.error();
  }
  int64_t num_generated_tokens = generate_result.get();

  pos_ += num_generated_tokens;

Comment on lines 39 to +56
virtual ::executorch::runtime::Result<uint64_t> prefill(
std::vector<uint64_t>& prompt_tokens,
int64_t& start_pos);
int64_t& start_pos,
float temperature = 0.0f);

/**
* Helper method to prefill a chunk of tokens.
* @param prompt_tokens The chunk of text prompt tokens to process.
* @param start_pos The starting position in KV cache of the input in the LLM
* Module.
* @param temperature Sampling temperature for the token produced by this
* chunk. Defaults to greedy (0.0).
* @return The next token of the LLM Module after prefilling this chunk.
*/
virtual ::executorch::runtime::Result<uint64_t> prefill_chunk(
std::vector<uint64_t>& prompt_tokens,
int64_t& start_pos);
int64_t& start_pos,
float temperature = 0.0f);
Comment on lines 112 to 121
runtime.cxx_library(
name = "runner_lib" + aten_suffix,
exported_headers = [
"text_llm_runner.h",
"text_llm_session.h",
"llm_runner_helper.h",
"llm_session.h",
"constants.h",
],
srcs = [
[ghstack-poisoned]
[ghstack-poisoned]
@mergennachin mergennachin marked this pull request as ready for review June 5, 2026 18:55
Copilot AI review requested due to automatic review settings June 5, 2026 18:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@kirklandsign

Copy link
Copy Markdown
Contributor

@claude review

@claude

claude Bot commented Jun 5, 2026

Copy link
Copy Markdown

Claude finished @kirklandsign's task in 2m 11s —— View job


Code Review: Engine/Session C++ core + token-step primitives

  • Read all changed files
  • Analyze API design (llm_session.h)
  • Review token-step primitives (text_llm_runner.cpp/h)
  • Review adapter layer (llm_runner_helper.cpp/h)
  • Review tests
  • Review utility additions
  • Post review feedback

Overall this is a well-structured PR. The LLMEngine/LLMSession abstraction cleanly separates the serving surface from the model-specific runner, and the token-step primitives (seek, prefill_tokens, decode_one) have thorough guards and consistent semantics with the existing generate() path. The test coverage is solid. Below are the issues I found, ordered by severity.


Issues

1. decode_one() capacity check is off-by-one — allows stepping to position max_context_len but then the next step correctly rejects

text_llm_runner.cpp:529:

pos_ < metadata_.at(kMaxContextLen)

After a successful decode_one(), pos_ increments to the context length (e.g. 128), and then the next decode_one() fails. This is intentional (the test DecodeOneRejectsWhenContextFull validates this), but it means the last successful forward pass writes a KV entry at position max_context_len - 1, which is the last valid slot. Just documenting that this behavior is correct and by design — the pending token at that point can still be emitted (no forward needed for the EOS short-circuit), but cannot be forwarded. This is fine.

2. seek() header doc should mention NotSupported error — (text_llm_runner.h:179)

The LLMSession::seek() doc in llm_session.h:96 says "sliding-window KV may reject a seek past its window" but doesn't name the error code. Callers handling seek failure programmatically (e.g. falling back to reset+prefill) need to match on both InvalidArgument and NotSupported. Consider adding the error code to the doc:

/// Returns InvalidArgument if pos is out of [0, position()].
/// Returns NotSupported for sliding-window models or non-KV-cache models.

Fix this →

3. SamplingConfig doc references non-existent capabilities() API — (llm_session.h:29-30)

The comment says:

An adapter applies the fields it supports and rejects non-default values of the rest rather than silently ignoring them (today only temperature is plumbed).

This is now accurate (the Copilot review flagged an earlier version that mentioned capabilities()). The current code correctly rejects unsupported params — no action needed here.

4. TextLLMEngine::create_session() re-loads the tokenizer from disk on every session — (llm_runner_helper.cpp:503)

Each create_session() call does load_tokenizer(tokenizer_path_), loading and parsing the tokenizer file again. For a high-throughput server with many sessions, this adds latency per session creation. Consider either:

  • Caching a shared tokenizer (tokenizers are typically stateless/thread-safe for encode/decode)
  • Or documenting this as acceptable for the "conservative single session" v1

This isn't a correctness bug, but worth noting for the follow-up stacked PRs.

5. stop() sets both stop_requested_ and text_token_generator_->stop() — (text_llm_runner.cpp:376-377)

This dual-flag stop is intentional (one for the session decode loop, one for the generate loop), but if stop() is called while generate() is running, both flags fire. On the next reset(), stop_requested_ is cleared (line 389) but text_token_generator_'s should_stop_ is only cleared at the start of its generate() loop (line 129 in text_token_generator.h). This means:

  • If you call stop()reset() → then use the session path (prefill_tokens/decode_one), it works fine.
  • If you call stop()reset() → then call generate(), it also works fine because generate() calls text_token_generator_->generate() which resets should_stop_ at the top of the loop.

So no bug, but the dual-stop mechanism is subtle enough to warrant a brief inline comment on stop() noting why both are needed.

6. prefill_tokens temperature resolution subtlety — (text_llm_runner.cpp:466-468)

const float temp = (temperature < 0.0f)
    ? (temperature_ == -1.0f ? 0.0f : temperature_)
    : temperature;

When temperature == 0.0f (explicit greedy from the caller), it's correctly passed through. When temperature == -1.0f (sentinel for "use default"), it falls into the < 0 branch. This is correct. Just noting it for clarity — the -1.0f sentinel is consistent throughout.

7. targets.bzl correctly places text_llm_session.h in private headers, not exported_headers — (line 123)

The Copilot reviewer's concern about this file being exported was incorrect. It's in headers (Buck private), not exported_headers, so external code cannot depend on it through the build system. The detail namespace naming reinforces this.


Design Observations (Not Issues)

  • Logit processor sharing via apply_logit_processors() (text_token_generator.h:62): Smart approach — extracting it as a public method on TextTokenGenerator lets decode_one() call it without duplicating the loop, ensuring the session path can never diverge from generate(). The test DecodeOneAppliesLogitProcessors validates this.

  • Cooperative stop semantics: The distinction between is_eos (literal EOS token) and is_terminal (loop-ending signal) in DecodeResult is clean and will serve the HTTP server layer well for different "finish_reason" values.

  • utf8_complete_prefix_len and stop_safe_prefix_len in util.h: Well-tested streaming utilities that will be needed for the HTTP server in later stacked PRs. Good to land them with their tests now.


CI Failures

The lintrunner failure (Lint for extension/llm/runner/util.h) is likely related to the inline functions in the header. The link-check and binary-size failures appear infrastructure-related (docker exec exit code 127 = command not found). The WebGPU failures are unclassified and likely pre-existing.


Summary

This is a solid foundation for the serving stack. The main actionable item is improving the seek() error documentation so callers can handle NotSupported vs InvalidArgument programmatically. The tokenizer-per-session pattern is fine for v1 but should be revisited when session creation becomes a hot path.

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants