extension/llm/runner: Engine/Session C++ core + token-step primitives by mergennachin · Pull Request #19991 · pytorch/executorch

mergennachin · 2026-06-03T22:32:38Z

Add the model-agnostic LLMEngine/LLMSession interfaces (llm_session.h) with
SamplingConfig, DecodeResult and LLMServingCapacity; the TextLLMRunner
token-step primitives the session layer is built on (seek, prefill_tokens,
position, decode_one); and TextLLMEngine/TextLLMSession over a single loaded
Program. decode_one() shares generate()'s logit processors via
TextTokenGenerator::apply_logit_processors so the two decode paths cannot
diverge. serving_capacity() reports a conservative single physical session
(physical weight sharing is backend-dependent).

Also add utf8_complete_prefix_len and stop_safe_prefix_len (util.h): byte-level
BPE tokenizers can emit a token that is only part of a multi-byte UTF-8
character, so a streaming consumer must forward only the complete-character
prefix of accumulated pieces and hold the trailing bytes until the rest arrives;
stop_safe_prefix_len additionally holds back the longest possible partial-stop
tail so a stop string straddling pieces is still caught. The C++ workers built on
this core use both to stream UTF-8-safe text with stop sequences. Covered by
gtests in test_text_llm_runner.cpp and test_util.cpp.

First of six stacked commits: C++ core -> server foundations -> worker-based
HTTP server -> pi docs -> Qwen worker -> Qwen CUDA V2 (per-session state).

Part of #20001

[ghstack-poisoned]

mergennachin · 2026-06-03T22:32:39Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2026-06-03T22:32:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19991

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Unrelated Failures, 2 Unclassified Failures

As of commit 6edbb17 with merge base eeb0646 ():

NEW FAILURES - The following jobs have failed:

Lint / link-check / lint-urls (gh)
Process completed with exit code 1.
pull / test-arm-cortex-m-size-test (bare_metal) / linux-job (gh)
RuntimeError: Command docker exec -t 8d38d5701cd1806af8736a37af96f99def7592dbb78a6a0f3ee59a2bf0066c95 /exec failed with exit code 127
pull / test-binary-size-linux-gcc / linux-job (gh)
RuntimeError: Command docker exec -t c3e69706b42489b30f74166027f250eb79cd10c7e3449540ef2fe09ded7d1d6c /exec failed with exit code 127

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Test WebGPU Backend / test-webgpu / test-backend-linux (webgpu, models) / linux-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command docker exec -t 3b3c5b0c61785bac148c231caad189d8b10d9883eed90361c789bd1f18ad2e83 /exec failed with exit code 1
Test WebGPU Backend / test-webgpu / test-backend-linux (webgpu, operators) / linux-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command docker exec -t cb5b9a1ed35c0b3cac8a8b3f4969c751f0b3a2aa06351dd7ccd2795b8a351069 /exec failed with exit code 1

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / test-arm-cortex-m-size-test (zephyr-preset) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / test-binary-size-linux / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-03T22:33:24Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

This PR introduces a model-agnostic C++ Engine/Session interface for LLM serving and adds token-step primitives to TextLLMRunner (seek/prefill_tokens/position/decode_one), enabling prefix-KV reuse and session-oriented decoding over a single loaded Program.

Changes:

Add LLMEngine / LLMSession interfaces plus shared structs (SamplingConfig, DecodeResult, LLMServingCapacity).
Extend TextLLMRunner with KV cursor control (seek, position), pre-tokenized prefill (prefill_tokens), and single-token decode (decode_one) using shared logit-processor logic.
Add TextLLMEngine/TextLLMSession adapter layer and new gtests covering the new primitives.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
extension/llm/runner/text_token_generator.h	Factors logit-processor application into a shared helper and adds an EOS helper for single-step decode.
extension/llm/runner/text_llm_runner.h	Adds session primitives (`seek`, `prefill_tokens`, `position`, `decode_one`) and tracks previous decode token.
extension/llm/runner/text_llm_runner.cpp	Implements the new primitives, adds KV-capacity checks, and aligns logit processing across decode paths.
extension/llm/runner/test/test_text_llm_runner.cpp	Adds unit tests for new primitives (seek/prefill_tokens/decode_one) and serving-capacity default.
extension/llm/runner/targets.bzl	Exports the new public header `llm_session.h`.
extension/llm/runner/llm_session.h	Adds the new model-agnostic Engine/Session API definitions.
extension/llm/runner/llm_runner_helper.h	Adds shared-Program runner construction + `TextLLMEngine` declaration.
extension/llm/runner/llm_runner_helper.cpp	Implements shared-Program runner creation and `TextLLMEngine` / `TextLLMSession` adapter.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

[ghstack-poisoned]

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

extension/llm/runner/text_llm_runner.cpp:255

If the first generated token returned by prefill (cur_token) is an EOS token and ignore_eos is false, generate() currently still enters TextTokenGenerator::generate(), which forwards the EOS and can emit additional tokens. This diverges from the new session decode_one() semantics (which stop at EOS without forwarding it) and is likely incorrect for one-shot generation as well.

  // start the main loop
  prompt_tokens.push_back(cur_token);

  // Set ignore_eos based on config
  text_token_generator_->set_ignore_eos(config.ignore_eos);

  // Generate max_new_tokens - 1 because prefill already generated 1 token.
  auto generate_result = text_token_generator_->generate(
      prompt_tokens, pos_, max_new_tokens - 1, resolved_temp, wrapped_callback);

  if (!generate_result.ok()) {
    return generate_result.error();
  }
  int64_t num_generated_tokens = generate_result.get();

  pos_ += num_generated_tokens;

  virtual ::executorch::runtime::Result<uint64_t> prefill(
      std::vector<uint64_t>& prompt_tokens,
-      int64_t& start_pos);
+      int64_t& start_pos,
+      float temperature = 0.0f);

  /**
   * Helper method to prefill a chunk of tokens.
   * @param prompt_tokens The chunk of text prompt tokens to process.
   * @param start_pos The starting position in KV cache of the input in the LLM
   * Module.
+   * @param temperature Sampling temperature for the token produced by this
+   * chunk. Defaults to greedy (0.0).
   * @return The next token of the LLM Module after prefilling this chunk.
   */
  virtual ::executorch::runtime::Result<uint64_t> prefill_chunk(
      std::vector<uint64_t>& prompt_tokens,
-      int64_t& start_pos);
+      int64_t& start_pos,
+      float temperature = 0.0f);


        runtime.cxx_library(
            name = "runner_lib" + aten_suffix,
            exported_headers = [
                "text_llm_runner.h",
+                "text_llm_session.h",
                "llm_runner_helper.h",
+                "llm_session.h",
                "constants.h",
            ],
            srcs = [


[ghstack-poisoned]

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

kirklandsign · 2026-06-05T21:20:57Z

@claude review

claude · 2026-06-05T21:21:36Z

Claude finished @kirklandsign's task in 2m 11s —— View job

Code Review: Engine/Session C++ core + token-step primitives

Read all changed files
Analyze API design (llm_session.h)
Review token-step primitives (text_llm_runner.cpp/h)
Review adapter layer (llm_runner_helper.cpp/h)
Review tests
Review utility additions
Post review feedback

Overall this is a well-structured PR. The LLMEngine/LLMSession abstraction cleanly separates the serving surface from the model-specific runner, and the token-step primitives (seek, prefill_tokens, decode_one) have thorough guards and consistent semantics with the existing generate() path. The test coverage is solid. Below are the issues I found, ordered by severity.

Issues

1. decode_one() capacity check is off-by-one — allows stepping to position max_context_len but then the next step correctly rejects

text_llm_runner.cpp:529:

pos_ < metadata_.at(kMaxContextLen)

After a successful decode_one(), pos_ increments to the context length (e.g. 128), and then the next decode_one() fails. This is intentional (the test DecodeOneRejectsWhenContextFull validates this), but it means the last successful forward pass writes a KV entry at position max_context_len - 1, which is the last valid slot. Just documenting that this behavior is correct and by design — the pending token at that point can still be emitted (no forward needed for the EOS short-circuit), but cannot be forwarded. This is fine.

2. seek() header doc should mention NotSupported error — (text_llm_runner.h:179)

The LLMSession::seek() doc in llm_session.h:96 says "sliding-window KV may reject a seek past its window" but doesn't name the error code. Callers handling seek failure programmatically (e.g. falling back to reset+prefill) need to match on both InvalidArgument and NotSupported. Consider adding the error code to the doc:

/// Returns InvalidArgument if pos is out of [0, position()].
/// Returns NotSupported for sliding-window models or non-KV-cache models.

Fix this →

3. SamplingConfig doc references non-existent capabilities() API — (llm_session.h:29-30)

The comment says:

An adapter applies the fields it supports and rejects non-default values of the rest rather than silently ignoring them (today only temperature is plumbed).

This is now accurate (the Copilot review flagged an earlier version that mentioned capabilities()). The current code correctly rejects unsupported params — no action needed here.

4. TextLLMEngine::create_session() re-loads the tokenizer from disk on every session — (llm_runner_helper.cpp:503)

Each create_session() call does load_tokenizer(tokenizer_path_), loading and parsing the tokenizer file again. For a high-throughput server with many sessions, this adds latency per session creation. Consider either:

Caching a shared tokenizer (tokenizers are typically stateless/thread-safe for encode/decode)
Or documenting this as acceptable for the "conservative single session" v1

This isn't a correctness bug, but worth noting for the follow-up stacked PRs.

5. stop() sets both stop_requested_ and text_token_generator_->stop() — (text_llm_runner.cpp:376-377)

This dual-flag stop is intentional (one for the session decode loop, one for the generate loop), but if stop() is called while generate() is running, both flags fire. On the next reset(), stop_requested_ is cleared (line 389) but text_token_generator_'s should_stop_ is only cleared at the start of its generate() loop (line 129 in text_token_generator.h). This means:

If you call stop() → reset() → then use the session path (prefill_tokens/decode_one), it works fine.
If you call stop() → reset() → then call generate(), it also works fine because generate() calls text_token_generator_->generate() which resets should_stop_ at the top of the loop.

So no bug, but the dual-stop mechanism is subtle enough to warrant a brief inline comment on stop() noting why both are needed.

6. prefill_tokens temperature resolution subtlety — (text_llm_runner.cpp:466-468)

const float temp = (temperature < 0.0f)
    ? (temperature_ == -1.0f ? 0.0f : temperature_)
    : temperature;

When temperature == 0.0f (explicit greedy from the caller), it's correctly passed through. When temperature == -1.0f (sentinel for "use default"), it falls into the < 0 branch. This is correct. Just noting it for clarity — the -1.0f sentinel is consistent throughout.

7. targets.bzl correctly places text_llm_session.h in private headers, not exported_headers — (line 123)

The Copilot reviewer's concern about this file being exported was incorrect. It's in headers (Buck private), not exported_headers, so external code cannot depend on it through the build system. The detail namespace naming reinforces this.

Design Observations (Not Issues)

Logit processor sharing via apply_logit_processors() (text_token_generator.h:62): Smart approach — extracting it as a public method on TextTokenGenerator lets decode_one() call it without duplicating the loop, ensuring the session path can never diverge from generate(). The test DecodeOneAppliesLogitProcessors validates this.
Cooperative stop semantics: The distinction between is_eos (literal EOS token) and is_terminal (loop-ending signal) in DecodeResult is clean and will serve the HTTP server layer well for different "finish_reason" values.
utf8_complete_prefix_len and stop_safe_prefix_len in util.h: Well-tested streaming utilities that will be needed for the HTTP server in later stacked PRs. Good to land them with their tests now.

CI Failures

The lintrunner failure (Lint for extension/llm/runner/util.h) is likely related to the inline functions in the header. The link-check and binary-size failures appear infrastructure-related (docker exec exit code 127 = command not found). The WebGPU failures are unclassified and likely pre-existing.

Summary

This is a solid foundation for the serving stack. The main actionable item is improving the seek() error documentation so callers can handle NotSupported vs InvalidArgument programmatically. The tokenizer-per-session pattern is fine for v1 but should be revisited when session creation becomes a hot path.

[ghstack-poisoned]

[INITIAL] Update

8a8a627

[ghstack-poisoned]

Copilot AI review requested due to automatic review settings June 3, 2026 22:32

mergennachin requested a review from larryliu0820 as a code owner June 3, 2026 22:32

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 3, 2026

Copilot started reviewing on behalf of mergennachin June 3, 2026 22:32 View session

This was referenced Jun 3, 2026

extension/llm/server: worker-based OpenAI-compatible HTTP server #19994

Open

extension/llm/server: serving foundations (schemas, errors, templating, tools) #19993

Open

extension/llm/runner: Python bindings for the Engine/Session API #19992

Closed

mergennachin requested review from digantdesai, kirklandsign and psiddh June 3, 2026 22:33

mergennachin requested review from Gasoonjia and GregoryComer June 3, 2026 22:33

Copilot AI reviewed Jun 3, 2026

View reviewed changes

[UPDATE] Update

dff4856

[ghstack-poisoned]

mergennachin mentioned this pull request Jun 3, 2026

extension/llm/server: document pi integration #19999

Open

[UPDATE] Update

0756e5b

[ghstack-poisoned]

Copilot AI review requested due to automatic review settings June 4, 2026 18:48

mergennachin mentioned this pull request Jun 4, 2026

examples/models/qwen3_5_moe: CUDA Engine/Session adapter + OpenAI serving #20043

Open

Copilot started reviewing on behalf of mergennachin June 4, 2026 18:48 View session

mergennachin marked this pull request as draft June 4, 2026 18:50

Copilot AI reviewed Jun 4, 2026

View reviewed changes

mergennachin added 2 commits June 4, 2026 15:14

[UPDATE] Update

170f01d

[ghstack-poisoned]

[UPDATE] Update

4648639

[ghstack-poisoned]

mergennachin marked this pull request as ready for review June 5, 2026 18:55

Copilot AI review requested due to automatic review settings June 5, 2026 18:55

Copilot started reviewing on behalf of mergennachin June 5, 2026 18:55 View session

Copilot AI reviewed Jun 5, 2026

[UPDATE] Update

6edbb17

[ghstack-poisoned]

mergennachin mentioned this pull request Jun 8, 2026

Qwen3.5-MoE CUDA V2 foundation: one model, many isolated sessions #20117

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extension/llm/runner: Engine/Session C++ core + token-step primitives#19991

extension/llm/runner: Engine/Session C++ core + token-step primitives#19991
mergennachin wants to merge 6 commits into
mainfrom
gh/mergennachin/2/head

mergennachin commented Jun 3, 2026 •

edited

Loading

Uh oh!

mergennachin commented Jun 3, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

kirklandsign commented Jun 5, 2026

Uh oh!

claude Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mergennachin commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergennachin commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19991

❌ 3 New Failures, 2 Unrelated Failures, 2 Unclassified Failures

Uh oh!

github-actions Bot commented Jun 3, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

kirklandsign commented Jun 5, 2026

Uh oh!

claude Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Engine/Session C++ core + token-step primitives

Issues

Design Observations (Not Issues)

CI Failures

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mergennachin commented Jun 3, 2026 •

edited

Loading

mergennachin commented Jun 3, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 3, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 5, 2026 •

edited

Loading