Skip to content

extension/llm/server: worker-based OpenAI-compatible HTTP server#19994

Open
mergennachin wants to merge 9 commits into
gh/mergennachin/4/headfrom
gh/mergennachin/5/head
Open

extension/llm/server: worker-based OpenAI-compatible HTTP server#19994
mergennachin wants to merge 9 commits into
gh/mergennachin/4/headfrom
gh/mergennachin/5/head

Conversation

@mergennachin

@mergennachin mergennachin commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Wire the foundations into a FastAPI app: /v1/chat/completions (streaming and
non-streaming), /v1/models, /health. Request validation rejects parameters the
server can't honor (top_p != 1, seed, n > 1, frequency/presence penalties,
top_k, logit_bias, logprobs, response_format other than text, non-positive
max_tokens, tool_choice = required / specific function) instead of silently
ignoring them; stop sequences are applied before tool parsing; usage is reported.

The Python process is control plane only: it loads no model and imports no
runtime pybind. Model execution runs in a separate C++ worker process
(cpp/text_llm_worker.cpp, over TextLLMEngine/TextLLMSession) that the control
plane spawns and drives over a small JSONL protocol (worker_client.py). The
protocol and the decode loop (reset, encode, context clamp, prefill, decode,
UTF-8 assembly, stop handling, stats, finish_reason) live in a shared header,
cpp/worker_loop.h, so model-specific workers reuse them; text_llm_worker only
constructs the engine/session and runs the loop. runner_pool is a pool of worker
processes (one in-flight request per worker) with a blocking->async streaming
bridge. V1 is single-slot; concurrent requests queue. There is no prefix cache
and no Python-side KV state; cancellation is best-effort (the control plane stops
consuming, the worker finishes the in-flight request). Hermetic tests (a
FakeRunner worker handle) cover the contract, templating, sampling params, tool
calls, the pool, and the worker protocol; conformance/ is a black-box suite
runnable against any live OpenAI server. READMEs document the flags and scope.

Depends on the serving foundations.

Part of #20001

[ghstack-poisoned]
@pytorch-bot

pytorch-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19994

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Unrelated Failure, 1 Unclassified Failure

As of commit 2dae19c with merge base eeb0646 (image):

NEW FAILURES - The following jobs have failed:

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

  • periodic (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
mergennachin added a commit that referenced this pull request Jun 3, 2026
Wire the foundations into a FastAPI app: /v1/chat/completions (streaming and
non-streaming), /v1/models, /health. Request validation rejects parameters the
server can't honor (top_p != 1, seed, n > 1, frequency/presence penalties,
top_k, logit_bias, logprobs, response_format other than text, non-positive
max_tokens, tool_choice = required / specific function) instead of silently
ignoring them; stop sequences are applied before tool parsing; client
cancellation calls runner.stop(); usage is reported. runner_pool admits physical
sessions per the engine's serving_capacity() (single-slot on XNNPACK, with
concurrent requests queueing on the resident session) and routes by prefix
affinity. Hermetic tests (FakeRunner via dependency injection) cover the
contract, templating, sampling params, tool calls and the pool; conformance/ is
a black-box suite runnable against any live OpenAI server. READMEs document the
flags and scope.

Last of four stacked commits; depends on the bindings and serving foundations.


ghstack-source-id: acef8e6
ghstack-comment-id: 4617263008
Pull-Request: #19994
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@mergennachin mergennachin marked this pull request as ready for review June 5, 2026 18:59
[ghstack-poisoned]
@mergennachin mergennachin changed the title extension/llm/server: OpenAI-compatible HTTP server extension/llm/server: worker-based OpenAI-compatible HTTP server Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant