Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions extension/llm/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,61 @@ lives inside the worker/session, not the control plane. Unsupported params (incl
`logprobs`, and `tool_choice="required"`) are rejected with a structured 400
rather than silently ignored. See `python/README.md` to run it and
`spec/README.md` for the exact contract.

## Use from pi (or any OpenAI-compatible harness)

Point pi at the server to use ExecuTorch as a local backend for tool-use
workflows. Launch the server:

```bash
python -m executorch.extension.llm.server.python.server \
--model-path <model.pte> \
--tokenizer-path <tokenizer.model-or-json> \
--hf-tokenizer <hf-model-or-local-dir> \
--model-id <model-id> \
--host 127.0.0.1 \
--port 8000
```

Useful optional flags (full reference in `python/README.md`):

- `--no-think` — default `enable_thinking=false` for templates that support it
(e.g. Qwen3-style).
- `--max-context N` — reject over-long prompts cleanly; use the export-time
context length.
- `--allow-chatml-fallback` — approximate ChatML when the model has no HF
`chat_template`; experimentation only, not recommended for reliable tool use.

Point pi at the server via `~/.pi/agent/models.json`:

```json
{ "providers": { "executorch": {
"baseUrl": "http://127.0.0.1:8000/v1", "api": "openai-completions",
"apiKey": "x", "models": [ { "id": "<model-id>" } ] } } }
```

Other OpenAI-compatible clients use their own schema — generically: base URL
`http://127.0.0.1:8000/v1`, the model id you passed to `--model-id`, and a dummy
API key if one is required.

Supported contract for pi:

- Endpoint `POST /v1/chat/completions`; streaming supported.
- Tool calls: the model's Hermes-style `<tool_call>...</tool_call>` output is
parsed and returned as OpenAI `tool_calls`. This generic server uses Hermes by
default; a model-specific server may select the Qwen XML format.
- `tool_choice`: only `"auto"`, `"none"`, or unset.
- Rejected with a structured 400 (`unsupported_parameter`), not silently
ignored: `tool_choice="required"` or specific-function forcing,
`response_format` JSON/constrained output, `logprobs`, `top_p` other than
`1.0`, and `seed`.

Reliability guidance:

- Use the model's real HF `chat_template` (`--hf-tokenizer`) for tool use, kept
aligned with the exported tokenizer/model.
- If tool calls come back as plain text, confirm the model is emitting the
configured tool-call format's markers (Hermes for the generic server) and that
`tools` were included in the request.
- If a request fails with `unsupported_parameter`, remove or disable that
OpenAI knob in your pi/client config.
Loading