diff --git a/extension/llm/server/README.md b/extension/llm/server/README.md index f90da90675a..7382a0a50df 100644 --- a/extension/llm/server/README.md +++ b/extension/llm/server/README.md @@ -32,3 +32,61 @@ lives inside the worker/session, not the control plane. Unsupported params (incl `logprobs`, and `tool_choice="required"`) are rejected with a structured 400 rather than silently ignored. See `python/README.md` to run it and `spec/README.md` for the exact contract. + +## Use from pi (or any OpenAI-compatible harness) + +Point pi at the server to use ExecuTorch as a local backend for tool-use +workflows. Launch the server: + +```bash +python -m executorch.extension.llm.server.python.server \ + --model-path \ + --tokenizer-path \ + --hf-tokenizer \ + --model-id \ + --host 127.0.0.1 \ + --port 8000 +``` + +Useful optional flags (full reference in `python/README.md`): + +- `--no-think` — default `enable_thinking=false` for templates that support it + (e.g. Qwen3-style). +- `--max-context N` — reject over-long prompts cleanly; use the export-time + context length. +- `--allow-chatml-fallback` — approximate ChatML when the model has no HF + `chat_template`; experimentation only, not recommended for reliable tool use. + +Point pi at the server via `~/.pi/agent/models.json`: + +```json +{ "providers": { "executorch": { + "baseUrl": "http://127.0.0.1:8000/v1", "api": "openai-completions", + "apiKey": "x", "models": [ { "id": "" } ] } } } +``` + +Other OpenAI-compatible clients use their own schema — generically: base URL +`http://127.0.0.1:8000/v1`, the model id you passed to `--model-id`, and a dummy +API key if one is required. + +Supported contract for pi: + +- Endpoint `POST /v1/chat/completions`; streaming supported. +- Tool calls: the model's Hermes-style `...` output is + parsed and returned as OpenAI `tool_calls`. This generic server uses Hermes by + default; a model-specific server may select the Qwen XML format. +- `tool_choice`: only `"auto"`, `"none"`, or unset. +- Rejected with a structured 400 (`unsupported_parameter`), not silently + ignored: `tool_choice="required"` or specific-function forcing, + `response_format` JSON/constrained output, `logprobs`, `top_p` other than + `1.0`, and `seed`. + +Reliability guidance: + +- Use the model's real HF `chat_template` (`--hf-tokenizer`) for tool use, kept + aligned with the exported tokenizer/model. +- If tool calls come back as plain text, confirm the model is emitting the + configured tool-call format's markers (Hermes for the generic server) and that + `tools` were included in the request. +- If a request fails with `unsupported_parameter`, remove or disable that + OpenAI knob in your pi/client config.