llama.cpp provides full OpenAI-compatible API, making it a drop-in replacement for cloud providers. The middleware can switch between zen (cloud) and llama.cpp (local) by changing a single configuration value.
llama-server exposes all standard OpenAI endpoints:
| Endpoint | Description | Location |
|---|---|---|
GET /v1/models |
List available models | server.cpp:171 |
POST /v1/chat/completions |
Chat completions | server.cpp:177 |
POST /v1/completions |
Text completions | server.cpp:175 |
POST /v1/embeddings |
Text embeddings | server.cpp:184 |
No special flags required - OpenAI compatibility is enabled by default.
| Parameter | Description | Source |
|---|---|---|
model |
Model identifier | server-context.cpp:9894 |
messages |
Array of message objects | server-common.cpp:884 |
stream |
Enable SSE streaming | server-common.cpp:841 |
temperature |
Sampling temperature | Mapped from server params |
max_tokens |
Maximum tokens to generate | Mapped from server params |
top_p |
Nucleus sampling | Mapped from server params |
tools |
Tool definitions | server-common.cpp:839 |
tool_choice |
Tool selection mode | server-common.cpp:842 |
parallel_tool_calls |
Enable multiple tool calls | server-common.cpp:960 |
Standard OpenAI response:
{
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello!",
"tool_calls": [...]
},
"finish_reason": "stop"
}],
"created": 1234567890,
"id": "chatcmpl-xxx",
"model": "gpt-3.5-turbo",
"object": "chat.completion",
"usage": {
"prompt_tokens": 10,
"completion_tokens": 5,
"total_tokens": 15
}
}Implementation location: tools/server/server-task.cpp:613-707
llama.cpp emits standard Server-Sent Events:
Format implementation: tools/server/server-common.cpp:1462
# SSE format
data: {"choices": [{"delta": {...}}]}
# Done marker
data: [DONE]Done marker location: tools/server/server-context.cpp:3005
Each chunk contains:
| Field | Description |
|---|---|
choices[0].delta |
Incremental content update |
choices[0].finish_reason |
null until complete |
object |
Always "chat.completion.chunk" |
id |
Same across all chunks |
created |
Timestamp |
Chunk generation: tools/server/server-task.cpp:728-796
When include_usage is true, an additional final chunk is sent:
{
"choices": [],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 5,
"total_tokens": 15
}
}Location: tools/server/server-task.cpp:768-783
| Parameter | Description |
|---|---|
tools |
Array of tool definitions |
tool_choice |
"auto", "required", or specific tool |
parallel_tool_calls |
Enable multiple tool calls simultaneously |
Requires: --jinja flag for native tool calling templates
Models with native function calling support:
- Llama 3.1 / 3.3 / 3.2
- Functionary v3.1 / v3.2
- Hermes 2/3
- Qwen 2.5 & Qwen 2.5 Coder
- Mistral Nemo
- Firefunction v2
- Command R7B
- DeepSeek R1
Documentation: docs/function-calling.md:10-18
If model template is not recognized, generic tool call format is used.
Documentation: docs/function-calling.md:20-22
{
"choices": [{
"finish_reason": "tool_calls",
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"name": "python",
"arguments": "{\"code\":\"...\"}"
}]
}
}],
"usage": {...}
}Implementation: tools/server/server-task.cpp:667
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"tools": [{
"type": "function",
"function": {
"name": "python",
"description": "Runs code in an ipython interpreter...",
"parameters": {
"type": "object",
"properties": {
"code": {"type": "string", "description": "Code to run"}
},
"required": ["code"]
}
}
}],
"messages": [{
"role": "user",
"content": "Print a hello world message with python."
}]
}'Documentation: docs/function-calling.md:335-362
Status: llama.cpp does NOT natively support a reasoning_content field in responses.
The reasoning_content field is specific to GLM 4.7 and some reasoning models (e.g., DeepSeek R1).
For middleware integration:
- GLM 4.7 (zen): Extract and log
reasoning_contentfrommessage.reasoning_content - llama.cpp: Check if model is reasoning-capable (e.g., DeepSeek R1), handle accordingly
- Other models: No
reasoning_contentfield - ignore
def has_reasoning(message: Dict[str, Any]) -> bool:
"""
Check if message contains reasoning content.
GLM 4.7: message.reasoning_content
DeepSeek R1: May have <think> tags in content
Other: No reasoning
"""
return "reasoning_content" in messagellama.cpp provides standard usage statistics:
{
"usage": {
"prompt_tokens": 50,
"completion_tokens": 10,
"total_tokens": 60
}
}Note: No cached_tokens field like zen (GLM 4.7) has.
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/v1",
api_key="sk-no-key-required"
)
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are ChatGPT..."},
{"role": "user", "content": "Write a limerick..."}
]
)
print(completion.choices[0].message)Documentation: tools/server/README.md:1189-1206
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "system", "content": "You are ChatGPT, an AI assistant."},
{"role": "user", "content": "Write a limerick about python exceptions"}
]
}'Documentation: tools/server/README.md:1211-1227
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"input": "hello",
"model": "GPT-4",
"encoding_format": "float"
}'Documentation: tools/server/README.md:1276-1284
Switch backends via BACKEND_URL:
zen (cloud):
BACKEND_URL=https://opencode.ai/zen/v1llama.cpp (local):
BACKEND_URL=http://localhost:8080/v1zen: No API key for free models
llama.cpp: Use "Bearer no-key" or "Bearer sk-no-key-required"
Pass-through: Both zen and llama.cpp accept standard OpenAI format
No transformation needed between client and backend.
Common fields:
choices[0].message.contentchoices[0].message.tool_callschoices[0].finish_reasonusage.prompt_tokensusage.completion_tokensusage.total_tokens
GLM 4.7 specific:
choices[0].message.reasoning_contentusage.prompt_tokens_details.cached_tokens
Both use identical SSE format:
data: {...}\n\nfor chunksdata: [DONE]\n\nfor completion
No parsing differences needed.
zen:
- Fetch from
https://opencode.ai/zen/v1/models - Returns
{"object": "list", "data": [{"id": "...", ...}]}
llama.cpp:
- Fetch from
http://localhost:8080/v1/models - Returns
[{"id": "...", ...}](array, not wrapped)
Parsing difference:
async def fetch_models(base_url: str) -> list[str]:
"""Fetch models from backend."""
async with httpx.AsyncClient() as client:
response = await client.get(f"{base_url}/models")
data = response.json()
# zen: {"data": [...]}
if isinstance(data, dict) and "data" in data:
return [m["id"] for m in data["data"]]
# llama.cpp: [...]
return [m["id"] for m in data]# Start llama-server on port 8080
llama-server \
--model path/to/model.gguf \
--port 8080 \
--host 0.0.0.0llama-server \
--model qwen2.5-coder-7b-instruct.gguf \
--jinja \ # Enable native tool calling
--port 8080llama-server \
--model qwen2.5-coder-32b-instruct.gguf \
--jinja \
--flash-attn on \
--cont-batching \
--parallel 8 \
--n-gpu-layers -1 \
--port 8080tools/server/tests/unit/test_chat_completion.py- Chat completionstools/server/tests/unit/test_tool_call.py- Tool callingtools/server/tests/unit/test_infill.py- Fill-in-the-middle
# Test chat completions
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello"}]}'
# Test models endpoint
curl http://localhost:8080/v1/models
# Test health
curl http://localhost:8080/health| File | Purpose |
|---|---|
tools/server/README.md |
Server documentation and examples |
tools/server/server.cpp |
Main server implementation |
tools/server/server-common.cpp |
Request parsing and shared logic |
tools/server/server-task.cpp |
Response generation |
tools/server/server-context.cpp |
Context management |
docs/function-calling.md |
Tool calling documentation |
- Drop-in replacement: Same API as OpenAI/zen
- No code changes: Switch via
BACKEND_URLconfiguration - Full feature parity: Chat, tools, streaming, embeddings
- Local privacy: No data leaves your machine
- Cost control: No per-token API costs
- Model flexibility: Run any GGUF model
- No
reasoning_contentfield: This is GLM 4.7 specific - No
cached_tokens: llama.cpp doesn't track cache hits - Hardware requirements: Needs sufficient RAM/VRAM for model size
- Model file management: Must download GGUF files manually
- Model auto-detection: Scan local directory for GGUF files
- State management: Integrate KV cache save/load
- Multi-model routing: Route different requests to different models
- Speculative decoding: Use draft models for faster generation
- Grammar constraints: Enforce JSON/tool call formats via GBNF