llama.cpp Integration Guide

Overview

llama.cpp provides full OpenAI-compatible API, making it a drop-in replacement for cloud providers. The middleware can switch between zen (cloud) and llama.cpp (local) by changing a single configuration value.

OpenAI-Compatible Endpoints

llama-server exposes all standard OpenAI endpoints:

Endpoint	Description	Location
`GET /v1/models`	List available models	server.cpp:171
`POST /v1/chat/completions`	Chat completions	server.cpp:177
`POST /v1/completions`	Text completions	server.cpp:175
`POST /v1/embeddings`	Text embeddings	server.cpp:184

No special flags required - OpenAI compatibility is enabled by default.

Request/Response Format

Supported Request Parameters

Parameter	Description	Source
`model`	Model identifier	server-context.cpp:9894
`messages`	Array of message objects	server-common.cpp:884
`stream`	Enable SSE streaming	server-common.cpp:841
`temperature`	Sampling temperature	Mapped from server params
`max_tokens`	Maximum tokens to generate	Mapped from server params
`top_p`	Nucleus sampling	Mapped from server params
`tools`	Tool definitions	server-common.cpp:839
`tool_choice`	Tool selection mode	server-common.cpp:842
`parallel_tool_calls`	Enable multiple tool calls	server-common.cpp:960

Response Format

Standard OpenAI response:

{
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello!",
      "tool_calls": [...]
    },
    "finish_reason": "stop"
  }],
  "created": 1234567890,
  "id": "chatcmpl-xxx",
  "model": "gpt-3.5-turbo",
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 5,
    "total_tokens": 15
  }
}

Implementation location: tools/server/server-task.cpp:613-707

Streaming Support

SSE Format

llama.cpp emits standard Server-Sent Events:

Format implementation: tools/server/server-common.cpp:1462

# SSE format
data: {"choices": [{"delta": {...}}]}

# Done marker
data: [DONE]

Done marker location: tools/server/server-context.cpp:3005

Streaming Chunk Structure

Each chunk contains:

Field	Description
`choices[0].delta`	Incremental content update
`choices[0].finish_reason`	`null` until complete
`object`	Always `"chat.completion.chunk"`
`id`	Same across all chunks
`created`	Timestamp

Chunk generation: tools/server/server-task.cpp:728-796

Final Usage Chunk

When include_usage is true, an additional final chunk is sent:

{
  "choices": [],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 5,
    "total_tokens": 15
  }
}

Location: tools/server/server-task.cpp:768-783

Tool Calling

Supported Parameters

Parameter	Description
`tools`	Array of tool definitions
`tool_choice`	`"auto"`, `"required"`, or specific tool
`parallel_tool_calls`	Enable multiple tool calls simultaneously

Requires: --jinja flag for native tool calling templates

Native Tool Calling Models

Models with native function calling support:

Llama 3.1 / 3.3 / 3.2
Functionary v3.1 / v3.2
Hermes 2/3
Qwen 2.5 & Qwen 2.5 Coder
Mistral Nemo
Firefunction v2
Command R7B
DeepSeek R1

Documentation: docs/function-calling.md:10-18

Generic Tool Calling

If model template is not recognized, generic tool call format is used.

Documentation: docs/function-calling.md:20-22

Tool Call Response Format

{
  "choices": [{
    "finish_reason": "tool_calls",
    "index": 0,
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "name": "python",
        "arguments": "{\"code\":\"...\"}"
      }]
    }
  }],
  "usage": {...}
}

Implementation: tools/server/server-task.cpp:667

Example Tool Call Request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [{
      "type": "function",
      "function": {
        "name": "python",
        "description": "Runs code in an ipython interpreter...",
        "parameters": {
          "type": "object",
          "properties": {
            "code": {"type": "string", "description": "Code to run"}
          },
          "required": ["code"]
        }
      }
    }],
    "messages": [{
      "role": "user",
      "content": "Print a hello world message with python."
    }]
  }'

Documentation: docs/function-calling.md:335-362

Reasoning Content

Status: llama.cpp does NOT natively support a reasoning_content field in responses.

The reasoning_content field is specific to GLM 4.7 and some reasoning models (e.g., DeepSeek R1).

Middleware Handling

For middleware integration:

GLM 4.7 (zen): Extract and log reasoning_content from message.reasoning_content
llama.cpp: Check if model is reasoning-capable (e.g., DeepSeek R1), handle accordingly
Other models: No reasoning_content field - ignore

Reasoning Detection Logic

def has_reasoning(message: Dict[str, Any]) -> bool:
    """
    Check if message contains reasoning content.

    GLM 4.7: message.reasoning_content
    DeepSeek R1: May have <think> tags in content
    Other: No reasoning
    """
    return "reasoning_content" in message

Usage Statistics

llama.cpp provides standard usage statistics:

{
  "usage": {
    "prompt_tokens": 50,
    "completion_tokens": 10,
    "total_tokens": 60
  }
}

Note: No cached_tokens field like zen (GLM 4.7) has.

Example Usage

Using OpenAI Python SDK

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="sk-no-key-required"
)

completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are ChatGPT..."},
        {"role": "user", "content": "Write a limerick..."}
    ]
)

print(completion.choices[0].message)

Documentation: tools/server/README.md:1189-1206

Using curl

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {"role": "system", "content": "You are ChatGPT, an AI assistant."},
      {"role": "user", "content": "Write a limerick about python exceptions"}
    ]
  }'

Documentation: tools/server/README.md:1211-1227

Embeddings

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key" \
  -d '{
    "input": "hello",
    "model": "GPT-4",
    "encoding_format": "float"
  }'

Documentation: tools/server/README.md:1276-1284

Integration Points for Middleware

1. Base URL Configuration

Switch backends via BACKEND_URL:

zen (cloud):

BACKEND_URL=https://opencode.ai/zen/v1

llama.cpp (local):

BACKEND_URL=http://localhost:8080/v1

2. Authentication

zen: No API key for free models llama.cpp: Use "Bearer no-key" or "Bearer sk-no-key-required"

3. Request Format

Pass-through: Both zen and llama.cpp accept standard OpenAI format

No transformation needed between client and backend.

4. Response Parsing

Common fields:

choices[0].message.content
choices[0].message.tool_calls
choices[0].finish_reason
usage.prompt_tokens
usage.completion_tokens
usage.total_tokens

GLM 4.7 specific:

choices[0].message.reasoning_content
usage.prompt_tokens_details.cached_tokens

5. Streaming

Both use identical SSE format:

data: {...}\n\n for chunks
data: [DONE]\n\n for completion

No parsing differences needed.

6. Model Discovery

zen:

Fetch from https://opencode.ai/zen/v1/models
Returns {"object": "list", "data": [{"id": "...", ...}]}

llama.cpp:

Fetch from http://localhost:8080/v1/models
Returns [{"id": "...", ...}] (array, not wrapped)

Parsing difference:

async def fetch_models(base_url: str) -> list[str]:
    """Fetch models from backend."""
    async with httpx.AsyncClient() as client:
        response = await client.get(f"{base_url}/models")
        data = response.json()

        # zen: {"data": [...]}
        if isinstance(data, dict) and "data" in data:
            return [m["id"] for m in data["data"]]

        # llama.cpp: [...]
        return [m["id"] for m in data]

Starting llama-server

Basic Usage

# Start llama-server on port 8080
llama-server \
  --model path/to/model.gguf \
  --port 8080 \
  --host 0.0.0.0

With Tool Calling

llama-server \
  --model qwen2.5-coder-7b-instruct.gguf \
  --jinja \  # Enable native tool calling
  --port 8080

With All Optimizations

llama-server \
  --model qwen2.5-coder-32b-instruct.gguf \
  --jinja \
  --flash-attn on \
  --cont-batching \
  --parallel 8 \
  --n-gpu-layers -1 \
  --port 8080

Testing

Test Suite Locations

tools/server/tests/unit/test_chat_completion.py - Chat completions
tools/server/tests/unit/test_tool_call.py - Tool calling
tools/server/tests/unit/test_infill.py - Fill-in-the-middle

Manual Testing

# Test chat completions
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello"}]}'

# Test models endpoint
curl http://localhost:8080/v1/models

# Test health
curl http://localhost:8080/health

Key Files

File	Purpose
`tools/server/README.md`	Server documentation and examples
`tools/server/server.cpp`	Main server implementation
`tools/server/server-common.cpp`	Request parsing and shared logic
`tools/server/server-task.cpp`	Response generation
`tools/server/server-context.cpp`	Context management
`docs/function-calling.md`	Tool calling documentation

Advantages for Middleware

Drop-in replacement: Same API as OpenAI/zen
No code changes: Switch via BACKEND_URL configuration
Full feature parity: Chat, tools, streaming, embeddings
Local privacy: No data leaves your machine
Cost control: No per-token API costs
Model flexibility: Run any GGUF model

Limitations

No reasoning_content field: This is GLM 4.7 specific
No cached_tokens: llama.cpp doesn't track cache hits
Hardware requirements: Needs sufficient RAM/VRAM for model size
Model file management: Must download GGUF files manually

Future Enhancements

Model auto-detection: Scan local directory for GGUF files
State management: Integrate KV cache save/load
Multi-model routing: Route different requests to different models
Speculative decoding: Use draft models for faster generation
Grammar constraints: Enforce JSON/tool call formats via GBNF

FilesExpand file tree

llama.cpp.md

Latest commit

History