This guide explains how to use the OpenAI-compatible API endpoints in Osaurus.
Returns a list of available models that are currently downloaded and ready to use.
curl http://localhost:8080/modelsExample response:
{
"object": "list",
"data": [
{
"id": "llama-3.2-3b-instruct",
"object": "model",
"created": 1738193123,
"owned_by": "osaurus"
},
{
"id": "qwen2.5-7b-instruct",
"object": "model",
"created": 1738193123,
"owned_by": "osaurus"
}
]
}Generate chat completions using the specified model.
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
],
"session_id": "my-session-1",
// Optional: reuse KV cache across turns for lower latency
"temperature": 0.7,
"max_tokens": 150
}'Example response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1738193123,
"model": "llama-3.2-3b-instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "I'm doing well, thank you for asking! How can I help you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 20,
"completion_tokens": 15,
"total_tokens": 35
}
}curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct",
"messages": [
{"role": "user", "content": "Tell me a short story"}
],
"stream": true,
"temperature": 0.8,
"max_tokens": 200
}'Streaming responses use Server-Sent Events (SSE) format:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1738193123,"model":"llama-3.2-3b-instruct","choices":[{"index":0,"delta":{"content":"Once"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1738193123,"model":"llama-3.2-3b-instruct","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1738193123,"model":"llama-3.2-3b-instruct","choices":[{"index":0,"delta":{"content":" a"},"finish_reason":null}]}
data: [DONE]
Osaurus implements OpenAI‑compatible function calling via the tools array and optional tool_choice in the request. The server injects tool‑calling instructions into the prompt and parses assistant outputs for a top‑level tool_calls object, tolerating minor formatting (e.g., code fences).
Supported tool type: function.
Request with tools (non‑stream):
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct",
"messages": [
{"role": "user", "content": "Weather in SF?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather by city name",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}
],
"tool_choice": "auto"
}'Example non‑streaming response (simplified):
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1738193123,
"model": "llama-3.2-3b-instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "call_1",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"city\":\"SF\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
]
}Streaming with tool calls: Osaurus emits OpenAI‑style deltas. First a role delta, then for each tool call: an id/type delta, a function name delta, and one or more argument deltas (chunked). The final chunk has finish_reason: "tool_calls", followed by [DONE].
data: {"id":"chatcmpl-xyz","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"}}]}
data: {"id":"chatcmpl-xyz","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"id":"call_1","type":"function"}]}}]}
data: {"id":"chatcmpl-xyz","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"name":"get_weather"}}]}}]}
data: {"id":"chatcmpl-xyz","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\"city\":\"SF\"}"}}]}}]}
data: {"id":"chatcmpl-xyz","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"tool_calls"}]}
data: [DONE]
Tool execution loop: After receiving tool calls, execute them client‑side and continue the conversation by sending the tool results as role: tool messages with the corresponding tool_call_id.
import json
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="osaurus")
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
}
}
}]
resp = client.chat.completions.create(
model="llama-3.2-3b-instruct",
messages=[{"role": "user", "content": "Weather in SF?"}],
tools=tools,
tool_choice="auto",
)
tool_calls = resp.choices[0].message.tool_calls or []
for call in tool_calls:
args = json.loads(call.function.arguments)
# Execute your function
result = {"tempC": 18, "conditions": "Foggy"}
followup = client.chat.completions.create(
model="llama-3.2-3b-instruct",
messages=[
{"role": "user", "content": "Weather in SF?"},
{"role": "assistant", "content": "", "tool_calls": tool_calls},
{"role": "tool", "tool_call_id": call.id, "content": json.dumps(result)}
]
)
print(f"Answer: {followup.choices[0].message.content}")Notes and limitations:
- Only
functiontools are supported. - Assistant must return arguments as a JSON‑escaped string. The server also tolerates a nested
parametersobject and normalizes it. - The parser accepts common wrappers like code fences and an
assistant:prefix. tool_choicesupports"auto","none", and a specific function target object.
Provide a session_id to reuse the same model chat session’s KV cache across requests. Reuse applies when:
- The
modelmatches, and - The session is not concurrently in use, and
- The session has not expired from the internal LRU window.
Example follow-up turn using the same session_id:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct",
"session_id": "my-session-1",
"messages": [
{"role": "user", "content": "And one more detail, please."}
]
}'Keep session_id stable per conversation and per model.
Osaurus defers chat templating to MLX ChatSession, which uses the model's configuration to format prompts. System messages are combined and passed as instructions; user content is supplied as the prompt to respond/streamResponse.
Models are automatically named based on their display names in ModelManager. The API converts the model names to lowercase and replaces spaces with hyphens. For example:
| Downloaded Model | API Model Name |
|---|---|
| Llama 3.2 3B Instruct | llama-3.2-3b-instruct |
| Llama 3.2 1B Instruct | llama-3.2-1b-instruct |
| Qwen 2.5 7B Instruct | qwen-2.5-7b-instruct |
| Qwen 2.5 3B Instruct | qwen-2.5-3b-instruct |
| Gemma 2 9B Instruct | gemma-2-9b-instruct |
| Gemma 2 2B Instruct | gemma-2-2b-instruct |
| DeepSeek-R1 Distill 1.5B | deepseek-r1-distill-1.5b |
| OpenELM 3B (GPT-style) | openelm-3b-(gpt-style) |
You can use the official OpenAI Python library with Osaurus:
from openai import OpenAI
# Point to your local Osaurus server
client = OpenAI(
base_url="http://localhost:8080/v1", # Use /v1 for OpenAI client compatibility
api_key="not-needed" # Osaurus doesn't require authentication
)
# List available models
models = client.models.list()
for model in models.data:
print(model.id)
# Create a chat completion
response = client.chat.completions.create(
model="llama-3.2-3b-instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
temperature=0.7,
max_tokens=100
)
print(response.choices[0].message.content)
# Stream a response
stream = client.chat.completions.create(
model="llama-3.2-3b-instruct",
messages=[
{"role": "user", "content": "Write a haiku about coding"}
],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")-
Model Availability: Only models that have been downloaded through the Osaurus UI will be available via the API.
-
Performance: The first request to a model may take longer as the model needs to be loaded into memory.
-
Memory Usage: Models are cached in memory after loading. Use the ModelManager UI to manage which models are downloaded.
-
GPU Acceleration: MLX automatically uses Apple Silicon GPU acceleration when available.
-
Context Length: Each model has different context length limitations. Refer to the model documentation for specifics.