You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Messages API trims prompts for model size serves local MLX model
49
+
┌→ mlx-lm.server (localhost:8080)
50
+
Claude Code → mallex proxy ──────┤ local MLX model
51
+
Anthropic classifies intent └→ Anthropic API
52
+
Messages API routes by effort Claude Sonnet / Opus
53
+
```
54
+
55
+
1.**Classifies intent** — uses your local model to classify each request as low, medium, or high effort
56
+
2.**Routes by effort** — sends simple tasks to local MLX, complex tasks to Claude API (configurable per tier)
57
+
3.**Translates requests** from Anthropic Messages API → OpenAI Chat Completions (for local model path)
58
+
4.**Trims prompts** — Claude Code sends ~24K chars of system prompt overhead; mallex trims this to fit the model's practical context budget
59
+
5.**Injects tool definitions** as XML in the system prompt so the local model can use tools (read_file, write_file, edit_file, bash, glob, grep)
60
+
6.**Translates responses** back from OpenAI format → Anthropic format (including streaming)
61
+
62
+
## Intent-Based Routing
63
+
64
+
mallex classifies every request by complexity and routes it to the right model. This is inspired by [NVIDIA's LLM Router](https://build.nvidia.com/nvidia/llm-router) pattern.
0 commit comments