Skip to content

Commit 5fbeea5

Browse files
authored
Merge pull request #4 from DaveZheng/feat/intent-based-routing
feat: intent-based routing with effort tiers and escalation
2 parents e252483 + 5ecc713 commit 5fbeea5

10 files changed

Lines changed: 1135 additions & 15 deletions

README.md

Lines changed: 83 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -34,25 +34,79 @@ mallex
3434
# Start proxy only (for use with an existing Claude Code session)
3535
mallex proxy
3636

37+
# Re-configure intent-based routing
38+
mallex --setup
39+
3740
# Stop the background mlx-lm.server
3841
mallex server stop
3942
```
4043

41-
On first run, mallex detects your hardware and recommends a model. You can accept the recommendation or provide a custom model ID.
44+
On first run, mallex detects your hardware, recommends a model, and walks you through intent-based routing setup.
4245

4346
## How It Works
4447

4548
```
46-
Claude Code → mallex proxy (localhost:3456) → mlx-lm.server (localhost:8080)
47-
Anthropic translates request/response OpenAI Chat Completions
48-
Messages API trims prompts for model size serves local MLX model
49+
┌→ mlx-lm.server (localhost:8080)
50+
Claude Code → mallex proxy ──────┤ local MLX model
51+
Anthropic classifies intent └→ Anthropic API
52+
Messages API routes by effort Claude Sonnet / Opus
53+
```
54+
55+
1. **Classifies intent** — uses your local model to classify each request as low, medium, or high effort
56+
2. **Routes by effort** — sends simple tasks to local MLX, complex tasks to Claude API (configurable per tier)
57+
3. **Translates requests** from Anthropic Messages API → OpenAI Chat Completions (for local model path)
58+
4. **Trims prompts** — Claude Code sends ~24K chars of system prompt overhead; mallex trims this to fit the model's practical context budget
59+
5. **Injects tool definitions** as XML in the system prompt so the local model can use tools (read_file, write_file, edit_file, bash, glob, grep)
60+
6. **Translates responses** back from OpenAI format → Anthropic format (including streaming)
61+
62+
## Intent-Based Routing
63+
64+
mallex classifies every request by complexity and routes it to the right model. This is inspired by [NVIDIA's LLM Router](https://build.nvidia.com/nvidia/llm-router) pattern.
65+
66+
### Effort tiers
67+
68+
| Tier | Default (8-32GB) | Default (64GB+ with Qwen3-Coder-Next) |
69+
|------|-------------------|---------------------------------------|
70+
| **Low** — chit chat, simple edits | Local MLX | Local MLX |
71+
| **Medium** — single features, debugging | Claude Sonnet 4.5 | Local MLX (benchmarks near Sonnet) |
72+
| **High** — architecture, multi-file refactors | Claude Opus 4.6 | Claude Opus 4.6 |
73+
74+
Defaults are recommendations based on your local model's capability. You can override any tier during setup.
75+
76+
### Intent categories
77+
78+
Each request is classified by the local model into one of four categories, which map to tiers automatically:
79+
80+
| Category | Description | Tier |
81+
|----------|-------------|------|
82+
| `chit_chat` | Casual conversation, explanations, Q&A | Low |
83+
| `simple_code` | Single-file edits, renames, fixing imports/typos | Low |
84+
| `hard_question` | Multi-file refactors, architecture, planning, complex debugging | High |
85+
| `try_again` | Previous answer was wrong/incomplete — escalates one tier up | Escalates |
86+
87+
### Escalation
88+
89+
When you say "that's wrong" or "try again", mallex escalates to the next tier:
90+
91+
```
92+
Local MLX (Low) → Claude Sonnet 4.5 (Medium) → Claude Opus 4.6 (High)
93+
```
94+
95+
If your local model handles medium (64GB+ setups), escalation goes:
96+
97+
```
98+
Local MLX (Low) → Local MLX (Medium) → Claude Opus 4.6 (High)
99+
```
100+
101+
### Setup
102+
103+
On first run, mallex walks you through routing configuration. To reconfigure later:
104+
105+
```bash
106+
mallex --setup
49107
```
50108

51-
1. **Starts mlx-lm.server** with your chosen model if not already running
52-
2. **Translates requests** from Anthropic Messages API → OpenAI Chat Completions
53-
3. **Trims prompts** — Claude Code sends ~24K chars of system prompt overhead; mallex trims this to fit the model's practical context budget
54-
4. **Injects tool definitions** as XML in the system prompt so the local model can use tools (read_file, write_file, edit_file, bash, glob, grep)
55-
5. **Translates responses** back from OpenAI format → Anthropic format (including streaming)
109+
You only need a Claude API key if any tier is configured to use Claude. If no key is provided, Claude tiers fall back to local MLX.
56110

57111
## Prompt Trimming
58112

@@ -152,17 +206,33 @@ Config is stored at `~/.mallex/config.json`:
152206
"serverPort": 8080,
153207
"proxyPort": 3456,
154208
"idleTimeoutMinutes": 15,
155-
"onExitServer": "ask"
209+
"onExitServer": "ask",
210+
"routing": {
211+
"rules": {
212+
"chit_chat": { "tier": 1 },
213+
"simple_code": { "tier": 1 },
214+
"hard_question": { "tier": 3 },
215+
"try_again": { "tier": 1 }
216+
},
217+
"tiers": {
218+
"1": { "target": "local" },
219+
"2": { "target": "claude", "claudeModel": "claude-sonnet-4-5-20250929" },
220+
"3": { "target": "claude", "claudeModel": "claude-opus-4-6" }
221+
},
222+
"claudeApiKey": "sk-ant-..."
223+
}
156224
}
157225
```
158226

159227
## Recommended Models
160228

161229
| Hardware | Recommended Model | Notes |
162230
|---|---|---|
163-
| 16GB RAM | Qwen2.5-Coder-7B-Instruct-4bit | Best quality/speed for limited RAM |
164-
| 32GB RAM | Qwen2.5-Coder-14B-Instruct-4bit | Good balance |
165-
| 64GB+ RAM | Qwen2.5-Coder-32B-Instruct-4bit | Best local coding model |
231+
| 8GB RAM | Qwen2.5-Coder-7B-Instruct-4bit | Basic — pair with Claude for medium/high tasks |
232+
| 16GB RAM | Qwen2.5-Coder-14B-Instruct-4bit | Good for simple tasks |
233+
| 32GB RAM | Qwen3-Coder-30B-A3B-Instruct-4bit | Handles most code tasks locally |
234+
| 64GB RAM | Qwen3-Coder-Next-Instruct-4bit | Benchmarks near Sonnet — handles medium tasks locally |
235+
| 128GB+ RAM | Qwen3-Coder-Next-Instruct-8bit | Best local quality |
166236

167237
## Debug
168238

src/claude-client.test.ts

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
import { describe, it } from "node:test";
2+
import assert from "node:assert";
3+
import { ClaudeApiError } from "./claude-client.js";
4+
5+
describe("ClaudeApiError", () => {
6+
it("preserves message and status", () => {
7+
const err = new ClaudeApiError("something went wrong", 500);
8+
assert.strictEqual(err.message, "something went wrong");
9+
assert.strictEqual(err.status, 500);
10+
assert.strictEqual(err.name, "ClaudeApiError");
11+
});
12+
13+
it("classifies 401 as auth error", () => {
14+
const err = new ClaudeApiError("unauthorized", 401);
15+
assert.strictEqual(err.isAuthError, true);
16+
assert.strictEqual(err.isRateLimited, false);
17+
assert.strictEqual(err.isOverloaded, false);
18+
});
19+
20+
it("classifies 429 as rate limited", () => {
21+
const err = new ClaudeApiError("rate limited", 429);
22+
assert.strictEqual(err.isAuthError, false);
23+
assert.strictEqual(err.isRateLimited, true);
24+
assert.strictEqual(err.isOverloaded, false);
25+
});
26+
27+
it("classifies 529 as overloaded", () => {
28+
const err = new ClaudeApiError("overloaded", 529);
29+
assert.strictEqual(err.isAuthError, false);
30+
assert.strictEqual(err.isRateLimited, false);
31+
assert.strictEqual(err.isOverloaded, true);
32+
});
33+
34+
it("classifies 500 as none of the special categories", () => {
35+
const err = new ClaudeApiError("internal server error", 500);
36+
assert.strictEqual(err.isAuthError, false);
37+
assert.strictEqual(err.isRateLimited, false);
38+
assert.strictEqual(err.isOverloaded, false);
39+
});
40+
});

src/claude-client.ts

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
/**
2+
* Anthropic API client using node:https (zero external dependencies).
3+
* Used by the routing proxy to forward requests that are too complex for local MLX.
4+
*/
5+
6+
import https from "node:https";
7+
import type { ServerResponse } from "node:http";
8+
9+
const ANTHROPIC_HOST = "api.anthropic.com";
10+
const ANTHROPIC_PATH = "/v1/messages";
11+
const ANTHROPIC_VERSION = "2023-06-01";
12+
13+
export class ClaudeApiError extends Error {
14+
readonly status: number;
15+
16+
constructor(message: string, status: number) {
17+
super(message);
18+
this.name = "ClaudeApiError";
19+
this.status = status;
20+
}
21+
22+
get isAuthError(): boolean {
23+
return this.status === 401;
24+
}
25+
26+
get isRateLimited(): boolean {
27+
return this.status === 429;
28+
}
29+
30+
get isOverloaded(): boolean {
31+
return this.status === 529;
32+
}
33+
}
34+
35+
/**
36+
* Send a non-streaming chat completion request to the Anthropic Messages API.
37+
* Returns the raw JSON response body string.
38+
*/
39+
export function claudeCompletion(
40+
anthropicReq: object,
41+
options: { apiKey: string },
42+
): Promise<string> {
43+
return new Promise((resolve, reject) => {
44+
const body = JSON.stringify(anthropicReq);
45+
46+
const req = https.request(
47+
{
48+
hostname: ANTHROPIC_HOST,
49+
path: ANTHROPIC_PATH,
50+
method: "POST",
51+
headers: {
52+
"Content-Type": "application/json",
53+
"x-api-key": options.apiKey,
54+
"anthropic-version": ANTHROPIC_VERSION,
55+
"Content-Length": Buffer.byteLength(body),
56+
},
57+
},
58+
(res) => {
59+
const chunks: Buffer[] = [];
60+
res.on("data", (chunk: Buffer) => chunks.push(chunk));
61+
res.on("end", () => {
62+
const responseBody = Buffer.concat(chunks).toString("utf-8");
63+
const status = res.statusCode ?? 0;
64+
if (status < 200 || status >= 300) {
65+
reject(new ClaudeApiError(responseBody, status));
66+
return;
67+
}
68+
resolve(responseBody);
69+
});
70+
},
71+
);
72+
73+
req.on("error", (err) => {
74+
reject(new Error(`Anthropic API network error: ${err.message}`));
75+
});
76+
77+
req.write(body);
78+
req.end();
79+
});
80+
}
81+
82+
/**
83+
* Send a streaming chat completion request to the Anthropic Messages API.
84+
* Pipes the SSE response directly to the provided ServerResponse.
85+
*/
86+
export function claudeCompletionStream(
87+
anthropicReq: object,
88+
options: { apiKey: string },
89+
clientRes: ServerResponse,
90+
): Promise<void> {
91+
return new Promise((resolve, reject) => {
92+
const body = JSON.stringify({ ...anthropicReq, stream: true });
93+
94+
const req = https.request(
95+
{
96+
hostname: ANTHROPIC_HOST,
97+
path: ANTHROPIC_PATH,
98+
method: "POST",
99+
headers: {
100+
"Content-Type": "application/json",
101+
"x-api-key": options.apiKey,
102+
"anthropic-version": ANTHROPIC_VERSION,
103+
"Content-Length": Buffer.byteLength(body),
104+
},
105+
},
106+
(apiRes) => {
107+
const status = apiRes.statusCode ?? 0;
108+
if (status < 200 || status >= 300) {
109+
const chunks: Buffer[] = [];
110+
apiRes.on("data", (chunk: Buffer) => chunks.push(chunk));
111+
apiRes.on("end", () => {
112+
const responseBody = Buffer.concat(chunks).toString("utf-8");
113+
reject(new ClaudeApiError(responseBody, status));
114+
});
115+
return;
116+
}
117+
118+
clientRes.writeHead(200, {
119+
"Content-Type": "text/event-stream",
120+
"Cache-Control": "no-cache",
121+
"Connection": "keep-alive",
122+
});
123+
124+
apiRes.pipe(clientRes);
125+
apiRes.on("end", () => resolve());
126+
},
127+
);
128+
129+
req.on("error", (err) => {
130+
reject(new Error(`Anthropic API network error: ${err.message}`));
131+
});
132+
133+
req.write(body);
134+
req.end();
135+
});
136+
}

src/config.ts

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,53 @@ import fs from "node:fs";
22
import path from "node:path";
33
import os from "node:os";
44

5+
export type IntentCategory = "chit_chat" | "simple_code" | "hard_question" | "try_again";
6+
export type ModelTierNumber = 1 | 2 | 3;
7+
8+
export interface RoutingRule {
9+
tier: ModelTierNumber;
10+
}
11+
12+
export interface TierModel {
13+
target: "local" | "claude";
14+
claudeModel?: string;
15+
}
16+
17+
export interface RoutingConfig {
18+
rules: Record<IntentCategory, RoutingRule>;
19+
tiers: Record<ModelTierNumber, TierModel>;
20+
claudeApiKey?: string;
21+
}
22+
23+
export const DEFAULT_ROUTING_RULES: Record<IntentCategory, RoutingRule> = {
24+
chit_chat: { tier: 1 },
25+
simple_code: { tier: 1 },
26+
hard_question: { tier: 3 },
27+
try_again: { tier: 1 },
28+
};
29+
30+
/**
31+
* Returns default tier→model mapping based on local model capability.
32+
* Qwen3-Coder-Next benchmarks near Sonnet, so medium defaults to local for those users.
33+
*/
34+
export function defaultTierModels(localModel: string): Record<ModelTierNumber, TierModel> {
35+
const isPowerful = localModel.toLowerCase().includes("qwen3-coder-next");
36+
return {
37+
1: { target: "local" },
38+
2: isPowerful
39+
? { target: "local" }
40+
: { target: "claude", claudeModel: "claude-sonnet-4-5-20250929" },
41+
3: { target: "claude", claudeModel: "claude-opus-4-6" },
42+
};
43+
}
44+
545
export interface MallexConfig {
646
model: string;
747
serverPort: number;
848
proxyPort: number;
949
idleTimeoutMinutes: number;
1050
onExitServer: "ask" | "stop" | "keep";
51+
routing?: RoutingConfig;
1152
}
1253

1354
export const DEFAULT_CONFIG: MallexConfig = {

0 commit comments

Comments
 (0)