DaveZheng
diff --git a/‎README.md‎
Lines changed: 83 additions & 13 deletions b/‎README.md‎
Lines changed: 83 additions & 13 deletions
diff --git a/‎src/claude-client.test.ts‎
Lines changed: 40 additions & 0 deletions b/‎src/claude-client.test.ts‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎src/claude-client.ts‎
Lines changed: 136 additions & 0 deletions b/‎src/claude-client.ts‎
Lines changed: 136 additions & 0 deletions
diff --git a/‎src/config.ts‎
Lines changed: 41 additions & 0 deletions b/‎src/config.ts‎
Lines changed: 41 additions & 0 deletions
@@ -34,25 +34,79 @@ mallex
 # Start proxy only (for use with an existing Claude Code session)
 mallex proxy
 
+# Re-configure intent-based routing
+mallex --setup
+
 # Stop the background mlx-lm.server
 mallex server stop
 ```
 
-On first run, mallex detects your hardware and recommends a model. You can accept the recommendation or provide a custom model ID.
+On first run, mallex detects your hardware, recommends a model, and walks you through intent-based routing setup.
 
 ## How It Works
 
 ```
-Claude Code  →  mallex proxy (localhost:3456)  →  mlx-lm.server (localhost:8080)
- Anthropic        translates request/response        OpenAI Chat Completions
- Messages API     trims prompts for model size        serves local MLX model
+                                    ┌→ mlx-lm.server (localhost:8080)
+Claude Code  →  mallex proxy ──────┤   local MLX model
+ Anthropic     classifies intent   └→ Anthropic API
+ Messages API  routes by effort       Claude Sonnet / Opus
+```
+
+1. **Classifies intent** — uses your local model to classify each request as low, medium, or high effort
+2. **Routes by effort** — sends simple tasks to local MLX, complex tasks to Claude API (configurable per tier)
+3. **Translates requests** from Anthropic Messages API → OpenAI Chat Completions (for local model path)
+4. **Trims prompts** — Claude Code sends ~24K chars of system prompt overhead; mallex trims this to fit the model's practical context budget
+5. **Injects tool definitions** as XML in the system prompt so the local model can use tools (read_file, write_file, edit_file, bash, glob, grep)
+6. **Translates responses** back from OpenAI format → Anthropic format (including streaming)
+
+## Intent-Based Routing
+
+mallex classifies every request by complexity and routes it to the right model. This is inspired by [NVIDIA's LLM Router](https://build.nvidia.com/nvidia/llm-router) pattern.
+
+### Effort tiers
+
+| Tier | Default (8-32GB) | Default (64GB+ with Qwen3-Coder-Next) |
+|------|-------------------|---------------------------------------|
+| **Low** — chit chat, simple edits | Local MLX | Local MLX |
+| **Medium** — single features, debugging | Claude Sonnet 4.5 | Local MLX (benchmarks near Sonnet) |
+| **High** — architecture, multi-file refactors | Claude Opus 4.6 | Claude Opus 4.6 |
+
+Defaults are recommendations based on your local model's capability. You can override any tier during setup.
+
+### Intent categories
+
+Each request is classified by the local model into one of four categories, which map to tiers automatically:
+
+| Category | Description | Tier |
+|----------|-------------|------|
+| `chit_chat` | Casual conversation, explanations, Q&A | Low |
+| `simple_code` | Single-file edits, renames, fixing imports/typos | Low |
+| `hard_question` | Multi-file refactors, architecture, planning, complex debugging | High |
+| `try_again` | Previous answer was wrong/incomplete — escalates one tier up | Escalates |
+
+### Escalation
+
+When you say "that's wrong" or "try again", mallex escalates to the next tier:
+
+```
+Local MLX (Low) → Claude Sonnet 4.5 (Medium) → Claude Opus 4.6 (High)
+```
+
+If your local model handles medium (64GB+ setups), escalation goes:
+
+```
+Local MLX (Low) → Local MLX (Medium) → Claude Opus 4.6 (High)
+```
+
+### Setup
+
+On first run, mallex walks you through routing configuration. To reconfigure later:
+
+```bash
+mallex --setup
 ```
 
-1. **Starts mlx-lm.server** with your chosen model if not already running
-2. **Translates requests** from Anthropic Messages API → OpenAI Chat Completions
-3. **Trims prompts** — Claude Code sends ~24K chars of system prompt overhead; mallex trims this to fit the model's practical context budget
-4. **Injects tool definitions** as XML in the system prompt so the local model can use tools (read_file, write_file, edit_file, bash, glob, grep)
-5. **Translates responses** back from OpenAI format → Anthropic format (including streaming)
+You only need a Claude API key if any tier is configured to use Claude. If no key is provided, Claude tiers fall back to local MLX.
 
 ## Prompt Trimming
 
@@ -152,17 +206,33 @@ Config is stored at `~/.mallex/config.json`:
   "serverPort": 8080,
   "proxyPort": 3456,
   "idleTimeoutMinutes": 15,
-  "onExitServer": "ask"
+  "onExitServer": "ask",
+  "routing": {
+    "rules": {
+      "chit_chat": { "tier": 1 },
+      "simple_code": { "tier": 1 },
+      "hard_question": { "tier": 3 },
+      "try_again": { "tier": 1 }
+    },
+    "tiers": {
+      "1": { "target": "local" },
+      "2": { "target": "claude", "claudeModel": "claude-sonnet-4-5-20250929" },
+      "3": { "target": "claude", "claudeModel": "claude-opus-4-6" }
+    },
+    "claudeApiKey": "sk-ant-..."
+  }
 }
 ```
 
 ## Recommended Models
 
 | Hardware | Recommended Model | Notes |
 |---|---|---|
-| 16GB RAM | Qwen2.5-Coder-7B-Instruct-4bit | Best quality/speed for limited RAM |
-| 32GB RAM | Qwen2.5-Coder-14B-Instruct-4bit | Good balance |
-| 64GB+ RAM | Qwen2.5-Coder-32B-Instruct-4bit | Best local coding model |
+| 8GB RAM | Qwen2.5-Coder-7B-Instruct-4bit | Basic — pair with Claude for medium/high tasks |
+| 16GB RAM | Qwen2.5-Coder-14B-Instruct-4bit | Good for simple tasks |
+| 32GB RAM | Qwen3-Coder-30B-A3B-Instruct-4bit | Handles most code tasks locally |
+| 64GB RAM | Qwen3-Coder-Next-Instruct-4bit | Benchmarks near Sonnet — handles medium tasks locally |
+| 128GB+ RAM | Qwen3-Coder-Next-Instruct-8bit | Best local quality |
 
 ## Debug
 
 
@@ -0,0 +1,40 @@
+import { describe, it } from "node:test";
+import assert from "node:assert";
+import { ClaudeApiError } from "./claude-client.js";
+
+describe("ClaudeApiError", () => {
+  it("preserves message and status", () => {
+    const err = new ClaudeApiError("something went wrong", 500);
+    assert.strictEqual(err.message, "something went wrong");
+    assert.strictEqual(err.status, 500);
+    assert.strictEqual(err.name, "ClaudeApiError");
+  });
+
+  it("classifies 401 as auth error", () => {
+    const err = new ClaudeApiError("unauthorized", 401);
+    assert.strictEqual(err.isAuthError, true);
+    assert.strictEqual(err.isRateLimited, false);
+    assert.strictEqual(err.isOverloaded, false);
+  });
+
+  it("classifies 429 as rate limited", () => {
+    const err = new ClaudeApiError("rate limited", 429);
+    assert.strictEqual(err.isAuthError, false);
+    assert.strictEqual(err.isRateLimited, true);
+    assert.strictEqual(err.isOverloaded, false);
+  });
+
+  it("classifies 529 as overloaded", () => {
+    const err = new ClaudeApiError("overloaded", 529);
+    assert.strictEqual(err.isAuthError, false);
+    assert.strictEqual(err.isRateLimited, false);
+    assert.strictEqual(err.isOverloaded, true);
+  });
+
+  it("classifies 500 as none of the special categories", () => {
+    const err = new ClaudeApiError("internal server error", 500);
+    assert.strictEqual(err.isAuthError, false);
+    assert.strictEqual(err.isRateLimited, false);
+    assert.strictEqual(err.isOverloaded, false);
+  });
+});
@@ -0,0 +1,136 @@
+/**
+ * Anthropic API client using node:https (zero external dependencies).
+ * Used by the routing proxy to forward requests that are too complex for local MLX.
+ */
+
+import https from "node:https";
+import type { ServerResponse } from "node:http";
+
+const ANTHROPIC_HOST = "api.anthropic.com";
+const ANTHROPIC_PATH = "/v1/messages";
+const ANTHROPIC_VERSION = "2023-06-01";
+
+export class ClaudeApiError extends Error {
+  readonly status: number;
+
+  constructor(message: string, status: number) {
+    super(message);
+    this.name = "ClaudeApiError";
+    this.status = status;
+  }
+
+  get isAuthError(): boolean {
+    return this.status === 401;
+  }
+
+  get isRateLimited(): boolean {
+    return this.status === 429;
+  }
+
+  get isOverloaded(): boolean {
+    return this.status === 529;
+  }
+}
+
+/**
+ * Send a non-streaming chat completion request to the Anthropic Messages API.
+ * Returns the raw JSON response body string.
+ */
+export function claudeCompletion(
+  anthropicReq: object,
+  options: { apiKey: string },
+): Promise<string> {
+  return new Promise((resolve, reject) => {
+    const body = JSON.stringify(anthropicReq);
+
+    const req = https.request(
+      {
+        hostname: ANTHROPIC_HOST,
+        path: ANTHROPIC_PATH,
+        method: "POST",
+        headers: {
+          "Content-Type": "application/json",
+          "x-api-key": options.apiKey,
+          "anthropic-version": ANTHROPIC_VERSION,
+          "Content-Length": Buffer.byteLength(body),
+        },
+      },
+      (res) => {
+        const chunks: Buffer[] = [];
+        res.on("data", (chunk: Buffer) => chunks.push(chunk));
+        res.on("end", () => {
+          const responseBody = Buffer.concat(chunks).toString("utf-8");
+          const status = res.statusCode ?? 0;
+          if (status < 200 || status >= 300) {
+            reject(new ClaudeApiError(responseBody, status));
+            return;
+          }
+          resolve(responseBody);
+        });
+      },
+    );
+
+    req.on("error", (err) => {
+      reject(new Error(`Anthropic API network error: ${err.message}`));
+    });
+
+    req.write(body);
+    req.end();
+  });
+}
+
+/**
+ * Send a streaming chat completion request to the Anthropic Messages API.
+ * Pipes the SSE response directly to the provided ServerResponse.
+ */
+export function claudeCompletionStream(
+  anthropicReq: object,
+  options: { apiKey: string },
+  clientRes: ServerResponse,
+): Promise<void> {
+  return new Promise((resolve, reject) => {
+    const body = JSON.stringify({ ...anthropicReq, stream: true });
+
+    const req = https.request(
+      {
+        hostname: ANTHROPIC_HOST,
+        path: ANTHROPIC_PATH,
+        method: "POST",
+        headers: {
+          "Content-Type": "application/json",
+          "x-api-key": options.apiKey,
+          "anthropic-version": ANTHROPIC_VERSION,
+          "Content-Length": Buffer.byteLength(body),
+        },
+      },
+      (apiRes) => {
+        const status = apiRes.statusCode ?? 0;
+        if (status < 200 || status >= 300) {
+          const chunks: Buffer[] = [];
+          apiRes.on("data", (chunk: Buffer) => chunks.push(chunk));
+          apiRes.on("end", () => {
+            const responseBody = Buffer.concat(chunks).toString("utf-8");
+            reject(new ClaudeApiError(responseBody, status));
+          });
+          return;
+        }
+
+        clientRes.writeHead(200, {
+          "Content-Type": "text/event-stream",
+          "Cache-Control": "no-cache",
+          "Connection": "keep-alive",
+        });
+
+        apiRes.pipe(clientRes);
+        apiRes.on("end", () => resolve());
+      },
+    );
+
+    req.on("error", (err) => {
+      reject(new Error(`Anthropic API network error: ${err.message}`));
+    });
+
+    req.write(body);
+    req.end();
+  });
+}
@@ -2,12 +2,53 @@ import fs from "node:fs";
 import path from "node:path";
 import os from "node:os";
 
+export type IntentCategory = "chit_chat" | "simple_code" | "hard_question" | "try_again";
+export type ModelTierNumber = 1 | 2 | 3;
+
+export interface RoutingRule {
+  tier: ModelTierNumber;
+}
+
+export interface TierModel {
+  target: "local" | "claude";
+  claudeModel?: string;
+}
+
+export interface RoutingConfig {
+  rules: Record<IntentCategory, RoutingRule>;
+  tiers: Record<ModelTierNumber, TierModel>;
+  claudeApiKey?: string;
+}
+
+export const DEFAULT_ROUTING_RULES: Record<IntentCategory, RoutingRule> = {
+  chit_chat: { tier: 1 },
+  simple_code: { tier: 1 },
+  hard_question: { tier: 3 },
+  try_again: { tier: 1 },
+};
+
+/**
+ * Returns default tier→model mapping based on local model capability.
+ * Qwen3-Coder-Next benchmarks near Sonnet, so medium defaults to local for those users.
+ */
+export function defaultTierModels(localModel: string): Record<ModelTierNumber, TierModel> {
+  const isPowerful = localModel.toLowerCase().includes("qwen3-coder-next");
+  return {
+    1: { target: "local" },
+    2: isPowerful
+      ? { target: "local" }
+      : { target: "claude", claudeModel: "claude-sonnet-4-5-20250929" },
+    3: { target: "claude", claudeModel: "claude-opus-4-6" },
+  };
+}
+
 export interface MallexConfig {
   model: string;
   serverPort: number;
   proxyPort: number;
   idleTimeoutMinutes: number;
   onExitServer: "ask" | "stop" | "keep";
+  routing?: RoutingConfig;
 }
 
 export const DEFAULT_CONFIG: MallexConfig = {