Local Model Gateway is lightweight local AI workstation management.
One endpoint, scheduler, and dashboard for coordinating model access across local agents and LLM apps.
It exposes a stable OpenAI-compatible API and a Streamable HTTP MCP endpoint, then puts all GPU-bound work behind one SQLite-backed scheduler. Agents can ask for Qwen, MiniMax, a LoRA job, or another local runtime through the same gateway without racing each other, double-loading models, or bypassing load/unload decisions.
Works with:
Watch a 58-second demo of multiple local clients sharing one endpoint while the gateway queues requests, swaps model residency, and keeps active work uninterrupted:
local-model-gateway-cli-demo.mp4
Local models are easy to start and hard to coordinate.
Most agent stacks assume "one base URL equals one model." On a real workstation, that breaks down quickly:
- a 32B model and a 230B MoE model cannot both casually live in memory
- OpenAI-compatible requests and MCP tool calls can hit the same GPU through different paths
- manual model start/stop scripts work until an agent retries, hangs, or swaps tasks mid-session
- loading every MCP tool schema into an agent prompt wastes context before the user asks for any tool
Local Model Gateway gives those moving parts one shared scheduler and a browser dashboard.
- Run several local agents through one OpenAI-compatible base URL without letting them fight over the same GPU.
- Point local LLM apps, scripts, and CLIs at one model endpoint while the gateway handles queueing and runtime swaps.
- Swap between different local model runtimes only when active work drains, so long requests are not interrupted mid-stream.
- Keep large MCP tool catalogs out of an agent prompt until a tool is actually searched, described, or called.
Local Model Gateway is:
- a local model gateway for OpenAI-compatible clients
- a shared scheduler for GPU-bound local model work
- a runtime residency manager for loading, unloading, and swapping local models
- a browser dashboard for seeing loaded models, active work, queued requests, recent completed work, prefill, and transfer telemetry
- an MCP runtime surface for agents that need status, model discovery, setup snippets, and cancellation tools
It is not:
- a replacement for
llama.cpp, Ollama, LM Studio, or vLLM - a model downloader or model marketplace
- a hosted inference service
- a framework-specific integration layer
The intent is to sit in front of local runtimes and make a workstation usable by multiple local agents and LLM apps at the same time.
| Capability | What it does |
|---|---|
| OpenAI-compatible API | Drop-in /v1/chat/completions, /v1/responses, and /v1/models routes for local agents. |
| MCP endpoint | Runtime control and setup tools over Streamable HTTP at /mcp. |
| Durable GPU queue | SQLite-backed priority/FIFO work admission across URL and MCP entrypoints. |
| Runtime residency | Starts, health-checks, unloads, and swaps managed local runtimes on demand. |
| Stop command cancellation | Exact user commands like stop or cancel generation cancel matching in-flight OpenAI work instead of starting another GPU request. |
| Browser dashboard | Read-only /dashboard view for loaded models, load progress, prefill, bandwidth, active work, queued requests, and recent completed work. |
| Launch adapters | macOS launchd and generic shell helpers now, with systemd/Docker planned. |
| Lazy MCP broker | Keeps downstream MCP catalogs out of the prompt until a tool is actually searched or described. |
| Discovery manifest | /.well-known/local-model-gateway.json for clients that want model, timeout, and endpoint hints. |
flowchart LR
A["Agent or app"] -->|"OpenAI-compatible HTTP"| G["Local Model Gateway"]
A -->|"MCP Streamable HTTP"| G
G --> Q["SQLite gpu_work_items queue"]
Q --> C["GpuCoordinator"]
C --> R1["Managed runtime: qwen3-32b"]
C --> R2["Managed runtime: minimax-m2.7"]
C --> R3["Runtime adapter: llama-cli one-shot"]
G --> B["Lazy MCP broker"]
B --> D["Downstream MCP servers"]
The important design constraint: no managed local GPU request should bypass the
coordinator. OpenAI URL requests, MCP submit_job requests, and one-shot
llama-cli runtime adapter work all queue through the same admission policy.
Current status:
- macOS is the best-supported workstation path today.
- Source install is the recommended install path until the CLI is published to npm.
- Fresh configs are safe by default: heavyweight runtime presets are generated disabled until you wire them to working local service scripts.
- Linux systemd and Docker examples exist as stubs, but they are not the primary tested path yet.
Source install:
git clone https://github.com/vstep1/local-model-gateway.git
cd local-model-gateway
npm install
npm run buildCreate a local config and run the first diagnostic:
npx local-model-gateway init
npx local-model-gateway doctor
npx local-model-gateway doctor --json
npx local-model-gateway doctor --fix-planAgent-led installs should read AGENTS.md and the
agent install runbook before changing ports, services,
or runtime config. doctor --json is the machine-readable preflight surface;
doctor --fix-plan prints read-only remediation steps and never mutates the
machine.
Start the gateway:
npx local-model-gateway startDefault endpoints:
| Endpoint | URL |
|---|---|
| OpenAI-compatible API | http://127.0.0.1:8787/v1 |
| MCP Streamable HTTP | http://127.0.0.1:8787/mcp |
| Status | http://127.0.0.1:8787/status |
| Dashboard | http://127.0.0.1:8787/dashboard |
| Discovery manifest | http://127.0.0.1:8787/.well-known/local-model-gateway.json |
Smoke check:
curl http://127.0.0.1:8787/health
curl http://127.0.0.1:8787/status
open http://127.0.0.1:8787/dashboard
curl http://127.0.0.1:8787/v1/models
curl http://127.0.0.1:8787/.well-known/local-model-gateway.jsonFresh configs do not enable heavyweight runtimes automatically. qwen3-32b and
minimax-m2.7 presets are generated disabled until you point them at local
runtime service scripts that work on your machine.
Point any OpenAI-compatible client at:
base_url: http://127.0.0.1:8787/v1
api_key: local-model-gateway
model: one of GET /v1/models
Example:
curl http://127.0.0.1:8787/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen3-32b",
"messages": [
{ "role": "user", "content": "Say hello from the local gateway." }
]
}'Connect MCP clients to:
url: http://127.0.0.1:8787/mcp
transport: streamable-http
Useful setup and runtime tools:
| Tool | Purpose |
|---|---|
gateway_status |
Read queue, runtime, and persisted config status. |
list_runtime_models |
List managed local runtime aliases and context metadata. |
recommend_local_profile |
Pick a local runtime for a target prompt budget. |
generate_client_config |
Emit generic OpenAI, generic MCP, or Hermes snippets. |
validate_client_config |
Check that a client points at loopback protocol endpoints. |
runtime_status |
Inspect active work, loaded model, queued work, and errors. |
cancel_runtime_request |
Cancel queued or active runtime work by work item id. |
force_unload_runtime |
Admin escape hatch for a stuck runtime. |
Managed runtimes are local OpenAI-compatible servers controlled by service
scripts. A runtime definition includes a base URL, health URL, start/stop
commands, timeouts, context metadata, and maxConcurrency.
Queue rules:
- higher priority wins
- same priority is FIFO
- active work is never interrupted unless explicitly cancelled
- same-model work can share a loaded runtime up to
maxConcurrency - different-model work waits until the active model drains, then swaps
- one-shot LoRA runtime adapter work unloads resident service runtimes before execution
Built-in presets:
| Alias | Default role | Default state |
|---|---|---|
qwen3-32b |
Clean local Qwen3 32B OpenAI-compatible runtime. | Disabled until configured. |
minimax-m2.7 |
Large local MiniMax M2.7 runtime for long-context agent work. | Disabled until configured. |
See Runtime Config and Coordinator Architecture.
Runtime adapter examples:
The broker is separate from GPU scheduling. It solves a different local-agent problem: MCP tool catalogs can be huge, and agents should not have to load every downstream schema before they need a tool.
The broker starts without opening every downstream MCP. It serves search from a local catalog snapshot when available and connects downstream servers only for targeted refreshes or selected tool calls.
The broker exposes a tiny surface:
search_toolsdescribe_toolcall_toollist_servers
search_tools returns compact matches only. describe_tool reveals the exact
schema for one selected tool. call_tool is policy-gated. No broker response
contains a risk level field.
See Broker Policy.
npx local-model-gateway init
npx local-model-gateway doctor
npx local-model-gateway doctor --json
npx local-model-gateway doctor --fix-plan
npx local-model-gateway start
npx local-model-gateway service install
npx local-model-gateway recipes list
npx local-model-gateway recipes show generic-openai
npx local-model-gateway recipes show generic-mcp
npx local-model-gateway recipes show hermesinit writes local-model-gateway.config.yaml. doctor is the first debugging
surface and reports actionable fixes for missing dependencies, occupied ports,
bad service scripts, sibling gateways, direct llama.cpp bypasses, gateway
endpoints, and invalid config.
| Package | Contents |
|---|---|
@local-model-gateway/core |
SQLite queue, GPU coordinator, config types, runtime state machine, model registry. |
@local-model-gateway/gateway |
OpenAI-compatible routes, MCP runtime tools, discovery manifest, server bootstrap. |
@local-model-gateway/runtime-adapters |
launchd and generic shell adapter helpers. |
@local-model-gateway/lazy-mcp-broker |
Read-first lazy MCP broker for downstream MCP catalogs. |
local-model-gateway |
CLI for init, doctor, recipes, service install, and start. |
- Binds to
127.0.0.1by default. - Refuses non-loopback binds unless auth is configured.
- Rejects unmanaged loopback OpenAI upstreams by default, so local GPU work cannot silently bypass the coordinator.
- Redacts likely secrets and local user paths from status/log surfaces.
- Ignores runtime state, local config, model files, logs, generated
dist, and SQLite databases. - Runs a public safety scan in CI before packaging.
npm run typecheck
npm test
npm run build
npm run scan:public
npm run pack:dry-run
npm run ciCI currently runs install, typecheck, tests, build, audit, public scan, and npm package dry run.
Near-term:
- dashboard controls for cancel, unload, and start runtime actions
- richer
doctorchecks for actual runtime service readiness - model/runtime templates for common local workstation setups
- npm publish workflow for the CLI package
- stronger demo coverage for multi-agent local workstation workflows
Later:
- runtime auto-detection recipes for common local model servers
- Linux systemd and Docker runtime adapters beyond the current stubs
- stronger broker policies for metered/write tools with confirmations
- portable service installer templates beyond launchd
MIT. See LICENSE.