Local Model Gateway

Local Model Gateway is lightweight local AI workstation management.

One endpoint, scheduler, and dashboard for coordinating model access across local agents and LLM apps.

It exposes a stable OpenAI-compatible API and a Streamable HTTP MCP endpoint, then puts all GPU-bound work behind one SQLite-backed scheduler. Agents can ask for Qwen, MiniMax, a LoRA job, or another local runtime through the same gateway without racing each other, double-loading models, or bypassing load/unload decisions.

Works with:

Demo

Watch a 58-second demo of multiple local clients sharing one endpoint while the gateway queues requests, swaps model residency, and keeps active work uninterrupted:

local-model-gateway-cli-demo.mp4

Why This Exists

Local models are easy to start and hard to coordinate.

Most agent stacks assume "one base URL equals one model." On a real workstation, that breaks down quickly:

a 32B model and a 230B MoE model cannot both casually live in memory
OpenAI-compatible requests and MCP tool calls can hit the same GPU through different paths
manual model start/stop scripts work until an agent retries, hangs, or swaps tasks mid-session
loading every MCP tool schema into an agent prompt wastes context before the user asks for any tool

Local Model Gateway gives those moving parts one shared scheduler and a browser dashboard.

Use Cases

Run several local agents through one OpenAI-compatible base URL without letting them fight over the same GPU.
Point local LLM apps, scripts, and CLIs at one model endpoint while the gateway handles queueing and runtime swaps.
Swap between different local model runtimes only when active work drains, so long requests are not interrupted mid-stream.
Keep large MCP tool catalogs out of an agent prompt until a tool is actually searched, described, or called.

What It Is

Local Model Gateway is:

a local model gateway for OpenAI-compatible clients
a shared scheduler for GPU-bound local model work
a runtime residency manager for loading, unloading, and swapping local models
a browser dashboard for seeing loaded models, active work, queued requests, recent completed work, prefill, and transfer telemetry
an MCP runtime surface for agents that need status, model discovery, setup snippets, and cancellation tools

It is not:

a replacement for llama.cpp, Ollama, LM Studio, or vLLM
a model downloader or model marketplace
a hosted inference service
a framework-specific integration layer

The intent is to sit in front of local runtimes and make a workstation usable by multiple local agents and LLM apps at the same time.

What You Get

Capability	What it does
OpenAI-compatible API	Drop-in `/v1/chat/completions`, `/v1/responses`, and `/v1/models` routes for local agents.
MCP endpoint	Runtime control and setup tools over Streamable HTTP at `/mcp`.
Durable GPU queue	SQLite-backed priority/FIFO work admission across URL and MCP entrypoints.
Runtime residency	Starts, health-checks, unloads, and swaps managed local runtimes on demand.
Stop command cancellation	Exact user commands like `stop` or `cancel generation` cancel matching in-flight OpenAI work instead of starting another GPU request.
Browser dashboard	Read-only `/dashboard` view for loaded models, load progress, prefill, bandwidth, active work, queued requests, and recent completed work.
Launch adapters	macOS launchd and generic shell helpers now, with systemd/Docker planned.
Lazy MCP broker	Keeps downstream MCP catalogs out of the prompt until a tool is actually searched or described.
Discovery manifest	`/.well-known/local-model-gateway.json` for clients that want model, timeout, and endpoint hints.

Architecture

flowchart LR
  A["Agent or app"] -->|"OpenAI-compatible HTTP"| G["Local Model Gateway"]
  A -->|"MCP Streamable HTTP"| G
  G --> Q["SQLite gpu_work_items queue"]
  Q --> C["GpuCoordinator"]
  C --> R1["Managed runtime: qwen3-32b"]
  C --> R2["Managed runtime: minimax-m2.7"]
  C --> R3["Runtime adapter: llama-cli one-shot"]
  G --> B["Lazy MCP broker"]
  B --> D["Downstream MCP servers"]

The important design constraint: no managed local GPU request should bypass the coordinator. OpenAI URL requests, MCP submit_job requests, and one-shot llama-cli runtime adapter work all queue through the same admission policy.

Quick Start

Current status:

macOS is the best-supported workstation path today.
Source install is the recommended install path until the CLI is published to npm.
Fresh configs are safe by default: heavyweight runtime presets are generated disabled until you wire them to working local service scripts.
Linux systemd and Docker examples exist as stubs, but they are not the primary tested path yet.

Source install:

git clone https://github.com/vstep1/local-model-gateway.git
cd local-model-gateway
npm install
npm run build

Create a local config and run the first diagnostic:

npx local-model-gateway init
npx local-model-gateway doctor
npx local-model-gateway doctor --json
npx local-model-gateway doctor --fix-plan

Agent-led installs should read AGENTS.md and the agent install runbook before changing ports, services, or runtime config. doctor --json is the machine-readable preflight surface; doctor --fix-plan prints read-only remediation steps and never mutates the machine.

Start the gateway:

npx local-model-gateway start

Default endpoints:

Endpoint	URL
OpenAI-compatible API	`http://127.0.0.1:8787/v1`
MCP Streamable HTTP	`http://127.0.0.1:8787/mcp`
Status	`http://127.0.0.1:8787/status`
Dashboard	`http://127.0.0.1:8787/dashboard`
Discovery manifest	`http://127.0.0.1:8787/.well-known/local-model-gateway.json`

Smoke check:

curl http://127.0.0.1:8787/health
curl http://127.0.0.1:8787/status
open http://127.0.0.1:8787/dashboard
curl http://127.0.0.1:8787/v1/models
curl http://127.0.0.1:8787/.well-known/local-model-gateway.json

Fresh configs do not enable heavyweight runtimes automatically. qwen3-32b and minimax-m2.7 presets are generated disabled until you point them at local runtime service scripts that work on your machine.

Protocols

OpenAI-Compatible

Point any OpenAI-compatible client at:

base_url: http://127.0.0.1:8787/v1
api_key: local-model-gateway
model: one of GET /v1/models

Example:

curl http://127.0.0.1:8787/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3-32b",
    "messages": [
      { "role": "user", "content": "Say hello from the local gateway." }
    ]
  }'

MCP

Connect MCP clients to:

url: http://127.0.0.1:8787/mcp
transport: streamable-http

Useful setup and runtime tools:

Tool	Purpose
`gateway_status`	Read queue, runtime, and persisted config status.
`list_runtime_models`	List managed local runtime aliases and context metadata.
`recommend_local_profile`	Pick a local runtime for a target prompt budget.
`generate_client_config`	Emit generic OpenAI, generic MCP, or Hermes snippets.
`validate_client_config`	Check that a client points at loopback protocol endpoints.
`runtime_status`	Inspect active work, loaded model, queued work, and errors.
`cancel_runtime_request`	Cancel queued or active runtime work by work item id.
`force_unload_runtime`	Admin escape hatch for a stuck runtime.

Runtime Coordination

Managed runtimes are local OpenAI-compatible servers controlled by service scripts. A runtime definition includes a base URL, health URL, start/stop commands, timeouts, context metadata, and maxConcurrency.

Queue rules:

higher priority wins
same priority is FIFO
active work is never interrupted unless explicitly cancelled
same-model work can share a loaded runtime up to maxConcurrency
different-model work waits until the active model drains, then swaps
one-shot LoRA runtime adapter work unloads resident service runtimes before execution

Built-in presets:

Alias	Default role	Default state
`qwen3-32b`	Clean local Qwen3 32B OpenAI-compatible runtime.	Disabled until configured.
`minimax-m2.7`	Large local MiniMax M2.7 runtime for long-context agent work.	Disabled until configured.

See Runtime Config and Coordinator Architecture.

Runtime adapter examples:

Lazy MCP Broker

The broker is separate from GPU scheduling. It solves a different local-agent problem: MCP tool catalogs can be huge, and agents should not have to load every downstream schema before they need a tool.

The broker starts without opening every downstream MCP. It serves search from a local catalog snapshot when available and connects downstream servers only for targeted refreshes or selected tool calls.

The broker exposes a tiny surface:

search_tools
describe_tool
call_tool
list_servers

search_tools returns compact matches only. describe_tool reveals the exact schema for one selected tool. call_tool is policy-gated. No broker response contains a risk level field.

See Broker Policy.

CLI

npx local-model-gateway init
npx local-model-gateway doctor
npx local-model-gateway doctor --json
npx local-model-gateway doctor --fix-plan
npx local-model-gateway start
npx local-model-gateway service install
npx local-model-gateway recipes list
npx local-model-gateway recipes show generic-openai
npx local-model-gateway recipes show generic-mcp
npx local-model-gateway recipes show hermes

init writes local-model-gateway.config.yaml. doctor is the first debugging surface and reports actionable fixes for missing dependencies, occupied ports, bad service scripts, sibling gateways, direct llama.cpp bypasses, gateway endpoints, and invalid config.

Packages

Package	Contents
`@local-model-gateway/core`	SQLite queue, GPU coordinator, config types, runtime state machine, model registry.
`@local-model-gateway/gateway`	OpenAI-compatible routes, MCP runtime tools, discovery manifest, server bootstrap.
`@local-model-gateway/runtime-adapters`	launchd and generic shell adapter helpers.
`@local-model-gateway/lazy-mcp-broker`	Read-first lazy MCP broker for downstream MCP catalogs.
`local-model-gateway`	CLI for init, doctor, recipes, service install, and start.

Safety Defaults

Binds to 127.0.0.1 by default.
Refuses non-loopback binds unless auth is configured.
Rejects unmanaged loopback OpenAI upstreams by default, so local GPU work cannot silently bypass the coordinator.
Redacts likely secrets and local user paths from status/log surfaces.
Ignores runtime state, local config, model files, logs, generated dist, and SQLite databases.
Runs a public safety scan in CI before packaging.

Development

npm run typecheck
npm test
npm run build
npm run scan:public
npm run pack:dry-run
npm run ci

CI currently runs install, typecheck, tests, build, audit, public scan, and npm package dry run.

Roadmap

Near-term:

dashboard controls for cancel, unload, and start runtime actions
richer doctor checks for actual runtime service readiness
model/runtime templates for common local workstation setups
npm publish workflow for the CLI package
stronger demo coverage for multi-agent local workstation workflows

Later:

runtime auto-detection recipes for common local model servers
Linux systemd and Docker runtime adapters beyond the current stubs
stronger broker policies for metered/write tools with confirmations
portable service installer templates beyond launchd

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
packages		packages
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.base.json		tsconfig.base.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local Model Gateway

Demo

Why This Exists

Use Cases

What It Is

What You Get

Architecture

Quick Start

Protocols

OpenAI-Compatible

MCP

Runtime Coordination

Lazy MCP Broker

CLI

Packages

Safety Defaults

Development

Roadmap

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Local Model Gateway

Demo

Why This Exists

Use Cases

What It Is

What You Get

Architecture

Quick Start

Protocols

OpenAI-Compatible

MCP

Runtime Coordination

Lazy MCP Broker

CLI

Packages

Safety Defaults

Development

Roadmap

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages