Skip to content

Commit ab71175

Browse files
committed
feat(cli): add multi-model inference CLI and codex URL fixes
- Add --model-alias flag to 'inference set' for multi-model config (e.g. --model-alias gpt=openai/gpt-4 --model-alias claude=anthropic/claude-sonnet-4-20250514) - Add gateway_inference_set_multi() handler in run.rs - Update inference get/print to display multi-model entries - Import InferenceModelEntry proto type in CLI - Fix build_backend_url to always strip /v1 prefix for codex paths - Add /v1/codex/* inference pattern for openai_responses protocol - Fix backend tests to use /v1 endpoint suffix Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>
1 parent 1d605b7 commit ab71175

File tree

8 files changed

+322
-60
lines changed

8 files changed

+322
-60
lines changed

architecture/inference-routing.md

Lines changed: 59 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,9 @@ sequenceDiagram
2121
Agent->>Proxy: CONNECT inference.local:443
2222
Proxy->>Proxy: TLS terminate (MITM)
2323
Proxy->>Proxy: Parse HTTP, detect pattern
24-
Proxy->>Router: proxy_with_candidates()
25-
Router->>Router: Select route by protocol
24+
Proxy->>Proxy: Extract model hint from body
25+
Proxy->>Router: proxy_with_candidates(model_hint)
26+
Router->>Router: Select route by alias or protocol
2627
Router->>Router: Rewrite auth + model
2728
Router->>Backend: HTTPS request
2829
Backend->>Router: Response headers + body stream
@@ -41,15 +42,16 @@ File: `crates/openshell-core/src/inference.rs`
4142

4243
`InferenceProviderProfile` is the single source of truth for provider-specific inference knowledge: default endpoint, supported protocols, credential key lookup order, auth header style, and default headers.
4344

44-
Three profiles are defined:
45+
Four profiles are defined:
4546

4647
| Provider | Default Base URL | Protocols | Auth | Default Headers |
47-
|----------|-----------------|-----------|------|-----------------|
48+
|----------|-----------------|-----------|------|------------------|
4849
| `openai` | `https://api.openai.com/v1` | `openai_chat_completions`, `openai_completions`, `openai_responses`, `model_discovery` | `Authorization: Bearer` | (none) |
4950
| `anthropic` | `https://api.anthropic.com/v1` | `anthropic_messages`, `model_discovery` | `x-api-key` | `anthropic-version: 2023-06-01` |
5051
| `nvidia` | `https://integrate.api.nvidia.com/v1` | `openai_chat_completions`, `openai_completions`, `openai_responses`, `model_discovery` | `Authorization: Bearer` | (none) |
52+
| `ollama` | `http://host.openshell.internal:11434` | `ollama_chat`, `ollama_model_discovery`, `openai_chat_completions`, `openai_completions`, `model_discovery` | `Authorization: Bearer` | (none) |
5153

52-
Each profile also defines `credential_key_names` (e.g. `["OPENAI_API_KEY"]`) and `base_url_config_keys` (e.g. `["OPENAI_BASE_URL"]`) used by the gateway to resolve credentials and endpoint overrides from provider records.
54+
Each profile also defines `credential_key_names` (e.g. `["OPENAI_API_KEY"]`) and `base_url_config_keys` (e.g. `["OPENAI_BASE_URL"]`) used by the gateway to resolve credentials and endpoint overrides from provider records. The Ollama profile uses `OLLAMA_API_KEY` for credentials and checks both `OLLAMA_BASE_URL` and `OLLAMA_HOST` for endpoint overrides. Its default endpoint uses `host.openshell.internal` so sandboxes can reach an Ollama instance running on the gateway host.
5355

5456
Unknown provider types return `None` from `profile_for()` and default to `Bearer` auth with no default headers via `auth_for_provider_type()`.
5557

@@ -70,7 +72,19 @@ The gateway implements the `Inference` gRPC service defined in `proto/inference.
7072
5. Builds a managed route spec that stores only `provider_name` and `model_id`. The spec intentionally leaves `base_url`, `api_key`, and `protocols` empty -- these are resolved dynamically at bundle time from the provider record.
7173
6. Upserts the route with name `inference.local`. Version starts at 1 and increments monotonically on each update.
7274

73-
`GetClusterInference` returns `provider_name`, `model_id`, and `version` for the managed route. Returns `NOT_FOUND` if cluster inference is not configured.
75+
`GetClusterInference` returns `provider_name`, `model_id`, `version`, and any configured `models` entries for the managed route. Returns `NOT_FOUND` if cluster inference is not configured.
76+
77+
### Multi-model routes
78+
79+
`upsert_multi_model_route()` configures multiple provider/model pairs on a single route, each identified by a short alias:
80+
81+
1. Validates that each `InferenceModelEntry` has non-empty `alias`, `provider_name`, and `model_id`.
82+
2. Checks that aliases are unique (case-insensitive).
83+
3. Verifies each provider exists and is inference-capable.
84+
4. Optionally probes each endpoint (skipped with `--no-verify`).
85+
5. Stores the full `models` vector in the route config. The first entry's provider/model are also written to the legacy single-model fields for backward compatibility.
86+
87+
At bundle time, each `InferenceModelEntry` is resolved into a separate `ResolvedRoute` whose `name` is set to the alias. The router's alias-first selection (see Route Selection) then matches the agent's `model` field against these names.
7488

7589
### Bundle delivery
7690

@@ -92,11 +106,15 @@ File: `proto/inference.proto`
92106

93107
Key messages:
94108

95-
- `SetClusterInferenceRequest` -- `provider_name` + `model_id` + optional `no_verify` override, with verification enabled by default
96-
- `SetClusterInferenceResponse` -- `provider_name` + `model_id` + `version`
109+
- `InferenceModelEntry` -- `alias` + `provider_name` + `model_id` (a single alias-to-provider mapping)
110+
- `SetClusterInferenceRequest` -- `provider_name` + `model_id` + optional `no_verify` override + `repeated InferenceModelEntry models`, with verification enabled by default
111+
- `SetClusterInferenceResponse` -- `provider_name` + `model_id` + `version` + `repeated InferenceModelEntry models`
112+
- `GetClusterInferenceResponse` -- `provider_name` + `model_id` + `version` + `repeated InferenceModelEntry models`
97113
- `GetInferenceBundleResponse` -- `repeated ResolvedRoute routes` + `revision` + `generated_at_ms`
98114
- `ResolvedRoute` -- `name`, `base_url`, `protocols`, `api_key`, `model_id`, `provider_type`
99115

116+
When `models` is non-empty in a set request, the gateway uses `upsert_multi_model_route()` and ignores the legacy `provider_name`/`model_id` fields. When `models` is empty, the legacy single-model path is used.
117+
100118
## Data Plane (Sandbox)
101119

102120
Files:
@@ -117,7 +135,7 @@ When a `CONNECT inference.local:443` arrives:
117135
1. Proxy responds `200 Connection Established`.
118136
2. `handle_inference_interception()` TLS-terminates the client connection using the sandbox CA (MITM).
119137
3. Raw HTTP requests are parsed from the TLS tunnel using `try_parse_http_request()` (supports Content-Length and chunked transfer encoding).
120-
4. Each parsed request is passed to `route_inference_request()`.
138+
4. Each parsed request is passed to `route_inference_request()`. Before routing, the proxy extracts a `model_hint` from the JSON request body's `model` field (if present). This hint is passed to the router for alias-based route selection.
121139
5. The tunnel supports HTTP keep-alive: multiple requests can be processed sequentially.
122140
6. Buffer starts at 64 KiB (`INITIAL_INFERENCE_BUF`) and grows up to 10 MiB (`MAX_INFERENCE_BUF`). Requests exceeding the max get `413 Payload Too Large`.
123141

@@ -133,10 +151,16 @@ Supported built-in patterns:
133151
| `POST` | `/v1/completions` | `openai_completions` | `completion` |
134152
| `POST` | `/v1/responses` | `openai_responses` | `responses` |
135153
| `POST` | `/v1/messages` | `anthropic_messages` | `messages` |
154+
| `POST` | `/v1/codex/*` | `openai_responses` | `codex_responses` |
136155
| `GET` | `/v1/models` | `model_discovery` | `models_list` |
137156
| `GET` | `/v1/models/*` | `model_discovery` | `models_get` |
157+
| `POST` | `/api/chat` | `ollama_chat` | `ollama_chat` |
158+
| `GET` | `/api/tags` | `ollama_model_discovery` | `ollama_tags` |
159+
| `POST` | `/api/show` | `ollama_model_discovery` | `ollama_show` |
160+
161+
Query strings are stripped before matching. Path matching is exact for most patterns; `/v1/models/*` and `/v1/codex/*` match any sub-path (e.g. `/v1/models/gpt-4.1`, `/v1/codex/responses`). Absolute-form URIs (e.g. `https://inference.local/v1/chat/completions`) are normalized to path-only form by `normalize_inference_path()` before detection.
138162

139-
Query strings are stripped before matching. Path matching is exact for most patterns; `/v1/models/*` matches any sub-path (e.g. `/v1/models/gpt-4.1`). Absolute-form URIs (e.g. `https://inference.local/v1/chat/completions`) are normalized to path-only form by `normalize_inference_path()` before detection.
163+
Ollama patterns use `/api/` paths (no `/v1/` prefix), matching Ollama's native API. This allows agents to use the Ollama client library directly against `inference.local`.
140164

141165
If no pattern matches, the proxy returns `403 Forbidden` with `{"error": "connection not allowed by policy"}`.
142166

@@ -161,7 +185,16 @@ Files:
161185

162186
### Route selection
163187

164-
`proxy_with_candidates()` finds the first route whose `protocols` list contains the detected source protocol (normalized to lowercase). If no route matches, returns `RouterError::NoCompatibleRoute`.
188+
`select_route()` picks the best route from the candidate list using a two-phase strategy:
189+
190+
1. **Alias match (preferred)**: If a `model_hint` is provided (extracted from the request body's `model` field), select the first candidate whose `name` equals the hint AND whose `protocols` list contains the detected source protocol.
191+
2. **Protocol fallback**: If no alias matches, fall back to the first candidate whose `protocols` list contains the source protocol.
192+
193+
This enables multi-route configurations where the agent selects a backend by setting the `model` field to an alias name (e.g. `"model": "my-gpt"` routes to the aliased provider). If the model field is absent, not a known alias, or parsing fails, routing falls back to protocol-based selection.
194+
195+
If no route matches either phase, returns `RouterError::NoCompatibleRoute`.
196+
197+
`proxy_with_candidates()` and `proxy_with_candidates_streaming()` both accept an optional `model_hint: Option<&str>` parameter, passed through from the sandbox proxy.
165198

166199
### Request rewriting
167200

@@ -171,7 +204,7 @@ Files:
171204
2. **Header stripping**: Removes `authorization`, `x-api-key`, `host`, and any header names that will be set from route defaults.
172205
3. **Default headers**: Applies route-level default headers (e.g. `anthropic-version: 2023-06-01`) unless the client already sent them.
173206
4. **Model rewrite**: Parses the request body as JSON and replaces the `model` field with the route's configured model. Non-JSON bodies are forwarded unchanged.
174-
5. **URL construction**: `build_backend_url()` appends the request path to the route endpoint. If the endpoint already ends with `/v1` and the request path starts with `/v1/`, the duplicate prefix is deduplicated.
207+
5. **URL construction**: `build_backend_url()` appends the request path to the route endpoint. If the request path is exactly `/v1` or starts with `/v1/`, the `/v1` prefix is always stripped before appending. This handles both `/v1`-suffixed endpoints (e.g. `api.openai.com/v1`) and non-versioned endpoints (e.g. `chatgpt.com/backend-api` for Codex) uniformly.
175208

176209
### Header sanitization
177210

@@ -297,12 +330,24 @@ The system route is stored as a separate `InferenceRoute` record in the gateway
297330

298331
Cluster inference commands:
299332

300-
- `openshell inference set --provider <name> --model <id>` -- configures user-facing cluster inference
333+
- `openshell inference set --provider <name> --model <id>` -- configures user-facing cluster inference (single model)
334+
- `openshell inference set --model-alias ALIAS=PROVIDER/MODEL [--model-alias ...]` -- configures multi-model cluster inference
301335
- `openshell inference set --system --provider <name> --model <id>` -- configures system inference
302336
- `openshell inference get` -- displays both user and system inference configuration
303337
- `openshell inference get --system` -- displays only the system inference configuration
304338

305-
The `--provider` flag references a provider record name (not a provider type). The provider must already exist in the cluster and have a supported inference type (`openai`, `anthropic`, or `nvidia`).
339+
The `--provider` flag references a provider record name (not a provider type). The provider must already exist in the cluster and have a supported inference type (`openai`, `anthropic`, `nvidia`, or `ollama`).
340+
341+
`--model-alias` can be repeated to configure multiple providers simultaneously. It conflicts with `--provider` and `--model` -- the two modes are mutually exclusive. Example:
342+
343+
```bash
344+
openshell inference set \
345+
--model-alias my-gpt=openai-dev/gpt-4o \
346+
--model-alias my-claude=anthropic-dev/claude-sonnet-4-20250514 \
347+
--model-alias my-llama=ollama-local/llama3
348+
```
349+
350+
Agents select a backend by setting the `model` field in their inference request to the alias name (e.g. `"model": "my-gpt"`).
306351

307352
Inference writes verify by default. `--no-verify` is the explicit opt-out for endpoints that are not up yet.
308353

crates/openshell-cli/src/main.rs

Lines changed: 41 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
use clap::{CommandFactory, Parser, Subcommand, ValueEnum, ValueHint};
77
use clap_complete::engine::ArgValueCompleter;
88
use clap_complete::env::CompleteEnv;
9-
use miette::Result;
9+
use miette::{Result, miette};
1010
use owo_colors::OwoColorize;
1111
use std::io::Write;
1212

@@ -286,6 +286,7 @@ const GATEWAY_EXAMPLES: &str = "\x1b[1mALIAS\x1b[0m
286286

287287
const INFERENCE_EXAMPLES: &str = "\x1b[1mEXAMPLES\x1b[0m
288288
$ openshell inference set --provider openai --model gpt-4
289+
$ openshell inference set --model-alias gpt=openai/gpt-4 --model-alias claude=anthropic/claude-sonnet-4-20250514
289290
$ openshell inference get
290291
$ openshell inference update --model gpt-4-turbo
291292
";
@@ -918,15 +919,26 @@ enum GatewayCommands {
918919
#[derive(Subcommand, Debug)]
919920
enum InferenceCommands {
920921
/// Set gateway-level inference provider and model.
922+
///
923+
/// Use --provider/--model for single-model mode, or --model-alias for
924+
/// multi-model mode (multiple providers routed by alias).
921925
#[command(help_template = LEAF_HELP_TEMPLATE, next_help_heading = "FLAGS")]
922926
Set {
923-
/// Provider name.
924-
#[arg(long, add = ArgValueCompleter::new(completers::complete_provider_names))]
925-
provider: String,
927+
/// Provider name (single-model mode).
928+
#[arg(long, required_unless_present = "model_alias", add = ArgValueCompleter::new(completers::complete_provider_names))]
929+
provider: Option<String>,
926930

927-
/// Model identifier to force for generation calls.
928-
#[arg(long)]
929-
model: String,
931+
/// Model identifier to force for generation calls (single-model mode).
932+
#[arg(long, required_unless_present = "model_alias")]
933+
model: Option<String>,
934+
935+
/// Add a model alias in the form ALIAS=PROVIDER/MODEL.
936+
/// Can be repeated to configure multiple providers simultaneously.
937+
/// Not supported with --system.
938+
///
939+
/// Example: --model-alias my-gpt=openai-dev/gpt-4o --model-alias my-claude=anthropic-dev/claude-sonnet-4-20250514
940+
#[arg(long, conflicts_with_all = ["provider", "model", "system"])]
941+
model_alias: Vec<String>,
930942

931943
/// Configure the system inference route instead of the user-facing
932944
/// route. System inference is used by platform functions (e.g. the
@@ -2024,14 +2036,32 @@ async fn main() -> Result<()> {
20242036
InferenceCommands::Set {
20252037
provider,
20262038
model,
2039+
model_alias,
20272040
system,
20282041
no_verify,
20292042
} => {
20302043
let route_name = if system { "sandbox-system" } else { "" };
2031-
run::gateway_inference_set(
2032-
endpoint, &provider, &model, route_name, no_verify, &tls,
2033-
)
2034-
.await?;
2044+
if !model_alias.is_empty() {
2045+
run::gateway_inference_set_multi(
2046+
endpoint,
2047+
&model_alias,
2048+
route_name,
2049+
no_verify,
2050+
&tls,
2051+
)
2052+
.await?;
2053+
} else {
2054+
let provider = provider.as_deref().ok_or_else(|| {
2055+
miette!("--provider is required in single-model mode")
2056+
})?;
2057+
let model = model
2058+
.as_deref()
2059+
.ok_or_else(|| miette!("--model is required in single-model mode"))?;
2060+
run::gateway_inference_set(
2061+
endpoint, provider, model, route_name, no_verify, &tls,
2062+
)
2063+
.await?;
2064+
}
20352065
}
20362066
InferenceCommands::Update {
20372067
provider,

0 commit comments

Comments
 (0)