Optional routing for Codex CLI on Windows. The default path is the normal Codex CLI. LiteLLM is an opt-in proxy used when you want Gemini API dispatch or a shared gateway to local Qwen.
The local Python wrapper cleans prompts, compresses noisy logs, estimates tokens, applies budgets, and selects one of these LiteLLM aliases:
codex-lightfor simple, low-cost and frequent taskscodex-defaultfor normal coding workcodex-longfor long-context reads, log review, and synthesiscodex-deepfor difficult debugging, security, and architecture decisionscodex-no-openaifor Gemini + local Qwen routing when OpenAI quota is low or exhaustedcodex-cheapandcodex-strongas backward-compatible aliasescodex-hf-cheapfor simple Hugging Face / open-model tasks whenHF_TOKENis setcodex-hf-fastfor larger Hugging Face / multi-provider tasks whenHF_TOKENis set
Gemini and local Qwen are configured through LiteLLM model groups when the proxy is active. Without the proxy, the wrapper keeps the standard Codex path and can still call local Qwen directly through Ollama for selected local tasks.
Provider API keys are never committed or written to a configuration file.
GEMINI_API_KEY is only needed for Gemini dispatch through LiteLLM. Local Qwen
only needs Ollama running with qwen2.5-coder:3b installed. The local proxy
master key is generated per session and written only to
%TEMP%\codex-litellm-proxy.key so local test scripts can authenticate; it is
removed when the proxy stops.
When OpenAI quota is low or exhausted, use the codex-no-openai alias. It routes
through Gemini first and local Qwen second, without OpenAI entries in the model
group:
codex --model codex-no-openaiFor one-shot wrapper calls, either force the provider:
python .\scripts\python\codex_cost_router.py run --dry-run `
--provider no-openai `
"Refactor this Python API without using OpenAI quota"or set a temporary session mode:
$env:CODEX_ROUTER_OPENAI_MODE = 'avoid'
python .\scripts\python\codex_cost_router.py run --dry-run `
"Refactor this Python API without using OpenAI quota"For a durable default, set avoid_openai: true in
codex-routing-policy.yaml.
Hugging Face can be used in two optional places.
First, Hugging Face Inference Providers can sit behind LiteLLM as another provider pool. The local config still includes two optional aliases:
codex-hf-cheap -> huggingface/groq/openai/gpt-oss-120b
codex-hf-fast -> huggingface/together/openai/gpt-oss-120bSet HF_TOKEN in the shell before starting the router. A fine-grained token
with Inference Providers permission is enough.
$env:HF_TOKEN = 'hf_...'
.\scripts\python\codex-cost-routing.cmdThe wrapper can prefer Hugging Face explicitly:
python .\scripts\python\codex_cost_router.py run --dry-run `
--provider huggingface `
"Benchmark this open model routing task"--provider auto routes Hugging Face or multi-provider prompts to the HF aliases
only when HF_TOKEN is present. Otherwise it keeps the OpenAI-backed aliases.
LiteLLM also uses HUGGINGFACE_API_KEY while resolving some Inference Provider
mappings. The local web session exports the submitted HF_TOKEN under both
names for the LiteLLM subprocess. If you start LiteLLM manually, set both names
to the same token:
$env:HF_TOKEN = 'hf_...'
$env:HUGGINGFACE_API_KEY = $env:HF_TOKENThe local LiteLLM config includes codex-qwen-local as a final fallback for
the main Codex aliases. It uses Ollama's OpenAI-compatible endpoint with the
Qwen2.5 Coder 3B model:
.\scripts\python\Start-CodexQwenOllama.ps1The script starts Ollama if needed and pulls:
qwen2.5-coder:3b
LiteLLM then reaches it through http://127.0.0.1:11434/v1. No provider API key
is required for this local fallback.
For the fastest local path, call Ollama directly instead of going through LiteLLM. This is the preferred path for quick code cleanup, small explanations, and local smoke checks:
.\scripts\python\Invoke-QwenLocal.ps1 "Nettoie ce code Python et explique le changement: def f(x): return x+1"Measure local throughput with the same OpenAI-compatible Ollama endpoint:
.\scripts\python\Measure-QwenLocalSpeed.ps1Use direct Ollama when raw local speed matters. Use LiteLLM when you need Codex-facing profiles, fallback order, provider routing, quotas, or a single OpenAI-compatible gateway across local and remote models.
Second, Hugging Face can be added as an optional Codex-facing layer. Running
enable now installs two managed profiles:
cost-routing -> Codex -> local LiteLLM proxy
cost-routing-hf -> Codex -> Hugging Face router
The Hugging Face profile uses the OpenAI-compatible endpoint at
https://router.huggingface.co/v1 with openai/gpt-oss-120b:fastest by
default. This is useful when you want a direct open-model path between Codex and
the normal LiteLLM workflow, without starting the local LiteLLM proxy.
.\scripts\python\Manage-CodexCostRouting.ps1 -CodexProvider HuggingFaceFor one-shot wrapper calls, use:
python .\scripts\python\codex_cost_router.py run --dry-run `
--codex-provider huggingface `
"Use the optional Hugging Face layer for this Codex task"Use cost-routing for the normal local proxy path and cost-routing-hf only
when you explicitly want Hugging Face between Codex and the rest of the routing
setup.
The stable path uses LiteLLM's native router on 127.0.0.1:4000. An optional
front router can be started on 127.0.0.1:4001 when you want experimental
risk-adjusted dispatching:
powershell -NoProfile -ExecutionPolicy Bypass -File `
$env:USERPROFILE\.codex\litellm-proxy\start-litellm-proxy.ps1 `
-EnableRiskRouterThe custom router first applies hard limits, then computes a soft risk score on the remaining candidates:
soft_risk_i =
wc * norm(response_cost_i)
+ wttft * norm(ttft_i)
+ wlat * norm(total_latency_i)
+ wtps * norm(-tokens_per_second_i)
+ we * error_rate_i
+ wtok * token_pressure_i
+ wq * norm(queue_depth_i)
Pr(i) = softmax(-soft_risk_i / tau)
Hard limits keep unavailable deployments out of traffic before scoring. Gemini
429 free-tier quota responses are cooled down for five minutes, 401/403
auth failures for one hour, and model 404 failures for thirty minutes. Live
in-flight request counts also act as hard limits so a busy local Qwen worker is
not overloaded.
The soft score uses EWMA metrics for TTFT, total latency, tokens/second, response cost, error rate, token pressure, and queue depth. Streaming requests update real Time To First Token and generation throughput; non-streaming requests approximate TTFT as total latency because the buffered upstream response does not expose the first generated token.
Codex-facing aliases can retry remaining candidates on retryable provider
failures. Direct probes such as gemini-flash-direct remain strict so API-key,
model-name, and quota problems are visible instead of hidden by fallback.
Useful diagnostics:
powershell -NoProfile -ExecutionPolicy Bypass -File `
$env:USERPROFILE\.codex\litellm-proxy\measure-risk-adjusted-dispatch.ps1 `
-Iterations 5
powershell -NoProfile -ExecutionPolicy Bypass -File `
$env:USERPROFILE\.codex\litellm-proxy\measure-risk-adjusted-streaming.ps1 `
-Iterations 3
Invoke-RestMethod http://127.0.0.1:4001/dispatch/metrics
Invoke-RestMethod http://127.0.0.1:4001/dispatch/state
Invoke-RestMethod -Method Post http://127.0.0.1:4001/dispatch/resetKey metrics:
ewma_ttft_ms: streaming Time To First Token.ewma_total_latency_ms: full request duration.ewma_tokens_per_second: generation speed after TTFT.ewma_queue_depthandin_flight: current load pressure.hard_limited_remaining_sandhard_limit_reason: temporary exclusions.soft_score_weights: active scoring weights.
The Python wrapper also includes an experimental lightweight Markov adaptive
router. It is disabled by default and starts in shadow mode when enabled, so
the default routing behavior is not changed until shadow_mode: false is set.
The Markov router scores each Codex-facing fallback provider from recent
history. It estimates provider state as healthy, warming, overloaded,
failing, or cooldown using TTFT, total latency, token pressure, estimated
cost, quality score, error rate, and command failures. It only changes the
fallback order when the expected improvement is above min_confidence_delta.
Enable shadow-mode observations in codex-routing-policy.yaml:
adaptive_router:
enabled: true
shadow_mode: true
min_confidence_delta: 0.15
cost_guard_enabled: true
max_cost_multiplier: 2.0
critical_risk_threshold: 0.65After enough shadow-mode data, live switching can be tested by setting:
adaptive_router:
enabled: true
shadow_mode: falseThe cost guard blocks a non-critical switch to a provider that is more than
max_cost_multiplier times more expensive than the current provider. It still
allows the switch when the baseline provider risk is critical, because avoiding
an imminent failure can be more important than the marginal request cost.
Each run logs the adaptive decision under adaptive_router in:
%USERPROFILE%\.codex\logs\cost_router.jsonl
Important fields:
baseline_provider: provider selected by the normal policy.suggested_provider: provider suggested by the Markov model.would_switch: whether the model found enough expected gain.applied: whether live switching actually changed the fallback order.cost_guard_blocked: whether a costly non-critical switch was blocked.health: per-provider state, risk, performance, and cost observations.
Controlled synthetic benchmark results from the local test harness:
Scenarios: 8
Switches applied: 4
Latency gain: +35.25%
TTFT gain: +32.11%
Cost reduction: +6.29%
Error-rate gain: +39.39%
Quality delta: +3.12 points
For lower-is-better metrics, the benchmark uses:
gain_pct = (baseline_value - markov_value) / baseline_value * 100
For quality, it uses:
quality_delta = markov_quality - baseline_quality
These numbers are controlled synthetic results, not production claims. The recommended production path is shadow mode first, then a small canary, then a progressive rollout if p50 TTFT, p95 latency, error rate, cost per 1k requests, and quality proxy stay stable or improve.
Default routing is controlled by codex-routing-policy.yaml. Precedence is:
- CLI flags such as
--providerand--codex-provider - environment variables such as
CODEX_ROUTER_PROVIDER codex-routing-policy.yaml- built-in safe defaults
Default policy:
default_provider: auto
default_codex_provider: auto
open_models_only: false
avoid_openai: false
max_cost_usd: 0.0
task_provider_rules:
simple: auto
medium: auto
complex: openai
fallback_order:
- litellm
- standardfallback_order is used for real Codex execution. If the selected Codex-facing
provider is not ready or exits with a non-zero code, the wrapper tries the next
provider in the policy order. A dry-run prints the planned order without calling
Codex.
Open PowerShell in the repository:
cd C:\Users\user\Documents\ScriptingStart Codex with the standard path:
.\scripts\python\codex-cost-routing.cmdStart the LiteLLM proxy only when you want Gemini/API dispatch or a proxy gateway to Qwen:
.\scripts\python\Manage-CodexCostRouting.ps1 -Action Start -CodexProvider LiteLLM
codex --profile cost-routing
.\scripts\python\Manage-CodexCostRouting.ps1 -Action StopInstall or update the local LiteLLM OSS proxy to the currently pinned stable
PyPI release. As of 2026-06-29 this repository pins litellm==1.90.0:
.\scripts\python\Install-CodexLocalLiteLLMAssets.ps1
.\scripts\python\Manage-CodexCostRouting.ps1 -Action Update -LiteLLMVersion 1.90.0
.\scripts\python\Manage-CodexCostRouting.ps1 -Action StatusUse -LiteLLMVersion latest only when you intentionally want to move beyond
the pinned stable version after checking PyPI/release notes.
You can also update while starting the proxy:
.\scripts\python\Manage-CodexCostRouting.ps1 -Action Start -CodexProvider LiteLLM -UpdateLiteLLM -LiteLLMVersion 1.90.0The launcher automatically bypasses restrictive PowerShell execution policies for this command only. With no arguments, it opens the normal Codex CLI path: no LiteLLM proxy, no API-key prompt, no temporary Codex profile.
When -CodexProvider LiteLLM is selected, the script:
- installs the official LiteLLM OSS proxy in
C:\tmp\litellm-osswhen needed; - can update the local LiteLLM package with
-Action Updateor-UpdateLiteLLM; - asks only for optional session keys that are not already set;
- requires at least
OPENAI_API_KEY,GEMINI_API_KEY, or local Qwen on Ollama; - creates a random local
LITELLM_API_KEYfor the proxy session and writes only that local proxy key to%TEMP%\codex-litellm-proxy.key; - starts the LiteLLM proxy in the background;
- enables the optional Codex
cost-routingprofile; - opens Codex with that profile;
- stops LiteLLM and restores the previous configuration when Codex closes.
If you prefer entering keys in a local page for one work session, start:
.\scripts\python\Start-CodexKeySessionWeb.ps1Then open http://127.0.0.1:8787/, paste OPENAI_API_KEY,
GEMINI_API_KEY, or HF_TOKEN, and submit the form. Qwen is not managed as an
API provider here: it is only an optional local Ollama fallback. Run
Start-CodexQwenOllama.ps1 and keep the local Qwen checkbox enabled; no Qwen API
base or API key is accepted by the page. The page starts the LiteLLM proxy on
http://127.0.0.1:4000/v1 with submitted values only in the proxy process
environment. The keys are not written to disk and the web server suppresses
request logging.
To launch the optional Hugging Face-facing profile instead of the local LiteLLM proxy:
.\scripts\python\Manage-CodexCostRouting.ps1 -CodexProvider HuggingFaceStop and restore the Codex configuration after an interrupted session:
.\scripts\python\codex-cost-routing.cmd Stop.\scripts\python\codex-cost-routing.cmd Status
.\scripts\python\Manage-CodexCostRouting.ps1 -Action Status
python .\scripts\python\codex_cost_router.py doctorStatus prints the local LiteLLM version from C:\tmp\litellm-oss when it is
installed.
If a browser opened on http://localhost:4000/health shows Unauthorized,
that is expected: the local proxy is protected by LITELLM_API_KEY. The local
test script reads the temporary proxy key automatically after Start.
Validate the local proxy aliases without making a paid/model call:
.\scripts\python\Test-CodexLiteLLMDispatch.ps1Run a real minimal provider call after entering the relevant key in the local web page:
.\scripts\python\Test-CodexLiteLLMDispatch.ps1 -Model codex-hf-cheap -Call
.\scripts\python\Test-CodexLiteLLMDispatch.ps1 -Model codex-qwen-local -Call
.\scripts\python\Test-CodexLiteLLMDispatch.ps1 -Model codex-default -CallThe test prints a compact JSON result and never prints provider tokens.
Check that the local checkout matches GitHub before changing routing files:
git fetch origin
git status -sb
git rev-list --left-right --count main...origin/mainThe last command should print 0 0. After edits, run the local validation gate:
python -m pytest .\scripts\python\tests\test_codex_cost_router.py .\scripts\python\tests\test_codex_key_session_web.py
python -m ruff check .\scripts\python\codex_cost_router.py .\scripts\python\codex_key_session_web.py .\scripts\python\tests\test_codex_cost_router.py .\scripts\python\tests\test_codex_key_session_web.py
.\scripts\python\Manage-CodexCostRouting.ps1 -Action Status
python .\scripts\python\codex_cost_router.py run --dry-run --codex-provider standard "Verification Codex standard"For local Qwen, verify Ollama and measure throughput:
.\scripts\python\Start-CodexQwenOllama.ps1
.\scripts\python\Measure-QwenLocalSpeed.ps1
python .\scripts\python\codex_cost_router.py run --provider qwen --max-output-tokens 20 "Reponds exactement: OK qwen local"For the optional proxy path, start LiteLLM only for Gemini/API dispatch or a single proxy gateway to local Qwen, then stop it when the session is done:
.\scripts\python\Manage-CodexCostRouting.ps1 -Action Start -CodexProvider LiteLLM
.\scripts\python\Test-CodexLiteLLMDispatch.ps1 -Model codex-qwen-local -Call
.\scripts\python\Manage-CodexCostRouting.ps1 -Action StopThe normal final state for daily Codex work is LiteLLM OSS : arrete and
Codex profile : standard.
Use the Python wrapper when prompt cleanup and dynamic model routing are needed:
python .\scripts\python\codex_cost_router.py run --dry-run "Document this API"
python .\scripts\python\codex_cost_router.py run "Refactor this Python API and add tests"Optional budgets and forced routing:
python .\scripts\python\codex_cost_router.py run `
--force-model codex-deep `
--provider openai `
--max-input-tokens 8000 `
--max-output-tokens 3000 `
"Review this production migration for security issues"python .\scripts\python\codex_cost_router.py history --limit 20
python .\scripts\python\codex_cost_router.py statsRouting metadata is written to:
%USERPROFILE%\.codex\logs\cost_router.jsonl
Prompts and API keys are not logged.
Manage-CodexCostRouting.ps1: automatic run, persistent start, status, and stop workflow.codex-cost-routing.cmd: simple Windows launcher.codex_cost_router.py: prompt optimization and one-shot routing.Invoke-QwenLocal.ps1: direct Ollama local Qwen call for fast small tasks.Measure-QwenLocalSpeed.ps1: repeatable local token/s benchmark.codex_key_session_web.py: local-only web form for session keys.Start-CodexKeySessionWeb.ps1: PowerShell launcher for the local key page.Test-CodexLiteLLMDispatch.ps1: local proxy alias and optional call test.start-litellm-proxy.ps1: local LiteLLM starter with native strategy selection and optional risk-router startup.risk_adjusted_router.py: optional TTFT/tokens-per-second/in-flight-aware risk-adjusted front router.measure-risk-adjusted-dispatch.ps1: non-streaming risk-router benchmark.measure-risk-adjusted-streaming.ps1: streaming TTFT and throughput probe.codex-routing-policy.yaml: editable routing policy and fallback order.litellm-cost-routing.yaml: local LiteLLM OSS OpenAI/Gemini model groups, context-window fallbacks, cooldowns, and compatibility aliases.
- LiteLLM runs locally and self-hosted.
- The configuration does not enable LiteLLM Cloud or Enterprise features.
- Actual provider billing remains authoritative.
- The router remains optional: stopping it restores the previous Codex
config.tomlbyte-for-byte.