Problem
The proxy already does raw body passthrough to preserve upstream prompt caching, but routing is static (weight-based). Two providers serving the same model may have very different cache hit rates, but traffic doesn't favor the one with the warmer cache.
Proposal
Track per-provider cache hit/miss signals from response headers (e.g., Anthropic's x-cache-hit, token usage cache_read_input_tokens) and use this to influence routing:
- Record cache hit rates per
(provider, model) pair
- When multiple providers can serve a request, boost routing weight for the provider with higher cache hit rate
- This directly reduces input token costs and latency
Expected Benefit
- Lower latency on cache hits (no re-processing of cached prefix)
- Reduced token costs (cache reads are cheaper)
- Smarter traffic distribution without manual weight tuning
Related
Part of stability & performance exploration vs direct Claude Code CLI → LLM gateway connections.
Problem
The proxy already does raw body passthrough to preserve upstream prompt caching, but routing is static (weight-based). Two providers serving the same model may have very different cache hit rates, but traffic doesn't favor the one with the warmer cache.
Proposal
Track per-provider cache hit/miss signals from response headers (e.g., Anthropic's
x-cache-hit, token usagecache_read_input_tokens) and use this to influence routing:(provider, model)pairExpected Benefit
Related
Part of stability & performance exploration vs direct Claude Code CLI → LLM gateway connections.