feat: ai-cache plugin#13308
Conversation
…ser cache collisions
There was a problem hiding this comment.
Pull request overview
This PR adds a new ai-cache plugin to APISIX to cache LLM responses in Redis, supporting an L1 exact-match cache and an L2 semantic (embedding/vector-search) cache, plus associated Prometheus metrics, docs, and test coverage.
Changes:
- Implement
ai-cacheplugin with exact + semantic caching, embedding providers (OpenAI / Azure OpenAI), and cache scoping controls. - Add Prometheus exporter metrics for ai-cache hits/misses and embedding latency/failures, with E2E tests.
- Add user documentation and register the plugin in docs/config and example config.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
apisix/plugins/ai-cache.lua |
Core plugin logic (cache lookup in access phase, cache write in log phase, response headers). |
apisix/plugins/ai-cache/schema.lua |
Plugin schema, defaults, and encrypted fields. |
apisix/plugins/ai-cache/exact.lua |
L1 Redis exact-match cache implementation. |
apisix/plugins/ai-cache/semantic.lua |
L2 Redis Stack (RediSearch) vector index/search/store implementation. |
apisix/plugins/ai-cache/embeddings/openai.lua |
OpenAI embeddings client. |
apisix/plugins/ai-cache/embeddings/azure_openai.lua |
Azure OpenAI embeddings client. |
apisix/plugins/prometheus/exporter.lua |
Adds ai-cache metrics (hits/misses, embedding latency/failures). |
t/plugin/ai-cache.t |
Functional tests for schema, L1/L2 behavior, bypass, non-2xx, streaming, drivers. |
t/plugin/ai-cache-scope.t |
Tests cache key scoping via vars and consumer isolation. |
t/plugin/prometheus-ai-cache.t |
Prometheus metric assertions for ai-cache behavior. |
docs/en/latest/plugins/ai-cache.md |
New plugin documentation and configuration examples. |
docs/en/latest/config.json |
Adds ai-cache to docs navigation. |
conf/config.yaml.example |
Adds plugin to example plugin list with priority. |
apisix/cli/config.lua |
Registers plugin in CLI plugin list. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if is_stream then | ||
| core.response.set_header("Content-Type", "text/event-stream") | ||
| else | ||
| core.response.set_header("Content-Type", "application/json") | ||
| end | ||
| return core.response.exit(200, proto.build_deny_response({ | ||
| stream = is_stream, | ||
| text = cached_text, | ||
| })) | ||
| end |
There was a problem hiding this comment.
Cache hits are returned using proto.build_deny_response(...). That helper is intended for policy denials and produces protocol-specific error/deny shapes for some protocols (eg openai-embeddings returns an error object), which can make cached responses invalid if this plugin is ever applied to non-chat endpoints. It would be safer to either (1) explicitly restrict ai-cache to chat-style protocols only, or (2) add a protocol method dedicated to building a successful cached response (and include required fields like model/usage where applicable).
There was a problem hiding this comment.
proto.build_deny_response could be reuse here to simulate LLM response, and could be rename to proto.build_response_from_text to prevent confusion. added TODO message to be addressed in later PR to keep focus on ai-cache implementation.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Description
Implements the
ai-cacheplugin, a two-layer response cache for LLM/AI traffic that reduces upstream cost and latency by reusing prior completions for matching prompts.Architecture:
The plugin sits in the request flow alongside the existing
ai-proxy/ai-protocolsstack. On every chat-completion request it:ctx.varkeys) and looks up the result in Redis as a JSON blob keyed byai-cache:l1:<scope>:<prompt_hash>. Hit ⇒ short-circuit with the cached response.ai-cache-idx-<dim>, dimension-segregated so different embedding models coexist safely), and returns the nearest candidate abovesemantic.similarity_threshold. Hit ⇒ also writes the entry into L1 so the next exact match skips the embedding cost entirely.body_filter, and (inlog) writes back into both L1 and L2.Layers can be enabled independently via the
layersarray (["exact"],["semantic"], or both).Scope / cache key:
cache_key.include_consumerandcache_key.include_varsproduce a stable scope hash so caches don't leak across consumers or per-tenant variables.Bypass:
bypass_onlets operators skip caching for specific request headers (e.g.Cache-Control: no-cache, debug headers).Observability: emits
X-AI-Cache-Status(HIT-L1/HIT-L2/MISS/BYPASS),X-AI-Cache-Similarity(L2 only), andX-AI-Cache-Ageresponse headers. Header names are configurable.Safety:
text-embedding-3-small1536-dim →text-embedding-3-large3072-dim) does not corrupt or wedge the index — old entries simply expire viasemantic.ttl.semantic.embedding.api_key,redis_password) are encrypted via APISIX'sencrypt_fields.max_cache_body_size(default 1 MiB) prevents the plugin from caching unbounded payloads.Compatibility: zero changes to existing plugins; ai-cache piggybacks on
ai-proxy/ai-protocolsand does not interfere when the route lacks an AI plugin chain.Which issue(s) this PR fixes:
Fixes #13290
Checklist