Skip to content

feat: ai-cache plugin#13308

Open
janiussyafiq wants to merge 38 commits intoapache:masterfrom
janiussyafiq:feat/ai-cache
Open

feat: ai-cache plugin#13308
janiussyafiq wants to merge 38 commits intoapache:masterfrom
janiussyafiq:feat/ai-cache

Conversation

@janiussyafiq
Copy link
Copy Markdown
Contributor

@janiussyafiq janiussyafiq commented Apr 27, 2026

Description

Implements the ai-cache plugin, a two-layer response cache for LLM/AI traffic that reduces upstream cost and latency by reusing prior completions for matching prompts.

Architecture:

The plugin sits in the request flow alongside the existing ai-proxy / ai-protocols stack. On every chat-completion request it:

  1. L1 (exact) — hashes the prompt (SHA-256) plus an optional scope (consumer, configured ctx.var keys) and looks up the result in Redis as a JSON blob keyed by ai-cache:l1:<scope>:<prompt_hash>. Hit ⇒ short-circuit with the cached response.
  2. L2 (semantic) — embeds the prompt via OpenAI / Azure OpenAI, runs a KNN search against a RediSearch HNSW vector index (ai-cache-idx-<dim>, dimension-segregated so different embedding models coexist safely), and returns the nearest candidate above semantic.similarity_threshold. Hit ⇒ also writes the entry into L1 so the next exact match skips the embedding cost entirely.
  3. MISS — proxies to the upstream, captures the response body in body_filter, and (in log) writes back into both L1 and L2.

Layers can be enabled independently via the layers array (["exact"], ["semantic"], or both).

Scope / cache key: cache_key.include_consumer and cache_key.include_vars produce a stable scope hash so caches don't leak across consumers or per-tenant variables.

Bypass: bypass_on lets operators skip caching for specific request headers (e.g. Cache-Control: no-cache, debug headers).

Observability: emits X-AI-Cache-Status (HIT-L1 / HIT-L2 / MISS / BYPASS), X-AI-Cache-Similarity (L2 only), and X-AI-Cache-Age response headers. Header names are configurable.

Safety:

  • L2 index name and key prefix are namespaced by embedding dimension, so swapping models (e.g. text-embedding-3-small 1536-dim → text-embedding-3-large 3072-dim) does not corrupt or wedge the index — old entries simply expire via semantic.ttl.
  • API keys (semantic.embedding.api_key, redis_password) are encrypted via APISIX's encrypt_fields.
  • max_cache_body_size (default 1 MiB) prevents the plugin from caching unbounded payloads.
  • L1 returns cached completions through the protocol's response builder so streaming and non-streaming clients receive a well-formed response in the format they expect.

Compatibility: zero changes to existing plugins; ai-cache piggybacks on ai-proxy / ai-protocols and does not interfere when the route lacks an AI plugin chain.

Which issue(s) this PR fixes:

Fixes #13290

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Apr 27, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new ai-cache plugin to APISIX to cache LLM responses in Redis, supporting an L1 exact-match cache and an L2 semantic (embedding/vector-search) cache, plus associated Prometheus metrics, docs, and test coverage.

Changes:

  • Implement ai-cache plugin with exact + semantic caching, embedding providers (OpenAI / Azure OpenAI), and cache scoping controls.
  • Add Prometheus exporter metrics for ai-cache hits/misses and embedding latency/failures, with E2E tests.
  • Add user documentation and register the plugin in docs/config and example config.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
apisix/plugins/ai-cache.lua Core plugin logic (cache lookup in access phase, cache write in log phase, response headers).
apisix/plugins/ai-cache/schema.lua Plugin schema, defaults, and encrypted fields.
apisix/plugins/ai-cache/exact.lua L1 Redis exact-match cache implementation.
apisix/plugins/ai-cache/semantic.lua L2 Redis Stack (RediSearch) vector index/search/store implementation.
apisix/plugins/ai-cache/embeddings/openai.lua OpenAI embeddings client.
apisix/plugins/ai-cache/embeddings/azure_openai.lua Azure OpenAI embeddings client.
apisix/plugins/prometheus/exporter.lua Adds ai-cache metrics (hits/misses, embedding latency/failures).
t/plugin/ai-cache.t Functional tests for schema, L1/L2 behavior, bypass, non-2xx, streaming, drivers.
t/plugin/ai-cache-scope.t Tests cache key scoping via vars and consumer isolation.
t/plugin/prometheus-ai-cache.t Prometheus metric assertions for ai-cache behavior.
docs/en/latest/plugins/ai-cache.md New plugin documentation and configuration examples.
docs/en/latest/config.json Adds ai-cache to docs navigation.
conf/config.yaml.example Adds plugin to example plugin list with priority.
apisix/cli/config.lua Registers plugin in CLI plugin list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/en/latest/plugins/ai-cache.md
Comment thread t/plugin/ai-cache-scope.t Outdated
Comment thread apisix/plugins/ai-cache.lua
Comment on lines +120 to +129
if is_stream then
core.response.set_header("Content-Type", "text/event-stream")
else
core.response.set_header("Content-Type", "application/json")
end
return core.response.exit(200, proto.build_deny_response({
stream = is_stream,
text = cached_text,
}))
end
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache hits are returned using proto.build_deny_response(...). That helper is intended for policy denials and produces protocol-specific error/deny shapes for some protocols (eg openai-embeddings returns an error object), which can make cached responses invalid if this plugin is ever applied to non-chat endpoints. It would be safer to either (1) explicitly restrict ai-cache to chat-style protocols only, or (2) add a protocol method dedicated to building a successful cached response (and include required fields like model/usage where applicable).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proto.build_deny_response could be reuse here to simulate LLM response, and could be rename to proto.build_response_from_text to prevent confusion. added TODO message to be addressed in later PR to keep focus on ai-cache implementation.

Comment thread apisix/plugins/ai-cache/semantic.lua
Comment thread apisix/plugins/prometheus/exporter.lua
Comment thread t/plugin/ai-cache.t Outdated
Comment thread t/plugin/prometheus-ai-cache.t Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apisix/plugins/ai-cache.lua
Comment thread apisix/plugins/ai-cache.lua
Comment thread apisix/plugins/ai-cache/schema.lua
Comment thread apisix/plugins/ai-cache/semantic.lua
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add ai-cache plugin for LLM semantic caching

2 participants