Prompt Caching

Flemma supports prompt caching across all supported providers. Each provider implements caching differently, but the general cache_retention parameter provides a consistent interface – set it once and it applies to whichever provider you use.

Quick Comparison

	Anthropic	OpenAI	Vertex AI	Moonshot
Default	`"short"` (5 min TTL)	`"short"` (in-memory)	Automatic	Automatic
Min. tokens	1,024–4,096	1,024	1,024–2,048	—
Read discount	90% (0.1x)	50% (0.5x)	90% (0.1x)	80% (0.2x)
Write cost	1.25x–2.0x	Free	Free	Free
Control	`cache_retention`	`cache_retention`	None (implicit)	`prompt_cache_key`

When caching is active, the usage bar includes cache percentage and token counts. Costs are adjusted to reflect each provider's discount on cached input.

Anthropic

Flemma automatically adds cache breakpoints to Anthropic API requests, letting the provider reuse previously processed prefixes at a fraction of the cost¹. Three breakpoints are placed: the tool definitions, the system prompt, and the last user message. Tools are sorted alphabetically so the prefix stays stable across requests.

The cache_retention parameter controls the caching strategy²:

Value	TTL	Write cost	Read cost	Description
`"short"`	5 min	1.25x	0.1x	Default. Good for active chat.
`"long"`	1 hour	2.0x	0.1x	Better for long-running tasks.
`"none"`	—	—	—	Disable caching entirely.

When caching is active, the usage bar includes cache percentage and read/write token counts. Costs are adjusted accordingly – cache reads are 90% cheaper than regular input tokens.

Note

Anthropic requires a minimum number of tokens in the cached prefix before caching activates³. The thresholds vary by model: 4096 tokens for Opus 4.6, Opus 4.5, and Haiku 4.5; 1024 tokens for Sonnet 4.6, Sonnet 4.5, Opus 4.1, Opus 4, and Sonnet 4. If your conversation is below this threshold, the API returns zero cache tokens and charges the standard input rate. This is expected – caching benefits grow with longer conversations and system prompts.

OpenAI

Flemma sends prompt caching hints to the OpenAI Responses API using the cache_retention parameter⁴. When caching is active, Flemma sends the buffer's file path as prompt_cache_key and a retention policy as prompt_cache_retention. When a cache hit occurs, the usage bar includes the cache percentage and read token count. Costs are adjusted to reflect the 50% discount on cached input⁵.

The cache_retention parameter controls the caching strategy:

Value	TTL	Write cost	Read cost	Description
`"short"`	5–10 min	free (invisible)	0.5x	Default. `in_memory` retention, good for active chat.
`"long"`	up to 24 h	free (invisible)	0.5x	Extended retention for long sessions.
`"none"`	—	—	—	No caching hints sent.

Note

Unlike Anthropic, OpenAI does not report cache write tokens in the API response. Writes happen automatically and are free, so the usage bar only shows cache reads.

Important

OpenAI caching is best-effort and not guaranteed. Even when the prompt meets all requirements, the API may return zero cached tokens. Key conditions:

Minimum 1,024 tokens in the prompt prefix⁴. Shorter prompts are never cached.
Prefix must be byte-identical between requests. Any change to tools, system prompt, or earlier messages invalidates the cache from that point forward.
Cache propagation takes time. The first request populates the cache; subsequent requests can hit it. Sending requests in rapid succession (within a few seconds) may miss the cache because the entry hasn't propagated yet. Wait at least 5–10 seconds between requests for the best chance of a hit.
128-token granularity. Only the first 1,024 tokens plus whole 128-token increments are cacheable. Tokens beyond the last 128-token boundary are always processed fresh.

Vertex AI

Gemini 2.5+ models support implicit context caching⁶. When consecutive requests share a common input prefix, the Vertex AI serving infrastructure automatically caches and reuses it – no configuration or request changes are needed. When a cache hit occurs, the usage bar includes the cache percentage and read token count. Costs are adjusted to reflect the 90% discount on cached input⁷.

Metric	Value	Description
Read cost	0.1x (10%)	Cached input tokens cost 10% of the normal input rate.
Write cost	—	No additional charge; caching is automatic.
Min. tokens	1,024 / 2,048	1,024 for Flash models, 2,048 for Pro models.

Important

Vertex AI implicit caching is automatic and best-effort – cache hits are not guaranteed. Key conditions:

Minimum token thresholds vary by model: 1,024 tokens for Flash, 2,048 tokens for Pro⁶. Shorter prompts are never cached.
Prefix must be identical between requests. Changing tools, system instructions, or earlier conversation turns invalidates the cache from that point forward.
Only Gemini 2.5+ models support implicit caching. Older Gemini models (2.0, 1.5) do not report cached tokens.
Cache propagation takes time. Like OpenAI, the first request populates the cache and immediate follow-up requests may not see a hit. Allow a few seconds between requests.
No user control. There is no TTL parameter or opt-out – caching is managed entirely by Google's infrastructure.

Google also offers an explicit Context Caching API⁸ that creates named cache resources with configurable TTLs via a separate endpoint. Explicit caching requires a different workflow (create cache, then reference it) and is not yet supported by Flemma.

Moonshot AI

Moonshot uses automatic prompt caching with no separate write fee. The API reports cached_tokens on cache hits, and cached input is billed at a reduced rate (approximately 80% discount). There is no minimum token threshold or TTL to manage – caching is handled entirely by Moonshot's infrastructure.

To improve cache hit rates across requests, you can set a stable prompt_cache_key via provider parameters:

require("flemma").setup({
  parameters = {
    moonshot = {
      prompt_cache_key = "my-project-key",
    },
  },
})

Or per-buffer in frontmatter:

flemma.opt.moonshot = { prompt_cache_key = "my-project-key" }

When a cache hit occurs, the usage bar includes the cached token count. Caching is available on kimi-k2 family models; moonshot-v1-* models do not report cached tokens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompt Caching

Quick Comparison

Anthropic

OpenAI

Vertex AI

Moonshot AI

FilesExpand file tree

prompt-caching.md

Latest commit

History

prompt-caching.md

File metadata and controls

Prompt Caching

Quick Comparison

Anthropic

OpenAI

Vertex AI

Moonshot AI

Footnotes