|
| 1 | +--- |
| 2 | +title: "Prompt caching" |
| 3 | +sidebarTitle: "Prompt caching" |
| 4 | +description: "Cache the stable prefix of your agent's prompt with Anthropic prompt caching to cut token cost and latency on every turn." |
| 5 | +--- |
| 6 | + |
| 7 | +import RcBanner from "/snippets/ai-chat-rc-banner.mdx"; |
| 8 | + |
| 9 | +<RcBanner /> |
| 10 | + |
| 11 | +**Prompt caching lets a provider reuse the unchanged prefix of your prompt across requests, billing it at a fraction of the input price and skipping re-processing.** With Anthropic, cache reads cost ~10% of base input tokens, so a long, stable system prompt or a growing conversation history pays full price once and reads cheaply on every turn after. |
| 12 | + |
| 13 | +Caching is a **byte-exact prefix match**: any change in the prefix invalidates everything after it. A multi-turn agent is the ideal case — the system prompt, tools, and earlier turns are identical turn over turn, so the cacheable prefix only grows. `chat.agent` is built to keep that prefix stable across turns, suspends, and resumes; this page shows how to place the cache breakpoints and verify they're hitting. |
| 14 | + |
| 15 | +Caching is provider-specific. This guide covers Anthropic (`@ai-sdk/anthropic`), where you opt in per breakpoint with `providerOptions.anthropic.cacheControl`. Other providers cache differently, and most cache automatically — see [Other providers](#other-providers). |
| 16 | + |
| 17 | +## What you cache, and where |
| 18 | + |
| 19 | +A request renders as `tools` → `system` → `messages`. There are three prefix regions worth caching, in order: |
| 20 | + |
| 21 | +| Region | How to cache it | Stability | |
| 22 | +| --- | --- | --- | |
| 23 | +| System prompt (+ tools) | `cacheControl` / `systemProviderOptions` on `chat.toStreamTextOptions()`, or `providerOptions` on `chat.prompt.set()` | Set once, never changes — the highest-value target | |
| 24 | +| Conversation history | `prepareMessages` adds a breakpoint to the last message | Grows append-only across turns | |
| 25 | +| Tool definitions | Stable as long as your tool set doesn't change between turns | Render at position 0 — changing them invalidates everything | |
| 26 | + |
| 27 | +`chat.agent` preserves `providerOptions` through message persistence and rehydration, so a breakpoint you place survives a suspend/resume or a page refresh. The recommended way to place message breakpoints is `prepareMessages` (below) rather than baking `cacheControl` into stored messages — `prepareMessages` runs on every prompt-assembly path, including after compaction, so the breakpoint is always in the right place. |
| 28 | + |
| 29 | +## Cache the system prompt |
| 30 | + |
| 31 | +The system prompt (your `chat.prompt` text plus any skills preamble) is usually the largest stable block, so it's the first thing to cache. `chat.toStreamTextOptions()` returns `system` as a plain string by default; opt into caching and it returns a structured system message carrying the cache breakpoint instead. |
| 32 | + |
| 33 | +Three ways to opt in, depending on where you'd rather express it. |
| 34 | + |
| 35 | +**`cacheControl` at the `streamText` call site** — the Anthropic-flavored one-liner: |
| 36 | + |
| 37 | +```ts /trigger/chat.ts |
| 38 | +import { chat } from "@trigger.dev/sdk/ai"; |
| 39 | +import { streamText } from "ai"; |
| 40 | +import { anthropic } from "@ai-sdk/anthropic"; |
| 41 | + |
| 42 | +export const myChat = chat.agent({ |
| 43 | + id: "my-chat", |
| 44 | + onChatStart: async () => { |
| 45 | + chat.prompt.set(SYSTEM_PROMPT); // a large, stable instruction block |
| 46 | + }, |
| 47 | + run: async ({ messages, signal }) => { |
| 48 | + return streamText({ |
| 49 | + model: anthropic("claude-sonnet-4-5"), |
| 50 | + // Caches the system block with a 5-minute breakpoint. |
| 51 | + ...chat.toStreamTextOptions({ cacheControl: { type: "ephemeral" } }), |
| 52 | + messages, |
| 53 | + abortSignal: signal, |
| 54 | + }); |
| 55 | + }, |
| 56 | +}); |
| 57 | +``` |
| 58 | + |
| 59 | +**`systemProviderOptions`** is the provider-agnostic form — pass the raw `providerOptions` so it composes with any provider: |
| 60 | + |
| 61 | +```ts /trigger/chat.ts |
| 62 | +return streamText({ |
| 63 | + model: anthropic("claude-sonnet-4-5"), |
| 64 | + ...chat.toStreamTextOptions({ |
| 65 | + systemProviderOptions: { anthropic: { cacheControl: { type: "ephemeral" } } }, |
| 66 | + }), |
| 67 | + messages, |
| 68 | + abortSignal: signal, |
| 69 | +}); |
| 70 | +``` |
| 71 | + |
| 72 | +**`providerOptions` on `chat.prompt.set()`** co-locates the intent with where the prompt is defined. It carries through to `toStreamTextOptions()` with no call-site change: |
| 73 | + |
| 74 | +```ts /trigger/chat.ts |
| 75 | +onChatStart: async () => { |
| 76 | + chat.prompt.set(SYSTEM_PROMPT, { |
| 77 | + providerOptions: { anthropic: { cacheControl: { type: "ephemeral" } } }, |
| 78 | + }); |
| 79 | +}, |
| 80 | +run: async ({ messages, signal }) => { |
| 81 | + return streamText({ |
| 82 | + model: anthropic("claude-sonnet-4-5"), |
| 83 | + ...chat.toStreamTextOptions(), // already cached |
| 84 | + messages, |
| 85 | + abortSignal: signal, |
| 86 | + }); |
| 87 | +}, |
| 88 | +``` |
| 89 | + |
| 90 | +If more than one is set, the call-site option wins: `systemProviderOptions` overrides `cacheControl`, and both override `chat.prompt.set`'s `providerOptions`. There's no deep merge — the most specific option replaces the rest. |
| 91 | + |
| 92 | +<Note> |
| 93 | + Use the 1-hour cache for prefixes that sit idle longer than 5 minutes between turns: `cacheControl: { type: "ephemeral", ttl: "1h" }`. Writes cost more (2× vs 1.25×), so it pays off only when reads span the longer window. |
| 94 | +</Note> |
| 95 | + |
| 96 | +## Cache the conversation history |
| 97 | + |
| 98 | +Place a breakpoint on the last message and the entire conversation prefix up to that point is cached, so the next turn reads it back instead of re-processing it. Do this in [`prepareMessages`](/ai-chat/reference#chatagentoptions) — it transforms model messages once, and `chat.agent` applies it on every path that builds a prompt (each turn, and both compaction rebuild paths), so the breakpoint always lands on the real last message. |
| 99 | + |
| 100 | +```ts /trigger/chat.ts |
| 101 | +export const myChat = chat.agent({ |
| 102 | + id: "my-chat", |
| 103 | + prepareMessages: async ({ messages }) => { |
| 104 | + if (messages.length === 0) return messages; |
| 105 | + const last = messages[messages.length - 1]; |
| 106 | + return [ |
| 107 | + ...messages.slice(0, -1), |
| 108 | + { |
| 109 | + ...last, |
| 110 | + providerOptions: { |
| 111 | + ...last.providerOptions, |
| 112 | + anthropic: { cacheControl: { type: "ephemeral" } }, |
| 113 | + }, |
| 114 | + }, |
| 115 | + ]; |
| 116 | + }, |
| 117 | + run: async ({ messages, signal }) => { |
| 118 | + return streamText({ |
| 119 | + model: anthropic("claude-sonnet-4-5"), |
| 120 | + ...chat.toStreamTextOptions({ cacheControl: { type: "ephemeral" } }), |
| 121 | + messages, |
| 122 | + abortSignal: signal, |
| 123 | + }); |
| 124 | + }, |
| 125 | +}); |
| 126 | +``` |
| 127 | + |
| 128 | +The system breakpoint and the conversation breakpoint compose: the system block is cached once for the life of the chat, and each turn extends the cached message prefix. |
| 129 | + |
| 130 | +<Note> |
| 131 | + Anthropic allows **at most 4** cache breakpoints per request, and a prefix must be at least ~1024 tokens (model-dependent) to cache at all — shorter prefixes silently don't cache. One system breakpoint plus one rolling message breakpoint is the typical setup and leaves headroom. |
| 132 | +</Note> |
| 133 | + |
| 134 | +## Caching and compaction |
| 135 | + |
| 136 | +Compaction rewrites the conversation prefix — it replaces earlier turns with a summary — so it necessarily invalidates the cached message prefix at that point. That's a one-time reset, not a regression: because `prepareMessages` also runs on the compaction rebuild and result paths, the new (shorter) prefix gets a fresh breakpoint and re-warms on the next turn. Your system-prompt cache is unaffected — compaction never touches the system block. See [Compaction](/ai-chat/compaction) for how the summary is produced. |
| 137 | + |
| 138 | +## Other providers |
| 139 | + |
| 140 | +Caching is provider-specific, and most providers don't use per-block breakpoints at all: |
| 141 | + |
| 142 | +- **OpenAI** and **Google Gemini** cache automatically. OpenAI caches any prompt prefix over 1024 tokens; Gemini 2.5 caches implicitly (1024 tokens on Flash, 2048 on Pro). Neither needs a breakpoint, so the system-caching options above are a no-op for them — `chat.agent` already gives automatic caching exactly what it needs: a byte-stable prefix that only grows across turns. Keep the system prompt frozen and the prefix over the model's minimum and reads happen on their own. (OpenAI's optional `providerOptions.openai.promptCacheKey` improves hit-routing across requests; it's a top-level option, not a system-block breakpoint.) |
| 143 | + |
| 144 | +- **Anthropic** and **Amazon Bedrock** take an explicit breakpoint on the system block — Anthropic via `cacheControl`, Bedrock via `cachePoint`. Both go through the provider-agnostic `systemProviderOptions`: |
| 145 | + |
| 146 | +```ts /trigger/chat.ts |
| 147 | +// Amazon Bedrock |
| 148 | +return streamText({ |
| 149 | + ...chat.toStreamTextOptions({ |
| 150 | + systemProviderOptions: { bedrock: { cachePoint: { type: "default" } } }, |
| 151 | + }), |
| 152 | + messages, |
| 153 | +}); |
| 154 | +``` |
| 155 | + |
| 156 | +The `cacheControl` shorthand is Anthropic-only; `systemProviderOptions` (and `chat.prompt.set`'s `providerOptions`) is the form to reach for on any other breakpoint-based provider. |
| 157 | + |
| 158 | +Usage reporting is normalized. Providers report cache tokens under different names (`cachedPromptTokens`, `cachedContentTokenCount`, `cacheReadInputTokens`), but the AI SDK maps them into the same `inputTokenDetails.cacheReadTokens` / `cacheWriteTokens` that `previousTurnUsage` and `totalUsage` carry and the dashboard shows — so the [verify step](#verify-caching-is-working) is the same regardless of provider. |
| 159 | + |
| 160 | +## Verify caching is working |
| 161 | + |
| 162 | +The turn's usage carries cache token counts. `chat.agent` accumulates them across turns and hands them to `run` as `previousTurnUsage` (last turn) and `totalUsage` (whole chat), both `LanguageModelUsage`: |
| 163 | + |
| 164 | +```ts /trigger/chat.ts |
| 165 | +run: async ({ messages, signal, previousTurnUsage }) => { |
| 166 | + // After turn 1, cacheReadTokens should be > 0 on a stable prefix. |
| 167 | + console.log("cache read", previousTurnUsage?.inputTokenDetails?.cacheReadTokens); |
| 168 | + console.log("cache write", previousTurnUsage?.inputTokenDetails?.cacheWriteTokens); |
| 169 | + |
| 170 | + return streamText({ |
| 171 | + model: anthropic("claude-sonnet-4-5"), |
| 172 | + ...chat.toStreamTextOptions({ cacheControl: { type: "ephemeral" } }), |
| 173 | + messages, |
| 174 | + abortSignal: signal, |
| 175 | + }); |
| 176 | +}, |
| 177 | +``` |
| 178 | + |
| 179 | +The first turn writes the cache (`cacheWriteTokens > 0`, `cacheReadTokens` is 0). Every turn after, on an unchanged prefix, reads it (`cacheReadTokens > 0`). The dashboard surfaces the same numbers on the AI span as **Cache write** and **Cache read**, so you can confirm hits per run without logging. |
| 180 | + |
| 181 | +If `cacheReadTokens` stays 0 across turns with an identical prefix, a silent invalidator is shifting the bytes — see below. |
| 182 | + |
| 183 | +<Warning> |
| 184 | + Anything that changes the prefix between turns silently kills the cache. Keep the system prompt **byte-stable** — never interpolate a timestamp, request ID, or per-turn value into `chat.prompt`. Don't change the **model** or the **tool set** mid-conversation (tools render at position 0, so adding one invalidates everything after). Inject dynamic per-turn context as a late message via [pending messages](/ai-chat/pending-messages) or [background injection](/ai-chat/background-injection), not into the cached prefix. |
| 185 | +</Warning> |
| 186 | + |
| 187 | +## Next steps |
| 188 | + |
| 189 | +<CardGroup cols={2}> |
| 190 | + <Card title="Compaction" icon="compress" href="/ai-chat/compaction"> |
| 191 | + Keep long conversations within token limits — and re-warm the cache after. |
| 192 | + </Card> |
| 193 | + <Card title="Fast starts" icon="bolt" href="/ai-chat/fast-starts"> |
| 194 | + Cut cold-start latency so a cached prefix is the only thing between a message and a reply. |
| 195 | + </Card> |
| 196 | + <Card title="chat.agent reference" icon="book" href="/ai-chat/reference#chatagentoptions"> |
| 197 | + Full option surface, including `prepareMessages` and `toStreamTextOptions`. |
| 198 | + </Card> |
| 199 | + <Card title="Building agents: backend" icon="server" href="/ai-chat/backend"> |
| 200 | + The three ways to build a chat backend and when to reach for each. |
| 201 | + </Card> |
| 202 | +</CardGroup> |
0 commit comments