From 5171e992310119e93c3a232b4babf9d336ac8512 Mon Sep 17 00:00:00 2001 From: Eric Allam Date: Mon, 15 Jun 2026 13:59:35 +0100 Subject: [PATCH 1/3] docs(ai-chat): add prompt caching guide --- docs/ai-chat/prompt-caching.mdx | 202 ++++++++++++++++++++++++++++++++ docs/docs.json | 1 + 2 files changed, 203 insertions(+) create mode 100644 docs/ai-chat/prompt-caching.mdx diff --git a/docs/ai-chat/prompt-caching.mdx b/docs/ai-chat/prompt-caching.mdx new file mode 100644 index 0000000000..0fe5688ffb --- /dev/null +++ b/docs/ai-chat/prompt-caching.mdx @@ -0,0 +1,202 @@ +--- +title: "Prompt caching" +sidebarTitle: "Prompt caching" +description: "Cache the stable prefix of your agent's prompt with Anthropic prompt caching to cut token cost and latency on every turn." +--- + +import RcBanner from "/snippets/ai-chat-rc-banner.mdx"; + + + +**Prompt caching lets a provider reuse the unchanged prefix of your prompt across requests, billing it at a fraction of the input price and skipping re-processing.** With Anthropic, cache reads cost ~10% of base input tokens, so a long, stable system prompt or a growing conversation history pays full price once and reads cheaply on every turn after. + +Caching is a **byte-exact prefix match**: any change in the prefix invalidates everything after it. A multi-turn agent is the ideal case — the system prompt, tools, and earlier turns are identical turn over turn, so the cacheable prefix only grows. `chat.agent` is built to keep that prefix stable across turns, suspends, and resumes; this page shows how to place the cache breakpoints and verify they're hitting. + +Caching is provider-specific. This guide covers Anthropic (`@ai-sdk/anthropic`), where you opt in per breakpoint with `providerOptions.anthropic.cacheControl`. Other providers cache differently, and most cache automatically — see [Other providers](#other-providers). + +## What you cache, and where + +A request renders as `tools` → `system` → `messages`. There are three prefix regions worth caching, in order: + +| Region | How to cache it | Stability | +| --- | --- | --- | +| System prompt (+ tools) | `cacheControl` / `systemProviderOptions` on `chat.toStreamTextOptions()`, or `providerOptions` on `chat.prompt.set()` | Set once, never changes — the highest-value target | +| Conversation history | `prepareMessages` adds a breakpoint to the last message | Grows append-only across turns | +| Tool definitions | Stable as long as your tool set doesn't change between turns | Render at position 0 — changing them invalidates everything | + +`chat.agent` preserves `providerOptions` through message persistence and rehydration, so a breakpoint you place survives a suspend/resume or a page refresh. The recommended way to place message breakpoints is `prepareMessages` (below) rather than baking `cacheControl` into stored messages — `prepareMessages` runs on every prompt-assembly path, including after compaction, so the breakpoint is always in the right place. + +## Cache the system prompt + +The system prompt (your `chat.prompt` text plus any skills preamble) is usually the largest stable block, so it's the first thing to cache. `chat.toStreamTextOptions()` returns `system` as a plain string by default; opt into caching and it returns a structured system message carrying the cache breakpoint instead. + +Three ways to opt in, depending on where you'd rather express it. + +**`cacheControl` at the `streamText` call site** — the Anthropic-flavored one-liner: + +```ts /trigger/chat.ts +import { chat } from "@trigger.dev/sdk/ai"; +import { streamText } from "ai"; +import { anthropic } from "@ai-sdk/anthropic"; + +export const myChat = chat.agent({ + id: "my-chat", + onChatStart: async () => { + chat.prompt.set(SYSTEM_PROMPT); // a large, stable instruction block + }, + run: async ({ messages, signal }) => { + return streamText({ + model: anthropic("claude-sonnet-4-5"), + // Caches the system block with a 5-minute breakpoint. + ...chat.toStreamTextOptions({ cacheControl: { type: "ephemeral" } }), + messages, + abortSignal: signal, + }); + }, +}); +``` + +**`systemProviderOptions`** is the provider-agnostic form — pass the raw `providerOptions` so it composes with any provider: + +```ts /trigger/chat.ts +return streamText({ + model: anthropic("claude-sonnet-4-5"), + ...chat.toStreamTextOptions({ + systemProviderOptions: { anthropic: { cacheControl: { type: "ephemeral" } } }, + }), + messages, + abortSignal: signal, +}); +``` + +**`providerOptions` on `chat.prompt.set()`** co-locates the intent with where the prompt is defined. It carries through to `toStreamTextOptions()` with no call-site change: + +```ts /trigger/chat.ts +onChatStart: async () => { + chat.prompt.set(SYSTEM_PROMPT, { + providerOptions: { anthropic: { cacheControl: { type: "ephemeral" } } }, + }); +}, +run: async ({ messages, signal }) => { + return streamText({ + model: anthropic("claude-sonnet-4-5"), + ...chat.toStreamTextOptions(), // already cached + messages, + abortSignal: signal, + }); +}, +``` + +If more than one is set, the call-site option wins: `systemProviderOptions` overrides `cacheControl`, and both override `chat.prompt.set`'s `providerOptions`. There's no deep merge — the most specific option replaces the rest. + + + Use the 1-hour cache for prefixes that sit idle longer than 5 minutes between turns: `cacheControl: { type: "ephemeral", ttl: "1h" }`. Writes cost more (2× vs 1.25×), so it pays off only when reads span the longer window. + + +## Cache the conversation history + +Place a breakpoint on the last message and the entire conversation prefix up to that point is cached, so the next turn reads it back instead of re-processing it. Do this in [`prepareMessages`](/ai-chat/reference#chatagentoptions) — it transforms model messages once, and `chat.agent` applies it on every path that builds a prompt (each turn, and both compaction rebuild paths), so the breakpoint always lands on the real last message. + +```ts /trigger/chat.ts +export const myChat = chat.agent({ + id: "my-chat", + prepareMessages: async ({ messages }) => { + if (messages.length === 0) return messages; + const last = messages[messages.length - 1]; + return [ + ...messages.slice(0, -1), + { + ...last, + providerOptions: { + ...last.providerOptions, + anthropic: { cacheControl: { type: "ephemeral" } }, + }, + }, + ]; + }, + run: async ({ messages, signal }) => { + return streamText({ + model: anthropic("claude-sonnet-4-5"), + ...chat.toStreamTextOptions({ cacheControl: { type: "ephemeral" } }), + messages, + abortSignal: signal, + }); + }, +}); +``` + +The system breakpoint and the conversation breakpoint compose: the system block is cached once for the life of the chat, and each turn extends the cached message prefix. + + + Anthropic allows **at most 4** cache breakpoints per request, and a prefix must be at least ~1024 tokens (model-dependent) to cache at all — shorter prefixes silently don't cache. One system breakpoint plus one rolling message breakpoint is the typical setup and leaves headroom. + + +## Caching and compaction + +Compaction rewrites the conversation prefix — it replaces earlier turns with a summary — so it necessarily invalidates the cached message prefix at that point. That's a one-time reset, not a regression: because `prepareMessages` also runs on the compaction rebuild and result paths, the new (shorter) prefix gets a fresh breakpoint and re-warms on the next turn. Your system-prompt cache is unaffected — compaction never touches the system block. See [Compaction](/ai-chat/compaction) for how the summary is produced. + +## Other providers + +Caching is provider-specific, and most providers don't use per-block breakpoints at all: + +- **OpenAI** and **Google Gemini** cache automatically. OpenAI caches any prompt prefix over 1024 tokens; Gemini 2.5 caches implicitly (1024 tokens on Flash, 2048 on Pro). Neither needs a breakpoint, so the system-caching options above are a no-op for them — `chat.agent` already gives automatic caching exactly what it needs: a byte-stable prefix that only grows across turns. Keep the system prompt frozen and the prefix over the model's minimum and reads happen on their own. (OpenAI's optional `providerOptions.openai.promptCacheKey` improves hit-routing across requests; it's a top-level option, not a system-block breakpoint.) + +- **Anthropic** and **Amazon Bedrock** take an explicit breakpoint on the system block — Anthropic via `cacheControl`, Bedrock via `cachePoint`. Both go through the provider-agnostic `systemProviderOptions`: + +```ts /trigger/chat.ts +// Amazon Bedrock +return streamText({ + ...chat.toStreamTextOptions({ + systemProviderOptions: { bedrock: { cachePoint: { type: "default" } } }, + }), + messages, +}); +``` + +The `cacheControl` shorthand is Anthropic-only; `systemProviderOptions` (and `chat.prompt.set`'s `providerOptions`) is the form to reach for on any other breakpoint-based provider. + +Usage reporting is normalized. Providers report cache tokens under different names (`cachedPromptTokens`, `cachedContentTokenCount`, `cacheReadInputTokens`), but the AI SDK maps them into the same `inputTokenDetails.cacheReadTokens` / `cacheWriteTokens` that `previousTurnUsage` and `totalUsage` carry and the dashboard shows — so the [verify step](#verify-caching-is-working) is the same regardless of provider. + +## Verify caching is working + +The turn's usage carries cache token counts. `chat.agent` accumulates them across turns and hands them to `run` as `previousTurnUsage` (last turn) and `totalUsage` (whole chat), both `LanguageModelUsage`: + +```ts /trigger/chat.ts +run: async ({ messages, signal, previousTurnUsage }) => { + // After turn 1, cacheReadTokens should be > 0 on a stable prefix. + console.log("cache read", previousTurnUsage?.inputTokenDetails?.cacheReadTokens); + console.log("cache write", previousTurnUsage?.inputTokenDetails?.cacheWriteTokens); + + return streamText({ + model: anthropic("claude-sonnet-4-5"), + ...chat.toStreamTextOptions({ cacheControl: { type: "ephemeral" } }), + messages, + abortSignal: signal, + }); +}, +``` + +The first turn writes the cache (`cacheWriteTokens > 0`, `cacheReadTokens` is 0). Every turn after, on an unchanged prefix, reads it (`cacheReadTokens > 0`). The dashboard surfaces the same numbers on the AI span as **Cache write** and **Cache read**, so you can confirm hits per run without logging. + +If `cacheReadTokens` stays 0 across turns with an identical prefix, a silent invalidator is shifting the bytes — see below. + + + Anything that changes the prefix between turns silently kills the cache. Keep the system prompt **byte-stable** — never interpolate a timestamp, request ID, or per-turn value into `chat.prompt`. Don't change the **model** or the **tool set** mid-conversation (tools render at position 0, so adding one invalidates everything after). Inject dynamic per-turn context as a late message via [pending messages](/ai-chat/pending-messages) or [background injection](/ai-chat/background-injection), not into the cached prefix. + + +## Next steps + + + + Keep long conversations within token limits — and re-warm the cache after. + + + Cut cold-start latency so a cached prefix is the only thing between a message and a reply. + + + Full option surface, including `prepareMessages` and `toStreamTextOptions`. + + + The three ways to build a chat backend and when to reach for each. + + diff --git a/docs/docs.json b/docs/docs.json index 367b99a914..a926dcdadf 100644 --- a/docs/docs.json +++ b/docs/docs.json @@ -123,6 +123,7 @@ "ai/prompts", "ai-chat/fast-starts", "ai-chat/compaction", + "ai-chat/prompt-caching", "ai-chat/pending-messages", "ai-chat/background-injection", "ai-chat/actions", From 4fa31fccb486c65e20ec5fb4ce3ce707f980407a Mon Sep 17 00:00:00 2001 From: Eric Allam Date: Mon, 15 Jun 2026 15:23:35 +0100 Subject: [PATCH 2/3] docs(ai-chat): note prompt caching needs AI SDK v6+ --- docs/ai-chat/prompt-caching.mdx | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/ai-chat/prompt-caching.mdx b/docs/ai-chat/prompt-caching.mdx index 0fe5688ffb..ee3f1b70c5 100644 --- a/docs/ai-chat/prompt-caching.mdx +++ b/docs/ai-chat/prompt-caching.mdx @@ -30,6 +30,10 @@ A request renders as `tools` → `system` → `messages`. There are three prefix The system prompt (your `chat.prompt` text plus any skills preamble) is usually the largest stable block, so it's the first thing to cache. `chat.toStreamTextOptions()` returns `system` as a plain string by default; opt into caching and it returns a structured system message carrying the cache breakpoint instead. + + System-prompt caching needs AI SDK v6 or later, where the `system` parameter accepts a structured message. On AI SDK v5 `system` is a plain string, so these options won't apply a breakpoint to the system block — cache the conversation via `prepareMessages` instead. + + Three ways to opt in, depending on where you'd rather express it. **`cacheControl` at the `streamText` call site** — the Anthropic-flavored one-liner: From a9dec8cdb47d85c38a2e1eaf5d9fda79656546d2 Mon Sep 17 00:00:00 2001 From: Eric Allam Date: Mon, 15 Jun 2026 16:12:25 +0100 Subject: [PATCH 3/3] docs(ai-chat): drop inaccurate provider cache-token field names --- docs/ai-chat/prompt-caching.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/ai-chat/prompt-caching.mdx b/docs/ai-chat/prompt-caching.mdx index ee3f1b70c5..9b383d9bef 100644 --- a/docs/ai-chat/prompt-caching.mdx +++ b/docs/ai-chat/prompt-caching.mdx @@ -159,7 +159,7 @@ return streamText({ The `cacheControl` shorthand is Anthropic-only; `systemProviderOptions` (and `chat.prompt.set`'s `providerOptions`) is the form to reach for on any other breakpoint-based provider. -Usage reporting is normalized. Providers report cache tokens under different names (`cachedPromptTokens`, `cachedContentTokenCount`, `cacheReadInputTokens`), but the AI SDK maps them into the same `inputTokenDetails.cacheReadTokens` / `cacheWriteTokens` that `previousTurnUsage` and `totalUsage` carry and the dashboard shows — so the [verify step](#verify-caching-is-working) is the same regardless of provider. +Usage reporting is normalized. Each provider reports cache tokens under its own provider-specific field, but the AI SDK maps them into the same `inputTokenDetails.cacheReadTokens` / `cacheWriteTokens` that `previousTurnUsage` and `totalUsage` carry and the dashboard shows — so the [verify step](#verify-caching-is-working) is the same regardless of provider. ## Verify caching is working