Skip to content

[Bug] DashScope native API: cache_control placed at message level instead of content block level #1363

@wbt-ice

Description

@wbt-ice

Description

The explicit cache control implementation in both DashScopeChatModel and OpenAIChatModel places cache_control at the message level, but the DashScope official documentation requires it to be placed inside the content block (within the content array). This applies to both the DashScope native protocol and the OpenAI-compatible protocol.

Current Behavior

When GenerateOptions.cacheControl(true) is enabled, both DashScopeChatFormatter.applyCacheControl() and OpenAIBaseFormatter.applyCacheControl() set cache_control on the message object directly (message level).

DashScope formatter:

public void applyCacheControl(List<DashScopeMessage> messages) {
    for (DashScopeMessage msg : messages) {
        if ("system".equals(msg.getRole()) && msg.getCacheControl() == null) {
            msg.setCacheControl(EPHEMERAL_CACHE_CONTROL);  // message level
        }
    }
    DashScopeMessage lastMsg = messages.get(messages.size() - 1);
    if (lastMsg.getCacheControl() == null) {
        lastMsg.setCacheControl(EPHEMERAL_CACHE_CONTROL);  // message level
    }
}

OpenAI formatter has the same logic in OpenAIBaseFormatter.applyCacheControl().

This produces the following JSON for both protocols:

{
  "role": "system",
  "content": "You are a helpful assistant.",
  "cache_control": {"type": "ephemeral"}
}

Expected Behavior

Per the official documentation, cache_control must be placed inside a content block, and content must be in array format:

{
  "role": "system",
  "content": [
    {
      "type": "text",
      "text": "You are a helpful assistant.",
      "cache_control": {"type": "ephemeral"}
    }
  ]
}

The documentation states:

需将 content 字段改为数组形式,并添加 cache_control 字段

This format requirement applies to both OpenAI-compatible and DashScope native protocols when calling DashScope models.

Issues Identified

  1. cache_control is placed at the wrong level — should be inside content blocks, not at message level. This affects both DashScopeChatFormatter and OpenAIBaseFormatter.
  2. Content part DTOs lack a cache_control field — both DashScopeContentPart and OpenAIMessage content parts have no way to carry cache_control at the content block level.
  3. Multimodal messages are not handled — when content is already a list of content parts, the cache_control still goes to message level and won't be recognized by the API.
  4. No guard for the 4-marker limit — the documentation states a maximum of 4 cache_control markers per request. If there are multiple system messages (e.g., injected by SkillHook or LongTermMemoryHook), the limit may be exceeded silently.

Suggested Fix

DashScope native protocol

  1. Add a cache_control field (Map<String, String>) to DashScopeContentPart.
  2. Modify DashScopeChatFormatter.applyCacheControl() to:
    • Convert string content to array format (List<DashScopeContentPart>) for target messages.
    • Set cache_control on the last content block within each target message.
  3. Apply the same fix to DashScopeMultiAgentFormatter.applyCacheControl().

OpenAI-compatible protocol

  1. Modify OpenAIBaseFormatter.applyCacheControl() with the same content-block-level approach.
  2. Add cache_control support to OpenAI content part DTOs.

Common

  1. Add a guard to ensure no more than 4 cache_control markers per request.
  2. Keep the existing message-level cacheControl fields for backward compatibility (manual metadata marking).

Affected Classes

  • DashScopeChatFormatter
  • DashScopeMultiAgentFormatter
  • DashScopeContentPart
  • DashScopeMessage
  • OpenAIBaseFormatter
  • OpenAIMessage

Discussion: Should applyCacheControl() auto-mark messages?

The current applyCacheControl() strategy is: "all system messages + last message". Given DashScope's prefix-matching caching mechanism, the strategy pattern itself is sound in theory:

  • Marking system messages creates layered prefix cache blocks (A, AB, ABC…). Even if later messages change, the shorter prefix (e.g., just the stable system prompt) can still be hit — a reasonable tiered caching approach.
  • Marking the last message caches the entire messages array as a complete prefix, which aligns with the official "continuous multi-turn dialog" pattern.

However, there are two practical concerns:

1. The 4-marker limit

The API enforces a hard limit of 4 cache_control markers per request. If more than 4 markers are present, only the last 4 take effect. In AgentScope, multiple hooks can dynamically inject system messages (e.g., SkillHook, LongTermMemoryHook, RAGHook), making the number of system messages unpredictable at the formatter level. When the total marker count exceeds 4, the earliest system messages — which are typically the most stable and most valuable to cache — will lose their markers and fall out of the cache.

2. Dynamic content defeats prefix caching

The framework cannot distinguish between stable system messages (e.g., the user's own system prompt) and dynamic ones (e.g., RAG-retrieved knowledge, long-term memory summaries). In AgentScope's hook architecture, hooks like GenericRAGHook and StaticLongTermMemoryHook inject system messages whose content changes on every request. Marking these with cache_control means each request creates a new cache block (at 125% of standard input cost) that will likely never be hit — the prefix changes every time.

Only the user knows which parts of their messages are stable and worth caching. A blanket "mark all system messages" strategy applied at the formatter level cannot make this distinction.

Suggestion

Consider making cache control user-driven rather than automatic:

  • The existing MessageMetadataKeys.CACHE_CONTROL mechanism already allows users to mark individual Msg objects for caching via metadata, which flows through applyCacheControlFromMetadata().
  • The automatic applyCacheControl() strategy could be removed or made opt-in, letting users who understand their caching needs and cost tolerance decide which messages to mark.
  • If keeping an automatic strategy, add a guard to enforce the 4-marker limit and prioritize stable prefixes (first system message + last message).

References

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions