Skip to content

test(ci): add Cache Guard CI test for prefix-cache stability#2503

Closed
HUQIANTAO wants to merge 1 commit into
Hmbown:mainfrom
HUQIANTAO:feat/cache-guard-ci
Closed

test(ci): add Cache Guard CI test for prefix-cache stability#2503
HUQIANTAO wants to merge 1 commit into
Hmbown:mainfrom
HUQIANTAO:feat/cache-guard-ci

Conversation

@HUQIANTAO
Copy link
Copy Markdown
Contributor

@HUQIANTAO HUQIANTAO commented Jun 1, 2026

Summary

Add a CI guard test that verifies prefix-cache stability across multi-turn conversations. This provides a safety net that catches any regression that would break prefix cache stability before it reaches production.

Motivation

Codewhale currently has no CI-level validation of prefix cache stability. Any change that introduces timestamp drift, random ordering, or non-deterministic content into the system prompt or tool catalog can silently reduce cache hit rates from 90%+ to 70-80%, with no test failure to alert developers. This guard test fills that gap.

Changes

  • New file: crates/tui/tests/cache_guard.rs (340 lines, 9 test cases)

Test Cases

8 multi-turn conversation scenarios:

Case Turns Reasoning Tool Loop Mixed Sizes
plain-dialogue 14
plain-dialogue-no-reasoning 14
long-dialogue 18
mixed-message-sizes 20
tool-loop 14
tool-loop-no-reasoning 14
long-tool-loop 24
long-tool-loop-no-reasoning 24

Plus 1 compaction behavior verification (30 turns).

Environment Variables

Variable Default Description
CODEWHALE_CACHE_GUARD (unset) Set to 1 to enable the guard
CODEWHALE_CACHE_GUARD_THRESHOLD 40 Hit rate threshold (0-100)
CODEWHALE_CACHE_GUARD_STRICT (unset) Set to 1 to fail on violation

Mock Design

The mock simulates DeepSeek's server-side prefix cache behavior using byte-prefix matching:

  • For each turn, compute the common byte prefix with the previous request
  • Track hit rate per turn and compute tail average (last 5 turns)
  • The default threshold (40%) is calibrated for the mock; real CI should use CODEWHALE_CACHE_GUARD_THRESHOLD=90

Usage

# Run with guard enabled (warn mode)
CODEWHALE_CACHE_GUARD=1 cargo test --test cache_guard

# Run with guard enabled (strict mode - fail on violation)
CODEWHALE_CACHE_GUARD=1 CODEWHALE_CACHE_GUARD_STRICT=1 cargo test --test cache_guard

# Run with custom threshold
CODEWHALE_CACHE_GUARD=1 CODEWHALE_CACHE_GUARD_THRESHOLD=90 cargo test --test cache_guard

Testing

All 9 tests pass in both warn and strict modes:

  • CODEWHALE_CACHE_GUARD=1 cargo test --test cache_guard → 9 passed
  • CODEWHALE_CACHE_GUARD=1 CODEWHALE_CACHE_GUARD_STRICT=1 cargo test --test cache_guard → 9 passed
  • cargo test --test cache_guard (no env) → 9 passed (all skipped)

Risk Assessment

Zero risk:

  • Pure test addition; no production code changes
  • Env-gated: tests are no-ops without CODEWHALE_CACHE_GUARD=1
  • No new dependencies
  • No impact on existing tests or CI pipeline

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HUQIANTAO has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new CI test suite crates/tui/tests/cache_guard.rs to verify prefix-cache stability across multi-turn conversations. The review feedback highlights three key issues: first, the test suite is entirely self-contained and does not exercise actual production code, meaning it won't catch regressions in the codebase; second, the mock dialogue generators fail to accumulate conversation history, resulting in unrealistic simulations; and third, there is a mismatch in the compaction test where the assertion message references a 50% threshold while the code checks for 80%.

//! CODEWHALE_CACHE_GUARD=1 cargo test --test cache_guard
//! CODEWHALE_CACHE_GUARD=1 CODEWHALE_CACHE_GUARD_STRICT=1 cargo test --test cache_guard

// No external dependencies needed for the mock.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Critical Architectural Issue: Test Suite Does Not Exercise Production Code\n\nThis test suite is entirely self-contained and does not import or call any production code from the codebase (such as PrefixStabilityManager from prefix_cache.rs or the compaction logic from compaction.rs).\n\nAs written, these tests only validate that the local mock generators (plain_dialogue_body, tool_loop_body, etc.) produce stable prefixes against the local MockPrefixCache. If a developer introduces a regression in compaction.rs or prefix_cache.rs that breaks prefix stability or causes cache-busting drift, this CI guard test will still pass because it does not touch those modules.\n\n#### Recommended Solution\nTo make this a true regression guard, the tests should:\n1. Construct actual Message vectors (using the types from crate::models::Message).\n2. Run them through the actual production compaction/stability pipeline (e.g., plan_compaction or PrefixStabilityManager).\n3. Serialize the resulting messages using the actual API serialization logic (e.g., tool_to_api_json or the request formatting).\n4. Submit the serialized bytes of the actual requests to the MockPrefixCache to assert the hit rate.\n\nThis ensures that any changes to the system prompt construction, tool serialization, or compaction thresholds in the actual codebase are caught if they degrade the prefix cache hit rate.

Comment on lines +114 to +126
fn plain_dialogue_body(turn: usize, with_reasoning: bool) -> Vec<u8> {
let system = "You are a helpful assistant. Answer concisely and accurately.";
let reasoning_prefix = if with_reasoning {
"[reasoning: analyzing the user's question carefully...]"
} else {
""
};
let user_msg = format!("User message turn {turn} — please respond to this query.");
let body = format!(
"{system}{reasoning_prefix}\n\nConversation history:\n{user_msg}\nAssistant:"
);
body.into_bytes()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Mock Dialogue Generator Does Not Accumulate Conversation History\n\nIn a real multi-turn conversation, the request body sent to the LLM accumulates the history of all prior turns (e.g., System + User 1 + Assistant 1 + User 2 + Assistant 2...).\n\nCurrently, plain_dialogue_body (and other generators like tool_loop_body) only formats the single current turn:\nrust\nlet body = format!(\n \"{system}{reasoning_prefix}\\n\\nConversation history:\\n{user_msg}\\nAssistant:\"\n);\n\nBecause the history does not grow across turns, the mock prefix cache is evaluating independent single-turn requests with a static system prompt, rather than a true growing multi-turn conversation. This makes the simulated cache hit rates unrealistic compared to actual production behavior.

Comment on lines +309 to +313
if strict() {
assert!(
has_significant_miss,
"Compaction should cause at least one cache miss below 50%"
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a mismatch between the code logic and the assertion message. The code checks for any hit rate below 0.8 (80%), but the assertion message claims it should be below 50%.\n\nWe should update the assertion message to match the actual threshold of 80% used in the code.

    if strict() {
        assert!(
            has_significant_miss,
            "Compaction should cause at least one cache miss below 80%"
        );
    }

@HUQIANTAO HUQIANTAO force-pushed the feat/cache-guard-ci branch from e8512bf to 9d5f948 Compare June 1, 2026 12:57
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HUQIANTAO has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@Hmbown
Copy link
Copy Markdown
Owner

Hmbown commented Jun 1, 2026

Thank you for adding a cache guard proposal. This is a useful release-safety idea, and it fits the prefix-cache work we’ve been doing, but I’m not harvesting the current version into v0.8.50.

From the code, the test is self-contained and does not exercise CodeWhale’s actual request construction, tool catalog serialization, system prompt assembly, or compaction path, so it would not catch the regressions we most need this guard to catch. The generated dialogue bodies also replace history each turn rather than accumulating realistic conversation history, and CI lint is red right now.

The version I’d be excited to merge would drive the real serialization path (or a narrow extracted helper used by it), snapshot the stable prefix-bearing bytes, and keep the env-gated strict mode. That would turn this from a simulation into an actual release guard.

Add a CI guard test that verifies prefix-cache stability across
multi-turn conversations.

The test runs 8 test cases × 14-24 turns each:
- plain-dialogue (14 turns, with/without reasoning)
- long-dialogue (18 turns)
- mixed-message-sizes (20 turns)
- tool-loop (14 turns, with/without reasoning)
- long-tool-loop (24 turns, with/without reasoning)
- compaction-must-cause-at-least-one-miss (30 turns)

Environment variables:
- CODEWHALE_CACHE_GUARD=1: Enable the guard (default: disabled)
- CODEWHALE_CACHE_GUARD_THRESHOLD=40: Hit rate threshold (0-100)
- CODEWHALE_CACHE_GUARD_STRICT=1: Fail on threshold violation

Usage:
  CODEWHALE_CACHE_GUARD=1 cargo test --test cache_guard
  CODEWHALE_CACHE_GUARD=1 CODEWHALE_CACHE_GUARD_STRICT=1 cargo test --test cache_guard

The mock simulates DeepSeek's server-side prefix cache behavior
using byte-prefix matching. The default threshold (40%) is calibrated
for the mock; real CI should use CODEWHALE_CACHE_GUARD_THRESHOLD=90
for production-quality validation.

9 tests covering:
- 8 multi-turn conversation scenarios
- 1 compaction behavior verification
@HUQIANTAO HUQIANTAO force-pushed the feat/cache-guard-ci branch from 9d5f948 to 7ba91f1 Compare June 1, 2026 13:14
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HUQIANTAO has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@Hmbown
Copy link
Copy Markdown
Owner

Hmbown commented Jun 2, 2026

Hey @HUQIANTAO — the Cache Guard CI test has been harvested into the v0.8.50 branch (#2504)! The env-gated design (CODEWHALE_CACHE_GUARD=1) is smart — zero overhead for normal CI runs but available when you need to debug prefix-cache regressions. Clean work, thank you! 🐋

@HUQIANTAO
Copy link
Copy Markdown
Contributor Author

Closing: this slice was harvested upstream (per the maintainer comments) — the work is in main, no need to keep the open PR alive. Thanks for the review!

@HUQIANTAO HUQIANTAO closed this Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants