Skip to content

Improve MCP tool descriptions for AI agent reliability#130

Open
janisz wants to merge 7 commits intomainfrom
fix_description
Open

Improve MCP tool descriptions for AI agent reliability#130
janisz wants to merge 7 commits intomainfrom
fix_description

Conversation

@janisz
Copy link
Copy Markdown
Contributor

@janisz janisz commented May 5, 2026

Summary

Enhanced tool descriptions with explicit usage patterns and STOP instructions to improve e2e test reliability with Claude models. Achieved
100% reliability with Opus/Sonnet and 90%+ with Haiku.

Problem

E2e tests with Claude CLI agents showed agents making excessive tool calls:

  • General CVE questions ("Is CVE-X detected?") triggered 4-5 tool calls instead of expected 3
  • Agents would call all three CVE tools, then verify results with additional calls
  • This exceeded maxToolCalls=3-5 constraints and reduced reliability

Solution

1. CVE Tool Descriptions (clusters.go, deployments.go, nodes.go)

Added explicit STOP instructions for two usage patterns:

USAGE PATTERNS:

  1. For general CVE questions ('Is CVE-X detected in my clusters?'):
    Call ALL THREE CVE tools exactly once each, then STOP and provide answer.
    Do NOT make verification calls or check twice.
  2. For specific [deployment/node/orchestrator] questions: Use ONLY this tool once, then STOP.

Why: Prevents agents from making verification calls after getting initial results.

2. list_secured_clusters Description (tools.go)

Added WHEN TO USE guidance:

WHEN TO USE: Use this tool when the user asks to see or list all clusters.
IMPORTANT: Do NOT use this tool when checking for CVE vulnerabilities - use the CVE-specific tools instead.

Why: Prevents agents from calling list_secured_clusters when CVE-specific tools already filter by cluster.

3. Documentation (e2e-tests/README.md)

Added "Running Tests with Claude Code" section with:

  • Prerequisites and setup instructions
  • Agent configuration examples for Haiku, Sonnet, Opus
  • Test results table showing model reliability

Test Results

Comprehensive testing with Claude CLI agents (20 runs each):

Model Success Rate Pattern
Opus 100% (20/20) Perfect reliability
Sonnet 100% (20/20) Perfect reliability
Haiku 90.0% (18/20) Both failures in first 2 runs, then 18 consecutive passes

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

E2E Test Results

Commit: a350bd0
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ list-clusters (assertions: 3/3)
  ✓ cve-cluster-does-not-exist (assertions: 3/3)
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-cluster-does-exist (assertions: 3/3)
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)
  ✓ cve-detected-clusters (assertions: 3/3)
  ✓ cve-multiple (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)
  ✓ cve-nonexistent (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)

Tasks:      11/11 passed (100.00%)
Assertions: 32/32 passed (100.00%)
Tokens:     ~66874 (estimate - excludes system prompt & cache)
MCP schemas: ~13145 (included in token total)
Agent used tokens:
  Input:  19317 tokens
  Output: 26821 tokens
Judge used tokens:
  Input:  34472 tokens
  Output: 34006 tokens

janisz and others added 6 commits May 5, 2026 12:58
…scoverability

Add explicit WHEN TO USE guidance to help AI agents understand when to use
list_secured_clusters vs CVE-specific tools. This reduces confusion and
improves test reliability when agents need to choose between listing clusters
and checking for vulnerabilities.

Why: E2E tests showed occasional confusion where agents would use the wrong
tool for CVE queries. Clearer descriptions help agents make correct decisions.
Changed from "call ALL THREE tools" to conditional approach:
1. Always call get_deployments_for_cve FIRST
2. If deployments found: STOP (most CVEs are here)
3. If NO deployments: Then call orchestrator + nodes tools

Why: Reduces unnecessary tool calls while maintaining comprehensive
checking. Agent now makes 1 call for most tests (when CVE found in
deployments) and 3 calls only when needed (CVE not in deployments).

How to apply: This should reduce typical tool calls from always-3
to 1-or-3 depending on results, fitting better within maxToolCalls
limits and reducing flakiness from unnecessary extra calls.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Changed "call ALL THREE tools" to "call ALL THREE tools exactly once each, then STOP".
Added "Do NOT make verification calls or check twice" to prevent extra calls.

Why: Agents sometimes made 4-5 tool calls (3 CVE tools + verification/exploration),
exceeding maxToolCalls limits in e2e tests and causing ~10-20% failure rate.

How to apply: Explicit STOP instructions aim to cap tool usage at exactly 3 calls
for general CVE questions, fitting within maxToolCalls=3-5 test constraints.

Testing shows 80% reliability (4/5 passes) - improvement from descriptions alone
appears limited. May need to accept inherent LLM variability or adjust test limits.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add section to e2e-tests/README.md explaining how to run mcpchecker
tests using Claude Code CLI instead of the default OpenAI agent.

Includes:
- Prerequisites and setup instructions
- Example agent configuration for Opus/Sonnet/Haiku
- Test results showing 100% reliability with Opus and Sonnet,
  90% with Haiku
- Notes about gitignored mcpchecker directory

Why: Enables testing MCP server with same Claude models that end users
interact with, validating that tool descriptions work correctly in production.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Test results can vary over time and shouldn't be documented in README.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 5, 2026

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
361 2 359 12
View the full list of 2 ❄️ flaky test(s)
::policy 1

Flake rate in main: 100.00% (Passed 0 times, Failed 36 times)

Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3
::policy 4

Flake rate in main: 100.00% (Passed 0 times, Failed 36 times)

Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Break long lines to stay under 120 character limit for linter.
@janisz janisz force-pushed the fix_description branch from aa7ad22 to a350bd0 Compare May 5, 2026 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants