Improve MCP tool descriptions for AI agent reliability#130
Open
Improve MCP tool descriptions for AI agent reliability#130
Conversation
E2E Test ResultsCommit: a350bd0 |
…scoverability Add explicit WHEN TO USE guidance to help AI agents understand when to use list_secured_clusters vs CVE-specific tools. This reduces confusion and improves test reliability when agents need to choose between listing clusters and checking for vulnerabilities. Why: E2E tests showed occasional confusion where agents would use the wrong tool for CVE queries. Clearer descriptions help agents make correct decisions.
Changed from "call ALL THREE tools" to conditional approach: 1. Always call get_deployments_for_cve FIRST 2. If deployments found: STOP (most CVEs are here) 3. If NO deployments: Then call orchestrator + nodes tools Why: Reduces unnecessary tool calls while maintaining comprehensive checking. Agent now makes 1 call for most tests (when CVE found in deployments) and 3 calls only when needed (CVE not in deployments). How to apply: This should reduce typical tool calls from always-3 to 1-or-3 depending on results, fitting better within maxToolCalls limits and reducing flakiness from unnecessary extra calls. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This reverts commit 97c8274.
Changed "call ALL THREE tools" to "call ALL THREE tools exactly once each, then STOP". Added "Do NOT make verification calls or check twice" to prevent extra calls. Why: Agents sometimes made 4-5 tool calls (3 CVE tools + verification/exploration), exceeding maxToolCalls limits in e2e tests and causing ~10-20% failure rate. How to apply: Explicit STOP instructions aim to cap tool usage at exactly 3 calls for general CVE questions, fitting within maxToolCalls=3-5 test constraints. Testing shows 80% reliability (4/5 passes) - improvement from descriptions alone appears limited. May need to accept inherent LLM variability or adjust test limits. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add section to e2e-tests/README.md explaining how to run mcpchecker tests using Claude Code CLI instead of the default OpenAI agent. Includes: - Prerequisites and setup instructions - Example agent configuration for Opus/Sonnet/Haiku - Test results showing 100% reliability with Opus and Sonnet, 90% with Haiku - Notes about gitignored mcpchecker directory Why: Enables testing MCP server with same Claude models that end users interact with, validating that tool descriptions work correctly in production. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Test results can vary over time and shouldn't be documented in README.
❌ 2 Tests Failed:
View the full list of 2 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
Break long lines to stay under 120 character limit for linter.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enhanced tool descriptions with explicit usage patterns and STOP instructions to improve e2e test reliability with Claude models. Achieved
100% reliability with Opus/Sonnet and 90%+ with Haiku.
Problem
E2e tests with Claude CLI agents showed agents making excessive tool calls:
maxToolCalls=3-5constraints and reduced reliabilitySolution
1. CVE Tool Descriptions (clusters.go, deployments.go, nodes.go)
Added explicit STOP instructions for two usage patterns:
USAGE PATTERNS:
Call ALL THREE CVE tools exactly once each, then STOP and provide answer.
Do NOT make verification calls or check twice.
Why: Prevents agents from making verification calls after getting initial results.
2. list_secured_clusters Description (tools.go)
Added WHEN TO USE guidance:
WHEN TO USE: Use this tool when the user asks to see or list all clusters.
IMPORTANT: Do NOT use this tool when checking for CVE vulnerabilities - use the CVE-specific tools instead.
Why: Prevents agents from calling
list_secured_clusterswhen CVE-specific tools already filter by cluster.3. Documentation (e2e-tests/README.md)
Added "Running Tests with Claude Code" section with:
Test Results
Comprehensive testing with Claude CLI agents (20 runs each):