Improve MCP tool descriptions for AI agent reliability by janisz · Pull Request #130 · stackrox/stackrox-mcp

janisz · 2026-05-05T10:56:38Z

Summary

Enhanced tool descriptions with explicit usage patterns and STOP instructions to improve e2e test reliability with Claude models. Achieved
100% reliability with Opus/Sonnet and 90%+ with Haiku.

Problem

E2e tests with Claude CLI agents showed agents making excessive tool calls:

General CVE questions ("Is CVE-X detected?") triggered 4-5 tool calls instead of expected 3
Agents would call all three CVE tools, then verify results with additional calls
This exceeded maxToolCalls=3-5 constraints and reduced reliability

Solution

1. CVE Tool Descriptions (clusters.go, deployments.go, nodes.go)

Added explicit STOP instructions for two usage patterns:

USAGE PATTERNS:

For general CVE questions ('Is CVE-X detected in my clusters?'):
Call ALL THREE CVE tools exactly once each, then STOP and provide answer.
Do NOT make verification calls or check twice.
For specific [deployment/node/orchestrator] questions: Use ONLY this tool once, then STOP.

Why: Prevents agents from making verification calls after getting initial results.

2. list_secured_clusters Description (tools.go)

Added WHEN TO USE guidance:

WHEN TO USE: Use this tool when the user asks to see or list all clusters.
IMPORTANT: Do NOT use this tool when checking for CVE vulnerabilities - use the CVE-specific tools instead.

Why: Prevents agents from calling list_secured_clusters when CVE-specific tools already filter by cluster.

3. Documentation (e2e-tests/README.md)

Added "Running Tests with Claude Code" section with:

Prerequisites and setup instructions
Agent configuration examples for Haiku, Sonnet, Opus
Test results table showing model reliability

Test Results

Comprehensive testing with Claude CLI agents (20 runs each):

Model	Success Rate	Pattern
Opus	100% (20/20)	Perfect reliability
Sonnet	100% (20/20)	Perfect reliability
Haiku	90.0% (18/20)	Both failures in first 2 runs, then 18 consecutive passes

github-actions · 2026-05-05T10:58:06Z

E2E Test Results

Commit: a350bd0
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ list-clusters (assertions: 3/3)
  ✓ cve-cluster-does-not-exist (assertions: 3/3)
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-cluster-does-exist (assertions: 3/3)
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)
  ✓ cve-detected-clusters (assertions: 3/3)
  ✓ cve-multiple (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)
  ✓ cve-nonexistent (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)

Tasks:      11/11 passed (100.00%)
Assertions: 32/32 passed (100.00%)
Tokens:     ~66874 (estimate - excludes system prompt & cache)
MCP schemas: ~13145 (included in token total)
Agent used tokens:
  Input:  19317 tokens
  Output: 26821 tokens
Judge used tokens:
  Input:  34472 tokens
  Output: 34006 tokens

…scoverability Add explicit WHEN TO USE guidance to help AI agents understand when to use list_secured_clusters vs CVE-specific tools. This reduces confusion and improves test reliability when agents need to choose between listing clusters and checking for vulnerabilities. Why: E2E tests showed occasional confusion where agents would use the wrong tool for CVE queries. Clearer descriptions help agents make correct decisions.

Changed from "call ALL THREE tools" to conditional approach: 1. Always call get_deployments_for_cve FIRST 2. If deployments found: STOP (most CVEs are here) 3. If NO deployments: Then call orchestrator + nodes tools Why: Reduces unnecessary tool calls while maintaining comprehensive checking. Agent now makes 1 call for most tests (when CVE found in deployments) and 3 calls only when needed (CVE not in deployments). How to apply: This should reduce typical tool calls from always-3 to 1-or-3 depending on results, fitting better within maxToolCalls limits and reducing flakiness from unnecessary extra calls. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

This reverts commit 97c8274.

Changed "call ALL THREE tools" to "call ALL THREE tools exactly once each, then STOP". Added "Do NOT make verification calls or check twice" to prevent extra calls. Why: Agents sometimes made 4-5 tool calls (3 CVE tools + verification/exploration), exceeding maxToolCalls limits in e2e tests and causing ~10-20% failure rate. How to apply: Explicit STOP instructions aim to cap tool usage at exactly 3 calls for general CVE questions, fitting within maxToolCalls=3-5 test constraints. Testing shows 80% reliability (4/5 passes) - improvement from descriptions alone appears limited. May need to accept inherent LLM variability or adjust test limits. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add section to e2e-tests/README.md explaining how to run mcpchecker tests using Claude Code CLI instead of the default OpenAI agent. Includes: - Prerequisites and setup instructions - Example agent configuration for Opus/Sonnet/Haiku - Test results showing 100% reliability with Opus and Sonnet, 90% with Haiku - Notes about gitignored mcpchecker directory Why: Enables testing MCP server with same Claude models that end users interact with, validating that tool descriptions work correctly in production. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Test results can vary over time and shouldn't be documented in README.

codecov-commenter · 2026-05-05T11:00:25Z

❌ 2 Tests Failed:

Tests completed	Failed	Passed	Skipped
361	2	359	12

View the full list of 2 ❄️ flaky test(s)

::policy 1
Flake rate in main: 100.00% (Passed 0 times, Failed 36 times)
Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3

::policy 4
Flake rate in main: 100.00% (Passed 0 times, Failed 36 times)
Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

Break long lines to stay under 120 character limit for linter.

janisz and others added 6 commits May 5, 2026 12:58

Revert "Make CVE tool descriptions more decisive and conditional"

0cf6708

This reverts commit 97c8274.

Remove test results table from README

b46c915

Test results can vary over time and shouldn't be documented in README.

Fix line length violations in CVE tool descriptions

a350bd0

Break long lines to stay under 120 character limit for linter.

janisz force-pushed the fix_description branch from aa7ad22 to a350bd0 Compare May 5, 2026 11:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MCP tool descriptions for AI agent reliability#130

Improve MCP tool descriptions for AI agent reliability#130
janisz wants to merge 7 commits intomainfrom
fix_description

janisz commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

janisz commented May 5, 2026

Summary

Problem

Solution

1. CVE Tool Descriptions (clusters.go, deployments.go, nodes.go)

2. list_secured_clusters Description (tools.go)

3. Documentation (e2e-tests/README.md)

Test Results

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Test Results

Uh oh!

codecov-commenter commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 2 Tests Failed:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 5, 2026 •

edited

Loading

codecov-commenter commented May 5, 2026 •

edited

Loading