docs: improve test agent autonomy and orchestration guidance #11561

NeOMakinG · 2026-01-02T11:22:33Z

Summary

Improves the test agent documentation to reduce the need for "babysitting" during testing sessions and adds comprehensive guidance on using subagents for orchestration during complex multi-feature testing.

Problem

During the release v1.993.0 testing session, the test agent required intermediate guidance and direction to complete all requested sub-tasks. The agent would stop mid-execution to ask for permission or clarification on obvious next steps, rather than completing the entire scope autonomously.

Solution

This PR adds two major improvements to .claude/commands/test-agent.md:

1. Autonomy and Task Completion Principles

New Sections:

Complete All Sub-Tasks Before Reporting: Clear expectations for executing ALL requested tasks before reporting
Understanding Scope: How to parse multi-part requests and plan execution
Making Autonomous Decisions: What to decide independently vs when to ask
Execution Discipline: Using TodoWrite for tracking without mid-stream reporting

Anti-Patterns to Avoid:

🚫 Asking permission for obvious next steps
🚫 Mid-stream progress reports
🚫 Stopping when encountering issues
🚫 Asking what to test or requiring clarification on standard procedures

2. Orchestration with Subagents

New Section: Guidance on when and how to use subagents for orchestration:

When to Use Subagents:

Testing release PRs with multiple features
Comprehensive feature validation across multiple areas
Long-running test sessions that may hit context limits
Complex testing requiring specialized expertise

Orchestration Pattern:

Parse the full testing scope from user request
Break down into logical testing domains (e.g., swaps, sends, UI, performance)
Launch specialized subagents for each domain using Task tool
Monitor subagent progress and results
Aggregate findings into comprehensive report
Post final report when all subagents complete

Benefits:

Parallel test execution for faster results
Specialized expertise per domain (frontend, security, performance)
Better context management (each subagent has fresh context)
Clearer separation of concerns
Easier to debug individual test domains

3. Comprehensive Examples

Example 1: Direct Execution (Single Feature)

Testing swap slippage UI
Shows focused, single-feature testing approach

Example 2: Orchestration (Release PR Testing)

Testing release v1.993.0 with 5 PRs
Demonstrates domain breakdown (Tron, assets, HyperEVM, Thor/Maya, Ledger)
Shows parallel subagent launches with specific prompts
Demonstrates result aggregation and comprehensive reporting

Updated Anti-Patterns:

Direct execution anti-patterns (for single-feature testing)
Orchestration anti-patterns (for multi-PR release testing)

Key Principles Established

Direct Execution: Simple, focused tests on single features - execute directly
Orchestration: Complex, multi-domain release testing - use parallel subagents
No Mid-Stream Check-Ins: Complete entire scope before reporting
Autonomous Decision-Making: Choose test parameters independently
TodoWrite for Tracking: Track progress internally, not for user updates

Benefits

Reduced Babysitting: Test agent operates autonomously without requiring intermediate direction
Better Scalability: Subagent orchestration enables parallel testing of complex releases
Clearer Patterns: Examples show both direct execution and orchestration approaches
Improved Context Management: Subagents prevent context limits on long test sessions
Higher Quality Testing: Specialized subagents for different domains (frontend, security, etc.)

Testing

N/A - Documentation-only change. Future testing sessions will validate the improved guidance.

Summary by CodeRabbit

Documentation
- Updated test agent documentation with comprehensive guidance on autonomous task execution and decision-making workflows.
- Added detailed examples and orchestration patterns for improved clarity on agent capabilities.
- Included CLI command reference for easier agent usage.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Adds comprehensive guidance for autonomous testing without requiring "babysitting": **New Sections Added**: - Autonomy and Task Completion Principles - Complete all sub-tasks before reporting - Understanding scope and planning - Making autonomous decisions - Execution discipline with TodoWrite tracking - Anti-Patterns to Avoid - Don't ask permission for obvious next steps - Don't provide mid-stream progress reports - Don't stop when encountering issues - Don't ask what to test or require clarification on standard procedures - Example: Autonomous Testing Session - Real-world example based on HyperEVM testing - Shows correct vs incorrect autonomous execution - Demonstrates TodoWrite usage for task tracking **Updated Workflow**: - Emphasizes parsing full scope before starting - TodoWrite for tracking progress (not reporting) - Execute ALL tests before reporting - Report only once when entire scope is complete **Context**: Based on release v1.993.0 testing session where guidance was needed to complete all requested sub-tasks. This update ensures future test sessions run autonomously without requiring intermediate direction. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Adds comprehensive guidance on using subagents for orchestration during complex multi-feature testing, especially for release PRs. **New Section: Orchestration with Subagents** - When to use subagents vs direct execution - Orchestration pattern (parse, break down, launch, monitor, aggregate, report) - Benefits of subagent approach (parallel execution, specialized expertise, better context) - Example orchestration for multi-PR release testing - Guidance on crafting subagent task prompts **Updated Examples** - Example 1: Direct execution for single-feature testing (swap slippage UI) - Example 2: Orchestration approach for release PR testing (v1.993.0) - Shows domain breakdown (Tron, assets, HyperEVM, Thor/Maya, Ledger) - Demonstrates parallel subagent launches - Shows aggregation and comprehensive reporting **Updated Anti-Patterns** - Direct execution anti-patterns (asking permission, mid-stream reports) - Orchestration anti-patterns (sequential launches, partial reporting) **Key Principle**: - Direct execution: Simple, focused tests on single features - Orchestration: Complex, multi-domain release testing with parallel subagents 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2026-01-02T11:22:48Z

📝 Walkthrough

Walkthrough

Extensive documentation updates to the test agent guidance file, adding new sections on autonomy principles, scope understanding, autonomous decision-making, subagent orchestration patterns, execution discipline, anti-pattern examples, and refined workflow instructions for comprehensive test agent behavior.

Changes

Cohort / File(s)	Summary
Test Agent Documentation `\.claude/commands/test-agent.md`	Added comprehensive guidance sections: Autonomy and Task Completion Principles with explicit Do/Dont guidelines; Understanding Scope workflow (parsing, task lists, planning, dependencies); Making Autonomous Decisions guidelines; Execution Discipline with subagent orchestration patterns; Anti-Patterns section with Bad/Good examples; Extended Available Tools section; Updated Test Scenario Bank and running workflow to emphasize parse-plan-execute-validate-report cadence; Large example Autonomous Testing Session with Direct Execution and Orchestration scenarios; new How to Use This Agent CLI commands section.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hops with glee through test commands new,
Autonomy and scope — now crystal true,
Subagents coordinating with finesse,
No more mid-stream chaos, just success!
This rabbit grins: orchestration's blessed! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'docs: improve test agent autonomy and orchestration guidance' accurately reflects the main changes: extensive additions to test agent documentation focusing on autonomy principles and orchestration with subagents.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch improve-test-agent-autonomy

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

.claude/commands/test-agent.md (1)
109-109: Add language specifiers to fenced code blocks for consistency.

Several code blocks lack language specifiers. While readability isn't materially affected, adding specifiers improves Markdown linting compliance. For pseudo-code examples, use a generic language like plaintext or bash as appropriate.
🔎 Example fixes
-Example Orchestration:
-```
+Example Orchestration:
+```plaintext
-[Launching subagent 1: Tron Testing]
-Prompt: "Test Tron TX parsing fixes...
+[Launching subagent 1: Tron Testing]
+```plaintext
+Prompt: "Test Tron TX parsing fixes...
+```
Also applies to: 472-472, 498-498, 515-515, 537-537, 546-546

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3ec9533 and 5c6de4e.

📒 Files selected for processing (1)

.claude/commands/test-agent.md

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: NeOMakinG
Repo: shapeshift/web PR: 10323
File: src/pages/RFOX/components/Stake/components/StakeSummary.tsx:112-114
Timestamp: 2025-08-22T13:00:44.879Z
Learning: NeOMakinG prefers to keep PR changes minimal and focused on the core objectives, avoiding cosmetic or defensive code improvements that aren't directly related to the PR scope, even when they would improve robustness.

Learnt from: NeOMakinG
Repo: shapeshift/web PR: 10128
File: .cursor/rules/error-handling.mdc:266-274
Timestamp: 2025-07-29T10:35:22.059Z
Learning: NeOMakinG prefers less nitpicky suggestions on documentation and best practices files, finding overly detailed suggestions on minor implementation details (like console.error vs logger.error) too granular for cursor rules documentation.

Learnt from: NeOMakinG
Repo: shapeshift/web PR: 10380
File: src/pages/Dashboard/components/AccountList/AccountTable.tsx:60-0
Timestamp: 2025-09-02T08:34:08.157Z
Learning: NeOMakinG prefers code review comments to focus only on actual PR changes, not pre-existing code issues, unless there are critical security or correctness concerns directly related to the new functionality.

Learnt from: NeOMakinG
Repo: shapeshift/web PR: 10234
File: src/components/MultiHopTrade/hooks/useGetTradeQuotes/hooks/useTrackTradeQuotes.ts:42-86
Timestamp: 2025-08-08T11:41:22.794Z
Learning: NeOMakinG prefers not to include refactors in move-only PRs; such suggestions should be deferred to follow-up issues instead of being applied within the same PR.

Learnt from: NeOMakinG
Repo: shapeshift/web PR: 10380
File: src/components/Table/Table.theme.ts:177-180
Timestamp: 2025-09-02T12:38:46.940Z
Learning: NeOMakinG prefers to defer technical debt and CSS correctness issues (like improper hover selectors) to follow-up PRs when the current PR is already large and focused on major feature implementation, even when the issues are valid from a usability/technical perspective.

📚 Learning: 2025-11-24T21:20:04.979Z

Learnt from: CR
Repo: shapeshift/web PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T21:20:04.979Z
Learning: Applies to **/*.test.{ts,tsx,js,jsx} : Write tests for critical business logic

Applied to files:

.claude/commands/test-agent.md

📚 Learning: 2025-11-24T21:20:04.979Z

Learnt from: CR
Repo: shapeshift/web PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T21:20:04.979Z
Learning: Applies to **/*.test.{ts,tsx,js,jsx} : Use descriptive test names that explain behavior

Applied to files:

.claude/commands/test-agent.md

🪛 markdownlint-cli2 (0.18.1)

.claude/commands/test-agent.md

109-109: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

472-472: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

498-498: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

515-515: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

537-537: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

546-546: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (3)

.claude/commands/test-agent.md (3)

18-175: Excellent clarity on autonomous operation and orchestration patterns.

The new sections on autonomy principles, subagent orchestration, and anti-patterns are well-structured and provide concrete, actionable guidance. The Do/Don't patterns and task breakdown examples effectively communicate expectations for agent behavior without ambiguity.

198-208: Workflow steps align well with autonomy principles.

The updated workflow section reinforces the "parse full scope" → "execute all" → "report once" model introduced in the autonomy section. The progression is logical and the emphasis on completing entire scope before reporting is consistent.

441-552: Examples effectively illustrate autonomous and orchestrated testing workflows.

The Direct Execution and Orchestration examples are concrete, realistic, and clearly demonstrate the expected behavior patterns. The subagent prompts and result aggregation format provide practical templates for implementation.

0xApotheosis

All looks sane. Get the Claude skill a run as a sanity check and it works as expected. Get in!

NeOMakinG and others added 2 commits January 2, 2026 12:19

NeOMakinG requested a review from a team as a code owner January 2, 2026 11:22

coderabbitai bot reviewed Jan 2, 2026

View reviewed changes

Merge branch 'develop' into improve-test-agent-autonomy

ce21a13

0xApotheosis approved these changes Jan 5, 2026

View reviewed changes

0xApotheosis merged commit ac9065a into develop Jan 5, 2026
4 checks passed

0xApotheosis deleted the improve-test-agent-autonomy branch January 5, 2026 08:40

gomesalexandre mentioned this pull request Jan 5, 2026

chore: release v1.995.0 #11570

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: improve test agent autonomy and orchestration guidance #11561

docs: improve test agent autonomy and orchestration guidance #11561

NeOMakinG commented Jan 2, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

0xApotheosis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

docs: improve test agent autonomy and orchestration guidance #11561

docs: improve test agent autonomy and orchestration guidance #11561

Conversation

NeOMakinG commented Jan 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

1. Autonomy and Task Completion Principles

2. Orchestration with Subagents

3. Comprehensive Examples

Key Principles Established

Benefits

Testing

Related

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

0xApotheosis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NeOMakinG commented Jan 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 2, 2026 •

edited

Loading