A 4-stage evaluation framework for testing Claude Code plugin component triggering. Validates whether skills, agents, commands, hooks, and MCP servers correctly activate when expected.
Claude Code plugins contain multiple component types (skills, agents, commands) that trigger based on user prompts. Testing these triggers manually is time-consuming and error-prone. This framework automates the entire evaluation process:
- Discovers all components in your plugin
- Generates test scenarios (positive and negative cases)
- Executes scenarios against the Claude Agent SDK
- Evaluates whether the correct component triggered
| Feature | Description |
|---|---|
| 4-Stage Pipeline | Analysis → Generation → Execution → Evaluation |
| Multi-Component | Skills, agents, commands, hooks, and MCP servers |
| Programmatic Detection | 100% confidence detection by parsing tool captures |
| Semantic Testing | Synonym and paraphrase variations to test robustness |
| Resume Capability | Checkpoint after each stage, resume interrupted runs |
| Cost Estimation | Token and USD estimates before execution |
| Batch API Support | 50% cost savings on large runs via Anthropic Batches API |
| Multiple Formats | JSON, YAML, JUnit XML, TAP output |
- Node.js >= 20.0.0
- An Anthropic API key
# Clone the repository
git clone https://github.com/sjnims/cc-plugin-eval.git
cd cc-plugin-eval
# Install dependencies
npm install
# Build
npm run build
# Create .env file with your API key
echo "ANTHROPIC_API_KEY=sk-ant-your-key-here" > .env# See cost estimate without running (recommended first)
npx cc-plugin-eval run -p ./path/to/your/plugin --dry-run
# Run full evaluation
npx cc-plugin-eval run -p ./path/to/your/pluginflowchart LR
subgraph Input
P[Plugin Directory]
end
subgraph Pipeline
S1[**Stage 1: Analysis**<br/>Parse plugin structure,<br/>extract triggers]
S2[**Stage 2: Generation**<br/>Create test scenarios<br/>positive & negative]
S3[**Stage 3: Execution**<br/>Run scenarios via<br/>Agent SDK]
S4[**Stage 4: Evaluation**<br/>Detect triggers,<br/>calculate metrics]
end
P --> S1 --> S2 --> S3 --> S4
S1 --> O1[analysis.json]
S2 --> O2[scenarios.json]
S3 --> O3[transcripts/]
S4 --> O4[evaluation.json]
| Stage | Purpose | Method | Output |
|---|---|---|---|
| 1. Analysis | Parse plugin structure, extract trigger phrases | Deterministic parsing | analysis.json |
| 2. Generation | Create test scenarios | LLM for skills/agents, deterministic for commands | scenarios.json |
| 3. Execution | Run scenarios against Claude Agent SDK | Tool capture hooks | transcripts/ |
| 4. Evaluation | Detect triggers, calculate metrics | Programmatic first, LLM judge for quality | evaluation.json |
Each component generates multiple scenario types to thoroughly test triggering:
| Type | Description | Example |
|---|---|---|
direct |
Exact trigger phrase | "create a skill" |
paraphrased |
Same intent, different words | "add a new skill to my plugin" |
edge_case |
Unusual but valid | "skill plz" |
negative |
Should NOT trigger | "tell me about database skills" |
semantic |
Synonym variations | "generate a skill" vs "create a skill" |
# Run complete evaluation
cc-plugin-eval run -p ./plugin
# With options
cc-plugin-eval run -p ./plugin \
--config custom-config.yaml \
--verbose \
--samples 3# Stage 1: Analysis only
cc-plugin-eval analyze -p ./plugin
# Stages 1-2: Analysis + Generation
cc-plugin-eval generate -p ./plugin
# Stages 1-3: Analysis + Generation + Execution
cc-plugin-eval execute -p ./plugin# Resume an interrupted run
cc-plugin-eval resume -r <run-id>
# List previous runs
cc-plugin-eval list -p ./plugin
# Generate report from existing results
cc-plugin-eval report -r <run-id> --output junit-xml| Option | Description |
|---|---|
-p, --plugin <path> |
Plugin directory path |
-c, --config <path> |
Config file (default: config.yaml) |
--dry-run |
Generate scenarios without execution |
--estimate |
Show cost estimate before execution |
--verbose |
Enable debug output |
--fast |
Only run previously failed scenarios |
--no-batch |
Force synchronous (non-batch) execution |
--rewind |
Undo file changes after each scenario |
--semantic |
Enable semantic variation testing |
--samples <n> |
Multi-sample judgment count |
--reps <n> |
Repetitions per scenario |
--output <format> |
Output format: json, yaml, junit-xml, tap |
In addition to the CLI, cc-plugin-eval exports a programmatic API for integration into build systems, test frameworks, and custom tooling.
npm install cc-plugin-evalimport {
runAnalysis,
runGeneration,
runExecution,
runEvaluation,
loadConfigWithOverrides,
consoleProgress,
} from "cc-plugin-eval";
import type {
EvalConfig,
AnalysisOutput,
TestScenario,
} from "cc-plugin-eval/types";
// Load configuration
const config = loadConfigWithOverrides("config.yaml", {
plugin: "./path/to/plugin",
});
// Stage 1: Analyze plugin structure
const analysis = await runAnalysis(config);
// Stage 2: Generate test scenarios
const generation = await runGeneration(analysis, config);
// Stage 3: Execute scenarios (captures tool interactions)
const execution = await runExecution(
analysis,
generation.scenarios,
config,
consoleProgress, // or provide custom progress callbacks
);
// Stage 4: Evaluate results
const evaluation = await runEvaluation(
analysis.plugin_name,
generation.scenarios,
execution.results,
config,
consoleProgress,
);
console.log(`Accuracy: ${(evaluation.metrics.accuracy * 100).toFixed(1)}%`);| Export | Description |
|---|---|
runAnalysis |
Stage 1: Parse plugin structure and extract triggers |
runGeneration |
Stage 2: Generate test scenarios for components |
runExecution |
Stage 3: Execute scenarios and capture tool interactions |
runEvaluation |
Stage 4: Evaluate results and calculate metrics |
loadConfigWithOverrides |
Load configuration with CLI-style overrides |
consoleProgress |
Default progress reporter (console output) |
Import types via the cc-plugin-eval/types subpath:
import type {
EvalConfig,
AnalysisOutput,
TestScenario,
ExecutionResult,
EvaluationResult,
EvalMetrics,
} from "cc-plugin-eval/types";Configuration is managed via config.yaml. Here's a quick reference:
scope:
skills: true # Evaluate skill components
agents: true # Evaluate agent components
commands: true # Evaluate command components
hooks: false # Evaluate hook components
mcp_servers: false # Evaluate MCP server componentsgeneration:
model: "claude-sonnet-4-5-20250929"
scenarios_per_component: 5 # Test scenarios per component
diversity: 0.7 # 0.0-1.0, higher = more unique scenarios
semantic_variations: true # Generate synonym variationsexecution:
model: "claude-sonnet-4-20250514"
max_turns: 5 # Conversation turns per scenario
timeout_ms: 60000 # Timeout per scenario (1 min)
max_budget_usd: 10.0 # Stop if cost exceeds this
disallowed_tools: # Safety: block file operations
- Write
- Edit
- Bashevaluation:
model: "claude-sonnet-4-5-20250929"
detection_mode: "programmatic_first" # Or "llm_only"
num_samples: 1 # Multi-sample judgmentSee the full config.yaml for all options, including:
tuning: Fine-tune timeouts, retry behavior, and token estimatesconflict_detection: Detect when multiple components trigger for the same promptbatch_threshold: Use Anthropic Batches API for cost savings (50% discount)sanitization: PII redaction with ReDoS-safe custom patterns
By default, scenarios testing the same component share a session with /clear between them. This reduces subprocess overhead by ~80%:
| Mode | Overhead per Scenario | 100 Scenarios |
|---|---|---|
| Batched (default) | ~1-2s after first | ~2-3 minutes |
| Isolated | ~5-8s each | ~8-13 minutes |
The /clear command resets conversation history between scenarios while reusing the subprocess and loaded plugin.
Switch to isolated mode when you need complete separation between scenarios:
- Testing plugins that modify filesystem state
- Debugging cross-contamination issues between scenarios
- When using
rewind_file_changes: true(automatically uses isolated mode)
To use isolated mode:
execution:
session_strategy: "isolated"Or via the deprecated (but still supported) option:
execution:
session_isolation: trueAfter a run, results are saved to:
results/
└── {plugin-name}/
└── {run-id}/
├── state.json # Pipeline state (for resume)
├── analysis.json # Stage 1: Parsed components
├── scenarios.json # Stage 2: Generated test cases
├── execution-metadata.json # Stage 3: Execution stats
├── evaluation.json # Stage 4: Results & metrics
└── transcripts/
└── {scenario-id}.json # Individual execution transcripts
{
"results": [
{
"scenario_id": "skill-create-direct-001",
"triggered": true,
"confidence": 100,
"quality_score": 9.2,
"detection_source": "programmatic",
"has_conflict": false
}
],
"metrics": {
"total_scenarios": 25,
"accuracy": 0.92,
"trigger_rate": 0.88,
"avg_quality": 8.7,
"conflict_count": 1
}
}Programmatic detection is primary for maximum accuracy:
- During execution, tool capture hooks capture all tool invocations
- Tool captures are parsed to detect
Skill,Task, andSlashCommandcalls - MCP tools detected via pattern:
mcp__<server>__<tool> - Hooks detected via
SDKHookResponseMessageevents - Confidence is 100% for programmatic detection
LLM judge is secondary, used for:
- Quality assessment (0-10 score)
- Edge cases where programmatic detection is ambiguous
- Multi-sample consensus when configured
npm install # Install dependencies
npm run build # Build TypeScript
npm test # Run tests
npm run lint # Lint code
npm run typecheck # Type checkSee CONTRIBUTING.md for detailed development setup, code style, testing requirements, and pull request guidelines.
- Phase 1: Skills, agents, commands evaluation
- Phase 2: Hooks evaluation (PR #58)
- Phase 3: MCP servers evaluation (PR #63)
- Phase 4: Cross-plugin conflict detection
- Phase 5: Marketplace evaluation
Default: execution.permission_bypass: true enables automated evaluation by automatically approving all tool invocations. This is required for unattended runs but has security implications:
- ✅ Required for CI/CD and automated evaluation
⚠️ Plugins can perform any action permitted by allowed tools- 🔒 Use
disallowed_toolsto restrict dangerous operations (default:[Write, Edit, Bash]) - 🔒 For untrusted plugins, set
permission_bypass: falsefor manual review (disables automation)
Security Note: With permission bypass enabled, use strict disallowed_tools and run in sandboxed environments when evaluating untrusted plugins.
Default: output.sanitization.enabled: false for backwards compatibility. Enable sanitization for PII-sensitive environments:
output:
sanitize_transcripts: true # Redact saved files
sanitize_logs: true # Redact console output
sanitization:
enabled: true
custom_patterns: # Optional domain-specific patterns
- pattern: "INTERNAL-\\w+"
replacement: "[REDACTED_ID]"Built-in redaction: API keys, JWT tokens, emails, phone numbers, SSNs, credit card numbers.
Enterprise use cases: Enable when handling PII or complying with GDPR, HIPAA, SOC 2, or similar regulations.
The default disallowed_tools: [Write, Edit, Bash] prevents file modifications and shell commands. Modify with caution:
- Enable
Write/Editonly if testing file-modifying plugins - Enable
Bashonly if testing shell-executing plugins - Use
rewind_file_changes: trueto restore files after each scenario
- API keys: Loaded from environment variables (
.env), never stored in config - Budget limits: Set
execution.max_budget_usdto cap API spending - Timeout limits: Set
execution.timeout_msto prevent runaway executions - Plugin loading: Only local plugins supported (
plugin.path), no remote loading - ReDoS protection: Custom sanitization patterns validated for Regular Expression Denial of Service vulnerabilities
For production/enterprise environments with compliance requirements, see the comprehensive security guide in SECURITY.md, including:
- Threat model and risk assessment
- Sandbox and isolation recommendations
- Compliance checklist (GDPR, HIPAA, SOC 2)
- Container isolation patterns
See CONTRIBUTING.md for development setup, code style, and pull request guidelines.
This project follows the Contributor Covenant code of conduct.
Steve Nims (@sjnims)
- Anthropic for Claude, the Anthropic SDK, and the Claude Agent SDK
- Bloom for architectural inspiration
- Boris Cherny for Claude Code
- Zod for runtime type validation
- Commander.js for CLI framework
- Vitest for testing
- Monster Energy for fuel
- deadmau5 for the beats