4-stage evaluation framework for testing Claude Code plugin component triggering. Validates skills, agents, and commands activate correctly via programmatic detection and LLM judgment.
-
Updated
Jan 20, 2026 - TypeScript
4-stage evaluation framework for testing Claude Code plugin component triggering. Validates skills, agents, and commands activate correctly via programmatic detection and LLM judgment.
Benchmark harness for A/B testing Claude Code plugins against OOLONG long-context reasoning tasks. Compare truncation vs RLM-RS recursive chunking strategies. Features Claude Code hooks integration, SQLite persistence, and comprehensive scoring aligned with the OOLONG paper methodology.
🚀 Automate the evaluation of Claude Code plugin components to ensure accurate triggering of skills, agents, commands, and hooks.
Add a description, image, and links to the plugin-testing topic page so that developers can more easily learn about it.
To associate your repository with the plugin-testing topic, visit your repo's landing page and select "manage topics."