Welcome to agent-catalog-eval, the CLI that grades your coding agents so you don't have to! Think of it as a rigorous (but fair) professor for your AI assistants. We evaluate coding-agent skills against a catalog of test cases to see if they're actually learning or just hallucinating their way through the semester.
You provide the homework (a directory of cases with a prompt.md, before/ and after/ snapshots, an eval.yaml, and a judge rubric), and we do the grading! This CLI unleashes your chosen agent (Cursor, OpenCode, or Claude Code) on every case, compares the resulting workspace against your after/ snapshot using an LLM judge, and hands out the pass/fail grades.
We extracted this runner from an internal skills repository so you can run the same harness against your own skill catalogs without the dreaded copy-paste. DRY, baby! βοΈ
Ready to test some bots? Let's get this installed!
# For the commitment-phobes (one-off)
npx agoda-agent-catalog-eval --help
# For the long haul (project install)
pnpm add -D agoda-agent-catalog-evalThe published binary is agent-catalog-eval. Easy peasy! π
agent-catalog-eval # Run all cases in your current directory
agent-catalog-eval tests/e2e # Run cases hiding in ./tests/e2e
agent-catalog-eval ./skills --filter ioc # Only run cases with "ioc" in the name (for when you're feeling specific)cases-dir is a positional argument, much like vitest path/to/tests or jest src. It defaults to your current working directory (process.cwd()). Any folder inside cases-dir that has an eval.yaml is officially a test case. (Don't worry, we automatically ignore the boring stuff like node_modules, src, dist, .git, and output).
Here's how you structure your agent's pop quiz:
my-skill-eval/
βββ eval.yaml # The syllabus: skill_path, threshold, judge_rubric
βββ prompt.md # The exam question: what you tell the agent
βββ before/ # The blank canvas: initial workspace state
βββ after/ # The answer key: ground-truth desired state
Your eval.yaml should look a little something like this:
skill_path: skills/my-skill/SKILL.md # Where the skill lives (resolved against --repo-root)
threshold: 70 # The passing grade (0β100). No participation trophies here! π
judge_rubric: |
Score 100 if X. Penalize for Y.
...Because we know you love to customize:
| Option | What it does |
|---|---|
[cases-dir] |
Where the tests live. Default: cwd. |
--agent <name> |
Who's taking the test? cursor, opencode, or claude-code. Default: opencode (because CI loves it). |
--dry-run |
Just looking! List discovered cases but don't actually run anything. π |
--filter <pattern> |
Substring match on test name. |
--worker-model <name> |
The brains of the operation. Default: claude-opus-4-7. |
--judge-model <name> |
The strict grader. Default: gemini-3.1-flash. |
--timeout <seconds> |
Pencils down! Hard timeout per agent. Default: 420 seconds. β±οΈ |
--collect |
Send us a postcard (POST telemetry summary) after the run. |
--metrics-url <url> |
Where to send the postcard. Default: $METRICS_URL or our built-in fallback. |
--header KEY=VALUE |
Extra headers for OpenAI calls and the metrics POST. Repeatable. BYOH (Bring Your Own Headers). |
--project <name> |
Override CI project name (we auto-detect by default). |
--repo-root <path> |
Where the repo starts (for resolving skill_path). Default: nearest .git ancestor. |
--output-dir <path> |
Where the magic (and mess) happens. Default: <cases-dir>/output. |
--base-url <url> |
OpenAI-compatible base URL. Default: $OPENAI_BASE_URL or https://api.openai.com/v1. |
--otel-endpoint <url> |
Send OpenTelemetry traces from each opencode run to this OTLP endpoint. Off when omitted. π°οΈ |
--otel-protocol <proto> |
OTLP protocol: grpc or http/protobuf. Default: grpc. |
--otel-service-name <name> |
service.name attribute on emitted spans. Default: agoda-agent-catalog-eval. |
--help, -h |
When all else fails, ask for help! π |
We default to opencode instead of cursor. Why? Because opencode is headless, OpenAI-compatible, and plays incredibly well with CI pipelines. cursor, on the other hand, needs a local install and is strictly for dev environments.
Want to switch it up? Just pass --agent cursor or --agent claude-code and you're good to go!
| Variable | What it's for |
|---|---|
OPENAI_API_KEY |
Your golden ticket to the OpenAI-compatible gateway. Required (unless you're just doing a --dry-run). π« |
OPENAI_BASE_URL |
Override the default base URL. |
METRICS_URL |
Override the default telemetry URL. |
OTEL_EXPORTER_OTLP_ENDPOINT |
Standard OTEL var. Used when --otel-endpoint is not passed. |
OTEL_EXPORTER_OTLP_PROTOCOL |
Standard OTEL var. Used when --otel-protocol is not passed. |
OTEL_SERVICE_NAME |
Standard OTEL var. Used when --otel-service-name is not passed. |
We're pretty smart about figuring out where we're running. CI context (project / pipeline / commit / branch) is auto-detected from the first matching environment variable:
| Provider | Project | Pipeline | Commit | Branch |
|---|---|---|---|---|
| GitLab π¦ | CI_PROJECT_PATH |
CI_PIPELINE_ID |
CI_COMMIT_SHA |
CI_COMMIT_BRANCH |
| GitHub Actions π | GITHUB_REPOSITORY |
GITHUB_RUN_ID |
GITHUB_SHA |
GITHUB_REF_NAME |
| TeamCity ποΈ | TEAMCITY_BUILDCONF_NAME |
BUILD_NUMBER |
BUILD_VCS_NUMBER |
TEAMCITY_BUILD_BRANCH |
| AppVeyor βοΈ | APPVEYOR_PROJECT_SLUG |
APPVEYOR_BUILD_ID |
APPVEYOR_REPO_COMMIT |
APPVEYOR_REPO_BRANCH |
| (none) π€·ββοΈ | unknown |
local |
unknown |
unknown |
Want to be the boss? Override any field with --project (more overrides coming soon!).
| Code | What it means |
|---|---|
0 |
π’ Success! All cases passed (or you ran --dry-run, or we found absolutely nothing to do). |
1 |
π΄ Uh oh. At least one case failed, or you typed something wrong. Better luck next time! |
If you pass the --collect flag, we'll POST a lovely application/json summary to your --metrics-url.
Want to see what your tests are actually doing β both the agent runs and the judge LLM calls? Wire up an OTLP endpoint and we'll ship traces from both:
agent-catalog-eval tests/e2e \
--agent opencode \
--otel-endpoint http://localhost:4317 \
--otel-protocol grpcWhen --otel-endpoint is set, the runner emits two flavours of spans:
1. Judge (LLM) spans β emitted from this CLI
The judge call to OpenAI is auto-instrumented with @arizeai/openinference-instrumentation-openai using the OpenInference semantic conventions. That means Arize (and any other OpenInference-aware backend) renders these as proper LLM spans β input/output messages, model, token counts, cost β without any extra tagging from you.
Each test is wrapped in an eval.test parent span with these attributes, so all the LLM activity for one test is grouped together in one trace:
| Attribute | Value |
|---|---|
agoda.eval.test_name |
The test case name (e.g. csharp-ioc/refactor-manual-di) |
agoda.eval.skill_path |
Path to the SKILL.md being evaluated |
agoda.eval.threshold |
Pass/fail score threshold for the test |
agoda.eval.agent |
opencode / cursor / claude-code |
agoda.eval.worker_model |
The worker model name |
agoda.eval.judge_model |
The judge model name |
agoda.eval.category |
The eval category, when set |
agoda.eval.score |
Final score from the judge (set when the test finishes) |
agoda.eval.passed |
Whether the score met the threshold (set when the test finishes) |
2. OpenCode subprocess spans β emitted from opencode
For opencode runs, the runner also:
- Adds the
@devtheops/opencode-plugin-otelplugin to the per-testopencode.json. (You'll need it on the box where opencode runs β the plugin is loaded from npm.) - Sets
OPENCODE_ENABLE_TELEMETRY=1and theOPENCODE_OTLP_*env vars on the spawned process, plus the standardOTEL_EXPORTER_OTLP_*andOTEL_SERVICE_NAMEvars so any other OTEL-aware tool also picks them up. - Packs the same per-test attributes into
OTEL_RESOURCE_ATTRIBUTESso each opencode span knows which test, skill, agent, project, pipeline, commit, and branch it came from. - Injects the W3C
TRACEPARENTenv var so plugins that honour it can stitch their spans under the parenteval.testspan.
The judge spans are emitted regardless of
--agent. The opencode-specific bits only fire for--agent opencodeβcursorandclaude-codehave their own telemetry stories.
For a throwaway local collector that just prints whatever it receives:
# otel-collector.yaml
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
processors:
batch:
exporters:
debug:
verbosity: detailed
service:
pipelines:
traces: { receivers: [otlp], processors: [batch], exporters: [debug] }
metrics: { receivers: [otlp], processors: [batch], exporters: [debug] }
logs: { receivers: [otlp], processors: [batch], exporters: [debug] }Run the collector, then:
agent-catalog-eval tests/e2e --otel-endpoint http://localhost:4317An internal skills repository is our reference consumer. Once this package hits the shelves, it'll run something like this:
npx agoda-agent-catalog-eval tests/e2e \
--agent opencode \
--collect \
--header x-custom-auth=my-tokenWhen your brilliant code gets merged to main, our changeset.yml workflow will automatically open/merge a release PR and publish it to npm with access: public and provenance enabled. Magic! β¨
Remember, in the world of AI coding agents, there are two types of people: those who test their agents, and those who trust them blindly. With agent-catalog-eval, you can trust and verify!
Happy evaluating, and may your agents always score 100! π