Skip to content

aws-samples/sample-agent-skill-eval

Agent Skill Eval — Is That Skill Any Good?

Python 3.10+ License: MIT-0 Tests: 621

An evaluation framework for AI Agent Skills (agentskills.io standard). Measures safety, quality, reliability, and cost efficiency.


Installation

There are two ways to use this project — pick the one that fits your workflow.

Option A: CLI Tool (for developers)

Install it, run commands yourself, integrate into CI.

# Install from source
git clone https://github.com/aws-samples/sample-agent-skill-eval.git
cd sample-agent-skill-eval
pip install -e .

# Now use it
skill-eval audit /path/to/skill
skill-eval report /path/to/skill

When to choose this: You want to evaluate skills from the command line, integrate into CI/CD pipelines, or run batch evaluations.

Option B: Agent Skill (for AI agents)

Let your AI agent discover and use skill-eval automatically.

# Cross-client standard (works with any agent that supports agentskills.io)
cp -r sample-agent-skill-eval ~/.agents/skills/skill-eval

# For Claude Code specifically
cp -r sample-agent-skill-eval ~/.claude/skills/skill-eval

Your agent discovers it via SKILL.md and knows how to run skill-eval commands when you ask things like "audit this skill" or "is this skill safe?"

Note: The CLI tool must be installed first (pip install -e .). The Agent Skill is a wrapper that teaches the agent how to invoke the CLI.

When to choose this: You want your AI coding agent to evaluate skills as part of its workflow — e.g., "check this skill before I install it."


What It Measures

Dimension What How
Safety Secrets, injection surfaces, unsafe installs, over-privileged permissions Static analysis (no agent needed)
Quality Functional correctness — does the skill actually help? With-skill vs without-skill comparison
Reliability Trigger precision — does the skill activate when it should? Relevant + irrelevant query testing
Cost Efficiency Is the quality gain worth the token cost? Pareto classification

Plus: regression detection against versioned baselines, and lifecycle tracking.


Quick Start

# Is this skill safe?
skill-eval audit /path/to/skill
# Score: 92/100 (Grade: A) — 0 criticals, 2 warnings

# Generate eval scaffolds for a new skill
skill-eval init /path/to/skill
# Created evals/evals.json (3 template cases)
# Created evals/eval_queries.json (6 template queries)

# Full evaluation with unified grade
skill-eval report /path/to/skill
# Unified Score: 88/100 (Grade: B)
# Audit: 92 (×0.40) | Functional: 85 (×0.40) | Trigger: 90 (×0.20)

# Check for regressions after changes
skill-eval snapshot /path/to/skill
skill-eval regression /path/to/skill
# No regressions detected (baseline: 92, current: 94)

All Commands

Command What it does Needs Claude CLI?
audit Security & structure scan — score + grade No
init Generate template evals.json + eval_queries.json No
snapshot Save current audit as versioned baseline No
regression Compare current audit against baseline No
lifecycle Track skill versions and detect changes No
functional Run eval cases with/without skill, grade assertions Yes
trigger Test skill activation for relevant/irrelevant queries Yes
compare Side-by-side comparison of two skills Yes
report Unified report: audit + functional + trigger — weighted grade Yes

Scoring

Audit starts at 100. Deductions: critical −25, warning −10, info −2.

Grades: A (90+), B (80+), C (70+), D (60+), F (<60).

Unified report weights: audit 40%, functional 40%, trigger 20%.


Scan Scope

By default, skill-eval audit only scans skill-standard directories: root-level files (SKILL.md, etc.), scripts/, and agents/. This matches the agentskills.io definition of what constitutes a skill's content.

Directories like tests/, examples/, references/, and docs/ are excluded because they may contain documentation or test fixtures that describe security anti-patterns without actually being vulnerable.

Use --include-all to scan the entire directory tree:

# Default: scan skill content only
skill-eval audit /path/to/skill

# Full: scan everything (useful for full repo security review)
skill-eval audit /path/to/skill --include-all

Self-Eval: Why This Matters

Running skill-eval on itself demonstrates the difference:

# Default scan — only skill content (SKILL.md, scripts/)
skill-eval audit .
# Score: 96/100 (Grade: A) — 0 criticals, 0 warnings, 2 infos

# Full scan — includes test fixtures with intentional anti-patterns
skill-eval audit . --include-all
# Score: 0/100 (Grade: F) — 60+ criticals from tests/fixtures/
# That's by design: you need bad examples to test a security scanner

Configuration (.skilleval.yaml)

Customize audit behavior per-skill or per-project by placing a .skilleval.yaml (or .skilleval.yml) in your skill directory or any parent directory.

# .skilleval.yaml
audit:
  # Ignore specific finding codes
  ignore:
    - STR-008    # Directory name ≠ skill name is fine
    - STR-017    # README alongside SKILL.md is intentional

  # Override severity levels
  severity_overrides:
    SEC-002: WARNING    # Downgrade external URL to warning
    STR-011: CRITICAL   # Upgrade short description to critical

  # Whitelist internal domains (won't trigger SEC-002)
  safe_domains:
    - api.internal.company.com
    - wiki.team.io

  # Minimum passing score for CI (exit 1 if below)
  min_score: 70

  # Custom regex rules
  custom_rules:
    - code: CUSTOM-001
      pattern: "TODO|FIXME|HACK"
      severity: INFO
      message: "Found TODO/FIXME/HACK comment"

CLI flags override config values: --min-score 80 takes precedence over the YAML setting. Config and CLI --ignore / --allowlist values are merged.

Note: Requires pyyaml — install with pip install -e ".[config]" or pip install pyyaml. Without it, config files are silently skipped.


Examples

Example What you'll learn
Data Analysis Full lifecycle walkthrough — init, audit, functional, trigger, report
Lifecycle Demo Three-version evolution (F→A→D) showing audit, functional, and regression detection
Golden Dataset Ground-truth skills for validating the evaluation framework itself
F to A Improvement Fix a failing skill step by step
Self-Eval Default vs --include-all scan scope on this repo
Real Skill Audits Interpret audit reports for production skills
Golden Eval Templates Write effective eval cases and trigger queries

Relationship to Anthropic skill-creator

These are complementary tools:

  • skill-creator helps you create skills (scaffolding, templates, best practices)
  • skill-eval helps you evaluate skills (security, quality, reliability, cost)

Workflow: create with skill-creator → evaluate with skill-eval → iterate → deploy.


CI/CD

# .github/workflows/skill-eval.yml
jobs:
  evaluate:
    uses: aws-samples/sample-agent-skill-eval/.github/workflows/skill-eval.yml@main
    with:
      skill_path: "path/to/your-skill"
      run_functional: true
      run_trigger: true

Exit codes: 0 passed, 1 warnings/regressions/failures, 2 critical/error.


Contributing

See CONTRIBUTING.md for development setup, testing, and PR workflow.

Links

License

MIT-0

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages