Agent Skill Eval — Is That Skill Any Good?

An evaluation framework for AI Agent Skills (agentskills.io standard). Measures safety, quality, reliability, and cost efficiency.

Installation

There are two ways to use this project — pick the one that fits your workflow.

Option A: CLI Tool (for developers)

Install it, run commands yourself, integrate into CI.

# Install from source
git clone https://github.com/aws-samples/sample-agent-skill-eval.git
cd sample-agent-skill-eval
pip install -e .

# Now use it
skill-eval audit /path/to/skill
skill-eval report /path/to/skill

When to choose this: You want to evaluate skills from the command line, integrate into CI/CD pipelines, or run batch evaluations.

Option B: Agent Skill (for AI agents)

Let your AI agent discover and use skill-eval automatically.

# Cross-client standard (works with any agent that supports agentskills.io)
cp -r sample-agent-skill-eval ~/.agents/skills/skill-eval

# For Claude Code specifically
cp -r sample-agent-skill-eval ~/.claude/skills/skill-eval

Your agent discovers it via SKILL.md and knows how to run skill-eval commands when you ask things like "audit this skill" or "is this skill safe?"

Note: The CLI tool must be installed first (pip install -e .). The Agent Skill is a wrapper that teaches the agent how to invoke the CLI.

When to choose this: You want your AI coding agent to evaluate skills as part of its workflow — e.g., "check this skill before I install it."

What It Measures

Dimension	What	How
Safety	Secrets, injection surfaces, unsafe installs, over-privileged permissions	Static analysis (no agent needed)
Quality	Functional correctness — does the skill actually help?	With-skill vs without-skill comparison
Reliability	Trigger precision — does the skill activate when it should?	Relevant + irrelevant query testing
Cost Efficiency	Is the quality gain worth the token cost?	Pareto classification

Plus: regression detection against versioned baselines, and lifecycle tracking.

Quick Start

# Is this skill safe?
skill-eval audit /path/to/skill
# Score: 92/100 (Grade: A) — 0 criticals, 2 warnings

# Generate eval scaffolds for a new skill
skill-eval init /path/to/skill
# Created evals/evals.json (3 template cases)
# Created evals/eval_queries.json (6 template queries)

# Full evaluation with unified grade
skill-eval report /path/to/skill
# Unified Score: 88/100 (Grade: B)
# Audit: 92 (×0.40) | Functional: 85 (×0.40) | Trigger: 90 (×0.20)

# Check for regressions after changes
skill-eval snapshot /path/to/skill
skill-eval regression /path/to/skill
# No regressions detected (baseline: 92, current: 94)

All Commands

Command	What it does	Needs Claude CLI?
`audit`	Security & structure scan — score + grade	No
`init`	Generate template `evals.json` + `eval_queries.json`	No
`snapshot`	Save current audit as versioned baseline	No
`regression`	Compare current audit against baseline	No
`lifecycle`	Track skill versions and detect changes	No
`functional`	Run eval cases with/without skill, grade assertions	Yes
`trigger`	Test skill activation for relevant/irrelevant queries	Yes
`compare`	Side-by-side comparison of two skills	Yes
`report`	Unified report: audit + functional + trigger — weighted grade	Yes

Scoring

Audit starts at 100. Deductions: critical −25, warning −10, info −2.

Grades: A (90+), B (80+), C (70+), D (60+), F (<60).

Unified report weights: audit 40%, functional 40%, trigger 20%.

Scan Scope

By default, skill-eval audit only scans skill-standard directories: root-level files (SKILL.md, etc.), scripts/, and agents/. This matches the agentskills.io definition of what constitutes a skill's content.

Directories like tests/, examples/, references/, and docs/ are excluded because they may contain documentation or test fixtures that describe security anti-patterns without actually being vulnerable.

Use --include-all to scan the entire directory tree:

# Default: scan skill content only
skill-eval audit /path/to/skill

# Full: scan everything (useful for full repo security review)
skill-eval audit /path/to/skill --include-all

Self-Eval: Why This Matters

Running skill-eval on itself demonstrates the difference:

# Default scan — only skill content (SKILL.md, scripts/)
skill-eval audit .
# Score: 96/100 (Grade: A) — 0 criticals, 0 warnings, 2 infos

# Full scan — includes test fixtures with intentional anti-patterns
skill-eval audit . --include-all
# Score: 0/100 (Grade: F) — 60+ criticals from tests/fixtures/
# That's by design: you need bad examples to test a security scanner

Configuration (`.skilleval.yaml`)

Customize audit behavior per-skill or per-project by placing a .skilleval.yaml (or .skilleval.yml) in your skill directory or any parent directory.

# .skilleval.yaml
audit:
  # Ignore specific finding codes
  ignore:
    - STR-008    # Directory name ≠ skill name is fine
    - STR-017    # README alongside SKILL.md is intentional

  # Override severity levels
  severity_overrides:
    SEC-002: WARNING    # Downgrade external URL to warning
    STR-011: CRITICAL   # Upgrade short description to critical

  # Whitelist internal domains (won't trigger SEC-002)
  safe_domains:
    - api.internal.company.com
    - wiki.team.io

  # Minimum passing score for CI (exit 1 if below)
  min_score: 70

  # Custom regex rules
  custom_rules:
    - code: CUSTOM-001
      pattern: "TODO|FIXME|HACK"
      severity: INFO
      message: "Found TODO/FIXME/HACK comment"

CLI flags override config values: --min-score 80 takes precedence over the YAML setting. Config and CLI --ignore / --allowlist values are merged.

Note: Requires pyyaml — install with pip install -e ".[config]" or pip install pyyaml. Without it, config files are silently skipped.

Examples

Example	What you'll learn
Data Analysis	Full lifecycle walkthrough — init, audit, functional, trigger, report
Lifecycle Demo	Three-version evolution (F→A→D) showing audit, functional, and regression detection
Golden Dataset	Ground-truth skills for validating the evaluation framework itself
F to A Improvement	Fix a failing skill step by step
Self-Eval	Default vs `--include-all` scan scope on this repo
Real Skill Audits	Interpret audit reports for production skills
Golden Eval Templates	Write effective eval cases and trigger queries

Relationship to Anthropic skill-creator

These are complementary tools:

skill-creator helps you create skills (scaffolding, templates, best practices)
skill-eval helps you evaluate skills (security, quality, reliability, cost)

Workflow: create with skill-creator → evaluate with skill-eval → iterate → deploy.

CI/CD

# .github/workflows/skill-eval.yml
jobs:
  evaluate:
    uses: aws-samples/sample-agent-skill-eval/.github/workflows/skill-eval.yml@main
    with:
      skill_path: "path/to/your-skill"
      run_functional: true
      run_trigger: true

Exit codes: 0 passed, 1 warnings/regressions/failures, 2 critical/error.

Contributing

See CONTRIBUTING.md for development setup, testing, and PR workflow.

Links

agentskills.io — Agent Skills specification
Anthropic Skills — official skill collection
OWASP Top 10 for LLMs
ClawHub — Agent Skills marketplace

License

MIT-0

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
docs		docs
evals		evals
examples		examples
references		references
skill_eval		skill_eval
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
demo.sh		demo.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Skill Eval — Is That Skill Any Good?

Installation

Option A: CLI Tool (for developers)

Option B: Agent Skill (for AI agents)

What It Measures

Quick Start

All Commands

Scoring

Scan Scope

Self-Eval: Why This Matters

Configuration (`.skilleval.yaml`)

Examples

Relationship to Anthropic skill-creator

CI/CD

Contributing

Links

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent Skill Eval — Is That Skill Any Good?

Installation

Option A: CLI Tool (for developers)

Option B: Agent Skill (for AI agents)

What It Measures

Quick Start

All Commands

Scoring

Scan Scope

Self-Eval: Why This Matters

Configuration (.skilleval.yaml)

Examples

Relationship to Anthropic skill-creator

CI/CD

Contributing

Links

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Configuration (`.skilleval.yaml`)

Packages