Skip to content

Releases: skill-bench/skill-eval-action

v1.1.0

20 Mar 17:43
68841d3

Choose a tag to compare

Changelog

v1.1.0

  • Auto-fix YAML plain scalars containing : (colon-space) — no more parse errors for natural-language criteria
  • Support grading.rubric eval format with id, description, weight, pass_if fields (normalized to flat criteria)
  • Validate all eval cases (required fields, types) before making any API calls
  • Support .yml extension alongside .yaml for eval files
  • Add scripts/test_validation.py for local testing without API calls

v1.0.0

  • Initial release
  • Discover and execute eval YAML test cases via claude -p
  • Grade responses against criteria via separate claude -p call
  • Post results as PR comment (upsert with HTML marker)
  • Upload interactive eval-viewer HTML as artifact
  • Configurable pass threshold with step failure
  • GitHub Actions step summary with results table

v1.0.0

17 Mar 16:19
v1.0.0
47efd5b

Choose a tag to compare

Full Changelog: v0.11.0...v1.0.0

v0.12.0

17 Mar 15:58
v0.12.0
47efd5b

Choose a tag to compare

Update org references, add logo

v0.11.0

17 Mar 10:06
v0.11.0
adaf781

Choose a tag to compare

Add max-retries and retry-delay to README inputs table

v0.10.0

17 Mar 09:51
v0.10.0
4c1c5d9

Choose a tag to compare

Add retry logic (3 attempts with backoff) for execute and grade API calls. Rename viewer HTML with timestamp.

v0.9.0

17 Mar 09:18
v0.9.0
e6ec573

Choose a tag to compare

Artifact names now include timestamp: YYYYMMDDTHHmmss-skill-name

v0.8.0

17 Mar 08:44
v0.8.0
7eb8756

Choose a tag to compare

Capture tokens/cost from claude CLI, benchmark tab uses summary data, remove feedback tab

v0.7.0

17 Mar 08:28
v0.7.0
b8530f6

Choose a tag to compare

Fix viewer HTML: transform timing fields and outputs to match template format

v0.6.0

17 Mar 08:13
8fa8347

Choose a tag to compare

Fix viewer HTML data embedding - use correct variable name and runs[] structure

v0.5.0

17 Mar 07:23
v0.5.0
5a97fc7

Choose a tag to compare