Releases: skill-bench/skill-eval-action
Releases · skill-bench/skill-eval-action
v1.1.0
Changelog
v1.1.0
- Auto-fix YAML plain scalars containing
:(colon-space) — no more parse errors for natural-language criteria - Support
grading.rubriceval format withid,description,weight,pass_iffields (normalized to flat criteria) - Validate all eval cases (required fields, types) before making any API calls
- Support
.ymlextension alongside.yamlfor eval files - Add
scripts/test_validation.pyfor local testing without API calls
v1.0.0
- Initial release
- Discover and execute eval YAML test cases via
claude -p - Grade responses against criteria via separate
claude -pcall - Post results as PR comment (upsert with HTML marker)
- Upload interactive eval-viewer HTML as artifact
- Configurable pass threshold with step failure
- GitHub Actions step summary with results table
v1.0.0
Full Changelog: v0.11.0...v1.0.0
v0.12.0
Update org references, add logo
v0.11.0
Add max-retries and retry-delay to README inputs table
v0.10.0
Add retry logic (3 attempts with backoff) for execute and grade API calls. Rename viewer HTML with timestamp.
v0.9.0
Artifact names now include timestamp: YYYYMMDDTHHmmss-skill-name
v0.8.0
Capture tokens/cost from claude CLI, benchmark tab uses summary data, remove feedback tab
v0.7.0
Fix viewer HTML: transform timing fields and outputs to match template format
v0.6.0
Fix viewer HTML data embedding - use correct variable name and runs[] structure