Autoresearch-inspired automatic optimization for AI Agent Skills (SKILL.md)
Skill Forge automatically iterates and optimizes your AI Agent Skill files using a three-layer evaluation system — static checks, benchmark tests, and LLM-as-Judge scoring. Just like karpathy's autoresearch lets AI agents optimize training code overnight, Skill Forge lets AI agents optimize your Skill instructions.
Original Skill → AI Agent Rewrites → 3-Layer Evaluation → Score ↑ Keep / ↓ Rollback → Show Diff → Next Round
pip install -e .# Optimize a Skill (semi-auto mode, default)
skill-forge optimize --skill /path/to/SKILL.md --benchmark /path/to/benchmark.yaml
# Full-auto mode
skill-forge optimize --skill /path/to/SKILL.md --benchmark /path/to/benchmark.yaml --mode auto --max-rounds 20
# Batch optimize multiple Skills at once
skill-forge batch --config configs/benchmarks.yaml
# Evaluate only
skill-forge evaluate --skill /path/to/SKILL.md --benchmark /path/to/benchmark.yaml
# View history
skill-forge history --skill /path/to/SKILL.md
# Show diff for a specific round
skill-forge diff --skill /path/to/SKILL.md --round 1
# Rollback to a specific round
skill-forge rollback --skill /path/to/SKILL.md --round 3Use the skill-forge-eval SKILL to guide a full optimization cycle:
Phase 1: Prepare → Confirm Skill path, create/find Benchmark & Program YAML
Phase 2: Optimize → Run skill-forge optimize, monitor each round
Phase 3: Backup → Auto-backup original as SKILL.md.bak.vYYYYMMDD
Phase 4: Evaluate → L1 static check + L2 Benchmark + LLM-as-Judge
Phase 5: Patch → Auto-patch missing items, re-verify L2
Phase 6: Replace → Replace original if all conditions met
Trigger: "optimize {skill-name}" or "run skill-forge-eval"
skill-forge/
├── skill_forge/ # Core Python package
│ ├── cli.py # CLI entry point
│ ├── core/
│ │ ├── agent.py # LLM agent for rewriting
│ │ ├── batch.py # Batch optimization runner
│ │ ├── benchmark.py # Benchmark YAML parser
│ │ ├── evaluator.py # L1/L2 evaluation engine
│ │ ├── optimizer.py # Main optimization loop
│ │ └── storage.py # Optimization history storage
│ └── utils/
│ ├── diff.py # Diff generation
│ └── scoring.py # Scoring utilities
├── skills/ # Open-source Skill files
│ └── skill-forge-eval/ # Full optimization workflow SKILL
├── tests/ # Unit & integration tests
└── configs/ # Local configs (not committed)
| Layer | Method | Purpose |
|---|---|---|
| L1 | Static analysis | YAML frontmatter, heading structure, code block balance |
| L2 | Benchmark tests | Keyword presence checks against expected content |
| LLM-as-Judge | LLM comparison | Content integrity, accuracy, readability, format, usability |
| Skill | Cases | Status |
|---|---|---|
| wechat-docx-export | 4 | ✅ 100% |
| dev-workflow | 10 | ✅ 100% |
| skill-forge-eval | 7 | ✅ 100% |
| Skill | Original | Optimized | Compression | L2 Pass |
|---|---|---|---|---|
| wechat-docx-export | 352 lines | 226 lines | 35.8% | ✅ 100% |
| dev-workflow | 2843 lines | 1072 lines | 62.3% | ✅ 100% |
MIT