Skip to content

atyou2happy/skill-forge

Repository files navigation

Skill Forge

中文文档

License: MIT

Autoresearch-inspired automatic optimization for AI Agent Skills (SKILL.md)

Skill Forge automatically iterates and optimizes your AI Agent Skill files using a three-layer evaluation system — static checks, benchmark tests, and LLM-as-Judge scoring. Just like karpathy's autoresearch lets AI agents optimize training code overnight, Skill Forge lets AI agents optimize your Skill instructions.

How It Works

Original Skill → AI Agent Rewrites → 3-Layer Evaluation → Score ↑ Keep / ↓ Rollback → Show Diff → Next Round

Installation

pip install -e .

Usage

# Optimize a Skill (semi-auto mode, default)
skill-forge optimize --skill /path/to/SKILL.md --benchmark /path/to/benchmark.yaml

# Full-auto mode
skill-forge optimize --skill /path/to/SKILL.md --benchmark /path/to/benchmark.yaml --mode auto --max-rounds 20

# Batch optimize multiple Skills at once
skill-forge batch --config configs/benchmarks.yaml

# Evaluate only
skill-forge evaluate --skill /path/to/SKILL.md --benchmark /path/to/benchmark.yaml

# View history
skill-forge history --skill /path/to/SKILL.md

# Show diff for a specific round
skill-forge diff --skill /path/to/SKILL.md --round 1

# Rollback to a specific round
skill-forge rollback --skill /path/to/SKILL.md --round 3

Complete Optimization Workflow

Use the skill-forge-eval SKILL to guide a full optimization cycle:

Phase 1: Prepare   → Confirm Skill path, create/find Benchmark & Program YAML
Phase 2: Optimize  → Run skill-forge optimize, monitor each round
Phase 3: Backup    → Auto-backup original as SKILL.md.bak.vYYYYMMDD
Phase 4: Evaluate  → L1 static check + L2 Benchmark + LLM-as-Judge
Phase 5: Patch     → Auto-patch missing items, re-verify L2
Phase 6: Replace   → Replace original if all conditions met

Trigger: "optimize {skill-name}" or "run skill-forge-eval"

Project Structure

skill-forge/
├── skill_forge/              # Core Python package
│   ├── cli.py                # CLI entry point
│   ├── core/
│   │   ├── agent.py          # LLM agent for rewriting
│   │   ├── batch.py          # Batch optimization runner
│   │   ├── benchmark.py      # Benchmark YAML parser
│   │   ├── evaluator.py      # L1/L2 evaluation engine
│   │   ├── optimizer.py      # Main optimization loop
│   │   └── storage.py        # Optimization history storage
│   └── utils/
│       ├── diff.py           # Diff generation
│       └── scoring.py        # Scoring utilities
├── skills/                   # Open-source Skill files
│   └── skill-forge-eval/     # Full optimization workflow SKILL
├── tests/                    # Unit & integration tests
└── configs/                  # Local configs (not committed)

Evaluation System

Layer Method Purpose
L1 Static analysis YAML frontmatter, heading structure, code block balance
L2 Benchmark tests Keyword presence checks against expected content
LLM-as-Judge LLM comparison Content integrity, accuracy, readability, format, usability

Benchmarks

Skill Cases Status
wechat-docx-export 4 ✅ 100%
dev-workflow 10 ✅ 100%
skill-forge-eval 7 ✅ 100%

Optimization Results

Skill Original Optimized Compression L2 Pass
wechat-docx-export 352 lines 226 lines 35.8% ✅ 100%
dev-workflow 2843 lines 1072 lines 62.3% ✅ 100%

License

MIT

About

AI-powered SKILL.md optimizer with 3-layer evaluation (L1/L2/LLM-as-Judge)

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages