Skill Forge

Autoresearch-inspired automatic optimization for AI Agent Skills (SKILL.md)

Skill Forge automatically iterates and optimizes your AI Agent Skill files using a three-layer evaluation system — static checks, benchmark tests, and LLM-as-Judge scoring. Just like karpathy's autoresearch lets AI agents optimize training code overnight, Skill Forge lets AI agents optimize your Skill instructions.

How It Works

Original Skill → AI Agent Rewrites → 3-Layer Evaluation → Score ↑ Keep / ↓ Rollback → Show Diff → Next Round

Installation

pip install -e .

Usage

# Optimize a Skill (semi-auto mode, default)
skill-forge optimize --skill /path/to/SKILL.md --benchmark /path/to/benchmark.yaml

# Full-auto mode
skill-forge optimize --skill /path/to/SKILL.md --benchmark /path/to/benchmark.yaml --mode auto --max-rounds 20

# Batch optimize multiple Skills at once
skill-forge batch --config configs/benchmarks.yaml

# Evaluate only
skill-forge evaluate --skill /path/to/SKILL.md --benchmark /path/to/benchmark.yaml

# View history
skill-forge history --skill /path/to/SKILL.md

# Show diff for a specific round
skill-forge diff --skill /path/to/SKILL.md --round 1

# Rollback to a specific round
skill-forge rollback --skill /path/to/SKILL.md --round 3

Complete Optimization Workflow

Use the skill-forge-eval SKILL to guide a full optimization cycle:

Phase 1: Prepare   → Confirm Skill path, create/find Benchmark & Program YAML
Phase 2: Optimize  → Run skill-forge optimize, monitor each round
Phase 3: Backup    → Auto-backup original as SKILL.md.bak.vYYYYMMDD
Phase 4: Evaluate  → L1 static check + L2 Benchmark + LLM-as-Judge
Phase 5: Patch     → Auto-patch missing items, re-verify L2
Phase 6: Replace   → Replace original if all conditions met

Trigger: "optimize {skill-name}" or "run skill-forge-eval"

Project Structure

skill-forge/
├── skill_forge/              # Core Python package
│   ├── cli.py                # CLI entry point
│   ├── core/
│   │   ├── agent.py          # LLM agent for rewriting
│   │   ├── batch.py          # Batch optimization runner
│   │   ├── benchmark.py      # Benchmark YAML parser
│   │   ├── evaluator.py      # L1/L2 evaluation engine
│   │   ├── optimizer.py      # Main optimization loop
│   │   └── storage.py        # Optimization history storage
│   └── utils/
│       ├── diff.py           # Diff generation
│       └── scoring.py        # Scoring utilities
├── skills/                   # Open-source Skill files
│   └── skill-forge-eval/     # Full optimization workflow SKILL
├── tests/                    # Unit & integration tests
└── configs/                  # Local configs (not committed)

Evaluation System

Layer	Method	Purpose
L1	Static analysis	YAML frontmatter, heading structure, code block balance
L2	Benchmark tests	Keyword presence checks against expected content
LLM-as-Judge	LLM comparison	Content integrity, accuracy, readability, format, usability

Benchmarks

Skill	Cases	Status
wechat-docx-export	4	✅ 100%
dev-workflow	10	✅ 100%
skill-forge-eval	7	✅ 100%

Optimization Results

Skill	Original	Optimized	Compression	L2 Pass
wechat-docx-export	352 lines	226 lines	35.8%	✅ 100%
dev-workflow	2843 lines	1072 lines	62.3%	✅ 100%

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
openspec/changes/v0.1		openspec/changes/v0.1
skill_forge		skill_forge
skills/skill-forge-eval		skills/skill-forge-eval
tests		tests
.gitignore		.gitignore
README.md		README.md
README_CN.md		README_CN.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Skill Forge

How It Works

Installation

Usage

Complete Optimization Workflow

Project Structure

Evaluation System

Benchmarks

Optimization Results

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Skill Forge

How It Works

Installation

Usage

Complete Optimization Workflow

Project Structure

Evaluation System

Benchmarks

Optimization Results

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages