Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ Thumbs.db
store/
maelstrom/maelstrom/store/

# Python
__pycache__/
*.pyc

# Profiling
profile.json

Expand Down
155 changes: 155 additions & 0 deletions gepa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# GEPA - Genetic Evolution for Prompt Artifacts

Skill evaluation harness and evolutionary optimizer for Claude Code agent skills (`.claude/agents/*.md`) and DST fault configurations (`src/buggify/config.rs`).

## Quick Start

```bash
# Phase 1: Scoring
python -m gepa.harness --list-tasks # Available ground truth tasks
python -m gepa.harness --score-baseline # Score all skills
python -m gepa.harness --skill rust-dev --offline # Score one skill
python -m gepa.harness --detail rust-dev paper_review # Detailed breakdown
python -m gepa.harness --capture-reviews # Capture via Claude CLI

# Phase 2: Skill Evolution
python -m gepa.harness --evolve rust-dev --mock # Test GA machinery (free)
python -m gepa.harness --evolve rust-dev --live # Real evolution (~$0.05/eval)
python -m gepa.harness --evolve rust-dev --live --budget 10.0 --generations 15

# Phase 2: DST Config Optimization
python -m gepa.harness --dst-optimize --mock # Test DST GA (free)
python -m gepa.harness --dst-optimize --test executor_dst_test --seeds 10
python -m gepa.harness --dst-optimize --generations 15 --population 12
```

## Architecture

```
gepa/
__init__.py # Package exports
__main__.py # python -m gepa entry point
harness.py # CLI: argparse + orchestration

# Phase 1: Scoring
scorer.py # Keyword matching + composite scoring
candidate.py # SkillCandidate: markdown parsing by section
evaluator.py # Offline + live evaluation modes

# Phase 2: Skill Evolution
mutations.py # 5 text mutation operators
skill_fitness.py # Mock, Offline, Live fitness evaluators + CostTracker
evolution.py # SkillEvolutionEngine: GA loop for skill markdown

# Phase 2: DST Config Optimization
dst_candidate.py # DstCandidate: 32 tunable fault parameters
dst_evaluator.py # Mock + cargo test fitness evaluators
dst_optimizer.py # DstEvolutionEngine: GA loop for fault configs

# Data
ground_truth/
schema.md # Ground truth format documentation
paper_review.json # Expert paper review findings
pr13_review.json # PR #13 review findings (ACL DRYRUN/LOG)
pr14_review.json # PR #14 review findings (ACL categories)
reviews/ # Cached review texts (git-ignored)
results/ # Skill evolution output (git-ignored)
dst_results/ # DST optimization output (git-ignored)
```

## Phase 1: Scoring

Each review is scored on four components:

| Component | Weight | Description |
|-----------|--------|-------------|
| Weighted Recall | 0.50 | Found required issues, weighted by severity |
| Precision | 0.25 | True positives / (TP + false positives) |
| Calibration | 0.15 | Correct severity classification |
| Coverage Bonus | 0.10 | Optional findings discovered |

Scoring uses keyword co-occurrence (deterministic, no API calls):
- Each finding has a list of keywords
- A finding is "matched" if >= 50% of keywords appear in the review
- False positives (known-correct things incorrectly flagged) reduce precision

## Phase 2: Skill Evolution

Evolves skill markdown files using genetic algorithms:

- **Crossover**: Section-level uniform (50/50 per section from each parent)
- **Mutation**: 5 text operators (sentence shuffle/drop/duplicate, keyword inject, section swap)
- **Fitness**: Composite score from ground truth evaluation
- **Budget**: CostTracker enforces spending limits for live evaluations

### Evaluator Modes

| Mode | Cost | Usage |
|------|------|-------|
| `--mock` | Free | Tests GA machinery with synthetic fitness |
| `--live` | ~$0.05/task | Calls Claude CLI, scores against ground truth |

### Output

Results saved to `gepa/results/{skill_name}/`:
- `gen_NNN.json` — Population data per generation
- `best_skill.md` — Best evolved skill markdown
- `evolution_history.json` — Full history with fitness curves

## Phase 2: DST Config Optimization

Evolves buggify fault injection probabilities to find configurations that surface the most invariant violations:

- **Parameters**: 32 floats (31 fault probabilities + global_multiplier)
- **Crossover**: Uniform per parameter
- **Mutation**: `current += choice([-1, 0, 1]) * step`, clamped to bounds
- **Initialization**: calm + moderate + chaos presets + random variants
- **Fitness**: 0.5 * violations_found + 0.3 * fault_coverage + 0.2 * (1 - crash_rate)

### Rust Integration

The `BUGGIFY_CONFIG` env var overrides `FaultConfig::moderate()` defaults:

```bash
# Run DST tests with custom fault config
BUGGIFY_CONFIG="global_multiplier=2.0,network.packet_drop=0.05" \
cargo test --release --test executor_dst_test
```

Format: comma-separated `key=value` pairs. See `src/buggify/faults.rs` for all fault IDs.

### Output

Results saved to `gepa/dst_results/`:
- `gen_NNN.json` — Population data per generation
- `best_config.env` — Best config in `BUGGIFY_CONFIG` format
- `evolution_history.json` — Full history

## Ground Truth

Ground truth files capture expert knowledge from real reviews:
- **paper_review.json**: 13 findings from 6-expert paper review
- **pr13_review.json**: 6 findings from PR #13 (ACL DRYRUN/LOG)
- **pr14_review.json**: 4 findings from PR #14 (ACL categories)

See `ground_truth/schema.md` for the full format specification.

## Testing

```bash
# All Phase 1 tests
python3 -m unittest gepa.test_scorer gepa.test_candidate gepa.test_evaluator -v

# Phase 2: Mutation + Evolution tests
python3 -m unittest gepa.test_mutations gepa.test_evolution -v

# Phase 2: DST optimizer tests
python3 -m unittest gepa.test_dst_candidate gepa.test_dst_optimizer -v

# Rust-side buggify config parsing
cargo test --release -p redis_sim config::tests
```

## Dependencies

None. Python 3.10+ stdlib only (matching `evolve/` pattern).
24 changes: 24 additions & 0 deletions gepa/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
"""
GEPA - Genetic Evolution for Prompt Artifacts

Skill evaluation harness and DST configuration optimizer.
Uses ground truth from expert reviews to score and evolve
Claude Code skill files (.claude/agents/*.md).
"""

from .scorer import Scorer
from .candidate import SkillCandidate
from .evaluator import OfflineEvaluator
from .evolution import SkillEvolutionEngine
from .dst_candidate import DstCandidate
from .dst_optimizer import DstEvolutionEngine

__version__ = "0.2.0"
__all__ = [
"Scorer",
"SkillCandidate",
"OfflineEvaluator",
"SkillEvolutionEngine",
"DstCandidate",
"DstEvolutionEngine",
]
4 changes: 4 additions & 0 deletions gepa/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
"""Allow running as: python -m gepa"""
from .harness import main

main()
177 changes: 177 additions & 0 deletions gepa/candidate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
"""
SkillCandidate - represents a skill markdown file as a mutable artifact.

Parses skill markdown into sections by ## headings, enabling
section-level mutation and crossover for GEPA evolution.
"""

import re
from dataclasses import dataclass, field
from pathlib import Path
from typing import Dict, List, Optional


@dataclass
class SkillSection:
"""A single ## section within a skill markdown file."""
heading: str # The ## heading text (without ##)
content: str # Everything between this heading and the next
level: int = 2 # Heading level (## = 2, ### = 3, etc.)

def to_markdown(self) -> str:
prefix = "#" * self.level
return f"{prefix} {self.heading}\n\n{self.content}"

def word_count(self) -> int:
return len(self.content.split())


@dataclass
class SkillCandidate:
"""
A skill markdown file parsed into sections for mutation.

Mirrors evolve/candidate.py pattern: serializable, identifiable,
and fitness-trackable.
"""
name: str # Skill name (e.g., "rust-dev")
frontmatter: str # YAML frontmatter (---\n...\n---)
preamble: str # Content before first ## heading
sections: List[SkillSection] = field(default_factory=list)
fitness: Optional[float] = None
generation: int = 0
parent_ids: List[int] = field(default_factory=list)
_id: int = field(default_factory=lambda: SkillCandidate._next_id())

_id_counter: int = 0

@classmethod
def _next_id(cls) -> int:
cls._id_counter += 1
return cls._id_counter

@classmethod
def reset_id_counter(cls):
"""Reset ID counter (useful for testing)."""
cls._id_counter = 0

@classmethod
def from_file(cls, path: Path) -> "SkillCandidate":
"""Parse a skill markdown file into sections."""
assert path.exists(), f"Skill file not found: {path}"
text = path.read_text()
name = path.stem
return cls.from_text(name, text)

@classmethod
def from_text(cls, name: str, text: str) -> "SkillCandidate":
"""Parse skill markdown text into sections."""
assert isinstance(text, str) and len(text) > 0, "text must be non-empty"

frontmatter = ""
body = text

# Extract YAML frontmatter
fm_match = re.match(r'^---\s*\n(.*?)\n---\s*\n', text, re.DOTALL)
if fm_match:
frontmatter = fm_match.group(0)
body = text[fm_match.end():]

# Split on ## headings (capture the heading line)
parts = re.split(r'^(#{1,6})\s+(.+)$', body, flags=re.MULTILINE)

preamble = parts[0].strip()
sections = []

# parts[0] = preamble, then groups of 3: (hashes, heading, content)
i = 1
while i < len(parts) - 2:
hashes = parts[i]
heading = parts[i + 1].strip()
content = parts[i + 2].strip() if i + 2 < len(parts) else ""
level = len(hashes)
sections.append(SkillSection(
heading=heading,
content=content,
level=level,
))
i += 3

return cls(
name=name,
frontmatter=frontmatter,
preamble=preamble,
sections=sections,
)

def to_markdown(self) -> str:
"""Reconstruct the full markdown from sections."""
parts = []
if self.frontmatter:
parts.append(self.frontmatter.rstrip())
if self.preamble:
parts.append(self.preamble)
for section in self.sections:
parts.append(section.to_markdown())
return "\n\n".join(parts) + "\n"

def save(self, path: Path) -> None:
"""Save skill markdown to file."""
path.write_text(self.to_markdown())

def section_names(self) -> List[str]:
"""List all section headings."""
return [s.heading for s in self.sections]

def get_section(self, heading: str) -> Optional[SkillSection]:
"""Find a section by heading (case-insensitive)."""
heading_lower = heading.lower()
for s in self.sections:
if s.heading.lower() == heading_lower:
return s
return None

def replace_section(self, heading: str, new_content: str) -> bool:
"""Replace a section's content by heading. Returns True if found."""
section = self.get_section(heading)
if section is None:
return False
section.content = new_content
return True

def word_count(self) -> int:
"""Total word count across all sections."""
total = len(self.preamble.split())
total += sum(s.word_count() for s in self.sections)
return total

def to_dict(self) -> Dict:
"""Serialize to dict for JSON storage."""
return {
"id": self._id,
"name": self.name,
"fitness": self.fitness,
"generation": self.generation,
"parent_ids": self.parent_ids,
"section_count": len(self.sections),
"word_count": self.word_count(),
"sections": [s.heading for s in self.sections],
}

@classmethod
def from_dict(cls, data: Dict, agents_dir: Path) -> "SkillCandidate":
"""Load from dict by re-reading the skill file."""
name = data["name"]
path = agents_dir / f"{name}.md"
candidate = cls.from_file(path)
candidate.fitness = data.get("fitness")
candidate.generation = data.get("generation", 0)
candidate.parent_ids = data.get("parent_ids", [])
return candidate

def __repr__(self) -> str:
fitness_str = f"{self.fitness:.3f}" if self.fitness is not None else "?"
return (
f"SkillCandidate(name={self.name!r}, fitness={fitness_str}, "
f"sections={len(self.sections)}, words={self.word_count()})"
)
Loading
Loading