nerdsane · nerdsane · Feb 23, 2026
diff --git a/.gitignore b/.gitignore
@@ -18,6 +18,10 @@ Thumbs.db
 store/
 maelstrom/maelstrom/store/
 
+# Python
+__pycache__/
+*.pyc
+
 # Profiling
 profile.json
 

diff --git a/gepa/README.md b/gepa/README.md
@@ -0,0 +1,155 @@
+# GEPA - Genetic Evolution for Prompt Artifacts
+
+Skill evaluation harness and evolutionary optimizer for Claude Code agent skills (`.claude/agents/*.md`) and DST fault configurations (`src/buggify/config.rs`).
+
+## Quick Start
+
+```bash
+# Phase 1: Scoring
+python -m gepa.harness --list-tasks                     # Available ground truth tasks
+python -m gepa.harness --score-baseline                  # Score all skills
+python -m gepa.harness --skill rust-dev --offline        # Score one skill
+python -m gepa.harness --detail rust-dev paper_review    # Detailed breakdown
+python -m gepa.harness --capture-reviews                 # Capture via Claude CLI
+
+# Phase 2: Skill Evolution
+python -m gepa.harness --evolve rust-dev --mock          # Test GA machinery (free)
+python -m gepa.harness --evolve rust-dev --live           # Real evolution (~$0.05/eval)
+python -m gepa.harness --evolve rust-dev --live --budget 10.0 --generations 15
+
+# Phase 2: DST Config Optimization
+python -m gepa.harness --dst-optimize --mock             # Test DST GA (free)
+python -m gepa.harness --dst-optimize --test executor_dst_test --seeds 10
+python -m gepa.harness --dst-optimize --generations 15 --population 12
+```
+
+## Architecture
+
+```
+gepa/
+    __init__.py             # Package exports
+    __main__.py             # python -m gepa entry point
+    harness.py              # CLI: argparse + orchestration
+
+    # Phase 1: Scoring
+    scorer.py               # Keyword matching + composite scoring
+    candidate.py            # SkillCandidate: markdown parsing by section
+    evaluator.py            # Offline + live evaluation modes
+
+    # Phase 2: Skill Evolution
+    mutations.py            # 5 text mutation operators
+    skill_fitness.py        # Mock, Offline, Live fitness evaluators + CostTracker
+    evolution.py            # SkillEvolutionEngine: GA loop for skill markdown
+
+    # Phase 2: DST Config Optimization
+    dst_candidate.py        # DstCandidate: 32 tunable fault parameters
+    dst_evaluator.py        # Mock + cargo test fitness evaluators
+    dst_optimizer.py        # DstEvolutionEngine: GA loop for fault configs
+
+    # Data
+    ground_truth/
+        schema.md           # Ground truth format documentation
+        paper_review.json   # Expert paper review findings
+        pr13_review.json    # PR #13 review findings (ACL DRYRUN/LOG)
+        pr14_review.json    # PR #14 review findings (ACL categories)
+    reviews/                # Cached review texts (git-ignored)
+    results/                # Skill evolution output (git-ignored)
+    dst_results/            # DST optimization output (git-ignored)
+```
+
+## Phase 1: Scoring
+
+Each review is scored on four components:
+
+| Component | Weight | Description |
+|-----------|--------|-------------|
+| Weighted Recall | 0.50 | Found required issues, weighted by severity |
+| Precision | 0.25 | True positives / (TP + false positives) |
+| Calibration | 0.15 | Correct severity classification |
+| Coverage Bonus | 0.10 | Optional findings discovered |
+
+Scoring uses keyword co-occurrence (deterministic, no API calls):
+- Each finding has a list of keywords
+- A finding is "matched" if >= 50% of keywords appear in the review
+- False positives (known-correct things incorrectly flagged) reduce precision
+
+## Phase 2: Skill Evolution
+
+Evolves skill markdown files using genetic algorithms:
+
+- **Crossover**: Section-level uniform (50/50 per section from each parent)
+- **Mutation**: 5 text operators (sentence shuffle/drop/duplicate, keyword inject, section swap)
+- **Fitness**: Composite score from ground truth evaluation
+- **Budget**: CostTracker enforces spending limits for live evaluations
+
+### Evaluator Modes
+
+| Mode | Cost | Usage |
+|------|------|-------|
+| `--mock` | Free | Tests GA machinery with synthetic fitness |
+| `--live` | ~$0.05/task | Calls Claude CLI, scores against ground truth |
+
+### Output
+
+Results saved to `gepa/results/{skill_name}/`:
+- `gen_NNN.json` — Population data per generation
+- `best_skill.md` — Best evolved skill markdown
+- `evolution_history.json` — Full history with fitness curves
+
+## Phase 2: DST Config Optimization
+
+Evolves buggify fault injection probabilities to find configurations that surface the most invariant violations:
+
+- **Parameters**: 32 floats (31 fault probabilities + global_multiplier)
+- **Crossover**: Uniform per parameter
+- **Mutation**: `current += choice([-1, 0, 1]) * step`, clamped to bounds
+- **Initialization**: calm + moderate + chaos presets + random variants
+- **Fitness**: 0.5 * violations_found + 0.3 * fault_coverage + 0.2 * (1 - crash_rate)
+
+### Rust Integration
+
+The `BUGGIFY_CONFIG` env var overrides `FaultConfig::moderate()` defaults:
+
+```bash
+# Run DST tests with custom fault config
+BUGGIFY_CONFIG="global_multiplier=2.0,network.packet_drop=0.05" \
+  cargo test --release --test executor_dst_test
+```
+
+Format: comma-separated `key=value` pairs. See `src/buggify/faults.rs` for all fault IDs.
+
+### Output
+
+Results saved to `gepa/dst_results/`:
+- `gen_NNN.json` — Population data per generation
+- `best_config.env` — Best config in `BUGGIFY_CONFIG` format
+- `evolution_history.json` — Full history
+
+## Ground Truth
+
+Ground truth files capture expert knowledge from real reviews:
+- **paper_review.json**: 13 findings from 6-expert paper review
+- **pr13_review.json**: 6 findings from PR #13 (ACL DRYRUN/LOG)
+- **pr14_review.json**: 4 findings from PR #14 (ACL categories)
+
+See `ground_truth/schema.md` for the full format specification.
+
+## Testing
+
+```bash
+# All Phase 1 tests
+python3 -m unittest gepa.test_scorer gepa.test_candidate gepa.test_evaluator -v
+
+# Phase 2: Mutation + Evolution tests
+python3 -m unittest gepa.test_mutations gepa.test_evolution -v
+
+# Phase 2: DST optimizer tests
+python3 -m unittest gepa.test_dst_candidate gepa.test_dst_optimizer -v
+
+# Rust-side buggify config parsing
+cargo test --release -p redis_sim config::tests
+```
+
+## Dependencies
+
+None. Python 3.10+ stdlib only (matching `evolve/` pattern).
diff --git a/gepa/__init__.py b/gepa/__init__.py
@@ -0,0 +1,24 @@
+"""
+GEPA - Genetic Evolution for Prompt Artifacts
+
+Skill evaluation harness and DST configuration optimizer.
+Uses ground truth from expert reviews to score and evolve
+Claude Code skill files (.claude/agents/*.md).
+"""
+
+from .scorer import Scorer
+from .candidate import SkillCandidate
+from .evaluator import OfflineEvaluator
+from .evolution import SkillEvolutionEngine
+from .dst_candidate import DstCandidate
+from .dst_optimizer import DstEvolutionEngine
+
+__version__ = "0.2.0"
+__all__ = [
+    "Scorer",
+    "SkillCandidate",
+    "OfflineEvaluator",
+    "SkillEvolutionEngine",
+    "DstCandidate",
+    "DstEvolutionEngine",
+]
diff --git a/gepa/__main__.py b/gepa/__main__.py
@@ -0,0 +1,4 @@
+"""Allow running as: python -m gepa"""
+from .harness import main
+
+main()
diff --git a/gepa/candidate.py b/gepa/candidate.py
@@ -0,0 +1,177 @@
+"""
+SkillCandidate - represents a skill markdown file as a mutable artifact.
+
+Parses skill markdown into sections by ## headings, enabling
+section-level mutation and crossover for GEPA evolution.
+"""
+
+import re
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Dict, List, Optional
+
+
+@dataclass
+class SkillSection:
+    """A single ## section within a skill markdown file."""
+    heading: str  # The ## heading text (without ##)
+    content: str  # Everything between this heading and the next
+    level: int = 2  # Heading level (## = 2, ### = 3, etc.)
+
+    def to_markdown(self) -> str:
+        prefix = "#" * self.level
+        return f"{prefix} {self.heading}\n\n{self.content}"
+
+    def word_count(self) -> int:
+        return len(self.content.split())
+
+
+@dataclass
+class SkillCandidate:
+    """
+    A skill markdown file parsed into sections for mutation.
+
+    Mirrors evolve/candidate.py pattern: serializable, identifiable,
+    and fitness-trackable.
+    """
+    name: str  # Skill name (e.g., "rust-dev")
+    frontmatter: str  # YAML frontmatter (---\n...\n---)
+    preamble: str  # Content before first ## heading
+    sections: List[SkillSection] = field(default_factory=list)
+    fitness: Optional[float] = None
+    generation: int = 0
+    parent_ids: List[int] = field(default_factory=list)
+    _id: int = field(default_factory=lambda: SkillCandidate._next_id())
+
+    _id_counter: int = 0
+
+    @classmethod
+    def _next_id(cls) -> int:
+        cls._id_counter += 1
+        return cls._id_counter
+
+    @classmethod
+    def reset_id_counter(cls):
+        """Reset ID counter (useful for testing)."""
+        cls._id_counter = 0
+
+    @classmethod
+    def from_file(cls, path: Path) -> "SkillCandidate":
+        """Parse a skill markdown file into sections."""
+        assert path.exists(), f"Skill file not found: {path}"
+        text = path.read_text()
+        name = path.stem
+        return cls.from_text(name, text)
+
+    @classmethod
+    def from_text(cls, name: str, text: str) -> "SkillCandidate":
+        """Parse skill markdown text into sections."""
+        assert isinstance(text, str) and len(text) > 0, "text must be non-empty"
+
+        frontmatter = ""
+        body = text
+
+        # Extract YAML frontmatter
+        fm_match = re.match(r'^---\s*\n(.*?)\n---\s*\n', text, re.DOTALL)
+        if fm_match:
+            frontmatter = fm_match.group(0)
+            body = text[fm_match.end():]
+
+        # Split on ## headings (capture the heading line)
+        parts = re.split(r'^(#{1,6})\s+(.+)$', body, flags=re.MULTILINE)
+
+        preamble = parts[0].strip()
+        sections = []
+
+        # parts[0] = preamble, then groups of 3: (hashes, heading, content)
+        i = 1
+        while i < len(parts) - 2:
+            hashes = parts[i]
+            heading = parts[i + 1].strip()
+            content = parts[i + 2].strip() if i + 2 < len(parts) else ""
+            level = len(hashes)
+            sections.append(SkillSection(
+                heading=heading,
+                content=content,
+                level=level,
+            ))
+            i += 3
+
+        return cls(
+            name=name,
+            frontmatter=frontmatter,
+            preamble=preamble,
+            sections=sections,
+        )
+
+    def to_markdown(self) -> str:
+        """Reconstruct the full markdown from sections."""
+        parts = []
+        if self.frontmatter:
+            parts.append(self.frontmatter.rstrip())
+        if self.preamble:
+            parts.append(self.preamble)
+        for section in self.sections:
+            parts.append(section.to_markdown())
+        return "\n\n".join(parts) + "\n"
+
+    def save(self, path: Path) -> None:
+        """Save skill markdown to file."""
+        path.write_text(self.to_markdown())
+
+    def section_names(self) -> List[str]:
+        """List all section headings."""
+        return [s.heading for s in self.sections]
+
+    def get_section(self, heading: str) -> Optional[SkillSection]:
+        """Find a section by heading (case-insensitive)."""
+        heading_lower = heading.lower()
+        for s in self.sections:
+            if s.heading.lower() == heading_lower:
+                return s
+        return None
+
+    def replace_section(self, heading: str, new_content: str) -> bool:
+        """Replace a section's content by heading. Returns True if found."""
+        section = self.get_section(heading)
+        if section is None:
+            return False
+        section.content = new_content
+        return True
+
+    def word_count(self) -> int:
+        """Total word count across all sections."""
+        total = len(self.preamble.split())
+        total += sum(s.word_count() for s in self.sections)
+        return total
+
+    def to_dict(self) -> Dict:
+        """Serialize to dict for JSON storage."""
+        return {
+            "id": self._id,
+            "name": self.name,
+            "fitness": self.fitness,
+            "generation": self.generation,
+            "parent_ids": self.parent_ids,
+            "section_count": len(self.sections),
+            "word_count": self.word_count(),
+            "sections": [s.heading for s in self.sections],
+        }
+
+    @classmethod
+    def from_dict(cls, data: Dict, agents_dir: Path) -> "SkillCandidate":
+        """Load from dict by re-reading the skill file."""
+        name = data["name"]
+        path = agents_dir / f"{name}.md"
+        candidate = cls.from_file(path)
+        candidate.fitness = data.get("fitness")
+        candidate.generation = data.get("generation", 0)
+        candidate.parent_ids = data.get("parent_ids", [])
+        return candidate
+
+    def __repr__(self) -> str:
+        fitness_str = f"{self.fitness:.3f}" if self.fitness is not None else "?"
+        return (
+            f"SkillCandidate(name={self.name!r}, fitness={fitness_str}, "
+            f"sections={len(self.sections)}, words={self.word_count()})"
+        )