Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
267 changes: 267 additions & 0 deletions .test/CUSTOMIZATION_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
# Customization Guide: Steering Skill Optimization with `--focus`

## 1. Overview

The `--focus` flag lets you steer skill optimization with natural language. It uses an LLM to make targeted adjustments to `manifest.yaml` and `ground_truth.yaml` before GEPA runs, so the optimizer prioritizes what matters to you.

### Quick Start

```bash
# Single focus area
uv run python .test/scripts/optimize.py databricks-asset-bundles \
--focus "prefix all catalogs with customer_ prefix"

# Multiple focus areas
uv run python .test/scripts/optimize.py databricks-asset-bundles \
--focus "prefix all catalogs with customer_ prefix" \
--focus "always use serverless compute"

# Focus areas from a file
uv run python .test/scripts/optimize.py databricks-asset-bundles \
--focus-file my_focus_areas.txt

# Dry run to see what would change
uv run python .test/scripts/optimize.py databricks-asset-bundles \
--focus "prefix all catalogs with customer_ prefix" --dry-run

# Combined with presets
uv run python .test/scripts/optimize.py databricks-asset-bundles \
--focus "use DLT for all pipeline examples" --preset quick
```

### What Happens

1. The LLM reads SKILL.md, `manifest.yaml`, and `ground_truth.yaml`
2. It adds `[FOCUS]`-prefixed guidelines to the manifest
3. It adjusts relevant existing test cases (expectations, patterns, guidelines)
4. It generates 2-3 new test cases targeting the focus area
5. GEPA optimization then runs with these enhanced evaluation criteria

---

## 2. How Each `ground_truth.yaml` Field Impacts Optimization

### `outputs.response` - Reference Answer

**What it is:** The ideal response the judge compares agent output against.

**How it steers optimization:** The quality judge uses this as the gold standard. If the reference response includes specific patterns (e.g., parameterized catalogs), the optimizer learns to produce those patterns.

**Example focus prompt:** `"All examples should use variable substitution for catalog names"`

**Before:**
```yaml
outputs:
response: |
catalog: my_catalog
```

**After:**
```yaml
outputs:
response: |
catalog: ${var.catalog_prefix}_catalog
```

### `expectations.expected_facts` - Substring Assertions

**What it is:** Exact substrings that must appear in the response. Checked deterministically (case-insensitive).

**How it steers optimization:** Failed facts tell the optimizer exactly what content is missing. Adding facts about your focus area forces the optimizer to include that content.

**Example focus prompt:** `"Must explain the MEASURE() function wrapping"`

**Before:**
```yaml
expected_facts:
- "Defines variables with default values"
```

**After:**
```yaml
expected_facts:
- "Defines variables with default values"
- "All catalog values use ${var.catalog_prefix} variable"
```

### `expectations.expected_patterns` - Regex Patterns

**What it is:** Regular expressions checked with `re.findall(pattern, response, re.IGNORECASE)`. Each has a `min_count` (minimum number of matches required) and a `description`.

**How it steers optimization:** Pattern failures are binary and precise. Adding patterns for your focus area creates hard requirements the optimizer must satisfy.

**Example focus prompt:** `"Prefix all catalogs with a configurable prefix variable"`

**Before:**
```yaml
expected_patterns:
- pattern: 'catalog:'
min_count: 2
description: Defines catalog variable
```

**After:**
```yaml
expected_patterns:
- pattern: 'catalog:'
min_count: 2
description: Defines catalog variable
- pattern: '\$\{var\.catalog_prefix\}'
min_count: 1
description: Uses catalog prefix variable
```

### `expectations.guidelines` - LLM Judge Rules

**What it is:** Natural-language evaluation criteria passed to the LLM judge. The judge scores how well the response follows each guideline.

**How it steers optimization:** Guidelines are the most flexible steering mechanism. They influence the quality score (30% of total) and effectiveness score (40% of total).

**Example focus prompt:** `"Must parameterize catalog names with a prefix variable"`

**Before:**
```yaml
guidelines:
- "Must define variables at root level with defaults"
```

**After:**
```yaml
guidelines:
- "Must define variables at root level with defaults"
- "Must parameterize catalog names with a prefix variable"
```

### `metadata.tags` - Categorization

**What it is:** Tags for organizing and filtering test cases. No direct impact on optimization scoring.

**How it steers optimization:** Tags help identify which test cases were generated or adjusted by focus. Focus-generated test cases get tags matching the focus area.

---

## 3. How Each `manifest.yaml` Field Impacts Optimization

### `scorers.default_guidelines` - Global Guidelines

**What it is:** Guidelines applied to ALL test cases that don't define their own guidelines. These are merged with per-test-case guidelines by the quality judge.

**How it steers optimization:** Adding `[FOCUS]` guidelines here affects every evaluation, not just specific test cases. This is the broadest way to steer optimization.

**What `--focus` does:** Prepends `[FOCUS]` to each new guideline and appends to the list. Duplicates are skipped.

**Before:**
```yaml
default_guidelines:
- "Response must address the user's request completely"
- "YAML examples must be valid and properly indented"
```

**After:**
```yaml
default_guidelines:
- "Response must address the user's request completely"
- "YAML examples must be valid and properly indented"
- "[FOCUS] All catalog references must use a configurable prefix variable"
- "[FOCUS] Variable substitution syntax ${var.prefix} must be demonstrated"
```

### `quality_gates` - Pass/Fail Thresholds

**What it is:** Minimum score thresholds for each scorer. If a score falls below the gate, the test case fails.

**How it steers optimization:** Higher thresholds make the optimizer work harder to satisfy that criterion. `--focus` can only make thresholds stricter (higher), never looser.

**Before:**
```yaml
quality_gates:
pattern_adherence: 0.9
execution_success: 0.8
```

**After (if focus demands stricter pattern checking):**
```yaml
quality_gates:
pattern_adherence: 0.95
execution_success: 0.8
```

---

## 4. Prompting Examples

### Scenario: Customer wants all catalogs prefixed

```bash
--focus "When creating DABs, prefix all catalogs and schemas with a customer-specific prefix using variables"
```

**What changes:**
- **manifest.yaml**: Adds `[FOCUS] All catalog/schema references must use ${var.prefix}_catalog pattern`
- **ground_truth.yaml**: Existing multi-env test cases get new `expected_patterns` for `${var.prefix}` syntax; 2-3 new test cases about prefix configuration

### Scenario: Customer wants DLT examples in DABs

```bash
--focus "Include Delta Live Tables (DLT) pipeline examples in all DABs configurations"
```

**What changes:**
- **manifest.yaml**: Adds `[FOCUS] DABs examples should include DLT pipeline resource definitions`
- **ground_truth.yaml**: Existing pipeline test cases get DLT-specific patterns; new test cases cover DLT pipeline YAML configuration

### Scenario: Customer wants stricter SQL validation

```bash
--focus "All SQL examples must use parameterized queries, never string interpolation"
```

**What changes:**
- **manifest.yaml**: Adds `[FOCUS] SQL examples must use parameterized queries with bind variables`
- **quality_gates**: `pattern_adherence` may increase (e.g., 0.9 -> 0.95)
- **ground_truth.yaml**: SQL-related test cases get patterns checking for parameterized syntax

---

## 5. Reviewing and Rolling Back Changes

### Identifying Focus-Generated Content

- **Guidelines**: Look for the `[FOCUS]` prefix in `manifest.yaml` `default_guidelines`
- **Test cases**: Check `metadata.source: generated_from_focus` in `ground_truth.yaml`
- **Adjusted responses**: Check `metadata._focus_original_response` for the pre-focus original

### Rolling Back

**Remove focus guidelines from manifest:**
```bash
# Edit manifest.yaml, delete lines starting with "[FOCUS]"
grep -v "^\s*- \"\[FOCUS\]" .test/skills/<skill>/manifest.yaml > tmp && mv tmp .test/skills/<skill>/manifest.yaml
```

**Remove focus-generated test cases:**
```python
# In Python
import yaml
with open(".test/skills/<skill>/ground_truth.yaml") as f:
data = yaml.safe_load(f)
data["test_cases"] = [
tc for tc in data["test_cases"]
if tc.get("metadata", {}).get("source") != "generated_from_focus"
]
with open(".test/skills/<skill>/ground_truth.yaml", "w") as f:
yaml.dump(data, f, default_flow_style=False, sort_keys=False)
```

**Restore original responses (for adjusted test cases):**
```python
for tc in data["test_cases"]:
original = tc.get("metadata", {}).pop("_focus_original_response", None)
if original:
tc["outputs"]["response"] = original
```

**Or use git:**
```bash
git checkout -- .test/skills/<skill>/manifest.yaml .test/skills/<skill>/ground_truth.yaml
```
83 changes: 79 additions & 4 deletions .test/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ uv run python .test/scripts/optimize.py <skill_name> [options]
|------|---------|---------|---------|
| `--gen-model` | `GEPA_GEN_LM` | `databricks/databricks-claude-sonnet-4-6` | Generates responses in proxy evaluator |
| `--reflection-lm` | `GEPA_REFLECTION_LM` | `databricks/databricks-claude-opus-4-6` | GEPA's reflection/mutation model |
| `--judge-model` | `GEPA_JUDGE_LM` | `databricks/databricks-claude-sonnet-4-6` | MLflow quality/regression judges |
| `--judge-model` | `GEPA_JUDGE_LM` | `databricks/databricks-claude-sonnet-4-6` | MLflow quality judge |

Proxy evaluator models use [litellm provider prefixes](https://docs.litellm.ai/docs/providers): `databricks/`, `openai/`, `anthropic/`.

Expand Down Expand Up @@ -277,8 +277,8 @@ test_cases:
| Field | Required | Description |
|-------|----------|-------------|
| `inputs.prompt` | Yes | The user question |
| `expectations.expected_facts` | Yes | Facts the response must contain (checked by quality judge) |
| `expectations.expected_patterns` | No | Regex patterns checked deterministically |
| `expectations.expected_facts` | Yes | Facts the response must contain (checked by quality judge + deterministic substring match) |
| `expectations.expected_patterns` | No | Regex patterns checked deterministically (feeds `fact_coverage`/`pattern_adherence` scores) |
| `expectations.guidelines` | No | Soft rules for the quality judge |
| `expectations.trace_expectations` | No | Agent behavioral validation (only with `--agent-eval`) |
| `outputs.response` | No | Reference answer for judge comparison |
Expand Down Expand Up @@ -311,6 +311,80 @@ Without `--tool-modules`, all skills are included regardless. Available modules:

---

## Evaluation & Scoring

### SkillBench evaluator (default)

Each candidate skill is evaluated per-task using a WITH vs WITHOUT comparison:

1. **Generate WITH-skill response** — LLM generates with SKILL.md in context
2. **Generate WITHOUT-skill response** — LLM generates without skill (cached)
3. **Three focused judges** — each returns categorical `"excellent"` / `"acceptable"` / `"poor"` verdicts:
- **Correctness judge** (WITH + WITHOUT) — facts, API references, code syntax accuracy
- **Completeness judge** (WITH + WITHOUT) — all parts addressed, expected info present
- **Guideline adherence judge** (WITH only) — Databricks-specific patterns and practices
- **Regression judge** (conditional) — fires only when effectiveness delta < -0.05
4. **Deterministic assertions** (0 LLM calls) — `assertions.py` checks `expected_facts` (substring match) and `expected_patterns` (regex match) against both responses

**Cost per task:** 5 LLM calls (correctness×2 + completeness×2 + guideline_adherence×1). WITHOUT calls are cached, so subsequent iterations cost only 3 calls.

**Scoring weights:**

| Component | Weight | Source |
|-----------|--------|--------|
| Effectiveness delta | 30% | Mean of (correctness_delta + completeness_delta) |
| Quality composite | 20% | Mean of (correctness + completeness + guideline_adherence) WITH scores |
| Fact/pattern coverage | 15% | Deterministic assertions from `assertions.py` |
| Guideline adherence | 10% | Dedicated weight for Databricks patterns |
| Token efficiency | 10% | Smaller candidates score higher |
| Structure | 5% | Syntax validation (Python, SQL, no hallucinated APIs) |
| Regression penalty | -10% | Explicit penalty when regression_judge detects harm |

**Categorical-to-float conversion:** `excellent=1.0`, `acceptable=0.6`, `poor=0.0`. The nonlinear scale incentivizes GEPA to push from "acceptable" to "excellent" (0.4 gap).

### How GEPA uses evaluation feedback

GEPA's reflection LM reads `side_info` rendered as markdown headers. Key fields:

- **`Judge_correctness_with`** / **`Judge_correctness_without`** — per-dimension accuracy feedback with categorical verdicts
- **`Judge_completeness_with`** / **`Judge_completeness_without`** — per-dimension coverage feedback
- **`Judge_guideline_adherence`** — pattern compliance feedback (WITH only)
- **`Judge_effectiveness`** — per-dimension deltas (`correctness_delta`, `completeness_delta`, `overall_delta`)
- **`Regression_Analysis`** — specific "what to fix" guidance (only when regression detected)
- **`Missing_Facts`** / **`Missing_Patterns`** — exact list of what the skill should add (from assertions)
- **`Passed_Facts`** / **`Passed_Patterns`** — what the skill already covers
- **`scores`** — feeds GEPA's multi-objective Pareto frontier (`correctness_with`, `completeness_with`, `guideline_adherence`, `quality_composite`, etc.)

This gives GEPA three independent, actionable signals. A mutation that fixes correctness but doesn't help completeness shows clear movement on one dimension, guiding the next mutation.

### Why three judges (not one, not five)?

The previous single quality judge collapsed 5 criteria into one 0.0–1.0 score. When a mutation improved correctness but hurt completeness, the score barely moved — GEPA couldn't distinguish which dimension improved. Three judges cover the core evaluation dimensions without excessive cost:

1. **Correctness** → fix errors (API syntax, deprecated patterns)
2. **Completeness** → add missing content
3. **Guideline adherence** → align with Databricks patterns + `--focus` areas

Deterministic assertions in `assertions.py` remain for precise, structured `Missing_Facts` lists at zero LLM cost.

### Agent evaluator (`--agent-eval`)

Runs a real Claude Code agent and adds tool-call scoring:

| Component | Weight |
|-----------|--------|
| Content quality | 20% |
| Skill effectiveness | 20% |
| Tool call correctness | 20% |
| Behavioral compliance | 15% |
| Execution success | 10% |
| Tool call efficiency | 10% |
| Token efficiency | 5% |

The agent evaluator also uses `assertions.py` for structured `Missing_Facts`/`Missing_Patterns` feedback. Tool-call judges use MLflow's `ToolCallCorrectness`/`ToolCallEfficiency` when available, falling back to deterministic trace scorers.

---

## Project Structure

```
Expand All @@ -325,8 +399,9 @@ Without `--tool-modules`, all skills are included regardless. Available modules:
│ ├── runner.py # Multi-pass GEPA orchestrator
│ ├── skillbench_evaluator.py # Fast proxy evaluator (WITH vs WITHOUT)
│ ├── agent_evaluator.py # Real Claude Code agent evaluator
│ ├── assertions.py # Deterministic fact/pattern assertions (zero LLM cost)
│ ├── assessment_fetcher.py # MLflow assessment injection
│ ├── judges.py # MLflow judge factories + fallback chain
│ ├── judges.py # MLflow quality judge factory + fallback chain
│ ├── config.py # Presets, model registration
│ ├── splitter.py # Train/val dataset splitting
│ ├── tools.py # MCP tool description extraction
Expand Down
Loading
Loading