CODEMODE APPROACH - BEHIND THE SCENES #8

fswair · 2026-03-16T00:18:09Z

fswair
Mar 16, 2026
Maintainer

CodeModeGenerator Detailed Flow

This document explains how the current CodeModeGenerator pipeline works end-to-end, with a focus on:

How ground-truth data is created
How feedback-guided exploration works round by round
How spec generation/refinement works
How YAML mode vs model-output mode (use_model_spec) are handled

Primary implementation: src/vowel/codemode.py

1. High-Level Purpose

CodeModeGenerator is a two-stage eval-spec generation system:

Explore the function behavior by executing LLM-written snippets against the real implementation.
Generate an eval spec from those verified results.

The key design principle is:

Do not guess expected outputs.
Execute code and use empirical outputs as ground-truth.

2. Main Data Models

Defined in src/vowel/codemode.py:

ExplorationSnippet: normal (expected-success) snippet
ErrorSnippet: expected-exception snippet
ExplorationPlan: model output for exploration (snippets + error_snippets)
SnippetResult: execution result record for each snippet
CodeModeResult: final pipeline output (exploration_results, yaml_spec, summary, refinement_rounds)

Mode-specific spec output models:

EvalsSource (string YAML payload)
EvalsBundle (structured object, from src/vowel/utils.py)

3. Initialization and Agent Setup

CodeModeGenerator.__init__ configures:

spec_model, exploration_model
executor (resolved via resolve_executors)
min_snippets for exploration quality floor
use_model_spec switch

Two lazy agents are used:

explorer_agent: always returns ExplorationPlan
spec_agent: returns either:
- EvalsSource when use_model_spec=False (default)
- EvalsBundle when use_model_spec=True

4. Ground-Truth Creation (Core Reliability Mechanism)

Ground-truth is created in exploration execution, not in spec generation.

How it works

LLM proposes snippets.
Each snippet is executed in an executor session initialized with real function code.
For each snippet, pipeline records:
- success/failure
- output (for success)
- exception type/message (for failure)
- duration
These SnippetResult records become the source of truth for spec generation.

This is why expected values can be trustworthy: they are measured from runtime behavior.

5. Feedback-Guided Exploration (Round-by-Round)

Exploration is iterative (exploration_rounds=2 by default).

Round 1: Static Exploration

Method: _get_exploration_plan

Input to model:

function name
function code
description

Output:

broad first-pass normal snippets
first-pass error snippets

Execution:

all snippets are run via _execute_plan
results are captured into SnippetResult

Round 2: Targeted Exploration (Feedback-Guided)

Methods:

_build_cluster_summary
_get_targeted_exploration_plan

Round 2 prompt receives:

full prior round results
deterministic cluster summary generated from real outputs/errors

Cluster summary includes:

success clusters by output type
error clusters by exception type + message prefix
already-tried code snippets (explicitly to prevent repetition)

Goal:

discover new behavior classes not covered in Round 1

Early-stop safeguards:

if Round 2 returns no new snippets, stop
new behavior count is tracked by _count_new_behaviors

6. Spec Generation (Phase 2)

Method: generate_spec

Inputs:

target function
full exploration results
optional failure_context from prior failed attempts

Prompt includes:

verified successful results
verified error results
explicit constraints (coverage, raises count, input access rules)

Output mode branch

A) YAML mode (default): `use_model_spec=False`

model returns EvalsSource (string YAML)
pipeline then performs:
- YAML tag sanitization (!!... stripping)
- syntax parse check (yaml.safe_load)
- validate_and_fix_spec
- validate_expected_values against executor
- inject_missing_error_cases

Return type: str

B) Structured mode: `use_model_spec=True`

model returns EvalsBundle
returns bundle directly from generate_spec
no YAML sanitation/fix chain is applied in this branch at generate_spec level

Return type: EvalsBundle

7. Refinement Loop (Phase 2-4)

Method: generate

After exploration, pipeline enters up-to max_refinement_rounds + 1 attempts (when run_evals=True).

Per attempt:

Generate spec (str or EvalsBundle).
Normalize for downstream use:
- bundle -> bundle.to_yaml() for YAML materialization
- string -> direct yaml_spec
Run evals in validation mode (ignore_duration()):
- bundle path: RunEvals.from_bundle(bundle)
- yaml path: RunEvals.from_source(yaml_spec)
If coverage >= min_coverage, stop.
Else build failure_context from summary and retry.

If generation or eval execution raises exception:

error message is fed back as failure_context
attempt count advances

8. Duration Injection and Finalization (Phase 5)

After refinement loop:

If enabled, inject per-case duration thresholds via _inject_durations.
Final summary run is executed again (still with ignore_duration() to avoid circular failure).
Optionally save YAML to {func.name}_evals.yml.
Return CodeModeResult.

Note: output contract currently always includes final yaml_spec in CodeModeResult, even when generation path was bundle-first.

9. Observability and Telemetry

logfire spans/records cover:

pipeline start/end
exploration rounds
snippet execution
spec generation attempts
refinement decisions
duration injection

This makes failure diagnosis and mode comparison practical.

10. Why This Design Works

Strengths:

Ground-truth expected values are runtime-measured, not hallucinated.
Exploration is feedback-guided, not one-shot static probing.
Refinement loop allows automatic recovery from partial failures.
Dual output mode supports both YAML-native and structured object pathways.

11. Current Practical Trade-off

In practice, YAML-native path is often more robust on smaller models because:

it has mature sanitization/fix steps
failures are corrected in-text across retries

Structured path (EvalsBundle) is cleaner architecturally, but depends more heavily on model capability and schema adherence.

12. Round-by-Round Timeline (Compact)

Round 1 exploration prompt -> snippets/error_snippets
Execute all snippets -> SnippetResult[] ground-truth
Build cluster summary from Round 1 outputs/errors
Round 2 targeted exploration -> new snippets
Execute Round 2 snippets -> expanded ground-truth
Generate spec attempt 1 (YAML or bundle)
Run evals for coverage validation
If needed, regenerate with failure context
Inject durations into final YAML
Final summary run + optional file save + CodeModeResult

13. Key Knobs

exploration_rounds (inside explore): exploration depth
min_snippets: minimum normal exploration breadth
max_refinement_rounds: retry budget
min_coverage: success threshold
inject_durations: performance constraint injection toggle
use_model_spec: EvalsSource vs EvalsBundle output mode

14. File References

Core pipeline: src/vowel/codemode.py
Eval runner API (from_source, from_bundle): src/vowel/runner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CODEMODE APPROACH - BEHIND THE SCENES #8

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

CODEMODE APPROACH - BEHIND THE SCENES #8

Uh oh!

fswair Mar 16, 2026 Maintainer

CodeModeGenerator Detailed Flow

1. High-Level Purpose

2. Main Data Models

3. Initialization and Agent Setup

4. Ground-Truth Creation (Core Reliability Mechanism)

How it works

5. Feedback-Guided Exploration (Round-by-Round)

Round 1: Static Exploration

Round 2: Targeted Exploration (Feedback-Guided)

6. Spec Generation (Phase 2)

Output mode branch

A) YAML mode (default): use_model_spec=False

B) Structured mode: use_model_spec=True

7. Refinement Loop (Phase 2-4)

8. Duration Injection and Finalization (Phase 5)

9. Observability and Telemetry

10. Why This Design Works

11. Current Practical Trade-off

12. Round-by-Round Timeline (Compact)

13. Key Knobs

14. File References

Replies: 0 comments

fswair
Mar 16, 2026
Maintainer

A) YAML mode (default): `use_model_spec=False`

B) Structured mode: `use_model_spec=True`