You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
CodeModeGenerator Detailed Flow
This document explains how the current
CodeModeGeneratorpipeline works end-to-end, with a focus on:use_model_spec) are handledPrimary implementation: src/vowel/codemode.py
1. High-Level Purpose
CodeModeGeneratoris a two-stage eval-spec generation system:The key design principle is:
2. Main Data Models
Defined in src/vowel/codemode.py:
ExplorationSnippet: normal (expected-success) snippetErrorSnippet: expected-exception snippetExplorationPlan: model output for exploration (snippets+error_snippets)SnippetResult: execution result record for each snippetCodeModeResult: final pipeline output (exploration_results,yaml_spec,summary,refinement_rounds)Mode-specific spec output models:
EvalsSource(string YAML payload)EvalsBundle(structured object, from src/vowel/utils.py)3. Initialization and Agent Setup
CodeModeGenerator.__init__configures:spec_model,exploration_modelexecutor(resolved viaresolve_executors)min_snippetsfor exploration quality flooruse_model_specswitchTwo lazy agents are used:
explorer_agent: always returnsExplorationPlanspec_agent: returns either:EvalsSourcewhenuse_model_spec=False(default)EvalsBundlewhenuse_model_spec=True4. Ground-Truth Creation (Core Reliability Mechanism)
Ground-truth is created in exploration execution, not in spec generation.
How it works
SnippetResultrecords become the source of truth for spec generation.This is why expected values can be trustworthy: they are measured from runtime behavior.
5. Feedback-Guided Exploration (Round-by-Round)
Exploration is iterative (
exploration_rounds=2by default).Round 1: Static Exploration
Method:
_get_exploration_planInput to model:
Output:
Execution:
_execute_planSnippetResultRound 2: Targeted Exploration (Feedback-Guided)
Methods:
_build_cluster_summary_get_targeted_exploration_planRound 2 prompt receives:
Cluster summary includes:
Goal:
Early-stop safeguards:
_count_new_behaviors6. Spec Generation (Phase 2)
Method:
generate_specInputs:
failure_contextfrom prior failed attemptsPrompt includes:
Output mode branch
A) YAML mode (default):
use_model_spec=FalseEvalsSource(string YAML)!!...stripping)yaml.safe_load)validate_and_fix_specvalidate_expected_valuesagainst executorinject_missing_error_casesReturn type:
strB) Structured mode:
use_model_spec=TrueEvalsBundlegenerate_specgenerate_speclevelReturn type:
EvalsBundle7. Refinement Loop (Phase 2-4)
Method:
generateAfter exploration, pipeline enters up-to
max_refinement_rounds + 1attempts (whenrun_evals=True).Per attempt:
strorEvalsBundle).bundle.to_yaml()for YAML materializationyaml_specignore_duration()):RunEvals.from_bundle(bundle)RunEvals.from_source(yaml_spec)min_coverage, stop.failure_contextfrom summary and retry.If generation or eval execution raises exception:
failure_context8. Duration Injection and Finalization (Phase 5)
After refinement loop:
_inject_durations.ignore_duration()to avoid circular failure).{func.name}_evals.yml.CodeModeResult.Note: output contract currently always includes final
yaml_specinCodeModeResult, even when generation path was bundle-first.9. Observability and Telemetry
logfirespans/records cover:This makes failure diagnosis and mode comparison practical.
10. Why This Design Works
Strengths:
11. Current Practical Trade-off
In practice, YAML-native path is often more robust on smaller models because:
Structured path (
EvalsBundle) is cleaner architecturally, but depends more heavily on model capability and schema adherence.12. Round-by-Round Timeline (Compact)
SnippetResult[]ground-truthCodeModeResult13. Key Knobs
exploration_rounds(insideexplore): exploration depthmin_snippets: minimum normal exploration breadthmax_refinement_rounds: retry budgetmin_coverage: success thresholdinject_durations: performance constraint injection toggleuse_model_spec:EvalsSourcevsEvalsBundleoutput mode14. File References
from_source,from_bundle): src/vowel/runner.pyBeta Was this translation helpful? Give feedback.
All reactions