Skip to content

Latest commit

 

History

History
552 lines (450 loc) · 26.9 KB

File metadata and controls

552 lines (450 loc) · 26.9 KB

AWP Refinement Mode

Task-local iterative refinement of a completed run's deliverable. SGD on y (the output), not θ (the policy).

See alsoParent: docs/README.md · Orthogonal axis: outer-loop.md moves θ (prompt artifacts) — refinement moves y (the seed run's deliverable); both reuse the same loss function · Gradient sources: critique.md (defects, R35 fixpoint guard), runtime.md (last 3 gate rejections from the completion gate chain), evaluation.md (score deltas) · Never active inside awp run: entered only via awp refine · Guard rule: R36 (empty gradient) — authoritative in spec/versions/1.0/validation-rules.md §12, catalogued in validation.md · Autonomy mapping: compliance.md — refinement sits outside the 7 layers of layer-model.md · Engine context: ORCHESTRATION_ENGINES.md — each iteration is a standalone delegation-loop run with budget halved vs. the seed

awp refine <seed_run_dir> reads a completed run, extracts a deterministic "gradient" from its critique defects, gate-chain rejections, and eval deltas, and drives an iterative loop that re-runs the same task with the prior deliverable as starting state until the loss stops decreasing. The winning iteration is promoted to <seed>/BEST/. Orthogonal to awp optimize — they move different parameters (see §5 "Complementarity") and never interact.

Where this sits in AWP:

  • Layer: 6 — Orchestration (see CLAUDE.md §Two Orchestration Engines). Refinement is an outer wrapper around the existing DelegationLoopRunner; it does not introduce a new engine.
  • Autonomy: composes with any A2+ workflow (requires manager + critique + gate chain). Does not change the autonomy spectrum A0–A4.
  • Agent contract: honors R17 — every iteration's manager and workers return {self.name: {"confidence": ..., ...}}; refinement never reaches inside the contract.
  • Budget envelope: every iteration runs under a halved copy of the seed's envelope (see §6.2). The envelope itself (CLAUDE.md §Key Protocols) is unchanged.
  • Validation: introduces exactly one new normative rule — R36 (§1.2 below).

1. Mental Model

AWP treats every full run as a forward pass through a system of agents. CLAUDE.md §4 — Loss & Backprop: the ML Lens formalizes this: the E2E run is the forward pass, the gate chain + critique + eval produce a scalar loss (computed by compute_run_loss), and the 5-Why-by-Layer protocol is the backprop that turns a symptom into a gradient you can edit against.

That framing has two natural axes:

Two axes of SGD in AWP

1.1 Why two axes

A policy update (awp optimize) teaches the system to do better next time on any similar task — it moves θ, the six versioned prompt artifacts held by the ArtifactRegistry (worker_pitfalls, manager_planning_preamble, experiment_context_hint_template, pattern_library, tool_description_templates, critique_rubric — see awp.outer_loop.defaults). A refinement (awp refine) leaves θ alone and spends extra inference compute right now on this specific deliverable until a stricter loss signal says stop. One is meta-learning; the other is extra forward passes. They compose cleanly because they edit disjoint state: optimize writes to the artifact registry (SQLite under ~/.awp/outer_loop.db by default, override via AWP_OUTER_LOOP_DB), refine writes to a specific seed's refinement_sessions/ and BEST/.

Concretely: if your critique keeps flagging the same kind of defect across many runs, use optimize — the policy is wrong. If this particular run came out imperfect but the policy is fine, use refine — the instance needs more compute, not a retrained manager.

1.2 What "gradient" means here

The refinement gradient is not a numerical vector. It is a structured, deterministic summary of what the prior run got wrong, extracted from on-disk artifacts without any LLM call:

The refinement gradient — three signal sources merge

Source Signal Produced by
iterations/<k>/critique.json (per-worker) defect descriptions + severity Critique engine (packages/awp-runtime/src/awp/runtime/critique/engine.py)
logs/<run_id>/events.jsonl where category=="gate"fields.triggered gate name + rejection reason Completion gate chain (L0 → critique → deliverable_presence → placeholder → file → deliverable → structural_integrity → eval)
run_completion.json.evaluation.per_metric vs. .thresholds per-metric gap (threshold − observed) Evaluation layer

All three are concatenated into a RefinementGradient Pydantic object, serialized to <workspace_k>/gradient_input.json, and rendered into a deterministic text prefix that is prepended to the manager's first PLAN user message on iteration k via the new manager_prompt_prefix parameter on AgentWorkflow (threaded through DelegationLoopRunner which injects it on iteration == 1 only). Subsequent manager calls within the same iteration do NOT see the prefix — the refinement intent is by then persisted in plan + state.

R36 (normative): if the gradient is empty (no defects, no rejections, no metric gaps), refinement aborts before iteration 1 with "nothing to refine". Full text in spec/versions/1.0/validation-rules.md. R36 sits alongside R17 (agent output contract), R33 (deterministic phase purity), R34 (L0 output contract), and R35 (repair fixpoint guard) as runtime-enforced rules (not static validator rules).

1.3 What "one step" does

An iteration is one full AgentWorkflow run with:

  • Task = the seed's original task text (unchanged).
  • Starting state = hard-linked contents of the prior iteration's FINAL/ at <workspace_k>/input/ — the manager sees the prior deliverable on disk and is told (via the prefix) to iterate on it, not rewrite from scratch.
  • Budget = halved from the seed's observed consumption (ceil for counts, floor for bytes/seconds, wall-time halved from observed-not-cap, 60 s floor). Depth unchanged — it is structural, not a cost.
  • Gradient prefix = 10–40 lines of deterministic text listing defects, rejected gates, and metric gaps with an explicit "preserve what works; fix what the gradient identifies" directive.

After the iteration completes, the refinement loop computes loss via awp.outer_loop.loss.compute_run_loss (reused unchanged — the same scalar that drives awp optimize) and updates the best-so-far pointer.

1.4 Convergence and stopping

The loop is a best-so-far tracker with four short-circuits:

Loss trajectory and convergence guards

Stop Why
max_iterations k == N (--iterations, clamped [1, 10], default 3).
regression Loss rose vs. previous iteration twice in a row — extra compute is making it worse.
plateau `
wall_time_exhausted Cumulative wall time ≥ 2× seed observed wall time — spending more than twice the original run is a poor trade.
empty_gradient R36 at iter 1, or gradient becomes empty at iter k>1.
no_prior_deliverable Prior iteration produced nothing to seed from.
error:<ExcName> An iteration crashed; the session is still finalized (see §4).

The session is always persisted, even on abort. You can always re-read <seed>/refinement_sessions/<ts>.json and know exactly what happened.


2. Architecture: the seed → BEST flow

Seed → iteration chain → BEST pointer

Full on-disk tree:

<seed_run_dir>/                          <-- completed AWP run, returned by awp run / UI
├── run_completion.json                  <-- parsed for task + final_budget
├── iterations/<k>/critique.json         <-- mined for defects
├── FINAL/                               <-- starting deliverable (promoted from output/ if missing)
├── logs/<run_id>/events.jsonl           <-- mined for gate rejections
│                                            (may live under ../../logs/<run_id>/)
│
├── refinement_sessions/                 <-- NEW, written by awp refine
│   └── refine_20260419T182438Z.json     <-- one per session, always written
│
└── BEST/                                <-- NEW, only updated on improvement
    ├── manifest.json                    <-- best_run_id, best_loss, seed_loss, session_id
    └── <deliverable files>              <-- hard-links from the winning iteration

/tmp/awp-experiments/refine_<ts>/        <-- iteration workspaces (independent experiments)
├── iter_1/
│   ├── input/                           <-- hard-linked from <seed>/FINAL/
│   ├── gradient_input.json              <-- R36 audit trail
│   ├── workspace/runs/<run_id>/         <-- the actual agent run
│   │   ├── run_completion.json          <-- parent_run_id == <seed>.run_id
│   │   └── iterations/…
│   └── output/<run_id>/                 <-- deliverables; promoted to FINAL/ on demand
├── iter_2/
│   ├── input/                           <-- hard-linked from iter_1/…/FINAL/
│   └── …                                <-- parent_run_id == iter_1.run_id
└── …

Every iteration is a first-class experiment in the experiment DB, linked upstream via parent_run_id (threaded through AgentWorkflow and persisted in run_completion.json — see commit 3fcde7b). The UI renders the chain without any schema change; see RefinementSessionsList for the panel that surfaces it next to a seed run's history entry.


3. Mechanism in code

RefinementLoop.run call graph

One page of call-graph (text form):

awp refine <seed> --iterations N
    │
    ▼
awp.cli.cmd_refine(args)
    │
    ├── guard: <seed>/run_completion.json + FINAL/ exist   → else exit 2
    │
    ▼
awp.refinement.loop.RefinementLoop(seed_run_dir=...).run(iterations=N)
    │
    ├── extract_gradient(<seed>)                            — R36 check at iter 0
    │       │
    │       ├── _extract_defects_from_iterations(…)         — reads iterations/*/critique.json
    │       ├── _extract_last_rejections(events.jsonl)      — supports real + synthetic schemas
    │       └── _extract_eval_deltas(run_completion.json)   — threshold − observed
    │
    ├── _read_seed_context()                                — parses final_budget (nested or flat)
    │
    ├── for k in 1..N:                    ┌──── try/except/finally: sidecar ALWAYS written ──┐
    │   │                                 │                                                   │
    │   ├── _ensure_final_dir(prior_run)  │  — fallback-promotes output/<run_id>/ → FINAL/   │
    │   ├── prepare_iteration_workspace   │  — hard-link FINAL → workspace_k/input/          │
    │   ├── extract_gradient(prior_run)   │  — re-extract for iter k>1                       │
    │   ├── write gradient_input.json     │                                                   │
    │   ├── render_refinement_prefix()    │                                                   │
    │   ├── budget_for_iteration(…)       │  — halve with ceil/floor, 60 s wall-time floor   │
    │   ├── default_workflow_factory(…)   │  — spawn AgentWorkflow with prefix + inputs       │
    │   ├── compute_run_loss(run_dir)     │                                                   │
    │   ├── update best_iter / best_loss  │                                                   │
    │   └── check stop conditions         │                                                   │
    │                                     │                                                   │
    │   on any exception: stop_reason = "error:<ExcName>"; finally block still runs           │
    │                                                                                         │
    │   finally:                                                                              │
    │     write_session_sidecar(<seed>, session)                                              │
    │     if best_iter > 0: write_best_pointer(<seed>, winning_run, …)                        │
    │                                                                                         │
    └── return RefinementResult ──────────┘

Key contracts enforced by the code:

  • Every session is observable. try/finally guarantees the sidecar is written whether the loop ran to completion, hit a stop condition, or raised mid-iteration.
  • Every iteration has a starting deliverable. _ensure_final_dir walks up from the iteration's run dir, finds the workspace-level output/<run_id>/, and hard-links it into <run_dir>/FINAL/. This bridges the gap between the runtime's stricter _write_canonical_final_output (which requires declared deliverables to exist) and refinement's unconditional need for some seed deliverable.
  • BEST never regresses. write_best_pointer only overwrites an incumbent BEST/ if the new session's best_loss is strictly lower than the stored one.

4. Reading a session

<seed>/refinement_sessions/<session_id>.json:

{
  "session_id": "refine_20260419T182438Z",
  "seed_run_id": "2026-04-19_16-05-53_cbf28fd0",
  "started_at":  "2026-04-19T18:24:38Z",
  "completed_at": "2026-04-19T18:26:50Z",
  "stop_reason": "no_prior_deliverable",
  "best_iter":   1,
  "iterations": [
    {"k": 1, "run_id": "2026-04-19_18-24-38_1ba2c7ed",
     "loss": 0.3500, "status": "partial"}
  ]
}

What to read out of this:

  • best_iter: 0 means the seed still wins — no iteration beat the baseline. Open the run_dirs of the iterations to see why they failed to improve (critique, gates, trace).
  • stop_reason: regression — the gradient was real but the policy can't act on it at its current compute budget. Consider a larger --iterations with a stronger --model, or move to awp optimize if the same regression appears across tasks.
  • stop_reason: plateau — the system has converged. No more compute will help; accept the current BEST.
  • stop_reason: empty_gradient_midloop — an iteration produced a "perfect" run (no defects, no gate rejects, eval satisfied). Rare. Usually accompanies best_iter > 0.
  • stop_reason: error:<ExcName> — a crash happened mid-loop. The session captures every iteration that did complete; check logs for the traceback.

BEST manifest (<seed>/BEST/manifest.json) names the winning iteration and records the delta against the seed. Running awp refine again with a better model or more iterations overwrites BEST only if it finds a strictly lower loss — so the canonical deliverable monotonically improves session over session.


5. Complementarity with awp optimize

Both commands reduce the same scalar loss (compute_run_loss), but they act on disjoint state:

Dimension awp optimize awp refine
Parameter θ: 6 versioned prompt artifacts y: one task's deliverable
Scope a task suite (generalization) one seed run (instance)
Loop SGD with rollback on mean-loss regression best-so-far with regression/plateau guards
Persistence artifact registry + epochs table refinement_sessions/ + BEST/ on the seed
Triggered by "our runs are systematically worse than they should be" "this specific run is imperfect"
Uses TextGrad LLM-as-optimizer raw LLM passes with a deterministic gradient prefix

Running one does not affect the other's state. You can (and should) mix them: awp optimize trains the policy weekly, awp refine polishes any individual result that matters.


6. Reference

6.1 CLI

awp refine <seed_run_dir> [--iterations N] [--model M] [--worker-model M]

Exit codes:

Code Meaning
0 At least one iteration improved loss; BEST/ updated.
0 Empty gradient — nothing to refine (prints "nothing to refine").
1 No iteration improved loss; seed still wins. Session still written.
2 Setup failure — seed missing, unreadable, no FINAL/.

6.2 Budget halving

Field Rule
max_loops, max_total_workers, max_tool_calls ceil(seed × 0.5), floored at 1
max_total_tokens seed // 2, floored at 1
max_wall_time int(observed_wall_time × 0.5), floored at 60 s
max_depth inherited unchanged

6.3 Event & artifact schemas parsed

Gradient extraction is defensive — it parses both the real runtime format and the unit-test synthetic format, so lands cleanly across the fleet:

  • Critique defects (preferred): aggregated from iterations/<k>/critique.json.critiques[].defects[], schema {category, location, description, severity}. De-duplicated by (description[:120], severity).
  • Critique defects (fallback): run_completion.json.critique.defects[], schema {summary, severity}. Used by synthetic unit tests.
  • Gate rejections (real): events.jsonl where category == "gate" and fields.triggered is True; reads fields.gate and fields.reason.
  • Gate rejections (synthetic): events.jsonl where type == "gate.reject"; reads gate and reason.
  • Eval deltas: run_completion.json.evaluation.per_metric minus .thresholds, keeping only positive gaps.

6.4 events.jsonl path resolution

The runtime writes events to <workspace>/logs/<run_id>/events.jsonl, not to the run directory itself. gradient._resolve_events_path walks up the parent chain of the run directory looking for a sibling logs/<run_id>/events.jsonl. Falls back to a colocated events.jsonl (which is what synthetic fixtures use).

6.5 Seed budget parsing

run_completion.json.final_budget uses nested dicts on real runs:

{
  "loops":      {"used": N, "max": M},
  "workers":    {"spawned": N, "max": M},
  "tokens":     {"consumed": N, "max": M},
  "tool_calls": {"used": N, "max": M},
  "wall_time":  {"elapsed_s": N, "max_s": M}
}

RefinementLoop._read_seed_context supports both this shape and the flat legacy {max_loops, max_total_tokens, …} shape used by unit-test fixtures.

6.6 Model tiering (low / mid / high)

Refinement can run each iteration with a different {manager, worker} model pair — a coarse-to-fine annealing schedule across N iterations. The schedule is a pure function of (k, N):

  • N == 1high
  • N == 2[low, high]
  • N ≥ 3 → thirds-proportional: low_end = ceil(N / 3), high_start = N − floor(N / 3) + 1

Full mapping table (N ∈ [1, 10]):

N Mapping low mid high
1 H 0 0 1
2 L, H 1 0 1
3 L, M, H 1 1 1
4 L, L, M, H 2 1 1
5 L, L, M, M, H 2 2 1
6 L, L, M, M, H, H 2 2 2
7 L, L, L, M, M, H, H 3 2 2
8 L, L, L, M, M, M, H, H 3 3 2
9 L, L, L, M, M, M, H, H, H 3 3 3
10 L, L, L, L, M, M, M, H, H, H 4 3 3

Why low → high and not the reverse: residual loss shrinks per iteration, so precision matters most late. Weak early models produce the defects the gradient extractor (§1.2) needs; a strong early model may silently paper over them and starve the loop of signal. Budget halving cooperates — the weakest tier has the largest headroom; the strongest operates on the smallest residual.

Fallback semantics (empty tier field). For a tier T resolving at iteration k:

resolved_manager = T.manager if T.manager else seed.manager
resolved_worker  = T.worker  if T.worker  else seed.worker

Empty string and None are treated identically. All three tiers empty → tiering is a no-op (TierPlan.is_trivial() == True); every iteration resolves to the seed pair, and the loop behaves identically to the legacy single-model path.

Invariants (enforced by unit tests in packages/awp-runtime/tests/refinement/test_tiers.py):

  1. For N ≥ 3, every tier has ≥ 1 iteration.
  2. The mapping is monotonic: once a later-tier iteration has fired, no earlier-tier iteration follows.
  3. count(low) ≥ count(mid) ≥ count(high) for every N ≥ 3.

API body shape (POST /api/experiments/<run_id>/refine):

{
  "iterations": 3,
  "tier_low":  {"manager": "", "worker": ""},
  "tier_mid":  {"manager": "", "worker": ""},
  "tier_high": {"manager": "", "worker": ""}
}

When any tier_* is present, the legacy model / worker_model fields are ignored and the seed run's parsed models become the fallback baseline. Mixed bodies produce a refinement.mixed_body: tier_* set; ignoring legacy model/worker_model warning but still return 202.

CLI:

awp refine <seed_run_dir> \
  --tier-low  "deepseek/deepseek-chat-v3.1:deepseek/deepseek-chat-v3.1" \
  --tier-mid  "openai/gpt-5-mini:deepseek/deepseek-chat-v3.1" \
  --tier-high "anthropic/claude-opus-4:anthropic/claude-sonnet-4"

Each flag takes a manager:worker pair (split on the first colon; either side may be empty → falls back to seed).

Observability — session sidecars gain three per-iteration fields (tier, model_manager, model_worker) and one session-level flag (tier_plan_used). Legacy sidecars without these keys still render correctly; consumers must treat them as optional.

Store + UI — the workflow store carries defaults in refinement_tier_low/mid/high; the Settings panel's Optimizers → "Model Tiers" block edits those defaults; the RefineModal toggle "Use tiered models across iterations" opts a single session in and pre-fills the three pairs from the store.

Out of scope for this mechanism: the outer loop (awp optimize) is not tiered; its manager_model / TextGrad-optimizer-model remains a single value. Revisiting this is deferred until refinement tiers have production runs behind them.


7. Files

Purpose Path
Orchestrator packages/awp-runtime/src/awp/refinement/loop.py
Gradient extractor packages/awp-runtime/src/awp/refinement/gradient.py
Workspace seeding packages/awp-runtime/src/awp/refinement/seed.py
Budget scaling packages/awp-runtime/src/awp/refinement/budget.py
Session + BEST writers packages/awp-runtime/src/awp/refinement/session.py
Model tiering (TierPlan) packages/awp-runtime/src/awp/refinement/tiers.py
CLI packages/awp-core/src/awp/cli.py::cmd_refine
Backend API packages/awp-ui/server/api/routes.py (POST /api/experiments/<run_id>/refine, GET /api/experiments/<run_id>/refinement_sessions)
Frontend packages/awp-ui/frontend/src/components/Refinement/ (RefineModal, RefinementSessionsList) + hook on RunHistory
E2E packages/awp-runtime/tests/e2e/test_e2e_refinement.py
Unit tests packages/awp-runtime/tests/refinement/
Normative rule spec/versions/1.0/validation-rules.md § R36

8. Further reading

  • CLAUDE.md §4 — Loss & Backprop: the ML Lens — the ML framing refinement implements at the task level.
  • docs/outer-loop.md — the θ-axis sibling.
  • spec/versions/1.0/validation-rules.md § R36 — normative R36 text.

9. Task-attached refinement (Plan 4)

When awp refine --target <experiment_id>:<task_id> is invoked, the CLI loads the task's BEST run (the run marked as best for that task in the experiment hierarchy — see spec/versions/1.0/experiment-task-hierarchy-design.md) and launches the refinement loop as described in §6. Refinement sessions are persisted under <experiment>/tasks/<task_id>/refinements/ as session_<timestamp>.json, with one session.json file serving as the latest-session pointer. Winning iterations are hard-linked into <experiment>/tasks/<task_id>/BEST/, overwriting the prior best only if loss is strictly lower (see packages/awp-runtime/src/awp/refinement/session.py).

The target-attached mode allows a refinement loop to be tied to a specific task within an experiment hierarchy, enabling downstream continuation tasks to read the refined output as their --primary input (see spec/versions/1.0/experiment-task-hierarchy-design.md §R37). Implementation lives in packages/awp-core/src/awp/experiment/cli_handlers.py::refine_task_aware.

See also

  • docs/continuation.md — y-axis carry-over across tasks (user-feedback gradient). Related but distinct from refinement's auto-gradient within one run.
  • docs/outer-loop.md — θ-axis optimisation. Now per-experiment (decision β).
  • packages/awp-runtime/src/awp/refinement/loop.pyRefinementLoop implementation.
  • After Plan 4: use awp refine --target <exp>:<task> to attach sessions under <task>/refinements/session_<ts>/.