Task-local iterative refinement of a completed run's deliverable. SGD on y (the output), not θ (the policy).
See also — Parent: docs/README.md · Orthogonal axis: outer-loop.md moves θ (prompt artifacts) — refinement moves y (the seed run's deliverable); both reuse the same loss function · Gradient sources: critique.md (defects, R35 fixpoint guard), runtime.md (last 3 gate rejections from the completion gate chain), evaluation.md (score deltas) · Never active inside
awp run: entered only viaawp refine· Guard rule: R36 (empty gradient) — authoritative in spec/versions/1.0/validation-rules.md §12, catalogued in validation.md · Autonomy mapping: compliance.md — refinement sits outside the 7 layers of layer-model.md · Engine context: ORCHESTRATION_ENGINES.md — each iteration is a standalone delegation-loop run with budget halved vs. the seed
awp refine <seed_run_dir> reads a completed run, extracts a
deterministic "gradient" from its critique
defects, gate-chain rejections, and
eval deltas, and drives an iterative
loop that re-runs the same task with the prior deliverable as
starting state until the loss
stops decreasing. The winning iteration is promoted to
<seed>/BEST/. Orthogonal to awp optimize —
they move different parameters (see §5 "Complementarity") and never
interact.
Where this sits in AWP:
- Layer: 6 — Orchestration (see CLAUDE.md §Two Orchestration Engines). Refinement is an outer wrapper around the existing
DelegationLoopRunner; it does not introduce a new engine. - Autonomy: composes with any A2+ workflow (requires manager + critique + gate chain). Does not change the autonomy spectrum A0–A4.
- Agent contract: honors R17 — every iteration's manager and workers return
{self.name: {"confidence": ..., ...}}; refinement never reaches inside the contract. - Budget envelope: every iteration runs under a halved copy of the seed's envelope (see §6.2). The envelope itself (CLAUDE.md §Key Protocols) is unchanged.
- Validation: introduces exactly one new normative rule — R36 (§1.2 below).
AWP treats every full run as a forward pass through a system of
agents. CLAUDE.md §4 — Loss & Backprop: the ML Lens
formalizes this: the E2E run is the forward pass, the
gate chain + critique + eval produce a
scalar loss (computed by
compute_run_loss),
and the 5-Why-by-Layer protocol
is the backprop that turns a symptom into a gradient you can edit
against.
That framing has two natural axes:
A policy update (awp optimize) teaches the
system to do better next time on any similar task — it moves
θ, the six versioned prompt artifacts held by the
ArtifactRegistry
(worker_pitfalls, manager_planning_preamble,
experiment_context_hint_template, pattern_library,
tool_description_templates, critique_rubric — see
awp.outer_loop.defaults).
A refinement (awp refine) leaves θ alone and spends extra
inference compute right now on this specific deliverable
until a stricter loss signal says stop. One is meta-learning; the
other is extra forward passes. They compose cleanly because they
edit disjoint state: optimize writes to the artifact registry
(SQLite under ~/.awp/outer_loop.db by default, override via
AWP_OUTER_LOOP_DB), refine writes to a specific seed's
refinement_sessions/ and BEST/.
Concretely: if your critique keeps flagging the same kind of
defect across many runs, use optimize — the policy is wrong. If
this particular run came out imperfect but the policy is fine,
use refine — the instance needs more compute, not a retrained
manager.
The refinement gradient is not a numerical vector. It is a structured, deterministic summary of what the prior run got wrong, extracted from on-disk artifacts without any LLM call:
| Source | Signal | Produced by |
|---|---|---|
iterations/<k>/critique.json (per-worker) |
defect descriptions + severity | Critique engine (packages/awp-runtime/src/awp/runtime/critique/engine.py) |
logs/<run_id>/events.jsonl where category=="gate" ∧ fields.triggered |
gate name + rejection reason | Completion gate chain (L0 → critique → deliverable_presence → placeholder → file → deliverable → structural_integrity → eval) |
run_completion.json.evaluation.per_metric vs. .thresholds |
per-metric gap (threshold − observed) |
Evaluation layer |
All three are concatenated into a
RefinementGradient
Pydantic object, serialized to <workspace_k>/gradient_input.json,
and rendered into a deterministic text prefix that is prepended to
the manager's first PLAN user message on iteration k via the
new manager_prompt_prefix parameter on
AgentWorkflow
(threaded through
DelegationLoopRunner
which injects it on iteration == 1 only). Subsequent manager calls
within the same iteration do NOT see the prefix — the refinement
intent is by then persisted in plan + state.
R36 (normative): if the gradient is empty (no defects, no
rejections, no metric gaps), refinement aborts before iteration 1
with "nothing to refine". Full text in
spec/versions/1.0/validation-rules.md.
R36 sits alongside R17 (agent output contract),
R33 (deterministic phase purity),
R34 (L0 output contract),
and R35 (repair fixpoint guard)
as runtime-enforced rules (not static validator rules).
An iteration is one full AgentWorkflow run with:
- Task = the seed's original task text (unchanged).
- Starting state = hard-linked contents of the prior
iteration's
FINAL/at<workspace_k>/input/— the manager sees the prior deliverable on disk and is told (via the prefix) to iterate on it, not rewrite from scratch. - Budget = halved from the seed's observed consumption
(
ceilfor counts, floor for bytes/seconds, wall-time halved from observed-not-cap, 60 s floor). Depth unchanged — it is structural, not a cost. - Gradient prefix = 10–40 lines of deterministic text listing defects, rejected gates, and metric gaps with an explicit "preserve what works; fix what the gradient identifies" directive.
After the iteration completes, the refinement loop computes loss
via awp.outer_loop.loss.compute_run_loss
(reused unchanged — the same scalar that drives
awp optimize) and updates the best-so-far
pointer.
The loop is a best-so-far tracker with four short-circuits:
| Stop | Why |
|---|---|
max_iterations |
k == N (--iterations, clamped [1, 10], default 3). |
regression |
Loss rose vs. previous iteration twice in a row — extra compute is making it worse. |
plateau |
` |
wall_time_exhausted |
Cumulative wall time ≥ 2× seed observed wall time — spending more than twice the original run is a poor trade. |
empty_gradient |
R36 at iter 1, or gradient becomes empty at iter k>1. |
no_prior_deliverable |
Prior iteration produced nothing to seed from. |
error:<ExcName> |
An iteration crashed; the session is still finalized (see §4). |
The session is always persisted, even on abort. You can always
re-read <seed>/refinement_sessions/<ts>.json and know exactly
what happened.
Full on-disk tree:
<seed_run_dir>/ <-- completed AWP run, returned by awp run / UI
├── run_completion.json <-- parsed for task + final_budget
├── iterations/<k>/critique.json <-- mined for defects
├── FINAL/ <-- starting deliverable (promoted from output/ if missing)
├── logs/<run_id>/events.jsonl <-- mined for gate rejections
│ (may live under ../../logs/<run_id>/)
│
├── refinement_sessions/ <-- NEW, written by awp refine
│ └── refine_20260419T182438Z.json <-- one per session, always written
│
└── BEST/ <-- NEW, only updated on improvement
├── manifest.json <-- best_run_id, best_loss, seed_loss, session_id
└── <deliverable files> <-- hard-links from the winning iteration
/tmp/awp-experiments/refine_<ts>/ <-- iteration workspaces (independent experiments)
├── iter_1/
│ ├── input/ <-- hard-linked from <seed>/FINAL/
│ ├── gradient_input.json <-- R36 audit trail
│ ├── workspace/runs/<run_id>/ <-- the actual agent run
│ │ ├── run_completion.json <-- parent_run_id == <seed>.run_id
│ │ └── iterations/…
│ └── output/<run_id>/ <-- deliverables; promoted to FINAL/ on demand
├── iter_2/
│ ├── input/ <-- hard-linked from iter_1/…/FINAL/
│ └── … <-- parent_run_id == iter_1.run_id
└── …
Every iteration is a first-class experiment in the experiment
DB, linked upstream via parent_run_id (threaded through
AgentWorkflow
and persisted in run_completion.json — see commit 3fcde7b).
The UI renders the chain without any schema
change; see
RefinementSessionsList
for the panel that surfaces it next to a seed run's history entry.
One page of call-graph (text form):
awp refine <seed> --iterations N
│
▼
awp.cli.cmd_refine(args)
│
├── guard: <seed>/run_completion.json + FINAL/ exist → else exit 2
│
▼
awp.refinement.loop.RefinementLoop(seed_run_dir=...).run(iterations=N)
│
├── extract_gradient(<seed>) — R36 check at iter 0
│ │
│ ├── _extract_defects_from_iterations(…) — reads iterations/*/critique.json
│ ├── _extract_last_rejections(events.jsonl) — supports real + synthetic schemas
│ └── _extract_eval_deltas(run_completion.json) — threshold − observed
│
├── _read_seed_context() — parses final_budget (nested or flat)
│
├── for k in 1..N: ┌──── try/except/finally: sidecar ALWAYS written ──┐
│ │ │ │
│ ├── _ensure_final_dir(prior_run) │ — fallback-promotes output/<run_id>/ → FINAL/ │
│ ├── prepare_iteration_workspace │ — hard-link FINAL → workspace_k/input/ │
│ ├── extract_gradient(prior_run) │ — re-extract for iter k>1 │
│ ├── write gradient_input.json │ │
│ ├── render_refinement_prefix() │ │
│ ├── budget_for_iteration(…) │ — halve with ceil/floor, 60 s wall-time floor │
│ ├── default_workflow_factory(…) │ — spawn AgentWorkflow with prefix + inputs │
│ ├── compute_run_loss(run_dir) │ │
│ ├── update best_iter / best_loss │ │
│ └── check stop conditions │ │
│ │ │
│ on any exception: stop_reason = "error:<ExcName>"; finally block still runs │
│ │
│ finally: │
│ write_session_sidecar(<seed>, session) │
│ if best_iter > 0: write_best_pointer(<seed>, winning_run, …) │
│ │
└── return RefinementResult ──────────┘
Key contracts enforced by the code:
- Every session is observable.
try/finallyguarantees the sidecar is written whether the loop ran to completion, hit a stop condition, or raised mid-iteration. - Every iteration has a starting deliverable.
_ensure_final_dirwalks up from the iteration's run dir, finds the workspace-leveloutput/<run_id>/, and hard-links it into<run_dir>/FINAL/. This bridges the gap between the runtime's stricter_write_canonical_final_output(which requires declared deliverables to exist) and refinement's unconditional need for some seed deliverable. - BEST never regresses.
write_best_pointeronly overwrites an incumbentBEST/if the new session'sbest_lossis strictly lower than the stored one.
<seed>/refinement_sessions/<session_id>.json:
{
"session_id": "refine_20260419T182438Z",
"seed_run_id": "2026-04-19_16-05-53_cbf28fd0",
"started_at": "2026-04-19T18:24:38Z",
"completed_at": "2026-04-19T18:26:50Z",
"stop_reason": "no_prior_deliverable",
"best_iter": 1,
"iterations": [
{"k": 1, "run_id": "2026-04-19_18-24-38_1ba2c7ed",
"loss": 0.3500, "status": "partial"}
]
}What to read out of this:
best_iter: 0means the seed still wins — no iteration beat the baseline. Open the run_dirs of the iterations to see why they failed to improve (critique, gates, trace).stop_reason: regression— the gradient was real but the policy can't act on it at its current compute budget. Consider a larger--iterationswith a stronger--model, or move toawp optimizeif the same regression appears across tasks.stop_reason: plateau— the system has converged. No more compute will help; accept the current BEST.stop_reason: empty_gradient_midloop— an iteration produced a "perfect" run (no defects, no gate rejects, eval satisfied). Rare. Usually accompaniesbest_iter > 0.stop_reason: error:<ExcName>— a crash happened mid-loop. The session captures every iteration that did complete; check logs for the traceback.
BEST manifest (<seed>/BEST/manifest.json) names the winning
iteration and records the delta against the seed. Running awp refine
again with a better model or more iterations overwrites BEST only
if it finds a strictly lower loss — so the canonical deliverable
monotonically improves session over session.
Both commands reduce the same scalar loss (compute_run_loss), but
they act on disjoint state:
| Dimension | awp optimize |
awp refine |
|---|---|---|
| Parameter | θ: 6 versioned prompt artifacts | y: one task's deliverable |
| Scope | a task suite (generalization) | one seed run (instance) |
| Loop | SGD with rollback on mean-loss regression | best-so-far with regression/plateau guards |
| Persistence | artifact registry + epochs table |
refinement_sessions/ + BEST/ on the seed |
| Triggered by | "our runs are systematically worse than they should be" | "this specific run is imperfect" |
| Uses | TextGrad LLM-as-optimizer | raw LLM passes with a deterministic gradient prefix |
Running one does not affect the other's state. You can (and
should) mix them: awp optimize trains the policy weekly, awp refine
polishes any individual result that matters.
awp refine <seed_run_dir> [--iterations N] [--model M] [--worker-model M]
Exit codes:
| Code | Meaning |
|---|---|
0 |
At least one iteration improved loss; BEST/ updated. |
0 |
Empty gradient — nothing to refine (prints "nothing to refine"). |
1 |
No iteration improved loss; seed still wins. Session still written. |
2 |
Setup failure — seed missing, unreadable, no FINAL/. |
| Field | Rule |
|---|---|
max_loops, max_total_workers, max_tool_calls |
ceil(seed × 0.5), floored at 1 |
max_total_tokens |
seed // 2, floored at 1 |
max_wall_time |
int(observed_wall_time × 0.5), floored at 60 s |
max_depth |
inherited unchanged |
Gradient extraction is defensive — it parses both the real runtime format and the unit-test synthetic format, so lands cleanly across the fleet:
- Critique defects (preferred): aggregated from
iterations/<k>/critique.json.critiques[].defects[], schema{category, location, description, severity}. De-duplicated by(description[:120], severity). - Critique defects (fallback):
run_completion.json.critique.defects[], schema{summary, severity}. Used by synthetic unit tests. - Gate rejections (real):
events.jsonlwherecategory == "gate"andfields.triggered is True; readsfields.gateandfields.reason. - Gate rejections (synthetic):
events.jsonlwheretype == "gate.reject"; readsgateandreason. - Eval deltas:
run_completion.json.evaluation.per_metricminus.thresholds, keeping only positive gaps.
The runtime writes events to <workspace>/logs/<run_id>/events.jsonl,
not to the run directory itself. gradient._resolve_events_path
walks up the parent chain of the run directory looking for a
sibling logs/<run_id>/events.jsonl. Falls back to a colocated
events.jsonl (which is what synthetic fixtures use).
run_completion.json.final_budget uses nested dicts on real runs:
{
"loops": {"used": N, "max": M},
"workers": {"spawned": N, "max": M},
"tokens": {"consumed": N, "max": M},
"tool_calls": {"used": N, "max": M},
"wall_time": {"elapsed_s": N, "max_s": M}
}
RefinementLoop._read_seed_context supports both this shape and
the flat legacy {max_loops, max_total_tokens, …} shape used by
unit-test fixtures.
Refinement can run each iteration with a different {manager, worker}
model pair — a coarse-to-fine annealing schedule across N iterations.
The schedule is a pure function of (k, N):
N == 1→highN == 2→[low, high]N ≥ 3→ thirds-proportional:low_end = ceil(N / 3),high_start = N − floor(N / 3) + 1
Full mapping table (N ∈ [1, 10]):
| N | Mapping | low | mid | high |
|---|---|---|---|---|
| 1 | H | 0 | 0 | 1 |
| 2 | L, H | 1 | 0 | 1 |
| 3 | L, M, H | 1 | 1 | 1 |
| 4 | L, L, M, H | 2 | 1 | 1 |
| 5 | L, L, M, M, H | 2 | 2 | 1 |
| 6 | L, L, M, M, H, H | 2 | 2 | 2 |
| 7 | L, L, L, M, M, H, H | 3 | 2 | 2 |
| 8 | L, L, L, M, M, M, H, H | 3 | 3 | 2 |
| 9 | L, L, L, M, M, M, H, H, H | 3 | 3 | 3 |
| 10 | L, L, L, L, M, M, M, H, H, H | 4 | 3 | 3 |
Why low → high and not the reverse: residual loss shrinks per iteration, so precision matters most late. Weak early models produce the defects the gradient extractor (§1.2) needs; a strong early model may silently paper over them and starve the loop of signal. Budget halving cooperates — the weakest tier has the largest headroom; the strongest operates on the smallest residual.
Fallback semantics (empty tier field). For a tier T resolving
at iteration k:
resolved_manager = T.manager if T.manager else seed.manager
resolved_worker = T.worker if T.worker else seed.worker
Empty string and None are treated identically. All three tiers empty
→ tiering is a no-op (TierPlan.is_trivial() == True); every
iteration resolves to the seed pair, and the loop behaves identically
to the legacy single-model path.
Invariants (enforced by unit tests in
packages/awp-runtime/tests/refinement/test_tiers.py):
- For
N ≥ 3, every tier has ≥ 1 iteration. - The mapping is monotonic: once a later-tier iteration has fired, no earlier-tier iteration follows.
count(low) ≥ count(mid) ≥ count(high)for everyN ≥ 3.
API body shape (POST /api/experiments/<run_id>/refine):
{
"iterations": 3,
"tier_low": {"manager": "…", "worker": "…"},
"tier_mid": {"manager": "…", "worker": "…"},
"tier_high": {"manager": "…", "worker": "…"}
}When any tier_* is present, the legacy model / worker_model
fields are ignored and the seed run's parsed models become the
fallback baseline. Mixed bodies produce a
refinement.mixed_body: tier_* set; ignoring legacy model/worker_model
warning but still return 202.
CLI:
awp refine <seed_run_dir> \
--tier-low "deepseek/deepseek-chat-v3.1:deepseek/deepseek-chat-v3.1" \
--tier-mid "openai/gpt-5-mini:deepseek/deepseek-chat-v3.1" \
--tier-high "anthropic/claude-opus-4:anthropic/claude-sonnet-4"
Each flag takes a manager:worker pair (split on the first colon;
either side may be empty → falls back to seed).
Observability — session sidecars gain three per-iteration fields
(tier, model_manager, model_worker) and one session-level flag
(tier_plan_used). Legacy sidecars without these keys still render
correctly; consumers must treat them as optional.
Store + UI — the workflow store carries defaults in
refinement_tier_low/mid/high; the Settings panel's Optimizers →
"Model Tiers" block edits those defaults; the RefineModal toggle
"Use tiered models across iterations" opts a single session in and
pre-fills the three pairs from the store.
Out of scope for this mechanism: the outer loop
(awp optimize) is not tiered; its manager_model /
TextGrad-optimizer-model remains a single value. Revisiting this is
deferred until refinement tiers have production runs behind them.
| Purpose | Path |
|---|---|
| Orchestrator | packages/awp-runtime/src/awp/refinement/loop.py |
| Gradient extractor | packages/awp-runtime/src/awp/refinement/gradient.py |
| Workspace seeding | packages/awp-runtime/src/awp/refinement/seed.py |
| Budget scaling | packages/awp-runtime/src/awp/refinement/budget.py |
| Session + BEST writers | packages/awp-runtime/src/awp/refinement/session.py |
Model tiering (TierPlan) |
packages/awp-runtime/src/awp/refinement/tiers.py |
| CLI | packages/awp-core/src/awp/cli.py::cmd_refine |
| Backend API | packages/awp-ui/server/api/routes.py (POST /api/experiments/<run_id>/refine, GET /api/experiments/<run_id>/refinement_sessions) |
| Frontend | packages/awp-ui/frontend/src/components/Refinement/ (RefineModal, RefinementSessionsList) + hook on RunHistory |
| E2E | packages/awp-runtime/tests/e2e/test_e2e_refinement.py |
| Unit tests | packages/awp-runtime/tests/refinement/ |
| Normative rule | spec/versions/1.0/validation-rules.md § R36 |
CLAUDE.md §4 — Loss & Backprop: the ML Lens— the ML framing refinement implements at the task level.docs/outer-loop.md— the θ-axis sibling.spec/versions/1.0/validation-rules.md § R36— normative R36 text.
When awp refine --target <experiment_id>:<task_id> is invoked, the
CLI loads the task's BEST run (the run marked as best for that task in
the experiment hierarchy — see spec/versions/1.0/experiment-task-hierarchy-design.md)
and launches the refinement loop as described in §6. Refinement
sessions are persisted under <experiment>/tasks/<task_id>/refinements/
as session_<timestamp>.json, with one session.json file serving as
the latest-session pointer. Winning iterations are hard-linked into
<experiment>/tasks/<task_id>/BEST/, overwriting the prior best only
if loss is strictly lower (see packages/awp-runtime/src/awp/refinement/session.py).
The target-attached mode allows a refinement loop to be tied to a
specific task within an experiment hierarchy, enabling downstream
continuation tasks to read the refined output as their --primary
input (see spec/versions/1.0/experiment-task-hierarchy-design.md §R37).
Implementation lives in packages/awp-core/src/awp/experiment/cli_handlers.py::refine_task_aware.
docs/continuation.md— y-axis carry-over across tasks (user-feedback gradient). Related but distinct from refinement's auto-gradient within one run.docs/outer-loop.md— θ-axis optimisation. Now per-experiment (decision β).packages/awp-runtime/src/awp/refinement/loop.py—RefinementLoopimplementation.- After Plan 4: use
awp refine --target <exp>:<task>to attach sessions under<task>/refinements/session_<ts>/.