Status: implemented in
awp-runtime≥ 1.0.40 Source:packages/awp-runtime/src/awp/runtime/delegation_loop_runner.py,packages/awp-runtime/src/awp/runtime/dynamic_tool_factory.py
See also — Parent: docs/README.md, ORCHESTRATION_ENGINES.md · Inner-loop mechanisms: manager-intelligence.md (strategy + decision journal), critique.md (per-worker repair inside the loop), runtime-tool-generation.md (capability accumulation) · Bounded by: runtime.md (budget envelope, completion gate chain) · Contrast with outer axes: outer-loop.md (moves θ — prompt artifacts), refinement.md (moves y — a seed run's deliverable) — iterative optimization here is the inside-one-run feedback loop · Autonomy mapping: compliance.md
Most multi-agent workflows run once: define a pipeline, feed in data, get a result. This works for well-understood problems where you know the steps in advance. But for complex, open-ended problems — research synthesis, multi-constraint optimization, deep data analysis — a single pass is rarely sufficient. The right approach depends on what you learn along the way.
AWP's delegation loop implements iterative optimization: a closed feedback loop where the manager observes progress, adjusts strategy, delegates refined subtasks, and accumulates capabilities (tools and skills) across iterations. Each round feeds what worked — and what didn't — back into the next. The system grows its own capabilities as it works.
This is not gradient descent. There is no mathematical gradient, no differentiable loss surface, no convergence guarantee. What AWP does is closer to how a skilled human iterates: try an approach, observe the results, learn from them, build better tools, refine the strategy, and repeat — with hard budget limits that guarantee termination.
This document describes the four mechanisms that make iterative optimization work: the feedback loop, capability accumulation, stall detection, and budget enforcement.
The delegation loop's feedback mechanism is the rolling summary — a structured record of what happened in previous iterations that feeds into the manager's next decision.
The rolling summary (ROLLING_SUMMARY.md) is rebuilt after every iteration by update_rolling_summary() and contains:
| Section | Content |
|---|---|
| Progress | Current iteration number and confidence score |
| Confidence Trend | Last 6 iterations shown as Iter N: 0.XX → Iter N+1: 0.YY → ... |
| Recent Iterations (Detail) | Last window iterations (default 3) with full confidence and key findings |
| Older Iterations (Summary) | Earlier iterations condensed to one line each: iteration number + confidence |
The window size is configured via config.history.full_results_window (default: 3). Recent iterations are shown in reverse chronological order (newest first) so the manager sees the most relevant information first.
The manager receives five decision types, not just delegate/complete. PLAN creates an explicit task graph before delegating. DIAGNOSE generates hypotheses about why progress stalled and runs lightweight diagnostic workers before retrying blindly. See Manager Intelligence for details on these decision types.
The rolling summary is deliberately limited to confidence scores and key findings. It does not include:
- Raw worker outputs (too large for context windows)
- Tool creation details (tracked separately in the tool registry)
- Skill contents (tracked separately in the skill catalog)
- Budget consumption details (tracked by the budget system)
This separation keeps the manager's context focused. The skill catalog is injected separately into the manager prompt via _build_skill_catalog_section(), listing available skills by name and one-line description.
Each iteration can expand the system's capabilities in two ways: tools and skills. Unlike the rolling summary (which is ephemeral context), accumulated tools and skills persist for the entire run and are available to all subsequent iterations.
When a worker needs a capability that doesn't exist, it generates a tool. The DynamicToolFactory validates and registers it through a multi-stage pipeline:
| Stage | Code | What it checks |
|---|---|---|
| Cache lookup | B5 | SHA-256 hash of FQN + code + schema; if hit, reuse existing tool |
| FQN validation | DT1-DT2 | Fully qualified name format, reserved namespace checks |
| AST validation | DT4 | Syntax check via ast.parse(), namespace-aware import policy (NC1–NC3), dangerous call detection |
| Schema-signature check | B2 | Handler function kwargs must match declared schema parameters |
| Placeholder rejection | DT9 | Detects and rejects base64 PNGs, minimal PDFs, and other dummy outputs |
| Dry-run probe | B3 | Runs tool with synthetic inputs in sandboxed venv (5s timeout) |
If validation fails at any stage, the error is classified as "repairable" or "terminal." Repairable errors trigger an inline LLM repair loop that fixes the code within the same worker iteration — the manager never sees the failure. See Runtime Tool Generation Pipeline for the full pipeline.
Once registered, a tool is:
- Cached by content hash — the same code is never validated twice
- Available to all subsequent workers in the same run
- Tracked with metrics — attempts, successes, cache hits, validation failures, repair attempts/successes
Skills are reusable domain knowledge saved as Markdown files. Unlike tools (executable code), skills are structured knowledge: approaches, patterns, procedures, and findings that help future workers avoid re-deriving what earlier workers already learned.
How skills flow through the system:
-
Manager creates skills: When delegating, the manager can include skills in a worker's
DelegationEnvelope. These can be inline markdown or references to previously persisted skills. -
Workers produce skills: Worker results can include a
skills_createdarray. Each skill with ≥30 words is automatically persisted toworkspace/skills/{name}.mdby_persist_worker_result_skills(). -
Skills are cataloged:
_load_skill_catalog()reads all*.mdfiles from the skills directory, extracts a one-line description from the## Purposeheading (or first non-heading line), and builds a name→description index. -
Catalog injected into manager prompt:
_build_skill_catalog_section()renders the catalog as a markdown section listing available skills by name. The manager can reference skills by short name in future delegation envelopes; the runtime resolves them to full content via_resolve_skills(). -
Skills can be updated: If a worker produces a skill with the same
# Skill: Nameheading as an existing one, it overwrites the previous version (latest wins).
The compounding effect: Early iterations establish foundational skills. Later iterations reference them by name instead of re-deriving the knowledge. This reduces redundant work and lets the system build on itself — the same pattern as a human team building institutional knowledge over the course of a project.
Iterative loops can get stuck. The manager might delegate the same kind of work repeatedly, workers might produce near-identical results, or confidence might oscillate without improving. AWP uses a StallDetector with two independent signal channels and a strategy-switching recovery mechanism.
| Channel | Signal | Threshold | What it detects |
|---|---|---|---|
| Confidence delta | abs(recent[-1] - recent[0]) over sliding window |
< min_confidence_delta (default 0.05) |
Progress has stopped — confidence is flat |
| Output similarity | SequenceMatcher ratio on first 2000 chars of worker output |
> 0.85 | Workers are producing the same thing — no new information |
Additionally, an oscillation detector triggers when confidence variance is < 0.01 and mean confidence is < 0.7 over an extended window — catching loops that bounce between low values without making progress.
The record() method returns one of four signals:
| Signal | Condition | Effect |
|---|---|---|
ok |
Neither channel stalled, or insufficient history (< window iterations) |
Continue normally |
warn |
One channel stalled | Log warning, increment warning counter |
switch_strategy |
Stall detected AND unused strategies remain in the pool | Rotate to next meta-strategy |
stop |
Both channels stalled AND all strategies exhausted (or strategy switching disabled) | Terminate loop |
Both channels must agree before the loop stops. A single channel stalling triggers a warning first, giving the system a chance to recover. This prevents premature termination from temporary plateaus.
When stall is detected, the manager rotates through configured meta-strategies before stopping. The default strategy pool (configured in StallDetectionConfig.strategy_switching):
decompose_finer— Break the current subtask into smaller piecessimplify— Reduce scope or constraintsreframe— Approach the problem from a different angleescalate— Flag for human intervention or higher-level manager
Each strategy switch resets the warning counter, giving the new strategy a fresh window to make progress. Only after all strategies are exhausted does the stall detector return stop.
Separate from stall detection, _check_convergence() forces completion when the loop has plateaued at a satisfactory level. Two heuristics:
| Heuristic | Condition | Interpretation |
|---|---|---|
| (a) Confidence plateau | abs(last - prev) < 0.05 AND last < 0.95 |
Confidence has stopped improving but hasn't reached high quality |
| (b) Identical findings | Last 3 iterations all DELEGATE with identical key_findings tuple |
The loop is producing the same analysis repeatedly |
Both heuristics are gated by a minimum iteration floor: max(5, pending_subtask_count + 3). This prevents false convergence during multi-phase tasks where confidence legitimately plateaus between phases while subtasks are still pending.
When convergence fires, the loop returns a partial_complete result rather than complete — signaling that the result is usable but the system stopped before reaching full confidence.
Every iterative loop must terminate. AWP enforces this with a multi-dimensional budget system where no single limit can be circumvented — the manager cannot override the safety envelope.
| Field | Default | What it limits |
|---|---|---|
max_loops |
100 | Total manager iterations |
max_total_workers |
500 | Total worker spawns across all iterations |
max_total_tokens |
10,000,000 | Total LLM tokens consumed (input + output) |
max_wall_time |
3600 (1 hour) | Wall clock time in seconds |
max_tool_calls |
1500 | Total tool invocations across all workers |
max_depth |
4 | Recursion depth for sub-manager delegation |
The loop's can_continue() check evaluates all dimensions on every iteration. If any single limit is breached, the loop terminates with a graceful partial result — it does not crash or hang.
The BudgetSnapshot class tracks consumption in real time:
- Loops: incremented on each manager iteration
- Workers: incremented on each worker spawn
- Tokens: accumulated from LLM response metadata
- Wall time: computed from
time.monotonic()at loop start - Tool calls: incremented on each tool invocation
budget_fraction_remaining returns the minimum fraction across all dimensions — this single number tells the manager how much runway is left. The manager receives this in its prompt context and can adjust its strategy accordingly (e.g., switching to synthesis mode when budget is low).
When the manager promotes a worker to a sub-manager (recursion), the parent pre-reserves budget for the child via allocate_child(fraction=0.3). This:
- Prevents the child from consuming the parent's entire budget
- Limits recursion depth via
max_depth - Caps concurrent sub-managers via
max_concurrent_submanagers(default 3) - Caps total sub-managers per run via
max_total_submanagers_per_run(default 6)
A typical iterative optimization run looks like this:
Iteration 1: Manager receives the task with no history. Delegates initial subtasks to workers. Workers produce results with low confidence (0.15–0.25) and possibly create first tools. Skills are empty.
Iteration 2–3: Manager reads rolling summary showing low confidence. May switch to PLAN mode to decompose the problem. Workers use tools from iteration 1. First skills are created and persisted. Confidence climbs (0.30–0.50), possibly with dips as new approaches are tried.
Iteration 4–6: Skill catalog grows. Workers reference earlier skills by name instead of re-deriving knowledge. New tools are cache-hit (not rebuilt). Manager adjusts strategy based on what's working. Confidence reaches 0.50–0.75.
Iteration 7+: Capabilities compound. Stall detector may fire if progress plateaus — manager switches strategy (decompose finer, simplify, reframe). If confidence reaches 0.95+, manager may decide COMPLETE. If budget runs low, manager switches to synthesis mode.
Termination: The loop ends when the manager decides COMPLETE, when budget is exhausted, when convergence is detected, or when stall detection stops the loop after all strategies are exhausted. The result includes all accumulated artifacts, tools, and skills.
Key property: Progress is not guaranteed to be monotonic. Confidence can dip when the manager tries a new approach or when workers explore a harder subtask. This is expected and healthy — the stall detector only fires when confidence is flat and output similarity is high (i.e., the system is stuck, not exploring).
- Runtime Tool Generation Pipeline — Full pipeline for dynamic tool creation (B1–B6, AST validation, repair loop)
- Manager Intelligence — Task decomposition, hypothesis-driven debugging, strategy switching, decision journal
- Critique Loop — Reflective analysis of worker outputs (defect detection, targeted repair)
- Orchestration — Delegation loop configuration, budget fields, stall detection config
- Evaluation — Quality scoring and threshold-based retry/repair