Skip to content

feat: retry Iron Loop executor on API overload (529) with configurable backoff#7

Open
davidbijl wants to merge 1 commit into
robotijn:mainfrom
davidbijl:feat/overload-retry-529
Open

feat: retry Iron Loop executor on API overload (529) with configurable backoff#7
davidbijl wants to merge 1 commit into
robotijn:mainfrom
davidbijl:feat/overload-retry-529

Conversation

@davidbijl
Copy link
Copy Markdown

Fixes #6.

Summary

  • Layer 1 — executor agent (agents/iron-loop/iron-loop-executor.md): new API Overload (529) Handling section instructs the executor to distinguish pre-write overloads (safe to retry) from mid-write overloads (human gate required) and write the appropriate .status file.
  • Layer 2 — state layer (background.js, actions.js, state.js): overload-retry and overload-partial added to the status enum; cleanupStaleInProgress skips overload plans; startAgent resumes an overload-retry plan in-progress instead of picking a new todo plan and blocks with a human-gate error for overload-partial; getAgentStatus surfaces overload states when no lock is held.
  • Layer 3 — dashboard (menu-screens.js): AGENT section shows ⏳ retry in Xm — <plan> for scheduled retries and ⚠ partial write — review: <plan> for mid-write overloads.
  • Config (settings.js): new retry category with overloadIntervalSeconds (default 600 s / 10 min).
  • Tests (tests/overload-retry.test.js): 9 unit tests covering all three layers (icon enum, writeStatus fields, cleanup skip logic, startAgent resume/block paths, dashboard labels, config schema).

Answers to the open questions in #6

  1. Preferred layer for retry logic: the executor agent writes the status (Layer 1) and exits; the state layer drives resume/block on the next startAgent call (Layer 2). No ScheduleWakeup dependency — the operator restarts via the menu when ready, or the executor can call ScheduleWakeup if it's available in its context (the agent instructions mention it as optional).
  2. Step-level resume vs full restart: full restart from the beginning of the current plan. The plan's completed [x] checkboxes are on disk, so the executor can fast-forward past already-done steps. No separate step-marker mechanism is needed for a first pass.
  3. ScheduleWakeup availability: treated as optional in the agent instructions — if available, use it; if not, exit cleanly. The dashboard indicator and the menu's Start Agent button serve as the manual resume path.
  4. Scope: all three layers are included, but the changes are minimal and additive — no existing behaviour is modified except cleanupStaleInProgress (now skips overload plans) and startAgent (now checks in-progress before picking todo).

Test plan

  • Run node --test tests/overload-retry.test.js — 9 tests, 0 failures
  • Run node --test tests/*.test.js — existing suite passes (the 1 pre-existing failure in update.test.js is unrelated to this PR and was failing on main before these changes)
  • Manually: simulate an overload-retry status file in a plan under plans/in-progress/, open /ctoc:menu, confirm the dashboard AGENT section shows ⏳ retry in Xm
  • Manually: simulate an overload-partial status file, confirm ⚠ partial write — review
  • Manually: click Start Agent with an overload-retry plan present, confirm the executor resumes that plan rather than picking a new todo plan

🤖 Generated with Claude Code

…e backoff

Implements three-layer recovery for HTTP 529 (API overloaded) errors during
Iron Loop executor runs, resolving issue robotijn#6.

Layer 1 — iron-loop-executor.md: adds explicit instructions for the executor
agent to distinguish pre-write (safe to retry) from mid-write (human review
required) overload events and write the appropriate status to the plan's
.status file.

Layer 2 — state layer:
- background.js: adds overload-retry and overload-partial to the status enum,
  preserves retry_at timestamp in writeStatus, adds markOverloadRetry() and
  markOverloadPartial() helpers.
- actions.js: cleanupStaleInProgress now skips overload plans; startAgent
  resumes an overload-retry plan in-progress instead of picking a new todo
  plan, and blocks with a human-gate error when an overload-partial plan exists.
- state.js: getAgentStatus surfaces overload-retry / overload-partial from
  in-progress plan status files when no lock is held.

Layer 3 — menu-screens.js: dashboard AGENT section shows ⏳ retry in Xm for
scheduled retries and ⚠ partial write — review for mid-write overloads.

Config — settings.js: adds retry.overloadIntervalSeconds (default 600s / 10 min).

Tests — tests/overload-retry.test.js: 9 unit tests covering all three layers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Retry Iron Loop executor on API overload (529) with configurable backoff

1 participant