feat: automatic ACTOR retry on Monitor failure#92
Conversation
When Monitor returns valid=false, the orchestrator now has a proper state transition back to ACTOR phase. This fixes the deadlock where Monitor detects issues but workflow-gate blocks edits during MONITOR. - Add monitor_failed() for sequential execution mode - Add wave_monitor_failed() for parallel/wave execution mode - Both persist feedback to .map/<branch>/monitor_feedback*.md - Both enforce max_retries with escalation status - Register as CLI commands: monitor_failed, wave_monitor_failed - Update map-efficient.md to use orchestrator instead of pseudocode
- Add DevOpsConf 2026 case study link to README.md - Minor formatting cleanup in ralph-iteration-logger.py, diagnostics.py, map_step_runner.py
There was a problem hiding this comment.
Pull request overview
Adds orchestrator-driven retry handling when Monitor returns valid=false, so the workflow can safely transition back into an editing-allowed phase (ACTOR) and persist Monitor feedback for the next retry, avoiding Monitor-phase deadlocks with workflow-gate.py.
Changes:
- Add
monitor_failed()andwave_monitor_failed()commands tomap_orchestrator.py(both template +.mapcopy) to increment retry counters, reset phases, and writemonitor_feedback*.md. - Expose the new commands via the orchestrator CLI
main()command choices/handlers. - Update
map-efficient.md(both template +.claudecopy) to reference the new orchestrator commands instead of pseudocode retry logic.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 11 comments.
| File | Description |
|---|---|
src/mapify_cli/templates/map/scripts/map_orchestrator.py |
Adds retry handlers + CLI wiring for sequential and wave modes. |
.map/scripts/map_orchestrator.py |
Mirrors the orchestrator retry handlers + CLI wiring for local workflow execution. |
src/mapify_cli/templates/commands/map-efficient.md |
Documents using monitor_failed / wave_monitor_failed in the Monitor failure loop. |
.claude/commands/map-efficient.md |
Mirrors the map-efficient documentation update for Claude command usage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
.map/scripts/map_orchestrator.py
Outdated
| if state.retry_count > state.max_retries: | ||
| state.save(state_file) | ||
| return { | ||
| "status": "max_retries", | ||
| "retry_count": state.retry_count, |
There was a problem hiding this comment.
The max-retry check is off by one: with max_retries=5 this escalates on the 6th failure (because it uses >). Also, in the max_retries branch the state’s phase isn’t reset to an editing phase, so workflow-gate can remain stuck in MONITOR even though you’re escalating. Consider using >= and explicitly resetting current_step_id/current_step_phase/pending_steps (or a dedicated escalation phase) before returning.
| state.pending_steps = step_order[actor_idx:] # ["2.3", "2.4"] (+ TDD steps if any) | ||
| state.current_step_id = "2.3" | ||
| state.current_step_phase = "ACTOR" | ||
|
|
There was a problem hiding this comment.
Re-queueing from step 2.3 leaves any prior 2.3 entries in completed_steps. This can inflate check_circuit_breaker()’s tool_count (it uses len(completed_steps)) and may trip the breaker early after a few retries. Consider removing 2.3 (and 2.4 if present) from completed_steps when re-queuing, or making completed_steps a set/unique list.
| # When re-queuing from 2.3/2.4, ensure these steps are no longer marked completed. | |
| # This prevents len(completed_steps)-based logic (e.g., circuit breakers) from | |
| # counting multiple retries of the same phase as additional completed tools. | |
| state.completed_steps = [ | |
| step_id for step_id in state.completed_steps | |
| if step_id not in ("2.3", "2.4") | |
| ] |
.map/scripts/map_orchestrator.py
Outdated
| print(json.dumps(result, indent=2)) | ||
|
|
||
| elif args.command == "monitor_failed": | ||
| feedback = args.task_or_step or "" |
There was a problem hiding this comment.
CLI parsing for monitor_failed only uses args.task_or_step, so unquoted multi-word feedback will be truncated and the remainder discarded into extra_args. For consistency with wave_monitor_failed, consider joining extra_args into the feedback text as well.
| feedback = args.task_or_step or "" | |
| parts = [] | |
| if args.task_or_step: | |
| parts.append(args.task_or_step) | |
| if getattr(args, "extra_args", None): | |
| parts.extend(args.extra_args) | |
| feedback = " ".join(parts) if parts else "" |
| if monitor_output["valid"] == false: | ||
| # Increment retry counter (also triggered when test gate fails above) | ||
| if retry_count < 5: | ||
| # Go back to Phase: ACTOR with Monitor feedback | ||
| # Actor will fix issues and re-apply code | ||
|
|
||
| # === STUCK RECOVERY (at retry 3) === | ||
| # At retry 3, intercept with intermediate recovery before retries 4-5. | ||
| # This gives Actor better context to break out of a stuck loop. | ||
| if retry_count == 3: | ||
| # Step 1: Check if research-agent already ran for this subtask | ||
| findings_file = f".map/{branch}/findings_{branch}.md" | ||
| if findings_file exists and has content for this subtask: | ||
| # Reuse existing findings (Edge Case 12: skip re-invocation) | ||
| recovery_context = read(findings_file) | ||
| else: | ||
| # Invoke research-agent for alternative approaches | ||
| Task( | ||
| subagent_type="research-agent", | ||
| description="Stuck recovery: find alternative approach", | ||
| prompt=f"""Subtask {subtask_id} failed 3 monitor retries. | ||
| # Use orchestrator to handle retry: requeues ACTOR+MONITOR, increments retry_count, | ||
| # switches phase so workflow-gate allows edits, persists feedback for Actor. | ||
| RETRY_RESULT=$(python3 .map/scripts/map_orchestrator.py monitor_failed "MONITOR_FEEDBACK_TEXT") | ||
| # RETRY_RESULT.status is "retrying" or "max_retries" |
There was a problem hiding this comment.
This snippet is fenced as bash but mixes non-bash pseudocode (e.g., if monitor_output["valid"] == false: and later if RETRY_STATUS == "max_retries":). If this is intended to be runnable shell, please convert the conditionals to valid bash and show how MONITOR_FEEDBACK_TEXT is sourced/escaped; otherwise consider changing the fence/language label to reflect that it’s pseudocode.
| if current_retries > state.max_retries: | ||
| state.save(state_file) | ||
| return { | ||
| "status": "max_retries", | ||
| "subtask_id": subtask_id, | ||
| "retry_count": current_retries, | ||
| "max_retries": state.max_retries, |
There was a problem hiding this comment.
Same off-by-one issue here: current_retries > max_retries escalates on the (max_retries+1)th failure. Also, the max_retries branch doesn’t reset any phase to an editing phase. Given workflow-gate checks wave subtask_phases against {ACTOR, APPLY, TEST_WRITER} (not step IDs), consider also setting current_step_phase/current_subtask_id (or updating the gate/representation) so retries actually unblock edits.
| if monitor_output["valid"] == false: | ||
| # Increment retry counter (also triggered when test gate fails above) | ||
| if retry_count < 5: | ||
| # Go back to Phase: ACTOR with Monitor feedback | ||
| # Actor will fix issues and re-apply code | ||
|
|
||
| # === STUCK RECOVERY (at retry 3) === | ||
| # At retry 3, intercept with intermediate recovery before retries 4-5. | ||
| # This gives Actor better context to break out of a stuck loop. | ||
| if retry_count == 3: | ||
| # Step 1: Check if research-agent already ran for this subtask | ||
| findings_file = f".map/{branch}/findings_{branch}.md" | ||
| if findings_file exists and has content for this subtask: | ||
| # Reuse existing findings (Edge Case 12: skip re-invocation) | ||
| recovery_context = read(findings_file) | ||
| else: | ||
| # Invoke research-agent for alternative approaches | ||
| Task( | ||
| subagent_type="research-agent", | ||
| description="Stuck recovery: find alternative approach", | ||
| prompt=f"""Subtask {subtask_id} failed 3 monitor retries. | ||
| # Use orchestrator to handle retry: requeues ACTOR+MONITOR, increments retry_count, | ||
| # switches phase so workflow-gate allows edits, persists feedback for Actor. | ||
| RETRY_RESULT=$(python3 .map/scripts/map_orchestrator.py monitor_failed "MONITOR_FEEDBACK_TEXT") | ||
| # RETRY_RESULT.status is "retrying" or "max_retries" |
There was a problem hiding this comment.
This snippet is fenced as bash but mixes non-bash pseudocode (e.g., if monitor_output["valid"] == false: and later if RETRY_STATUS == "max_retries":). If this is intended to be runnable shell, please convert the conditionals to valid bash and show how MONITOR_FEEDBACK_TEXT is sourced/escaped; otherwise consider changing the fence/language label to reflect that it’s pseudocode.
| if state.retry_count > state.max_retries: | ||
| state.save(state_file) | ||
| return { | ||
| "status": "max_retries", | ||
| "retry_count": state.retry_count, |
There was a problem hiding this comment.
The max-retry check is off by one: with max_retries=5 this escalates on the 6th failure (because it uses >). Also, in the max_retries branch the state’s phase isn’t reset to an editing phase, so workflow-gate can remain stuck in MONITOR even though you’re escalating. Consider using >= and explicitly resetting current_step_id/current_step_phase/pending_steps (or a dedicated escalation phase) before returning.
| # Requeue ACTOR (2.3) and MONITOR (2.4) | ||
| step_order = _get_step_order(state.tdd_mode) | ||
| actor_idx = step_order.index("2.3") | ||
| state.pending_steps = step_order[actor_idx:] # ["2.3", "2.4"] (+ TDD steps if any) | ||
| state.current_step_id = "2.3" | ||
| state.current_step_phase = "ACTOR" |
There was a problem hiding this comment.
Re-queueing from step 2.3 leaves any prior 2.3 entries in completed_steps. This can inflate check_circuit_breaker()’s tool_count (it uses len(completed_steps)) and may trip the breaker early after a few retries. Consider removing 2.3 (and 2.4 if present) from completed_steps when re-queuing, or making completed_steps a set/unique list.
| print(json.dumps(result, indent=2)) | ||
|
|
||
| elif args.command == "monitor_failed": | ||
| feedback = args.task_or_step or "" |
There was a problem hiding this comment.
CLI parsing for monitor_failed only uses args.task_or_step, so unquoted multi-word feedback will be truncated and the remainder discarded into extra_args. For consistency with wave_monitor_failed, consider joining extra_args into the feedback text as well.
| feedback = args.task_or_step or "" | |
| parts = [] | |
| if args.task_or_step: | |
| parts.append(args.task_or_step) | |
| if args.extra_args: | |
| parts.extend(args.extra_args) | |
| feedback = " ".join(parts) |
| def monitor_failed(branch: str, feedback: str = "") -> dict: | ||
| """Handle Monitor valid=false: requeue ACTOR+MONITOR, increment retry_count. | ||
|
|
||
| Called by map-efficient.md when Monitor returns valid=false. | ||
| Switches phase back to ACTOR so workflow-gate allows edits. |
There was a problem hiding this comment.
No unit tests were added for these new retry handlers. The repo already has pytest coverage for other orchestrator commands (see tests/test_map_orchestrator.py), so adding tests for monitor_failed()/wave_monitor_failed() (retry increment, requeue/reset behavior, and max_retries semantics) would help prevent regressions.
Issues found by Monitor/Predictor/Evaluator review:
- Fix CLI feedback truncation: join all positional args for multi-word
feedback (was silently dropping words after the first)
- Fix TDD mode: always requeue only ["2.3", "2.4"] on retry, never
re-run TDD pre-steps (2.25/2.26) that were already completed
- Fix feedback overwrite: use retry-numbered filenames
(monitor_feedback_retry{N}.md) per error-patterns.md lesson
- Extract _write_feedback_file() helper to DRY the feedback persistence
- Update Phase D wave retry pseudocode to use wave_monitor_failed
- Clarify Actor feedback file path comment (explicit, not "automatic")
- Add 17 unit tests: TestMonitorFailed (10) + TestWaveMonitorFailed (7)
… edits Review findings (11 items): - Extract _check_retry_limit() helper to DRY up monitor_failed/wave_monitor_failed - Add phase guard to monitor_failed() rejecting non-MONITOR phase calls - Add --feedback CLI flag for symmetric argument passing - Fix SUBTASK_ID placeholder, feedback filename comment, Continue label text - Document serial/wave retry counter invariant in StepState - Update ARCHITECTURE.md CLI command list New feature: reopen_for_fixes() - Transitions COMPLETE → ACTOR so workflow gate unlocks for review fixes - Solves: /map-review finds issues but edits blocked in COMPLETE phase - map-review.md updated with "Workflow Gate Unlock" section Tests: 13 new tests (6 reopen_for_fixes + 4 review findings + 3 integration)
TEST_WRITER prompt now includes explicit anti-patterns: - Every test must verify semantic behavior, not just single branch execution - Tests must assert multiple consequences (side effects, return values, state) - Prefer scenario-based tests over single-field unit checks - Aim for 60%+ semantic tests TEST_FAIL_GATE now includes quality gate: - Classifies tests as semantic vs trivial - Rejects if >40% tests are trivial single-branch checks - Sends feedback to TEST_WRITER to merge trivial tests into richer scenarios Addresses feedback: 3/9 tests were trivial branch checks in production use.
TEST_WRITER could add temporal comments like "currently FAILS" or "expected to FAIL until fix is applied" that become stale after the Green phase. Actor in code_only mode cannot clean them up. - Add rule to TEST_WRITER prohibiting temporal state-marking comments (map-tdd.md, map-efficient.md, actor.md) - Add TDD Refactor cleanup step after Actor returns, before Monitor, to scan and clean any stale Red-phase markers in test files
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| feedback_file = _write_feedback_file( | ||
| branch, | ||
| f"monitor_feedback_{subtask_id}_retry{current_retries}.md", | ||
| f"Monitor Feedback for {subtask_id} (retry {current_retries})", | ||
| feedback, |
There was a problem hiding this comment.
wave_monitor_failed() interpolates subtask_id into the feedback filename. Since subtask_id is user-controlled via the CLI, a value containing path separators / .. could lead to path traversal and writing outside .map/<branch>/. Validate/sanitize subtask_id (e.g., enforce ^ST-\d+$ and/or strip path separators) before using it in a path component.
| ), | ||
| }, | ||
| ) | ||
| if escalation is not None: |
There was a problem hiding this comment.
In the max_retries early-return path, monitor_failed() saves state but leaves current_step_phase as MONITOR. That reintroduces the original deadlock (workflow-gate blocks edits during MONITOR) at exactly the point where the user needs to decide next actions. Consider still resetting the phase/pending steps to an editing phase (e.g., ACTOR + requeue 2.3/2.4) while returning status="max_retries", or otherwise explicitly unlocking edits as part of the escalation path.
| if escalation is not None: | |
| if escalation is not None: | |
| # Even on max-retries escalation, move state back to an editing phase | |
| # so that workflow-gate does not keep the branch locked in MONITOR. | |
| state.pending_steps = ["2.3", "2.4"] | |
| state.current_step_id = "2.3" | |
| state.current_step_phase = "ACTOR" |
| f"Monitor retry limit reached ({state.max_retries} attempts). " | ||
| "Escalate to user." |
There was a problem hiding this comment.
The escalation message says "retry limit reached (N attempts)", but escalation happens when current_retries > max_retries (i.e., after exceeding the limit). Consider adjusting the wording to "retry limit exceeded" (or include the current attempt number) to avoid off-by-one confusion in user-facing output.
| f"Monitor retry limit reached ({state.max_retries} attempts). " | |
| "Escalate to user." | |
| f"Monitor retry limit exceeded: " | |
| f"{state.retry_count} attempts (max {state.max_retries}). " | |
| "Escalating to user." |
| feedback_file = _write_feedback_file( | ||
| branch, | ||
| f"monitor_feedback_{subtask_id}_retry{current_retries}.md", | ||
| f"Monitor Feedback for {subtask_id} (retry {current_retries})", | ||
| feedback, |
There was a problem hiding this comment.
wave_monitor_failed() interpolates subtask_id into the feedback filename. Since subtask_id is user-controlled via the CLI, a value containing path separators / .. could lead to path traversal and writing outside .map/<branch>/. Validate/sanitize subtask_id (e.g., enforce ^ST-\d+$ and/or strip path separators) before using it in a path component.
| """ | ||
| state_file = Path(f".map/{branch}/step_state.json") | ||
| state = StepState.load(state_file) | ||
|
|
There was a problem hiding this comment.
wave_monitor_failed() increments retry counts and mutates subtask_phases without checking that wave execution is configured, that subtask_id exists in the current wave, or that the subtask is currently in MONITOR (2.4). As written, calling this command at the wrong time (or with an unknown subtask_id) can silently create/overwrite state and corrupt wave progress. Add guards (e.g., require execution_waves configured, subtask_id in current wave, and current phase == "2.4") and return a structured error otherwise.
| # Guard: wave execution must be configured | |
| execution_waves = getattr(state, "execution_waves", None) | |
| if not execution_waves: | |
| return { | |
| "status": "error", | |
| "error": "wave_execution_not_configured", | |
| "message": ( | |
| "Monitor failure handling requires execution_waves to be " | |
| "configured in step_state.json." | |
| ), | |
| "subtask_id": subtask_id, | |
| } | |
| # Guard: current wave index must be valid | |
| current_wave_index = getattr(state, "current_wave_index", None) | |
| if not isinstance(current_wave_index, int) or not ( | |
| 0 <= current_wave_index < len(execution_waves) | |
| ): | |
| return { | |
| "status": "error", | |
| "error": "invalid_wave_state", | |
| "message": ( | |
| "Current wave index is not valid; cannot process monitor failure " | |
| "for subtask." | |
| ), | |
| "subtask_id": subtask_id, | |
| } | |
| current_wave = execution_waves[current_wave_index] | |
| # Derive the list of subtasks for the current wave in a tolerant way | |
| if isinstance(current_wave, dict): | |
| wave_subtasks = current_wave.get("subtasks") | |
| else: | |
| wave_subtasks = current_wave | |
| if not wave_subtasks or subtask_id not in wave_subtasks: | |
| return { | |
| "status": "error", | |
| "error": "unknown_subtask_for_current_wave", | |
| "message": ( | |
| f"Subtask {subtask_id} is not part of the current execution wave; " | |
| "cannot mark monitor as failed." | |
| ), | |
| "subtask_id": subtask_id, | |
| } | |
| # Guard: subtask must currently be in MONITOR phase (2.4) | |
| current_phase = state.subtask_phases.get(subtask_id) | |
| if current_phase != "2.4": | |
| return { | |
| "status": "error", | |
| "error": "invalid_phase_for_monitor_failed", | |
| "message": ( | |
| f"Cannot mark monitor failed for {subtask_id} because its current " | |
| f"phase is {current_phase!r}, expected '2.4' (MONITOR)." | |
| ), | |
| "subtask_id": subtask_id, | |
| "current_phase": current_phase, | |
| "expected_phase": "2.4", | |
| } |
| if monitor_output["valid"] == false: | ||
| # Increment retry counter (also triggered when test gate fails above) | ||
| if retry_count < 5: | ||
| # Go back to Phase: ACTOR with Monitor feedback | ||
| # Actor will fix issues and re-apply code | ||
|
|
||
| # === STUCK RECOVERY (at retry 3) === | ||
| # At retry 3, intercept with intermediate recovery before retries 4-5. | ||
| # This gives Actor better context to break out of a stuck loop. | ||
| if retry_count == 3: | ||
| # Step 1: Check if research-agent already ran for this subtask | ||
| findings_file = f".map/{branch}/findings_{branch}.md" | ||
| if findings_file exists and has content for this subtask: | ||
| # Reuse existing findings (Edge Case 12: skip re-invocation) | ||
| recovery_context = read(findings_file) | ||
| else: | ||
| # Invoke research-agent for alternative approaches | ||
| Task( | ||
| subagent_type="research-agent", | ||
| description="Stuck recovery: find alternative approach", | ||
| prompt=f"""Subtask {subtask_id} failed 3 monitor retries. | ||
| # Use orchestrator to handle retry: requeues ACTOR+MONITOR, increments retry_count, | ||
| # switches phase so workflow-gate allows edits, persists feedback for Actor. | ||
| RETRY_RESULT=$(python3 .map/scripts/map_orchestrator.py monitor_failed --feedback "MONITOR_FEEDBACK_TEXT") | ||
| # RETRY_RESULT.status is "retrying" or "max_retries" | ||
| # RETRY_RESULT.retry_count shows current attempt number | ||
| # RETRY_RESULT.feedback_file points to .map/<branch>/monitor_feedback_retry{N}.md | ||
|
|
||
| RETRY_STATUS=$(echo "$RETRY_RESULT" | python3 -c "import sys,json; print(json.load(sys.stdin).get('status',''))") | ||
| RETRY_COUNT=$(echo "$RETRY_RESULT" | python3 -c "import sys,json; print(json.load(sys.stdin).get('retry_count',0))") | ||
|
|
||
| if RETRY_STATUS == "max_retries": | ||
| # Escalate to user (retry limit reached after 5 attempts) | ||
| AskUserQuestion(questions=[{"question": "Monitor retry limit reached (5 attempts). How to proceed?", "header": "Retry limit", "options": [{"label": "Continue", "description": "Continue with more retries (manually edit step_state.json retry_count)"}, {"label": "Skip", "description": "Skip this subtask and move to next"}, {"label": "Abort", "description": "Stop workflow"}], "multiSelect": false}]) |
There was a problem hiding this comment.
This retry-handling snippet mixes bash-style variable assignments (RETRY_RESULT=$(...)) with Python-style conditionals (if RETRY_STATUS == "max_retries":). Because this prompt is meant to be followed verbatim, the mixed syntax is likely to confuse users/agents and can lead to incorrect command execution. Consider rewriting this block in one consistent pseudo-language (or make it valid bash with [ ... ] conditionals) and keep it consistently fenced.
| # After Monitor returns valid=true, run deterministic test gate: | ||
| TEST_GATE=$(python3 .map/scripts/map_step_runner.py run_test_gate) | ||
| # If tests fail, treat as Monitor valid=false — feed output back to Actor | ||
| if echo "$TEST_GATE" | python3 -c "import sys,json; d=json.load(sys.stdin); sys.exit(0 if d.get('passed') else 1)" 2>/dev/null; then | ||
| # Tests passed — snapshot code state for artifact verification | ||
| SNAPSHOT=$(python3 .map/scripts/map_step_runner.py snapshot_code_state) | ||
| # Append git ref to review artifact header (if code-review file exists) | ||
| fi | ||
|
|
||
| # After Monitor returns: | ||
| if monitor_output["valid"] == false: | ||
| # Increment retry counter (also triggered when test gate fails above) | ||
| if retry_count < 5: | ||
| # Go back to Phase: ACTOR with Monitor feedback | ||
| # Actor will fix issues and re-apply code | ||
|
|
||
| # === STUCK RECOVERY (at retry 3) === | ||
| # At retry 3, intercept with intermediate recovery before retries 4-5. | ||
| # This gives Actor better context to break out of a stuck loop. | ||
| if retry_count == 3: | ||
| # Step 1: Check if research-agent already ran for this subtask | ||
| findings_file = f".map/{branch}/findings_{branch}.md" | ||
| if findings_file exists and has content for this subtask: | ||
| # Reuse existing findings (Edge Case 12: skip re-invocation) | ||
| recovery_context = read(findings_file) | ||
| else: | ||
| # Invoke research-agent for alternative approaches | ||
| Task( | ||
| subagent_type="research-agent", | ||
| description="Stuck recovery: find alternative approach", | ||
| prompt=f"""Subtask {subtask_id} failed 3 monitor retries. | ||
| # Use orchestrator to handle retry: requeues ACTOR+MONITOR, increments retry_count, | ||
| # switches phase so workflow-gate allows edits, persists feedback for Actor. | ||
| RETRY_RESULT=$(python3 .map/scripts/map_orchestrator.py monitor_failed --feedback "MONITOR_FEEDBACK_TEXT") | ||
| # RETRY_RESULT.status is "retrying" or "max_retries" | ||
| # RETRY_RESULT.retry_count shows current attempt number | ||
| # RETRY_RESULT.feedback_file points to .map/<branch>/monitor_feedback_retry{N}.md | ||
|
|
||
| RETRY_STATUS=$(echo "$RETRY_RESULT" | python3 -c "import sys,json; print(json.load(sys.stdin).get('status',''))") | ||
| RETRY_COUNT=$(echo "$RETRY_RESULT" | python3 -c "import sys,json; print(json.load(sys.stdin).get('retry_count',0))") | ||
|
|
||
| if RETRY_STATUS == "max_retries": | ||
| # Escalate to user (retry limit reached after 5 attempts) | ||
| AskUserQuestion(questions=[{"question": "Monitor retry limit reached (5 attempts). How to proceed?", "header": "Retry limit", "options": [{"label": "Continue", "description": "Continue with more retries (manually edit step_state.json retry_count)"}, {"label": "Skip", "description": "Skip this subtask and move to next"}, {"label": "Abort", "description": "Stop workflow"}], "multiSelect": false}]) |
There was a problem hiding this comment.
This retry-handling block mixes bash-style constructs (VAR=$(...), if echo ...; then) with Python-style conditionals (if RETRY_STATUS == "max_retries":) and pseudo-DSL like AskUserQuestion(...). Because map-efficient.md is an operational playbook prompt, the mixed syntax is likely to confuse users/agents. Consider rewriting the block in a single consistent pseudo-language (or make it valid bash throughout) and fence it accordingly.
| """ | ||
| state_file = Path(f".map/{branch}/step_state.json") | ||
| state = StepState.load(state_file) | ||
|
|
There was a problem hiding this comment.
wave_monitor_failed() increments retry counts and mutates subtask_phases without checking that wave execution is configured, that subtask_id exists in the current wave, or that the subtask is currently in MONITOR (2.4). As written, calling this command at the wrong time (or with an unknown subtask_id) can silently create/overwrite state and corrupt wave progress. Add guards (e.g., require execution_waves configured, subtask_id in current wave, and current phase == "2.4") and return a structured error otherwise.
| # Ensure wave-based execution is configured and we have a current wave | |
| execution_waves = getattr(state, "execution_waves", None) | |
| current_wave_index = getattr(state, "current_wave_index", None) | |
| if not execution_waves or current_wave_index is None or not ( | |
| 0 <= current_wave_index < len(execution_waves) | |
| ): | |
| return { | |
| "status": "error", | |
| "error": "wave_execution_not_configured", | |
| "subtask_id": subtask_id, | |
| "message": ( | |
| "Wave-based execution is not properly configured; " | |
| "cannot process monitor failure for this subtask." | |
| ), | |
| } | |
| # Verify the subtask belongs to the current wave | |
| current_wave = execution_waves[current_wave_index] | |
| if isinstance(current_wave, dict): | |
| current_wave_subtasks = list(current_wave.keys()) | |
| elif isinstance(current_wave, list): | |
| current_wave_subtasks = current_wave | |
| else: | |
| current_wave_subtasks = [] | |
| if subtask_id not in current_wave_subtasks: | |
| return { | |
| "status": "error", | |
| "error": "subtask_not_in_current_wave", | |
| "subtask_id": subtask_id, | |
| "message": ( | |
| f"Subtask {subtask_id} is not part of the current wave; " | |
| "cannot process monitor failure." | |
| ), | |
| } | |
| # Verify the subtask is currently in MONITOR phase ("2.4") | |
| current_phase = state.subtask_phases.get(subtask_id) | |
| if current_phase != "2.4": | |
| return { | |
| "status": "error", | |
| "error": "invalid_subtask_phase", | |
| "subtask_id": subtask_id, | |
| "expected_phase": "2.4", | |
| "actual_phase": current_phase, | |
| "message": ( | |
| "wave_monitor_failed can only be called when the subtask " | |
| "is in MONITOR phase (2.4)." | |
| ), | |
| } |
| ), | ||
| }, | ||
| ) | ||
| if escalation is not None: |
There was a problem hiding this comment.
In the max_retries early-return path, monitor_failed() saves state but leaves current_step_phase as MONITOR. That reintroduces the original deadlock (workflow-gate blocks edits during MONITOR) at exactly the point where the user needs to decide next actions. Consider still resetting the phase/pending steps to an editing phase (e.g., ACTOR + requeue 2.3/2.4) while returning status="max_retries", or otherwise explicitly unlocking edits as part of the escalation path.
| if escalation is not None: | |
| if escalation is not None: | |
| # On max retries, we still want to unlock edits and move the workflow | |
| # out of MONITOR, otherwise the workflow-gate would deadlock exactly | |
| # when the user needs to intervene. Reset to an editing phase and | |
| # persist that state while returning status="max_retries". | |
| state.pending_steps = ["2.3", "2.4"] | |
| state.current_step_id = "2.3" | |
| state.current_step_phase = "ACTOR" |
| f"Monitor retry limit reached ({state.max_retries} attempts). " | ||
| "Escalate to user." |
There was a problem hiding this comment.
The escalation message says "retry limit reached (N attempts)", but escalation happens when current_retries > max_retries (i.e., after exceeding the limit). Consider adjusting the wording to "retry limit exceeded" (or include the current attempt number) to avoid off-by-one confusion in user-facing output.
| f"Monitor retry limit reached ({state.max_retries} attempts). " | |
| "Escalate to user." | |
| f"Monitor retry limit exceeded after {state.retry_count} attempts " | |
| f"(max {state.max_retries}). Escalate to user." |
…ce, map-debate) Remove unused skill definitions and their trigger rules from both dev config (.claude/) and shipped templates (src/mapify_cli/templates/).
…metadata Clean up remaining mentions after command/skill deletion. CHANGELOG entries preserved as historical record.
Summary
monitor_failed()andwave_monitor_failed()tomap_orchestrator.py— when Monitor returnsvalid=false, the orchestrator automatically switches phase back to ACTOR, increments retry count, and persists feedback to a fileworkflow-gate.pyblocks edits during MONITOR phasemap-efficient.mdto use the new orchestrator commands instead of pseudocode for retry logicDesign
Monitor remains strictly read-only. The orchestrator (not Monitor) handles state transitions.
CLI interface
Test plan
monitor_failed()sets phase to ACTOR, increments retry_count, requeues stepswave_monitor_failed()resets subtask phase, increments per-subtask retry count