feat: automatic ACTOR retry on Monitor failure by azalio · Pull Request #92 · azalio/map-framework

azalio · 2026-03-27T12:35:12Z

Summary

Add monitor_failed() and wave_monitor_failed() to map_orchestrator.py — when Monitor returns valid=false, the orchestrator automatically switches phase back to ACTOR, increments retry count, and persists feedback to a file
This fixes the deadlock where Monitor detects issues but workflow-gate.py blocks edits during MONITOR phase
Update map-efficient.md to use the new orchestrator commands instead of pseudocode for retry logic

Design

ACTOR (writes code) → MONITOR (finds bug, valid=false)
                           ↓
                      orchestrator calls monitor_failed()
                           ↓ phase → ACTOR, retry_count++
                      ACTOR (reads monitor_feedback.md, fixes) → MONITOR (re-checks)
                           ↓ valid=true → next phase

Monitor remains strictly read-only. The orchestrator (not Monitor) handles state transitions.

CLI interface

# Sequential mode
python3 .map/scripts/map_orchestrator.py monitor_failed "feedback text"

# Wave/parallel mode
python3 .map/scripts/map_orchestrator.py wave_monitor_failed ST-001 "feedback text"

Test plan

Unit test: monitor_failed() sets phase to ACTOR, increments retry_count, requeues steps
Unit test: wave_monitor_failed() resets subtask phase, increments per-subtask retry count
Unit test: max_retries returns escalation status
Template sync tests pass (24/24)
Full test suite passes (818/818)

When Monitor returns valid=false, the orchestrator now has a proper state transition back to ACTOR phase. This fixes the deadlock where Monitor detects issues but workflow-gate blocks edits during MONITOR. - Add monitor_failed() for sequential execution mode - Add wave_monitor_failed() for parallel/wave execution mode - Both persist feedback to .map/<branch>/monitor_feedback*.md - Both enforce max_retries with escalation status - Register as CLI commands: monitor_failed, wave_monitor_failed - Update map-efficient.md to use orchestrator instead of pseudocode

- Add DevOpsConf 2026 case study link to README.md - Minor formatting cleanup in ralph-iteration-logger.py, diagnostics.py, map_step_runner.py

Copilot

Pull request overview

Adds orchestrator-driven retry handling when Monitor returns valid=false, so the workflow can safely transition back into an editing-allowed phase (ACTOR) and persist Monitor feedback for the next retry, avoiding Monitor-phase deadlocks with workflow-gate.py.

Changes:

Add monitor_failed() and wave_monitor_failed() commands to map_orchestrator.py (both template + .map copy) to increment retry counters, reset phases, and write monitor_feedback*.md.
Expose the new commands via the orchestrator CLI main() command choices/handlers.
Update map-efficient.md (both template + .claude copy) to reference the new orchestrator commands instead of pseudocode retry logic.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 11 comments.

File	Description
`src/mapify_cli/templates/map/scripts/map_orchestrator.py`	Adds retry handlers + CLI wiring for sequential and wave modes.
`.map/scripts/map_orchestrator.py`	Mirrors the orchestrator retry handlers + CLI wiring for local workflow execution.
`src/mapify_cli/templates/commands/map-efficient.md`	Documents using `monitor_failed` / `wave_monitor_failed` in the Monitor failure loop.
`.claude/commands/map-efficient.md`	Mirrors the `map-efficient` documentation update for Claude command usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-27T12:42:34Z

.map/scripts/map_orchestrator.py

+    if state.retry_count > state.max_retries:
+        state.save(state_file)
+        return {
+            "status": "max_retries",
+            "retry_count": state.retry_count,


The max-retry check is off by one: with max_retries=5 this escalates on the 6th failure (because it uses >). Also, in the max_retries branch the state’s phase isn’t reset to an editing phase, so workflow-gate can remain stuck in MONITOR even though you’re escalating. Consider using >= and explicitly resetting current_step_id/current_step_phase/pending_steps (or a dedicated escalation phase) before returning.

Copilot · 2026-03-27T12:42:34Z

.map/scripts/map_orchestrator.py

+    state.pending_steps = step_order[actor_idx:]  # ["2.3", "2.4"] (+ TDD steps if any)
+    state.current_step_id = "2.3"
+    state.current_step_phase = "ACTOR"
+


Re-queueing from step 2.3 leaves any prior 2.3 entries in completed_steps. This can inflate check_circuit_breaker()’s tool_count (it uses len(completed_steps)) and may trip the breaker early after a few retries. Consider removing 2.3 (and 2.4 if present) from completed_steps when re-queuing, or making completed_steps a set/unique list.

Suggested change

# When re-queuing from 2.3/2.4, ensure these steps are no longer marked completed.

# This prevents len(completed_steps)-based logic (e.g., circuit breakers) from

# counting multiple retries of the same phase as additional completed tools.

state.completed_steps = [

step_id for step_id in state.completed_steps

if step_id not in ("2.3", "2.4")

]

Copilot · 2026-03-27T12:42:35Z

.map/scripts/map_orchestrator.py

            print(json.dumps(result, indent=2))

+        elif args.command == "monitor_failed":
+            feedback = args.task_or_step or ""


CLI parsing for monitor_failed only uses args.task_or_step, so unquoted multi-word feedback will be truncated and the remainder discarded into extra_args. For consistency with wave_monitor_failed, consider joining extra_args into the feedback text as well.

Suggested change

feedback = args.task_or_step or ""

parts = []

if args.task_or_step:

parts.append(args.task_or_step)

if getattr(args, "extra_args", None):

parts.extend(args.extra_args)

feedback = " ".join(parts) if parts else ""

Copilot · 2026-03-27T12:42:35Z

.claude/commands/map-efficient.md

 if monitor_output["valid"] == false:
-    # Increment retry counter (also triggered when test gate fails above)
-    if retry_count < 5:
-        # Go back to Phase: ACTOR with Monitor feedback
-        # Actor will fix issues and re-apply code
-
-        # === STUCK RECOVERY (at retry 3) ===
-        # At retry 3, intercept with intermediate recovery before retries 4-5.
-        # This gives Actor better context to break out of a stuck loop.
-        if retry_count == 3:
-            # Step 1: Check if research-agent already ran for this subtask
-            findings_file = f".map/{branch}/findings_{branch}.md"
-            if findings_file exists and has content for this subtask:
-                # Reuse existing findings (Edge Case 12: skip re-invocation)
-                recovery_context = read(findings_file)
-            else:
-                # Invoke research-agent for alternative approaches
-                Task(
-                    subagent_type="research-agent",
-                    description="Stuck recovery: find alternative approach",
-                    prompt=f"""Subtask {subtask_id} failed 3 monitor retries.
+    # Use orchestrator to handle retry: requeues ACTOR+MONITOR, increments retry_count,
+    # switches phase so workflow-gate allows edits, persists feedback for Actor.
+    RETRY_RESULT=$(python3 .map/scripts/map_orchestrator.py monitor_failed "MONITOR_FEEDBACK_TEXT")
+    # RETRY_RESULT.status is "retrying" or "max_retries"


This snippet is fenced as bash but mixes non-bash pseudocode (e.g., if monitor_output["valid"] == false: and later if RETRY_STATUS == "max_retries":). If this is intended to be runnable shell, please convert the conditionals to valid bash and show how MONITOR_FEEDBACK_TEXT is sourced/escaped; otherwise consider changing the fence/language label to reflect that it’s pseudocode.

Copilot · 2026-03-27T12:42:35Z

src/mapify_cli/templates/map/scripts/map_orchestrator.py

+    if current_retries > state.max_retries:
+        state.save(state_file)
+        return {
+            "status": "max_retries",
+            "subtask_id": subtask_id,
+            "retry_count": current_retries,
+            "max_retries": state.max_retries,


Same off-by-one issue here: current_retries > max_retries escalates on the (max_retries+1)th failure. Also, the max_retries branch doesn’t reset any phase to an editing phase. Given workflow-gate checks wave subtask_phases against {ACTOR, APPLY, TEST_WRITER} (not step IDs), consider also setting current_step_phase/current_subtask_id (or updating the gate/representation) so retries actually unblock edits.

Copilot · 2026-03-27T12:42:36Z

src/mapify_cli/templates/commands/map-efficient.md

 if monitor_output["valid"] == false:
-    # Increment retry counter (also triggered when test gate fails above)
-    if retry_count < 5:
-        # Go back to Phase: ACTOR with Monitor feedback
-        # Actor will fix issues and re-apply code
-
-        # === STUCK RECOVERY (at retry 3) ===
-        # At retry 3, intercept with intermediate recovery before retries 4-5.
-        # This gives Actor better context to break out of a stuck loop.
-        if retry_count == 3:
-            # Step 1: Check if research-agent already ran for this subtask
-            findings_file = f".map/{branch}/findings_{branch}.md"
-            if findings_file exists and has content for this subtask:
-                # Reuse existing findings (Edge Case 12: skip re-invocation)
-                recovery_context = read(findings_file)
-            else:
-                # Invoke research-agent for alternative approaches
-                Task(
-                    subagent_type="research-agent",
-                    description="Stuck recovery: find alternative approach",
-                    prompt=f"""Subtask {subtask_id} failed 3 monitor retries.
+    # Use orchestrator to handle retry: requeues ACTOR+MONITOR, increments retry_count,
+    # switches phase so workflow-gate allows edits, persists feedback for Actor.
+    RETRY_RESULT=$(python3 .map/scripts/map_orchestrator.py monitor_failed "MONITOR_FEEDBACK_TEXT")
+    # RETRY_RESULT.status is "retrying" or "max_retries"


This snippet is fenced as bash but mixes non-bash pseudocode (e.g., if monitor_output["valid"] == false: and later if RETRY_STATUS == "max_retries":). If this is intended to be runnable shell, please convert the conditionals to valid bash and show how MONITOR_FEEDBACK_TEXT is sourced/escaped; otherwise consider changing the fence/language label to reflect that it’s pseudocode.

Copilot · 2026-03-27T12:42:36Z

src/mapify_cli/templates/map/scripts/map_orchestrator.py

+    if state.retry_count > state.max_retries:
+        state.save(state_file)
+        return {
+            "status": "max_retries",
+            "retry_count": state.retry_count,


The max-retry check is off by one: with max_retries=5 this escalates on the 6th failure (because it uses >). Also, in the max_retries branch the state’s phase isn’t reset to an editing phase, so workflow-gate can remain stuck in MONITOR even though you’re escalating. Consider using >= and explicitly resetting current_step_id/current_step_phase/pending_steps (or a dedicated escalation phase) before returning.

Copilot · 2026-03-27T12:42:36Z

src/mapify_cli/templates/map/scripts/map_orchestrator.py

+    # Requeue ACTOR (2.3) and MONITOR (2.4)
+    step_order = _get_step_order(state.tdd_mode)
+    actor_idx = step_order.index("2.3")
+    state.pending_steps = step_order[actor_idx:]  # ["2.3", "2.4"] (+ TDD steps if any)
+    state.current_step_id = "2.3"
+    state.current_step_phase = "ACTOR"


Re-queueing from step 2.3 leaves any prior 2.3 entries in completed_steps. This can inflate check_circuit_breaker()’s tool_count (it uses len(completed_steps)) and may trip the breaker early after a few retries. Consider removing 2.3 (and 2.4 if present) from completed_steps when re-queuing, or making completed_steps a set/unique list.

Copilot · 2026-03-27T12:42:36Z

src/mapify_cli/templates/map/scripts/map_orchestrator.py

            print(json.dumps(result, indent=2))

+        elif args.command == "monitor_failed":
+            feedback = args.task_or_step or ""


CLI parsing for monitor_failed only uses args.task_or_step, so unquoted multi-word feedback will be truncated and the remainder discarded into extra_args. For consistency with wave_monitor_failed, consider joining extra_args into the feedback text as well.

Suggested change

feedback = args.task_or_step or ""

parts = []

if args.task_or_step:

parts.append(args.task_or_step)

if args.extra_args:

parts.extend(args.extra_args)

feedback = " ".join(parts)

Copilot · 2026-03-27T12:42:36Z

src/mapify_cli/templates/map/scripts/map_orchestrator.py

+def monitor_failed(branch: str, feedback: str = "") -> dict:
+    """Handle Monitor valid=false: requeue ACTOR+MONITOR, increment retry_count.
+
+    Called by map-efficient.md when Monitor returns valid=false.
+    Switches phase back to ACTOR so workflow-gate allows edits.


No unit tests were added for these new retry handlers. The repo already has pytest coverage for other orchestrator commands (see tests/test_map_orchestrator.py), so adding tests for monitor_failed()/wave_monitor_failed() (retry increment, requeue/reset behavior, and max_retries semantics) would help prevent regressions.

Issues found by Monitor/Predictor/Evaluator review: - Fix CLI feedback truncation: join all positional args for multi-word feedback (was silently dropping words after the first) - Fix TDD mode: always requeue only ["2.3", "2.4"] on retry, never re-run TDD pre-steps (2.25/2.26) that were already completed - Fix feedback overwrite: use retry-numbered filenames (monitor_feedback_retry{N}.md) per error-patterns.md lesson - Extract _write_feedback_file() helper to DRY the feedback persistence - Update Phase D wave retry pseudocode to use wave_monitor_failed - Clarify Actor feedback file path comment (explicit, not "automatic") - Add 17 unit tests: TestMonitorFailed (10) + TestWaveMonitorFailed (7)

… edits Review findings (11 items): - Extract _check_retry_limit() helper to DRY up monitor_failed/wave_monitor_failed - Add phase guard to monitor_failed() rejecting non-MONITOR phase calls - Add --feedback CLI flag for symmetric argument passing - Fix SUBTASK_ID placeholder, feedback filename comment, Continue label text - Document serial/wave retry counter invariant in StepState - Update ARCHITECTURE.md CLI command list New feature: reopen_for_fixes() - Transitions COMPLETE → ACTOR so workflow gate unlocks for review fixes - Solves: /map-review finds issues but edits blocked in COMPLETE phase - map-review.md updated with "Workflow Gate Unlock" section Tests: 13 new tests (6 reopen_for_fixes + 4 review findings + 3 integration)

TEST_WRITER prompt now includes explicit anti-patterns: - Every test must verify semantic behavior, not just single branch execution - Tests must assert multiple consequences (side effects, return values, state) - Prefer scenario-based tests over single-field unit checks - Aim for 60%+ semantic tests TEST_FAIL_GATE now includes quality gate: - Classifies tests as semantic vs trivial - Rejects if >40% tests are trivial single-branch checks - Sends feedback to TEST_WRITER to merge trivial tests into richer scenarios Addresses feedback: 3/9 tests were trivial branch checks in production use.

TEST_WRITER could add temporal comments like "currently FAILS" or "expected to FAIL until fix is applied" that become stale after the Green phase. Actor in code_only mode cannot clean them up. - Add rule to TEST_WRITER prohibiting temporal state-marking comments (map-tdd.md, map-efficient.md, actor.md) - Add TDD Refactor cleanup step after Actor returns, before Monitor, to scan and clean any stale Red-phase markers in test files

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-30T15:29:58Z

src/mapify_cli/templates/map/scripts/map_orchestrator.py

+    feedback_file = _write_feedback_file(
+        branch,
+        f"monitor_feedback_{subtask_id}_retry{current_retries}.md",
+        f"Monitor Feedback for {subtask_id} (retry {current_retries})",
+        feedback,


wave_monitor_failed() interpolates subtask_id into the feedback filename. Since subtask_id is user-controlled via the CLI, a value containing path separators / .. could lead to path traversal and writing outside .map/<branch>/. Validate/sanitize subtask_id (e.g., enforce ^ST-\d+$ and/or strip path separators) before using it in a path component.

Copilot · 2026-03-30T15:29:58Z

src/mapify_cli/templates/map/scripts/map_orchestrator.py

+            ),
+        },
+    )
+    if escalation is not None:


In the max_retries early-return path, monitor_failed() saves state but leaves current_step_phase as MONITOR. That reintroduces the original deadlock (workflow-gate blocks edits during MONITOR) at exactly the point where the user needs to decide next actions. Consider still resetting the phase/pending steps to an editing phase (e.g., ACTOR + requeue 2.3/2.4) while returning status="max_retries", or otherwise explicitly unlocking edits as part of the escalation path.

Suggested change

if escalation is not None:

if escalation is not None:

# Even on max-retries escalation, move state back to an editing phase

# so that workflow-gate does not keep the branch locked in MONITOR.

state.pending_steps = ["2.3", "2.4"]

state.current_step_id = "2.3"

state.current_step_phase = "ACTOR"

Copilot · 2026-03-30T15:29:58Z

src/mapify_cli/templates/map/scripts/map_orchestrator.py

+                f"Monitor retry limit reached ({state.max_retries} attempts). "
+                "Escalate to user."


The escalation message says "retry limit reached (N attempts)", but escalation happens when current_retries > max_retries (i.e., after exceeding the limit). Consider adjusting the wording to "retry limit exceeded" (or include the current attempt number) to avoid off-by-one confusion in user-facing output.

Suggested change

f"Monitor retry limit reached ({state.max_retries} attempts). "

"Escalate to user."

f"Monitor retry limit exceeded: "

f"{state.retry_count} attempts (max {state.max_retries}). "

"Escalating to user."

Copilot · 2026-03-30T15:29:59Z

.map/scripts/map_orchestrator.py

+    feedback_file = _write_feedback_file(
+        branch,
+        f"monitor_feedback_{subtask_id}_retry{current_retries}.md",
+        f"Monitor Feedback for {subtask_id} (retry {current_retries})",
+        feedback,


wave_monitor_failed() interpolates subtask_id into the feedback filename. Since subtask_id is user-controlled via the CLI, a value containing path separators / .. could lead to path traversal and writing outside .map/<branch>/. Validate/sanitize subtask_id (e.g., enforce ^ST-\d+$ and/or strip path separators) before using it in a path component.

Copilot · 2026-03-30T15:29:59Z

.map/scripts/map_orchestrator.py

+    """
+    state_file = Path(f".map/{branch}/step_state.json")
+    state = StepState.load(state_file)
+


wave_monitor_failed() increments retry counts and mutates subtask_phases without checking that wave execution is configured, that subtask_id exists in the current wave, or that the subtask is currently in MONITOR (2.4). As written, calling this command at the wrong time (or with an unknown subtask_id) can silently create/overwrite state and corrupt wave progress. Add guards (e.g., require execution_waves configured, subtask_id in current wave, and current phase == "2.4") and return a structured error otherwise.

Suggested change

# Guard: wave execution must be configured

execution_waves = getattr(state, "execution_waves", None)

if not execution_waves:

return {

"status": "error",

"error": "wave_execution_not_configured",

"message": (

"Monitor failure handling requires execution_waves to be "

"configured in step_state.json."

),

"subtask_id": subtask_id,

}

# Guard: current wave index must be valid

current_wave_index = getattr(state, "current_wave_index", None)

if not isinstance(current_wave_index, int) or not (

0 <= current_wave_index < len(execution_waves)

):

return {

"status": "error",

"error": "invalid_wave_state",

"message": (

"Current wave index is not valid; cannot process monitor failure "

"for subtask."

),

"subtask_id": subtask_id,

}

current_wave = execution_waves[current_wave_index]

# Derive the list of subtasks for the current wave in a tolerant way

if isinstance(current_wave, dict):

wave_subtasks = current_wave.get("subtasks")

else:

wave_subtasks = current_wave

if not wave_subtasks or subtask_id not in wave_subtasks:

return {

"status": "error",

"error": "unknown_subtask_for_current_wave",

"message": (

f"Subtask {subtask_id} is not part of the current execution wave; "

"cannot mark monitor as failed."

),

"subtask_id": subtask_id,

}

# Guard: subtask must currently be in MONITOR phase (2.4)

current_phase = state.subtask_phases.get(subtask_id)

if current_phase != "2.4":

return {

"status": "error",

"error": "invalid_phase_for_monitor_failed",

"message": (

f"Cannot mark monitor failed for {subtask_id} because its current "

f"phase is {current_phase!r}, expected '2.4' (MONITOR)."

),

"subtask_id": subtask_id,

"current_phase": current_phase,

"expected_phase": "2.4",

}

Copilot · 2026-03-30T15:29:59Z

.claude/commands/map-efficient.md

 if monitor_output["valid"] == false:
-    # Increment retry counter (also triggered when test gate fails above)
-    if retry_count < 5:
-        # Go back to Phase: ACTOR with Monitor feedback
-        # Actor will fix issues and re-apply code
-
-        # === STUCK RECOVERY (at retry 3) ===
-        # At retry 3, intercept with intermediate recovery before retries 4-5.
-        # This gives Actor better context to break out of a stuck loop.
-        if retry_count == 3:
-            # Step 1: Check if research-agent already ran for this subtask
-            findings_file = f".map/{branch}/findings_{branch}.md"
-            if findings_file exists and has content for this subtask:
-                # Reuse existing findings (Edge Case 12: skip re-invocation)
-                recovery_context = read(findings_file)
-            else:
-                # Invoke research-agent for alternative approaches
-                Task(
-                    subagent_type="research-agent",
-                    description="Stuck recovery: find alternative approach",
-                    prompt=f"""Subtask {subtask_id} failed 3 monitor retries.
+    # Use orchestrator to handle retry: requeues ACTOR+MONITOR, increments retry_count,
+    # switches phase so workflow-gate allows edits, persists feedback for Actor.
+    RETRY_RESULT=$(python3 .map/scripts/map_orchestrator.py monitor_failed --feedback "MONITOR_FEEDBACK_TEXT")
+    # RETRY_RESULT.status is "retrying" or "max_retries"
+    # RETRY_RESULT.retry_count shows current attempt number
+    # RETRY_RESULT.feedback_file points to .map/<branch>/monitor_feedback_retry{N}.md
+
+    RETRY_STATUS=$(echo "$RETRY_RESULT" | python3 -c "import sys,json; print(json.load(sys.stdin).get('status',''))")
+    RETRY_COUNT=$(echo "$RETRY_RESULT" | python3 -c "import sys,json; print(json.load(sys.stdin).get('retry_count',0))")
+
+    if RETRY_STATUS == "max_retries":
+        # Escalate to user (retry limit reached after 5 attempts)
+        AskUserQuestion(questions=[{"question": "Monitor retry limit reached (5 attempts). How to proceed?", "header": "Retry limit", "options": [{"label": "Continue", "description": "Continue with more retries (manually edit step_state.json retry_count)"}, {"label": "Skip", "description": "Skip this subtask and move to next"}, {"label": "Abort", "description": "Stop workflow"}], "multiSelect": false}])


This retry-handling snippet mixes bash-style variable assignments (RETRY_RESULT=$(...)) with Python-style conditionals (if RETRY_STATUS == "max_retries":). Because this prompt is meant to be followed verbatim, the mixed syntax is likely to confuse users/agents and can lead to incorrect command execution. Consider rewriting this block in one consistent pseudo-language (or make it valid bash with [ ... ] conditionals) and keep it consistently fenced.

Copilot · 2026-03-30T15:29:59Z

src/mapify_cli/templates/commands/map-efficient.md

 # After Monitor returns valid=true, run deterministic test gate:
 TEST_GATE=$(python3 .map/scripts/map_step_runner.py run_test_gate)
 # If tests fail, treat as Monitor valid=false — feed output back to Actor
 if echo "$TEST_GATE" | python3 -c "import sys,json; d=json.load(sys.stdin); sys.exit(0 if d.get('passed') else 1)" 2>/dev/null; then
    # Tests passed — snapshot code state for artifact verification
    SNAPSHOT=$(python3 .map/scripts/map_step_runner.py snapshot_code_state)
    # Append git ref to review artifact header (if code-review file exists)
 fi

 # After Monitor returns:
 if monitor_output["valid"] == false:
-    # Increment retry counter (also triggered when test gate fails above)
-    if retry_count < 5:
-        # Go back to Phase: ACTOR with Monitor feedback
-        # Actor will fix issues and re-apply code
-
-        # === STUCK RECOVERY (at retry 3) ===
-        # At retry 3, intercept with intermediate recovery before retries 4-5.
-        # This gives Actor better context to break out of a stuck loop.
-        if retry_count == 3:
-            # Step 1: Check if research-agent already ran for this subtask
-            findings_file = f".map/{branch}/findings_{branch}.md"
-            if findings_file exists and has content for this subtask:
-                # Reuse existing findings (Edge Case 12: skip re-invocation)
-                recovery_context = read(findings_file)
-            else:
-                # Invoke research-agent for alternative approaches
-                Task(
-                    subagent_type="research-agent",
-                    description="Stuck recovery: find alternative approach",
-                    prompt=f"""Subtask {subtask_id} failed 3 monitor retries.
+    # Use orchestrator to handle retry: requeues ACTOR+MONITOR, increments retry_count,
+    # switches phase so workflow-gate allows edits, persists feedback for Actor.
+    RETRY_RESULT=$(python3 .map/scripts/map_orchestrator.py monitor_failed --feedback "MONITOR_FEEDBACK_TEXT")
+    # RETRY_RESULT.status is "retrying" or "max_retries"
+    # RETRY_RESULT.retry_count shows current attempt number
+    # RETRY_RESULT.feedback_file points to .map/<branch>/monitor_feedback_retry{N}.md
+
+    RETRY_STATUS=$(echo "$RETRY_RESULT" | python3 -c "import sys,json; print(json.load(sys.stdin).get('status',''))")
+    RETRY_COUNT=$(echo "$RETRY_RESULT" | python3 -c "import sys,json; print(json.load(sys.stdin).get('retry_count',0))")
+
+    if RETRY_STATUS == "max_retries":
+        # Escalate to user (retry limit reached after 5 attempts)
+        AskUserQuestion(questions=[{"question": "Monitor retry limit reached (5 attempts). How to proceed?", "header": "Retry limit", "options": [{"label": "Continue", "description": "Continue with more retries (manually edit step_state.json retry_count)"}, {"label": "Skip", "description": "Skip this subtask and move to next"}, {"label": "Abort", "description": "Stop workflow"}], "multiSelect": false}])


This retry-handling block mixes bash-style constructs (VAR=$(...), if echo ...; then) with Python-style conditionals (if RETRY_STATUS == "max_retries":) and pseudo-DSL like AskUserQuestion(...). Because map-efficient.md is an operational playbook prompt, the mixed syntax is likely to confuse users/agents. Consider rewriting the block in a single consistent pseudo-language (or make it valid bash throughout) and fence it accordingly.

Copilot · 2026-03-30T15:30:00Z

src/mapify_cli/templates/map/scripts/map_orchestrator.py

+    """
+    state_file = Path(f".map/{branch}/step_state.json")
+    state = StepState.load(state_file)
+


wave_monitor_failed() increments retry counts and mutates subtask_phases without checking that wave execution is configured, that subtask_id exists in the current wave, or that the subtask is currently in MONITOR (2.4). As written, calling this command at the wrong time (or with an unknown subtask_id) can silently create/overwrite state and corrupt wave progress. Add guards (e.g., require execution_waves configured, subtask_id in current wave, and current phase == "2.4") and return a structured error otherwise.

Suggested change

# Ensure wave-based execution is configured and we have a current wave

execution_waves = getattr(state, "execution_waves", None)

current_wave_index = getattr(state, "current_wave_index", None)

if not execution_waves or current_wave_index is None or not (

0 <= current_wave_index < len(execution_waves)

):

return {

"status": "error",

"error": "wave_execution_not_configured",

"subtask_id": subtask_id,

"message": (

"Wave-based execution is not properly configured; "

"cannot process monitor failure for this subtask."

),

}

# Verify the subtask belongs to the current wave

current_wave = execution_waves[current_wave_index]

if isinstance(current_wave, dict):

current_wave_subtasks = list(current_wave.keys())

elif isinstance(current_wave, list):

current_wave_subtasks = current_wave

else:

current_wave_subtasks = []

if subtask_id not in current_wave_subtasks:

return {

"status": "error",

"error": "subtask_not_in_current_wave",

"subtask_id": subtask_id,

"message": (

f"Subtask {subtask_id} is not part of the current wave; "

"cannot process monitor failure."

),

}

# Verify the subtask is currently in MONITOR phase ("2.4")

current_phase = state.subtask_phases.get(subtask_id)

if current_phase != "2.4":

return {

"status": "error",

"error": "invalid_subtask_phase",

"subtask_id": subtask_id,

"expected_phase": "2.4",

"actual_phase": current_phase,

"message": (

"wave_monitor_failed can only be called when the subtask "

"is in MONITOR phase (2.4)."

),

}

Copilot · 2026-03-30T15:30:00Z

.map/scripts/map_orchestrator.py

+            ),
+        },
+    )
+    if escalation is not None:


In the max_retries early-return path, monitor_failed() saves state but leaves current_step_phase as MONITOR. That reintroduces the original deadlock (workflow-gate blocks edits during MONITOR) at exactly the point where the user needs to decide next actions. Consider still resetting the phase/pending steps to an editing phase (e.g., ACTOR + requeue 2.3/2.4) while returning status="max_retries", or otherwise explicitly unlocking edits as part of the escalation path.

Suggested change

if escalation is not None:

if escalation is not None:

# On max retries, we still want to unlock edits and move the workflow

# out of MONITOR, otherwise the workflow-gate would deadlock exactly

# when the user needs to intervene. Reset to an editing phase and

# persist that state while returning status="max_retries".

state.pending_steps = ["2.3", "2.4"]

state.current_step_id = "2.3"

state.current_step_phase = "ACTOR"

Copilot · 2026-03-30T15:30:00Z

.map/scripts/map_orchestrator.py

+                f"Monitor retry limit reached ({state.max_retries} attempts). "
+                "Escalate to user."


The escalation message says "retry limit reached (N attempts)", but escalation happens when current_retries > max_retries (i.e., after exceeding the limit). Consider adjusting the wording to "retry limit exceeded" (or include the current attempt number) to avoid off-by-one confusion in user-facing output.

Suggested change

f"Monitor retry limit reached ({state.max_retries} attempts). "

"Escalate to user."

f"Monitor retry limit exceeded after {state.retry_count} attempts "

f"(max {state.max_retries}). Escalate to user."

…ce, map-debate) Remove unused skill definitions and their trigger rules from both dev config (.claude/) and shipped templates (src/mapify_cli/templates/).

…metadata Clean up remaining mentions after command/skill deletion. CHANGELOG entries preserved as historical record.

Copilot AI review requested due to automatic review settings March 27, 2026 12:35

Copilot started reviewing on behalf of azalio March 27, 2026 12:35 View session

chore: formatting cleanup and add DevOpsConf case study link

c3826b4

- Add DevOpsConf 2026 case study link to README.md - Minor formatting cleanup in ralph-iteration-logger.py, diagnostics.py, map_step_runner.py

Copilot AI reviewed Mar 27, 2026

View reviewed changes

azalio added 4 commits March 27, 2026 15:50

azalio requested a review from Copilot March 30, 2026 15:21

Copilot started reviewing on behalf of azalio March 30, 2026 15:21 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

azalio added 4 commits March 30, 2026 23:04

chore: remove deprecated skills (map-workflows-guide, map-cli-referen…

c471017

…ce, map-debate) Remove unused skill definitions and their trigger rules from both dev config (.claude/) and shipped templates (src/mapify_cli/templates/).

fix: remove map-debate from expected command templates in tests

ab0b1ab

chore: remove all map-debate references from docs, tests, and plugin …

f460a94

…metadata Clean up remaining mentions after command/skill deletion. CHANGELOG entries preserved as historical record.

chore: ignore .map/scripts/.map/ in gitignore

3c95941

azalio merged commit 41c5e39 into main Mar 31, 2026
6 checks passed

+    # When re-queuing from 2.3/2.4, ensure these steps are no longer marked completed.
+    # This prevents len(completed_steps)-based logic (e.g., circuit breakers) from
+    # counting multiple retries of the same phase as additional completed tools.
+    state.completed_steps = [
+        step_id for step_id in state.completed_steps
+        if step_id not in ("2.3", "2.4")
+    ]

-            feedback = args.task_or_step or ""
+            parts = []
+            if args.task_or_step:
+                parts.append(args.task_or_step)
+            if getattr(args, "extra_args", None):
+                parts.extend(args.extra_args)
+            feedback = " ".join(parts) if parts else ""

		f"Monitor retry limit reached ({state.max_retries} attempts). "
		"Escalate to user."

-                f"Monitor retry limit reached ({state.max_retries} attempts). "
-                "Escalate to user."
+                f"Monitor retry limit exceeded: "
+                f"{state.retry_count} attempts (max {state.max_retries}). "
+                "Escalating to user."

+    # Guard: wave execution must be configured
+    execution_waves = getattr(state, "execution_waves", None)
+    if not execution_waves:
+        return {
+            "status": "error",
+            "error": "wave_execution_not_configured",
+            "message": (
+                "Monitor failure handling requires execution_waves to be "
+                "configured in step_state.json."
+            ),
+            "subtask_id": subtask_id,
+        }
+    # Guard: current wave index must be valid
+    current_wave_index = getattr(state, "current_wave_index", None)
+    if not isinstance(current_wave_index, int) or not (
+<= current_wave_index < len(execution_waves)
+    ):
+        return {
+            "status": "error",
+            "error": "invalid_wave_state",
+            "message": (
+                "Current wave index is not valid; cannot process monitor failure "
+                "for subtask."
+            ),
+            "subtask_id": subtask_id,
+        }
+    current_wave = execution_waves[current_wave_index]
+    # Derive the list of subtasks for the current wave in a tolerant way
+    if isinstance(current_wave, dict):
+        wave_subtasks = current_wave.get("subtasks")
+    else:
+        wave_subtasks = current_wave
+    if not wave_subtasks or subtask_id not in wave_subtasks:
+        return {
+            "status": "error",
+            "error": "unknown_subtask_for_current_wave",
+            "message": (
+                f"Subtask {subtask_id} is not part of the current execution wave; "
+                "cannot mark monitor as failed."
+            ),
+            "subtask_id": subtask_id,
+        }
+    # Guard: subtask must currently be in MONITOR phase (2.4)
+    current_phase = state.subtask_phases.get(subtask_id)
+    if current_phase != "2.4":
+        return {
+            "status": "error",
+            "error": "invalid_phase_for_monitor_failed",
+            "message": (
+                f"Cannot mark monitor failed for {subtask_id} because its current "
+                f"phase is {current_phase!r}, expected '2.4' (MONITOR)."
+            ),
+            "subtask_id": subtask_id,
+            "current_phase": current_phase,
+            "expected_phase": "2.4",
+        }

-    if escalation is not None:
+    if escalation is not None:
+        # On max retries, we still want to unlock edits and move the workflow
+        # out of MONITOR, otherwise the workflow-gate would deadlock exactly
+        # when the user needs to intervene. Reset to an editing phase and
+        # persist that state while returning status="max_retries".
+        state.pending_steps = ["2.3", "2.4"]
+        state.current_step_id = "2.3"
+        state.current_step_phase = "ACTOR"

Conversation

azalio commented Mar 27, 2026

Summary

Design

CLI interface

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels