maystudios · maystudios · Mar 25, 2026 · Mar 25, 2026 · Copilot · Mar 25, 2026
diff --git a/templates/references/self-improvement.md b/templates/references/self-improvement.md
@@ -43,6 +43,10 @@ at the end of every session in a MaxsimCLI project. Each entry records:
 - Which tasks were attempted repeatedly without a commit (likely failed).
 - Long-term trends across many sessions.
 
+### MEMORY.md Size Limit
+
+Keep MEMORY.md under 200 lines (the Claude Code context loading limit). The `maxsim-capture-learnings` Stop hook enforces this by pruning at 180 lines, leaving headroom. When the file approaches the limit, the oldest entries are removed first. Each entry should be concise (3-5 lines) to maximize the number of sessions that fit.
-Keep MEMORY.md under 200 lines (the Claude Code context loading limit). The `maxsim-capture-learnings` Stop hook enforces this by pruning at 180 lines, leaving headroom. When the file approaches the limit, the oldest entries are removed first. Each entry should be concise (3-5 lines) to maximize the number of sessions that fit.
+Keep MEMORY.md under 200 lines (the Claude Code context loading limit). The `maxsim-capture-learnings` Stop hook enforces this by pruning at 180 lines, leaving headroom. When the file approaches the limit, the oldest entries are removed first. Aim to keep each entry concise (around 3–5 lines) by keeping commit lists short or summarizing patterns, but note that hook-generated entries may be longer when there are many commits.
-Keep MEMORY.md under 200 lines (the Claude Code context loading limit). The `maxsim-capture-learnings` Stop hook enforces this by pruning at 180 lines, leaving headroom. When the file approaches the limit, the oldest entries are removed first. Each entry should be concise (3-5 lines) to maximize the number of sessions that fit.
+Keep MEMORY.md under 200 lines (the Claude Code context loading limit). The `maxsim-capture-learnings` Stop hook enforces this by pruning at 180 lines, leaving headroom. When the file approaches the limit, the oldest entries are removed first. Aim to keep each entry concise (around 3–5 lines) by keeping commit lists short or summarizing patterns, but note that hook-generated entries may be longer when there are many commits.
+
 ---
 
 ## 3. Results Tracking

diff --git a/templates/skills/autoresearch/references/loop-protocol.md b/templates/skills/autoresearch/references/loop-protocol.md
@@ -93,10 +93,11 @@ If verification exceeds 2x normal time, kill and treat as crash.
 
 Some metrics are inherently noisy (benchmark times, ML accuracy). Strategies:
 
-- **Multi-run verification:** Run verify N times, use the median.
-- **Minimum improvement threshold:** Ignore improvements smaller than the noise floor.
-- **Confirmation run:** Re-verify before making a final keep decision.
-- **Environment pinning:** Pin random seeds, use deterministic test ordering, flush caches.
+- **For improvements of 1–5%:** Run the verify command 3 times and use the median result.
+- **For improvements >5%:** Run the verify command 5 times and use the median result.
+- **Minimum improvement threshold:** Ignore improvements smaller than the noise floor (typically 0.5% for benchmarks).
- **Minimum improvement threshold:** Ignore improvements smaller than the noise floor (typically 0.5% for benchmarks).
+- **For improvements between the noise floor and 1% (e.g., 0.5–1%):** Treat as noise by default. Only consider keeping if you run at least 5 verification runs and the median improvement remains above the noise floor.
+- **Minimum improvement threshold:** Ignore improvements smaller than the noise floor (typically 0.5% for benchmarks), and apply the above rule for any borderline improvements between the noise floor and 1%.
- **Minimum improvement threshold:** Ignore improvements smaller than the noise floor (typically 0.5% for benchmarks).
+- **For improvements between the noise floor and 1% (e.g., 0.5–1%):** Treat as noise by default. Only consider keeping if you run at least 5 verification runs and the median improvement remains above the noise floor.
+- **Minimum improvement threshold:** Ignore improvements smaller than the noise floor (typically 0.5% for benchmarks), and apply the above rule for any borderline improvements between the noise floor and 1%.
+- **Confirmation run:** After accepting an improvement, re-verify once more before making the final keep decision.
+- **Environment pinning:** Pin random seeds, use deterministic test ordering, flush caches between runs.
 
 ## Phase 5.5: Guard (Regression Check)
 

diff --git a/templates/skills/verification/SKILL.md b/templates/skills/verification/SKILL.md
@@ -167,3 +167,42 @@ Do not attempt a 4th run without user acknowledgment and revised instructions.
 | Skipping Gate 4 after Gate 3 passes | Declaring done without regression check | Gate 3 and Gate 4 are both required; neither is optional |
 | Conflating "no errors" with "correct output" | Exit code 0 but wrong behavior | Evidence must show correct output, not just absence of error |
 | Writing evidence after the fact | Constructing output from memory | Run the command, capture the output, paste it verbatim |
+
+---
+
+## 5-Step Verification Process
+
+When verification fails, follow this structured process:
+
+1. **Run the check command one final time** — capture fresh output as evidence
+2. **Construct diagnostic summary** — compare spec expectations vs actual output
+3. **Identify root cause** — is it a spec problem, environment problem, or implementation problem?
+4. **Propose next step** — rewrite spec, fix environment, reduce scope, or escalate
+5. **Escalate if unresolved** — create a diagnostic GitHub Issue with all evidence
+
+---
+
+## GitHub Issue Escalation
+
+When a task fails verification after 3 attempts, escalate by creating (or commenting on) a GitHub Issue:
-When a task fails verification after 3 attempts, escalate by creating (or commenting on) a GitHub Issue:
+When verification still fails after the final allowed attempt (as defined by the workflow configuration), escalate by creating (or commenting on) a GitHub Issue:
-When a task fails verification after 3 attempts, escalate by creating (or commenting on) a GitHub Issue:
+When verification still fails after the final allowed attempt (as defined by the workflow configuration), escalate by creating (or commenting on) a GitHub Issue:
+
+1. **Original task spec** — quoted from the plan comment
+2. **What was attempted** — brief factual summary of each attempt
+3. **The specific gate that failed** — exact error output from each run
+4. **Root cause analysis** — spec/environment/implementation classification
+5. **Proposed next step** — rewrite spec, fix environment, reduce scope, or request user input
+
+Label the issue with `type:bug` and `maxsim:auto`.
+
+---
+
+## Fresh Executor Context
+
+Each retry attempt MUST use a fresh executor agent:
+
+- Do NOT reuse the previous executor (spawn a new one)
+- Provide the full task spec (do not assume prior context carries over)
+- Include the diagnostic summary from the failed run
+- Include revised instructions based on root cause analysis
+
+Treat each fresh executor as a cold start. Do NOT reference or build upon any previous attempt's reasoning or partial work.
-Each retry attempt MUST use a fresh executor agent:
-
- Do NOT reuse the previous executor (spawn a new one)
- Provide the full task spec (do not assume prior context carries over)
- Include the diagnostic summary from the failed run
- Include revised instructions based on root cause analysis
-
-Treat each fresh executor as a cold start. Do NOT reference or build upon any previous attempt's reasoning or partial work.
+For retries that spawn a new executor agent, treat that agent as a fresh executor:
+
+- When you spawn a new executor, do NOT reuse any previous executor state
+- Provide the full task spec (do not assume prior context carries over)
+- Include the diagnostic summary from the failed run
+- Include revised instructions based on root cause analysis
+
+Treat each fresh executor as a cold start. Do NOT reference or build upon any previous attempt's reasoning or partial work within that executor.
-Each retry attempt MUST use a fresh executor agent:
-
- Do NOT reuse the previous executor (spawn a new one)
- Provide the full task spec (do not assume prior context carries over)
- Include the diagnostic summary from the failed run
- Include revised instructions based on root cause analysis
-
-Treat each fresh executor as a cold start. Do NOT reference or build upon any previous attempt's reasoning or partial work.
+For retries that spawn a new executor agent, treat that agent as a fresh executor:
+
+- When you spawn a new executor, do NOT reuse any previous executor state
+- Provide the full task spec (do not assume prior context carries over)
+- Include the diagnostic summary from the failed run
+- Include revised instructions based on root cause analysis
+
+Treat each fresh executor as a cold start. Do NOT reference or build upon any previous attempt's reasoning or partial work within that executor.