feat(process-all): reliable state-write + batch-state-repair-sweep + resumption + ceiling tune#335
Merged
mitwilli-create merged 1 commit intoMay 29, 2026
Conversation
…resumption + ceiling tune Closes 6+ session recurring ask about Process All not draining the queue. Spec: data/spec-process-all-drain-and-visibility-2026-05-29.md Five compounding root causes addressed: 1. NEW lib/process-all-state.mjs — atomic per-phase state-write helper that auto-fills the 11 required fields under jobs[jobId]. Closes the "type field never set when invoked from CLI" bug that made CLI-invoked runs invisible to dashboard SSE filters. Atomic tmp+rename per the state-write-without-disk-write bug class. 2. NEW lib/batch-state-repair-sweep.mjs — idempotent sweep that re-marks pipeline.md URLs as [x] when their batch is terminal-errored or terminal-expired. Closes the loop from PR #308 for batches that predate the markPipelineUrl fix (the 6 stuck batches with errored=179 each that re-queued the same URLs on every Process All run). 3. NEW resume-state at data/process-all-resume-state.json — SIGTERM/SIGINT handlers write resume-state before exit; next invocation reads it (if < 24h TTL) + skips already-completed phases. Cleared on success. 4. Tuned MAX_ROUNDS 10 → 30 (clamp [1, 100]) + new PROCESS_ALL_MAX_WALLCLOCK_MS (90 min default) + 2-consecutive-100%-error early-abort. Wall-clock cap is secondary to PER_RUN_CAP_PROCESS_ALL_USD=$1000 — dollar cap is the primary governor. 5. SSE last_run_complete toast now populates correctly because (a) the type field is reliably set on every write, and (b) the schema is guaranteed to have all 11 fields the SSE consumer expects. Plus 3 new test files (25 cases total, all green) + 2 new AGENTS.md bug-class entries (process-orchestrator-without-resumable-state + state-file-without-schema-enforcement) + env-var table updates + Session Notes entry. Pre-existing telemetry bug fixed alongside: pipelineBefore/pipelineAfter regex assumed URL came immediately after the [ ] marker, but actual pipeline.md format puts company name + prose first. Replaced with the correct .startsWith('- [ ]') pattern used elsewhere in the orchestrator. Kill switches: PROCESS_ALL_DISABLE_RESUME=true + PROCESS_ALL_DISABLE_REPAIR_SWEEP=true for emergency rollback. No code revert needed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
c0d1e25 to
bea16b9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes 6+ session recurring ask about Process All not draining (last raised 2026-05-28 sess:b9b06508 as a /system-maintainer command-message). Spec:
data/spec-process-all-drain-and-visibility-2026-05-29.md. Handover:data/handover-process-all-drain-and-visibility-2026-05-29.md.Diagnostic refinement vs. handover framing
The handover doc framed this as "state-write is broken (fields undefined)". The actual diagnostic was more nuanced:
{jobs: {jobId: {field}}}— fields ARE on the per-job records.type: 'process-all'was never set (only set by the server's spawn path), so SSE filters atdashboard-server.mjs:2768excluded them.temperature+ idempotent pipeline-mark + sidebar wiring #308'smarkPipelineUrl()fix → URLs still[ ]→ re-batched every run → never drain.MAX_ROUNDS=10was too low given realized batch sizes 73-179 — 1000 URLs theoretical, much less in practice with error-rate noise.The spec's "what to ship" remains correct; only the framing of the state-write bug needed refinement.
What shipped (9 files, +1348/-12)
lib/process-all-state.mjs(NEW)writeProcessState(atomic tmp+rename, 11 required fields) +writeResumeState/readResumeState/shouldResume/clearResumeStatelib/batch-state-repair-sweep.mjs(NEW)scripts/process-all-pipeline.mjsscripts/repair-pipeline-process-state.mjs(NEW)tests/process-all-state-write-invariant.test.mjs(NEW)tests/batch-state-repair-sweep.test.mjs(NEW)tests/process-all-resume-state.test.mjs(NEW)AGENTS.mdCLAUDE.mdPre-fix snapshot
Test plan
node --test tests/process-all-state-write-invariant.test.mjs— 7/7 greennode --test tests/batch-state-repair-sweep.test.mjs— 8/8 greennode --test tests/process-all-resume-state.test.mjs— 9/9 greennode scripts/process-all-pipeline.mjs --dry-run— pipelineBefore=419, type tagged, all 11 fields populated under jobs[jobId]node scripts/repair-pipeline-process-state.mjs --dry-run— reconstructs 579 URLs across 5 recent batches (400 succeeded, 179 errored)node test-all.mjs— 77 pass, 1 fail (pre-existing tracker score-format issue unrelated; confirmed viagit stash && node verify-pipeline.mjson clean main)node scripts/repair-pipeline-process-state.mjsto seed SSE/deploy-verifyfor the canonical 9-phase shipRollback
launchctl setenv PROCESS_ALL_DISABLE_RESUME true— skip resume detectionlaunchctl setenv PROCESS_ALL_DISABLE_REPAIR_SWEEP true— skip dead-batch sweeplaunchctl setenv PROCESS_ALL_MAX_BATCH_ROUNDS 10— revert to legacy capgit revert c0d1e25 && git push origin main🤖 Generated with Claude Code