Skip to content

feat(process-all): reliable state-write + batch-state-repair-sweep + resumption + ceiling tune#335

Merged
mitwilli-create merged 1 commit into
mainfrom
feat/process-all-drain-2026-05-29-claude-d9560376
May 29, 2026
Merged

feat(process-all): reliable state-write + batch-state-repair-sweep + resumption + ceiling tune#335
mitwilli-create merged 1 commit into
mainfrom
feat/process-all-drain-2026-05-29-claude-d9560376

Conversation

@mitwilli-create
Copy link
Copy Markdown
Owner

Summary

Closes 6+ session recurring ask about Process All not draining (last raised 2026-05-28 sess:b9b06508 as a /system-maintainer command-message). Spec: data/spec-process-all-drain-and-visibility-2026-05-29.md. Handover: data/handover-process-all-drain-and-visibility-2026-05-29.md.

Diagnostic refinement vs. handover framing

The handover doc framed this as "state-write is broken (fields undefined)". The actual diagnostic was more nuanced:

  • State-write IS working for server-spawned runs. The "Gap A" validator checked TOP-LEVEL fields, but the canonical schema is {jobs: {jobId: {field}}} — fields ARE on the per-job records.
  • CLI-invoked runs were invisible because type: 'process-all' was never set (only set by the server's spawn path), so SSE filters at dashboard-server.mjs:2768 excluded them.
  • 6 stuck batches with errored=179 each (canonical 179 vendor-deprecation pattern) predate PR fix(dedupe + batch): cross-surface dupe defense + drop deprecated temperature + idempotent pipeline-mark + sidebar wiring #308's markPipelineUrl() fix → URLs still [ ] → re-batched every run → never drain.
  • MAX_ROUNDS=10 was too low given realized batch sizes 73-179 — 1000 URLs theoretical, much less in practice with error-rate noise.
  • No wall-clock cap + no resume = killed runs lost work; next click re-triaged from scratch.

The spec's "what to ship" remains correct; only the framing of the state-write bug needed refinement.

What shipped (9 files, +1348/-12)

File Change
lib/process-all-state.mjs (NEW) writeProcessState (atomic tmp+rename, 11 required fields) + writeResumeState/readResumeState/shouldResume/clearResumeState
lib/batch-state-repair-sweep.mjs (NEW) Idempotent dead-batch URL sweeper
scripts/process-all-pipeline.mjs Atomic saveState; type tagging; repair sweep on start; resume detection; SIGTERM/SIGINT handlers; MAX_ROUNDS 10→30 (clamp [1,100]); wall-clock cap; error-rate early abort; kill switches; pre-existing telemetry-regex bug fixed
scripts/repair-pipeline-process-state.mjs (NEW) One-time deterministic backfill, no LLM
tests/process-all-state-write-invariant.test.mjs (NEW) 7 cases
tests/batch-state-repair-sweep.test.mjs (NEW) 8 cases
tests/process-all-resume-state.test.mjs (NEW) 9 cases
AGENTS.md 4 env-var rows + 2 bug-class entries
CLAUDE.md Session Notes entry

Pre-fix snapshot

pipeline-pending: 303 (now 419 after sibling LinkedIn canonicalization landed mid-build)
applications.md rows: 250

Recent batches:
  msgbatch_01MmnS5GbuWFqxfwQFHJcoad: processing=0 succeeded=0 errored=179
  msgbatch_016sttTnvRgioiukcDbewzfJ: processing=0 succeeded=176 errored=0
  msgbatch_01YTtxzUMhpRYHAt9RmFhS1j: processing=0 succeeded=74 errored=0
  msgbatch_013eEQvCJfPasoXcbUhuJxnh: processing=0 succeeded=77 errored=0
  msgbatch_01Cp46z3ixpxgwSrT5J8Xgtu: processing=0 succeeded=73 errored=0

pipeline-process-state.json: had {jobs: {...}} populated with prior runs;
                              schema is correct, but SSE filter required
                              type:'process-all' tag that CLI runs never set.

Test plan

  • node --test tests/process-all-state-write-invariant.test.mjs — 7/7 green
  • node --test tests/batch-state-repair-sweep.test.mjs — 8/8 green
  • node --test tests/process-all-resume-state.test.mjs — 9/9 green
  • Dry-run node scripts/process-all-pipeline.mjs --dry-run — pipelineBefore=419, type tagged, all 11 fields populated under jobs[jobId]
  • Dry-run node scripts/repair-pipeline-process-state.mjs --dry-run — reconstructs 579 URLs across 5 recent batches (400 succeeded, 179 errored)
  • node test-all.mjs — 77 pass, 1 fail (pre-existing tracker score-format issue unrelated; confirmed via git stash && node verify-pipeline.mjs on clean main)
  • Concurrent-instance collision handled per protocol (paused, surfaced coordination prompt, re-applied edits cleanly after sibling confirmed completion)
  • Post-merge: node scripts/repair-pipeline-process-state.mjs to seed SSE
  • Post-merge: trigger Process All from dashboard; observe sidebar count decrementing in real-time + first-run sweep marking the 6 stuck batches (~890 URLs)
  • Post-merge: kill mid-flight (Ctrl-C); next click resumes from last completed round
  • /deploy-verify for the canonical 9-phase ship

Rollback

  • Env-flag-only reverts (no code revert needed):
    • launchctl setenv PROCESS_ALL_DISABLE_RESUME true — skip resume detection
    • launchctl setenv PROCESS_ALL_DISABLE_REPAIR_SWEEP true — skip dead-batch sweep
    • launchctl setenv PROCESS_ALL_MAX_BATCH_ROUNDS 10 — revert to legacy cap
  • Hard revert: git revert c0d1e25 && git push origin main

🤖 Generated with Claude Code

…resumption + ceiling tune

Closes 6+ session recurring ask about Process All not draining the queue.
Spec: data/spec-process-all-drain-and-visibility-2026-05-29.md

Five compounding root causes addressed:

1. NEW lib/process-all-state.mjs — atomic per-phase state-write helper
   that auto-fills the 11 required fields under jobs[jobId]. Closes the
   "type field never set when invoked from CLI" bug that made CLI-invoked
   runs invisible to dashboard SSE filters. Atomic tmp+rename per the
   state-write-without-disk-write bug class.

2. NEW lib/batch-state-repair-sweep.mjs — idempotent sweep that re-marks
   pipeline.md URLs as [x] when their batch is terminal-errored or
   terminal-expired. Closes the loop from PR #308 for batches that
   predate the markPipelineUrl fix (the 6 stuck batches with errored=179
   each that re-queued the same URLs on every Process All run).

3. NEW resume-state at data/process-all-resume-state.json — SIGTERM/SIGINT
   handlers write resume-state before exit; next invocation reads it (if
   < 24h TTL) + skips already-completed phases. Cleared on success.

4. Tuned MAX_ROUNDS 10 → 30 (clamp [1, 100]) + new PROCESS_ALL_MAX_WALLCLOCK_MS
   (90 min default) + 2-consecutive-100%-error early-abort. Wall-clock cap
   is secondary to PER_RUN_CAP_PROCESS_ALL_USD=$1000 — dollar cap is the
   primary governor.

5. SSE last_run_complete toast now populates correctly because (a) the
   type field is reliably set on every write, and (b) the schema is
   guaranteed to have all 11 fields the SSE consumer expects.

Plus 3 new test files (25 cases total, all green) + 2 new AGENTS.md
bug-class entries (process-orchestrator-without-resumable-state +
state-file-without-schema-enforcement) + env-var table updates +
Session Notes entry.

Pre-existing telemetry bug fixed alongside: pipelineBefore/pipelineAfter
regex assumed URL came immediately after the [ ] marker, but actual
pipeline.md format puts company name + prose first. Replaced with the
correct .startsWith('- [ ]') pattern used elsewhere in the orchestrator.

Kill switches: PROCESS_ALL_DISABLE_RESUME=true + PROCESS_ALL_DISABLE_REPAIR_SWEEP=true
for emergency rollback. No code revert needed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mitwilli-create mitwilli-create force-pushed the feat/process-all-drain-2026-05-29-claude-d9560376 branch from c0d1e25 to bea16b9 Compare May 29, 2026 20:36
@mitwilli-create mitwilli-create merged commit 0fe68a9 into main May 29, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant