feat(process-all): reliable state-write + batch-state-repair-sweep + resumption + ceiling tune by mitwilli-create · Pull Request #335 · mitwilli-create/career-ops

mitwilli-create · 2026-05-29T20:32:00Z

Summary

Closes 6+ session recurring ask about Process All not draining (last raised 2026-05-28 sess:b9b06508 as a /system-maintainer command-message). Spec: data/spec-process-all-drain-and-visibility-2026-05-29.md. Handover: data/handover-process-all-drain-and-visibility-2026-05-29.md.

Diagnostic refinement vs. handover framing

The handover doc framed this as "state-write is broken (fields undefined)". The actual diagnostic was more nuanced:

State-write IS working for server-spawned runs. The "Gap A" validator checked TOP-LEVEL fields, but the canonical schema is {jobs: {jobId: {field}}} — fields ARE on the per-job records.
CLI-invoked runs were invisible because type: 'process-all' was never set (only set by the server's spawn path), so SSE filters at dashboard-server.mjs:2768 excluded them.
6 stuck batches with errored=179 each (canonical 179 vendor-deprecation pattern) predate PR fix(dedupe + batch): cross-surface dupe defense + drop deprecated temperature + idempotent pipeline-mark + sidebar wiring #308's markPipelineUrl() fix → URLs still [ ] → re-batched every run → never drain.
MAX_ROUNDS=10 was too low given realized batch sizes 73-179 — 1000 URLs theoretical, much less in practice with error-rate noise.
No wall-clock cap + no resume = killed runs lost work; next click re-triaged from scratch.

The spec's "what to ship" remains correct; only the framing of the state-write bug needed refinement.

What shipped (9 files, +1348/-12)

File	Change
`lib/process-all-state.mjs` (NEW)	`writeProcessState` (atomic tmp+rename, 11 required fields) + `writeResumeState`/`readResumeState`/`shouldResume`/`clearResumeState`
`lib/batch-state-repair-sweep.mjs` (NEW)	Idempotent dead-batch URL sweeper
`scripts/process-all-pipeline.mjs`	Atomic saveState; type tagging; repair sweep on start; resume detection; SIGTERM/SIGINT handlers; MAX_ROUNDS 10→30 (clamp [1,100]); wall-clock cap; error-rate early abort; kill switches; pre-existing telemetry-regex bug fixed
`scripts/repair-pipeline-process-state.mjs` (NEW)	One-time deterministic backfill, no LLM
`tests/process-all-state-write-invariant.test.mjs` (NEW)	7 cases
`tests/batch-state-repair-sweep.test.mjs` (NEW)	8 cases
`tests/process-all-resume-state.test.mjs` (NEW)	9 cases
`AGENTS.md`	4 env-var rows + 2 bug-class entries
`CLAUDE.md`	Session Notes entry

Pre-fix snapshot

pipeline-pending: 303 (now 419 after sibling LinkedIn canonicalization landed mid-build)
applications.md rows: 250

Recent batches:
  msgbatch_01MmnS5GbuWFqxfwQFHJcoad: processing=0 succeeded=0 errored=179
  msgbatch_016sttTnvRgioiukcDbewzfJ: processing=0 succeeded=176 errored=0
  msgbatch_01YTtxzUMhpRYHAt9RmFhS1j: processing=0 succeeded=74 errored=0
  msgbatch_013eEQvCJfPasoXcbUhuJxnh: processing=0 succeeded=77 errored=0
  msgbatch_01Cp46z3ixpxgwSrT5J8Xgtu: processing=0 succeeded=73 errored=0

pipeline-process-state.json: had {jobs: {...}} populated with prior runs;
                              schema is correct, but SSE filter required
                              type:'process-all' tag that CLI runs never set.

Test plan

Rollback

Env-flag-only reverts (no code revert needed):
- launchctl setenv PROCESS_ALL_DISABLE_RESUME true — skip resume detection
- launchctl setenv PROCESS_ALL_DISABLE_REPAIR_SWEEP true — skip dead-batch sweep
- launchctl setenv PROCESS_ALL_MAX_BATCH_ROUNDS 10 — revert to legacy cap
Hard revert: git revert c0d1e25 && git push origin main

🤖 Generated with Claude Code

…resumption + ceiling tune Closes 6+ session recurring ask about Process All not draining the queue. Spec: data/spec-process-all-drain-and-visibility-2026-05-29.md Five compounding root causes addressed: 1. NEW lib/process-all-state.mjs — atomic per-phase state-write helper that auto-fills the 11 required fields under jobs[jobId]. Closes the "type field never set when invoked from CLI" bug that made CLI-invoked runs invisible to dashboard SSE filters. Atomic tmp+rename per the state-write-without-disk-write bug class. 2. NEW lib/batch-state-repair-sweep.mjs — idempotent sweep that re-marks pipeline.md URLs as [x] when their batch is terminal-errored or terminal-expired. Closes the loop from PR #308 for batches that predate the markPipelineUrl fix (the 6 stuck batches with errored=179 each that re-queued the same URLs on every Process All run). 3. NEW resume-state at data/process-all-resume-state.json — SIGTERM/SIGINT handlers write resume-state before exit; next invocation reads it (if < 24h TTL) + skips already-completed phases. Cleared on success. 4. Tuned MAX_ROUNDS 10 → 30 (clamp [1, 100]) + new PROCESS_ALL_MAX_WALLCLOCK_MS (90 min default) + 2-consecutive-100%-error early-abort. Wall-clock cap is secondary to PER_RUN_CAP_PROCESS_ALL_USD=$1000 — dollar cap is the primary governor. 5. SSE last_run_complete toast now populates correctly because (a) the type field is reliably set on every write, and (b) the schema is guaranteed to have all 11 fields the SSE consumer expects. Plus 3 new test files (25 cases total, all green) + 2 new AGENTS.md bug-class entries (process-orchestrator-without-resumable-state + state-file-without-schema-enforcement) + env-var table updates + Session Notes entry. Pre-existing telemetry bug fixed alongside: pipelineBefore/pipelineAfter regex assumed URL came immediately after the [ ] marker, but actual pipeline.md format puts company name + prose first. Replaced with the correct .startsWith('- [ ]') pattern used elsewhere in the orchestrator. Kill switches: PROCESS_ALL_DISABLE_RESUME=true + PROCESS_ALL_DISABLE_REPAIR_SWEEP=true for emergency rollback. No code revert needed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions Bot added the 🔴 core-architecture label May 29, 2026

mitwilli-create force-pushed the feat/process-all-drain-2026-05-29-claude-d9560376 branch from c0d1e25 to bea16b9 Compare May 29, 2026 20:36

mitwilli-create merged commit 0fe68a9 into main May 29, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(process-all): reliable state-write + batch-state-repair-sweep + resumption + ceiling tune#335

feat(process-all): reliable state-write + batch-state-repair-sweep + resumption + ceiling tune#335
mitwilli-create merged 1 commit into
mainfrom
feat/process-all-drain-2026-05-29-claude-d9560376

mitwilli-create commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mitwilli-create commented May 29, 2026

Summary

Diagnostic refinement vs. handover framing

What shipped (9 files, +1348/-12)

Pre-fix snapshot

Test plan

Rollback

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant