Skip to content

fix(flapping-plists): sweep 4 flapping launchd plists — TCC wrapper drop + dashboard-dependency inline + exit-1 misclassification#336

Merged
mitwilli-create merged 1 commit into
mainfrom
fix/flapping-plists-sweep-2026-05-29-claude-1404
May 29, 2026
Merged

fix(flapping-plists): sweep 4 flapping launchd plists — TCC wrapper drop + dashboard-dependency inline + exit-1 misclassification#336
mitwilli-create merged 1 commit into
mainfrom
fix/flapping-plists-sweep-2026-05-29-claude-1404

Conversation

@mitwilli-create
Copy link
Copy Markdown
Owner

Summary

Closes the R3 residual from the 2026-05-29 chain handover: 4 of 6 flapping launchd plists fixed in one same-PR sweep. Two remaining (buttons-smoke, scan-email-poll) need investigation; not addressed here.

Three failure modes, four fixes:

Plist Pre-fix exit Root cause Fix
network-database-build 126 (since 2026-05-23, ~7 days silent flapping) macOS Tahoe TCC blocks /bin/bash from reading scripts under ~/Documents/ Drop bash + cron-run.sh wrapper; invoke node directly. Matches bug-intake-mapper pattern.
phase-B-prime-daily 1 (since 2026-05-25, 5 consecutive days) _CONTACTS_DATA not populated when 03:30 PT job fires before morning rebuild Inline node scripts/build-dashboard.mjs at start of main(). Override via SKIP_DASHBOARD_PRELOAD=1.
pipeline-health 1 (false-positive flap when 2+ concurrent sessions) Hardcoded pids.length > 1 orchestrator threshold Env-configurable PIPELINE_HEALTH_MAX_ORCHESTRATORS (default 1) + process.exit(0) always. The JSON file IS the signal per top-of-file comment.
health-column-liveness 1 (false-positive flap when coverage <90%) Exit 1 for valid data-coverage signal misclassified as crash process.exit(0) on unhealthy. Keep exit 2 for true FATAL. JSON at data/health-column-coverage.json is the signal.

AGENTS.md additions

Two new bug-class entries:

  • ### Bug class: launchd-bash-wrapper-tahoe-tcc-block — full doc of why bash hits TCC and node doesn't, with safe-pattern XML.
  • ### Bug class: launchd-exit-1-misclassified-as-flapping-on-data-signals — generalizable pattern: when a script writes its real signal to a JSON file AND returns an exit code, treat the file as the signal. Reserve non-zero for true script failure. Companion pattern: env-configurable thresholds replace hardcoded "expected 0 or 1" assertions.

Smoke tests (all pass)

plutil -lint:           4/4 plists OK
node --check:           3/3 .mjs files OK
pipeline-health exit:   0 (was 1; JSON signal preserved)
health-column exit:     0 (was 1; JSON signal preserved)
PIPELINE_HEALTH_MAX_ORCHESTRATORS env knob: honored

Test plan

  • Merge the PR
  • Manually trigger each fixed plist to confirm flapping resolved:
    • launchctl kickstart -k gui/$(id -u)/com.mitchell.career-ops.network-database-build → exit 0 expected
    • launchctl kickstart -k gui/$(id -u)/com.mitchell.career-ops.phase-B-prime-daily → exit 0 expected (cost ~$2.50 if run during work day)
    • launchctl kickstart -k gui/$(id -u)/com.mitchell.career-ops.pipeline-health → exit 0 expected
    • launchctl kickstart -k gui/$(id -u)/com.mitchell.career-ops.health-column-liveness → exit 0 expected
  • After 24h, re-run node scripts/agents/system-maintainer.mjs --health → should show 2 flapping instead of 6
  • (Optional) Set launchctl setenv PIPELINE_HEALTH_MAX_ORCHESTRATORS 2 if concurrent-session orchestrators are routine

Out of scope (follow-ups)

  • buttons-smoke: 1/14 assertion failure ("batch-runner dry-run reports expected queue size") — code investigation, not a crash. File a separate PR.
  • scan-email-poll: produces no .err / .out logs at all — likely plist misconfiguration or job stub that never starts. Investigation needed.

🤖 Generated with Claude Code

…per drop + dashboard-dependency inline + exit-1 misclassification

Four plists were flapping for distinct reasons. Same-PR sweep so the
launchd surface goes 6 → 2 (buttons-smoke + scan-email-poll need
follow-up investigation, not addressed here).

1. network-database-build (exit 126 daily since 2026-05-23)
   - Drop /bin/bash + cron-run.sh wrapper. macOS Tahoe TCC blocks bash
     from reading scripts under ~/Documents/ even when bash is allowed.
     Node invocation works because /Users/.../node lives outside the
     protected tree. Matches bug-intake-mapper / pipeline-health /
     health-column-liveness pattern.

2. phase-B-prime-daily (exit 1 daily since 2026-05-25)
   - Inline `node scripts/build-dashboard.mjs` at start of main() so
     _CONTACTS_DATA is populated regardless of when the morning
     rebuild fires. Idempotent. Override via SKIP_DASHBOARD_PRELOAD=1.

3. pipeline-health-check.mjs (false-positive flapping on 2+ concurrent
   Claude sessions — the new normal)
   - Env-configurable PIPELINE_HEALTH_MAX_ORCHESTRATORS (default 1).
   - process.exit(0) always. The JSON file IS the signal per the
     top-of-file comment intent.

4. health-column-liveness.mjs (false-positive flapping when
   data-coverage < 90% — a valid signal, wrong protocol)
   - process.exit(0) on unhealthy. Keep exit 2 for true FATAL.
   - The JSON at data/health-column-coverage.json is the signal.

AGENTS.md adds two bug-class entries:
- launchd-bash-wrapper-tahoe-tcc-block
- launchd-exit-1-misclassified-as-flapping-on-data-signals

Smoke-tested:
- plutil -lint OK on all 4 plists
- node --check OK on all 3 .mjs files
- pipeline-health exits 0 with JSON signal preserved
- health-column-liveness exits 0 with JSON signal preserved
- PIPELINE_HEALTH_MAX_ORCHESTRATORS env knob honored

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mitwilli-create mitwilli-create marked this pull request as ready for review May 29, 2026 22:11
@mitwilli-create mitwilli-create merged commit 3a917c7 into main May 29, 2026
9 checks passed
@mitwilli-create mitwilli-create deleted the fix/flapping-plists-sweep-2026-05-29-claude-1404 branch May 29, 2026 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant