fix: process hygiene - MCP orphan watchdog, live-PID update locks, PATH-hijack-proof registration by RaghavChamadiya · Pull Request #385 · repowise-dev/repowise

RaghavChamadiya · 2026-06-05T15:02:55Z

Problem

An audit of a real machine found three structural process-hygiene issues, all pointing at one repo:

Orphaned MCP stdio servers (every Windows user). repowise mcp --transport stdio has no parent-death handling, and Windows never kills children when the parent dies — so every crashed / force-quit / abnormally-ended agent session leaks its MCP server pair, each holding wiki.db handles that contend with later repowise update runs (WAL writer locks).
Wedged update locks (any OS, rapid commits). read_update_lock treated a lock as stale only after 30 min of wall clock and never checked whether the owning PID was alive. A SIGKILLed/crashed update — the paths atexit cannot cover — blocked further updates for the full window, and its leftover lock also suppressed the augment hook's stale-wiki warnings.
PATH hijack of MCP registrations (multi-install users). All registrations wrote the bare command "repowise", resolved via PATH at session start — observed in the wild: a stale conda shadow install serving a repo indexed by the venv install.

Fix

New repowise.core.procutils — stdlib-only, fail-open process probes (no psutil, no subprocess on hot paths):

pid_alive — kernel32 OpenProcess + WaitForSingleObject on Windows (os.kill(pid, 0) on Windows terminates the target, so it is never used as a probe there), os.kill(pid, 0) on POSIX
process_create_token — process identity by creation time, to defeat PID reuse
ancestor_chain — single Toolhelp32 snapshot on Windows, /proc on Linux, one-shot ps on macOS

1. Parent-death watchdog (mcp_server/_watchdog.py, stdio only). A plain getppid() watch would fix nothing: console-script installs run as a launcher chain (client -> repowise.exe shim -> venv python -> real python on Windows), and the immediate parent is a shim that waits on us and survives the client. The watchdog walks the ancestor chain at startup, skips launcher shims (python*, repowise*, uv*) and shell wrappers (cmd, sh, pwsh, ...), and watches every recorded ancestor up to and including the first non-launcher — the client — identity-checked by creation token. It never watches above the client (a terminal may die while the client legitimately lives), fails open on any probe uncertainty, and can be disabled with REPOWISE_MCP_NO_WATCHDOG=1. 5s-poll daemon thread; os._exit(0) on client death.

2. Live-PID lock staleness. Locks now record the owner's pid_create_token; a lock whose owner is dead or recycled is stale immediately. Applied to both lock implementations — cli/helpers.py and its workspace mirror in core/workspace/update.py (formats kept in sync) — and the augment hook's _read_in_flight_marker now delegates to the canonical reader. Legacy token-less locks remain honored (liveness + wall clock), and the 30-min ceiling still reaps hung-but-alive updates.

3. Scope-split registration. Per-user configs (~/.claude/settings.json, claude_desktop_config.json) now pin the absolute path of the install that ran init — resolved from the running interpreter's scripts dir, never PATH — and are refreshed on every re-registration, so a moved venv self-heals on the next init/update. Repo-shared files (.mcp.json, .codex/config.toml) deliberately keep the bare name: they may be committed, and one contributor's absolute path would break every other checkout. Transient locations (uvx cache, temp dirs) fall back to the bare name.

Tests

tests/unit/test_procutils.py — real probes against self and reaped children
tests/unit/cli/test_update_lock.py — staleness matrix (dead / recycled / legacy / unknown / wall clock) for both lock implementations
tests/unit/cli/test_mcp_command_resolution.py — per-user absolute vs repo-shared bare; transient-location fallbacks; moved-venv self-heal
tests/unit/server/test_mcp_watchdog.py — launcher/client classification, watch-set boundaries, fail-open behavior, and an end-to-end orphan test: a client process spawns a watchdogged server and exits; the server self-terminates within seconds
Two existing augment-staleness tests fabricated locks with fake PIDs to mean "update running"; they now use a live PID, plus a new regression test pinning that a crashed update's lock no longer suppresses stale warnings

Full unit + provider suites pass (3,932 tests). Note for review: the Windows code paths were exercised locally end-to-end; Linux /proc and macOS ps paths are covered by the same tests and need a CI run to confirm — everything fails open (unknown => alive => no action), so the worst case is the previous behavior, never a wrongly-killed server or wrongly-broken lock.

Not in this PR (follow-ups)

Hook commands (repowise-augment, repowise-rewrite, post-commit command -v repowise) stay bare: their installers are has-hook-already idempotent, so an absolute path would be pinned forever and break when the venv moves — needs an update-in-place re-registration story first.
Doctor process audit (enumerate live repowise mcp/update processes, flag shadow-install servers and stale registrations) — separate PR as planned; one logical MCP server is 2–3 OS processes (shim chain), so the audit should count sets.

…TH-hijack-proof registration Three structural fixes for leaked/wedged repowise processes, observed on real machines and confirmed structural (hit every Windows user, and the lock issue hits anyone with rapid commits on any OS): 1. Orphaned MCP stdio servers: `repowise mcp --transport stdio` never exited when the MCP client died abnormally; on Windows children are never killed with their parent, so every crashed/force-quit session leaked a server pair holding wiki.db handles. New parent-death watchdog walks the ancestor chain past launcher shims (repowise.exe → venv python → real python; uv/uvx; cmd/sh wrappers) to the actual client and exits when it dies. A plain getppid() watch would see only the shim, which survives the client. Opt out: REPOWISE_MCP_NO_WATCHDOG=1. 2. Wedged update locks: read_update_lock treated locks as stale only after 30 min wall clock; a SIGKILLed/crashed update (the paths atexit cannot cover) blocked further updates for the whole window. Locks now record the owner's process-creation token and are stale immediately when the owning PID is dead or recycled. Applied to both lock implementations (cli/helpers + the workspace mirror in core/workspace/update) and to the augment hook's in-flight marker, so a crashed update can't suppress stale-wiki warnings either. Legacy token-less locks remain honored; the wall-clock ceiling still reaps hung-but-alive updates. 3. PATH hijack of MCP registrations: bare repowise commands resolve via PATH at session start, so shadow installs (conda, old pip, pipx) hijack the server against the indexed repo. Per-user configs (~/.claude/settings.json, claude_desktop_config.json) now pin the absolute path of the install that ran init, refreshed on every re-registration so a moved venv self-heals. Repo-shared files (.mcp.json, .codex/config.toml) deliberately keep the bare name — they may be committed and must not carry one machine's paths. Transient locations (uvx cache, temp dirs) fall back to bare. All probes live in new stdlib-only repowise.core.procutils: kernel32 via ctypes on Windows (os.kill(pid, 0) on Windows TERMINATES the target — never used as a probe), /proc on Linux, one-shot ps on macOS; fail-open (unknown ⇒ alive) so uncertainty can never kill a live server or break a live lock. No new dependencies; no subprocess calls on hot paths. Tests: procutils probes, lock staleness matrix (dead/recycled/legacy/ unknown), per-user-absolute vs repo-shared-bare registration, watchdog unit + end-to-end orphan test (client dies ⇒ server self-terminates). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

repowise-bot · 2026-06-05T15:03:13Z

✅ Health: 7.0 (unchanged)
_{1 file moved · 5 hotspots · 5 hidden couplings · 5 with fix history · 2 dead-code findings}

🚨 Change risk: high _{(riskier than 81% of this repo's commits · raw 9.6/10)}
This change's risk is driven by:

large diff (many lines added)
scattered, high-entropy change

🩹 Review priority _{(files here with the most recent bug-fix history — defects cluster, so review these first)}

.../cli/helpers.py — fixed 6× in the last ~6 months
.../cli/mcp_config.py — fixed 4× in the last ~6 months
.../mcp_server/_server.py — fixed 3× in the last ~6 months
.../cli/test_augment_staleness.py — fixed 2× in the last ~6 months
.../commands/augment_cmd.py — fixed 2× in the last ~6 months

File	Score	Δ	Why
`.../workspace/update.py`	5.1 → 4.3	▼ -0.8	🔻 introduced nested complexity, complex method · ✅ resolved dry violation

💡 .../workspace/update.py: Flatten the control flow. Pull early-return guards to the top, extract the deepest branch into a helper, and consider replacing nested conditionals with a strategy table or dispatch dict.

🔥 Hotspots touched (5)

.../workspace/update.py — 4 commits/90d, 8 dependents · primary owner: Raghav Chamadiya (100%)
.../mcp_server/_server.py — 9 commits/90d, 5 dependents · primary owner: Raghav Chamadiya (95%)
.../cli/test_augment_staleness.py — 3 commits/90d, 0 dependents · primary owner: Raghav Chamadiya (100%)

2 more

.../cli/helpers.py — 20 commits/90d, 40 dependents · primary owner: Raghav Chamadiya (72%)
.../commands/augment_cmd.py — 8 commits/90d, 5 dependents · primary owner: Raghav Chamadiya (100%)

🔗 Hidden coupling (3 files)

.../cli/helpers.py co-changes with these files (not in this PR):
- .../cli/test_helpers.py (9× — 🟢 routine)
- .../commands/update_cmd.py (8× — 🟢 routine)
- README.md (5× — 🟢 routine)
.../commands/augment_cmd.py co-changes with README.md (5× — 🟢 routine) — not in this PR.
.../cli/mcp_config.py co-changes with .../cli/main.py (5× — 🟢 routine) — not in this PR.

💀 Dead code (2 findings)

💀 .../cli/helpers.py ensure_db (confidence 1.00)
💀 .../cli/helpers.py is_interactive_session (confidence 1.00)

📊 Full report · ⭐ Star Repowise · 📥 Install bot · Last updated 2026-06-05 15:03 UTC
_{Silence on a single PR with [skip repowise] in the title · Per-repo toggle on repowise.dev/settings?tab=bot}

RaghavChamadiya requested a review from swati510 as a code owner June 5, 2026 15:02

swati510 approved these changes Jun 5, 2026

View reviewed changes

RaghavChamadiya merged commit 1b9c72d into main Jun 5, 2026
5 checks passed

RaghavChamadiya deleted the fix/process-hygiene branch June 5, 2026 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: process hygiene - MCP orphan watchdog, live-PID update locks, PATH-hijack-proof registration#385

fix: process hygiene - MCP orphan watchdog, live-PID update locks, PATH-hijack-proof registration#385
RaghavChamadiya merged 1 commit into
mainfrom
fix/process-hygiene

RaghavChamadiya commented Jun 5, 2026

Uh oh!

repowise-bot Bot commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

RaghavChamadiya commented Jun 5, 2026

Problem

Fix

Tests

Not in this PR (follow-ups)

Uh oh!

repowise-bot Bot commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants