fix: process hygiene - MCP orphan watchdog, live-PID update locks, PATH-hijack-proof registration#385
Merged
Merged
Conversation
…TH-hijack-proof registration Three structural fixes for leaked/wedged repowise processes, observed on real machines and confirmed structural (hit every Windows user, and the lock issue hits anyone with rapid commits on any OS): 1. Orphaned MCP stdio servers: `repowise mcp --transport stdio` never exited when the MCP client died abnormally; on Windows children are never killed with their parent, so every crashed/force-quit session leaked a server pair holding wiki.db handles. New parent-death watchdog walks the ancestor chain past launcher shims (repowise.exe → venv python → real python; uv/uvx; cmd/sh wrappers) to the actual client and exits when it dies. A plain getppid() watch would see only the shim, which survives the client. Opt out: REPOWISE_MCP_NO_WATCHDOG=1. 2. Wedged update locks: read_update_lock treated locks as stale only after 30 min wall clock; a SIGKILLed/crashed update (the paths atexit cannot cover) blocked further updates for the whole window. Locks now record the owner's process-creation token and are stale immediately when the owning PID is dead or recycled. Applied to both lock implementations (cli/helpers + the workspace mirror in core/workspace/update) and to the augment hook's in-flight marker, so a crashed update can't suppress stale-wiki warnings either. Legacy token-less locks remain honored; the wall-clock ceiling still reaps hung-but-alive updates. 3. PATH hijack of MCP registrations: bare repowise commands resolve via PATH at session start, so shadow installs (conda, old pip, pipx) hijack the server against the indexed repo. Per-user configs (~/.claude/settings.json, claude_desktop_config.json) now pin the absolute path of the install that ran init, refreshed on every re-registration so a moved venv self-heals. Repo-shared files (.mcp.json, .codex/config.toml) deliberately keep the bare name — they may be committed and must not carry one machine's paths. Transient locations (uvx cache, temp dirs) fall back to bare. All probes live in new stdlib-only repowise.core.procutils: kernel32 via ctypes on Windows (os.kill(pid, 0) on Windows TERMINATES the target — never used as a probe), /proc on Linux, one-shot ps on macOS; fail-open (unknown ⇒ alive) so uncertainty can never kill a live server or break a live lock. No new dependencies; no subprocess calls on hot paths. Tests: procutils probes, lock staleness matrix (dead/recycled/legacy/ unknown), per-user-absolute vs repo-shared-bare registration, watchdog unit + end-to-end orphan test (client dies ⇒ server self-terminates). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
✅ Health: 7.0 (unchanged) 🚨 Change risk: high (riskier than 81% of this repo's commits · raw 9.6/10)
🩹 Review priority (files here with the most recent bug-fix history — defects cluster, so review these first)
🔥 Hotspots touched (5)
2 more
🔗 Hidden coupling (3 files)
💀 Dead code (2 findings)
📊 Full report · ⭐ Star Repowise · 📥 Install bot · Last updated 2026-06-05 15:03 UTC |
swati510
approved these changes
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
An audit of a real machine found three structural process-hygiene issues, all pointing at one repo:
repowise mcp --transport stdiohas no parent-death handling, and Windows never kills children when the parent dies — so every crashed / force-quit / abnormally-ended agent session leaks its MCP server pair, each holding wiki.db handles that contend with laterrepowise updateruns (WAL writer locks).read_update_locktreated a lock as stale only after 30 min of wall clock and never checked whether the owning PID was alive. A SIGKILLed/crashed update — the pathsatexitcannot cover — blocked further updates for the full window, and its leftover lock also suppressed the augment hook's stale-wiki warnings."repowise", resolved via PATH at session start — observed in the wild: a stale conda shadow install serving a repo indexed by the venv install.Fix
New
repowise.core.procutils— stdlib-only, fail-open process probes (no psutil, no subprocess on hot paths):pid_alive— kernel32OpenProcess+WaitForSingleObjecton Windows (os.kill(pid, 0)on Windows terminates the target, so it is never used as a probe there),os.kill(pid, 0)on POSIXprocess_create_token— process identity by creation time, to defeat PID reuseancestor_chain— single Toolhelp32 snapshot on Windows,/procon Linux, one-shotpson macOS1. Parent-death watchdog (
mcp_server/_watchdog.py, stdio only). A plaingetppid()watch would fix nothing: console-script installs run as a launcher chain (client -> repowise.exe shim -> venv python -> real pythonon Windows), and the immediate parent is a shim that waits on us and survives the client. The watchdog walks the ancestor chain at startup, skips launcher shims (python*,repowise*,uv*) and shell wrappers (cmd,sh,pwsh, ...), and watches every recorded ancestor up to and including the first non-launcher — the client — identity-checked by creation token. It never watches above the client (a terminal may die while the client legitimately lives), fails open on any probe uncertainty, and can be disabled withREPOWISE_MCP_NO_WATCHDOG=1. 5s-poll daemon thread;os._exit(0)on client death.2. Live-PID lock staleness. Locks now record the owner's
pid_create_token; a lock whose owner is dead or recycled is stale immediately. Applied to both lock implementations —cli/helpers.pyand its workspace mirror incore/workspace/update.py(formats kept in sync) — and the augment hook's_read_in_flight_markernow delegates to the canonical reader. Legacy token-less locks remain honored (liveness + wall clock), and the 30-min ceiling still reaps hung-but-alive updates.3. Scope-split registration. Per-user configs (
~/.claude/settings.json,claude_desktop_config.json) now pin the absolute path of the install that raninit— resolved from the running interpreter's scripts dir, never PATH — and are refreshed on every re-registration, so a moved venv self-heals on the nextinit/update. Repo-shared files (.mcp.json,.codex/config.toml) deliberately keep the bare name: they may be committed, and one contributor's absolute path would break every other checkout. Transient locations (uvx cache, temp dirs) fall back to the bare name.Tests
tests/unit/test_procutils.py— real probes against self and reaped childrentests/unit/cli/test_update_lock.py— staleness matrix (dead / recycled / legacy / unknown / wall clock) for both lock implementationstests/unit/cli/test_mcp_command_resolution.py— per-user absolute vs repo-shared bare; transient-location fallbacks; moved-venv self-healtests/unit/server/test_mcp_watchdog.py— launcher/client classification, watch-set boundaries, fail-open behavior, and an end-to-end orphan test: a client process spawns a watchdogged server and exits; the server self-terminates within secondsFull unit + provider suites pass (3,932 tests). Note for review: the Windows code paths were exercised locally end-to-end; Linux
/procand macOSpspaths are covered by the same tests and need a CI run to confirm — everything fails open (unknown => alive => no action), so the worst case is the previous behavior, never a wrongly-killed server or wrongly-broken lock.Not in this PR (follow-ups)
repowise-augment,repowise-rewrite, post-commitcommand -v repowise) stay bare: their installers are has-hook-already idempotent, so an absolute path would be pinned forever and break when the venv moves — needs an update-in-place re-registration story first.repowise mcp/updateprocesses, flag shadow-install servers and stale registrations) — separate PR as planned; one logical MCP server is 2–3 OS processes (shim chain), so the audit should count sets.