fix(runtimed): evict re-keyed rooms by Arc pointer, not stale key#1836
Merged
rgbkrk merged 2 commits intonteract:mainfrom Apr 16, 2026
Merged
fix(runtimed): evict re-keyed rooms by Arc pointer, not stale key#1836rgbkrk merged 2 commits intonteract:mainfrom
rgbkrk merged 2 commits intonteract:mainfrom
Conversation
When rekey_ephemeral_room moves a room from UUID → file path key, the eviction timer scheduled under the old UUID key would find no matching room (the UUID was removed from the map) and skip eviction entirely. The room then lives forever with its kernel, runtime agent socket, file watcher, and autosave debouncer — leaking fds. Fix: look up the room in the eviction timer by Arc pointer instead of the captured notebook_id. The Arc is stable across re-keys since rekey_ephemeral_room moves the same Arc to the new key. Evidence: 237 rooms created, 163 eviction timers skipped due to stale key after re-key, 144 zombie rooms with 0 peers and idle kernels accumulating 695 fds against a 1024 soft limit.
Verifies the exact failure mode that caused zombie rooms: a room is created under a UUID key, re-keyed to a file path on save, then the eviction timer fires with the stale UUID. The test confirms the Arc::ptr_eq lookup finds and removes the room under its new key. Addresses review feedback on PR nteract#1836.
rgbkrk
approved these changes
Apr 16, 2026
4 tasks
8 tasks
rgbkrk
added a commit
that referenced
this pull request
Apr 16, 2026
Task 6.2 replaced the post-save rekey dance with inline path_index/ room.path updates in the SaveNotebook handler. The function had been #[allow(dead_code)] since then. Its companion test (added in PR #1836 to guard against stale-UUID eviction) becomes structurally obsolete because UUID keys never change in the new model. Test count: 323 → 322
rgbkrk
added a commit
that referenced
this pull request
Apr 16, 2026
Task 6.2 replaced the post-save rekey dance with inline path_index/ room.path updates in the SaveNotebook handler. The function had been #[allow(dead_code)] since then. Its companion test (added in PR #1836 to guard against stale-UUID eviction) becomes structurally obsolete because UUID keys never change in the new model. Test count: 323 → 322
rgbkrk
added a commit
that referenced
this pull request
Apr 16, 2026
Task 6.2 replaced the post-save rekey dance with inline path_index/ room.path updates in the SaveNotebook handler. The function had been #[allow(dead_code)] since then. Its companion test (added in PR #1836 to guard against stale-UUID eviction) becomes structurally obsolete because UUID keys never change in the new model. Test count: 323 → 322
rgbkrk
added a commit
that referenced
this pull request
Apr 16, 2026
* docs(plan): uuid-first notebook identity implementation plan
* feat(runtimed): add PathIndex for canonical-path → uuid lookup
* feat(runtimed): add NotebookRoom.id UUID field (unwired)
* refactor(runtimed): NotebookRoom.notebook_path -> path: Option<PathBuf>
Rename the field from `notebook_path: RwLock<PathBuf>` to
`path: RwLock<Option<PathBuf>>`. Untitled rooms (UUID notebook_id)
initialize with `None`; file-backed rooms initialize with `Some(path)`.
- `new_fresh` and `load_or_create`: compute `path` via `is_untitled_notebook`
- Trust verification: file-backed rooms use `verify_trust_from_file`,
untitled rooms use snapshot path (unchanged)
- `find_room_by_notebook_path`: compare `Option` with `as_deref()`
- `rekey_ephemeral_room`: write `Some(new_path)` on promotion
- `save_notebook_to_disk`: return `Unrecoverable` if `target_path` is None
and room has no `path` (untitled notebook with no save target)
- `process_markdown_assets`, asset resolution, runtime agent spawn,
auto-launch, clone-to-disk: all updated to handle `Option<PathBuf>`
- `test_room_with_path` (test helper): `path: RwLock::new(Some(...))`
- Added tests: `untitled_room_has_path_none`, `file_backed_room_has_path_some`
* refactor(runtimed): clean up Task 4.1 clippy lints
Convert map_or(false/true, ...) to is_some_and/is_none_or. Restructure
trust verification to match on &path, eliminating the implicit-invariant
expect() on path.as_deref().
Review feedback from Task 4.1 code review.
* docs: refresh stale docstrings after notebook_path -> path rename
- SaveNotebook request docstring: untitled rooms (no path) now error
- Restore orphaned 'Get or create a room' docstring to its function
- Test helper doc comment: notebook_path -> path
* refactor(runtimed): key rooms map by Uuid, add path_index
Switch NotebookRooms from HashMap<String, ...> to HashMap<Uuid, ...>.
Add a PathIndex secondary index (path → UUID) threaded through the
Daemon struct for O(1) path lookups. Replace the old linear scan in
find_room_by_notebook_path with the O(1) find_room_by_path using
path_index.
- NotebookRoom::new_fresh now takes (uuid, path: Option<PathBuf>, ...)
directly instead of a notebook_id string; callers build both explicitly.
- get_or_create_room is now async and takes &NotebookRooms + path_index.
- rekey_ephemeral_room keeps the UUID key stable and only updates the
room's path field + path_index; it no longer re-keys the rooms map.
- Eviction loop captures path_index and cleans up path_index entry on
room removal.
- All daemon.rs handlers (handle_open_notebook, handle_create_notebook,
NotebookSync, RuntimeAgent, InspectNotebook, ShutdownNotebook, ListRooms)
updated to use UUID keys.
- New tests: find_room_by_path_returns_room_after_index_insert,
find_room_by_path_returns_none_when_not_indexed.
- All existing tests updated to pass uuid + path: Option<PathBuf>.
- 320 lib tests + tokio_mutex_lint + all lint checks pass.
* fix(runtimed): address Task 5.1 code review concerns
- C1: NotebookSync handshake with a path now consults find_room_by_path
before minting a new UUID. Prevents zombie rooms on reconnect.
- I1: eviction uses path_index.remove_by_uuid(uuid) instead of reading
room.path via try_read(). Robust against concurrent path writers.
- I2: swallowed PathAlreadyOpen in get_or_create_room is now logged at
error level since it indicates caller invariant violation.
- I3: test_eviction_finds_rekeyed_room_by_arc_pointer now exercises
path_index cleanup via remove_by_uuid.
- I4: corrected TODO comment in rekey_ephemeral_room to point at
Task 6.2 as the closure point (not Phase 7).
* feat(protocol): add PathChanged broadcast and SaveError response
Introduces NotebookBroadcast::PathChanged (sent when a room's .ipynb
path changes) and NotebookResponse::SaveError { error: SaveErrorKind }
with PathAlreadyOpen and Io variants.
These are scaffolding for Task 6.2's SaveNotebook rewrite — no behavior
changes this commit. RoomRenamed and NotebookSaved.new_notebook_id
remain in place until Task 7.3.
* feat(runtimed): rewrite SaveNotebook without room rekeying
The new handler does the transition untitled→file-backed inline:
- Claim path in path_index (fail fast with SaveErrorKind::PathAlreadyOpen
on collision)
- Update room.path
- Stop .automerge persistence; delete the stale file
- Spawn .ipynb file watcher + autosave debouncer
- Clear ephemeral markers
- Broadcast NotebookBroadcast::PathChanged
Also handles save-as rename: updates path_index and room.path when the
written path differs from the existing room.path.
rekey_ephemeral_room is marked #[allow(dead_code)] — Task 7.1 deletes
it along with its companion tests. This closes the interloper-detection
gap that Task 5.1 deferred.
Updated test_rekey_ephemeral_room_starts_autosave to drive through the
new inline transition logic instead of calling rekey_ephemeral_room
directly. Added two new unit tests:
- saving_untitled_notebook_updates_path_index_and_keeps_uuid
- saving_to_already_open_path_returns_path_already_open_error
* fix(runtimed): address Task 6.2 code review concerns
- Extract promote_untitled_to_file_backed and rename_file_backed_room
helpers. Handler and tests both call them (fixes parallel-to-production
test smell).
- Delete the ineffective persist_tx.send(None) call. The .automerge
file may still be resurrected by subsequent AutomergeSync frames;
documented with TODO(followup) to track the deeper fix.
- Combine dual room.path reads into one critical section.
- Add warn log when canonicalize falls back to the raw path.
- Swap std::fs::remove_file to tokio::fs::remove_file in the
transition helper (async throughout).
* refactor(runtimed): delete rekey_ephemeral_room
Task 6.2 replaced the post-save rekey dance with inline path_index/
room.path updates in the SaveNotebook handler. The function had been
#[allow(dead_code)] since then. Its companion test (added in PR #1836
to guard against stale-UUID eviction) becomes structurally obsolete
because UUID keys never change in the new model.
Test count: 323 → 322
* refactor(runtimed): delete redirect_map and RedirectEntry
The post-save rekey block that populated redirect_map was deleted in
Task 6.2, so the map was never written to. Lookups always miss. Remove
the field, the RedirectEntry struct, both lookup sites in
handle_open_notebook and handle_create_notebook, and the constructor
initialization.
* refactor(protocol)!: delete RoomRenamed broadcast and new_notebook_id
Both were part of the pre-UUID-first model where notebook IDs could
change after save. With UUID-first identity, UUIDs are stable for a
room's lifetime. RoomRenamed is replaced by PathChanged (different
semantics: path mutation, not identity change). new_notebook_id is
dropped entirely — callers just use the UUID they already have.
Breaking change: wire protocol consumers must update to handle
PathChanged instead of RoomRenamed, and stop reading new_notebook_id
from SaveNotebook responses.
* refactor(runt-mcp)!: drop looks_like_path, split notebook param
open_notebook now takes explicit `path` XOR `notebook_id` parameters
instead of a single heuristic-driven `notebook` field. The
looks_like_path helper is deleted.
save_notebook uses a UUID parse check (not a path heuristic) to detect
ephemeral notebooks that require an explicit path before saving.
save_notebook response shape is finalized: `path` and `notebook_id`
only (no previous_notebook_id since UUIDs are stable after Task 5/6/7).
Adds unit tests for ephemeral UUID detection and save response shape.
Breaking change for MCP agents still using the old `notebook` convenience
parameter on open_notebook — migrate to `path` or `notebook_id`.
* fix(runt-mcp): drop UUID-parse ephemeral heuristic, handle SaveError
The Task 8.1 subagent added an ephemeral-detection check that parses
notebook_id as UUID — but in the UUID-first model every notebook_id is
a UUID, so the check always fires and breaks file-backed save-in-place.
Delete the client-side check. The daemon already returns SaveError with
a clear message when it can't determine a save target. Also add a proper
match arm for NotebookResponse::SaveError so PathAlreadyOpen and Io
errors surface cleanly to the agent instead of falling through to
'Unexpected response'.
* chore: lint, typecheck, and fix unused variable after UUID-first refactor
Remove two dead `override_arc` bindings in async_session.rs that were
left over from the UUID-first refactor but never used in the async block.
All lint, clippy, typecheck, and lib tests pass clean.
* fix(runtimed): return room UUID in NotebookConnectionInfo, not path
Task 5.1 minted a UUID as the room key but handle_open_notebook (and
handle_create_notebook with a caller-provided hint) continued to send
the canonical path string as NotebookConnectionInfo.notebook_id. MCP
agents and TypeScript clients received a path where they expected a
UUID.
Smoke-tested via nteract-dev: create_notebook → save_notebook(path) →
open_notebook(path) all return the same UUID, and a second save to the
same path correctly surfaces SaveErrorKind::PathAlreadyOpen with the
conflicting session's UUID.
* docs: sweep stale references to deleted UUID-first concepts
Updates contributing/, CLAUDE.md, .claude/rules/, .claude/skills/, and
source comments to describe the post-refactor reality: UUID-stable room
keys, path_index secondary map, PathChanged broadcast, and PathAlreadyOpen
on save collision. Historical references in the plan document itself are
left intact.
* feat(notebook)!: handle PathChanged broadcast and SaveError in Tauri app
Frontend:
- Add PathChanged variant to DaemonBroadcast type.
- useAutomergeNotebook subscribes to path_changed and invokes the new
apply_path_changed Tauri command to keep the window's local path +
title in sync when a peer (MCP agent, sibling window) saves or renames.
Rust (Tauri notebook crate):
- apply_path_changed command: updates WindowNotebookContext.path and
window title from a PathChanged payload.
- save_notebook and save_notebook_as now match on SaveError and surface
PathAlreadyOpen as a user-facing message: 'Cannot save: {path} is
already open in another window. Close that window first...'
- format_save_error helper shared between both commands.
Breaking change for clients that relied on RoomRenamed — that variant is
gone; use PathChanged (which does NOT change the UUID) for path updates.
* fix(runtimed): pass room UUID to runtime agent subprocess, not path
Two RuntimeAgentHandle::spawn call sites (auto_launch_kernel and the
LaunchKernel handler) were passing room.path (or a path-derived string)
as the agent's notebook_id argument. When the agent reconnected via its
RuntimeAgent handshake, the daemon parsed that value as a UUID to look
up the room — a path like /home/ubuntu/foo.ipynb fails Uuid::parse_str,
so the lookup returned None and the agent timed out after 30s waiting
for the handshake response.
Both sites now pass room.id.to_string(), the canonical stable UUID key
for the rooms map.
Caught in staging.
* fix(runtimed): claim path_index before writing to disk on save
Critical bug found during live QA: when room B attempted to save to a
path already claimed by room A, the handler wrote room B's contents to
disk FIRST, then checked path_index. The empty write overwrote room A's
file; room A's file watcher saw the 'external' change and deleted all
its CRDT cells to mirror disk; autosave then persisted the wipe.
Fix: claim path_index BEFORE save_notebook_to_disk. On collision, return
SaveError::PathAlreadyOpen immediately with the file untouched. On I/O
failure, roll back the claim.
Introduces:
- canonical_target_path() — canonicalize a path that may not yet exist
(tokio::fs::canonicalize requires the file to exist).
- normalize_save_target() — extension + relative-path validation.
- try_claim_path() — small wrapper returning SaveErrorKind directly.
- finalize_untitled_promotion() — the non-claim half of the transition.
Deletes rename_file_backed_room (inlined into the handler now that the
handler owns the claim order).
Regression test: path_collision_does_not_overwrite_existing_file asserts
the target file is byte-identical after a claim collision.
* refactor(runtimed): avoid unwrap in SaveNotebook pre-claim logic
Collapse the 'needs_new_claim' + 'target_for_claim' pair into a single
'pre_claim: Option<PathBuf>' that encodes both 'does a claim need to
happen?' and 'here's the path to claim'. Eliminates the `.unwrap()`
that CI flagged as clippy::unwrap_used.
* fix(runtimed-py): surface SaveError from daemon save requests
Codex review P1: Session.save() / AsyncSession.save() returned
'Unexpected response' when the daemon sent NotebookResponse::SaveError.
Now maps SaveErrorKind::PathAlreadyOpen to a user-facing message that
includes the conflicting UUID + path, and SaveErrorKind::Io to its
message directly.
Also syncs the MCP tool-cache + mcpb manifests after Task 8.1's
open_notebook description change (CI was failing on stale caches).
* chore: add diagnostics for path-related operations
Hunting the stray '{cwd}/{uuid}.ipynb' bug. Adds targeted logs at
every site where a path flows through the save/open machinery:
- Daemon: flag UUID-shaped OpenNotebook paths at entry (warn)
- Daemon: save_notebook_to_disk logs caller-supplied path + room state
- Tauri: initialize_notebook_sync_open warns on bare-UUID paths
- Tauri: save_notebook / save_notebook_as log invocation
- Tauri: all context.path mutations log the before/after value + window
- Tauri: apply_path_changed logs the broadcast payload
Temporary — remove once the root cause is identified and fixed.
* fix(runt-mcp-proxy): preserve uuid notebook identity on rejoin
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rekey_ephemeral_room(UUID → file path) were never evicted, leaking fds indefinitelyArcpointer comparison instead of the capturednotebook_idkey, which becomes stale after re-keyingRoot cause
When
rekey_ephemeral_roommoves a room from UUID → file path key, the eviction timer scheduled under the old UUID key fires 30s later and looks uprooms_guard.get(&uuid_key). Since the UUID was removed from the map (replaced by the path key), the lookup returnsNoneand eviction is skipped. No new eviction timer is ever scheduled under the new path key. The room lives forever with its kernel, runtime agent socket, file watcher, and autosave debouncer.Evidence from production daemon (7h uptime)
Fix
The eviction timer closure now iterates the rooms map to find the entry matching the captured
Arc<NotebookRoom>by pointer. TheArcis stable across re-keying sincerekey_ephemeral_roommoves the sameArcto the new key. This is O(n) in room count but runs at most once per room disconnect — negligible vs the existing O(doc_size) teardown work.Test plan