Skip to content

fix(runtimed): evict re-keyed rooms by Arc pointer, not stale key#1836

Merged
rgbkrk merged 2 commits intonteract:mainfrom
quillaid:fix/room-eviction-rekey
Apr 16, 2026
Merged

fix(runtimed): evict re-keyed rooms by Arc pointer, not stale key#1836
rgbkrk merged 2 commits intonteract:mainfrom
quillaid:fix/room-eviction-rekey

Conversation

@quillaid
Copy link
Copy Markdown
Collaborator

Summary

  • Fixes room eviction leak where rooms re-keyed by rekey_ephemeral_room (UUID → file path) were never evicted, leaking fds indefinitely
  • The eviction timer now finds the room in the rooms map by Arc pointer comparison instead of the captured notebook_id key, which becomes stale after re-keying

Root cause

When rekey_ephemeral_room moves a room from UUID → file path key, the eviction timer scheduled under the old UUID key fires 30s later and looks up rooms_guard.get(&uuid_key). Since the UUID was removed from the map (replaced by the path key), the lookup returns None and eviction is skipped. No new eviction timer is ever scheduled under the new path key. The room lives forever with its kernel, runtime agent socket, file watcher, and autosave debouncer.

Evidence from production daemon (7h uptime)

Metric Count
Rooms created 237
Eviction timers scheduled 248
Eviction skipped (stale key after re-key) 163
Actually evicted 81
Cancelled (peers reconnected) 4
Zombie rooms (0 peers, idle kernel) 144
Daemon fds 695/1024 (68%)

Fix

The eviction timer closure now iterates the rooms map to find the entry matching the captured Arc<NotebookRoom> by pointer. The Arc is stable across re-keying since rekey_ephemeral_room moves the same Arc to the new key. This is O(n) in room count but runs at most once per room disconnect — negligible vs the existing O(doc_size) teardown work.

Test plan

  • All runtimed tests pass (25/25)
  • Lint clean, tokio mutex lint passes
  • Deploy locally, verify zombie rooms are evicted after 30s idle timeout
  • Monitor fd count — should stabilize instead of monotonically increasing

When rekey_ephemeral_room moves a room from UUID → file path key, the
eviction timer scheduled under the old UUID key would find no matching
room (the UUID was removed from the map) and skip eviction entirely.
The room then lives forever with its kernel, runtime agent socket,
file watcher, and autosave debouncer — leaking fds.

Fix: look up the room in the eviction timer by Arc pointer instead of
the captured notebook_id. The Arc is stable across re-keys since
rekey_ephemeral_room moves the same Arc to the new key.

Evidence: 237 rooms created, 163 eviction timers skipped due to stale
key after re-key, 144 zombie rooms with 0 peers and idle kernels
accumulating 695 fds against a 1024 soft limit.
@rgbkrk rgbkrk marked this pull request as draft April 16, 2026 06:16
Verifies the exact failure mode that caused zombie rooms: a room is
created under a UUID key, re-keyed to a file path on save, then the
eviction timer fires with the stale UUID. The test confirms the
Arc::ptr_eq lookup finds and removes the room under its new key.

Addresses review feedback on PR nteract#1836.
@rgbkrk rgbkrk marked this pull request as ready for review April 16, 2026 12:51
@rgbkrk rgbkrk merged commit 8badf16 into nteract:main Apr 16, 2026
21 of 22 checks passed
@rgbkrk rgbkrk mentioned this pull request Apr 16, 2026
8 tasks
rgbkrk added a commit that referenced this pull request Apr 16, 2026
Task 6.2 replaced the post-save rekey dance with inline path_index/
room.path updates in the SaveNotebook handler. The function had been
#[allow(dead_code)] since then. Its companion test (added in PR #1836
to guard against stale-UUID eviction) becomes structurally obsolete
because UUID keys never change in the new model.

Test count: 323 → 322
rgbkrk added a commit that referenced this pull request Apr 16, 2026
Task 6.2 replaced the post-save rekey dance with inline path_index/
room.path updates in the SaveNotebook handler. The function had been
#[allow(dead_code)] since then. Its companion test (added in PR #1836
to guard against stale-UUID eviction) becomes structurally obsolete
because UUID keys never change in the new model.

Test count: 323 → 322
rgbkrk added a commit that referenced this pull request Apr 16, 2026
Task 6.2 replaced the post-save rekey dance with inline path_index/
room.path updates in the SaveNotebook handler. The function had been
#[allow(dead_code)] since then. Its companion test (added in PR #1836
to guard against stale-UUID eviction) becomes structurally obsolete
because UUID keys never change in the new model.

Test count: 323 → 322
rgbkrk added a commit that referenced this pull request Apr 16, 2026
* docs(plan): uuid-first notebook identity implementation plan

* feat(runtimed): add PathIndex for canonical-path → uuid lookup

* feat(runtimed): add NotebookRoom.id UUID field (unwired)

* refactor(runtimed): NotebookRoom.notebook_path -> path: Option<PathBuf>

Rename the field from `notebook_path: RwLock<PathBuf>` to
`path: RwLock<Option<PathBuf>>`. Untitled rooms (UUID notebook_id)
initialize with `None`; file-backed rooms initialize with `Some(path)`.

- `new_fresh` and `load_or_create`: compute `path` via `is_untitled_notebook`
- Trust verification: file-backed rooms use `verify_trust_from_file`,
  untitled rooms use snapshot path (unchanged)
- `find_room_by_notebook_path`: compare `Option` with `as_deref()`
- `rekey_ephemeral_room`: write `Some(new_path)` on promotion
- `save_notebook_to_disk`: return `Unrecoverable` if `target_path` is None
  and room has no `path` (untitled notebook with no save target)
- `process_markdown_assets`, asset resolution, runtime agent spawn,
  auto-launch, clone-to-disk: all updated to handle `Option<PathBuf>`
- `test_room_with_path` (test helper): `path: RwLock::new(Some(...))`
- Added tests: `untitled_room_has_path_none`, `file_backed_room_has_path_some`

* refactor(runtimed): clean up Task 4.1 clippy lints

Convert map_or(false/true, ...) to is_some_and/is_none_or. Restructure
trust verification to match on &path, eliminating the implicit-invariant
expect() on path.as_deref().

Review feedback from Task 4.1 code review.

* docs: refresh stale docstrings after notebook_path -> path rename

- SaveNotebook request docstring: untitled rooms (no path) now error
- Restore orphaned 'Get or create a room' docstring to its function
- Test helper doc comment: notebook_path -> path

* refactor(runtimed): key rooms map by Uuid, add path_index

Switch NotebookRooms from HashMap<String, ...> to HashMap<Uuid, ...>.
Add a PathIndex secondary index (path → UUID) threaded through the
Daemon struct for O(1) path lookups. Replace the old linear scan in
find_room_by_notebook_path with the O(1) find_room_by_path using
path_index.

- NotebookRoom::new_fresh now takes (uuid, path: Option<PathBuf>, ...)
  directly instead of a notebook_id string; callers build both explicitly.
- get_or_create_room is now async and takes &NotebookRooms + path_index.
- rekey_ephemeral_room keeps the UUID key stable and only updates the
  room's path field + path_index; it no longer re-keys the rooms map.
- Eviction loop captures path_index and cleans up path_index entry on
  room removal.
- All daemon.rs handlers (handle_open_notebook, handle_create_notebook,
  NotebookSync, RuntimeAgent, InspectNotebook, ShutdownNotebook, ListRooms)
  updated to use UUID keys.
- New tests: find_room_by_path_returns_room_after_index_insert,
  find_room_by_path_returns_none_when_not_indexed.
- All existing tests updated to pass uuid + path: Option<PathBuf>.
- 320 lib tests + tokio_mutex_lint + all lint checks pass.

* fix(runtimed): address Task 5.1 code review concerns

- C1: NotebookSync handshake with a path now consults find_room_by_path
  before minting a new UUID. Prevents zombie rooms on reconnect.
- I1: eviction uses path_index.remove_by_uuid(uuid) instead of reading
  room.path via try_read(). Robust against concurrent path writers.
- I2: swallowed PathAlreadyOpen in get_or_create_room is now logged at
  error level since it indicates caller invariant violation.
- I3: test_eviction_finds_rekeyed_room_by_arc_pointer now exercises
  path_index cleanup via remove_by_uuid.
- I4: corrected TODO comment in rekey_ephemeral_room to point at
  Task 6.2 as the closure point (not Phase 7).

* feat(protocol): add PathChanged broadcast and SaveError response

Introduces NotebookBroadcast::PathChanged (sent when a room's .ipynb
path changes) and NotebookResponse::SaveError { error: SaveErrorKind }
with PathAlreadyOpen and Io variants.

These are scaffolding for Task 6.2's SaveNotebook rewrite — no behavior
changes this commit. RoomRenamed and NotebookSaved.new_notebook_id
remain in place until Task 7.3.

* feat(runtimed): rewrite SaveNotebook without room rekeying

The new handler does the transition untitled→file-backed inline:
- Claim path in path_index (fail fast with SaveErrorKind::PathAlreadyOpen
  on collision)
- Update room.path
- Stop .automerge persistence; delete the stale file
- Spawn .ipynb file watcher + autosave debouncer
- Clear ephemeral markers
- Broadcast NotebookBroadcast::PathChanged

Also handles save-as rename: updates path_index and room.path when the
written path differs from the existing room.path.

rekey_ephemeral_room is marked #[allow(dead_code)] — Task 7.1 deletes
it along with its companion tests. This closes the interloper-detection
gap that Task 5.1 deferred.

Updated test_rekey_ephemeral_room_starts_autosave to drive through the
new inline transition logic instead of calling rekey_ephemeral_room
directly. Added two new unit tests:
- saving_untitled_notebook_updates_path_index_and_keeps_uuid
- saving_to_already_open_path_returns_path_already_open_error

* fix(runtimed): address Task 6.2 code review concerns

- Extract promote_untitled_to_file_backed and rename_file_backed_room
  helpers. Handler and tests both call them (fixes parallel-to-production
  test smell).
- Delete the ineffective persist_tx.send(None) call. The .automerge
  file may still be resurrected by subsequent AutomergeSync frames;
  documented with TODO(followup) to track the deeper fix.
- Combine dual room.path reads into one critical section.
- Add warn log when canonicalize falls back to the raw path.
- Swap std::fs::remove_file to tokio::fs::remove_file in the
  transition helper (async throughout).

* refactor(runtimed): delete rekey_ephemeral_room

Task 6.2 replaced the post-save rekey dance with inline path_index/
room.path updates in the SaveNotebook handler. The function had been
#[allow(dead_code)] since then. Its companion test (added in PR #1836
to guard against stale-UUID eviction) becomes structurally obsolete
because UUID keys never change in the new model.

Test count: 323 → 322

* refactor(runtimed): delete redirect_map and RedirectEntry

The post-save rekey block that populated redirect_map was deleted in
Task 6.2, so the map was never written to. Lookups always miss. Remove
the field, the RedirectEntry struct, both lookup sites in
handle_open_notebook and handle_create_notebook, and the constructor
initialization.

* refactor(protocol)!: delete RoomRenamed broadcast and new_notebook_id

Both were part of the pre-UUID-first model where notebook IDs could
change after save. With UUID-first identity, UUIDs are stable for a
room's lifetime. RoomRenamed is replaced by PathChanged (different
semantics: path mutation, not identity change). new_notebook_id is
dropped entirely — callers just use the UUID they already have.

Breaking change: wire protocol consumers must update to handle
PathChanged instead of RoomRenamed, and stop reading new_notebook_id
from SaveNotebook responses.

* refactor(runt-mcp)!: drop looks_like_path, split notebook param

open_notebook now takes explicit `path` XOR `notebook_id` parameters
instead of a single heuristic-driven `notebook` field. The
looks_like_path helper is deleted.

save_notebook uses a UUID parse check (not a path heuristic) to detect
ephemeral notebooks that require an explicit path before saving.

save_notebook response shape is finalized: `path` and `notebook_id`
only (no previous_notebook_id since UUIDs are stable after Task 5/6/7).

Adds unit tests for ephemeral UUID detection and save response shape.

Breaking change for MCP agents still using the old `notebook` convenience
parameter on open_notebook — migrate to `path` or `notebook_id`.

* fix(runt-mcp): drop UUID-parse ephemeral heuristic, handle SaveError

The Task 8.1 subagent added an ephemeral-detection check that parses
notebook_id as UUID — but in the UUID-first model every notebook_id is
a UUID, so the check always fires and breaks file-backed save-in-place.

Delete the client-side check. The daemon already returns SaveError with
a clear message when it can't determine a save target. Also add a proper
match arm for NotebookResponse::SaveError so PathAlreadyOpen and Io
errors surface cleanly to the agent instead of falling through to
'Unexpected response'.

* chore: lint, typecheck, and fix unused variable after UUID-first refactor

Remove two dead `override_arc` bindings in async_session.rs that were
left over from the UUID-first refactor but never used in the async block.
All lint, clippy, typecheck, and lib tests pass clean.

* fix(runtimed): return room UUID in NotebookConnectionInfo, not path

Task 5.1 minted a UUID as the room key but handle_open_notebook (and
handle_create_notebook with a caller-provided hint) continued to send
the canonical path string as NotebookConnectionInfo.notebook_id. MCP
agents and TypeScript clients received a path where they expected a
UUID.

Smoke-tested via nteract-dev: create_notebook → save_notebook(path) →
open_notebook(path) all return the same UUID, and a second save to the
same path correctly surfaces SaveErrorKind::PathAlreadyOpen with the
conflicting session's UUID.

* docs: sweep stale references to deleted UUID-first concepts

Updates contributing/, CLAUDE.md, .claude/rules/, .claude/skills/, and
source comments to describe the post-refactor reality: UUID-stable room
keys, path_index secondary map, PathChanged broadcast, and PathAlreadyOpen
on save collision. Historical references in the plan document itself are
left intact.

* feat(notebook)!: handle PathChanged broadcast and SaveError in Tauri app

Frontend:
- Add PathChanged variant to DaemonBroadcast type.
- useAutomergeNotebook subscribes to path_changed and invokes the new
  apply_path_changed Tauri command to keep the window's local path +
  title in sync when a peer (MCP agent, sibling window) saves or renames.

Rust (Tauri notebook crate):
- apply_path_changed command: updates WindowNotebookContext.path and
  window title from a PathChanged payload.
- save_notebook and save_notebook_as now match on SaveError and surface
  PathAlreadyOpen as a user-facing message: 'Cannot save: {path} is
  already open in another window. Close that window first...'
- format_save_error helper shared between both commands.

Breaking change for clients that relied on RoomRenamed — that variant is
gone; use PathChanged (which does NOT change the UUID) for path updates.

* fix(runtimed): pass room UUID to runtime agent subprocess, not path

Two RuntimeAgentHandle::spawn call sites (auto_launch_kernel and the
LaunchKernel handler) were passing room.path (or a path-derived string)
as the agent's notebook_id argument. When the agent reconnected via its
RuntimeAgent handshake, the daemon parsed that value as a UUID to look
up the room — a path like /home/ubuntu/foo.ipynb fails Uuid::parse_str,
so the lookup returned None and the agent timed out after 30s waiting
for the handshake response.

Both sites now pass room.id.to_string(), the canonical stable UUID key
for the rooms map.

Caught in staging.

* fix(runtimed): claim path_index before writing to disk on save

Critical bug found during live QA: when room B attempted to save to a
path already claimed by room A, the handler wrote room B's contents to
disk FIRST, then checked path_index. The empty write overwrote room A's
file; room A's file watcher saw the 'external' change and deleted all
its CRDT cells to mirror disk; autosave then persisted the wipe.

Fix: claim path_index BEFORE save_notebook_to_disk. On collision, return
SaveError::PathAlreadyOpen immediately with the file untouched. On I/O
failure, roll back the claim.

Introduces:
- canonical_target_path() — canonicalize a path that may not yet exist
  (tokio::fs::canonicalize requires the file to exist).
- normalize_save_target() — extension + relative-path validation.
- try_claim_path() — small wrapper returning SaveErrorKind directly.
- finalize_untitled_promotion() — the non-claim half of the transition.

Deletes rename_file_backed_room (inlined into the handler now that the
handler owns the claim order).

Regression test: path_collision_does_not_overwrite_existing_file asserts
the target file is byte-identical after a claim collision.

* refactor(runtimed): avoid unwrap in SaveNotebook pre-claim logic

Collapse the 'needs_new_claim' + 'target_for_claim' pair into a single
'pre_claim: Option<PathBuf>' that encodes both 'does a claim need to
happen?' and 'here's the path to claim'. Eliminates the `.unwrap()`
that CI flagged as clippy::unwrap_used.

* fix(runtimed-py): surface SaveError from daemon save requests

Codex review P1: Session.save() / AsyncSession.save() returned
'Unexpected response' when the daemon sent NotebookResponse::SaveError.
Now maps SaveErrorKind::PathAlreadyOpen to a user-facing message that
includes the conflicting UUID + path, and SaveErrorKind::Io to its
message directly.

Also syncs the MCP tool-cache + mcpb manifests after Task 8.1's
open_notebook description change (CI was failing on stale caches).

* chore: add diagnostics for path-related operations

Hunting the stray '{cwd}/{uuid}.ipynb' bug. Adds targeted logs at
every site where a path flows through the save/open machinery:

- Daemon: flag UUID-shaped OpenNotebook paths at entry (warn)
- Daemon: save_notebook_to_disk logs caller-supplied path + room state
- Tauri: initialize_notebook_sync_open warns on bare-UUID paths
- Tauri: save_notebook / save_notebook_as log invocation
- Tauri: all context.path mutations log the before/after value + window
- Tauri: apply_path_changed logs the broadcast payload

Temporary — remove once the root cause is identified and fixed.

* fix(runt-mcp-proxy): preserve uuid notebook identity on rejoin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants