fix: self-heal manifest-unreferenced branch forks (stop wedged branches)#231
fix: self-heal manifest-unreferenced branch forks (stop wedged branches)#231ragnorc wants to merge 11 commits into
Conversation
The global Arc<RwLock<Omnigraph>> that once serialized every server write was removed — the server holds the engine as a lockless Arc<Omnigraph> and write methods are &self, so the per-(table_key, branch) write queues are now the actual write-serialization mechanism (in-process only). Correct comments that still claimed the global lock is 'still in place' / 'today', or framed the queues as MR-686 scaffolding: write_queue.rs module doc, exec/merge.rs, db/omnigraph/schema_apply.rs, db/manifest/recovery.rs, and the bench_concurrent_http.rs example (which also wrongly stated mutate_as is &mut self). workload.rs is left as-is — its 'previous global RwLock' wording is accurate history.
An interrupted first-write fork (create_branch succeeded, the manifest publish did not) leaves a fully-formed Lance branch ref the manifest never references. The branch stays a valid manifest branch, so cleanup's reconciler never reclaims it, and today the next write to that table wedges with 'incomplete prior delete; run cleanup'. Forge that exact residue (a live 'feature' branch + a directly-created 'feature' ref on the Person table the manifest doesn't reference) and assert the next load AND mutate self-heal. Deterministic and local — no S3 or timing, since the forge IS the post-crash state. Adds a shared node_table_uri helper. This commit is RED: it reproduces the bug and fails against the unfixed engine with the predicted symptom. The fix follows in the next commit.
The first write to a table on a branch lazily forks it via Lance create_branch, a durable two-phase op that advances Lance state BEFORE the atomic manifest publish. If the writer dies or its request future is cancelled between the fork and the publish, the branch ref is fully formed but the manifest never references it. The next write re-enters the fork path, create_branch collides, and the engine wedged with 'orphaned table state ... incomplete prior delete; run cleanup' — which cleanup could not even fix, because the branch is still a live manifest branch. This hit load, mutate, ingest, and the merge fork path (one shared engine chokepoint), so a routine deploy restart or client disconnect could wedge a branch. Fix: treat the per-table fork ref as derived state of the manifest. fork_branch_ from_state returns a typed ForkOutcome instead of a human 'incomplete prior delete' error; on RefAlreadyExists the db layer reclaims the manifest- unreferenced fork (force_delete_branch + re-fork, exactly once) and proceeds. A live committed fork is still routed to a retryable conflict before the fork path, so concurrent first-writes stay correct. Reclaim is only safe if no in-process writer can be mid-fork, so the write entry points (load, mutate) acquire the per-(table, branch) write queues for all touched tables up front — before the fork, held through the publish — when forking a non-main branch. commit_all accepts these pre-held guards instead of re-acquiring (the queue is non-re-entrant). The merge fork path already holds the queue and self-heals through the shared wrapper. Cross-process in-flight forks remain the documented one-winner-CAS gap. Mechanical prep folded in: mutation IR lowering is hoisted so the touched-table set is known before execution; commit_all gains the held_guards parameter. Flips recreate_over_orphaned_fork_before_cleanup_is_actionable to assert self-heal; fork_collision_with_live_concurrent_fork_is_retryable still holds. Docs: writes.md cancelled-future note, invariants.md cross-process known gap.
reconcile_orphaned_branches keyed orphans on the branch NAME (absent from the manifest), so it only reclaimed forks from a fully-deleted branch. A fork left on a still-live branch by an interrupted first-write was never reclaimed — the backstop the handoff expected cleanup to provide did not cover that case. Broaden it to a per-table authority test: a Lance branch B on table T is an orphan iff B is not a live manifest branch (delete-leftover) OR the manifest's branch-B snapshot does not place T on B (interrupted first-write). Per-branch snapshots are resolved once and cached across tables. Legitimately-forked tables, main, and internal/system branches are never reclaimed; children are dropped before parents to avoid Lance's referenced-parent RefConflict. The commit-graph half stays whole-branch (per-table doesn't apply there). This is the guaranteed-convergence backstop to the write-path self-heal: it reclaims any fork the write path never revisits, and is what Lance's own create_branch docstring asks embedders to provide for zombie/orphan refs.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 37f5af587a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| crate::storage_layer::ForkOutcome::RefAlreadyExists => { | ||
| table_ops::reclaim_orphaned_fork_and_refork( |
There was a problem hiding this comment.
Re-read branch authority before reclaiming forks
This reclaim trusts the caller's proof that the manifest does not place the table on active_branch, but the first-write path obtains that proof with snapshot_for_branch, which can return the coordinator's cached snapshot when the handle is currently bound to that branch. During the branch-merge target-branch swap (or any stale branch-bound handle), another writer may already have published a legitimate fork; create_branch then reports RefAlreadyExists and this path force-deletes/re-forks that valid table branch before commit_all notices the manifest/head mismatch, leaving the manifest pointing at a deleted fork. Re-open a fresh branch snapshot immediately before reclaiming, or treat RefAlreadyExists as retryable unless the fresh manifest proves the ref is unreferenced.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in fedda78. reclaim_orphaned_fork_and_refork now re-derives its precondition from a FRESH manifest read (fresh_snapshot_for_branch, which bypasses the coordinator cache) immediately before force-deleting, and refuses with a retryable conflict if the table is legitimately on the branch. Correct regardless of caller snapshot staleness (including the branch-merge target swap). Good catch.
The fork reclaim force-deletes a Lance branch ref, gated on the caller's proof that the manifest does not place the table on the branch. But the first-write path obtains that proof via snapshot_for_branch, which returns the coordinator's CACHED snapshot when the handle is bound to the branch (an embedded handle on the branch, or branch_merge's target swap). If that snapshot is stale and a concurrent writer already published a legitimate fork, the reclaim would force-delete it and re-fork from source, stranding the manifest at a version the recreated ref no longer has. Make the destructive primitive own its safety precondition: re-derive it from a FRESH manifest read (fresh_snapshot_for_branch, which bypasses the cache) immediately before force-deleting. If fresh authority shows the table is on the branch, refuse with a retryable conflict instead of destroying a valid fork. Correct for any caller regardless of snapshot staleness. Also stop branching on Lance's exact RefConflict prose (loosened match; typed-variant is the durable follow-up). Addresses PR review (Codex P1, Greptile P2).
A node delete cascades to every edge table touching that node (execute_delete_ node), forking those edge tables during execution. But touched_table_keys derived the up-front fork-queue set from the IR ops alone (just node:Type), so a branch delete that forks node + cascade edges held only the node queue — commit_all then saw cascade-edge keys it had no guard for. The touched set is a pure function of (IR ops + catalog), so compute the COMPLETE set: op types plus, for delete-node ops, the cascade edges derived the same way the executor derives them (from_type/to_type match). Pre-computed now equals actual by construction. Also promote commit_all's held-guard coverage check out of debug_assert into an all-builds check that fails the write with a typed manifest_internal error: a load-bearing serialization invariant must fail loudly+safely in release, not silently proceed unguarded if a future execution path ever touches a table outside the pre-computed set. Adds branch_cascade_delete_forks_node_and_edges_under_held_queues, which drives the cascade path on a branch (the gap the existing insert/load tests missed). Addresses PR review (Cursor medium, Greptile P2).
The broadened per-table reconciler force_delete'd an orphan candidate on a LIVE branch without holding the per-(table, branch) write queue. An in-process first-write fork in its fork->publish window holds that queue and has not yet advanced the manifest, so it looks exactly like an origin-2 orphan — concurrent cleanup could delete the ref the writer still holds and is about to publish. (The old branch-name-based reconciler did not have this race: a deleted branch cannot have a live first-write.) Bring the reconciler under the same invariant the write-path reclaim already obeys: never force_delete a fork ref without holding the (table, branch) write queue AND confirming, under it, from a fresh read, that the ref is still manifest-unreferenced. Acquire one key at a time (no lock-order inversion vs multi-table acquire_many writers); if the writer published meanwhile, the fresh re-check sees the table on the branch and skips. Cross-process writers remain the documented one-winner-CAS gap. Addresses PR review (Cursor high).
fork_branch_from_state mapped ANY create_branch failure to RefAlreadyExists, routing transient I/O / version / Lance-internal errors into the destructive reclaim path and masking the real error as a retryable conflict. Branch on the actual fact instead: on create_branch failure, check whether the ref exists (list_branches). Only a genuinely pre-existing ref — a fully-formed manifest-unreferenced fork — is a reclaim candidate; any other failure propagates with fidelity. We deliberately do NOT force-delete on a not-found-ref failure: it is indistinguishable from a transient error on a fresh create, and force-deleting there is the overreach the fresh-authority guard already removed. A phase-1-only Lance zombie (rarer; create_branch interrupted mid its two internal phases) surfaces as the propagated error for manual reclaim. Addresses PR review (Cursor medium).
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3300109. Configure here.
…ive branch The reconcile pre-delete re-check treated ANY fresh_snapshot error as 'still an orphan' and proceeded to force_delete. A transient manifest read failure on a LIVE branch could therefore destroy a fork the manifest still considers legitimate — inconsistent with the write-path reclaim (aborts on the same error) and the candidate scan (skips on snapshot failure). Distinguish the two origins under the queue: a branch absent from the manifest authority (origin 1) is a confirmed orphan and is deleted without a fresh read (no live writer can hold a deleted branch's queue); a LIVE branch (origin 2) gets the fresh re-check and, on a transient read error, is SKIPPED — never destroyed on ambiguity — converging on a later cleanup. Same don't-destroy-on- ambiguous-error principle as the create_branch failure classification. Addresses PR review (Cursor medium).
| let still_orphan = if !live_branches.contains(&branch) { | ||
| // Origin 1: the branch is absent from the manifest authority | ||
| // entirely — a confirmed orphan. No live writer can hold this | ||
| // branch's queue (you cannot first-write to a branch the | ||
| // manifest does not have), so no fresh re-check is needed. | ||
| true |
There was a problem hiding this comment.
Stale
live_branches makes origin-1 still_orphan = true unsafe
live_branches is a snapshot taken once at the top of reconcile_orphaned_branches (from the coordinator's cached view). The comment "you cannot first-write to a branch the manifest does not have" is only true at that moment. If a branch is created after live_branches is captured, it is absent from the set even though the manifest now has it and an in-process writer may already be holding the (table, branch) write queue through its fork+publish. The reconciler blocks on the queue, the writer publishes (the fork is now legitimate), the reconciler acquires the queue — and then !live_branches.contains(&branch) is still true (stale), so it treats the legitimately-published fork as an origin-1 orphan and calls force_delete_branch. The result is the manifest referencing a non-existent Lance branch ref, corrupting subsequent reads on that (table, branch) pair until repaired.
The origin-2 path correctly defends against this by calling fresh_snapshot_for_branch under the queue. Origin-1 needs the same treatment: if the fresh read returns "branch not found", it is confirmed absent (still an orphan); if the read succeeds, re-check whether the table is placed on it and skip if so; on transient error, skip (matching the origin-2 safe-skip behaviour).

What & why
A first write to a table on a branch lazily forks it via Lance
create_branch— a durable, two-phase op that advances Lance state before the atomic manifest publish — so a crash, deploy restart, or cancelled request future between the fork and the publish left a fully-formed branch ref the manifest never referenced, wedging every subsequent write to that table on that branch with"orphaned table state … incomplete prior delete; run cleanup"(andcleanupcouldn't even fix it, since the branch was still live). This makes the per-table fork derived state of the manifest: the fork now returns a typedForkOutcomeand self-heals a manifest-unreferenced leftover (reclaim + re-fork under the write queue) on the next write across load/mutate/merge, withcleanup's reconciler broadened to a per-table authority test as the guaranteed backstop.Backing issue / RFC
Checklist
chore:commitwrites.rs), reconcile coverage (maintenance.rs), failpoints flip + preserved retryable testwrites.md,invariants.md) updatedNotes for reviewers
Reclaim is safe because branch first-writes now acquire the per-
(table, branch)write queues up front (held through publish), so no in-process writer can be mid-fork; cross-process in-flight forks remain the already-documented one-winner-CAS gap (noted ininvariants.md). Commits are file-granular and ordered red→green (3edbe2eis the RED test,6e3204dthe fix) since interactive hunk staging wasn't available. Only steady-state cost is added serialization of concurrent first-writes to the same(table, branch)— which already conflict at commit; main-branch and already-forked writes are unaffected.Note
Medium Risk
Changes destructive fork reclaim and cleanup deletion paths on the write and maintenance hot paths; mitigations are fresh manifest checks and write-queue serialization, with cross-process races still documented as retry-only CAS gaps.
Overview
Fixes a durability wedge where Lance
create_branchcould finish before the manifest publish, leaving a branch ref the manifest never referenced and blocking later writes with “run cleanup” (often unhelpful while the branch stayed live).Write path:
fork_branch_from_statenow returnsForkOutcome(CreatedvsRefAlreadyExists, confirmed vialist_branchesso transient errors don’t trigger reclaim). On collision,reclaim_orphaned_fork_and_reforkre-checks a fresh branch snapshot,force_delete_branch, then re-forks.mutate_asandload_aspre-acquire per-(table, branch)queues for all touched tables when a fork is needed and passheld_guardsintocommit_allso the non-re-entrant queue isn’t double-acquired;touched_table_keysincludes edge tables from cascading node deletes.Cleanup backstop:
reconcile_orphaned_branchestreats orphans per table (deleted branch vs live branch without that table on it), acquires the write queue, re-validates withfresh_snapshot_for_branch(skips delete on read errors), then force-deletes.Tests and dev docs (
invariants.md,writes.md) cover self-heal, origin-2 reconcile, and the in-process-only reclaim safety model.Reviewed by Cursor Bugbot for commit d0fb12b. Bugbot is set up for automated code reviews on this repo. Configure here.
Greptile Summary
This PR fixes a durability defect where an interrupted first-write branch fork — Lance
create_branchcompleting but the manifest publish not — left a manifest-unreferenced branch ref that wedged all subsequent writes to that table on that branch with "run cleanup" (andcleanupcouldn't fix it since the branch was still live). The fix makes the fork operation self-healing:fork_branch_from_statereturns a typedForkOutcome, and onRefAlreadyExiststhe engine reclaims the stale ref under the write queue after a fresh manifest re-validation, then re-forks.table_ops.rs,loader/mod.rs,mutation.rs): first-writes onto non-main branches now acquire per-(table, branch)write queues before the fork and hold them through the manifest publish;reclaim_orphaned_fork_and_reforkself-heals manifest-unreferenced residues with a fresh-manifest safety check before anyforce_delete_branch.optimize.rs): broadened to a per-table authority test (origin-1: whole branch gone; origin-2: live branch but table not placed on it); origin-2 forks are re-validated under the write queue viafresh_snapshot_for_branchbefore delete.writes.rs,maintenance.rs,failpoints.rs): deterministic forged-orphan regression test (load + mutate paths), cascade-delete queue-coverage test, and the recreate-over-orphan failpoint test flipped from expect-error to expect-success.Confidence Score: 4/5
The write-path self-heal and origin-2 reconciler path are carefully guarded with fresh manifest checks under the queue; one gap in the origin-1 reconciler re-validation can delete a legitimately-published fork when cleanup runs concurrently with branch creation.
The write-path changes and origin-2 reconciler path are sound — both re-validate under the write queue using a fresh manifest read before any destructive operation. The origin-1 re-validation in
reconcile_orphaned_branches(lines 734–739) skips the fresh check, relying on a stalelive_branchessnapshot taken at function start. A branch created concurrently during the sweep will not be in that snapshot; if its first-write fork is published before the reconciler reaches that (table, branch) pair, the reconciler acquires the queue and — becauselive_branchesstill does not contain the branch — callsforce_delete_branchon a legitimately-published ref, leaving the manifest referencing a non-existent Lance branch ref.crates/omnigraph/src/db/omnigraph/optimize.rs — the origin-1
still_orphan = trueblock (lines 734–739)Important Files Changed
still_orphan = trueuses a stalelive_branchessnapshot and can delete a legitimately-published fork on a recently-created branch.reclaim_orphaned_fork_and_reforkwith self-validation via fresh manifest read before destructive ops; correct handling of double-collision and RefConflict error.touched_table_keys(including cascade-delete edge tables) before execution for up-front queue acquisition; theneeds_fork = falsegap for stale snapshot is the documented one-winner-CAS limit.held_guardspass-through tocommit_all; held-guard coverage check is now an all-builds hard failure rather than a debug-only assertion.ForkOutcome;list_branchesdisambiguation prevents routing transient I/O errors into the destructive reclaim path.touchedfromnode_rows/edge_rowskeys.Sequence Diagram
sequenceDiagram participant W as Writer participant Q as WriteQueue participant L as Lance (TableStore) participant M as Manifest Note over W,M: First write to table T on branch B (fork path) W->>Q: acquire_many([(T, B), ...]) ← up-front Q-->>W: guards held W->>L: fork_branch_from_state(T, B) alt Created L-->>W: ForkOutcome::Created(ds) else RefAlreadyExists (interrupted prior fork) L-->>W: ForkOutcome::RefAlreadyExists W->>M: fresh_snapshot_for_branch(B) M-->>W: snapshot (no T on B → confirmed orphan) W->>L: force_delete_branch(T, B) L-->>W: ok W->>L: fork_branch_from_state(T, B) again L-->>W: ForkOutcome::Created(ds) end W->>M: commit_all (guards reused, not re-acquired) M-->>W: published W->>Q: release guards Note over W,M: cleanup reconciler — per-table authority participant C as Cleanup C->>M: live_branches snapshot C->>L: list_branches(T) C->>Q: acquire(T, B) ← waits for any live writer alt origin-2 (live branch, T not placed) C->>M: fresh_snapshot_for_branch(B) ← re-validate M-->>C: T placed on B → skip (not orphan) endComments Outside Diff (1)
crates/omnigraph/src/exec/mutation.rs, line 456-469 (link)needs_fork = falseleaves a fork-without-queue gap on cross-process branch recreateIf
needs_forkisfalseat snapshot time (all touched tables appear already-forked), no guards are pre-acquired. If a foreign-process then deletes and recreates the branch between this snapshot fetch andexecute_named_mutation, some tables become unforked;open_owned_dataset_for_branch_writere-detects this and callsdb.fork_dataset_from_entry_state, which may reachreclaim_orphaned_fork_and_refork— without the per-(table, branch)queue held. The invariant comment onreclaim_orphaned_fork_and_reforkstates callers "MUST already hold" the queue.This is acknowledged in
invariants.mdas the one-winner-CAS gap and leads to retries rather than corruption, so it's not a blocking concern. Worth calling out explicitly since the comment onreclaim_orphaned_fork_and_reforkreads as an unconditional precondition but actually has this cross-process carve-out.Reviews (5): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile