Skip to content

fix: self-heal manifest-unreferenced branch forks (stop wedged branches)#231

Open
ragnorc wants to merge 11 commits into
mainfrom
ragnorc/medan
Open

fix: self-heal manifest-unreferenced branch forks (stop wedged branches)#231
ragnorc wants to merge 11 commits into
mainfrom
ragnorc/medan

Conversation

@ragnorc

@ragnorc ragnorc commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

What & why

A first write to a table on a branch lazily forks it via Lance create_branch — a durable, two-phase op that advances Lance state before the atomic manifest publish — so a crash, deploy restart, or cancelled request future between the fork and the publish left a fully-formed branch ref the manifest never referenced, wedging every subsequent write to that table on that branch with "orphaned table state … incomplete prior delete; run cleanup" (and cleanup couldn't even fix it, since the branch was still live). This makes the per-table fork derived state of the manifest: the fork now returns a typed ForkOutcome and self-heals a manifest-unreferenced leftover (reclaim + re-fork under the write queue) on the next write across load/mutate/merge, with cleanup's reconciler broadened to a per-table authority test as the guaranteed backstop.

Backing issue / RFC

  • Maintainer fix for a reproduced durability defect (interrupted branch fork wedges writes). No public issue/RFC; per the template note, maintainer internal process applies. Deterministic local repro added as the lead test commit.

Checklist

  • Change is focused (one logical change) — the fork-reclaim fix; a small related stale-comment cleanup rides in its own chore: commit
  • Tests added/updated for behavior changes — red→green regression (writes.rs), reconcile coverage (maintenance.rs), failpoints flip + preserved retryable test
  • Public docs updated if user-facing surface changed (or N/A) — N/A user-facing; dev docs (writes.md, invariants.md) updated
  • Reviewed against docs/dev/invariants.md — no Hard Invariant weakened; uses the authority-derived reconciler pattern the invariants prefer

Notes for reviewers

Reclaim is safe because branch first-writes now acquire the per-(table, branch) write queues up front (held through publish), so no in-process writer can be mid-fork; cross-process in-flight forks remain the already-documented one-winner-CAS gap (noted in invariants.md). Commits are file-granular and ordered red→green (3edbe2e is the RED test, 6e3204d the fix) since interactive hunk staging wasn't available. Only steady-state cost is added serialization of concurrent first-writes to the same (table, branch) — which already conflict at commit; main-branch and already-forked writes are unaffected.


Note

Medium Risk
Changes destructive fork reclaim and cleanup deletion paths on the write and maintenance hot paths; mitigations are fresh manifest checks and write-queue serialization, with cross-process races still documented as retry-only CAS gaps.

Overview
Fixes a durability wedge where Lance create_branch could finish before the manifest publish, leaving a branch ref the manifest never referenced and blocking later writes with “run cleanup” (often unhelpful while the branch stayed live).

Write path: fork_branch_from_state now returns ForkOutcome (Created vs RefAlreadyExists, confirmed via list_branches so transient errors don’t trigger reclaim). On collision, reclaim_orphaned_fork_and_refork re-checks a fresh branch snapshot, force_delete_branch, then re-forks. mutate_as and load_as pre-acquire per-(table, branch) queues for all touched tables when a fork is needed and pass held_guards into commit_all so the non-re-entrant queue isn’t double-acquired; touched_table_keys includes edge tables from cascading node deletes.

Cleanup backstop: reconcile_orphaned_branches treats orphans per table (deleted branch vs live branch without that table on it), acquires the write queue, re-validates with fresh_snapshot_for_branch (skips delete on read errors), then force-deletes.

Tests and dev docs (invariants.md, writes.md) cover self-heal, origin-2 reconcile, and the in-process-only reclaim safety model.

Reviewed by Cursor Bugbot for commit d0fb12b. Bugbot is set up for automated code reviews on this repo. Configure here.

Greptile Summary

This PR fixes a durability defect where an interrupted first-write branch fork — Lance create_branch completing but the manifest publish not — left a manifest-unreferenced branch ref that wedged all subsequent writes to that table on that branch with "run cleanup" (and cleanup couldn't fix it since the branch was still live). The fix makes the fork operation self-healing: fork_branch_from_state returns a typed ForkOutcome, and on RefAlreadyExists the engine reclaims the stale ref under the write queue after a fresh manifest re-validation, then re-forks.

  • Write path (table_ops.rs, loader/mod.rs, mutation.rs): first-writes onto non-main branches now acquire per-(table, branch) write queues before the fork and hold them through the manifest publish; reclaim_orphaned_fork_and_refork self-heals manifest-unreferenced residues with a fresh-manifest safety check before any force_delete_branch.
  • Cleanup reconciler (optimize.rs): broadened to a per-table authority test (origin-1: whole branch gone; origin-2: live branch but table not placed on it); origin-2 forks are re-validated under the write queue via fresh_snapshot_for_branch before delete.
  • Tests (writes.rs, maintenance.rs, failpoints.rs): deterministic forged-orphan regression test (load + mutate paths), cascade-delete queue-coverage test, and the recreate-over-orphan failpoint test flipped from expect-error to expect-success.

Confidence Score: 4/5

The write-path self-heal and origin-2 reconciler path are carefully guarded with fresh manifest checks under the queue; one gap in the origin-1 reconciler re-validation can delete a legitimately-published fork when cleanup runs concurrently with branch creation.

The write-path changes and origin-2 reconciler path are sound — both re-validate under the write queue using a fresh manifest read before any destructive operation. The origin-1 re-validation in reconcile_orphaned_branches (lines 734–739) skips the fresh check, relying on a stale live_branches snapshot taken at function start. A branch created concurrently during the sweep will not be in that snapshot; if its first-write fork is published before the reconciler reaches that (table, branch) pair, the reconciler acquires the queue and — because live_branches still does not contain the branch — calls force_delete_branch on a legitimately-published ref, leaving the manifest referencing a non-existent Lance branch ref.

crates/omnigraph/src/db/omnigraph/optimize.rs — the origin-1 still_orphan = true block (lines 734–739)

Important Files Changed

Filename Overview
crates/omnigraph/src/db/omnigraph/optimize.rs Broadens reconciler to per-table authority (origin-1 + origin-2); origin-2 re-validation under queue is correct, but origin-1 still_orphan = true uses a stale live_branches snapshot and can delete a legitimately-published fork on a recently-created branch.
crates/omnigraph/src/db/omnigraph/table_ops.rs Adds reclaim_orphaned_fork_and_refork with self-validation via fresh manifest read before destructive ops; correct handling of double-collision and RefConflict error.
crates/omnigraph/src/exec/mutation.rs Hoists IR lowering to compute touched_table_keys (including cascade-delete edge tables) before execution for up-front queue acquisition; the needs_fork = false gap for stale snapshot is the documented one-winner-CAS limit.
crates/omnigraph/src/exec/staging.rs Adds held_guards pass-through to commit_all; held-guard coverage check is now an all-builds hard failure rather than a debug-only assertion.
crates/omnigraph/src/table_store.rs Replaces blunt orphan-error with typed ForkOutcome; list_branches disambiguation prevents routing transient I/O errors into the destructive reclaim path.
crates/omnigraph/src/loader/mod.rs Mirrors mutation path's up-front fork-queue acquisition for loads onto non-main branches; correctly derives touched from node_rows/edge_rows keys.
crates/omnigraph/tests/writes.rs Adds regression test for forged manifest-unreferenced fork (both load + mutate paths) and cascade-delete queue coverage; covers the primary bug scenario.
crates/omnigraph/tests/maintenance.rs Adds reconciler test distinguishing origin-2 orphan reclaim from legitimate fork on the same live branch.
crates/omnigraph/tests/failpoints.rs Flips the recreate-over-orphan test from expect_err to expect (red→green); adds correctness assertion that the self-healed fork starts fresh from main, not from the stale orphan.

Sequence Diagram

sequenceDiagram
    participant W as Writer
    participant Q as WriteQueue
    participant L as Lance (TableStore)
    participant M as Manifest

    Note over W,M: First write to table T on branch B (fork path)
    W->>Q: acquire_many([(T, B), ...]) ← up-front
    Q-->>W: guards held

    W->>L: fork_branch_from_state(T, B)
    alt Created
        L-->>W: ForkOutcome::Created(ds)
    else RefAlreadyExists (interrupted prior fork)
        L-->>W: ForkOutcome::RefAlreadyExists
        W->>M: fresh_snapshot_for_branch(B)
        M-->>W: snapshot (no T on B → confirmed orphan)
        W->>L: force_delete_branch(T, B)
        L-->>W: ok
        W->>L: fork_branch_from_state(T, B) again
        L-->>W: ForkOutcome::Created(ds)
    end

    W->>M: commit_all (guards reused, not re-acquired)
    M-->>W: published
    W->>Q: release guards

    Note over W,M: cleanup reconciler — per-table authority
    participant C as Cleanup
    C->>M: live_branches snapshot
    C->>L: list_branches(T)
    C->>Q: acquire(T, B)   ← waits for any live writer
    alt origin-2 (live branch, T not placed)
        C->>M: fresh_snapshot_for_branch(B) ← re-validate
        M-->>C: T placed on B → skip (not orphan)
    end
Loading

Comments Outside Diff (1)

  1. crates/omnigraph/src/exec/mutation.rs, line 456-469 (link)

    P2 needs_fork = false leaves a fork-without-queue gap on cross-process branch recreate

    If needs_fork is false at snapshot time (all touched tables appear already-forked), no guards are pre-acquired. If a foreign-process then deletes and recreates the branch between this snapshot fetch and execute_named_mutation, some tables become unforked; open_owned_dataset_for_branch_write re-detects this and calls db.fork_dataset_from_entry_state, which may reach reclaim_orphaned_fork_and_refork — without the per-(table, branch) queue held. The invariant comment on reclaim_orphaned_fork_and_refork states callers "MUST already hold" the queue.

    This is acknowledged in invariants.md as the one-winner-CAS gap and leads to retries rather than corruption, so it's not a blocking concern. Worth calling out explicitly since the comment on reclaim_orphaned_fork_and_refork reads as an unconditional precondition but actually has this cross-process carve-out.

    Fix in Claude Code

Fix All in Claude Code

Reviews (5): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

ragnorc added 4 commits June 14, 2026 19:11
The global Arc<RwLock<Omnigraph>> that once serialized every server write was
removed — the server holds the engine as a lockless Arc<Omnigraph> and write
methods are &self, so the per-(table_key, branch) write queues are now the
actual write-serialization mechanism (in-process only).

Correct comments that still claimed the global lock is 'still in place' /
'today', or framed the queues as MR-686 scaffolding: write_queue.rs module doc,
exec/merge.rs, db/omnigraph/schema_apply.rs, db/manifest/recovery.rs, and the
bench_concurrent_http.rs example (which also wrongly stated mutate_as is
&mut self). workload.rs is left as-is — its 'previous global RwLock' wording is
accurate history.
An interrupted first-write fork (create_branch succeeded, the manifest publish
did not) leaves a fully-formed Lance branch ref the manifest never references.
The branch stays a valid manifest branch, so cleanup's reconciler never
reclaims it, and today the next write to that table wedges with 'incomplete
prior delete; run cleanup'.

Forge that exact residue (a live 'feature' branch + a directly-created
'feature' ref on the Person table the manifest doesn't reference) and assert
the next load AND mutate self-heal. Deterministic and local — no S3 or timing,
since the forge IS the post-crash state. Adds a shared node_table_uri helper.

This commit is RED: it reproduces the bug and fails against the unfixed engine
with the predicted symptom. The fix follows in the next commit.
The first write to a table on a branch lazily forks it via Lance create_branch,
a durable two-phase op that advances Lance state BEFORE the atomic manifest
publish. If the writer dies or its request future is cancelled between the fork
and the publish, the branch ref is fully formed but the manifest never
references it. The next write re-enters the fork path, create_branch collides,
and the engine wedged with 'orphaned table state ... incomplete prior delete;
run cleanup' — which cleanup could not even fix, because the branch is still a
live manifest branch. This hit load, mutate, ingest, and the merge fork path
(one shared engine chokepoint), so a routine deploy restart or client
disconnect could wedge a branch.

Fix: treat the per-table fork ref as derived state of the manifest. fork_branch_
from_state returns a typed ForkOutcome instead of a human 'incomplete prior
delete' error; on RefAlreadyExists the db layer reclaims the manifest-
unreferenced fork (force_delete_branch + re-fork, exactly once) and proceeds.
A live committed fork is still routed to a retryable conflict before the fork
path, so concurrent first-writes stay correct.

Reclaim is only safe if no in-process writer can be mid-fork, so the write
entry points (load, mutate) acquire the per-(table, branch) write queues for
all touched tables up front — before the fork, held through the publish — when
forking a non-main branch. commit_all accepts these pre-held guards instead of
re-acquiring (the queue is non-re-entrant). The merge fork path already holds
the queue and self-heals through the shared wrapper. Cross-process in-flight
forks remain the documented one-winner-CAS gap.

Mechanical prep folded in: mutation IR lowering is hoisted so the touched-table
set is known before execution; commit_all gains the held_guards parameter.

Flips recreate_over_orphaned_fork_before_cleanup_is_actionable to assert
self-heal; fork_collision_with_live_concurrent_fork_is_retryable still holds.
Docs: writes.md cancelled-future note, invariants.md cross-process known gap.
reconcile_orphaned_branches keyed orphans on the branch NAME (absent from the
manifest), so it only reclaimed forks from a fully-deleted branch. A fork left
on a still-live branch by an interrupted first-write was never reclaimed — the
backstop the handoff expected cleanup to provide did not cover that case.

Broaden it to a per-table authority test: a Lance branch B on table T is an
orphan iff B is not a live manifest branch (delete-leftover) OR the manifest's
branch-B snapshot does not place T on B (interrupted first-write). Per-branch
snapshots are resolved once and cached across tables. Legitimately-forked
tables, main, and internal/system branches are never reclaimed; children are
dropped before parents to avoid Lance's referenced-parent RefConflict. The
commit-graph half stays whole-branch (per-table doesn't apply there).

This is the guaranteed-convergence backstop to the write-path self-heal: it
reclaims any fork the write path never revisits, and is what Lance's own
create_branch docstring asks embedders to provide for zombie/orphan refs.
@ragnorc ragnorc requested a review from aaltshuler as a code owner June 14, 2026 17:26
Comment thread crates/omnigraph/src/exec/mutation.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37f5af587a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1509 to +1510
crate::storage_layer::ForkOutcome::RefAlreadyExists => {
table_ops::reclaim_orphaned_fork_and_refork(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Re-read branch authority before reclaiming forks

This reclaim trusts the caller's proof that the manifest does not place the table on active_branch, but the first-write path obtains that proof with snapshot_for_branch, which can return the coordinator's cached snapshot when the handle is currently bound to that branch. During the branch-merge target-branch swap (or any stale branch-bound handle), another writer may already have published a legitimate fork; create_branch then reports RefAlreadyExists and this path force-deletes/re-forks that valid table branch before commit_all notices the manifest/head mismatch, leaving the manifest pointing at a deleted fork. Re-open a fresh branch snapshot immediately before reclaiming, or treat RefAlreadyExists as retryable unless the fresh manifest proves the ref is unreferenced.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fedda78. reclaim_orphaned_fork_and_refork now re-derives its precondition from a FRESH manifest read (fresh_snapshot_for_branch, which bypasses the coordinator cache) immediately before force-deleting, and refuses with a retryable conflict if the table is legitimately on the branch. Correct regardless of caller snapshot staleness (including the branch-merge target swap). Good catch.

Comment thread crates/omnigraph/src/exec/staging.rs Outdated
Comment thread crates/omnigraph/src/db/omnigraph/table_ops.rs Outdated
ragnorc added 2 commits June 14, 2026 19:49
The fork reclaim force-deletes a Lance branch ref, gated on the caller's proof
that the manifest does not place the table on the branch. But the first-write
path obtains that proof via snapshot_for_branch, which returns the coordinator's
CACHED snapshot when the handle is bound to the branch (an embedded handle on
the branch, or branch_merge's target swap). If that snapshot is stale and a
concurrent writer already published a legitimate fork, the reclaim would
force-delete it and re-fork from source, stranding the manifest at a version the
recreated ref no longer has.

Make the destructive primitive own its safety precondition: re-derive it from a
FRESH manifest read (fresh_snapshot_for_branch, which bypasses the cache)
immediately before force-deleting. If fresh authority shows the table is on the
branch, refuse with a retryable conflict instead of destroying a valid fork.
Correct for any caller regardless of snapshot staleness. Also stop branching on
Lance's exact RefConflict prose (loosened match; typed-variant is the durable
follow-up). Addresses PR review (Codex P1, Greptile P2).
A node delete cascades to every edge table touching that node (execute_delete_
node), forking those edge tables during execution. But touched_table_keys
derived the up-front fork-queue set from the IR ops alone (just node:Type), so a
branch delete that forks node + cascade edges held only the node queue —
commit_all then saw cascade-edge keys it had no guard for.

The touched set is a pure function of (IR ops + catalog), so compute the
COMPLETE set: op types plus, for delete-node ops, the cascade edges derived the
same way the executor derives them (from_type/to_type match). Pre-computed now
equals actual by construction.

Also promote commit_all's held-guard coverage check out of debug_assert into an
all-builds check that fails the write with a typed manifest_internal error: a
load-bearing serialization invariant must fail loudly+safely in release, not
silently proceed unguarded if a future execution path ever touches a table
outside the pre-computed set.

Adds branch_cascade_delete_forks_node_and_edges_under_held_queues, which drives
the cascade path on a branch (the gap the existing insert/load tests missed).
Addresses PR review (Cursor medium, Greptile P2).
Comment thread crates/omnigraph/src/db/omnigraph/optimize.rs
Comment thread crates/omnigraph/src/table_store.rs Outdated
ragnorc and others added 3 commits June 14, 2026 20:14
The broadened per-table reconciler force_delete'd an orphan candidate on a LIVE
branch without holding the per-(table, branch) write queue. An in-process
first-write fork in its fork->publish window holds that queue and has not yet
advanced the manifest, so it looks exactly like an origin-2 orphan — concurrent
cleanup could delete the ref the writer still holds and is about to publish.
(The old branch-name-based reconciler did not have this race: a deleted branch
cannot have a live first-write.)

Bring the reconciler under the same invariant the write-path reclaim already
obeys: never force_delete a fork ref without holding the (table, branch) write
queue AND confirming, under it, from a fresh read, that the ref is still
manifest-unreferenced. Acquire one key at a time (no lock-order inversion vs
multi-table acquire_many writers); if the writer published meanwhile, the fresh
re-check sees the table on the branch and skips. Cross-process writers remain
the documented one-winner-CAS gap. Addresses PR review (Cursor high).
fork_branch_from_state mapped ANY create_branch failure to RefAlreadyExists,
routing transient I/O / version / Lance-internal errors into the destructive
reclaim path and masking the real error as a retryable conflict.

Branch on the actual fact instead: on create_branch failure, check whether the
ref exists (list_branches). Only a genuinely pre-existing ref — a fully-formed
manifest-unreferenced fork — is a reclaim candidate; any other failure
propagates with fidelity. We deliberately do NOT force-delete on a not-found-ref
failure: it is indistinguishable from a transient error on a fresh create, and
force-deleting there is the overreach the fresh-authority guard already removed.
A phase-1-only Lance zombie (rarer; create_branch interrupted mid its two
internal phases) surfaces as the propagated error for manual reclaim.
Addresses PR review (Cursor medium).

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3300109. Configure here.

Comment thread crates/omnigraph/src/db/omnigraph/optimize.rs
Comment thread crates/omnigraph/src/db/omnigraph/optimize.rs Outdated
ragnorc added 2 commits June 14, 2026 20:50
…ive branch

The reconcile pre-delete re-check treated ANY fresh_snapshot error as 'still an
orphan' and proceeded to force_delete. A transient manifest read failure on a
LIVE branch could therefore destroy a fork the manifest still considers
legitimate — inconsistent with the write-path reclaim (aborts on the same error)
and the candidate scan (skips on snapshot failure).

Distinguish the two origins under the queue: a branch absent from the manifest
authority (origin 1) is a confirmed orphan and is deleted without a fresh read
(no live writer can hold a deleted branch's queue); a LIVE branch (origin 2)
gets the fresh re-check and, on a transient read error, is SKIPPED — never
destroyed on ambiguity — converging on a later cleanup. Same don't-destroy-on-
ambiguous-error principle as the create_branch failure classification.
Addresses PR review (Cursor medium).
Comment on lines +734 to +739
let still_orphan = if !live_branches.contains(&branch) {
// Origin 1: the branch is absent from the manifest authority
// entirely — a confirmed orphan. No live writer can hold this
// branch's queue (you cannot first-write to a branch the
// manifest does not have), so no fresh re-check is needed.
true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Stale live_branches makes origin-1 still_orphan = true unsafe

live_branches is a snapshot taken once at the top of reconcile_orphaned_branches (from the coordinator's cached view). The comment "you cannot first-write to a branch the manifest does not have" is only true at that moment. If a branch is created after live_branches is captured, it is absent from the set even though the manifest now has it and an in-process writer may already be holding the (table, branch) write queue through its fork+publish. The reconciler blocks on the queue, the writer publishes (the fork is now legitimate), the reconciler acquires the queue — and then !live_branches.contains(&branch) is still true (stale), so it treats the legitimately-published fork as an origin-1 orphan and calls force_delete_branch. The result is the manifest referencing a non-existent Lance branch ref, corrupting subsequent reads on that (table, branch) pair until repaired.

The origin-2 path correctly defends against this by calling fresh_snapshot_for_branch under the queue. Origin-1 needs the same treatment: if the fresh read returns "branch not found", it is confirmed absent (still an orphan); if the read succeeds, re-check whether the table is placed on it and skip if so; on transient error, skip (matching the origin-2 safe-skip behaviour).

Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant