Skip to content

Fix equality delete load notification deadlock#2630

Open
fallintoplace wants to merge 1 commit into
apache:mainfrom
fallintoplace:fix-equality-delete-notify-deadlock
Open

Fix equality delete load notification deadlock#2630
fallintoplace wants to merge 1 commit into
apache:mainfrom
fallintoplace:fix-equality-delete-notify-deadlock

Conversation

@fallintoplace

@fallintoplace fallintoplace commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Reuse an existing equality-delete load notifier instead of overwriting it with a new one.
  • Keep the loader path scoped to the existing start-load state machine.
  • Add a regression test that polls a waiter before the load is inserted, then verifies it is notified when the predicate is loaded.

Root cause

try_start_eq_del_load marked a delete file as Loading with one Notify, but insert_equality_delete replaced the map entry with a different Notify. Any reader that observed the first notifier could wait forever because only the second notifier was signaled.

Tests

  • cargo fmt --check
  • CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse cargo test -p iceberg test_equality_delete_waiter_is_notified_after_load_started --locked
  • CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse cargo test -p iceberg arrow::delete_filter::tests --locked
  • CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse cargo test -p iceberg arrow::caching_delete_file_loader::tests --locked

@fallintoplace fallintoplace force-pushed the fix-equality-delete-notify-deadlock branch from b34805a to fd32b7a Compare June 12, 2026 22:41

@viirya viirya left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch — this is a real deadlock. try_start_eq_del_load creates notifier A, stores Loading(A), and returns A; the caller then calls insert_equality_delete, which (on main) creates a second notifier B and overwrites the entry with Loading(B). A reader that called get_equality_delete_predicate_for_delete_file_path in the window between those two steps captures A and awaits A.notified(), but the load only ever calls B.notify_waiters() — so that reader waits forever.

The fix has insert_equality_delete reuse the existing Loading notifier instead of minting a new one, which closes the window. Reusing the start-load notifier (rather than, say, having try_start_eq_del_load return it for the caller to thread through) keeps the call site simple and is the minimal correct change. The _ => { mint a new notify } fallback preserves behavior for callers that reach insert_equality_delete without a prior Loading entry.

The regression test is well-targeted: it polls a waiter to Pending before the load is inserted, then asserts the waiter is notified — exactly the race that was broken. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants