Fix equality delete load notification deadlock#2630
Conversation
b34805a to
fd32b7a
Compare
viirya
left a comment
There was a problem hiding this comment.
Nice catch — this is a real deadlock. try_start_eq_del_load creates notifier A, stores Loading(A), and returns A; the caller then calls insert_equality_delete, which (on main) creates a second notifier B and overwrites the entry with Loading(B). A reader that called get_equality_delete_predicate_for_delete_file_path in the window between those two steps captures A and awaits A.notified(), but the load only ever calls B.notify_waiters() — so that reader waits forever.
The fix has insert_equality_delete reuse the existing Loading notifier instead of minting a new one, which closes the window. Reusing the start-load notifier (rather than, say, having try_start_eq_del_load return it for the caller to thread through) keeps the call site simple and is the minimal correct change. The _ => { mint a new notify } fallback preserves behavior for callers that reach insert_equality_delete without a prior Loading entry.
The regression test is well-targeted: it polls a waiter to Pending before the load is inserted, then asserts the waiter is notified — exactly the race that was broken. LGTM.
Summary
Root cause
try_start_eq_del_loadmarked a delete file asLoadingwith oneNotify, butinsert_equality_deletereplaced the map entry with a differentNotify. Any reader that observed the first notifier could wait forever because only the second notifier was signaled.Tests
cargo fmt --checkCARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse cargo test -p iceberg test_equality_delete_waiter_is_notified_after_load_started --lockedCARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse cargo test -p iceberg arrow::delete_filter::tests --lockedCARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse cargo test -p iceberg arrow::caching_delete_file_loader::tests --locked