test(quorum-store): repro F5 — persist reject-branch skips notify_subscribers (Galxe/gravity-audit#707)#744
test(quorum-store): repro F5 — persist reject-branch skips notify_subscribers (Galxe/gravity-audit#707)#744keanji-x wants to merge 1 commit into
Conversation
…ribers Adds a focused cargo unit test reproducing gravity-audit F5: in BatchStore::persist, the reject branch (batch expiration <= last_certified_time, so save() bails and persist_inner returns None) is filtered out by the `filter_map`, so notify_subscribers is never called for it. A concurrent in-flight batch fetcher that subscribed via subscribe() (the recovery / get_batch path) is therefore never woken and hangs until its request_batch times out -- even when the batch is, in fact, locally available in the quorum-store DB. The test drives the public BatchReader::get_batch API with a mock network sender whose request_batch future never resolves, forcing the only viable delivery path to be the persist-subscription notification. It then makes the batch locally available (save_fetched_batch_to_db) and issues the redundant, rejected persist(). It asserts the fetcher IS served (correct behavior). Evidence: - FAILS on current code: the fetcher is never notified and times out (panic at the timeout arm). - PASSES when persist() is changed to call notify_subscribers for every persist_request regardless of the save() outcome (the original, now-commented-out loop). Verified locally, then reverted -- this PR adds test code only, no production changes. Note on the second half of F5 (GC `expiration - 60s buffer` inequality): populate_cache_and_gc_expired_batches and the 60s expiration buffer do not exist in this gravity fork (the bootstrap GC in BatchStore::new is commented out), so that half is not reproducible here and is not claimed. Refs Galxe/gravity-audit#707 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
I looked through the block sync / recovery payload path again, and I think the impact here is probably narrower than the PR description suggests. The missed-notify behavior itself is real: However, in the real block sync / recovery path, payload fetching has a few other guards/retry paths:
So I agree this is a missed notification edge case, and notifying after the batch becomes locally readable may be a reasonable hardening. But I don’t think the current test proves a persistent recovery hang in the production flow. The test uses a network sender whose Because of that, I would treat this as a narrow concurrency edge case / possible latency or retry issue, not a high-impact recovery bug, unless we can reproduce a real block-sync/recovery path where the missed notification causes an actual stall beyond the normal request retry/timeout behavior. |
What
Adds a focused cargo unit test reproducing gravity-audit F5 in the quorum-store recovery path. Test-only; no production code is modified.
The bug
In
aptos-core/consensus/src/quorum_store/batch_store.rs,BatchStore::persist:When
save()rejects a batch becausevalue.expiration() <= last_certified_time,persist_innerreturnsNone, thefilter_mapdrops it, andnotify_subscribersis never called for it. A concurrent in-flight batch fetcher that subscribed viasubscribe()(the recovery /get_batchpath) is therefore never woken and hangs until itsrequest_batchtimes out — even when the batch is, in fact, locally available in the quorum-store DB (e.g. just written bysave_fetched_batch_to_db).The original (now-commented-out) loop called
notify_subscribers(persist_request)for every request regardless of thesave()outcome — that is the correct behavior the test asserts.The test
test_persist_reject_branch_notifies_subscriber_707drives the publicBatchReader::get_batchAPI with a mock network sender whoserequest_batchfuture never resolves, so the only viable delivery path is the persist-subscription notification. It then:save_fetched_batch_to_db,persist()(expiration500≤last_certified_time 1000),Evidence
persist()is changed to callnotify_subscribersfor everypersist_requestregardless ofsave()outcome (the original loop). Verified locally, then reverted — this PR adds test code only.Run:
Honesty / scope
expiration - 60s buffervsblock_timestamp <= expiration) is not reproducible in this fork:populate_cache_and_gc_expired_batchesand the 60s buffer do not exist here — the bootstrap GC inBatchStore::newis commented out. That half is not claimed.test_get_local_batchalready fails on cleanorigin/main(independent of this change).Refs Galxe/gravity-audit#707
🤖 Generated with Claude Code