test(quorum-store): repro F5 — persist reject-branch skips notify_subscribers (Galxe/gravity-audit#707) by keanji-x · Pull Request #744 · Galxe/gravity-sdk

keanji-x · 2026-06-10T14:01:48Z

What

Adds a focused cargo unit test reproducing gravity-audit F5 in the quorum-store recovery path. Test-only; no production code is modified.

The bug

In aptos-core/consensus/src/quorum_store/batch_store.rs, BatchStore::persist:

let res = persist_requests
    .into_par_iter()
    .filter_map(|req| {
        let signed_info_opt = self.persist_inner(req.clone());
        signed_info_opt.map(|si| (req, si))   // reject branch (None) is DROPPED here
    })
    .collect::<Vec<_>>();
let mut signed_infos = vec![];
for (req, si) in res {
    self.notify_subscribers(req);             // never reached for rejected reqs
    signed_infos.push(si);
}

When save() rejects a batch because value.expiration() <= last_certified_time, persist_inner returns None, the filter_map drops it, and notify_subscribers is never called for it. A concurrent in-flight batch fetcher that subscribed via subscribe() (the recovery / get_batch path) is therefore never woken and hangs until its request_batch times out — even when the batch is, in fact, locally available in the quorum-store DB (e.g. just written by save_fetched_batch_to_db).

The original (now-commented-out) loop called notify_subscribers(persist_request) for every request regardless of the save() outcome — that is the correct behavior the test asserts.

The test

test_persist_reject_branch_notifies_subscriber_707 drives the public BatchReader::get_batch API with a mock network sender whose request_batch future never resolves, so the only viable delivery path is the persist-subscription notification. It then:

starts the in-flight fetcher (which subscribes internally),
makes the batch locally available via save_fetched_batch_to_db,
issues the redundant, rejected persist() (expiration 500 ≤ last_certified_time 1000),
asserts the fetcher is served.

Evidence

FAILS on current code — the fetcher is never notified and times out:

thread '...test_persist_reject_branch_notifies_subscriber_707' panicked at .../batch_store_test.rs:449:19:
F5 repro (Galxe/gravity-audit#707): in-flight fetcher was NEVER notified after persist()
rejected the expired-vs-last_certified_time batch (notify_subscribers skipped on the reject
branch), even though the batch is locally available -- the fetcher hung and timed out
test ... FAILED
test result: FAILED. 0 passed; 1 failed; ...

PASSES when persist() is changed to call notify_subscribers for every persist_request regardless of save() outcome (the original loop). Verified locally, then reverted — this PR adds test code only.

Run:

export CARGO_TARGET_DIR=.../target
RUSTFLAGS="--cfg tokio_unstable" cargo test -p aptos-consensus \
  quorum_store::tests::batch_store_test::test_persist_reject_branch_notifies_subscriber_707

Honesty / scope

The GC-inequality half of F5 (expiration - 60s buffer vs block_timestamp <= expiration) is not reproducible in this fork: populate_cache_and_gc_expired_batches and the 60s buffer do not exist here — the bootstrap GC in BatchStore::new is commented out. That half is not claimed.
Pre-existing unrelated failure: test_get_local_batch already fails on clean origin/main (independent of this change).

Refs Galxe/gravity-audit#707

🤖 Generated with Claude Code

…ribers Adds a focused cargo unit test reproducing gravity-audit F5: in BatchStore::persist, the reject branch (batch expiration <= last_certified_time, so save() bails and persist_inner returns None) is filtered out by the `filter_map`, so notify_subscribers is never called for it. A concurrent in-flight batch fetcher that subscribed via subscribe() (the recovery / get_batch path) is therefore never woken and hangs until its request_batch times out -- even when the batch is, in fact, locally available in the quorum-store DB. The test drives the public BatchReader::get_batch API with a mock network sender whose request_batch future never resolves, forcing the only viable delivery path to be the persist-subscription notification. It then makes the batch locally available (save_fetched_batch_to_db) and issues the redundant, rejected persist(). It asserts the fetcher IS served (correct behavior). Evidence: - FAILS on current code: the fetcher is never notified and times out (panic at the timeout arm). - PASSES when persist() is changed to call notify_subscribers for every persist_request regardless of the save() outcome (the original, now-commented-out loop). Verified locally, then reverted -- this PR adds test code only, no production changes. Note on the second half of F5 (GC `expiration - 60s buffer` inequality): populate_cache_and_gc_expired_batches and the 60s expiration buffer do not exist in this gravity fork (the bootstrap GC in BatchStore::new is commented out), so that half is not reproducible here and is not claimed. Refs Galxe/gravity-audit#707 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Lchangliang · 2026-06-11T07:21:07Z

I looked through the block sync / recovery payload path again, and I think the impact here is probably narrower than the PR description suggests.

The missed-notify behavior itself is real: persist() only calls notify_subscribers() for requests where persist_inner() returns Some(SignedBatchInfo). If save() rejects the batch because expiration <= last_certified_time, the request is filtered out and subscribers are not notified.

However, in the real block sync / recovery path, payload fetching has a few other guards/retry paths:

During block sync, fetched blocks call payload_manager.prefetch_payload_data(payload, block.timestamp_usecs()).
In recovery, send_for_execution(recovery=true) calls payload_manager.get_transactions(block).await in a retry loop. On error, it calls prefetch_payload_data() again and retries.
QuorumStorePayloadManager::request_transactions() only calls batch_reader.get_batch() when block_timestamp <= batch_info.expiration(). If the batch is expired relative to the block being executed, it is skipped before reaching BatchStore.
BatchReaderImpl::get_batch() always checks local cache/DB first. If the batch is already in DB, it returns directly without relying on notify_subscribers().
If a waiter subscribed before the batch was written to DB and the later persist() rejects without notifying, that waiter may miss this notification. But the network request path has retry/RPC timeout behavior; once it returns an error, the recovery loop re-prefetches, and the next get_batch() should re-check local DB and find the batch.

So I agree this is a missed notification edge case, and notifying after the batch becomes locally readable may be a reasonable hardening. But I don’t think the current test proves a persistent recovery hang in the production flow. The test uses a network sender whose request_batch future never resolves, which makes notify_subscribers() the only possible completion path. That is useful for isolating the internal behavior, but it amplifies the impact compared with the real path where request timeout/retry and local DB re-checks exist.

Because of that, I would treat this as a narrow concurrency edge case / possible latency or retry issue, not a high-impact recovery bug, unless we can reproduce a real block-sync/recovery path where the missed notification causes an actual stall beyond the normal request retry/timeout behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(quorum-store): repro F5 — persist reject-branch skips notify_subscribers (Galxe/gravity-audit#707)#744

test(quorum-store): repro F5 — persist reject-branch skips notify_subscribers (Galxe/gravity-audit#707)#744
keanji-x wants to merge 1 commit into
mainfrom
test/unit-qs-recovery-notify-gc-707

keanji-x commented Jun 10, 2026

Uh oh!

Lchangliang commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

keanji-x commented Jun 10, 2026

What

The bug

The test

Evidence

Honesty / scope

Uh oh!

Lchangliang commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants