Skip to content

test(quorum-store): repro F5 — persist reject-branch skips notify_subscribers (Galxe/gravity-audit#707)#744

Open
keanji-x wants to merge 1 commit into
mainfrom
test/unit-qs-recovery-notify-gc-707
Open

test(quorum-store): repro F5 — persist reject-branch skips notify_subscribers (Galxe/gravity-audit#707)#744
keanji-x wants to merge 1 commit into
mainfrom
test/unit-qs-recovery-notify-gc-707

Conversation

@keanji-x

Copy link
Copy Markdown
Contributor

What

Adds a focused cargo unit test reproducing gravity-audit F5 in the quorum-store recovery path. Test-only; no production code is modified.

The bug

In aptos-core/consensus/src/quorum_store/batch_store.rs, BatchStore::persist:

let res = persist_requests
    .into_par_iter()
    .filter_map(|req| {
        let signed_info_opt = self.persist_inner(req.clone());
        signed_info_opt.map(|si| (req, si))   // reject branch (None) is DROPPED here
    })
    .collect::<Vec<_>>();
let mut signed_infos = vec![];
for (req, si) in res {
    self.notify_subscribers(req);             // never reached for rejected reqs
    signed_infos.push(si);
}

When save() rejects a batch because value.expiration() <= last_certified_time, persist_inner returns None, the filter_map drops it, and notify_subscribers is never called for it. A concurrent in-flight batch fetcher that subscribed via subscribe() (the recovery / get_batch path) is therefore never woken and hangs until its request_batch times out — even when the batch is, in fact, locally available in the quorum-store DB (e.g. just written by save_fetched_batch_to_db).

The original (now-commented-out) loop called notify_subscribers(persist_request) for every request regardless of the save() outcome — that is the correct behavior the test asserts.

The test

test_persist_reject_branch_notifies_subscriber_707 drives the public BatchReader::get_batch API with a mock network sender whose request_batch future never resolves, so the only viable delivery path is the persist-subscription notification. It then:

  1. starts the in-flight fetcher (which subscribes internally),
  2. makes the batch locally available via save_fetched_batch_to_db,
  3. issues the redundant, rejected persist() (expiration 500last_certified_time 1000),
  4. asserts the fetcher is served.

Evidence

  • FAILS on current code — the fetcher is never notified and times out:
    thread '...test_persist_reject_branch_notifies_subscriber_707' panicked at .../batch_store_test.rs:449:19:
    F5 repro (Galxe/gravity-audit#707): in-flight fetcher was NEVER notified after persist()
    rejected the expired-vs-last_certified_time batch (notify_subscribers skipped on the reject
    branch), even though the batch is locally available -- the fetcher hung and timed out
    test ... FAILED
    test result: FAILED. 0 passed; 1 failed; ...
    
  • PASSES when persist() is changed to call notify_subscribers for every persist_request regardless of save() outcome (the original loop). Verified locally, then reverted — this PR adds test code only.

Run:

export CARGO_TARGET_DIR=.../target
RUSTFLAGS="--cfg tokio_unstable" cargo test -p aptos-consensus \
  quorum_store::tests::batch_store_test::test_persist_reject_branch_notifies_subscriber_707

Honesty / scope

  • The GC-inequality half of F5 (expiration - 60s buffer vs block_timestamp <= expiration) is not reproducible in this fork: populate_cache_and_gc_expired_batches and the 60s buffer do not exist here — the bootstrap GC in BatchStore::new is commented out. That half is not claimed.
  • Pre-existing unrelated failure: test_get_local_batch already fails on clean origin/main (independent of this change).

Refs Galxe/gravity-audit#707

🤖 Generated with Claude Code

…ribers

Adds a focused cargo unit test reproducing gravity-audit F5: in
BatchStore::persist, the reject branch (batch expiration <=
last_certified_time, so save() bails and persist_inner returns None) is
filtered out by the `filter_map`, so notify_subscribers is never called
for it. A concurrent in-flight batch fetcher that subscribed via
subscribe() (the recovery / get_batch path) is therefore never woken and
hangs until its request_batch times out -- even when the batch is, in
fact, locally available in the quorum-store DB.

The test drives the public BatchReader::get_batch API with a mock network
sender whose request_batch future never resolves, forcing the only viable
delivery path to be the persist-subscription notification. It then makes
the batch locally available (save_fetched_batch_to_db) and issues the
redundant, rejected persist(). It asserts the fetcher IS served (correct
behavior).

Evidence:
- FAILS on current code: the fetcher is never notified and times out
  (panic at the timeout arm).
- PASSES when persist() is changed to call notify_subscribers for every
  persist_request regardless of the save() outcome (the original,
  now-commented-out loop). Verified locally, then reverted -- this PR adds
  test code only, no production changes.

Note on the second half of F5 (GC `expiration - 60s buffer` inequality):
populate_cache_and_gc_expired_batches and the 60s expiration buffer do
not exist in this gravity fork (the bootstrap GC in BatchStore::new is
commented out), so that half is not reproducible here and is not claimed.

Refs Galxe/gravity-audit#707

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Lchangliang

Copy link
Copy Markdown
Contributor

I looked through the block sync / recovery payload path again, and I think the impact here is probably narrower than the PR description suggests.

The missed-notify behavior itself is real: persist() only calls notify_subscribers() for requests where persist_inner() returns Some(SignedBatchInfo). If save() rejects the batch because expiration <= last_certified_time, the request is filtered out and subscribers are not notified.

However, in the real block sync / recovery path, payload fetching has a few other guards/retry paths:

  • During block sync, fetched blocks call payload_manager.prefetch_payload_data(payload, block.timestamp_usecs()).
  • In recovery, send_for_execution(recovery=true) calls payload_manager.get_transactions(block).await in a retry loop. On error, it calls prefetch_payload_data() again and retries.
  • QuorumStorePayloadManager::request_transactions() only calls batch_reader.get_batch() when block_timestamp <= batch_info.expiration(). If the batch is expired relative to the block being executed, it is skipped before reaching BatchStore.
  • BatchReaderImpl::get_batch() always checks local cache/DB first. If the batch is already in DB, it returns directly without relying on notify_subscribers().
  • If a waiter subscribed before the batch was written to DB and the later persist() rejects without notifying, that waiter may miss this notification. But the network request path has retry/RPC timeout behavior; once it returns an error, the recovery loop re-prefetches, and the next get_batch() should re-check local DB and find the batch.

So I agree this is a missed notification edge case, and notifying after the batch becomes locally readable may be a reasonable hardening. But I don’t think the current test proves a persistent recovery hang in the production flow. The test uses a network sender whose request_batch future never resolves, which makes notify_subscribers() the only possible completion path. That is useful for isolating the internal behavior, but it amplifies the impact compared with the real path where request timeout/retry and local DB re-checks exist.

Because of that, I would treat this as a narrow concurrency edge case / possible latency or retry issue, not a high-impact recovery bug, unless we can reproduce a real block-sync/recovery path where the missed notification causes an actual stall beyond the normal request retry/timeout behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants