Skip to content

test(consensus): reproduce fast_forward_sync num_blocks u64 underflow (audit #705)#742

Open
keanji-x wants to merge 1 commit into
mainfrom
test/unit-ffsync-numblocks-underflow-705
Open

test(consensus): reproduce fast_forward_sync num_blocks u64 underflow (audit #705)#742
keanji-x wants to merge 1 commit into
mainfrom
test/unit-ffsync-numblocks-underflow-705

Conversation

@keanji-x

Copy link
Copy Markdown
Contributor

Summary

Reproduction (test-only PR) for gravity-audit#705 — F3 in galxe/RESTART_RECOVERY_FINDINGS.md: a u64 underflow of num_blocks in BlockStore::fast_forward_sync.

The call site at aptos-core/consensus/src/block_storage/sync_manager.rs:509-514:

let num_blocks = highest_quorum_cert.certified_block().round()
    - highest_commit_cert.ledger_info().ledger_info().round()
    + 1;
// although unlikely, we might wrap num_blocks around on a 32-bit machine
assert!(num_blocks < std::usize::MAX as u64);

highest_commit_cert is effective_commit_cert = max(local_hcc, remote_hcc) by round (the gravity-specific change at :358). When the local commit root is ahead of the incoming SyncInfo's hqc certified round (a stale / forked SyncInfo whose hqc block the node does not have), hqc_round - commit_round + 1 underflows to ~u64::MAX. The guard only considered 32-bit overflow; on 64-bit usize::MAX == u64::MAX, so the wrapped value still satisfies < usize::MAX and the assert passes, then the absurd count is handed to retrieve_blocks_in_range(.., num_blocks, ..):

  • debug builds: panic with attempt to subtract with overflow
  • release builds: wraps and attempts a giant fetch → sync stall.

What this PR adds

Test code only — no production-code edits. A #[cfg(test)] mod ffsync_numblocks_underflow_705_tests appended to sync_manager.rs that builds the exact same QuorumCert / WrappedLedgerInfo objects the production path operates on (using gen_test_certificate + into_wrapped_ledger_info), with commit_round = 100 > hqc_round = 90, and evaluates the identical num_blocks expression.

  • ffsync_num_blocks_underflows_when_commit_ahead_of_hqcFAILS on current code (overflow panic at the production expression). Would PASS once a checked_sub + early-return guard (commit_round >= hqc_round) is added.
  • ffsync_num_blocks_ok_when_hqc_ahead_of_commit — passes; guards against an over-broad fix breaking the happy path.

Evidence

$ RUSTFLAGS="--cfg tokio_unstable" cargo test -p aptos-consensus ffsync_numblocks_underflow_705 -- --nocapture --test-threads=1

running 2 tests
test ...::ffsync_num_blocks_ok_when_hqc_ahead_of_commit ... ok
test ...::ffsync_num_blocks_underflows_when_commit_ahead_of_hqc ...
thread '...::ffsync_num_blocks_underflows_when_commit_ahead_of_hqc' panicked at
  aptos-core/consensus/src/block_storage/sync_manager.rs:1446:26:
attempt to subtract with overflow
FAILED

test result: FAILED. 1 passed; 1 failed; 0 ignored; 0 measured; 272 filtered out

sync_manager.rs:1446 is the test's verbatim copy of the production num_blocks expression (line 509). The crate's test profile has overflow checks enabled, so the underflow surfaces as a panic.

Suggested fix (not in this PR)

Use checked_sub with an early return when the node is already ≥ the target (commit round ≥ hqc round).

Refs Galxe/gravity-audit#705

🤖 Generated with Claude Code

…#705)

F3 of galxe/RESTART_RECOVERY_FINDINGS.md. `BlockStore::fast_forward_sync`
(aptos-core/consensus/src/block_storage/sync_manager.rs:509-514) computes:

    let num_blocks = highest_quorum_cert.certified_block().round()
        - highest_commit_cert.ledger_info().ledger_info().round() + 1;
    assert!(num_blocks < std::usize::MAX as u64);  // does NOT catch underflow on 64-bit

`highest_commit_cert` is `effective_commit_cert = max(local_hcc, remote_hcc)`
by round (sync_to_highest_quorum_cert, :358). When the local commit root is
AHEAD of the incoming SyncInfo's hqc certified round, `hqc_round -
commit_round + 1` underflows to ~u64::MAX. The guard only handled 32-bit
overflow; on 64-bit usize::MAX == u64::MAX so the wrapped value passes the
assert and an absurd count is handed to retrieve_blocks_in_range (debug:
panic; release: wrap + giant fetch -> sync stall).

Adds a #[cfg(test)] module (test code only, no production-code edits) that
builds the exact same QuorumCert / WrappedLedgerInfo objects with
commit_round (100) > hqc_round (90) and evaluates the identical num_blocks
expression.

Evidence (cargo test -p aptos-consensus, profile has overflow checks on):

  test ...::ffsync_num_blocks_underflows_when_commit_ahead_of_hqc ...
  thread '...' panicked at sync_manager.rs:1446:26:
  attempt to subtract with overflow
  FAILED
  test ...::ffsync_num_blocks_ok_when_hqc_ahead_of_commit ... ok

The underflow test FAILS on current code (overflow panic at the production
expression); it would PASS once a checked_sub + early-return guard
(commit_round >= hqc_round) is added. The happy-path test guards against an
over-broad fix.

Refs Galxe/gravity-audit#705

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Lchangliang

Copy link
Copy Markdown
Contributor

Thanks for adding the repro test. I double checked the actual add_certs path, and I don’t think this test proves the issue is reachable through the real sync flow as written.

The test constructs the arithmetic case directly: hqc_round = 90, commit_round = 100, then evaluates the same num_blocks = hqc_round - commit_round + 1 expression. That does demonstrate the unchecked expression can underflow for those isolated inputs.

However, in the real path there are earlier guards/invariants before fast_forward_sync is called:

  • sync_up only calls add_certs after has_newer_certificates() and SyncInfo::verify().
  • SyncInfo::verify() ensures the remote cert ordering: remote HQC >= remote HOC >= remote HCC.
  • The sync_to_highest_commit_cert path already checks sync_from_cert.round < ledger_info.round before calling fast_forward_sync, so that path should not call it with a commit cert ahead of the target.
  • The only suspicious branch is sync_to_highest_quorum_cert, where we use effective_commit_cert = max(local_hcc, remote_hcc). But that branch first checks need_sync_to_highest_quorum_cert(remote_hqc), which requires ordered_root.round < remote_hqc.certified_round && !block_exists(remote_hqc.id).

Given the normal local invariant ordered_root.round >= local_hcc.round, if local_hcc.round = 100 and remote_hqc.round = 90, need_sync_to_highest_quorum_cert() should return false before we ever call fast_forward_sync. So the constructed local_hcc > remote_hqc case appears to be filtered out by the real add_certs path.

Because of that, I don’t think this unit test is sufficient as a repro for #705. To make this actionable, the test should drive the real add_certs / sync_to_highest_quorum_cert path and show that, after SyncInfo::verify() and need_sync_to_highest_quorum_cert(), we can still reach fast_forward_sync(remote_hqc, effective_commit_cert) with effective_commit_cert.round > remote_hqc.certified_round. Without that, this looks like an artificial arithmetic case rather than a reachable bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants