fix: lock-free dedup in mempool broadcast + pfn_stress run.sh bugs#712
Open
keanji-x wants to merge 6 commits into
Open
fix: lock-free dedup in mempool broadcast + pfn_stress run.sh bugs#712keanji-x wants to merge 6 commits into
keanji-x wants to merge 6 commits into
Conversation
`Mempool::read_timeline` constructed a filter closure that locked `Arc<Mutex<TxnCache>>` per pool item, then re-locked the same Mutex in the consume loop. Under sustained load (per-FN pool ~100K, default `num_sender_buckets=4`, `shared_mempool_tick_interval_ms=10`) this peaks at ~40M lock acquires/sec on a single Mutex, saturating tokio worker threads and starving consensus message handling. PFN nodes fall behind, FastForwardSync churn climbs, bench-submitted txs stop flowing to the validator within ~10 min of 8K target TPS. Take a snapshot of the visited set once under a single lock acquire and let the filter run lock-free against the snapshot. Worst-case staleness is one extra broadcast of an in-flight tx, which TxnCache already documents as acceptable. Insert-side semantics are unchanged. Validated on 5-node pfn_chain stress (target=pfn3, 8K TPS, 30 min): bench-submitted txs reach 940K-1.2M (vs. ~447K without the fix) and PFN nodes hold consensus sync; a deeper gravity-reth txpool issue takes over at the new ceiling and is tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two independent bugs in `regression/pfn_chain_stress/run.sh`, both
made running the suite repeatedly impossible:
1. Wait-for-block loop exited on first iteration. The curl-piped-to-sed
inside `bn=$(...)` returned exit 7 ("couldn't connect") while a node
was still coming up; combined with `set -euo pipefail` this aborted
run.sh before bench launch. Swallow with `|| true` so the loop
tolerates the connection-refused window.
2. `gravity_node:pfn-stress` image was only built when absent. After the
first run cached the image, every subsequent invocation re-built the
binary on the host, staged it into `docker/gravity_node/bin/`, then
ran the *old* image whose `COPY` had baked in the original binary at
image-build time. Cargo-built fixes silently failed to reach the
cluster. Always run `docker build`; BuildKit's COPY layer is keyed
on the binary's hash so the no-op case is fast (~1 s).
Both reproduced in this checkout running PR Galxe#709 against main.
The previous commit replaced the per-tick caller of `is_contains` with `snapshot()`. The method is still useful for tests asserting membership, but `TxnCache` is not part of the crate's public API (`mod mempool;` is private), so rustc's `dead_code` lint correctly flags it as unused in non-test builds — and CI's `-D warnings` escalates that to an error. Gating with `#[cfg(test)]` keeps the test usages compiling without exposing dead code in lib builds.
ByteYue
previously approved these changes
May 14, 2026
nekomoto911
previously approved these changes
May 14, 2026
Two conflicts on merge of origin/main into the PR branch: aptos-core/mempool/src/core_mempool/mempool.rs — accept main. PR Galxe#722 (per-entry TTL + bucket-sharded snapshot) restructured read_timeline so the txn_cache Mutex is now locked exactly once per call (around the dispatch loop, not per pool item). This subsumes PR Galxe#712's snapshot-once optimization, which addressed the same per-pool-item lock contention via the old HashSet-based TxnCache. The TxnCache type no longer has the HashSet shape (snapshot()/ is_contains() methods don't exist), so the original patch can't be forward-ported as-is; the underlying problem is already solved. regression/pfn_chain_stress/run.sh — keep PR Galxe#712's always-build fix (the stale-image bug is real), but switch from the deleted Dockerfile.host-binary to main's consolidated Dockerfile with --target runtime-host-binary. Verified: RUSTFLAGS="--cfg tokio_unstable" cargo check -p aptos-mempool clean; bash -n run.sh clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two unrelated bugs that surfaced together while validating PR #709's pfn_chain stress suite. Splitting into two commits for clarity; happy to split into two PRs if reviewers prefer.
1.
fix(mempool): snapshot visited set once per read_timeline call(8ec83f9)Mempool::read_timelineused a closure that lockedArc<Mutex<TxnCache>>per pool item to dedup against the visited set, then re-locked the same Mutex in the consume loop. Under sustained load that single Mutex is hit at roughlypool_size × num_sender_buckets × (1000 / shared_mempool_tick_interval_ms). With our PFN defaults (pool ≈ 100K, buckets=4, tick=10ms) that's ~40M acquire/release pairs per second on one Mutex, enough to saturate tokio workers and starve consensus message handling.The fix takes a single snapshot of the visited set per
read_timelinecall and lets the filter run lock-free against it. Worst-case staleness is one extra broadcast of an in-flight tx —TxnCachealready documents this as acceptable (gossip layer dedupes downstream). Insert-side semantics unchanged.Measured impact (5-node pfn_chain stress, target=pfn3, 8K TPS, 30 min, two runs):
A deeper gravity-reth txpool issue (NO_NONCE_GAPS promotion in
update_canon) takes over at the new ceiling, plus a validator-side mempool→quorum-store bottleneck — both tracked separately. This PR does not claim to make stress sustainable; it removes one of several blocking issues so the others become measurable.2.
fix(regression): pfn_chain run.sh wait-loop pipefail + stale-image gate(7ee31ad)Two independent bugs in the stress harness introduced by PR #709, both reproducible from a clean checkout:
A. wait-for-block loop exited on first iteration. The
bn=\$(curl ... | sed ...)returned curl exit 7 ("couldn't connect") while a node was still coming up; combined withset -euo pipefail, this aborted run.sh before bench launch. Swallowed with|| true.B.
gravity_node:pfn-stressimage cached forever. Only built when absent. After the first run, every subsequent invocation re-built the binary on the host, staged it intodocker/gravity_node/bin/, then ran the old image whoseCOPYhad baked in the original binary at image-build time. Every fix-and-retest cycle silently ran the old code. Always rundocker build; BuildKit's COPY layer is keyed on binary hash so no-op is fast (~1s).I hit (B) while iterating on PR #709 review feedback — every cluster I brought up was running the binary from the very first run two days earlier.
Test plan
cargo check -p aptos-mempoolcleancargo build --bin gravity_node --profile quick-releasecleanBENCH_DURATION_SECS=1800 ./run.sh --clean pfn3completes; bench Progress crosses 900K (was ~447K); no new ERROR/panic on any node🤖 Generated with Claude Code