fix: lock-free dedup in mempool broadcast + pfn_stress run.sh bugs by keanji-x · Pull Request #712 · Galxe/gravity-sdk

keanji-x · 2026-05-13T14:33:05Z

Summary

Two unrelated bugs that surfaced together while validating PR #709's pfn_chain stress suite. Splitting into two commits for clarity; happy to split into two PRs if reviewers prefer.

1. `fix(mempool): snapshot visited set once per read_timeline call` (`8ec83f9`)

Mempool::read_timeline used a closure that locked Arc<Mutex<TxnCache>> per pool item to dedup against the visited set, then re-locked the same Mutex in the consume loop. Under sustained load that single Mutex is hit at roughly pool_size × num_sender_buckets × (1000 / shared_mempool_tick_interval_ms). With our PFN defaults (pool ≈ 100K, buckets=4, tick=10ms) that's ~40M acquire/release pairs per second on one Mutex, enough to saturate tokio workers and starve consensus message handling.

The fix takes a single snapshot of the visited set per read_timeline call and lets the filter run lock-free against it. Worst-case staleness is one extra broadcast of an in-flight tx — TxnCache already documents this as acceptable (gossip layer dedupes downstream). Insert-side semantics unchanged.

Measured impact (5-node pfn_chain stress, target=pfn3, 8K TPS, 30 min, two runs):

	Bench-submitted txs reached	Peak observed chain TPS
Without patch	~447K (death by mempool mutex contention ~10 min in)	~5
With patch (run 1)	1.2M	2,273
With patch (run 2)	939K	134

A deeper gravity-reth txpool issue (NO_NONCE_GAPS promotion in update_canon) takes over at the new ceiling, plus a validator-side mempool→quorum-store bottleneck — both tracked separately. This PR does not claim to make stress sustainable; it removes one of several blocking issues so the others become measurable.

2. `fix(regression): pfn_chain run.sh wait-loop pipefail + stale-image gate` (`7ee31ad`)

Two independent bugs in the stress harness introduced by PR #709, both reproducible from a clean checkout:

A. wait-for-block loop exited on first iteration. The bn=\$(curl ... | sed ...) returned curl exit 7 ("couldn't connect") while a node was still coming up; combined with set -euo pipefail, this aborted run.sh before bench launch. Swallowed with || true.

B. gravity_node:pfn-stress image cached forever. Only built when absent. After the first run, every subsequent invocation re-built the binary on the host, staged it into docker/gravity_node/bin/, then ran the old image whose COPY had baked in the original binary at image-build time. Every fix-and-retest cycle silently ran the old code. Always run docker build; BuildKit's COPY layer is keyed on binary hash so no-op is fast (~1s).

I hit (B) while iterating on PR #709 review feedback — every cluster I brought up was running the binary from the very first run two days earlier.

Test plan

cargo check -p aptos-mempool clean
cargo build --bin gravity_node --profile quick-release clean
BENCH_DURATION_SECS=1800 ./run.sh --clean pfn3 completes; bench Progress crosses 900K (was ~447K); no new ERROR/panic on any node
Run reproducible across two independent 30-min stress runs (variance ~30%, both well above pre-patch ceiling)

🤖 Generated with Claude Code

`Mempool::read_timeline` constructed a filter closure that locked `Arc<Mutex<TxnCache>>` per pool item, then re-locked the same Mutex in the consume loop. Under sustained load (per-FN pool ~100K, default `num_sender_buckets=4`, `shared_mempool_tick_interval_ms=10`) this peaks at ~40M lock acquires/sec on a single Mutex, saturating tokio worker threads and starving consensus message handling. PFN nodes fall behind, FastForwardSync churn climbs, bench-submitted txs stop flowing to the validator within ~10 min of 8K target TPS. Take a snapshot of the visited set once under a single lock acquire and let the filter run lock-free against the snapshot. Worst-case staleness is one extra broadcast of an in-flight tx, which TxnCache already documents as acceptable. Insert-side semantics are unchanged. Validated on 5-node pfn_chain stress (target=pfn3, 8K TPS, 30 min): bench-submitted txs reach 940K-1.2M (vs. ~447K without the fix) and PFN nodes hold consensus sync; a deeper gravity-reth txpool issue takes over at the new ceiling and is tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two independent bugs in `regression/pfn_chain_stress/run.sh`, both made running the suite repeatedly impossible: 1. Wait-for-block loop exited on first iteration. The curl-piped-to-sed inside `bn=$(...)` returned exit 7 ("couldn't connect") while a node was still coming up; combined with `set -euo pipefail` this aborted run.sh before bench launch. Swallow with `|| true` so the loop tolerates the connection-refused window. 2. `gravity_node:pfn-stress` image was only built when absent. After the first run cached the image, every subsequent invocation re-built the binary on the host, staged it into `docker/gravity_node/bin/`, then ran the *old* image whose `COPY` had baked in the original binary at image-build time. Cargo-built fixes silently failed to reach the cluster. Always run `docker build`; BuildKit's COPY layer is keyed on the binary's hash so the no-op case is fast (~1 s). Both reproduced in this checkout running PR Galxe#709 against main.

The previous commit replaced the per-tick caller of `is_contains` with `snapshot()`. The method is still useful for tests asserting membership, but `TxnCache` is not part of the crate's public API (`mod mempool;` is private), so rustc's `dead_code` lint correctly flags it as unused in non-test builds — and CI's `-D warnings` escalates that to an error. Gating with `#[cfg(test)]` keeps the test usages compiling without exposing dead code in lib builds.

Two conflicts on merge of origin/main into the PR branch: aptos-core/mempool/src/core_mempool/mempool.rs — accept main. PR Galxe#722 (per-entry TTL + bucket-sharded snapshot) restructured read_timeline so the txn_cache Mutex is now locked exactly once per call (around the dispatch loop, not per pool item). This subsumes PR Galxe#712's snapshot-once optimization, which addressed the same per-pool-item lock contention via the old HashSet-based TxnCache. The TxnCache type no longer has the HashSet shape (snapshot()/ is_contains() methods don't exist), so the original patch can't be forward-ported as-is; the underlying problem is already solved. regression/pfn_chain_stress/run.sh — keep PR Galxe#712's always-build fix (the stale-image bug is real), but switch from the deleted Dockerfile.host-binary to main's consolidated Dockerfile with --target runtime-host-binary. Verified: RUSTFLAGS="--cfg tokio_unstable" cargo check -p aptos-mempool clean; bash -n run.sh clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

keanji-x and others added 3 commits May 13, 2026 22:32

ByteYue previously approved these changes May 14, 2026

View reviewed changes

Merge branch 'main' into fix/mempool-broadcast-snapshot-and-pfn-stress

bb726a9

nekomoto911 previously approved these changes May 14, 2026

View reviewed changes

nekomoto911 and others added 2 commits May 15, 2026 13:50

Merge branch 'main' into fix/mempool-broadcast-snapshot-and-pfn-stress

bbd50c0

keanji-x dismissed stale reviews from nekomoto911 and ByteYue via 780ce36 May 25, 2026 06:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: lock-free dedup in mempool broadcast + pfn_stress run.sh bugs#712

fix: lock-free dedup in mempool broadcast + pfn_stress run.sh bugs#712
keanji-x wants to merge 6 commits into
Galxe:mainfrom
keanji-x:fix/mempool-broadcast-snapshot-and-pfn-stress

keanji-x commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

keanji-x commented May 13, 2026

Summary

1. fix(mempool): snapshot visited set once per read_timeline call (8ec83f9)

2. fix(regression): pfn_chain run.sh wait-loop pipefail + stale-image gate (7ee31ad)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. `fix(mempool): snapshot visited set once per read_timeline call` (`8ec83f9`)

2. `fix(regression): pfn_chain run.sh wait-loop pipefail + stale-image gate` (`7ee31ad`)