Skip to content

fix: lock-free dedup in mempool broadcast + pfn_stress run.sh bugs#712

Open
keanji-x wants to merge 6 commits into
Galxe:mainfrom
keanji-x:fix/mempool-broadcast-snapshot-and-pfn-stress
Open

fix: lock-free dedup in mempool broadcast + pfn_stress run.sh bugs#712
keanji-x wants to merge 6 commits into
Galxe:mainfrom
keanji-x:fix/mempool-broadcast-snapshot-and-pfn-stress

Conversation

@keanji-x

Copy link
Copy Markdown
Contributor

Summary

Two unrelated bugs that surfaced together while validating PR #709's pfn_chain stress suite. Splitting into two commits for clarity; happy to split into two PRs if reviewers prefer.

1. fix(mempool): snapshot visited set once per read_timeline call (8ec83f9)

Mempool::read_timeline used a closure that locked Arc<Mutex<TxnCache>> per pool item to dedup against the visited set, then re-locked the same Mutex in the consume loop. Under sustained load that single Mutex is hit at roughly pool_size × num_sender_buckets × (1000 / shared_mempool_tick_interval_ms). With our PFN defaults (pool ≈ 100K, buckets=4, tick=10ms) that's ~40M acquire/release pairs per second on one Mutex, enough to saturate tokio workers and starve consensus message handling.

The fix takes a single snapshot of the visited set per read_timeline call and lets the filter run lock-free against it. Worst-case staleness is one extra broadcast of an in-flight tx — TxnCache already documents this as acceptable (gossip layer dedupes downstream). Insert-side semantics unchanged.

Measured impact (5-node pfn_chain stress, target=pfn3, 8K TPS, 30 min, two runs):

Bench-submitted txs reached Peak observed chain TPS
Without patch ~447K (death by mempool mutex contention ~10 min in) ~5
With patch (run 1) 1.2M 2,273
With patch (run 2) 939K 134

A deeper gravity-reth txpool issue (NO_NONCE_GAPS promotion in update_canon) takes over at the new ceiling, plus a validator-side mempool→quorum-store bottleneck — both tracked separately. This PR does not claim to make stress sustainable; it removes one of several blocking issues so the others become measurable.

2. fix(regression): pfn_chain run.sh wait-loop pipefail + stale-image gate (7ee31ad)

Two independent bugs in the stress harness introduced by PR #709, both reproducible from a clean checkout:

A. wait-for-block loop exited on first iteration. The bn=\$(curl ... | sed ...) returned curl exit 7 ("couldn't connect") while a node was still coming up; combined with set -euo pipefail, this aborted run.sh before bench launch. Swallowed with || true.

B. gravity_node:pfn-stress image cached forever. Only built when absent. After the first run, every subsequent invocation re-built the binary on the host, staged it into docker/gravity_node/bin/, then ran the old image whose COPY had baked in the original binary at image-build time. Every fix-and-retest cycle silently ran the old code. Always run docker build; BuildKit's COPY layer is keyed on binary hash so no-op is fast (~1s).

I hit (B) while iterating on PR #709 review feedback — every cluster I brought up was running the binary from the very first run two days earlier.

Test plan

  • cargo check -p aptos-mempool clean
  • cargo build --bin gravity_node --profile quick-release clean
  • BENCH_DURATION_SECS=1800 ./run.sh --clean pfn3 completes; bench Progress crosses 900K (was ~447K); no new ERROR/panic on any node
  • Run reproducible across two independent 30-min stress runs (variance ~30%, both well above pre-patch ceiling)

🤖 Generated with Claude Code

keanji-x and others added 3 commits May 13, 2026 22:32
`Mempool::read_timeline` constructed a filter closure that locked
`Arc<Mutex<TxnCache>>` per pool item, then re-locked the same Mutex in
the consume loop. Under sustained load (per-FN pool ~100K, default
`num_sender_buckets=4`, `shared_mempool_tick_interval_ms=10`) this
peaks at ~40M lock acquires/sec on a single Mutex, saturating tokio
worker threads and starving consensus message handling. PFN nodes
fall behind, FastForwardSync churn climbs, bench-submitted txs stop
flowing to the validator within ~10 min of 8K target TPS.

Take a snapshot of the visited set once under a single lock acquire
and let the filter run lock-free against the snapshot. Worst-case
staleness is one extra broadcast of an in-flight tx, which TxnCache
already documents as acceptable. Insert-side semantics are unchanged.

Validated on 5-node pfn_chain stress (target=pfn3, 8K TPS, 30 min):
bench-submitted txs reach 940K-1.2M (vs. ~447K without the fix) and
PFN nodes hold consensus sync; a deeper gravity-reth txpool issue
takes over at the new ceiling and is tracked separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two independent bugs in `regression/pfn_chain_stress/run.sh`, both
made running the suite repeatedly impossible:

1. Wait-for-block loop exited on first iteration. The curl-piped-to-sed
   inside `bn=$(...)` returned exit 7 ("couldn't connect") while a node
   was still coming up; combined with `set -euo pipefail` this aborted
   run.sh before bench launch. Swallow with `|| true` so the loop
   tolerates the connection-refused window.

2. `gravity_node:pfn-stress` image was only built when absent. After the
   first run cached the image, every subsequent invocation re-built the
   binary on the host, staged it into `docker/gravity_node/bin/`, then
   ran the *old* image whose `COPY` had baked in the original binary at
   image-build time. Cargo-built fixes silently failed to reach the
   cluster. Always run `docker build`; BuildKit's COPY layer is keyed
   on the binary's hash so the no-op case is fast (~1 s).

Both reproduced in this checkout running PR Galxe#709 against main.
The previous commit replaced the per-tick caller of `is_contains`
with `snapshot()`. The method is still useful for tests asserting
membership, but `TxnCache` is not part of the crate's public API
(`mod mempool;` is private), so rustc's `dead_code` lint correctly
flags it as unused in non-test builds — and CI's `-D warnings`
escalates that to an error.

Gating with `#[cfg(test)]` keeps the test usages compiling without
exposing dead code in lib builds.
ByteYue
ByteYue previously approved these changes May 14, 2026
nekomoto911
nekomoto911 previously approved these changes May 14, 2026
nekomoto911 and others added 2 commits May 15, 2026 13:50
Two conflicts on merge of origin/main into the PR branch:

aptos-core/mempool/src/core_mempool/mempool.rs — accept main.
  PR Galxe#722 (per-entry TTL + bucket-sharded snapshot) restructured
  read_timeline so the txn_cache Mutex is now locked exactly once per
  call (around the dispatch loop, not per pool item). This subsumes
  PR Galxe#712's snapshot-once optimization, which addressed the same
  per-pool-item lock contention via the old HashSet-based TxnCache.
  The TxnCache type no longer has the HashSet shape (snapshot()/
  is_contains() methods don't exist), so the original patch can't be
  forward-ported as-is; the underlying problem is already solved.

regression/pfn_chain_stress/run.sh — keep PR Galxe#712's always-build
  fix (the stale-image bug is real), but switch from the deleted
  Dockerfile.host-binary to main's consolidated Dockerfile with
  --target runtime-host-binary.

Verified: RUSTFLAGS="--cfg tokio_unstable" cargo check -p aptos-mempool
clean; bash -n run.sh clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@keanji-x keanji-x dismissed stale reviews from nekomoto911 and ByteYue via 780ce36 May 25, 2026 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants