Skip to content

Optimize parallel TxSet building logic#5144

Open
dmkozh wants to merge 9 commits intostellar:masterfrom
dmkozh:txset_build_opt
Open

Optimize parallel TxSet building logic#5144
dmkozh wants to merge 9 commits intostellar:masterfrom
dmkozh:txset_build_opt

Conversation

@dmkozh
Copy link
Contributor

@dmkozh dmkozh commented Feb 18, 2026

Description

This contains a number of optimizations that speed up TxSet building logic 3-6 times depending on the benchmark scenario. There are (almost) no changes in the output TxSets (at least our tests don't detect any), with the exception of treating footprint key hash collisions as conflicts (this should be rare enough to not be relevant in practice). The PR is broken down into commits with individual optimizations and their impact.

Summary of the changes:

  • Optimize the low conflict scenarios by using O(conflict txs) operations instead of O(clusters). This may or may not become problematic in the future, we can revisit this.
  • Sort footprint keys instead of doing repeated map lookups for conflict detection
  • Sort transactions once instead of using SurgePricingPriorityQueue. This change is probably is the most sketchy one, but the impact is pretty significant, and the amount of duplication is pretty minimal and inconsequential (parallel tx set building is special enough anyways and might further diverge from mempool logic in the future)
  • Lots of data structure optimizations (reduce allocations/copies, avoid using shared ptrs, get rid of several maps etc.)

The initial optimizations have been iterated on by Claude, but I have cleaned these up and reviewed the changes thoroughly.

Checklist

  • Reviewed the contributing document
  • Rebased on top of master (no merge commits)
  • Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
  • Compiles
  • Ran all tests
  • If change impacts performance, include supporting evidence per the performance document

Copilot AI review requested due to automatic review settings February 18, 2026 02:19
Add 3 new benchmark scenarios to improve coverage:
- Low conflict (0.1 conflicts/tx, 5 RO, 1 RW)
- Medium conflict (0.5 conflicts/tx, 2 RO, 2 RW)
- Dense RW-only (5 conflicts/tx, 0 RO, 3 RW)

Also add structured output header for easier comparison.
Add mAllConflictTxs BitSet member to Stage that accumulates the union
of all successfully-added transactions' conflict sets. getConflictingClusters
checks mAllConflictTxs.get(tx.mId) first - if false, avoids the O(C)
cluster scan entirely.

Benchmark: worst case (conflicts=1,1000,1) improved ~49% (1148ms -> 585ms).
Other scenarios within noise.
Add mTxToCluster vector mapping tx id -> cluster pointer, enabling
getConflictingClusters to iterate tx.mConflictTxs set bits and look up
which cluster each conflicting tx belongs to.  This is O(K) where K is
the number of conflict tx ids, vs the previous O(C) cluster scan.

The mapping is updated after each successful tryAdd by walking the new
merged cluster's txIds bitset.  For the full bin-packing path we save
the new cluster pointer before binPacking() sorts the vector.

Benchmark across all scenarios shows 5-16% improvement beyond Step 2.
Best improvement: conflicts=20,40,1 now at 480ms (was 675ms baseline, -29%).
Replace createNewClusters (which copied all N shared_ptrs) with in-place
mutation of mClusters.  Conflicting clusters are saved for rollback,
removed from the vector via compaction scan, and the new merged cluster
is appended.  On failure, rollback restores the original state.

For the common no-conflict case this reduces shared_ptr operations from
O(N) copies to a single push_back.

Benchmark: 20-35% improvement across all scenarios vs Step 3.
Combined from baseline: worst case -65% (1148ms -> 401ms),
zero-conflict -32.5% (635ms -> 428ms).
- Change txFrames and sorobanCfg from by-value to by-reference capture
  in the thread lambda, eliminating O(N) shared_ptr copies per thread.

Benchmark: modest improvement on conflict-heavy scenarios
(conflicts=1,1000,1: 401->381ms, conflicts=10,10,10: 576->524ms).
Replace UnorderedMap<LedgerKey,...> footprint maps with a flat vector of
(keyHash, txId, isRW) entries sorted by precomputed hash. This
avoids ~400K hash map node allocations and multiple rehashings, providing
much better cache behavior through sequential memory access.

The sort-based approach:
1. Collects all footprint entries into a single flat vector with
   precomputed LedgerKey hashes (one allocation).
2. Sorts by hash for cache-friendly grouping.
3. Linear scan finds groups sharing the same key hash.
4. Conflict marking (RW-RW and RO-RW) is done in the same scan,
   eliminating the separate conflict marking phase.

Profile improvement (no-conflict scenario):
  footprintMaps: 240ms -> 70ms (hashSort)
  conflictMark:   27ms ->  1ms (conflictScan)
  Combined:      267ms -> 71ms  (3.8x faster)

Benchmark improvement from previous step:
  conflicts=0,0,0:     452.6 -> 179.4ms  (-60%)
  conflicts=0.1,5,1:   424.0 -> 196.3ms  (-54%)
  conflicts=1,1000,1:  381.4 -> 170.5ms  (-55%)
  conflicts=10,10,10:  524.4 -> 294.1ms  (-44%)
Worker threads receive read-only references instead of expensive deep-copied std::set trees. Each thread does a simple linear scan over the sorted order, checking anyGreater() to skip oversized txs.

It's not great that a bit of logic is duplicated with SurgePricingPriorityQueue now, but on the other hand the parallel tx set nomination is different enough from the mempool prioritization and going forward it might diverge even more, so this duplication doesn't seem too problematic to me (and it's just a few lines in the end).

Benchmark (10000 txs, 128 clusters, 1-4 stages, 5 iterations):
  conflicts=0,0,0:     179.4 -> 194.8ms (noise; +8.6%)
  conflicts=0.1,5,1:   196.3 -> 150.0ms (-23.6%)
  conflicts=0.5,2,2:   192.9 -> 142.2ms (-26.3%)
  conflicts=1,1000,1:  170.5 -> 139.0ms (-18.5%)
  conflicts=5,0,3:     176.8 -> 156.4ms (-11.5%)
  conflicts=10,40,1:   231.3 -> 194.0ms (-16.1%)
  conflicts=20,40,1:   253.8 -> 195.3ms (-23.0%)
  conflicts=10,10,10:  294.1 -> 250.9ms (-14.7%)
  conflicts=50,50,5:   265.0 -> 199.6ms (-24.7%)

8/9 scenarios improved, up to 26% faster. Eliminates O(N log N)
std::set insertions and 4 expensive tree deep-copies per build.
- get rid of shared_ptr
- get rid of unnecessary maps

Up to 40% improvement on some benchmarks (180->110ms for no conflicts scenario).
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the parallel transaction set building logic for Soroban transactions, achieving 3-6x performance improvements through several key optimizations. The changes modernize the conflict detection algorithm, eliminate redundant data structures, and reduce allocations while maintaining correctness (with a minor acceptable change to treat hash collisions as conflicts).

Changes:

  • Replaced map-based conflict detection with a hash-based sorting approach, trading O(K) lookups for O(N log N) sorting with better cache locality
  • Refactored TxComparator (formerly nested in SurgePricingPriorityQueue) into a standalone TxFeeComparator class and replaced priority queue-based iteration with direct sorting
  • Optimized Stage class with in-place cluster management, fast-path conflict detection using a global conflict mask, and O(K) conflict lookups via a tx-to-cluster mapping
  • Changed data structures from shared_ptr to unique_ptr and replaced maps with vectors for better performance
  • Enhanced validation error reporting with structured TxSetValidationResult enum values
  • Added comprehensive benchmark scenarios to validate optimizations

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/herder/test/TxSetTests.cpp Added descriptive output and additional benchmark scenarios covering various conflict patterns (sparse, heavy, RO/RW mixes)
src/herder/TxSetFrame.cpp Enhanced error reporting with structured validation results and formatted error messages
src/herder/SurgePricingUtils.h Extracted TxComparator as public TxFeeComparator class for reuse in parallel tx set building
src/herder/SurgePricingUtils.cpp Moved TxComparator implementation to TxFeeComparator at namespace scope
src/herder/ParallelTxSetBuilder.cpp Core optimizations: hash-based conflict detection, in-place cluster management, O(K) conflict lookups, direct sorting instead of priority queue, and reduced allocations
Builds/VisualStudio/stellar-core.vcxproj.filters Added new source files and reorganized TracyClient.cpp location
Builds/VisualStudio/stellar-core.vcxproj Whitespace fixes (tabs to spaces) and added new source files

Comment on lines +341 to 359
rollbackClusters(Cluster const* mergedCluster,
std::vector<std::unique_ptr<Cluster const>>& savedClusters)
{
// Find and swap-pop the merged cluster.
for (size_t i = 0; i < mClusters.size(); ++i)
{
if (txConflicts.find(cluster.get()) == txConflicts.end())
if (mClusters[i].get() == mergedCluster)
{
newClusters.push_back(cluster);
mClusters[i] = std::move(mClusters.back());
mClusters.pop_back();
break;
}
}
newClusters.push_back(newCluster);
return std::make_optional(std::move(newClusters));
// Restore the saved conflicting clusters.
for (auto& saved : savedClusters)
{
mClusters.push_back(std::move(saved));
}
}
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rollbackClusters function does not update the mTxToCluster mapping when rolling back a failed merge. This could leave stale pointers in mTxToCluster pointing to the removed merged cluster, or fail to restore pointers to the original saved clusters. The mapping should be updated during rollback to maintain consistency. Consider clearing the mapping entries for transactions in the merged cluster and restoring them for the saved clusters.

Copilot uses AI. Check for mistakes.
// to the complexity of the straightforward approach of iterating over all
// the transaction pairs.
// With 64-bit hashes and typical
// footprint sizes, collisions are exceedingly rare .
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a missing space after 'rare' in the comment. The sentence should read "collisions are exceedingly rare." with proper punctuation.

Suggested change
// footprint sizes, collisions are exceedingly rare .
// footprint sizes, collisions are exceedingly rare.

Copilot uses AI. Check for mistakes.
continue;
}

// Collect all entries matching.
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is incomplete. It says "Collect all entries matching." but doesn't complete the thought. Consider clarifying what the entries are matching, such as "Collect all entries matching the current key hash."

Suggested change
// Collect all entries matching.
// Collect all entries with the matching keyHash into RO/RW lists.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant