[feat] Hybrid layout for HashJoin/Sort#119
Conversation
03120aa to
dfe7dce
Compare
| << ", spill enabled: " << spillEnabled() | ||
| << ", maxHashTableSize = " << maxHashTableBucketCount_; | ||
| << ", maxHashTableSize = " << maxHashTableBucketCount_ | ||
| << ", hybrid mode " << (hybridJoin_ ? "enabled" : "disbaled"); |
| if (hybridJoin_) { | ||
| BOLT_CHECK_LE( | ||
| driverId_, | ||
| 255, |
There was a problem hiding this comment.
why hardcode limit to 255?
There was a problem hiding this comment.
Currently we store a BIGINT (64 bits) of rowId where the top 8 bits represents the driverId and the remaining 56 bits represents the rowId for each driver. So the max # of driver it supports is 255. Maybe we can make it as a config.
| const T* rawValues = flatChild->rawValues(); | ||
| const uint64_t* rawNulls = flatChild->rawNulls(); | ||
|
|
||
| constexpr vector_size_t kPrefetchDist = 16; |
There was a problem hiding this comment.
Does the magic number 16 fit for all the arch? e.g. x86 arm?
There was a problem hiding this comment.
I'm not sure. I tuned it on a x86_64 machine, k = 8-32 perform similarly.
dfe7dce to
42a50e9
Compare
2d125ec to
f8bb01e
Compare
7fbf61f to
8822918
Compare
8822918 to
142e8ff
Compare
|
|
||
| const auto rid = rowIdPtr[idx].rowId_; | ||
| if (rawNulls != nullptr && bits::isBitNull(rawNulls, rid)) { | ||
| bits::setNull(nulls, resultIndex, true); |
There was a problem hiding this comment.
isBitNull & setNull are not very efficient. How about set 4/8 bits at a time?
bits::setNull(nulls, index, byte, mask) // add a new interface ?
We can only use bitwise OR/AND to eliminate this if branch.
There was a problem hiding this comment.
Yeah we can setNull for 4 or 8 bits (rows) at a time by modifying a uint_8. But for if branch, we also use that condition check to avoid the value copy if it is null, so I'm not sure if we can really eliminate it.
There was a problem hiding this comment.
Yeah, it not easy to eliminate if branch.
we can use gather instruction to get the scattered bytes(nulls), then movemask it as a 8-bit byte result.
But gather instruction shows no obvious advantage. It is OK to keep the if branch.
|
@ruochenj123 I have completed the code review. The design and implementation look fine to me, and I have left some minor comments. |
What problem does this PR solve?
Issue Number: #11
Type of Change
Description
Currently HashJoin and Sort operations store data in row-based RowContainer, which incurs non-trivial layout conversion overhead. This PR introduces a hybrid storage design that keeps payload columns in their original columnar format while only storing keys in RowContainer, reducing this layout conversion overhead.
Main Changes
HybridContainer
RowContainer) from payload storage (kept asRowVectorPtr).HybridRowIdto reference payload rows.HybridRowIdencodes{containerId, rowId}to support multi-driver parallel execution in HashJoin.Multi-driver support in HashJoin
allContainers_map enables cross-container payload extraction during the probe phase.Extraction optimizations
coalesceBatches()flattens multiple payload batches into a single contiguous batch to reduce TLB misses during extraction.sortByContainerId()reorders rows by containerId before extraction to improve cache locality in multi-container scenarios.isSingleContainer()provides a fast path that skips sorting overhead in the single-driver scenario.Configuration options
hybrid_join_enabled/hybrid_sort_enabledto opt in to hybrid execution.hybrid_join_reorder_enabledto control row reordering (disabled in tests to preserve deterministic output).hybrid_join_scattered_mode_enabled/hybrid_sort_scattered_mode_enabledto control the reconstruction methods.scattered_moderequires less memory but might lead to overhead of reconstruction due to cache/TLB misses.Performance Impact
[] No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).
Positive Impact: I have run benchmarks.
Click to view Benchmark Results
Production Workload Results
Sort + Window Queries
Queries with Sort followed by Window functions (RANK, ROW_NUMBER, LEAD/LAG).
Summary: Sort speedups 1.09x–2.81x; Total speedups 1.03x–1.47x.
Dynamic Partition Insert Queries
Queries writing to partitioned tables with Sort for partition ordering.
Summary: Sort speedups 1.50x–2.19x (excluding Q5 regression); Total speedups 1.03x–1.16x.
Suboptimal Join Order Queries
Queries with suboptimal join orders placing larger tables on the build side.
Summary: Join speedups 1.04x–1.53x; Total speedups 1.01x–1.07x.
Negative Impact: Explained below (e.g., trade-off for correctness).
Release Note
Checklist (For Author)
Breaking Changes
No
Yes (Description: ...)
Click to view Breaking Changes