Skip to content

[feat] Hybrid layout for HashJoin/Sort#119

Merged
markjin1990 merged 16 commits intobytedance:mainfrom
ruochenj123:hybrid-design
Apr 8, 2026
Merged

[feat] Hybrid layout for HashJoin/Sort#119
markjin1990 merged 16 commits intobytedance:mainfrom
ruochenj123:hybrid-design

Conversation

@ruochenj123
Copy link
Copy Markdown
Contributor

@ruochenj123 ruochenj123 commented Jan 14, 2026

What problem does this PR solve?

Issue Number: #11

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

Currently HashJoin and Sort operations store data in row-based RowContainer, which incurs non-trivial layout conversion overhead. This PR introduces a hybrid storage design that keeps payload columns in their original columnar format while only storing keys in RowContainer, reducing this layout conversion overhead.

Main Changes

  1. HybridContainer

    • Separates key storage (in RowContainer) from payload storage (kept as RowVectorPtr).
    • Introduces encoded HybridRowId to reference payload rows.
    • HybridRowId encodes {containerId, rowId} to support multi-driver parallel execution in HashJoin.
  2. Multi-driver support in HashJoin

    • Each driver builds its own HybridContainer.
    • After table merge, the allContainers_ map enables cross-container payload extraction during the probe phase.
  3. Extraction optimizations

    • coalesceBatches() flattens multiple payload batches into a single contiguous batch to reduce TLB misses during extraction.
    • sortByContainerId() reorders rows by containerId before extraction to improve cache locality in multi-container scenarios.
    • Prefetching during extraction to hide memory latency and reduce data loading time.
    • isSingleContainer() provides a fast path that skips sorting overhead in the single-driver scenario.
  4. Configuration options

    • hybrid_join_enabled / hybrid_sort_enabled to opt in to hybrid execution.
    • hybrid_join_reorder_enabled to control row reordering (disabled in tests to preserve deterministic output).
    • hybrid_join_scattered_mode_enabled/hybrid_sort_scattered_mode_enabled to control the reconstruction methods. scattered_mode requires less memory but might lead to overhead of reconstruction due to cache/TLB misses.

Performance Impact

  • [] No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results

    Production Workload Results

    Sort + Window Queries

    Queries with Sort followed by Window functions (RANK, ROW_NUMBER, LEAD/LAG).

    Metric Q1 Q2 Q3 Q4 Q5
    Total Impr. 1.04x 1.22x 1.08x 1.03x 1.47x
    Sort Impr. 1.41x 1.09x 1.69x 1.60x 2.81x

    Summary: Sort speedups 1.09x–2.81x; Total speedups 1.03x–1.47x.


    Dynamic Partition Insert Queries

    Queries writing to partitioned tables with Sort for partition ordering.

    Metric Q1 Q2 Q3 Q4 Q5
    Total Impr. 1.14x 1.15x 1.05x 1.16x 1.03x
    Sort Impr. 1.50x 2.08x 2.19x 2.18x 0.93x

    Summary: Sort speedups 1.50x–2.19x (excluding Q5 regression); Total speedups 1.03x–1.16x.


    Suboptimal Join Order Queries

    Queries with suboptimal join orders placing larger tables on the build side.

    Metric Q1 Q2 Q3 Q4 Q5
    Total Impr. 1.01x 1.07x 1.06x 1.02x 1.05x
    Join Impr. 1.19x 1.53x 1.04x 1.05x 0.97x

    Summary: Join speedups 1.04x–1.53x; Total speedups 1.01x–1.07x.

  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Release Note:
- Add hybrid execution model for HashJoin and Sort.
- Add configurations about extraction methods for hybrid model.
- The experiments on production workloads show up 1.5X improvement.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Jan 14, 2026

CLA assistant check
All committers have signed the CLA.

@ruochenj123 ruochenj123 changed the title Hybrid layout design for HashJoin/Sort [WIP] Hybrid layout design for HashJoin/Sort Jan 14, 2026
@markjin1990 markjin1990 added the performance performance improvement needed label Jan 14, 2026
@markjin1990 markjin1990 requested a review from fzhedu January 14, 2026 21:35
@ruochenj123 ruochenj123 force-pushed the hybrid-design branch 2 times, most recently from 03120aa to dfe7dce Compare January 15, 2026 15:11
@markjin1990 markjin1990 changed the title [WIP] Hybrid layout design for HashJoin/Sort Hybrid layout design for HashJoin/Sort Jan 20, 2026
Comment thread bolt/exec/HashBuild.cpp Outdated
<< ", spill enabled: " << spillEnabled()
<< ", maxHashTableSize = " << maxHashTableBucketCount_;
<< ", maxHashTableSize = " << maxHashTableBucketCount_
<< ", hybrid mode " << (hybridJoin_ ? "enabled" : "disbaled");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo disabled

Comment thread bolt/exec/HashBuild.cpp
if (hybridJoin_) {
BOLT_CHECK_LE(
driverId_,
255,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why hardcode limit to 255?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we store a BIGINT (64 bits) of rowId where the top 8 bits represents the driverId and the remaining 56 bits represents the rowId for each driver. So the max # of driver it supports is 255. Maybe we can make it as a config.

Comment thread bolt/exec/RowContainer.h
const T* rawValues = flatChild->rawValues();
const uint64_t* rawNulls = flatChild->rawNulls();

constexpr vector_size_t kPrefetchDist = 16;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the magic number 16 fit for all the arch? e.g. x86 arm?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I tuned it on a x86_64 machine, k = 8-32 perform similarly.

Comment thread bolt/exec/HashBuild.cpp Outdated
Comment thread bolt/core/QueryConfig.h Outdated
Comment thread bolt/exec/RowContainer.h Outdated
Comment thread bolt/exec/HashBuild.cpp Outdated
Comment thread bolt/exec/SortBuffer.cpp
Comment thread bolt/exec/SortBuffer.cpp Outdated
Comment thread bolt/exec/SortBuffer.cpp Outdated
Comment thread bolt/exec/OperatorUtils.cpp
Comment thread bolt/exec/RowContainer.cpp Outdated
@markjin1990 markjin1990 changed the title Hybrid layout design for HashJoin/Sort [feat] Hybrid layout design for HashJoin/Sort Mar 31, 2026
Comment thread .gitignore Outdated
Comment thread .gitignore Outdated
@markjin1990 markjin1990 added the enhancement New feature or request label Apr 1, 2026
@markjin1990 markjin1990 changed the title [feat] Hybrid layout design for HashJoin/Sort [feat] Hybrid layout for HashJoin/Sort Apr 1, 2026
Comment thread bolt/exec/RowContainer.h

const auto rid = rowIdPtr[idx].rowId_;
if (rawNulls != nullptr && bits::isBitNull(rawNulls, rid)) {
bits::setNull(nulls, resultIndex, true);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isBitNull & setNull are not very efficient. How about set 4/8 bits at a time?

bits::setNull(nulls,  index, byte,  mask)  // add a new interface ?

We can only use bitwise OR/AND to eliminate this if branch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can setNull for 4 or 8 bits (rows) at a time by modifying a uint_8. But for if branch, we also use that condition check to avoid the value copy if it is null, so I'm not sure if we can really eliminate it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it not easy to eliminate if branch.
we can use gather instruction to get the scattered bytes(nulls), then movemask it as a 8-bit byte result.
But gather instruction shows no obvious advantage. It is OK to keep the if branch.

@kexianda
Copy link
Copy Markdown
Collaborator

kexianda commented Apr 2, 2026

@ruochenj123 I have completed the code review. The design and implementation look fine to me, and I have left some minor comments.

@markjin1990 markjin1990 added this pull request to the merge queue Apr 8, 2026
Merged via the queue into bytedance:main with commit 537fbdf Apr 8, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request performance performance improvement needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants