Skip to content

Conversation

@gengliqi
Copy link
Contributor

@gengliqi gengliqi commented Jun 6, 2025

What problem does this PR solve?

Issue Number: ref #9060

Problem Summary:

What is changed and how it works?

Support right outer/semi/anti join in hash join v2

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

gengliqi added 30 commits March 6, 2025 21:44
Signed-off-by: gengliqi <gengliqiii@gmail.com>
u
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
u
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
u
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
u
Signed-off-by: gengliqi <gengliqiii@gmail.com>
u
Signed-off-by: gengliqi <gengliqiii@gmail.com>
u
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
u
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
u
Signed-off-by: gengliqi <gengliqiii@gmail.com>
u
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
u
Signed-off-by: gengliqi <gengliqiii@gmail.com>
gengliqi added 9 commits June 7, 2025 16:04
Signed-off-by: gengliqi <gengliqiii@gmail.com>
… right semi/anti join

Signed-off-by: gengliqi <gengliqiii@gmail.com>
u
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
u
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
u
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 8, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: guo-shaoge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Aug 8, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 8, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-08-08 02:23:16.90694221 +0000 UTC m=+580507.049703385: ☑️ agreed by guo-shaoge.

Signed-off-by: gengliqi <gengliqiii@gmail.com>
return {};
}

bool HashJoin::needProbeScanBuildSide() const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe name it as needScanBuildSideAfterProbe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

RowPtrs row_ptrs;

IColumn::Selector right_semi_selector;
BlockSelective right_semi_offsets;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to right_semi_selective?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (need_row_data)
break;
}
for (auto [column_index, _] : join->row_layout.other_column_indexes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for rightsemi and rightanti join, only column in other_column_for_other_conditioin is saved in row data, so why check other_columns_index instead of other_column_for_other_conditioin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Should check other_column_for_other_conditioin here. Fixed.

break;
}

need_other_block_data = (kind == RightSemi || kind == RightAnti)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems (kind == RightSemi || kind == RightAnti) is always true here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. These conditions are removed.

ColumnPtr null_map_holder;
ConstNullMapPtr null_map{};
extractNestedColumnsAndNullMap(key_columns, null_map_holder, null_map);
resetHashJoinKeyGetter(join->method, join_key_getter, key_columns, join->row_layout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need do all these for a sample block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For initializing the join_key_getter

join->initOutputBlock(wd.scan_result_block);
for (size_t i = 0; i < output_columns; ++i)
{
auto & src_column = non_joined_non_full_block->getByPosition(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like in this code branch, scan_result_block is always empty, why not just swap non_joined_non_full_block with scan_result_block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

{
auto & src_column = non_joined_non_full_block->getByPosition(i);
auto & des_column = wd.scan_result_block.getByPosition(i);
des_column.column->assumeMutable()->insertRangeFrom(*src_column.column, 0, rows);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks to me the main purpose of this while loop is merge non_joined_non_full_block into a big enough block and return it? if yes, then maybe we can use vstackBlocks to do this? Since vstackBlocks can merge blocks in batch, and is more memory friendly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Changed to use vstackBlocks.

if constexpr (need_row_data)
scan_block_rows += wd.insert_batch.size();
else
scan_block_rows += wd.selective_offsets.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks to me that wd.insert_batch.size() and wd.selective_offsets.size() can be different since selective_offsets is alreays cleared in L281, but insert_batch is not always flushed, is there any potential issues here if scan_result_block need data from both insert_batch and row_data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert_batch is always flushed by flushInsertBatch each time scanImpl is called. BTW, these four lines of code has been deleted due to no use.

Comment on lines 152 to 159
size_t scan_size = 0;
RowContainer * container = wd.current_container;
size_t index = wd.current_container_index;
wd.selective_offsets.clear();
wd.selective_offsets.reserve(max_block_size);
constexpr size_t insert_batch_max_size = 256;
wd.insert_batch.clear();
wd.insert_batch.reserve(insert_batch_max_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks to me that these code can be moved after the while loop in L164?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return getRowPtrFlag(ptr) & 0x10;
}

inline void setRowPtrNullFlag(RowPtr ptr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this nullflag used for?

Copy link
Contributor Author

@gengliqi gengliqi Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No use here. Can be used for future right outer semi/anti join.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants