-
Notifications
You must be signed in to change notification settings - Fork 412
Support right outer/semi/anti join in hash join v2 #10227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
… right semi/anti join Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: guo-shaoge The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
Signed-off-by: gengliqi <gengliqiii@gmail.com>
| return {}; | ||
| } | ||
|
|
||
| bool HashJoin::needProbeScanBuildSide() const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe name it as needScanBuildSideAfterProbe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| RowPtrs row_ptrs; | ||
|
|
||
| IColumn::Selector right_semi_selector; | ||
| BlockSelective right_semi_offsets; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename to right_semi_selective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| if (need_row_data) | ||
| break; | ||
| } | ||
| for (auto [column_index, _] : join->row_layout.other_column_indexes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for rightsemi and rightanti join, only column in other_column_for_other_conditioin is saved in row data, so why check other_columns_index instead of other_column_for_other_conditioin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. Should check other_column_for_other_conditioin here. Fixed.
| break; | ||
| } | ||
|
|
||
| need_other_block_data = (kind == RightSemi || kind == RightAnti) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems (kind == RightSemi || kind == RightAnti) is always true here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. These conditions are removed.
| ColumnPtr null_map_holder; | ||
| ConstNullMapPtr null_map{}; | ||
| extractNestedColumnsAndNullMap(key_columns, null_map_holder, null_map); | ||
| resetHashJoinKeyGetter(join->method, join_key_getter, key_columns, join->row_layout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need do all these for a sample block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For initializing the join_key_getter
| join->initOutputBlock(wd.scan_result_block); | ||
| for (size_t i = 0; i < output_columns; ++i) | ||
| { | ||
| auto & src_column = non_joined_non_full_block->getByPosition(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like in this code branch, scan_result_block is always empty, why not just swap non_joined_non_full_block with scan_result_block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| { | ||
| auto & src_column = non_joined_non_full_block->getByPosition(i); | ||
| auto & des_column = wd.scan_result_block.getByPosition(i); | ||
| des_column.column->assumeMutable()->insertRangeFrom(*src_column.column, 0, rows); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks to me the main purpose of this while loop is merge non_joined_non_full_block into a big enough block and return it? if yes, then maybe we can use vstackBlocks to do this? Since vstackBlocks can merge blocks in batch, and is more memory friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Changed to use vstackBlocks.
| if constexpr (need_row_data) | ||
| scan_block_rows += wd.insert_batch.size(); | ||
| else | ||
| scan_block_rows += wd.selective_offsets.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks to me that wd.insert_batch.size() and wd.selective_offsets.size() can be different since selective_offsets is alreays cleared in L281, but insert_batch is not always flushed, is there any potential issues here if scan_result_block need data from both insert_batch and row_data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insert_batch is always flushed by flushInsertBatch each time scanImpl is called. BTW, these four lines of code has been deleted due to no use.
| size_t scan_size = 0; | ||
| RowContainer * container = wd.current_container; | ||
| size_t index = wd.current_container_index; | ||
| wd.selective_offsets.clear(); | ||
| wd.selective_offsets.reserve(max_block_size); | ||
| constexpr size_t insert_batch_max_size = 256; | ||
| wd.insert_batch.clear(); | ||
| wd.insert_batch.reserve(insert_batch_max_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks to me that these code can be moved after the while loop in L164?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| return getRowPtrFlag(ptr) & 0x10; | ||
| } | ||
|
|
||
| inline void setRowPtrNullFlag(RowPtr ptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this nullflag used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No use here. Can be used for future right outer semi/anti join.
Signed-off-by: gengliqi <gengliqiii@gmail.com>
What problem does this PR solve?
Issue Number: ref #9060
Problem Summary:
What is changed and how it works?
Check List
Tests
Side effects
Documentation
Release note