Is your feature request related to a problem or challenge?
Suppose we have A semijoin B, and the optimizer chooses to implement that as as a RightSemi join. That means we build B and stream A, doing lookups against the B hash table. The current implementation does the following:
- For a given A value, we lookup all the matching B values, walking the hash chain and checking for equality.
- For each match, produce a
(probe_idx, build_idx) pair.
- Apply the join filter if there's a non-equijoin filter
- Remove duplicates from the list of all
probe_idx values
- Materialize the output
RecordBatch using the distinct probe_idx values
In part we do this because we share code with other join implementations, but some of this work is redundant for RightSemi joins (and RightAnti joins as well):
- We can stop once we hit the first matching B value, rather than walking the rest of the chain
- We never need
build_idx values
- We don't need to remove duplicates from the
probe_idx array, since we never produced them in the first place
The win is biggest for semijoins where many build values match each probe value, but it should still be a win for lower fan-out semi-joins.
We could go further and specialize the hash table for semijoins: we don't need to keep duplicate build-side values in the hash table in the first place, the visited bitmap, and so on. But just optimizing the join algorithm itself should get us most of the win.
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
Is your feature request related to a problem or challenge?
Suppose we have A semijoin B, and the optimizer chooses to implement that as as a
RightSemijoin. That means we build B and stream A, doing lookups against the B hash table. The current implementation does the following:(probe_idx, build_idx)pair.probe_idxvaluesRecordBatchusing the distinctprobe_idxvaluesIn part we do this because we share code with other join implementations, but some of this work is redundant for
RightSemijoins (andRightAntijoins as well):build_idxvaluesprobe_idxarray, since we never produced them in the first placeThe win is biggest for semijoins where many build values match each probe value, but it should still be a win for lower fan-out semi-joins.
We could go further and specialize the hash table for semijoins: we don't need to keep duplicate build-side values in the hash table in the first place, the
visitedbitmap, and so on. But just optimizing the join algorithm itself should get us most of the win.Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response