Skip to content

Optimize RightSemi, RightAnti joins with existence-probes #22854

@neilconway

Description

@neilconway

Is your feature request related to a problem or challenge?

Suppose we have A semijoin B, and the optimizer chooses to implement that as as a RightSemi join. That means we build B and stream A, doing lookups against the B hash table. The current implementation does the following:

  1. For a given A value, we lookup all the matching B values, walking the hash chain and checking for equality.
  2. For each match, produce a (probe_idx, build_idx) pair.
  3. Apply the join filter if there's a non-equijoin filter
  4. Remove duplicates from the list of all probe_idx values
  5. Materialize the output RecordBatch using the distinct probe_idx values

In part we do this because we share code with other join implementations, but some of this work is redundant for RightSemi joins (and RightAnti joins as well):

  1. We can stop once we hit the first matching B value, rather than walking the rest of the chain
  2. We never need build_idx values
  3. We don't need to remove duplicates from the probe_idx array, since we never produced them in the first place

The win is biggest for semijoins where many build values match each probe value, but it should still be a win for lower fan-out semi-joins.

We could go further and specialize the hash table for semijoins: we don't need to keep duplicate build-side values in the hash table in the first place, the visited bitmap, and so on. But just optimizing the join algorithm itself should get us most of the win.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions