[Data] Propagate Schema in _shuffle_block Empty Block Case to ColumnNotFound Error in Chained Left Joins#61507
[Data] Propagate Schema in _shuffle_block Empty Block Case to ColumnNotFound Error in Chained Left Joins#61507rayhhome wants to merge 6 commits intoray-project:masterfrom
Conversation
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
There was a problem hiding this comment.
Pull request overview
Fixes a Ray Data hash-shuffle join edge case where a first empty-row block prevents schema broadcast to aggregators, producing columnless empty tables that later trigger ColumnNotFoundError in chained left joins.
Changes:
- Update
_shuffle_block()to broadcast schema-carrying empty tables to all aggregators when the first received block is empty andsend_empty_blocks=True. - Add a deterministic regression test that forces the “empty block arrives first” ordering to reproduce the chained left-join failure.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
python/ray/data/_internal/execution/operators/hash_shuffle.py |
Ensures schema propagation even when the first shuffled block has num_rows == 0 and schema broadcast is required. |
python/ray/data/tests/test_join.py |
Adds a regression test that deterministically reproduces and validates the fix for chained left-outer joins with empty intermediate blocks. |
Comments suppressed due to low confidence (1)
python/ray/data/_internal/execution/operators/hash_shuffle.py:295
- In the
block.num_rows == 0 and send_empty_blocksbranch, this broadcasts by calling_create_empty_table(block.schema)andray.put(...)once per partition. For largenum_partitionsthis creates many identical objects in the object store and duplicates logic that already exists in the non-empty path. Consider reusing the existing partition loop by only early-returning whensend_empty_blocksis false, or at leastray.put()a single schema-carrying empty table once and reuse the sameObjectReffor allaggregator.submitcalls.
if block.num_rows == 0:
if send_empty_blocks:
# NOTE: Perform the schema broadcast explicitly
# Every aggregator gets a (0, N) schema-carrying placeholder even when
# the triggering block itself is empty.
pending = []
for partition_id in range(pool.num_partitions):
aggregator = pool.get_aggregator_for_partition(partition_id)
partition_ref = ray.put(_create_empty_table(block.schema))
pending.append(
aggregator.submit.remote(input_index, partition_id, partition_ref)
)
# Wait for all schema-carrier blocks to be accepted before returning.
# This mirrors the synchronization in the normal (non-empty) path and
# ensures aggregators hold the schema before finalize() is called.
while pending:
_, pending = ray.wait(pending, num_returns=len(pending), timeout=1)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… of schema Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request provides a well-reasoned fix for a ColumnNotFoundError that occurs during chained left joins with empty intermediate blocks. The approach of explicitly broadcasting a schema-carrying empty block to all aggregators is a solid solution to ensure schema propagation. The included regression test is excellent, as it deterministically reproduces the issue and verifies the fix. I have one suggestion to simplify the waiting logic by using ray.get() instead of a while loop with ray.wait(), which would make the code more concise and robust in handling task failures. Overall, this is a high-quality contribution.
Note: Security Review did not run due to the size of the PR.
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
| ] | ||
|
|
||
| # The first block must be the empty one so the bug fires. | ||
| joined_1 = ray.data.from_arrow([empty_block] + data_blocks) |
There was a problem hiding this comment.
i'm not sure this is true unless ctx.execution_options.preserve_order = True
There was a problem hiding this comment.
I traced through from_arrow and I think the preserve_order variable is not involved in its operations. I see it only referenced in MapOperator, UnionOperator, and OutputSplitter, but not in InputDataBuffer, which I believe is the only operator involved here.
| ), f"Missing columns. Got: {result.columns.tolist()}" | ||
|
|
||
| # Verify per-row correctness | ||
| for _, row in result.iterrows(): |
There was a problem hiding this comment.
I think it's a bit hard to read how u are verifying output. What if u did something like
expected_table = pd.dataframe({...})
result = expected_tableIdk the syntax off the top of my head, but that way it's easier to read the expected output
Description
Per #60013, chained left joins fail with
ColumnNotFoundErrorwhen the first join produces empty intermediate blocks. #60520 attempted the fix, but a refined reproduction script shows it does not resolve the underlying issue. This PR proposes a targeted fix and a deterministic regression test.Root cause
Using the example in #60013, when the streaming executor feeds the second join's input, the first block delivered can have zero rows. The bug is then triggered through the following sequence:
_do_add_input_innersees that this is the first block for input sequence 0 (or 1), so it submits a_shuffle_blocktask withsend_empty_blocks=Trueand immediately setsself._has_schemas_broadcasted[input_index] = True;_shuffle_blockworker tasks triggers an early return of(empty_metadata, {}). Noaggregator.submit()calls are ever made and the schema never reaches any aggregator;send_empty_blocks=False. Qggregators with no non-empty data are never contacted at all, leaving their bucket queue empty;drain_queue()returns[]for those partitions, so_combine([])builds anArrowblockBuilderwith noadd_block()calls and produces an empty table with no columns;JoiningAggregation.finalize()callspa.Table.join()on this columnless table, it raisesColumnNotFoundErroras observed.Why #60520 does not fix this issue
#60520 modifies
ArrowBlockBuilder.build()to use a storedself._schemawhenlen(tables) == 0. However,self._schemais only populated insideadd_block()calls. Whenpartition_shardsis[]in_combine(...),self._schemaremainsNone.This fix
In
_shuffle_block, whenblock.num_rows == 0andsend_empty_blocks=True, explicitly broadcast schema-carrying empty tables to every aggregator before returning. This mirrors the broadcast logic for non-empty blocks, which ensures every aggregator holds at least one schema-carrying block and thus finalizes correctly.Related issues
Fixes #60013 and follows up on #60520.
Additional information
The original reproduction script in #60013 occasionally misses the error due to the uncertain order of blocks fed to the second join. To force the bug, add the following lines to the reproduction script:
The augmented script forces the order of blocks so that the first block going into the second join is always empty.
The new test case in
test_join.pyplaces the empty block in a list fed tofrom_arrow, preserving the block order and ensuring that the second join will always see the empty block first. The bug fires reliably on every run before the fix.