[python] Split row chunks that overflow the 2GB per-column limit on read#8243
Open
TheR1sing3un wants to merge 1 commit into
Open
[python] Split row chunks that overflow the 2GB per-column limit on read#8243TheR1sing3un wants to merge 1 commit into
TheR1sing3un wants to merge 1 commit into
Conversation
Reading a table with a very large STRING/BYTES column could crash with "Cannot convert ChunkedArray to Array". A 65536-row chunk can exceed the 2GB per-column limit of pyarrow.string()/binary() (32-bit offsets), in which case pyarrow.array() returns a ChunkedArray that a single RecordBatch cannot hold. Convert the row-to-batch helpers into generators that detect the overflow (a column coming back as a ChunkedArray) and recursively split the rows in half, so every emitted RecordBatch keeps each column under the limit. A single row that still overflows raises a clear error instead of recursing forever. Both the serial and parallel read paths are updated accordingly.
79bbb69 to
2bd46d7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Reading a table with a very large STRING/BYTES column can crash with
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array.A row chunk (
chunk_size = 65536) can exceed the 2GB per-column limit ofpyarrow.string()/pyarrow.binary(), which use 32-bit offsets. When thathappens
pyarrow.array()returns aChunkedArray, and a singleRecordBatchcannot hold a
ChunkedArray, soRecordBatch.from_pydictfails on thenon-Arrow-native (row-based) read path.
Reproduced on current
masterwith a single ~2.1GB string column:This PR turns the row-to-batch helpers into generators that build each column
array, detect the overflow (a column coming back as a
ChunkedArray), andrecursively split the rows in half so every emitted
RecordBatchkeeps eachcolumn under the 2GB limit. A single row that still overflows raises a clear
ValueErrorinstead of recursing forever. Both the serial(
_arrow_batch_generator) and parallel (_read_one_split_to_batches) readpaths are updated, and the small-chunk common case still emits exactly one
batch.
Tests
Added
paimon-python/pypaimon/tests/table_read_chunked_overflow_test.py, whichpatches
pyarrow.arrayto simulate auto-chunking past a small threshold (so noreal 2GB allocation is needed) and asserts:
Arraybatches withdata/order preserved, both with and without the
_row_kindcolumn;ValueError;convert_rows_to_arrow_batcheshelper splits the same way.