Skip to content

[python] Split row chunks that overflow the 2GB per-column limit on read#8243

Open
TheR1sing3un wants to merge 1 commit into
apache:masterfrom
TheR1sing3un:python-split-2gb-column-chunks
Open

[python] Split row chunks that overflow the 2GB per-column limit on read#8243
TheR1sing3un wants to merge 1 commit into
apache:masterfrom
TheR1sing3un:python-split-2gb-column-chunks

Conversation

@TheR1sing3un

Copy link
Copy Markdown
Member

Purpose

Reading a table with a very large STRING/BYTES column can crash with
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array.

A row chunk (chunk_size = 65536) can exceed the 2GB per-column limit of
pyarrow.string() / pyarrow.binary(), which use 32-bit offsets. When that
happens pyarrow.array() returns a ChunkedArray, and a single RecordBatch
cannot hold a ChunkedArray, so RecordBatch.from_pydict fails on the
non-Arrow-native (row-based) read path.

Reproduced on current master with a single ~2.1GB string column:

import pyarrow as pa
n, blob = 2100, "a" * (1024 * 1024)   # 2100 MiB > 2048 MiB
rows = [(i, blob) for i in range(n)]
schema = pa.schema([("id", pa.int64()), ("payload", pa.string())])
pydict = {name: list(col) for name, col in zip(schema.names, zip(*rows))}
pa.RecordBatch.from_pydict(pydict, schema=schema)
# TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array

This PR turns the row-to-batch helpers into generators that build each column
array, detect the overflow (a column coming back as a ChunkedArray), and
recursively split the rows in half so every emitted RecordBatch keeps each
column under the 2GB limit. A single row that still overflows raises a clear
ValueError instead of recursing forever. Both the serial
(_arrow_batch_generator) and parallel (_read_one_split_to_batches) read
paths are updated, and the small-chunk common case still emits exactly one
batch.

Tests

Added paimon-python/pypaimon/tests/table_read_chunked_overflow_test.py, which
patches pyarrow.array to simulate auto-chunking past a small threshold (so no
real 2GB allocation is needed) and asserts:

  • an oversized chunk is split into multiple single-Array batches with
    data/order preserved, both with and without the _row_kind column;
  • a below-threshold chunk still produces a single batch;
  • a single row that still overflows raises ValueError;
  • the static convert_rows_to_arrow_batches helper splits the same way.

Reading a table with a very large STRING/BYTES column could crash with
"Cannot convert ChunkedArray to Array". A 65536-row chunk can exceed the
2GB per-column limit of pyarrow.string()/binary() (32-bit offsets), in
which case pyarrow.array() returns a ChunkedArray that a single
RecordBatch cannot hold.

Convert the row-to-batch helpers into generators that detect the overflow
(a column coming back as a ChunkedArray) and recursively split the rows in
half, so every emitted RecordBatch keeps each column under the limit. A
single row that still overflows raises a clear error instead of recursing
forever. Both the serial and parallel read paths are updated accordingly.
@TheR1sing3un TheR1sing3un force-pushed the python-split-2gb-column-chunks branch from 79bbb69 to 2bd46d7 Compare June 15, 2026 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant