[python] Split row chunks that overflow the 2GB per-column limit on read by TheR1sing3un · Pull Request #8243 · apache/paimon

TheR1sing3un · 2026-06-15T11:02:19Z

Purpose

Reading a table with a very large STRING/BYTES column can crash with
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array.

A row chunk (chunk_size = 65536) can exceed the 2GB per-column limit of
pyarrow.string() / pyarrow.binary(), which use 32-bit offsets. When that
happens pyarrow.array() returns a ChunkedArray, and a single RecordBatch
cannot hold a ChunkedArray, so RecordBatch.from_pydict fails on the
non-Arrow-native (row-based) read path.

Reproduced on current master with a single ~2.1GB string column:

import pyarrow as pa
n, blob = 2100, "a" * (1024 * 1024)   # 2100 MiB > 2048 MiB
rows = [(i, blob) for i in range(n)]
schema = pa.schema([("id", pa.int64()), ("payload", pa.string())])
pydict = {name: list(col) for name, col in zip(schema.names, zip(*rows))}
pa.RecordBatch.from_pydict(pydict, schema=schema)
# TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array

This PR turns the row-to-batch helpers into generators that build each column
array, detect the overflow (a column coming back as a ChunkedArray), and
recursively split the rows in half so every emitted RecordBatch keeps each
column under the 2GB limit. A single row that still overflows raises a clear
ValueError instead of recursing forever. Both the serial
(_arrow_batch_generator) and parallel (_read_one_split_to_batches) read
paths are updated, and the small-chunk common case still emits exactly one
batch.

Tests

Added paimon-python/pypaimon/tests/table_read_chunked_overflow_test.py, which
patches pyarrow.array to simulate auto-chunking past a small threshold (so no
real 2GB allocation is needed) and asserts:

an oversized chunk is split into multiple single-Array batches with
data/order preserved, both with and without the _row_kind column;
a below-threshold chunk still produces a single batch;
a single row that still overflows raises ValueError;
the static convert_rows_to_arrow_batches helper splits the same way.

Reading a table with a very large STRING/BYTES column could crash with "Cannot convert ChunkedArray to Array". A 65536-row chunk can exceed the 2GB per-column limit of pyarrow.string()/binary() (32-bit offsets), in which case pyarrow.array() returns a ChunkedArray that a single RecordBatch cannot hold. Convert the row-to-batch helpers into generators that detect the overflow (a column coming back as a ChunkedArray) and recursively split the rows in half, so every emitted RecordBatch keeps each column under the limit. A single row that still overflows raises a clear error instead of recursing forever. Both the serial and parallel read paths are updated accordingly.

TheR1sing3un force-pushed the python-split-2gb-column-chunks branch from 79bbb69 to 2bd46d7 Compare June 15, 2026 12:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Split row chunks that overflow the 2GB per-column limit on read#8243

[python] Split row chunks that overflow the 2GB per-column limit on read#8243
TheR1sing3un wants to merge 1 commit into
apache:masterfrom
TheR1sing3un:python-split-2gb-column-chunks

TheR1sing3un commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheR1sing3un commented Jun 15, 2026

Purpose

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant