Skip to content

fix: preserve dictionary encoding in to_arrow_batch_reader#3595

Open
avison9 wants to merge 1 commit into
apache:mainfrom
avison9:fix/strict-metrics-not-equal-not-in
Open

fix: preserve dictionary encoding in to_arrow_batch_reader#3595
avison9 wants to merge 1 commit into
apache:mainfrom
avison9:fix/strict-metrics-not-equal-not-in

Conversation

@avison9

@avison9 avison9 commented Jul 2, 2026

Copy link
Copy Markdown

Rationale for this change

DataScan.to_arrow_batch_reader(dictionary_columns=("col",)) silently dropped dictionary
encoding — callers got plain string arrays instead of dictionary<values=string, indices=int32>,
with no error or warning. The eager equivalent to_arrow(dictionary_columns=...) worked
correctly. This fixes the streaming path to match.

Before:

result = table.scan().to_arrow_batch_reader(dictionary_columns=("label",)).read_all()
print(result.schema.field("label").type)
# string  ← dictionary encoding silently lost

After:

result = table.scan().to_arrow_batch_reader(dictionary_columns=("label",)).read_all()
print(result.schema.field("label").type)
# dictionary<values=string, indices=int32>  ✓

Root cause: _to_arrow_batch_reader_via_file_scan_tasks built target_schema via
schema_to_pyarrow (which returns plain types), then called .cast(target_schema) on the
RecordBatchReader. ArrowScan.to_record_batches correctly yielded dictionary-encoded
batches, but the cast converted them back to plain types. The fix rebuilds target_schema
with pa.dictionary(pa.int32(), field.type) for each column in dictionary_columns before
the cast, so the encoding is preserved end-to-end.

Are these changes tested?

Yes. Added test_to_arrow_batch_reader_preserves_dictionary_columns in
tests/io/test_pyarrow.py — writes a two-column Parquet file and asserts that
_to_arrow_batch_reader_via_file_scan_tasks returns a dictionary-encoded label column
while leaving id as plain int32.

Are there any user-facing changes?

Yes — to_arrow_batch_reader(dictionary_columns=...) now correctly returns dictionary-encoded
arrays for the requested columns, matching the documented behaviour and the existing to_arrow
path. Please apply the changelog label to this PR — the project has no changelog file/newsfragment
requirement (no CONTRIBUTING.md); release notes are generated from labeled, merged PR titles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants