fix: preserve dictionary encoding in to_arrow_batch_reader by avison9 · Pull Request #3595 · apache/iceberg-python

avison9 · 2026-07-02T14:43:21Z

Rationale for this change

DataScan.to_arrow_batch_reader(dictionary_columns=("col",)) silently dropped dictionary
encoding — callers got plain string arrays instead of dictionary<values=string, indices=int32>,
with no error or warning. The eager equivalent to_arrow(dictionary_columns=...) worked
correctly. This fixes the streaming path to match.

Before:

result = table.scan().to_arrow_batch_reader(dictionary_columns=("label",)).read_all()
print(result.schema.field("label").type)
# string  ← dictionary encoding silently lost

After:

result = table.scan().to_arrow_batch_reader(dictionary_columns=("label",)).read_all()
print(result.schema.field("label").type)
# dictionary<values=string, indices=int32>  ✓

Root cause: _to_arrow_batch_reader_via_file_scan_tasks built target_schema via
schema_to_pyarrow (which returns plain types), then called .cast(target_schema) on the
RecordBatchReader. ArrowScan.to_record_batches correctly yielded dictionary-encoded
batches, but the cast converted them back to plain types. The fix rebuilds target_schema
with pa.dictionary(pa.int32(), field.type) for each column in dictionary_columns before
the cast, so the encoding is preserved end-to-end.

Are these changes tested?

Yes. Added test_to_arrow_batch_reader_preserves_dictionary_columns in
tests/io/test_pyarrow.py — writes a two-column Parquet file and asserts that
_to_arrow_batch_reader_via_file_scan_tasks returns a dictionary-encoded label column
while leaving id as plain int32.

Are there any user-facing changes?

Yes — to_arrow_batch_reader(dictionary_columns=...) now correctly returns dictionary-encoded
arrays for the requested columns, matching the documented behaviour and the existing to_arrow
path. Please apply the changelog label to this PR — the project has no changelog file/newsfragment
requirement (no CONTRIBUTING.md); release notes are generated from labeled, merged PR titles.

fix: preserve dictionary encoding in to_arrow_batch_reader

41aef66

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: preserve dictionary encoding in to_arrow_batch_reader#3595

fix: preserve dictionary encoding in to_arrow_batch_reader#3595
avison9 wants to merge 1 commit into
apache:mainfrom
avison9:fix/strict-metrics-not-equal-not-in

avison9 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

avison9 commented Jul 2, 2026

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants