Skip to content

fix(arrow): reconcile evolved nested structs by field id, not position#2647

Open
jordepic wants to merge 1 commit into
apache:mainfrom
jordepic:fix/2617-nested-struct-field-id
Open

fix(arrow): reconcile evolved nested structs by field id, not position#2647
jordepic wants to merge 1 commit into
apache:mainfrom
jordepic:fix/2617-nested-struct-field-id

Conversation

@jordepic

Copy link
Copy Markdown

What changes are included in this PR?

When a table's struct (or nested list/map) column has gained fields over time via schema evolution, reading data files written under the older schema fails with an Arrow cast error such as Cast error: Casting from Utf8 to Struct(...). The record-batch transformer reconciles a file's nested children to the table schema by position within the struct rather than by Iceberg field id, so once a nested struct adds a field, the children no longer line up and a mismatched cast is attempted (e.g. casting a string child into a struct slot). Files are valid and readable by Iceberg-Java/Spark.

e.g. struct goes from a, c to a, b, c -> when reading old file with only a, c it tries to cast c to type of b

This change fixes the bug!

Replace the flat cast with promote_array_to_target, which walks the target type and matches nested struct children by PARQUET:field_id, filling fields absent from the file with typed NULLs and recursing through list/large-list/map. Primitives still use cast for valid Iceberg promotions. Mirrors iceberg-java's by-field-id nested readers.

Are these changes tested?

Yes, unit tests are included to ensure that nested fields are now properly reconciled when present in the schema but not in the data file itself.

@jordepic jordepic force-pushed the fix/2617-nested-struct-field-id branch from 8bcdd6d to e33ca27 Compare June 14, 2026 21:20
When a struct gains a field (schema evolution) after a data file was written,
the file's struct has fewer children than the table schema. RecordBatchTransformer
promoted such columns with arrow_cast::cast, which matches struct/list/map children
positionally -- so every child after the gap shifts and types collide, e.g.
'Casting from Utf8 to Struct' when a scalar lands on an added list<struct>.

Replace the flat cast with cast_schema_to_target, which walks the target type and
matches nested struct children by PARQUET:field_id, filling fields absent from the
file with typed NULLs and recursing through list/large-list/map. Primitives still use
cast for valid Iceberg promotions. Mirrors iceberg-java's by-field-id nested readers.

Closes apache#2617
@jordepic jordepic force-pushed the fix/2617-nested-struct-field-id branch from e33ca27 to 5d42f1a Compare June 14, 2026 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Type casting error when reading files persisted with old schema for complex type

1 participant