fix(arrow): reconcile evolved nested structs by field id, not position#2647
Open
jordepic wants to merge 1 commit into
Open
fix(arrow): reconcile evolved nested structs by field id, not position#2647jordepic wants to merge 1 commit into
jordepic wants to merge 1 commit into
Conversation
8bcdd6d to
e33ca27
Compare
When a struct gains a field (schema evolution) after a data file was written, the file's struct has fewer children than the table schema. RecordBatchTransformer promoted such columns with arrow_cast::cast, which matches struct/list/map children positionally -- so every child after the gap shifts and types collide, e.g. 'Casting from Utf8 to Struct' when a scalar lands on an added list<struct>. Replace the flat cast with cast_schema_to_target, which walks the target type and matches nested struct children by PARQUET:field_id, filling fields absent from the file with typed NULLs and recursing through list/large-list/map. Primitives still use cast for valid Iceberg promotions. Mirrors iceberg-java's by-field-id nested readers. Closes apache#2617
e33ca27 to
5d42f1a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes are included in this PR?
When a table's struct (or nested list/map) column has gained fields over time via schema evolution, reading data files written under the older schema fails with an Arrow cast error such as Cast error: Casting from Utf8 to Struct(...). The record-batch transformer reconciles a file's nested children to the table schema by position within the struct rather than by Iceberg field id, so once a nested struct adds a field, the children no longer line up and a mismatched cast is attempted (e.g. casting a string child into a struct slot). Files are valid and readable by Iceberg-Java/Spark.
e.g. struct goes from a, c to a, b, c -> when reading old file with only a, c it tries to cast c to type of b
This change fixes the bug!
Replace the flat cast with promote_array_to_target, which walks the target type and matches nested struct children by PARQUET:field_id, filling fields absent from the file with typed NULLs and recursing through list/large-list/map. Primitives still use cast for valid Iceberg promotions. Mirrors iceberg-java's by-field-id nested readers.
Are these changes tested?
Yes, unit tests are included to ensure that nested fields are now properly reconciled when present in the schema but not in the data file itself.