[python] Support schema evolution of nested struct sub-fields by TheR1sing3un · Pull Request #8187 · apache/paimon

TheR1sing3un · 2026-06-10T02:25:45Z

Purpose

Follow-up to #8126, which made read-time schema evolution align top-level columns by field id. This extends the same id-based alignment to sub-fields inside a ROW (including a ROW nested in an ARRAY/MAP).

Before this PR, nested sub-field evolution didn't work: adding a sub-field silently created a top-level column, and rename/drop/update-type failed, because only the last name in the path was matched.

Now a dotted path like mv.value is resolved recursively, so for a column mv ROW<version BIGINT, value STRING>:

add a sub-field → old rows read NULL for it;
rename a sub-field → data follows the field id, not the name;
drop a sub-field → its old data is not revived;
update type of a sub-field → cast at read time.

# mv ROW<version BIGINT, value STRING>
catalog.alter_table("db.t", [SchemaChange.rename_column(["mv", "value"], "val")])
# files written under mv.value are read back as mv.val with the same data

How it works:

Nested sub-fields get globally-unique field ids at create time; highestFieldId is computed recursively so nested and top-level ids never collide.
Schema changes (add / rename / drop / update-type / update-nullability / update-comment) recurse along the field-name path, transparently through ARRAY/MAP wrappers.
update column type is validated against the cast-support rules.
The read path aligns nested sub-fields by id — reorder, pad missing with NULL, follow renames, cast changed types — recursing into struct / array / map.

Tests

New cases in schema_evolution_nested_read_test.py:

nested add / rename / drop / update-type read round-trips, on append-only and primary-key tables;
sub-fields of ARRAY<ROW> and MAP<.,ROW>;
the nested field-id model (global uniqueness, recursive highestFieldId, duplicate detection);
the type-cast support rules.

Read-time schema evolution previously aligned only top-level columns by field id; sub-fields inside a ROW (and a ROW nested in an ARRAY/MAP) could not evolve: adding one silently created a top-level column, and rename/drop/update-type raised because the schema manager only handled the last path element. - Assign globally-unique ids to nested sub-fields at create time and compute highestFieldId recursively, so nested ids never collide with top-level ones. - Recurse schema changes along the dotted field-name path (transparently through ARRAY/MAP wrappers) for add/rename/drop/update-type/update-nullability/ update-comment, allocating new ids from the persisted highestFieldId. - Validate update-column-type against the cast-support rules. - Align nested sub-fields by field id at read time: reorder, pad missing with NULL, follow renames, and cast changed types, recursing into struct/array/map. Add tests covering nested add/rename/drop/update-type round-trips (append-only and primary-key), ARRAY<ROW>/MAP<.,ROW> sub-fields, the id model, and the cast rules.

Nested-leaf projection on append-only reads pushed the leaf path down by the LATEST name, bypassing the per-file field-id normalization: after a sub-field rename the old file's leaf read NULL, and after a sub-field type change old and new batches carried different types and failed to concatenate. Mirror the merge path instead: widen the projection to the full top-level columns so the field-id normalization applies (rename follows the id, missing sub-fields pad NULL, types are cast), then extract the requested leaf paths back to the user's flat schema - batch-level via NestedLeafBatchReader, or row-level via OuterProjectionRecordReader when a post-read filter is involved. Add regression tests projecting a renamed and a type-changed sub-field across old and new files.

update_column_type from ROW/ARRAY/MAP to STRING passes validation (the cast rules allow constructed types to character strings), but reading an old file failed with ArrowNotImplementedError because struct/list/map cannot be cast to utf8 directly. Render the string form during per-file alignment instead, matching the engine's cast rules: ROW as '{v1, v2}', ARRAY as '[e1, e2]', MAP as '{k1 -> v1, k2 -> v2}', with sub-values rendered recursively, NULL sub-values as the literal 'null', and NULL containers staying NULL. Add round-trip tests for ROW/ARRAY/MAP to STRING, NULL semantics, and a nested sub-field changed to STRING.

…kens Two follow-ups on the nested schema-evolution path: - update_column_type from VECTOR (or MULTISET) to STRING passed validation but old files failed on read: there is no string rendering for them. Narrow the cast rule so only ROW/ARRAY/MAP - the constructed types the read path can render - are accepted as string sources. - The nested path walker consumed the ARRAY/MAP wrapper token by position without checking it, so an invalid path like ['arr', 'wrong', 'c'] was accepted and mutated the schema exactly like ['arr', 'element', 'c']. Require 'element' for arrays and 'value' for maps before descending. Add tests for the rejected vector alter (the column still reads), the narrowed cast rules, and wrong wrapper tokens on ARRAY<ROW> / MAP<.,ROW>.

…sliced arrays, gate null-to-not-null Self-review findings on the nested schema-evolution path: - update_column_type between same-root constructed types (e.g. ROW<a INT> -> ROW<a BIGINT, c STRING>) was accepted: the replacement carried caller-supplied nested ids that corrupt the id model and old rows read all-NULL; a VECTOR length change was accepted but unreadable. Reject non-identical constructed-to-constructed casts - reshaping goes through sub-field / 'element' / 'value' paths, which keep working. - The list/map rebuilds in the alignment and string-rendering paths read offsets/raw buffers directly, which errors on a sliced ListArray and silently misaligns rows on a sliced MapArray; re-materialize sliced inputs first. - Converting a nullable column to NOT NULL was silently accepted; it is now rejected by default and opt-in via 'alter-column-null-to-not-null.disabled' = 'false'. Also add an end-to-end test for the array 'element' type promotion path.

JingsongLi

Inline review comments.

…action When a read projects a nested struct sub-field (e.g. mv.latest_version), the read widens the projection to the full top-level column so per-file field-id normalization applies, then extracts the leaf afterwards. The leaf path is absent from the widened read fields, so SplitRead.__init__ dropped any predicate referencing it (predicate_for_reader=None) and the filter was silently lost -- every row was returned. Re-evaluate the dropped predicate after extraction, where the flat columns match the predicate fields: RawFileSplitRead (append-only / PK raw-convertible) wraps the extracted batches with FilterRecordBatchReader; MergeFileSplitRead (PK non raw-convertible) filters the extracted rows with FilterRecordReader, rewriting indices into the flat output. The predicate is trimmed to the projected columns first, so a filter on a non-projected column keeps the existing drop semantics instead of referencing a missing column.

UpdateColumnType carries its own target nullability, but with keep_nullability false the handler applied it without the null-to-not-null guard that UpdateColumnNullability enforces. A type change such as BIGINT NOT NULL on a nullable column therefore succeeded under the default table options, while Java SchemaManager#updateColumnType rejects it unless alter-column-null-to-not-null.disabled=false. Thread disable_null_to_not_null into _handle_update_column_type and call _assert_nullability_change when keep_nullability is false.

supports_cast only encodes the logical cast specification (mirroring Java DataTypeCasts), so casts with no runtime implementation -- e.g. TIMESTAMP -> DECIMAL, BOOLEAN -> DECIMAL, TIME -> TIMESTAMP -- were accepted at alter time and then failed at read with ArrowNotImplementedError. Java additionally checks CastExecutors.resolve(...) != null. Add can_execute_cast as the executable-cast counterpart: leaf casts defer to PyArrow's cast-kernel availability (probed once and cached), constructed -> string is rendered by the read path, and other constructed conversions are rejected. update column type now requires both supports_cast and can_execute_cast.

JingsongLi

Follow-up inline review comment.

can_execute_cast probed an empty array, which only resolves the PyArrow cast kernel. Some kernels still reject the target type parameters on non-empty input: INT -> DECIMAL(10, 2) has a kernel but needs precision >= 12 to hold an int's range at scale 2 (BIGINT needs >= 21), so the empty probe accepted it, the alter succeeded, and reading an old file failed with ArrowInvalid "Precision is not great enough". Probe a single null row instead. It triggers PyArrow's static type-parameter validation (decimal precision sufficiency) while a null value avoids per-value parse/overflow errors that safe=False -- matching the read path -- tolerates.

AtomicType.to_dict() encodes nullability as a " NOT NULL" string suffix. The atomic-type parser stripped it only on the space-split branch (plain types like BIGINT), so a parameterized type took the paren branch and kept the suffix inside AtomicType.type -- e.g. parsing "DECIMAL(12, 2) NOT NULL" yielded type="DECIMAL(12, 2) NOT NULL", which doubled to "... NOT NULL NOT NULL" on the next serialization. update_column_type to a non-null parameterized type (a valid widening) then failed during alter once can_execute_cast materialized the target. Strip the trailing NOT NULL / NULL suffix up front so nullability lives only in `nullable`, fixing the round-trip for every parse_data_type / from_dict caller.

TheR1sing3un marked this pull request as ready for review June 10, 2026 02:25

JingsongLi reviewed Jun 10, 2026

View reviewed changes

Comment thread paimon-python/pypaimon/read/reader/data_file_batch_reader.py

TheR1sing3un requested a review from JingsongLi June 10, 2026 07:47

JingsongLi reviewed Jun 11, 2026

View reviewed changes

Comment thread paimon-python/pypaimon/casting/data_type_casts.py Outdated

TheR1sing3un requested a review from JingsongLi June 11, 2026 03:00

JingsongLi reviewed Jun 11, 2026

View reviewed changes

Comment thread paimon-python/pypaimon/read/reader/data_file_batch_reader.py

JingsongLi reviewed Jun 11, 2026

View reviewed changes

Comment thread paimon-python/pypaimon/schema/schema_manager.py Outdated

TheR1sing3un added 2 commits June 11, 2026 13:59

TheR1sing3un requested a review from JingsongLi June 11, 2026 08:16

JingsongLi reviewed Jun 14, 2026

View reviewed changes

Comment thread paimon-python/pypaimon/schema/schema_manager.py

Comment thread paimon-python/pypaimon/casting/data_type_casts.py

Comment thread paimon-python/pypaimon/read/split_read.py

TheR1sing3un added 3 commits June 14, 2026 13:29

JingsongLi reviewed Jun 14, 2026

View reviewed changes

Comment thread paimon-python/pypaimon/casting/data_type_casts.py Outdated

TheR1sing3un force-pushed the python-nested-schema-evolution branch from 166807f to 58d30b2 Compare June 14, 2026 10:30

JingsongLi reviewed Jun 14, 2026

View reviewed changes

Comment thread paimon-python/pypaimon/schema/schema_manager.py

TheR1sing3un requested a review from JingsongLi June 14, 2026 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Support schema evolution of nested struct sub-fields#8187

[python] Support schema evolution of nested struct sub-fields#8187
TheR1sing3un wants to merge 10 commits into
apache:masterfrom
TheR1sing3un:python-nested-schema-evolution

TheR1sing3un commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JingsongLi left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JingsongLi left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TheR1sing3un commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheR1sing3un commented Jun 10, 2026 •

edited

Loading