Skip to content

[python] Support schema evolution of nested struct sub-fields#8187

Open
TheR1sing3un wants to merge 10 commits into
apache:masterfrom
TheR1sing3un:python-nested-schema-evolution
Open

[python] Support schema evolution of nested struct sub-fields#8187
TheR1sing3un wants to merge 10 commits into
apache:masterfrom
TheR1sing3un:python-nested-schema-evolution

Conversation

@TheR1sing3un

@TheR1sing3un TheR1sing3un commented Jun 10, 2026

Copy link
Copy Markdown
Member

Purpose

Follow-up to #8126, which made read-time schema evolution align top-level columns by field id. This extends the same id-based alignment to sub-fields inside a ROW (including a ROW nested in an ARRAY/MAP).

Before this PR, nested sub-field evolution didn't work: adding a sub-field silently created a top-level column, and rename/drop/update-type failed, because only the last name in the path was matched.

Now a dotted path like mv.value is resolved recursively, so for a column mv ROW<version BIGINT, value STRING>:

  • add a sub-field → old rows read NULL for it;
  • rename a sub-field → data follows the field id, not the name;
  • drop a sub-field → its old data is not revived;
  • update type of a sub-field → cast at read time.
# mv ROW<version BIGINT, value STRING>
catalog.alter_table("db.t", [SchemaChange.rename_column(["mv", "value"], "val")])
# files written under mv.value are read back as mv.val with the same data

How it works:

  • Nested sub-fields get globally-unique field ids at create time; highestFieldId is computed recursively so nested and top-level ids never collide.
  • Schema changes (add / rename / drop / update-type / update-nullability / update-comment) recurse along the field-name path, transparently through ARRAY/MAP wrappers.
  • update column type is validated against the cast-support rules.
  • The read path aligns nested sub-fields by id — reorder, pad missing with NULL, follow renames, cast changed types — recursing into struct / array / map.

Tests

New cases in schema_evolution_nested_read_test.py:

  • nested add / rename / drop / update-type read round-trips, on append-only and primary-key tables;
  • sub-fields of ARRAY<ROW> and MAP<.,ROW>;
  • the nested field-id model (global uniqueness, recursive highestFieldId, duplicate detection);
  • the type-cast support rules.

Read-time schema evolution previously aligned only top-level columns by field
id; sub-fields inside a ROW (and a ROW nested in an ARRAY/MAP) could not evolve:
adding one silently created a top-level column, and rename/drop/update-type
raised because the schema manager only handled the last path element.

- Assign globally-unique ids to nested sub-fields at create time and compute
  highestFieldId recursively, so nested ids never collide with top-level ones.
- Recurse schema changes along the dotted field-name path (transparently
  through ARRAY/MAP wrappers) for add/rename/drop/update-type/update-nullability/
  update-comment, allocating new ids from the persisted highestFieldId.
- Validate update-column-type against the cast-support rules.
- Align nested sub-fields by field id at read time: reorder, pad missing with
  NULL, follow renames, and cast changed types, recursing into struct/array/map.

Add tests covering nested add/rename/drop/update-type round-trips (append-only
and primary-key), ARRAY<ROW>/MAP<.,ROW> sub-fields, the id model, and the cast
rules.
@TheR1sing3un TheR1sing3un marked this pull request as ready for review June 10, 2026 02:25
Comment thread paimon-python/pypaimon/read/reader/data_file_batch_reader.py
Nested-leaf projection on append-only reads pushed the leaf path down by
the LATEST name, bypassing the per-file field-id normalization: after a
sub-field rename the old file's leaf read NULL, and after a sub-field type
change old and new batches carried different types and failed to
concatenate.

Mirror the merge path instead: widen the projection to the full top-level
columns so the field-id normalization applies (rename follows the id,
missing sub-fields pad NULL, types are cast), then extract the requested
leaf paths back to the user's flat schema - batch-level via
NestedLeafBatchReader, or row-level via OuterProjectionRecordReader when a
post-read filter is involved.

Add regression tests projecting a renamed and a type-changed sub-field
across old and new files.
@TheR1sing3un TheR1sing3un requested a review from JingsongLi June 10, 2026 07:47
Comment thread paimon-python/pypaimon/casting/data_type_casts.py Outdated
update_column_type from ROW/ARRAY/MAP to STRING passes validation (the
cast rules allow constructed types to character strings), but reading an
old file failed with ArrowNotImplementedError because struct/list/map
cannot be cast to utf8 directly.

Render the string form during per-file alignment instead, matching the
engine's cast rules: ROW as '{v1, v2}', ARRAY as '[e1, e2]', MAP as
'{k1 -> v1, k2 -> v2}', with sub-values rendered recursively, NULL
sub-values as the literal 'null', and NULL containers staying NULL.

Add round-trip tests for ROW/ARRAY/MAP to STRING, NULL semantics, and a
nested sub-field changed to STRING.
@TheR1sing3un TheR1sing3un requested a review from JingsongLi June 11, 2026 03:00
Comment thread paimon-python/pypaimon/read/reader/data_file_batch_reader.py
Comment thread paimon-python/pypaimon/schema/schema_manager.py Outdated
…kens

Two follow-ups on the nested schema-evolution path:

- update_column_type from VECTOR (or MULTISET) to STRING passed validation
  but old files failed on read: there is no string rendering for them.
  Narrow the cast rule so only ROW/ARRAY/MAP - the constructed types the
  read path can render - are accepted as string sources.

- The nested path walker consumed the ARRAY/MAP wrapper token by position
  without checking it, so an invalid path like ['arr', 'wrong', 'c'] was
  accepted and mutated the schema exactly like ['arr', 'element', 'c'].
  Require 'element' for arrays and 'value' for maps before descending.

Add tests for the rejected vector alter (the column still reads), the
narrowed cast rules, and wrong wrapper tokens on ARRAY<ROW> / MAP<.,ROW>.
…sliced arrays, gate null-to-not-null

Self-review findings on the nested schema-evolution path:

- update_column_type between same-root constructed types (e.g. ROW<a INT>
  -> ROW<a BIGINT, c STRING>) was accepted: the replacement carried
  caller-supplied nested ids that corrupt the id model and old rows read
  all-NULL; a VECTOR length change was accepted but unreadable. Reject
  non-identical constructed-to-constructed casts - reshaping goes through
  sub-field / 'element' / 'value' paths, which keep working.

- The list/map rebuilds in the alignment and string-rendering paths read
  offsets/raw buffers directly, which errors on a sliced ListArray and
  silently misaligns rows on a sliced MapArray; re-materialize sliced
  inputs first.

- Converting a nullable column to NOT NULL was silently accepted; it is
  now rejected by default and opt-in via
  'alter-column-null-to-not-null.disabled' = 'false'.

Also add an end-to-end test for the array 'element' type promotion path.
@TheR1sing3un TheR1sing3un requested a review from JingsongLi June 11, 2026 08:16

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline review comments.

Comment thread paimon-python/pypaimon/schema/schema_manager.py
Comment thread paimon-python/pypaimon/casting/data_type_casts.py
Comment thread paimon-python/pypaimon/read/split_read.py
…action

When a read projects a nested struct sub-field (e.g. mv.latest_version), the
read widens the projection to the full top-level column so per-file field-id
normalization applies, then extracts the leaf afterwards. The leaf path is
absent from the widened read fields, so SplitRead.__init__ dropped any
predicate referencing it (predicate_for_reader=None) and the filter was
silently lost -- every row was returned.

Re-evaluate the dropped predicate after extraction, where the flat columns
match the predicate fields: RawFileSplitRead (append-only / PK raw-convertible)
wraps the extracted batches with FilterRecordBatchReader; MergeFileSplitRead
(PK non raw-convertible) filters the extracted rows with FilterRecordReader,
rewriting indices into the flat output. The predicate is trimmed to the
projected columns first, so a filter on a non-projected column keeps the
existing drop semantics instead of referencing a missing column.
UpdateColumnType carries its own target nullability, but with keep_nullability
false the handler applied it without the null-to-not-null guard that
UpdateColumnNullability enforces. A type change such as BIGINT NOT NULL on a
nullable column therefore succeeded under the default table options, while Java
SchemaManager#updateColumnType rejects it unless
alter-column-null-to-not-null.disabled=false.

Thread disable_null_to_not_null into _handle_update_column_type and call
_assert_nullability_change when keep_nullability is false.
supports_cast only encodes the logical cast specification (mirroring Java
DataTypeCasts), so casts with no runtime implementation -- e.g. TIMESTAMP ->
DECIMAL, BOOLEAN -> DECIMAL, TIME -> TIMESTAMP -- were accepted at alter time
and then failed at read with ArrowNotImplementedError. Java additionally checks
CastExecutors.resolve(...) != null.

Add can_execute_cast as the executable-cast counterpart: leaf casts defer to
PyArrow's cast-kernel availability (probed once and cached), constructed ->
string is rendered by the read path, and other constructed conversions are
rejected. update column type now requires both supports_cast and
can_execute_cast.

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up inline review comment.

Comment thread paimon-python/pypaimon/casting/data_type_casts.py Outdated
can_execute_cast probed an empty array, which only resolves the PyArrow cast
kernel. Some kernels still reject the target type parameters on non-empty input:
INT -> DECIMAL(10, 2) has a kernel but needs precision >= 12 to hold an int's
range at scale 2 (BIGINT needs >= 21), so the empty probe accepted it, the alter
succeeded, and reading an old file failed with ArrowInvalid "Precision is not
great enough".

Probe a single null row instead. It triggers PyArrow's static type-parameter
validation (decimal precision sufficiency) while a null value avoids per-value
parse/overflow errors that safe=False -- matching the read path -- tolerates.
@TheR1sing3un TheR1sing3un force-pushed the python-nested-schema-evolution branch from 166807f to 58d30b2 Compare June 14, 2026 10:30
Comment thread paimon-python/pypaimon/schema/schema_manager.py
AtomicType.to_dict() encodes nullability as a " NOT NULL" string suffix. The
atomic-type parser stripped it only on the space-split branch (plain types like
BIGINT), so a parameterized type took the paren branch and kept the suffix
inside AtomicType.type -- e.g. parsing "DECIMAL(12, 2) NOT NULL" yielded
type="DECIMAL(12, 2) NOT NULL", which doubled to "... NOT NULL NOT NULL" on the
next serialization. update_column_type to a non-null parameterized type (a valid
widening) then failed during alter once can_execute_cast materialized the target.

Strip the trailing NOT NULL / NULL suffix up front so nullability lives only in
`nullable`, fixing the round-trip for every parse_data_type / from_dict caller.
@TheR1sing3un TheR1sing3un requested a review from JingsongLi June 14, 2026 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants