Accept Arrow null (Iceberg v3 unknown) physical types in Parquet schema validation and add test#12
Accept Arrow null (Iceberg v3 unknown) physical types in Parquet schema validation and add test#12manuzhang wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8676ef1e19
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (arrow_type->id() == ::arrow::Type::NA) { | ||
| return {}; |
There was a problem hiding this comment.
Handle null physical fields before descending into nested projections
When the Arrow null physical field corresponds to an expected nested type (for example a v3 unknown column later read as a list, map, or struct), this early success only bypasses type validation; the callers still see field.type()->is_nested() and descend into parquet_field.children. A null physical field has no children, so lists/maps are rejected during projection and some structs can later be projected without selecting the null column, which defeats the intended “compatible with any projected Iceberg type” behavior for nested fields. The null case needs to be represented as a null projected field (or otherwise handled before nested descent), not just accepted here.
Useful? React with 👍 / 👎.
Add the unknown primitive type model, JSON and Arrow schema round-trips, visitor coverage, and schema evolution handling. Also project Parquet Arrow null physical fields as null projections before nested descent. Co-authored-by: Codex <codex@openai.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 06a8959bd0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Project expected unknown fields as nulls when physical Avro or Parquet files still contain concrete columns. Map Iceberg unknown to Avro null and keep null projection paths readable by the Avro decoders. Co-authored-by: Codex <codex@openai.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 54a5dc8f87
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Keep Avro and Parquet null physical fields from being coerced into null projections for required Iceberg fields, including required nested list elements. This preserves required-column semantics while still allowing optional unknown/null projections to materialize nulls. Co-authored-by: Codex <codex@openai.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3f9ce333b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Validate unknown fields while parsing schema JSON so the null-only unknown type is only accepted in optional field positions, including list elements and map values. Map keys are rejected because they are required by construction. Co-authored-by: Codex <codex@openai.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 200922726d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (source_type.type_id() == TypeId::kUnknown && expected_type.is_primitive()) { | ||
| return {}; |
There was a problem hiding this comment.
Allow unknown-to-nested evolution in schema projection
ValidateSchemaEvolution now treats unknown as compatible only when the expected type is primitive, so Project still rejects valid evolutions where a source field is unknown and the current schema expects an optional nested type (struct/list/map). In that case ProjectNested hits the nested type-id equality check and returns Cannot read ... from unknown instead of producing a null projection, which blocks reading v3 data files where unknown was later refined to a nested type.
Useful? React with 👍 / 👎.
Allow schema projection to materialize nulls when optional nested fields are read from source unknown fields. Required nested targets still fail to preserve non-null semantics. Co-authored-by: Codex <codex@openai.com>
Project expected unknown fields as null through nested schema projections. Keep Parquet shape columns selected for nested null projections and allow Avro direct encoding of null-only unknown leaves. Co-authored-by: Codex <codex@openai.com>
Motivation
unknownvalues as nulls in physical Parquet files, so schema-evolution validation must treat Arrow null physical types as compatible with any projected Iceberg type.Description
ValidateParquetSchemaEvolutioninparquet_schema_util.ccto allow Arrownull(::arrow::Type::NA) physical types and return success for them.ProjectNestedto keep helper functions organized.ValidateParquetSchemaEvolutiondeclaration inparquet_schema_util_internal.hso it is available where needed.ValidateSchemaEvolutionAllowsNullPhysicalTypeinparquet_schema_test.ccthat verifies a Parquet field with::arrow::null()is accepted.Testing
ParquetSchemaProjectionTestsuite including the newValidateSchemaEvolutionAllowsNullPhysicalTypetest, and all tests passed.Codex Task