Skip to content

Accept Arrow null (Iceberg v3 unknown) physical types in Parquet schema validation and add test#12

Open
manuzhang wants to merge 7 commits into
mainfrom
codex/support-unknown-v3-type
Open

Accept Arrow null (Iceberg v3 unknown) physical types in Parquet schema validation and add test#12
manuzhang wants to merge 7 commits into
mainfrom
codex/support-unknown-v3-type

Conversation

@manuzhang
Copy link
Copy Markdown
Owner

Motivation

  • Iceberg v3 represents unknown values as nulls in physical Parquet files, so schema-evolution validation must treat Arrow null physical types as compatible with any projected Iceberg type.

Description

  • Updated ValidateParquetSchemaEvolution in parquet_schema_util.cc to allow Arrow null (::arrow::Type::NA) physical types and return success for them.
  • Adjusted anonymous namespace boundaries and added a forward declaration for ProjectNested to keep helper functions organized.
  • Exported the ValidateParquetSchemaEvolution declaration in parquet_schema_util_internal.h so it is available where needed.
  • Added unit test ValidateSchemaEvolutionAllowsNullPhysicalType in parquet_schema_test.cc that verifies a Parquet field with ::arrow::null() is accepted.

Testing

  • Ran the ParquetSchemaProjectionTest suite including the new ValidateSchemaEvolutionAllowsNullPhysicalType test, and all tests passed.
  • Existing projection tests (e.g. int->long, float->double, incompatible types) were executed and remained successful.

Codex Task

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8676ef1e19

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +69 to +70
if (arrow_type->id() == ::arrow::Type::NA) {
return {};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle null physical fields before descending into nested projections

When the Arrow null physical field corresponds to an expected nested type (for example a v3 unknown column later read as a list, map, or struct), this early success only bypasses type validation; the callers still see field.type()->is_nested() and descend into parquet_field.children. A null physical field has no children, so lists/maps are rejected during projection and some structs can later be projected without selecting the null column, which defeats the intended “compatible with any projected Iceberg type” behavior for nested fields. The null case needs to be represented as a null projected field (or otherwise handled before nested descent), not just accepted here.

Useful? React with 👍 / 👎.

Add the unknown primitive type model, JSON and Arrow schema round-trips, visitor coverage, and schema evolution handling. Also project Parquet Arrow null physical fields as null projections before nested descent.

Co-authored-by: Codex <codex@openai.com>
@manuzhang
Copy link
Copy Markdown
Owner Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 06a8959bd0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/iceberg/parquet/parquet_schema_util.cc Outdated
Comment thread src/iceberg/avro/avro_schema_util.cc Outdated
Project expected unknown fields as nulls when physical Avro or Parquet files still contain concrete columns. Map Iceberg unknown to Avro null and keep null projection paths readable by the Avro decoders.

Co-authored-by: Codex <codex@openai.com>
@manuzhang
Copy link
Copy Markdown
Owner Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54a5dc8f87

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/iceberg/parquet/parquet_schema_util.cc
Comment thread src/iceberg/avro/avro_schema_util.cc
Keep Avro and Parquet null physical fields from being coerced into null projections for required Iceberg fields, including required nested list elements. This preserves required-column semantics while still allowing optional unknown/null projections to materialize nulls.

Co-authored-by: Codex <codex@openai.com>
@manuzhang
Copy link
Copy Markdown
Owner Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3f9ce333b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/iceberg/json_serde.cc
Validate unknown fields while parsing schema JSON so the null-only unknown type is only accepted in optional field positions, including list elements and map values. Map keys are rejected because they are required by construction.

Co-authored-by: Codex <codex@openai.com>
@manuzhang
Copy link
Copy Markdown
Owner Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 200922726d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +52 to +53
if (source_type.type_id() == TypeId::kUnknown && expected_type.is_primitive()) {
return {};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allow unknown-to-nested evolution in schema projection

ValidateSchemaEvolution now treats unknown as compatible only when the expected type is primitive, so Project still rejects valid evolutions where a source field is unknown and the current schema expects an optional nested type (struct/list/map). In that case ProjectNested hits the nested type-id equality check and returns Cannot read ... from unknown instead of producing a null projection, which blocks reading v3 data files where unknown was later refined to a nested type.

Useful? React with 👍 / 👎.

manuzhang and others added 2 commits May 20, 2026 16:26
Allow schema projection to materialize nulls when optional nested fields are read from source unknown fields. Required nested targets still fail to preserve non-null semantics.

Co-authored-by: Codex <codex@openai.com>
Project expected unknown fields as null through nested schema projections. Keep Parquet shape columns selected for nested null projections and allow Avro direct encoding of null-only unknown leaves.

Co-authored-by: Codex <codex@openai.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant