Skip to content

feat(datafusion): support PARTITIONED BY for identity-partitioned external tables#2575

Open
huan233usc wants to merge 4 commits into
apache:mainfrom
huan233usc:feat/datafusion-external-table-partitioned-by
Open

feat(datafusion): support PARTITIONED BY for identity-partitioned external tables#2575
huan233usc wants to merge 4 commits into
apache:mainfrom
huan233usc:feat/datafusion-external-table-partitioned-by

Conversation

@huan233usc

@huan233usc huan233usc commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

What changes are included in this PR?

CREATE EXTERNAL TABLE ... STORED AS ICEBERG (via IcebergTableProviderFactory) previously rejected any PARTITIONED BY clause outright.

DataFusion's PARTITIONED BY grammar only accepts plain column names — it cannot express Iceberg transforms such as bucket(16, id) or days(ts) (unlike Spark's native DSv2 grammar). Given that constraint, this PR:

  • Stops rejecting table_partition_cols in check_cmd.
  • Adds validate_partition_columns, run after the table is loaded:
    • If the table's default partition spec uses any non-identity transform, returns a clear FeatureUnsupported error naming the offending field/transform.
    • Otherwise validates that the declared columns exactly match the identity partition columns in order (consistent with PartitionSpec::is_compatible_with and Java's PartitionSpec.compatibleWith, where field order is significant).
  • Omitting PARTITIONED BY keeps the previous behavior: any table — including non-identity partitioned ones — can still be registered for read-only access.
  • A TODO is left to support non-identity transforms once DataFusion's grammar can express them.

Example

CREATE EXTERNAL TABLE my_iceberg_table
STORED AS ICEBERG LOCATION '/path/to/metadata.json'
PARTITIONED BY (event_date);

Are these changes tested?

Yes. Added unit tests in table_provider_factory.rs plus two metadata fixtures (bucket-partitioned and multi-identity-partitioned):

  • single identity column match / mismatch
  • multiple identity columns match / wrong order / subset (count mismatch)
  • non-identity (bucket[4]) transform rejected with a clear error
  • non-identity partitioned table still registers when PARTITIONED BY is omitted

cargo test -p iceberg-datafusion and cargo clippy -p iceberg-datafusion --all-targets pass.

…ernal tables

`CREATE EXTERNAL TABLE ... STORED AS ICEBERG` previously rejected any
`PARTITIONED BY` clause. Since DataFusion's grammar only accepts plain
column names (it cannot express transforms such as `bucket[N]` or `day`),
allow the clause for identity-partitioned tables and validate that the
declared columns match the table's default partition spec, in order.

Tables partitioned with non-identity transforms can still be registered by
omitting the clause; specifying it returns a clear error pointing at the
offending transform.

Closes apache#2050
/// non-identity transforms, can still be registered for read-only access without declaring
/// its partitioning.
fn validate_partition_columns(table: &Table, declared_partition_cols: &[String]) -> Result<()> {
if declared_partition_cols.is_empty() {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior here is open for discussion.

We could choose ignore validation partition spec, pros is it will unblock user creating an external table that is partitioned(potentially with the case data fusion not supported), cons is the sql is not strictly accurate.

@huan233usc

Copy link
Copy Markdown
Contributor Author

Hi @CTTY, can I get some feedback and thoughts from you when you have a chance? Thanks

@CTTY CTTY self-requested a review June 11, 2026 00:07
@CTTY

CTTY commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Hi @huan233usc , thanks for the contribution. Throwing errors on partition transforms looks good to me. However, I'm not sure if this is a problem that we want to solve at this point.

Currently we don't support CREATE EXTERNAL TABLE/register_table, and I think we should tackle #2021 first before coming to this. wdyt?

@huan233usc

Copy link
Copy Markdown
Contributor Author

Hi @huan233usc , thanks for the contribution. Throwing errors on partition transforms looks good to me. However, I'm not sure if this is a problem that we want to solve at this point.

Currently we don't support CREATE EXTERNAL TABLE/register_table, and I think we should tackle #2021 first before coming to this. wdyt?

Hi @CTTY

Makes sense, thanks for the context.

Based on my observation, from a user perspective, the ideal priority would probably be:

  1. CREATE TABLE
    -- this is isn't really doable with stock DataFusion today unless we make
  2. CREATE EXTERNAL TABLE ... LOCATION ... where LOCATION points to a storage path (creating a new table there)
    -- iiuc is Support CREATE EXTERNAL TABLE backed by a Catalog with DataFusion #2021 is mainly about.
  3. CREATE EXTERNAL TABLE ... LOCATION ... where LOCATION points to an existing metadata JSON / snapshot (read-only registration) (this PR handle, stepping back a bit this pr seems a bit redundant/unnecessary )

Let me know if anything I could help with #2021?
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants