feat(datafusion): support PARTITIONED BY for identity-partitioned external tables#2575
feat(datafusion): support PARTITIONED BY for identity-partitioned external tables#2575huan233usc wants to merge 4 commits into
Conversation
…ernal tables `CREATE EXTERNAL TABLE ... STORED AS ICEBERG` previously rejected any `PARTITIONED BY` clause. Since DataFusion's grammar only accepts plain column names (it cannot express transforms such as `bucket[N]` or `day`), allow the clause for identity-partitioned tables and validate that the declared columns match the table's default partition spec, in order. Tables partitioned with non-identity transforms can still be registered by omitting the clause; specifying it returns a clear error pointing at the offending transform. Closes apache#2050
| /// non-identity transforms, can still be registered for read-only access without declaring | ||
| /// its partitioning. | ||
| fn validate_partition_columns(table: &Table, declared_partition_cols: &[String]) -> Result<()> { | ||
| if declared_partition_cols.is_empty() { |
There was a problem hiding this comment.
The behavior here is open for discussion.
We could choose ignore validation partition spec, pros is it will unblock user creating an external table that is partitioned(potentially with the case data fusion not supported), cons is the sql is not strictly accurate.
|
Hi @CTTY, can I get some feedback and thoughts from you when you have a chance? Thanks |
|
Hi @huan233usc , thanks for the contribution. Throwing errors on partition transforms looks good to me. However, I'm not sure if this is a problem that we want to solve at this point. Currently we don't support |
Hi @CTTY Makes sense, thanks for the context. Based on my observation, from a user perspective, the ideal priority would probably be:
Let me know if anything I could help with #2021? |
Which issue does this PR close?
What changes are included in this PR?
CREATE EXTERNAL TABLE ... STORED AS ICEBERG(viaIcebergTableProviderFactory) previously rejected anyPARTITIONED BYclause outright.DataFusion's
PARTITIONED BYgrammar only accepts plain column names — it cannot express Iceberg transforms such asbucket(16, id)ordays(ts)(unlike Spark's native DSv2 grammar). Given that constraint, this PR:table_partition_colsincheck_cmd.validate_partition_columns, run after the table is loaded:FeatureUnsupportederror naming the offending field/transform.PartitionSpec::is_compatible_withand Java'sPartitionSpec.compatibleWith, where field order is significant).PARTITIONED BYkeeps the previous behavior: any table — including non-identity partitioned ones — can still be registered for read-only access.TODOis left to support non-identity transforms once DataFusion's grammar can express them.Example
Are these changes tested?
Yes. Added unit tests in
table_provider_factory.rsplus two metadata fixtures (bucket-partitioned and multi-identity-partitioned):bucket[4]) transform rejected with a clear errorPARTITIONED BYis omittedcargo test -p iceberg-datafusionandcargo clippy -p iceberg-datafusion --all-targetspass.