variant: Integrate datafusion-variant into Datafusion#22908
variant: Integrate datafusion-variant into Datafusion#22908friendlymatthew wants to merge 3 commits into
Conversation
Introduces a new `datafusion-functions-variant` crate containing scalar UDFs that operate on the Parquet Variant logical type (cast_to_variant, json_to_variant, variant_to_json, variant_get and typed variants, variant_get_field, variant_contains, is_variant_null, variant_normalize, variant_pretty, variant_object_keys, and list/object construct/insert/delete). The crate follows the functions-nested convention and is wired into core via a default-on `variant_expressions` feature, so `SessionContext::new()` registers the UDFs out of the box. Sources, unit tests, and 18 sqllogictest files are ported from the datafusion-variant project; three cross-type expectations (e.g. asking for bool out of an int) that are currently broken in parquet-variant-compute are marked TODO / `#[ignore]`.
alamb
left a comment
There was a problem hiding this comment.
The high level idea here is great. I have a comment on how I think the dependencies should go, but otherwise 👌
Thank you @friendlymatthew
| @@ -42,6 +42,7 @@ workspace = true | |||
| nested_expressions = ["datafusion-functions-nested"] | |||
| # This feature is deprecated. Use the `nested_expressions` feature instead. | |||
| array_expressions = ["nested_expressions"] | |||
| variant_expressions = ["datafusion-functions-variant"] | |||
There was a problem hiding this comment.
see my comments on datafusion-json #21353 (comment)
While appealing from a "no extra config" point of view, I think it is extremely important NOT to make these functions a new dependency of the core datafusion crate -- instead I think we should follow the model of datafusion-functions-spark -- that the dependency goes the other way (datafusion-functions-variatn --> datafusion)
| statement ok | ||
| CREATE TABLE placeholder_zero_insert(x BIGINT NULL); | ||
| CREATE TABLE json_data (id INT, json_str TEXT) AS VALUES |
Drops the optional datafusion-functions-variant dependency from `datafusion` and the `variant_expressions` feature it gated. Following the `datafusion-functions-spark` model, `datafusion-functions-variant` now depends on `datafusion` behind a new `core` feature and exposes a `SessionStateBuilderVariant::with_variant_features` extension trait. The sqllogictest harness pulls in the variant crate with `core` and calls `with_variant_features()` so the existing `variant_*.slt` files keep working. This keeps the core crate's compile time and dep surface smaller, matching the structure used for Spark functions.
03abce1 to
4ea3549
Compare
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
CI flagged 17 src files in datafusion-functions-variant missing the
Apache license header and one typo ("somethign" -> "something") in
variant_list_construct.rs.
Which issue does this PR close?
Rationale for this change
This commit brings in a set of UDFs for interacting with Variant data.
datafusion-variantwas a standalone project to quickly develop a set of UDFs around the Variant data type. The core functionality behind the udfs are pretty set, and would be a good time to open it up to the broader Datafusion community