Skip to content

perf: preserve dictionary encoding for lower/upper to avoid materializing low-cardinality columns#22905

Open
lyne7-sc wants to merge 2 commits into
apache:mainfrom
lyne7-sc:perf/dictionary_scalar
Open

perf: preserve dictionary encoding for lower/upper to avoid materializing low-cardinality columns#22905
lyne7-sc wants to merge 2 commits into
apache:mainfrom
lyne7-sc:perf/dictionary_scalar

Conversation

@lyne7-sc

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

When a Dictionary(K, Utf8) column is passed to a string scalar function, the type-coercion layer currently materializes it to flat Utf8/Utf8View before the function runs, so the operation is applied to every row instead of just the unique dictionary values, and the dictionary encoding is lost on the output. See #19458 for the underlying coercion behavior and #20935 for the string-function-specific impact.

This is wasteful for low-cardinality columns and inflates Arrow IPC/Flight message sizes downstream.

What changes are included in this PR?

  • Add EncodingPreservation { None, Dictionary } and opt-in constructors on Coercion (new_exact_preserving_encoding, with_encoding_preservation).
  • In get_valid_types, when a Coercible arg requests Dictionary preservation, run coercion against the dictionary's value type and re-wrap the result as Dictionary(K, V'), so the function receives a DictionaryArray.
  • Make lower/upper opt in and handle dictionary inputs (array + scalar) by converting only the dictionary values and re-wrapping with the original keys.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes. upper/lower now return Dictionary(...) for dictionary inputs instead of the previously materialized Utf8View. New public API (EncodingPreservation, new Coercion constructors) is added — please add the api change label.

@github-actions github-actions Bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jun 11, 2026
@lyne7-sc

Copy link
Copy Markdown
Contributor Author

Hi @Jefffrey, following up on #19458. I prototyped one approach and wanted to check it with you before going further. Would you be interested in taking a look?

@github-actions

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion-expr v54.0.0 (current)
       Built [  25.339s] (current)
     Parsing datafusion-expr v54.0.0 (current)
      Parsed [   0.069s] (current)
    Building datafusion-expr v54.0.0 (baseline)
       Built [  25.425s] (baseline)
     Parsing datafusion-expr v54.0.0 (baseline)
      Parsed [   0.068s] (baseline)
    Checking datafusion-expr v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   1.578s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  53.617s] datafusion-expr
    Building datafusion-expr-common v54.0.0 (current)
       Built [  18.149s] (current)
     Parsing datafusion-expr-common v54.0.0 (current)
      Parsed [   0.017s] (current)
    Building datafusion-expr-common v54.0.0 (baseline)
       Built [  17.931s] (baseline)
     Parsing datafusion-expr-common v54.0.0 (baseline)
      Parsed [   0.017s] (baseline)
    Checking datafusion-expr-common v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.264s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure enum_struct_variant_field_added: pub enum struct variant field added ---

Description:
An enum's exhaustive struct variant has a new field, which has to be included when constructing or matching on this variant.
        ref: https://doc.rust-lang.org/reference/attributes/type_system.html#the-non_exhaustive-attribute
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/enum_struct_variant_field_added.ron

Failed in:
  field encoding_preservation of variant Coercion::Exact in /home/runner/work/datafusion/datafusion/datafusion/expr-common/src/signature.rs:1058
  field encoding_preservation of variant Coercion::Implicit in /home/runner/work/datafusion/datafusion/datafusion/expr-common/src/signature.rs:1068

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  37.073s] datafusion-expr-common
    Building datafusion-functions v54.0.0 (current)
       Built [  28.765s] (current)
     Parsing datafusion-functions v54.0.0 (current)
      Parsed [   0.076s] (current)
    Building datafusion-functions v54.0.0 (baseline)
       Built [  28.261s] (baseline)
     Parsing datafusion-functions v54.0.0 (baseline)
      Parsed [   0.077s] (baseline)
    Checking datafusion-functions v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.453s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  58.788s] datafusion-functions
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 160.186s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.022s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 159.304s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.022s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.101s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 322.225s] datafusion-sqllogictest

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 11, 2026

@Jefffrey Jefffrey left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just an initial comment, hope to fully review it soon

Comment on lines +1072 to +1080
/// Controls whether a [`Coercion`] preserves an argument's physical encoding
/// (e.g. dictionary) instead of materializing it to the coerced value type.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Hash)]
pub enum EncodingPreservation {
/// Do not request preservation of a physical encoding.
None,
/// Preserve dictionary encoding and coerce only the dictionary values.
Dictionary,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider is if an enum is best suited for this 🤔

For example, if we want to also include run encoded arrays as well (since they are similar to dictionaries), would this mean two new variants, one just for run arrays and one for dictionary + run arrays?

  • I don't know if arrow is planning to include any more types of encodings like this at the moment

So I was also thinking perhaps a bitflag approach (or just a struct with boolean flags) could be another approach:

struct Encoding {
    preserve_dictionary: bool,
    preserve_run: bool,
}

But I don't know how common a use case would be to implement preservation only for dictionaries and not run arrays (or vice versa) so maybe thats overengineering 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change functions Changes to functions implementation logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants