perf: preserve dictionary encoding for lower/upper to avoid materializing low-cardinality columns#22905
perf: preserve dictionary encoding for lower/upper to avoid materializing low-cardinality columns#22905lyne7-sc wants to merge 2 commits into
Conversation
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
Jefffrey
left a comment
There was a problem hiding this comment.
just an initial comment, hope to fully review it soon
| /// Controls whether a [`Coercion`] preserves an argument's physical encoding | ||
| /// (e.g. dictionary) instead of materializing it to the coerced value type. | ||
| #[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Hash)] | ||
| pub enum EncodingPreservation { | ||
| /// Do not request preservation of a physical encoding. | ||
| None, | ||
| /// Preserve dictionary encoding and coerce only the dictionary values. | ||
| Dictionary, | ||
| } |
There was a problem hiding this comment.
Something to consider is if an enum is best suited for this 🤔
For example, if we want to also include run encoded arrays as well (since they are similar to dictionaries), would this mean two new variants, one just for run arrays and one for dictionary + run arrays?
- I don't know if arrow is planning to include any more types of encodings like this at the moment
So I was also thinking perhaps a bitflag approach (or just a struct with boolean flags) could be another approach:
struct Encoding {
preserve_dictionary: bool,
preserve_run: bool,
}But I don't know how common a use case would be to implement preservation only for dictionaries and not run arrays (or vice versa) so maybe thats overengineering 🤔
Which issue does this PR close?
CoercionAPI #19458Rationale for this change
When a
Dictionary(K, Utf8)column is passed to a string scalar function, the type-coercion layer currently materializes it to flat Utf8/Utf8View before the function runs, so the operation is applied to every row instead of just the unique dictionary values, and the dictionary encoding is lost on the output. See #19458 for the underlying coercion behavior and #20935 for the string-function-specific impact.This is wasteful for low-cardinality columns and inflates Arrow IPC/Flight message sizes downstream.
What changes are included in this PR?
EncodingPreservation { None, Dictionary }and opt-in constructors onCoercion(new_exact_preserving_encoding,with_encoding_preservation).get_valid_types, when aCoerciblearg requestsDictionarypreservation, run coercion against the dictionary's value type and re-wrap the result asDictionary(K, V'), so the function receives aDictionaryArray.lower/upperopt in and handle dictionary inputs (array + scalar) by converting only the dictionary values and re-wrapping with the original keys.Are these changes tested?
Yes
Are there any user-facing changes?
Yes. upper/lower now return Dictionary(...) for dictionary inputs instead of the previously materialized Utf8View. New public API (EncodingPreservation, new Coercion constructors) is added — please add the api change label.