string_agg: introduce deferred copying for GroupsAccumulator with mixed eager/deferred paths#21469
Draft
kosiew wants to merge 13 commits intoapache:mainfrom
Draft
string_agg: introduce deferred copying for GroupsAccumulator with mixed eager/deferred paths#21469kosiew wants to merge 13 commits intoapache:mainfrom
kosiew wants to merge 13 commits intoapache:mainfrom
Conversation
Optimize StringAggGroupsAccumulator to retain input and state batches with metadata instead of building a Vec<Option<String>> on every update. Assemble concatenated strings lazily in evaluate() and state(). Adjust size() to reflect retained arrays and metadata. Support EmitTo::First(n) by emitting the required prefix and renumbering retained groups. Include note for future mixed-batch compaction work.
Remove unnecessary &mut self from append_rows. Consolidate repeated string-append loop into a typed private helper using ArrayAccessor. Eliminate redundant runtime null checks in favor of non-null entry invariant with debug_assert!. Simplify retain_after_emit into a single filter-and-renumber pass. Trim local ceremony in evaluate() and state() for clarity.
Consolidate string-like array routing through a single StringInputArray abstraction to improve maintainability. Rename the slot appender to append_group_value for better readability of the lazy-assembly path.
Update append_rows_typed and append_batch_values_typed to accept array references instead of values. Modify call sites in StringInputArray to pass references, improving memory efficiency and consistency across function calls.
Adjust string_agg to implement a hybrid accumulator, offering eager updates for lightweight workloads and switching to deferred row tracking for larger batches. This change enhances performance while maintaining efficiency. Included mixed-mode regression tests to cover various batch scenarios and ensure correctness.
Contributor
Author
|
Benchmark (#21437) |
51ac58a to
baa8054
Compare
Eliminate repeated match arms in string_agg.rs by introducing a local dispatch macro. This enhances clarity and readability, allowing each method to focus on intent while simplifying maintenance for future changes. The refactor preserves existing static dispatch behavior, ensuring that all targeted tests continue to pass.
Rework the string_agg accumulator to initiate in eager mode, reducing unnecessary allocations. Restore an eager append helper for the hot path and enhance promotion logic to use lightweight size estimates from Arrow buffers. This allows short payloads to remain on the eager path while enabling deferred copying for larger batches. Add regression tests to ensure short payloads do not promote and mixed eager/deferred batches operate correctly.
80352ce to
ca5d685
Compare
… todo - Revised comment to indicate a future task for compacting mixed batches in the StringAggGroupsAccumulator implementation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
string_aggGroupsAccumulator#21156Rationale for this change
The current
StringAggGroupsAccumulatoreagerly copies string data duringupdate_batch, which can lead to unnecessary memory duplication and increased CPU overhead—especially for large payloads and high-cardinality group sets.This PR introduces a hybrid approach that selectively defers copying by retaining references to input batches and materializing results only during
evaluate(). This aligns with the proposal in #21156 while balancing implementation complexity and performance.Key motivations:
What changes are included in this PR?
1. Hybrid accumulation strategy
Introduces eager and deferred execution paths
Uses heuristics to decide when to switch:
DEFER_GROUP_THRESHOLD(number of groups)DEFER_PAYLOAD_LEN_THRESHOLD(average string size)2. Deferred storage model
Adds
DeferredRowsstructure:ArrayRefs (Arc-backed)(group_idx, row_idx)pairs instead of copying strings3. Lazy materialization
evaluate()4. Efficient batch handling
Introduces
StringInputArrayabstraction to unify:Utf8LargeUtf8Utf8View5. Partial emit support
Handles
EmitTo::First(n)correctly:6. Memory accounting improvements
size()now includes:7. Refactoring and utilities
Extracted helpers:
append_batch_typedappend_batch_values_typedappend_rows_typedAdded
clear_state()for accurate memory reset8. Tests
Added coverage for:
Are these changes tested?
Yes.
New tests include:
groups_mixed_eager_and_deferred_batchesgroups_short_payloads_do_not_promote_to_deferredThese tests verify:
Are there any user-facing changes?
No direct user-facing API changes.
However, users may observe:
No breaking changes are introduced.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.