perf(arrow-ipc): Add writer benchmarks for dictionaries#10122
Conversation
|
CC: @alamb @Rich-T-kid |
|
taking a look 👀 |
Rich-T-kid
left a comment
There was a problem hiding this comment.
@JakeDern the PR looks good to me. I have some comments but they arent blocking. 🚀 nice work
| for i in 0..n { | ||
| let mut builder = StringDictionaryBuilder::<UInt32Type>::new(); | ||
| for r in 0..num_rows { | ||
| builder.append_value(format!("batch {i} value {}", r % (num_rows / 2))); |
There was a problem hiding this comment.
For an 8k batch size there will be 4k unique values, and for 64k there will be 32k unique values. I don't think this alone is the right way to benchmark dictionaries. Dictionaries are focused on low cardinality, so it makes more sense to parameterize benchmarks by target cardinality. For example, (5%, 10%, 25%, 50%) unique values relative to batch size.
The implementation should also grow with the number of unique values, since that's the point of the IPC format in delta mode:
A dictionary batch with isDelta set indicates that its vector should be concatenated with those of any previous batches with the same id.
Varying cardinality helps detect whether the encoder is doing O(N) work (proportional to total rows) when it should be doing O(K) work (proportional to unique values).
There was a problem hiding this comment.
Thank you for the thoughtful review!
Varying cardinality helps detect whether the encoder is doing O(N) work (proportional to total rows) when it should be doing O(K) work (proportional to unique values).
I would clarify that we are doing O(K + N) work as there is still per row overhead in encoding the dictionary keys as well as per unique value cost in encoding the dictionary values.
I agree that benchmarks varying the amount of unique values could yield useful information, but these simple benchmarks can still answer whether we've meaningfully changed the amount of work required to emit dictionary batches.
I think until those additional parameters are needed to demonstrate something the current set cannot, I favor the simplicity of omitting them.
Just my .02 of course, happy to take more feedback on this.
| for i in 0..n { | ||
| // 3/4 of the rows reuse values shared by every batch, the other 1/4 | ||
| // introduce values unique to this batch which extends the dictionary. | ||
| for r in 0..num_rows { |
There was a problem hiding this comment.
Similar comment to before.
|
a possible minor/medium optimization for the uncompressed case came to mind & I remembered the current benchmarks dont cover this. but with |
I think this benchmark is covering that case: arrow-rs/arrow-ipc/benches/ipc_writer.rs Lines 29 to 41 in c4a831a The default |
|
yea your right, source. |
Which issue does this PR close?
Rationale for this change
This PR adds writer benchmarks for dictionaries so that we can measure the performance impact of code changes on those code paths.
What changes are included in this PR?
Three new benchmarks:
Are these changes tested?
Yes, just benchmarks included which I ran locally.
Are there any user-facing changes?
No.