perf(arrow-ipc): Add writer benchmarks for dictionaries by JakeDern · Pull Request #10122 · apache/arrow-rs

JakeDern · 2026-06-11T16:51:08Z

Which issue does this PR close?

Closes arrow-ipc: Extend writer benchmarks to include dictionaries #10119

Rationale for this change

This PR adds writer benchmarks for dictionaries so that we can measure the performance impact of code changes on those code paths.

What changes are included in this PR?

Three new benchmarks:

StreamWriter benchmark for dictionaries
StreamWriter benchmark for delta dictionaries
FileWriter benchmark for delta dictionaries

Are these changes tested?

Yes, just benchmarks included which I ran locally.

Are there any user-facing changes?

No.

JakeDern · 2026-06-11T16:57:35Z

CC: @alamb @Rich-T-kid

Rich-T-kid · 2026-06-11T18:11:35Z

taking a look 👀

Rich-T-kid

@JakeDern the PR looks good to me. I have some comments but they arent blocking. 🚀 nice work

Rich-T-kid · 2026-06-11T18:35:06Z

+    for i in 0..n {
+        let mut builder = StringDictionaryBuilder::<UInt32Type>::new();
+        for r in 0..num_rows {
+            builder.append_value(format!("batch {i} value {}", r % (num_rows / 2)));


For an 8k batch size there will be 4k unique values, and for 64k there will be 32k unique values. I don't think this alone is the right way to benchmark dictionaries. Dictionaries are focused on low cardinality, so it makes more sense to parameterize benchmarks by target cardinality. For example, (5%, 10%, 25%, 50%) unique values relative to batch size.
The implementation should also grow with the number of unique values, since that's the point of the IPC format in delta mode:

A dictionary batch with isDelta set indicates that its vector should be concatenated with those of any previous batches with the same id.

Varying cardinality helps detect whether the encoder is doing O(N) work (proportional to total rows) when it should be doing O(K) work (proportional to unique values).

Thank you for the thoughtful review!

Varying cardinality helps detect whether the encoder is doing O(N) work (proportional to total rows) when it should be doing O(K) work (proportional to unique values).

I would clarify that we are doing O(K + N) work as there is still per row overhead in encoding the dictionary keys as well as per unique value cost in encoding the dictionary values.

I agree that benchmarks varying the amount of unique values could yield useful information, but these simple benchmarks can still answer whether we've meaningfully changed the amount of work required to emit dictionary batches.

I think until those additional parameters are needed to demonstrate something the current set cannot, I favor the simplicity of omitting them.

Just my .02 of course, happy to take more feedback on this.

makes sense to me.

Rich-T-kid · 2026-06-11T18:44:54Z

+    for i in 0..n {
+        // 3/4 of the rows reuse values shared by every batch, the other 1/4
+        // introduce values unique to this batch which extends the dictionary.
+        for r in 0..num_rows {


Similar comment to before.

Rich-T-kid · 2026-06-12T03:27:19Z

a possible minor/medium optimization for the uncompressed case came to mind & I remembered the current benchmarks dont cover this.
@JakeDern what do you think about adding an identical case to

group.bench_function("StreamWriter/write_10/zstd", |b| {
        let batch = create_batch(8192, true);
        let mut buffer = Vec::with_capacity(2 * 1024 * 1024);
        b.iter(move || {
            buffer.clear();
            let options = IpcWriteOptions::default()
                .try_with_compression(Some(CompressionType::ZSTD))
                .unwrap();
            let mut writer =
                StreamWriter::try_new_with_options(&mut buffer, batch.schema().as_ref(), options)
                    .unwrap();
            for _ in 0..10 {
                writer.write(&batch).unwrap();
            }
            writer.finish().unwrap();
        })
    });

but with try_with_compression set to None. It may make sense to actually refactor the StreamWriter benchmarks to iterate over the possible compression types.
I would add it but that would come with possible merge conflicts with this change. If not thats fine I can submit a PR.

JakeDern · 2026-06-12T14:34:16Z

a possible minor/medium optimization for the uncompressed case came to mind & I remembered the current benchmarks dont cover this. @JakeDern what do you think about adding an identical case to
group.bench_function("StreamWriter/write_10/zstd", |b| {
        let batch = create_batch(8192, true);
        let mut buffer = Vec::with_capacity(2 * 1024 * 1024);
        b.iter(move || {
            buffer.clear();
            let options = IpcWriteOptions::default()
                .try_with_compression(Some(CompressionType::ZSTD))
                .unwrap();
            let mut writer =
                StreamWriter::try_new_with_options(&mut buffer, batch.schema().as_ref(), options)
                    .unwrap();
            for _ in 0..10 {
                writer.write(&batch).unwrap();
            }
            writer.finish().unwrap();
        })
    });
but with try_with_compression set to None. It may make sense to actually refactor the StreamWriter benchmarks to iterate over the possible compression types. I would add it but that would come with possible merge conflicts with this change. If not thats fine I can submit a PR.

I think this benchmark is covering that case:

arrow-rs/arrow-ipc/benches/ipc_writer.rs

Lines 29 to 41 in c4a831a

    
           group.bench_function("StreamWriter/write_10", |b| { 
        
               let batch = create_batch(8192, true); 
        
               let mut buffer = Vec::with_capacity(2 * 1024 * 1024); 
        
               b.iter(move || { 
        
                   buffer.clear(); 
        
                   let mut writer = StreamWriter::try_new(&mut buffer, batch.schema().as_ref()).unwrap(); 
        
                   for _ in 0..10 { 
        
                       writer.write(&batch).unwrap(); 
        
                   } 
        
                   writer.finish().unwrap(); 
        
               }) 
        
           });

The default StreamWriter::try_new uses the default IpcWriteOptions which sets batch_compression_type: None. Let me know if I'm misunderstanding though.

Rich-T-kid · 2026-06-12T14:38:49Z

yea your right, source.

Add benches

71e5f8f

github-actions Bot added the arrow Changes to the arrow crate label Jun 11, 2026

JakeDern marked this pull request as ready for review June 11, 2026 16:52

Rich-T-kid approved these changes Jun 11, 2026

View reviewed changes

JakeDern mentioned this pull request Jun 11, 2026

perf(arrow-ipc): Avoid copies and write dictionary batches directly to writers when possible #10128

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(arrow-ipc): Add writer benchmarks for dictionaries#10122

perf(arrow-ipc): Add writer benchmarks for dictionaries#10122
JakeDern wants to merge 1 commit into
apache:mainfrom
JakeDern:ipc-writer-dict-benches

JakeDern commented Jun 11, 2026

Uh oh!

JakeDern commented Jun 11, 2026

Uh oh!

Rich-T-kid commented Jun 11, 2026

Uh oh!

Rich-T-kid left a comment

Uh oh!

Rich-T-kid Jun 11, 2026

Uh oh!

JakeDern Jun 11, 2026

Uh oh!

Rich-T-kid Jun 12, 2026

Uh oh!

Rich-T-kid Jun 11, 2026

Uh oh!

Rich-T-kid commented Jun 12, 2026

Uh oh!

JakeDern commented Jun 12, 2026

Uh oh!

Rich-T-kid commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JakeDern commented Jun 11, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

JakeDern commented Jun 11, 2026

Uh oh!

Rich-T-kid commented Jun 11, 2026

Uh oh!

Rich-T-kid left a comment

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

JakeDern Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid commented Jun 12, 2026

Uh oh!

JakeDern commented Jun 12, 2026

Uh oh!

Rich-T-kid commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rich-T-kid commented Jun 12, 2026 •

edited

Loading