Skip to content

feat: Reduce reallocations in RxStreamOrderer#3003

Draft
larseggert wants to merge 3 commits intomozilla:mainfrom
larseggert:feat-inbound_frame-prealloc
Draft

feat: Reduce reallocations in RxStreamOrderer#3003
larseggert wants to merge 3 commits intomozilla:mainfrom
larseggert:feat-inbound_frame-prealloc

Conversation

@larseggert
Copy link
Collaborator

Let's see if this helps performance.

Let's see if this helps performance.
@codecov
Copy link

codecov bot commented Sep 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.36%. Comparing base (b3d8f0d) to head (cc5c529).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3003      +/-   ##
==========================================
- Coverage   95.66%   93.36%   -2.31%     
==========================================
  Files         123      123              
  Lines       35702    35712      +10     
  Branches    35702    35712      +10     
==========================================
- Hits        34156    33342     -814     
- Misses       1506     1528      +22     
- Partials       40      842     +802     
Components Coverage Δ
neqo-common 97.31% <ø> (-0.88%) ⬇️
neqo-crypto 83.31% <ø> (-7.17%) ⬇️
neqo-http3 93.32% <ø> (-1.81%) ⬇️
neqo-qpack 94.14% <ø> (-2.09%) ⬇️
neqo-transport 94.44% <100.00%> (-2.14%) ⬇️
neqo-udp 80.48% <ø> (-10.74%) ⬇️
mtu 85.76% <ø> (-1.74%) ⬇️

@github-actions
Copy link
Contributor

github-actions bot commented Sep 22, 2025

🐰 Bencher Report

Branchfeat-inbound_frame-prealloc
TestbedOn-prem
Click to view all benchmark results
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
google vs. neqo (cubic, paced)📈 view plot
🚷 view threshold
278.12 ms
(-0.08%)Baseline: 278.34 ms
282.73 ms
(98.37%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
msquic vs. neqo (cubic, paced)📈 view plot
🚷 view threshold
224.72 ms
(+12.76%)Baseline: 199.30 ms
236.94 ms
(94.84%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. google (cubic, paced)📈 view plot
🚷 view threshold
756.26 ms
(-0.45%)Baseline: 759.69 ms
774.82 ms
(97.61%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. msquic (cubic, paced)📈 view plot
🚷 view threshold
156.46 ms
(-0.83%)Baseline: 157.78 ms
160.59 ms
(97.43%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. neqo (cubic)📈 view plot
🚷 view threshold
94.69 ms
(+3.42%)Baseline: 91.56 ms
96.88 ms
(97.74%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. neqo (cubic, paced)📈 view plot
🚷 view threshold
94.16 ms
(+1.35%)Baseline: 92.90 ms
98.09 ms
(95.99%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. neqo (reno)📈 view plot
🚷 view threshold
93.24 ms
(+1.86%)Baseline: 91.54 ms
96.70 ms
(96.43%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. neqo (reno, paced)📈 view plot
🚷 view threshold
95.04 ms
(+2.42%)Baseline: 92.79 ms
97.78 ms
(97.19%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. quiche (cubic, paced)📈 view plot
🚷 view threshold
191.75 ms
(-0.97%)Baseline: 193.64 ms
196.97 ms
(97.35%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
neqo vs. s2n (cubic, paced)📈 view plot
🚷 view threshold
221.72 ms
(+0.26%)Baseline: 221.14 ms
224.10 ms
(98.94%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
quiche vs. neqo (cubic, paced)📈 view plot
🚷 view threshold
157.33 ms
(+2.74%)Baseline: 153.14 ms
158.50 ms
(99.26%)
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Upper Boundary
milliseconds (ms)
(Limit %)
s2n vs. neqo (cubic, paced)📈 view plot
🚷 view threshold
173.28 ms
(-0.28%)Baseline: 173.77 ms
178.00 ms
(97.35%)
🐰 View full continuous benchmarking report in Bencher

@github-actions
Copy link
Contributor

Benchmark results

Performance differences relative to b3d8f0d.

1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: No change in performance detected.
       time:   [200.11 ms 200.47 ms 200.96 ms]
       thrpt:  [497.61 MiB/s 498.82 MiB/s 499.72 MiB/s]
change:
       time:   [−0.0850% +0.1485% +0.4403%] (p = 0.28 > 0.05)
       thrpt:  [−0.4384% −0.1483% +0.0851%]

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe

1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: No change in performance detected.
       time:   [299.66 ms 301.36 ms 303.08 ms]
       thrpt:  [32.994 Kelem/s 33.183 Kelem/s 33.371 Kelem/s]
change:
       time:   [−0.3116% +0.4839% +1.2104%] (p = 0.21 > 0.05)
       thrpt:  [−1.1959% −0.4816% +0.3126%]

Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild

1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: No change in performance detected.
       time:   [28.416 ms 28.512 ms 28.630 ms]
       thrpt:  [34.928   B/s 35.073   B/s 35.191   B/s]
change:
       time:   [−0.3750% +0.0906% +0.5737%] (p = 0.71 > 0.05)
       thrpt:  [−0.5704% −0.0905% +0.3764%]

Found 23 outliers among 100 measurements (23.00%)
11 (11.00%) low severe
1 (1.00%) high mild
11 (11.00%) high severe

1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.
       time:   [202.29 ms 202.62 ms 203.01 ms]
       thrpt:  [492.58 MiB/s 493.52 MiB/s 494.33 MiB/s]
change:
       time:   [−3.7342% −3.4882% −3.2606%] (p = 0.00 < 0.05)
       thrpt:  [+3.3705% +3.6142% +3.8791%]

Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high severe

decode 4096 bytes, mask ff: No change in performance detected.
       time:   [11.613 µs 11.651 µs 11.694 µs]
       change: [−0.8027% −0.1804% +0.3326%] (p = 0.57 > 0.05)

Found 18 outliers among 100 measurements (18.00%)
2 (2.00%) low severe
6 (6.00%) low mild
3 (3.00%) high mild
7 (7.00%) high severe

decode 1048576 bytes, mask ff: No change in performance detected.
       time:   [3.0185 ms 3.0278 ms 3.0387 ms]
       change: [−0.8118% −0.1895% +0.3544%] (p = 0.54 > 0.05)

Found 8 outliers among 100 measurements (8.00%)
8 (8.00%) high severe

decode 4096 bytes, mask 7f: No change in performance detected.
       time:   [19.948 µs 19.998 µs 20.056 µs]
       change: [−0.3160% +0.1413% +0.5881%] (p = 0.57 > 0.05)

Found 17 outliers among 100 measurements (17.00%)
1 (1.00%) low severe
1 (1.00%) high mild
15 (15.00%) high severe

decode 1048576 bytes, mask 7f: No change in performance detected.
       time:   [5.0328 ms 5.0426 ms 5.0540 ms]
       change: [−1.2024% −0.5035% +0.0438%] (p = 0.12 > 0.05)

Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) low mild
10 (10.00%) high severe

decode 4096 bytes, mask 3f: No change in performance detected.
       time:   [8.2789 µs 8.3159 µs 8.3580 µs]
       change: [+0.0483% +0.5635% +1.1728%] (p = 0.06 > 0.05)

Found 25 outliers among 100 measurements (25.00%)
2 (2.00%) low severe
8 (8.00%) low mild
3 (3.00%) high mild
12 (12.00%) high severe

decode 1048576 bytes, mask 3f: No change in performance detected.
       time:   [1.5881 ms 1.5949 ms 1.6035 ms]
       change: [−2.0388% −0.4004% +0.7757%] (p = 0.67 > 0.05)

Found 11 outliers among 100 measurements (11.00%)
3 (3.00%) high mild
8 (8.00%) high severe

1-streams/each-1000-bytes/wallclock-time: Change within noise threshold.
       time:   [589.89 µs 591.70 µs 593.79 µs]
       change: [−1.1506% −0.6516% −0.1492%] (p = 0.01 < 0.05)

Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high severe
1-streams/each-1000-bytes/simulated-time
time: [118.79 ms 118.98 ms 119.17 ms]
thrpt: [8.1944 KiB/s 8.2076 KiB/s 8.2211 KiB/s]
change:
time: [−0.2819% −0.0113% +0.2549%] (p = 0.93 > 0.05)
thrpt: [−0.2543% +0.0113% +0.2827%]
No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild

1000-streams/each-1-bytes/wallclock-time: Change within noise threshold.
       time:   [14.032 ms 14.061 ms 14.092 ms]
       change: [−0.9199% −0.6459% −0.3527%] (p = 0.00 < 0.05)

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
1000-streams/each-1-bytes/simulated-time
time: [14.985 s 15.000 s 15.014 s]
thrpt: [66.605 B/s 66.668 B/s 66.732 B/s]
change:
time: [−0.1109% +0.0187% +0.1448%] (p = 0.78 > 0.05)
thrpt: [−0.1446% −0.0187% +0.1110%]
No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild

1000-streams/each-1000-bytes/wallclock-time: No change in performance detected.
       time:   [50.776 ms 50.939 ms 51.101 ms]
       change: [−0.1426% +0.5306% +1.1027%] (p = 0.10 > 0.05)
1000-streams/each-1000-bytes/simulated-time: No change in performance detected.
       time:   [18.741 s 18.912 s 19.085 s]
       thrpt:  [51.170 KiB/s 51.638 KiB/s 52.110 KiB/s]
change:
       time:   [−0.9411% +0.3276% +1.5536%] (p = 0.62 > 0.05)
       thrpt:  [−1.5298% −0.3265% +0.9501%]

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild

coalesce_acked_from_zero 1+1 entries: No change in performance detected.
       time:   [88.191 ns 88.538 ns 88.895 ns]
       change: [−0.1959% +0.3684% +1.1270%] (p = 0.28 > 0.05)

Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) high mild
4 (4.00%) high severe

coalesce_acked_from_zero 3+1 entries: No change in performance detected.
       time:   [106.03 ns 106.55 ns 107.30 ns]
       change: [−0.2849% +0.3210% +1.0704%] (p = 0.39 > 0.05)

Found 9 outliers among 100 measurements (9.00%)
9 (9.00%) high severe

coalesce_acked_from_zero 10+1 entries: No change in performance detected.
       time:   [105.58 ns 106.07 ns 106.64 ns]
       change: [−0.4911% −0.0224% +0.4220%] (p = 0.92 > 0.05)

Found 15 outliers among 100 measurements (15.00%)
4 (4.00%) low severe
3 (3.00%) low mild
1 (1.00%) high mild
7 (7.00%) high severe

coalesce_acked_from_zero 1000+1 entries: No change in performance detected.
       time:   [88.723 ns 91.556 ns 98.100 ns]
       change: [−1.4101% +2.7728% +10.065%] (p = 0.60 > 0.05)

Found 14 outliers among 100 measurements (14.00%)
6 (6.00%) high mild
8 (8.00%) high severe

RxStreamOrderer::inbound_frame(): 💚 Performance has improved.
       time:   [102.63 ms 102.79 ms 103.07 ms]
       change: [−7.6514% −7.3735% −7.0581%] (p = 0.00 < 0.05)

Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) low mild
2 (2.00%) high mild
2 (2.00%) high severe

sent::Packets::take_ranges: No change in performance detected.
       time:   [4.5298 µs 4.6645 µs 4.8068 µs]
       change: [−2.0069% +1.3391% +5.2122%] (p = 0.52 > 0.05)

Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe

transfer/pacing-false/varying-seeds/wallclock-time/run: Change within noise threshold.
       time:   [26.951 ms 27.000 ms 27.052 ms]
       change: [+0.9397% +1.2158% +1.4901%] (p = 0.00 < 0.05)

Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe

transfer/pacing-false/varying-seeds/simulated-time/run: No change in performance detected.
       time:   [25.129 s 25.166 s 25.204 s]
       thrpt:  [162.52 KiB/s 162.76 KiB/s 163.00 KiB/s]
change:
       time:   [−0.2319% −0.0321% +0.1692%] (p = 0.76 > 0.05)
       thrpt:  [−0.1689% +0.0321% +0.2325%]

Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
2 (2.00%) high mild

transfer/pacing-true/varying-seeds/wallclock-time/run: Change within noise threshold.
       time:   [27.309 ms 27.379 ms 27.452 ms]
       change: [+0.4884% +0.8607% +1.2416%] (p = 0.00 < 0.05)

Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild

transfer/pacing-true/varying-seeds/simulated-time/run: Change within noise threshold.
       time:   [24.987 s 25.033 s 25.079 s]
       thrpt:  [163.32 KiB/s 163.62 KiB/s 163.92 KiB/s]
change:
       time:   [+0.0800% +0.3075% +0.5500%] (p = 0.01 < 0.05)
       thrpt:  [−0.5470% −0.3065% −0.0799%]
transfer/pacing-false/same-seed/wallclock-time/run: Change within noise threshold.
       time:   [26.419 ms 26.434 ms 26.449 ms]
       change: [+0.9941% +1.1013% +1.2043%] (p = 0.00 < 0.05)

Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild

transfer/pacing-false/same-seed/simulated-time/run: No change in performance detected.
       time:   [25.152 s 25.152 s 25.152 s]
       thrpt:  [162.85 KiB/s 162.85 KiB/s 162.85 KiB/s]
change:
       time:   [+0.0000% +0.0000% +0.0000%] (p = NaN > 0.05)
       thrpt:  [+0.0000% +0.0000% +0.0000%]
transfer/pacing-true/same-seed/wallclock-time/run: Change within noise threshold.
       time:   [28.142 ms 28.160 ms 28.179 ms]
       change: [+0.0679% +0.1823% +0.2954%] (p = 0.00 < 0.05)

Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild

transfer/pacing-true/same-seed/simulated-time/run: No change in performance detected.
       time:   [25.588 s 25.588 s 25.588 s]
       thrpt:  [160.07 KiB/s 160.07 KiB/s 160.07 KiB/s]
change:
       time:   [+0.0000% +0.0000% +0.0000%] (p = NaN > 0.05)
       thrpt:  [+0.0000% +0.0000% +0.0000%]

Download data for profiler.firefox.com or download performance comparison data.

@larseggert
Copy link
Collaborator Author

Hm. Transfer test regression, but one bench shows an improvement. Time to look at flamegraphs...

@larseggert
Copy link
Collaborator Author

Hm. Simplifying inbound_frame drastically doesn't seem to make things slower.

@larseggert
Copy link
Collaborator Author

So what we save in memcpy we spend on malloc now :-)

@codspeed-hq
Copy link

codspeed-hq bot commented Nov 12, 2025

CodSpeed Performance Report

Merging #3003 will improve performances by 17.7%

Comparing larseggert:feat-inbound_frame-prealloc (102a412) with main (b9c32c7)

Summary

⚡ 1 improvement
✅ 22 untouched

Benchmarks breakdown

Mode Benchmark BASE HEAD Change
Simulation client 852.3 ms 724.1 ms +17.7%

@github-actions
Copy link
Contributor

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to b9c32c7.

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

@github-actions
Copy link
Contributor

Client/server transfer results

Performance differences relative to b9c32c7.

Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.

Client vs. server (params) Mean ± σ Min Max MiB/s ± σ Δ main Δ main
google vs. google 455.4 ± 4.4 450.0 466.7 70.3 ± 7.3
google vs. neqo (cubic, paced) 278.1 ± 4.5 268.8 286.8 115.1 ± 7.1 1.2 0.4%
msquic vs. msquic 187.8 ± 63.8 143.9 407.4 170.4 ± 0.5
msquic vs. neqo (cubic, paced) 224.7 ± 60.4 159.9 394.8 142.4 ± 0.5 10.1 4.7%
neqo vs. google (cubic, paced) 756.3 ± 4.2 750.2 770.4 42.3 ± 7.6 -0.3 -0.0%
neqo vs. msquic (cubic, paced) 156.5 ± 4.7 149.4 173.0 204.5 ± 6.8 -0.9 -0.6%
neqo vs. neqo (cubic) 94.7 ± 4.8 85.2 107.6 337.9 ± 6.7 1.3 1.4%
neqo vs. neqo (cubic, paced) 94.2 ± 4.3 85.9 103.4 339.9 ± 7.4 0.1 0.1%
neqo vs. neqo (reno) 93.2 ± 4.6 85.7 102.7 343.2 ± 7.0 -0.9 -1.0%
neqo vs. neqo (reno, paced) 95.0 ± 4.3 88.2 105.4 336.7 ± 7.4 0.1 0.1%
neqo vs. quiche (cubic, paced) 191.7 ± 4.2 186.1 203.0 166.9 ± 7.6 💚 -3.1 -1.6%
neqo vs. s2n (cubic, paced) 221.7 ± 4.6 213.5 234.4 144.3 ± 7.0 💔 1.4 0.6%
quiche vs. neqo (cubic, paced) 157.3 ± 4.8 145.7 170.4 203.4 ± 6.7 0.3 0.2%
quiche vs. quiche 145.3 ± 4.3 138.9 159.7 220.2 ± 7.4
s2n vs. neqo (cubic, paced) 173.3 ± 4.8 162.7 180.7 184.7 ± 6.7 0.8 0.4%
s2n vs. s2n 251.9 ± 28.5 232.0 350.0 127.0 ± 1.1

Download data for profiler.firefox.com or download performance comparison data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant