feat(http3): optimize stream sending to avoid blocked streams#3015
feat(http3): optimize stream sending to avoid blocked streams#3015larseggert wants to merge 5 commits intomozilla:mainfrom
Conversation
…ed streams Implements optimizations to address wasteful attempts to send data on flow-control blocked streams and fixes iteration patterns that could lose streams during re-addition. Key improvements: - Add blocked_streams HashSet to track flow-control blocked streams separately - Modify stream_has_pending_data() to skip adding blocked streams to pending queue - Replace mem::take() with single-stream iteration to handle re-additions properly - Add check_blocked_streams() method to automatically move unblocked streams back - Use HashSet::retain() for zero-allocation in-place stream filtering - Clean up both pending and blocked sets when streams are removed Benefits: - Reduces CPU usage by eliminating repeated attempts on blocked streams - Improves responsiveness by automatically detecting when streams become unblocked Closes mozilla#261
There was a problem hiding this comment.
Pull Request Overview
This PR optimizes HTTP/3 stream sending by introducing flow-control aware stream management to avoid repeatedly attempting to send data on blocked streams.
- Adds a separate
blocked_streamsHashSet to track flow-control blocked streams - Implements
check_blocked_streams()method to automatically detect when blocked streams become unblocked - Refactors stream iteration from
mem::take()to single-stream processing to handle re-additions properly
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3015 +/- ##
==========================================
- Coverage 93.41% 93.34% -0.07%
==========================================
Files 124 124
Lines 36234 36350 +116
Branches 36234 36350 +116
==========================================
+ Hits 33847 33930 +83
- Misses 1540 1575 +35
+ Partials 847 845 -2
|
…²) performance ## Summary Implements optimizations to eliminate wasteful attempts to send data on flow-control blocked streams and addresses performance issues identified during code review. ## Key Improvements ### 1. Blocked Stream Tracking - Add `blocked_streams` HashSet to track flow-control blocked streams separately - Modify `stream_has_pending_data()` to skip adding blocked streams to pending queue - Add `check_blocked_streams()` method to automatically move unblocked streams back to pending - Clean up both pending and blocked sets when streams are removed ### 2. Performance Optimizations - Use `mem::take()` pattern in `send_non_control_streams()` to avoid O(n²) iteration - Implement batching in `check_blocked_streams()` to avoid expensive per-stream calls during iteration - Properly handle streams that get re-added during processing ## Benefits - **Reduces CPU usage** by eliminating repeated attempts on blocked streams - **Improves responsiveness** by automatically detecting when streams become unblocked - **Fixes algorithmic performance issues** in stream processing hot paths - **Maintains backward compatibility** with all existing tests passing Closes mozilla#261
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| let mut unblocked = Vec::new(); | ||
| #[expect( | ||
| clippy::iter_over_hash_type, | ||
| reason = "OK to loop over streams in an undefined order." | ||
| )] | ||
| for &stream_id in &self.blocked_streams { | ||
| match conn.stream_avail_send_space(stream_id) { | ||
| Ok(0) => { | ||
| // Still blocked, do nothing. | ||
| } | ||
| Ok(_) => { | ||
| // Unblocked, collect for removal and move to pending data if needed. | ||
| unblocked.push(stream_id); | ||
| } | ||
| Err(_) => { | ||
| // Stream no longer exists, collect for removal. | ||
| unblocked.push(stream_id); | ||
| } | ||
| } | ||
| } | ||
| // Remove all unblocked streams from blocked_streams. | ||
| for stream_id in unblocked { | ||
| self.blocked_streams.remove(&stream_id); | ||
| if let Some(stream) = self.send_streams.get(&stream_id) { | ||
| if stream.has_data_to_send() { | ||
| self.streams_with_pending_data.insert(stream_id); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The method iterates over all blocked streams on every call, which could be expensive with many blocked streams. Consider optimizing by only checking blocked streams when flow control updates are received, or implementing a more efficient unblocking mechanism.
mxinden
left a comment
There was a problem hiding this comment.
This increases the amount of state tracking, i.e. does not come for free both in terms of complexity, and potentially in terms of compute. Unless we have a benchmark showing this improves performance, i.e. shows that the additional tracking is cheaper than the continuous attempt at sending on a blocked stream, I don't think we should merge.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| for stream_id in unblocked { | ||
| self.blocked_streams.remove(&stream_id); | ||
| if let Some(stream) = self.send_streams.get(&stream_id) { | ||
| if stream.has_data_to_send() { |
There was a problem hiding this comment.
This creates a potential infinite loop. When mark_stream_for_sending() is called, it adds the stream to streams_with_pending_data, which will cause it to be processed again in send_streams_with_pending_data(). If the stream becomes blocked again, it will be moved back to blocked_streams, then check_blocked_streams() will move it back to pending, creating a cycle.
| if stream.has_data_to_send() { | |
| if stream.has_data_to_send() && !self.streams_with_pending_data.contains(&stream_id) { |
| match conn.stream_avail_send_space(stream_id) { | ||
| Ok(0) => { | ||
| // No space available, stream is likely blocked | ||
| self.blocked_streams.insert(stream_id); | ||
| } | ||
| Ok(_) => { | ||
| // Space available, stream can continue sending | ||
| self.streams_with_pending_data.insert(stream_id); | ||
| } |
There was a problem hiding this comment.
This logic will re-add a stream that was just processed back to streams_with_pending_data in the same iteration, potentially causing it to be processed again immediately. This could lead to inefficient repeated processing of the same stream within a single send_streams_with_pending_data() call.
|
| Branch | fix-261 |
| Testbed | On-prem |
Click to view all benchmark results
| Benchmark | Latency | Benchmark Result nanoseconds (ns) (Result Δ%) | Upper Boundary nanoseconds (ns) (Limit %) |
|---|---|---|---|
| 1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client | 📈 view plot 🚷 view threshold | 208,660,000.00 ns(+0.92%)Baseline: 206,766,840.58 ns | 216,469,067.65 ns (96.39%) |
| 1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client | 📈 view plot 🚷 view threshold | 204,780,000.00 ns(+1.93%)Baseline: 200,912,115.94 ns | 211,182,472.97 ns (96.97%) |
| 1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client | 📈 view plot 🚷 view threshold | 38,595,000.00 ns(+21.90%)Baseline: 31,661,324.64 ns | 42,922,891.79 ns (89.92%) |
| 1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client | 📈 view plot 🚷 view threshold | 290,440,000.00 ns(-0.32%)Baseline: 291,370,840.58 ns | 303,634,212.32 ns (95.65%) |
| 1-streams/each-1000-bytes/simulated-time | 📈 view plot 🚷 view threshold | 118,970,000.00 ns(+0.25%)Baseline: 118,674,550.72 ns | 120,712,253.92 ns (98.56%) |
| 1-streams/each-1000-bytes/wallclock-time | 📈 view plot 🚷 view threshold | 586,310.00 ns(-1.07%)Baseline: 592,642.00 ns | 615,476.24 ns (95.26%) |
| 1000-streams/each-1-bytes/simulated-time | 📈 view plot 🚷 view threshold | 2,333,300,000.00 ns(-82.70%)Baseline: 13,490,980,579.71 ns | 23,068,540,082.26 ns (10.11%) |
| 1000-streams/each-1-bytes/wallclock-time | 📈 view plot 🚷 view threshold | 12,511,000.00 ns(-9.61%)Baseline: 13,841,753.62 ns | 15,084,963.56 ns (82.94%) |
| 1000-streams/each-1000-bytes/simulated-time | 📈 view plot 🚷 view threshold | 16,476,000,000.00 ns(-11.70%)Baseline: 18,659,620,289.86 ns | 20,618,474,175.45 ns (79.91%) |
| 1000-streams/each-1000-bytes/wallclock-time | 📈 view plot 🚷 view threshold | 49,854,000.00 ns(-1.61%)Baseline: 50,669,866.67 ns | 57,072,130.62 ns (87.35%) |
| RxStreamOrderer::inbound_frame() | 📈 view plot 🚷 view threshold | 110,340,000.00 ns(+0.59%)Baseline: 109,696,956.52 ns | 111,590,965.33 ns (98.88%) |
| coalesce_acked_from_zero 1+1 entries | 📈 view plot 🚷 view threshold | 89.64 ns(+0.87%)Baseline: 88.87 ns | 90.13 ns (99.45%) |
| coalesce_acked_from_zero 10+1 entries | 📈 view plot 🚷 view threshold | 105.74 ns(-0.32%)Baseline: 106.07 ns | 107.22 ns (98.62%) |
| coalesce_acked_from_zero 1000+1 entries | 📈 view plot 🚷 view threshold | 92.08 ns(+1.94%)Baseline: 90.32 ns | 94.94 ns (96.98%) |
| coalesce_acked_from_zero 3+1 entries | 📈 view plot 🚷 view threshold | 106.54 ns(-0.05%)Baseline: 106.59 ns | 107.68 ns (98.95%) |
| decode 1048576 bytes, mask 3f | 📈 view plot 🚷 view threshold | 1,760,900.00 ns(+7.70%)Baseline: 1,634,959.71 ns | 1,809,989.32 ns (97.29%) |
| decode 1048576 bytes, mask 7f | 📈 view plot 🚷 view threshold | 5,048,300.00 ns(-0.36%)Baseline: 5,066,487.54 ns | 5,113,021.54 ns (98.73%) |
| decode 1048576 bytes, mask ff | 📈 view plot 🚷 view threshold | 3,003,000.00 ns(-0.87%)Baseline: 3,029,363.77 ns | 3,053,866.08 ns (98.33%) |
| decode 4096 bytes, mask 3f | 📈 view plot 🚷 view threshold | 6,252.20 ns(-15.02%)Baseline: 7,357.02 ns | 10,374.04 ns (60.27%) |
| decode 4096 bytes, mask 7f | 📈 view plot 🚷 view threshold | 19,633.00 ns(-0.86%)Baseline: 19,802.48 ns | 20,464.83 ns (95.94%) |
| decode 4096 bytes, mask ff | 📈 view plot 🚷 view threshold | 11,341.00 ns(-0.21%)Baseline: 11,365.41 ns | 12,521.98 ns (90.57%) |
| sent::Packets::take_ranges | 📈 view plot 🚷 view threshold | 4,560.80 ns(-3.38%)Baseline: 4,720.32 ns | 4,959.89 ns (91.95%) |
| transfer/pacing-false/same-seed/simulated-time/run | 📈 view plot 🚷 view threshold | 25,234,000,000.00 ns(-0.69%)Baseline: 25,410,201,166.18 ns | 26,029,155,392.72 ns (96.95%) |
| transfer/pacing-false/same-seed/wallclock-time/run | 📈 view plot 🚷 view threshold | 24,777,000.00 ns(-3.53%)Baseline: 25,684,667.64 ns | 27,106,786.37 ns (91.41%) |
| transfer/pacing-false/varying-seeds/simulated-time/run | 📈 view plot 🚷 view threshold | 25,190,000,000.00 ns(+0.06%)Baseline: 25,175,405,247.81 ns | 25,224,533,916.06 ns (99.86%) |
| transfer/pacing-false/varying-seeds/wallclock-time/run | 📈 view plot 🚷 view threshold | 25,040,000.00 ns(-2.83%)Baseline: 25,768,239.07 ns | 27,412,870.51 ns (91.34%) |
| transfer/pacing-true/same-seed/simulated-time/run | 📈 view plot 🚷 view threshold | 25,301,000,000.00 ns(-1.08%)Baseline: 25,577,483,965.01 ns | 25,883,944,847.15 ns (97.75%) |
| transfer/pacing-true/same-seed/wallclock-time/run | 📈 view plot 🚷 view threshold | 26,401,000.00 ns(-2.33%)Baseline: 27,030,889.21 ns | 28,602,554.35 ns (92.30%) |
| transfer/pacing-true/varying-seeds/simulated-time/run | 📈 view plot 🚷 view threshold | 25,016,000,000.00 ns(+0.08%)Baseline: 24,995,355,685.13 ns | 25,044,029,996.46 ns (99.89%) |
| transfer/pacing-true/varying-seeds/wallclock-time/run | 📈 view plot 🚷 view threshold | 25,322,000.00 ns(-3.61%)Baseline: 26,269,545.19 ns | 27,994,757.32 ns (90.45%) |
Signed-off-by: Lars Eggert <lars@eggert.org>
CodSpeed Performance ReportMerging #3015 will improve performances by 11.16%Comparing Summary
Benchmarks breakdown
|
Failed Interop TestsQUIC Interop Runner, client vs. server, differences relative to b9c32c7. neqo-latest as client
neqo-latest as server
All resultsSucceeded Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
Unsupported Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
|
Benchmark resultsPerformance differences relative to b9c32c7. 1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💔 Performance has regressed. time: [204.47 ms 204.78 ms 205.09 ms]
thrpt: [487.58 MiB/s 488.32 MiB/s 489.06 MiB/s]
change:
time: [+1.6604% +1.8996% +2.1422%] (p = 0.00 < 0.05)
thrpt: [-2.0972% -1.8642% -1.6333%]
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: 💔 Performance has regressed. time: [288.57 ms 290.44 ms 292.34 ms]
thrpt: [34.207 Kelem/s 34.431 Kelem/s 34.654 Kelem/s]
change:
time: [+1.0774% +1.9952% +2.9143%] (p = 0.00 < 0.05)
thrpt: [-2.8318% -1.9561% -1.0659%]
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: No change in performance detected. time: [38.443 ms 38.595 ms 38.770 ms]
thrpt: [25.793 B/s 25.910 B/s 26.013 B/s]
change:
time: [-0.7411% -0.1602% +0.4553%] (p = 0.60 > 0.05)
thrpt: [-0.4532% +0.1604% +0.7467%]
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: Change within noise threshold. time: [208.37 ms 208.66 ms 208.99 ms]
thrpt: [478.50 MiB/s 479.25 MiB/s 479.91 MiB/s]
change:
time: [-0.7270% -0.5355% -0.3313%] (p = 0.00 < 0.05)
thrpt: [+0.3324% +0.5384% +0.7323%]
decode 4096 bytes, mask ff: No change in performance detected. time: [11.308 µs 11.341 µs 11.380 µs]
change: [-1.0832% -0.4440% +0.0548%] (p = 0.14 > 0.05)
decode 1048576 bytes, mask ff: No change in performance detected. time: [2.9936 ms 3.0030 ms 3.0141 ms]
change: [-0.3209% +0.1246% +0.5597%] (p = 0.61 > 0.05)
decode 4096 bytes, mask 7f: No change in performance detected. time: [19.582 µs 19.633 µs 19.691 µs]
change: [-0.1588% +0.2884% +0.8085%] (p = 0.25 > 0.05)
decode 1048576 bytes, mask 7f: No change in performance detected. time: [5.0337 ms 5.0483 ms 5.0657 ms]
change: [-0.2913% +0.1011% +0.5009%] (p = 0.63 > 0.05)
decode 4096 bytes, mask 3f: No change in performance detected. time: [6.2161 µs 6.2522 µs 6.2960 µs]
change: [-0.0048% +0.4681% +1.0177%] (p = 0.08 > 0.05)
decode 1048576 bytes, mask 3f: Change within noise threshold. time: [1.7581 ms 1.7609 ms 1.7664 ms]
change: [-1.9950% -1.0072% -0.2316%] (p = 0.02 < 0.05)
1-streams/each-1000-bytes/wallclock-time: No change in performance detected. time: [585.12 µs 586.31 µs 587.71 µs]
change: [-0.4693% +0.0600% +0.5569%] (p = 0.82 > 0.05)
1000-streams/each-1-bytes/wallclock-time: 💚 Performance has improved. time: [12.476 ms 12.511 ms 12.546 ms]
change: [-2.5823% -2.1537% -1.7459%] (p = 0.00 < 0.05)
1000-streams/each-1000-bytes/wallclock-time: 💚 Performance has improved. time: [49.733 ms 49.854 ms 49.977 ms]
change: [-1.9723% -1.4463% -1.0104%] (p = 0.00 < 0.05)
1000-streams/each-1000-bytes/simulated-time: No change in performance detected. time: [16.214 s 16.476 s 16.737 s]
thrpt: [58.346 KiB/s 59.271 KiB/s 60.229 KiB/s]
change:
time: [-1.8031% +0.3896% +2.6943%] (p = 0.72 > 0.05)
thrpt: [-2.6236% -0.3881% +1.8362%]
coalesce_acked_from_zero 1+1 entries: No change in performance detected. time: [89.258 ns 89.640 ns 90.018 ns]
change: [-0.1880% +0.2282% +0.6918%] (p = 0.32 > 0.05)
coalesce_acked_from_zero 3+1 entries: No change in performance detected. time: [106.18 ns 106.54 ns 106.93 ns]
change: [-0.5614% +0.0503% +0.6192%] (p = 0.87 > 0.05)
coalesce_acked_from_zero 10+1 entries: No change in performance detected. time: [105.41 ns 105.74 ns 106.15 ns]
change: [-1.1771% -0.3829% +0.3972%] (p = 0.35 > 0.05)
coalesce_acked_from_zero 1000+1 entries: No change in performance detected. time: [91.929 ns 92.075 ns 92.238 ns]
change: [-0.0523% +0.6052% +1.3694%] (p = 0.09 > 0.05)
RxStreamOrderer::inbound_frame(): Change within noise threshold. time: [110.14 ms 110.34 ms 110.66 ms]
change: [+0.5859% +0.7739% +1.0999%] (p = 0.00 < 0.05)
sent::Packets::take_ranges: No change in performance detected. time: [4.4608 µs 4.5608 µs 4.6527 µs]
change: [-2.3333% +0.9104% +4.5284%] (p = 0.62 > 0.05)
transfer/pacing-false/varying-seeds/wallclock-time/run: Change within noise threshold. time: [24.988 ms 25.040 ms 25.101 ms]
change: [+0.9575% +1.2453% +1.5561%] (p = 0.00 < 0.05)
transfer/pacing-false/varying-seeds/simulated-time/run: No change in performance detected. time: [25.155 s 25.190 s 25.225 s]
thrpt: [162.38 KiB/s 162.61 KiB/s 162.83 KiB/s]
change:
time: [-0.1498% +0.0425% +0.2387%] (p = 0.67 > 0.05)
thrpt: [-0.2381% -0.0424% +0.1500%]
transfer/pacing-true/varying-seeds/wallclock-time/run: Change within noise threshold. time: [25.262 ms 25.322 ms 25.385 ms]
change: [-1.0386% -0.7009% -0.3595%] (p = 0.00 < 0.05)
transfer/pacing-true/varying-seeds/simulated-time/run: No change in performance detected. time: [24.980 s 25.016 s 25.052 s]
thrpt: [163.50 KiB/s 163.74 KiB/s 163.97 KiB/s]
change:
time: [-0.0809% +0.1287% +0.3384%] (p = 0.21 > 0.05)
thrpt: [-0.3373% -0.1285% +0.0810%]
transfer/pacing-false/same-seed/wallclock-time/run: Change within noise threshold. time: [24.746 ms 24.777 ms 24.823 ms]
change: [+1.5039% +1.7092% +1.9363%] (p = 0.00 < 0.05)
transfer/pacing-false/same-seed/simulated-time/run: No change in performance detected. time: [25.234 s 25.234 s 25.234 s]
thrpt: [162.32 KiB/s 162.32 KiB/s 162.32 KiB/s]
change:
time: [+0.0000% +0.0000% +0.0000%] (p = NaN > 0.05)
thrpt: [+0.0000% +0.0000% +0.0000%]
transfer/pacing-true/same-seed/wallclock-time/run: No change in performance detected. time: [26.386 ms 26.401 ms 26.417 ms]
change: [-0.0512% +0.1611% +0.3147%] (p = 0.08 > 0.05)
transfer/pacing-true/same-seed/simulated-time/run: No change in performance detected. time: [25.301 s 25.301 s 25.301 s]
thrpt: [161.89 KiB/s 161.89 KiB/s 161.89 KiB/s]
change:
time: [+0.0000% +0.0000% +0.0000%] (p = NaN > 0.05)
thrpt: [+0.0000% +0.0000% +0.0000%]
Download data for |
Client/server transfer resultsPerformance differences relative to b9c32c7. Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.
Download data for |
Implements optimizations to address wasteful attempts to send data on flow-control blocked streams and fixes iteration patterns that could lose streams during re-addition.
Key improvements:
Benefits:
Closes #261