TCP CRR workload#1906
Conversation
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9443fca431
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| no_delay: self.config.no_delay, | ||
| backlog: self.config.backlog, | ||
| }; | ||
| rr::run_server(params, self.metric_labels, self.shutdown, "tcp_crr").await?; |
There was a problem hiding this comment.
Reuse server flow slots for CRR
For sustained tcp_crr runs this delegates the blackhole to the RR server path, but each CRR transaction creates a new server-side connection while rr::server_thread_main assigns a fresh monotonically increasing token on every accept and FlowMap explicitly never reuses removed slots (lading/src/neper/rr.rs:753-761, lading/src/neper/flow.rs:31-50). As a result the server's backing Vec<Option<Flow<_>>> grows by one slot per request/response cycle even after connections close, so a high-rate CRR experiment will steadily consume memory until lading itself becomes the bottleneck or OOMs.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 048d10ffb0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self.inner.len() < idx { | ||
| self.inner.resize_with(idx + 1, || None); | ||
| } | ||
| self.inner[idx] = Some(flow); | ||
| if self.inner[idx].is_none() { |
There was a problem hiding this comment.
Grow the vector before indexing the first slot
With any positive flow count, FlowMap::new only reserves capacity and leaves inner.len() == 0; on the first insert idx is 0, this condition is false, and the next line indexes self.inner[0], panicking before either tcp_rr or tcp_crr can establish its first flow. The resize check needs to include the idx == len case.
Useful? React with 👍 / 👎.
| /// Total number of TCP flows the client should open. Sent to the client | ||
| /// over the control connection during startup. | ||
| pub(crate) flows: u16, | ||
| pub(crate) flows: u32, |
There was a problem hiding this comment.
Keep the control handshake width consistent
Changing flows to u32 makes the server write four big-endian bytes via params.flows.to_be_bytes(), but the client still reads HANDSHAKE_LEN = 2 into a u16. For normal configurations such as the default flows: 1, the generator reads the high two bytes as 0 and then rejects threads > flows, so the new/default RR and CRR workloads fail to start.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cf47a59ecf
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| Err(e) => { | ||
| trace!("reconnect to {addr} failed: {e}"); | ||
| metrics.connections_failed.add(1); | ||
| let _ = flows.remove(token); |
There was a problem hiding this comment.
Keep CRR flows alive after transient reconnect errors
In CRR mode, immediate reconnect errors such as EADDRNOTAVAIL during ephemeral-port exhaustion are expected transient conditions, and the module docs explicitly call out that sustained CRR runs can hit this limit. Removing the flow here permanently reduces the configured active flow count; after enough transient failures the generator drains to zero flows and silently stops producing CRR load instead of retrying when ports become available again.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
We partially mitigate this by increasing the local port range for all sockets created by the generator in 69e43c0
This is not a complete solution but addressing this edge cases add more code than is needed right now. We can evaluate later whether users actually want to create a number of flows which can cause this to even occur.
cf47a59 to
4ca89fa
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4ca89fa859
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ac7aa5c284
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…lls may fail with EADDRNOTAVAIL due to port exhaustion. We reduce chances of this by increase the local port range for sockets created by the generator.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 301c265c80
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| /// | ||
| /// # Panics | ||
| /// | ||
| /// Panics if `addr` cannot be parsed as an IP address. |
There was a problem hiding this comment.
Return an error for invalid tcp_crr addresses
For any tcp_crr generator config where addr is malformed, this expect panics the whole lading process before spin can return its Result, so callers cannot surface a normal configuration error or shut down cleanly. This new generator should parse the address into an error variant instead of aborting on user-provided config input.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 21fc427004
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| @@ -0,0 +1,131 @@ | |||
| //! TCP connect/request/response (`tcp_crr`) generator — the client side. | |||
There was a problem hiding this comment.
Replace non-ASCII documentation characters
/workspace/lading/AGENTS.md requires documentation to be US-ASCII only ("no Unicode characters in code or documentation"), but this added doc comment uses a Unicode em dash; the same pattern appears in several other new comments in this change, so the repository validation can reject the patch until these are replaced with ASCII punctuation.
Useful? React with 👍 / 👎.
In order to avoid EADDRNOTAVAIL failures due to race conditions when creating new connections, we assign a dedicated port to each flow. This prevents port collisions when multiple threads are attempting to create a connection at the same time. To ensure that the port has been freed before it is reused in the ClientAction::Reconnect pathway, we forcefully trigger a RST packet to be sent so that the generator's kernel frees the port when the tcp connection is dropped. This way the upcoming new connection for the same flow will not fail due to EADDRINUSE. To accomplish this we use a trick where we set the SO_LINGER socket option to 0 to force an abortive close on close(fd). At the application protocol level, when the client initiates an abortive close the server has already transitioned into CLOSESTREAM state where it will close drop the connection from its end anyway.
pinning ports to flows was changing the behavior on the bpf side since the 5-tuple was becoming less variable. This meant that the effect of flow tracking in bpf was different due to less overhead from managing more new flows.
132cadd to
d4e7bf4
Compare
96db1de to
d4e7bf4
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d4e7bf424e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // fall through | ||
| } | ||
| Ok(Some(e)) => { | ||
| warn!("connect failed: {e}"); |
There was a problem hiding this comment.
Avoid warning on every failed CRR reconnect
In tcp_crr mode, transient connect failures can happen repeatedly under the exact high-churn conditions this workload is meant to exercise (for example backlog pressure, refused connects, or ephemeral-port exhaustion). Emitting a warn! from the reconnect path for every failed nonblocking connect completion can make logging dominate the event loop and distort the experiment; this should stay at trace/debug level or be rate-limited while the counter records the failures.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
We are not stress testing the receiver. Connect should not fail excessively.
What does this PR do?
This PR adds a new workload tcp_crr. This is similar to tcp_rr but with the difference that the client initiates a new connection for every request-response.
The workload stresses the tcp connection creation code path.
Motivation
Related issues
Additional Notes
PR stack
[1] #1905
[2] #1906 <-- This