Skip to content

TCP CRR workload#1906

Open
usamasaqib wants to merge 13 commits into
usama.saqib/rr-refactorfrom
usamasaqib/tcp-crr
Open

TCP CRR workload#1906
usamasaqib wants to merge 13 commits into
usama.saqib/rr-refactorfrom
usamasaqib/tcp-crr

Conversation

@usamasaqib

@usamasaqib usamasaqib commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR adds a new workload tcp_crr. This is similar to tcp_rr but with the difference that the client initiates a new connection for every request-response.

The workload stresses the tcp connection creation code path.

Motivation

Related issues

Additional Notes

PR stack
[1] #1905
[2] #1906 <-- This

@datadog-prod-us1-4

This comment has been minimized.

@usamasaqib usamasaqib changed the base branch from main to usama.saqib/rr-refactor June 2, 2026 10:08
@usamasaqib usamasaqib changed the title Usamasaqib/tcp crr TCP CRR workload Jun 2, 2026
@usamasaqib usamasaqib marked this pull request as ready for review June 2, 2026 10:14
@usamasaqib usamasaqib requested a review from a team as a code owner June 2, 2026 10:14

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9443fca431

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

no_delay: self.config.no_delay,
backlog: self.config.backlog,
};
rr::run_server(params, self.metric_labels, self.shutdown, "tcp_crr").await?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reuse server flow slots for CRR

For sustained tcp_crr runs this delegates the blackhole to the RR server path, but each CRR transaction creates a new server-side connection while rr::server_thread_main assigns a fresh monotonically increasing token on every accept and FlowMap explicitly never reuses removed slots (lading/src/neper/rr.rs:753-761, lading/src/neper/flow.rs:31-50). As a result the server's backing Vec<Option<Flow<_>>> grows by one slot per request/response cycle even after connections close, so a high-rate CRR experiment will steadily consume memory until lading itself becomes the bottleneck or OOMs.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 048d10ffb0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lading/src/neper/flow.rs Outdated
Comment on lines +58 to +61
if self.inner.len() < idx {
self.inner.resize_with(idx + 1, || None);
}
self.inner[idx] = Some(flow);
if self.inner[idx].is_none() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Grow the vector before indexing the first slot

With any positive flow count, FlowMap::new only reserves capacity and leaves inner.len() == 0; on the first insert idx is 0, this condition is false, and the next line indexes self.inner[0], panicking before either tcp_rr or tcp_crr can establish its first flow. The resize check needs to include the idx == len case.

Useful? React with 👍 / 👎.

Comment thread lading/src/neper/rr.rs Outdated
/// Total number of TCP flows the client should open. Sent to the client
/// over the control connection during startup.
pub(crate) flows: u16,
pub(crate) flows: u32,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep the control handshake width consistent

Changing flows to u32 makes the server write four big-endian bytes via params.flows.to_be_bytes(), but the client still reads HANDSHAKE_LEN = 2 into a u16. For normal configurations such as the default flows: 1, the generator reads the high two bytes as 0 and then rejects threads > flows, so the new/default RR and CRR workloads fail to start.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf47a59ecf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lading/src/neper/rr.rs Outdated
Err(e) => {
trace!("reconnect to {addr} failed: {e}");
metrics.connections_failed.add(1);
let _ = flows.remove(token);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep CRR flows alive after transient reconnect errors

In CRR mode, immediate reconnect errors such as EADDRNOTAVAIL during ephemeral-port exhaustion are expected transient conditions, and the module docs explicitly call out that sustained CRR runs can hit this limit. Removing the flow here permanently reduces the configured active flow count; after enough transient failures the generator drains to zero flows and silently stops producing CRR load instead of retrying when ports become available again.

Useful? React with 👍 / 👎.

@usamasaqib usamasaqib Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We partially mitigate this by increasing the local port range for all sockets created by the generator in 69e43c0

This is not a complete solution but addressing this edge cases add more code than is needed right now. We can evaluate later whether users actually want to create a number of flows which can cause this to even occur.

@usamasaqib usamasaqib force-pushed the usamasaqib/tcp-crr branch from cf47a59 to 4ca89fa Compare June 2, 2026 11:47

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ca89fa859

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lading/src/neper/flow.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac7aa5c284

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lading/src/neper/rr.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 301c265c80

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

///
/// # Panics
///
/// Panics if `addr` cannot be parsed as an IP address.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return an error for invalid tcp_crr addresses

For any tcp_crr generator config where addr is malformed, this expect panics the whole lading process before spin can return its Result, so callers cannot surface a normal configuration error or shut down cleanly. This new generator should parse the address into an error variant instead of aborting on user-provided config input.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 21fc427004

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@@ -0,0 +1,131 @@
//! TCP connect/request/response (`tcp_crr`) generator — the client side.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace non-ASCII documentation characters

/workspace/lading/AGENTS.md requires documentation to be US-ASCII only ("no Unicode characters in code or documentation"), but this added doc comment uses a Unicode em dash; the same pattern appears in several other new comments in this change, so the repository validation can reject the patch until these are replaced with ASCII punctuation.

Useful? React with 👍 / 👎.

@usamasaqib usamasaqib marked this pull request as draft June 3, 2026 06:44
In order to avoid EADDRNOTAVAIL failures due to race conditions when
creating new connections, we assign a dedicated port to each flow. This
prevents port collisions when multiple threads are attempting to create
a connection at the same time. To ensure that the port has been freed
before it is reused in the ClientAction::Reconnect pathway, we
forcefully trigger a RST packet to be sent so that the generator's
kernel frees the port when the tcp connection is dropped. This way the
upcoming new connection for the same flow will not fail due to
EADDRINUSE. To accomplish this we use a trick where we set the SO_LINGER
socket option to 0 to force an abortive close on close(fd). At the
application protocol level, when the client initiates an abortive close
the server has already transitioned into CLOSESTREAM state where it will
close drop the connection from its end anyway.
pinning ports to flows was changing the behavior on the bpf side since
the 5-tuple was becoming less variable. This meant that the effect of
flow tracking in bpf was different due to less overhead from managing
more new flows.
@usamasaqib usamasaqib force-pushed the usamasaqib/tcp-crr branch from 132cadd to d4e7bf4 Compare June 4, 2026 11:36
@usamasaqib usamasaqib force-pushed the usamasaqib/tcp-crr branch from 96db1de to d4e7bf4 Compare June 8, 2026 09:50
@usamasaqib usamasaqib marked this pull request as ready for review June 8, 2026 09:50

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d4e7bf424e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lading/src/neper/rr.rs
// fall through
}
Ok(Some(e)) => {
warn!("connect failed: {e}");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid warning on every failed CRR reconnect

In tcp_crr mode, transient connect failures can happen repeatedly under the exact high-churn conditions this workload is meant to exercise (for example backlog pressure, refused connects, or ephemeral-port exhaustion). Emitting a warn! from the reconnect path for every failed nonblocking connect completion can make logging dominate the event loop and distort the experiment; this should stay at trace/debug level or be rate-limited while the counter records the failures.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not stress testing the receiver. Connect should not fail excessively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant