Skip to content

netvsp: [EQE 135] fix sub-channel panic on EndpointRequiresQueueRestart#3558

Open
erfrimod wants to merge 2 commits into
microsoft:mainfrom
erfrimod:erfrimod/fix-eqe-135-on-subchannel
Open

netvsp: [EQE 135] fix sub-channel panic on EndpointRequiresQueueRestart#3558
erfrimod wants to merge 2 commits into
microsoft:mainfrom
erfrimod:erfrimod/fix-eqe-135-on-subchannel

Conversation

@erfrimod
Copy link
Copy Markdown
Contributor

Sub-channel workers are panicking on assert_eq!(self.channel_idx, 0) after self-transitioning into WaitingForCoordinator on EndpointRequiresQueueRestart, because the coordinator only restores the Primary worker to Ready and the subchannel worker stayed in WaitingForCoordinator until the next process pass tripped the primary-only assert. This is a result of GDMA_EQE_HWC_RESET_REQUEST (EQE 135) arriving on a subchannel. Netvsp needs to be robust to the EQE arriving on subchannels as well as multiple channels simultaneously.

  • Subchannels stay in Ready, and do not enter WaitingForCoordinator.
  • Coordinator message channel capacity increased to max_queues so signals from Primary and Subchannels don't collide. (Each worker can have at most one in-flight message at a time.)
  • The Restart Coordinator message now contains channel_idx so we can differentiate.
  • Coordinator loop restructured to drain messages prior to restarting.
    • Any Primary Update or StartTimer messages are handled prior to its worker being restarted.
  • Added unit test to show multiple messages are handled without panics.

TBD: I want to test this in the lab.

@erfrimod erfrimod requested a review from a team as a code owner May 23, 2026 01:01
Copilot AI review requested due to automatic review settings May 23, 2026 01:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates netvsp’s restart coordination so that EndpointRequiresQueueRestart (EQE 135 / GDMA_EQE_HWC_RESET_REQUEST) can safely arrive on sub-channels (and multiple channels concurrently) without triggering primary-only assertions or worker-state panics.

Changes:

  • Extend CoordinatorMessage::Restart to include channel_idx, and increase the coordinator message channel capacity to max_queues to avoid collisions.
  • Restructure the coordinator loop to drain queued messages before running a restart cycle, and to treat sub-channel restart requests as coalescing signals.
  • Add a unit test that injects per-queue TxError::TryRestart on sub-channels and validates coalesced restart behavior without panics.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
vm/devices/net/netvsp/src/lib.rs Adds channel_idx to restart messages, increases coordinator channel capacity, and refactors coordinator/worker restart handling to tolerate sub-channel-triggered restarts.
vm/devices/net/netvsp/src/test.rs Adds per-queue restart injection in the test endpoint and a new async test covering sub-channel restart coalescing and message-handling robustness.

Comment thread vm/devices/net/netvsp/src/lib.rs Outdated
Comment on lines +4238 to +4247
// Drain any messages that arrived prior to the stop.
while let Ok(Some(msg)) = self.recv.try_next() {
match msg {
CoordinatorMessage::Restart { .. } => {
// Discarding any additional Restart messages.
}
// Ensure any non-restart message from the Primary is
// handled prior to restarting the workers.
other => {
self.handle_primary_message(other, state).await;
…sure a primary channel that sent restart is moved back to Ready
Copilot AI review requested due to automatic review settings May 23, 2026 01:19
@erfrimod erfrimod force-pushed the erfrimod/fix-eqe-135-on-subchannel branch from 6245add to cb6fec2 Compare May 23, 2026 01:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment on lines +4218 to +4225
/// Called at the top of each iteration of [`Self::process`] loop.
/// Coalesce Sub-channel restart requests into `self.restart = true`.
/// Any Primary message landing concurrently with a `Restart` is
/// handled prior to running the restart cycle.
async fn drain_pending_messages(&mut self, state: &mut CoordinatorState) {
while let Ok(Some(msg)) = self.recv.try_next() {
self.handle_coordinator_message(msg, state).await;
}
@github-actions
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants