netvsp: [EQE 135] fix sub-channel panic on EndpointRequiresQueueRestart#3558
Open
erfrimod wants to merge 2 commits into
Open
netvsp: [EQE 135] fix sub-channel panic on EndpointRequiresQueueRestart#3558erfrimod wants to merge 2 commits into
erfrimod wants to merge 2 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates netvsp’s restart coordination so that EndpointRequiresQueueRestart (EQE 135 / GDMA_EQE_HWC_RESET_REQUEST) can safely arrive on sub-channels (and multiple channels concurrently) without triggering primary-only assertions or worker-state panics.
Changes:
- Extend
CoordinatorMessage::Restartto includechannel_idx, and increase the coordinator message channel capacity tomax_queuesto avoid collisions. - Restructure the coordinator loop to drain queued messages before running a restart cycle, and to treat sub-channel restart requests as coalescing signals.
- Add a unit test that injects per-queue
TxError::TryRestarton sub-channels and validates coalesced restart behavior without panics.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
vm/devices/net/netvsp/src/lib.rs |
Adds channel_idx to restart messages, increases coordinator channel capacity, and refactors coordinator/worker restart handling to tolerate sub-channel-triggered restarts. |
vm/devices/net/netvsp/src/test.rs |
Adds per-queue restart injection in the test endpoint and a new async test covering sub-channel restart coalescing and message-handling robustness. |
Comment on lines
+4238
to
+4247
| // Drain any messages that arrived prior to the stop. | ||
| while let Ok(Some(msg)) = self.recv.try_next() { | ||
| match msg { | ||
| CoordinatorMessage::Restart { .. } => { | ||
| // Discarding any additional Restart messages. | ||
| } | ||
| // Ensure any non-restart message from the Primary is | ||
| // handled prior to restarting the workers. | ||
| other => { | ||
| self.handle_primary_message(other, state).await; |
…sure a primary channel that sent restart is moved back to Ready
6245add to
cb6fec2
Compare
Comment on lines
+4218
to
+4225
| /// Called at the top of each iteration of [`Self::process`] loop. | ||
| /// Coalesce Sub-channel restart requests into `self.restart = true`. | ||
| /// Any Primary message landing concurrently with a `Restart` is | ||
| /// handled prior to running the restart cycle. | ||
| async fn drain_pending_messages(&mut self, state: &mut CoordinatorState) { | ||
| while let Ok(Some(msg)) = self.recv.try_next() { | ||
| self.handle_coordinator_message(msg, state).await; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sub-channel workers are panicking on
assert_eq!(self.channel_idx, 0)after self-transitioning intoWaitingForCoordinatoronEndpointRequiresQueueRestart, because the coordinator only restores the Primary worker to Ready and the subchannel worker stayed inWaitingForCoordinatoruntil the next process pass tripped the primary-only assert. This is a result ofGDMA_EQE_HWC_RESET_REQUEST(EQE 135) arriving on a subchannel. Netvsp needs to be robust to the EQE arriving on subchannels as well as multiple channels simultaneously.WaitingForCoordinator.max_queuesso signals from Primary and Subchannels don't collide. (Each worker can have at most one in-flight message at a time.)channel_idxso we can differentiate.UpdateorStartTimermessages are handled prior to its worker being restarted.TBD: I want to test this in the lab.