patches-7.0: replace cdnsp-sky1 shutdown SError mitigations with host teardown#37
Open
npmccallum wants to merge 1 commit into
Open
patches-7.0: replace cdnsp-sky1 shutdown SError mitigations with host teardown#37npmccallum wants to merge 1 commit into
npmccallum wants to merge 1 commit into
Conversation
… teardown
Remove 0061-DPTSW-24991 and 0062-DPTSW-25423 and replace them with
0061-usb-cdns3-cdnsp-sky1-tear-down-host-on-shutdown.patch.
cdnsp_sky1_shutdown() gates the USB clocks while the child xHCI host and
its devices are still live above the controller. If anything touches the
controller's registers after the clocks are gated, the access raises a
fatal asynchronous SError that hangs shutdown/reboot. Three distinct paths
can do this:
1. the gadget (cdnsp_gadget_pullup) - addressed by 0061 (release the
gadget driver);
2. the xHCI IRQ handler (xhci_irq reading USBSTS) - addressed by 0062
(disable_irq + clear HCD_FLAG_HW_ACCESSIBLE);
3. a host device's async error recovery - not covered by either. The
panic we captured (via netconsole) came from usb-storage ->
xhci_urb_dequeue: an attached mass-storage device's SCSI
error-recovery thread resetting its port and dequeuing URBs touches
xHCI MMIO after the clocks are gated. 0061 (gadget) and 0062 (IRQ)
do not stop this thread, so the hang still reproduces with both.
Instead of a third targeted mitigation, route ->shutdown() through the
existing ->remove() path, which calls of_platform_depopulate() to remove
the xHCI host first - disconnecting its devices and stopping all of their
work (gadget, IRQ, and async recovery) - before the resets are asserted
and the clocks gated. One patch supersedes both mitigations and covers the
host async-recovery path they miss.
Tested: 13/13 consecutive `systemctl reboot` with an attached composite USB
device (HID + mass-storage), zero SErrors on every teardown (captured via
netconsole), versus a deterministic hang before; an 8-reboot stress run
was also clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Remove 0061-DPTSW-24991 and 0062-DPTSW-25423 and replace them with 0061-usb-cdns3-cdnsp-sky1-tear-down-host-on-shutdown.patch.
cdnsp_sky1_shutdown() gates the USB clocks while the child xHCI host and its devices are still live above the controller. If anything touches the controller's registers after the clocks are gated, the access raises a fatal asynchronous SError that hangs shutdown/reboot. Three distinct paths can do this:
Instead of a third targeted mitigation, route ->shutdown() through the existing ->remove() path, which calls of_platform_depopulate() to remove the xHCI host first - disconnecting its devices and stopping all of their work (gadget, IRQ, and async recovery) - before the resets are asserted and the clocks gated. One patch supersedes both mitigations and covers the host async-recovery path they miss.
Reusing ->remove() for ->shutdown() is an established pattern for USB
dual-role / glue drivers in mainline, for example:
(drivers/usb/dwc3/dwc3-qcom.c)
(drivers/usb/dwc3/dwc3-xilinx.c)
(drivers/usb/chipidea/ci_hdrc_imx.c)