Backend: Add BackendWithFallback to retry IO#215
Backend: Add BackendWithFallback to retry IO#215riley-dixon wants to merge 26 commits intodevelopfrom
Conversation
0234797 to
8b8ccfa
Compare
8b8ccfa to
4025ee3
Compare
There was a problem hiding this comment.
Pull request overview
This PR refactors the Backend IO pathway to centralize common logic (stats updates + fallback retry orchestration) and introduces BackendWithFallback so a backend (e.g., Fastpath) can automatically retry failed IO through an alternate backend (e.g., Fallback).
Changes:
- Added
BackendWithFallbackand refactoredBackendto route IO through a new_io_impl()hook, withio()acting as the common entry point. - Wired Fastpath to optionally register and use a fallback backend when
_io_impl()throws and the fallback backend accepts the request. - Added/updated unit + integration tests and build plumbing for the new backend behavior.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/amd_detail/backend.h |
Introduces _io_impl(), stats update hooks, and new BackendWithFallback interface. |
src/amd_detail/backend.cpp |
Implements the shared Backend::io() flow and fallback retry logic for BackendWithFallback::io(). |
src/amd_detail/backend/fastpath.h / src/amd_detail/backend/fastpath.cpp |
Migrates Fastpath to BackendWithFallback, moves IO submission into _io_impl(), adds stats hook implementations. |
src/amd_detail/backend/fallback.h / src/amd_detail/backend/fallback.cpp |
Adds stats hook implementations and _io_impl(); adjusts IO helper structure for chunking. |
src/amd_detail/state.cpp |
Creates/registers a shared Fallback backend instance and registers it as Fastpath’s fallback when enabled. |
src/amd_detail/CMakeLists.txt |
Adds new backend.cpp to the library build. |
test/amd_detail/backend.cpp |
New unit tests for fallback retry behavior at the Backend layer. |
test/amd_detail/fastpath.cpp |
Adds Fastpath integration tests covering retry + rejection behaviors via fallback. |
test/amd_detail/mbackend.h |
Extends backend mocks to support the new hooks (_io_impl, stats updates) and adds a BackendWithFallback mock. |
test/amd_detail/CMakeLists.txt |
Adds new backend test file to the test build. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
4025ee3 to
756a18f
Compare
|
@riley-dixon I've opened a new pull request, #227, to work on those changes. Once the pull request is ready, I'll request review from you. |
One challenge here was ensuring that the right stats counter gets incremented when an IO is retried, or trying to prevent an exception being raised by the stats module from causing the IO to be retried. Parts of RetryableBackend may want to be brought up to Backend, like the update_stats functions.
Adds tests for the behaviour of the retry mechanism, as well as any default behaviour from the RetryableBackend.
There was a problem hiding this comment.
Pull request overview
This PR introduces an automatic IO retry mechanism by adding a BackendWithFallback interface and refactoring backend IO submission so failures from one backend (e.g., Fastpath) can be retried via a registered fallback backend (e.g., Fallback).
Changes:
- Refactor
BackendIO flow by introducing_io_impl()and centralizing stats updates inBackend::io(). - Add
BackendWithFallbackand wire Fastpath to optionally retry failed IO using a registered fallback backend. - Add unit and integration tests covering fallback retry behavior.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/amd_detail/backend.h |
Defines new Backend::io() entry point, _io_impl(), stats hooks, and BackendWithFallback. |
src/amd_detail/backend.cpp |
Implements shared IO/stats behavior and fallback retry logic. |
src/amd_detail/backend/fastpath.h |
Converts Fastpath to BackendWithFallback and moves IO into _io_impl(). |
src/amd_detail/backend/fastpath.cpp |
Implements _io_impl() plus Fastpath stats update hooks. |
src/amd_detail/backend/fallback.h |
Adds stats hooks and _io_impl() plumbing for the fallback backend. |
src/amd_detail/backend/fallback.cpp |
Routes IO through _io_impl() and ensures stats update behavior remains correct. |
src/amd_detail/state.cpp |
Creates a shared fallback backend instance and registers it with Fastpath. |
src/amd_detail/CMakeLists.txt |
Adds new production source backend.cpp to the build. |
test/amd_detail/mbackend.h |
Updates mocks for new backend interface and adds MBackendWithFallback. |
test/amd_detail/backend.cpp |
Adds unit tests validating fallback eligibility and retry behavior. |
test/amd_detail/fastpath.cpp |
Adds integration-level tests for Fastpath retrying via Fallback. |
test/amd_detail/CMakeLists.txt |
Adds new test source backend.cpp to the test build. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
This change passes in the original parameters of the IO request for `is_retryable()` to process. By default, we now check that the fallback IO engine will accept the IO request prior to submitting it. This is useful in cases where the Fallback backend is available, but for some reason the request is still invalid. It also removes a hack that was used for testing this scenario with "optionally" mocking `is_retryable()`.
Note: retryable_io() -> _io_impl() Formatter moved the order of some functions around. Previously, a "RetryableBackend" would use io() to wrap the fallback mechanism before calling retryable_io() to actually perform the request. Now, every Backend will use _io_impl() which will be responsible for handling the IO request, leaving io() as the public front-end.
"Retryable" was considered to be a misnomer in this context. Instead of retrying the IO with the same backend, we were resubmitting the IO to a different Backend. While "BackendWithFallback" is not exactly a great name, its something we can change later.
This was originally written to BackendWithFallback. However, this could be further generalized to all Backends. _io_impl() is then only concerned with issuing the IO, letting the io() method do any additional processing.
517341b to
4f131e6
Compare
test/amd_detail/fastpath.cpp
Outdated
| } | ||
| }; | ||
|
|
||
| // The Fastpath can throw a few different kinds of derived std::runtime_errors. |
There was a problem hiding this comment.
I think this test needs to be broken up so that each test focuses on specific functionality.
I think what you could do is instantiate a fastpath backend instance and then register a mock backend as the fallback backend. then you could have a number of tests where each test ensures that Fastpath::is_retryable does the right thing for different return values from Hip::hipAmdFileRead/Hip::hipAmdFileWrite (-ENODEVICE, ....).
There was a problem hiding this comment.
I can definitely add some new unit tests specifically for Fallback::is_fallback_eligible. I think though I would still like to keep the integration tests if possible. Yes, it touches on a lot of different areas but that is the point, to make sure the fallback mechanism behaves as intended starting from io submission. The future system tests will expand on this by not mocking out all Hip & Sys calls.
There was a problem hiding this comment.
This test introduces implementation details of fallback into fastpath tests. Instead of registering the fallback backend as the fallback, could you register a mock backend as the fallback?
There should also be unit tests to ensure that is_fallback_eligible returns as expected for different inputs.
There was a problem hiding this comment.
is_fallback_eligible is not directly accessible since we discussed moving its visibility to be protected. Other unit tests have already been created to cover different scenarios.
could you register a mock backend as the fallback?
There are already unit tests that do this. The integration test is specific to test that the Fastpath->Fallback IO path works as intended with as much real code as possible. At this point I'll remove the integration test, but we should get the system test in so that we are covering this execution path.
Its been recommended to leave stats as-is and stats is assumed to not throw exceptions.
Backend::io() and Fallback::io() did the same thing. Just explicitly mark in Fallback to use the base implementation.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cd709ec to
eba18a5
Compare
The two checks are registering a nullptr, and registering itself (leads to an infinite loop).
This changes how is_fallback_eligible() gets used. is_fallback_eligible() should only be concerned with its own instance, and not if the fallback backend will accept the IO request. This also means that implementations of is_fallback_eligible() do not need to remember to call the fallback Backend::score() method.
These tests use a mocked fallback Backend which simplifies setting up the expectations.
The unit test looks largely similar to the integration test, with
the only difference being what gets mocked:
(mcfg.fallback() vs. m_fallback.score())
Because this test does not care how the fallback backend rejects
the IO, lets just shift it to an integration test and avoid
duplicating the kind-of-ugly exception-assertion logic.
A future system test which disables the Fallback mechanism
will cover this failure path with the real Fallback backend.
eba18a5 to
d1cf110
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Motivation
Modifies Backend & adds a new BackendWithFallback interface to re-attempt an IO request that fails with a different Backend. For example, if a Fastpath IO request fails, the Fallback path can automatically retry the IO.
Technical Details
BackendWithFallbackbackends can optionally register another Backend as eligible to retry failed IO requests._io_impl()has been added which handles only the logic required to submit the IO request to the kernel/driver.io()now acts as the entry point and can handle coordinating other tasks such as updating stats or the auto fallback mechanism. (Part of the rationale for this change is so the auto fallback mechanism only retries the IO if the IO failed and not something else.)Test Plan
Added unit tests and Fastpath integration level tests. System level tests will be added separately as it is a non-trivial addition.
Reviewer Note
When reviewing the first couple of commits, you may wish to be aware of certain changes made later on.
AIHIPFILE-100