Skip to content

Backend: Add BackendWithFallback to retry IO#215

Open
riley-dixon wants to merge 26 commits intodevelopfrom
rildixon/io-retry-compat
Open

Backend: Add BackendWithFallback to retry IO#215
riley-dixon wants to merge 26 commits intodevelopfrom
rildixon/io-retry-compat

Conversation

@riley-dixon
Copy link
Collaborator

@riley-dixon riley-dixon commented Mar 11, 2026

Motivation

Modifies Backend & adds a new BackendWithFallback interface to re-attempt an IO request that fails with a different Backend. For example, if a Fastpath IO request fails, the Fallback path can automatically retry the IO.

Technical Details

  • Derived BackendWithFallback backends can optionally register another Backend as eligible to retry failed IO requests.
  • hipFile will resubmit the failed IO request if a fallback backend is registered and would otherwise accept the original IO request.
  • Backend has been reworked for how IO requests are processed. A new virtual function _io_impl() has been added which handles only the logic required to submit the IO request to the kernel/driver. io() now acts as the entry point and can handle coordinating other tasks such as updating stats or the auto fallback mechanism. (Part of the rationale for this change is so the auto fallback mechanism only retries the IO if the IO failed and not something else.)

Test Plan

Added unit tests and Fastpath integration level tests. System level tests will be added separately as it is a non-trivial addition.

Reviewer Note

When reviewing the first couple of commits, you may wish to be aware of certain changes made later on.

AIHIPFILE-100

@riley-dixon riley-dixon self-assigned this Mar 11, 2026
@riley-dixon riley-dixon force-pushed the rildixon/io-retry-compat branch 4 times, most recently from 0234797 to 8b8ccfa Compare March 18, 2026 19:06
@riley-dixon riley-dixon changed the title Backend: Add RetryableBackend to retry IO Backend: Add BackendWithFallback to retry IO Mar 18, 2026
@riley-dixon riley-dixon force-pushed the rildixon/io-retry-compat branch from 8b8ccfa to 4025ee3 Compare March 18, 2026 19:29
@riley-dixon riley-dixon requested a review from Copilot March 18, 2026 19:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the Backend IO pathway to centralize common logic (stats updates + fallback retry orchestration) and introduces BackendWithFallback so a backend (e.g., Fastpath) can automatically retry failed IO through an alternate backend (e.g., Fallback).

Changes:

  • Added BackendWithFallback and refactored Backend to route IO through a new _io_impl() hook, with io() acting as the common entry point.
  • Wired Fastpath to optionally register and use a fallback backend when _io_impl() throws and the fallback backend accepts the request.
  • Added/updated unit + integration tests and build plumbing for the new backend behavior.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/amd_detail/backend.h Introduces _io_impl(), stats update hooks, and new BackendWithFallback interface.
src/amd_detail/backend.cpp Implements the shared Backend::io() flow and fallback retry logic for BackendWithFallback::io().
src/amd_detail/backend/fastpath.h / src/amd_detail/backend/fastpath.cpp Migrates Fastpath to BackendWithFallback, moves IO submission into _io_impl(), adds stats hook implementations.
src/amd_detail/backend/fallback.h / src/amd_detail/backend/fallback.cpp Adds stats hook implementations and _io_impl(); adjusts IO helper structure for chunking.
src/amd_detail/state.cpp Creates/registers a shared Fallback backend instance and registers it as Fastpath’s fallback when enabled.
src/amd_detail/CMakeLists.txt Adds new backend.cpp to the library build.
test/amd_detail/backend.cpp New unit tests for fallback retry behavior at the Backend layer.
test/amd_detail/fastpath.cpp Adds Fastpath integration tests covering retry + rejection behaviors via fallback.
test/amd_detail/mbackend.h Extends backend mocks to support the new hooks (_io_impl, stats updates) and adds a BackendWithFallback mock.
test/amd_detail/CMakeLists.txt Adds new backend test file to the test build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@riley-dixon riley-dixon force-pushed the rildixon/io-retry-compat branch from 4025ee3 to 756a18f Compare March 18, 2026 21:39
@riley-dixon
Copy link
Collaborator Author

@copilot

Copy link

Copilot AI commented Mar 18, 2026

@riley-dixon I've opened a new pull request, #227, to work on those changes. Once the pull request is ready, I'll request review from you.

@riley-dixon riley-dixon requested a review from Copilot March 18, 2026 21:43
One challenge here was ensuring that the right stats counter gets
incremented when an IO is retried, or trying to prevent an exception
being raised by the stats module from causing the IO to be retried.

Parts of RetryableBackend may want to be brought up to Backend,
like the update_stats functions.
Adds tests for the behaviour of the retry mechanism, as well as
any default behaviour from the RetryableBackend.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an automatic IO retry mechanism by adding a BackendWithFallback interface and refactoring backend IO submission so failures from one backend (e.g., Fastpath) can be retried via a registered fallback backend (e.g., Fallback).

Changes:

  • Refactor Backend IO flow by introducing _io_impl() and centralizing stats updates in Backend::io().
  • Add BackendWithFallback and wire Fastpath to optionally retry failed IO using a registered fallback backend.
  • Add unit and integration tests covering fallback retry behavior.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/amd_detail/backend.h Defines new Backend::io() entry point, _io_impl(), stats hooks, and BackendWithFallback.
src/amd_detail/backend.cpp Implements shared IO/stats behavior and fallback retry logic.
src/amd_detail/backend/fastpath.h Converts Fastpath to BackendWithFallback and moves IO into _io_impl().
src/amd_detail/backend/fastpath.cpp Implements _io_impl() plus Fastpath stats update hooks.
src/amd_detail/backend/fallback.h Adds stats hooks and _io_impl() plumbing for the fallback backend.
src/amd_detail/backend/fallback.cpp Routes IO through _io_impl() and ensures stats update behavior remains correct.
src/amd_detail/state.cpp Creates a shared fallback backend instance and registers it with Fastpath.
src/amd_detail/CMakeLists.txt Adds new production source backend.cpp to the build.
test/amd_detail/mbackend.h Updates mocks for new backend interface and adds MBackendWithFallback.
test/amd_detail/backend.cpp Adds unit tests validating fallback eligibility and retry behavior.
test/amd_detail/fastpath.cpp Adds integration-level tests for Fastpath retrying via Fallback.
test/amd_detail/CMakeLists.txt Adds new test source backend.cpp to the test build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

This change passes in the original parameters of the IO request
for `is_retryable()` to process. By default, we now check that the
fallback IO engine will accept the IO request prior to submitting
it. This is useful in cases where the Fallback backend is available,
but for some reason the request is still invalid.

It also removes a hack that was used for testing this scenario
with "optionally" mocking `is_retryable()`.
Note:
retryable_io() -> _io_impl()
Formatter moved the order of some functions around.

Previously, a "RetryableBackend" would use io() to wrap the fallback
mechanism before calling retryable_io() to actually perform the
request.

Now, every Backend will use _io_impl() which will be responsible
for handling the IO request, leaving io() as the public front-end.
"Retryable" was considered to be a misnomer in this context. Instead of
retrying the IO with the same backend, we were resubmitting the IO to
a different Backend.

While "BackendWithFallback" is not exactly a great name, its something
we can change later.
This was originally written to BackendWithFallback. However,
this could be further generalized to all Backends. _io_impl()
is then only concerned with issuing the IO, letting the io()
method do any additional processing.
@riley-dixon riley-dixon force-pushed the rildixon/io-retry-compat branch 2 times, most recently from 517341b to 4f131e6 Compare March 18, 2026 22:21
@riley-dixon riley-dixon marked this pull request as ready for review March 18, 2026 22:21
}
};

// The Fastpath can throw a few different kinds of derived std::runtime_errors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test needs to be broken up so that each test focuses on specific functionality.

I think what you could do is instantiate a fastpath backend instance and then register a mock backend as the fallback backend. then you could have a number of tests where each test ensures that Fastpath::is_retryable does the right thing for different return values from Hip::hipAmdFileRead/Hip::hipAmdFileWrite (-ENODEVICE, ....).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely add some new unit tests specifically for Fallback::is_fallback_eligible. I think though I would still like to keep the integration tests if possible. Yes, it touches on a lot of different areas but that is the point, to make sure the fallback mechanism behaves as intended starting from io submission. The future system tests will expand on this by not mocking out all Hip & Sys calls.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test introduces implementation details of fallback into fastpath tests. Instead of registering the fallback backend as the fallback, could you register a mock backend as the fallback?

There should also be unit tests to ensure that is_fallback_eligible returns as expected for different inputs.

Copy link
Collaborator Author

@riley-dixon riley-dixon Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_fallback_eligible is not directly accessible since we discussed moving its visibility to be protected. Other unit tests have already been created to cover different scenarios.

could you register a mock backend as the fallback?

There are already unit tests that do this. The integration test is specific to test that the Fastpath->Fallback IO path works as intended with as much real code as possible. At this point I'll remove the integration test, but we should get the system test in so that we are covering this execution path.

Its been recommended to leave stats as-is and stats is assumed
to not throw exceptions.
Backend::io() and Fallback::io() did the same thing. Just explicitly
mark in Fallback to use the base implementation.
Copilot AI review requested due to automatic review settings March 25, 2026 18:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 26, 2026 19:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 26, 2026 19:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@riley-dixon riley-dixon force-pushed the rildixon/io-retry-compat branch from cd709ec to eba18a5 Compare March 26, 2026 20:27
The two checks are registering a nullptr, and registering itself
(leads to an infinite loop).
This changes how is_fallback_eligible() gets used.

is_fallback_eligible() should only be concerned with its own instance,
and not if the fallback backend will accept the IO request. This also
means that implementations of is_fallback_eligible() do not need to
remember to call the fallback Backend::score() method.
These tests use a mocked fallback Backend which simplifies setting
up the expectations.
The unit test looks largely similar to the integration test, with
the only difference being what gets mocked:
    (mcfg.fallback() vs. m_fallback.score())

Because this test does not care how the fallback backend rejects
the IO, lets just shift it to an integration test and avoid
duplicating the kind-of-ugly exception-assertion logic.

A future system test which disables the Fallback mechanism
will cover this failure path with the real Fallback backend.
Copilot AI review requested due to automatic review settings March 26, 2026 20:49
@riley-dixon riley-dixon force-pushed the rildixon/io-retry-compat branch from eba18a5 to d1cf110 Compare March 26, 2026 20:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants