Skip to content

Rildixon/io retry compat system test#234

Draft
riley-dixon wants to merge 16 commits intodevelopfrom
rildixon/io-retry-compat-system-test
Draft

Rildixon/io retry compat system test#234
riley-dixon wants to merge 16 commits intodevelopfrom
rildixon/io-retry-compat-system-test

Conversation

@riley-dixon
Copy link
Collaborator

@riley-dixon riley-dixon commented Mar 23, 2026

GTest's framework allows for a mock to call the real implementation of a method. We can utilize this in our system tests to very selectively choose what gets mocked. In this case, we want to mock the call from KFD/HSA to induce a failure to test the fallback mechanism.

Relies on #215

AIHIPFILE-155

One challenge here was ensuring that the right stats counter gets
incremented when an IO is retried, or trying to prevent an exception
being raised by the stats module from causing the IO to be retried.

Parts of RetryableBackend may want to be brought up to Backend,
like the update_stats functions.
Adds tests for the behaviour of the retry mechanism, as well as
any default behaviour from the RetryableBackend.
This change passes in the original parameters of the IO request
for `is_retryable()` to process. By default, we now check that the
fallback IO engine will accept the IO request prior to submitting
it. This is useful in cases where the Fallback backend is available,
but for some reason the request is still invalid.

It also removes a hack that was used for testing this scenario
with "optionally" mocking `is_retryable()`.
Note:
retryable_io() -> _io_impl()
Formatter moved the order of some functions around.

Previously, a "RetryableBackend" would use io() to wrap the fallback
mechanism before calling retryable_io() to actually perform the
request.

Now, every Backend will use _io_impl() which will be responsible
for handling the IO request, leaving io() as the public front-end.
"Retryable" was considered to be a misnomer in this context. Instead of
retrying the IO with the same backend, we were resubmitting the IO to
a different Backend.

While "BackendWithFallback" is not exactly a great name, its something
we can change later.
This was originally written to BackendWithFallback. However,
this could be further generalized to all Backends. _io_impl()
is then only concerned with issuing the IO, letting the io()
method do any additional processing.
A negative integer represents a system error. Internally, all of the
backends wrap this error code into an exception which gets propagated
up to the hipFile API layer.

Removing the comment that io() may return a negative error code.
Technically this is not enforced as we leave the return type ssize_t,
but this is to avoid type-casting elsewhere.
Some commonly used methods/global variables where defined in multiple
locations. This was fine as long as unit tests did not try to include
test-common.h, or system tests including hipfile-test.h.

This consolidates these definitions into test-common.h.

Tested by compiling and seeing which unit tests fail to compile
due to missing definitions. System tests were unaffected due to not
including hipfile-test.h at all.
Since MOCK_PASSTHROUGH is variadic in terms of the function it calls,
this macro can likely be used in other test modules.
@riley-dixon riley-dixon self-assigned this Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant