Skip to content

prov/efa: Improve error message for ibv_create_ah failure#12136

Merged
jiaxiyan merged 1 commit intoofiwg:mainfrom
jiaxiyan:ah_err_msg
Apr 16, 2026
Merged

prov/efa: Improve error message for ibv_create_ah failure#12136
jiaxiyan merged 1 commit intoofiwg:mainfrom
jiaxiyan:ah_err_msg

Conversation

@jiaxiyan
Copy link
Copy Markdown
Contributor

For create AH command, EINVAL currently can be returned due to non existing PD or GID, or unallowed cross AZ if the GID is in another AZ. List all the reasons before rdma core returns more specific error codes.

@jiaxiyan jiaxiyan requested a review from a team April 13, 2026 20:41
@shijin-aws
Copy link
Copy Markdown
Contributor

Can u confirm we have such error path tested in unit-test ? I remember I added it but worth checking..

shijin-aws
shijin-aws previously approved these changes Apr 13, 2026
@jiaxiyan
Copy link
Copy Markdown
Contributor Author

Can u confirm we have such error path tested in unit-test ? I remember I added it but worth checking.

Yes we have test_efa_base_ep_enable_ah_alloc_failure and test_efa_rdm_ep_enable_ah_alloc_failure.
I tested this change manually too

FI_LOG_LEVEL=WARN ./fi_rdm_pingpong -p efa -f efa -E
libfabric:679175:1776112406::efa:av:efa_ah_alloc():147<warn> ibv_create_ah failed with EINVAL. Local GID: fe80::8e0:5fff:fe34:b7b3, remote GID: fe80::8e0:5fff:fe34:b7b3. Possible causes: 1) Remote GID is in a different availability zone (cross-AZ communication is not enabled). 2) Remote GID is invalid. 3) Protection domain 0x650e9d688f20 is invalid.
libfabric:679175:1776112406::efa:ep_ctrl:efa_rdm_ep_ctrl():1494<warn> EFA RDM endpoint cannot create ah for its own address
fi_enable(): common/shared.c:1529, ret=-22 (Invalid argument)

@jiaxiyan jiaxiyan requested a review from a-szegel April 14, 2026 20:05
Comment thread prov/efa/src/efa_ah.c
For create AH command, EINVAL currently can be returned due to non existing
PD or GID, or unallowed cross AZ if the GID is in another AZ.
List all the reasons before rdma core returns more specific error codes.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
@jiaxiyan
Copy link
Copy Markdown
Contributor Author

bot:aws:retest

@jiaxiyan jiaxiyan merged commit 93326a1 into ofiwg:main Apr 16, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants