-
Notifications
You must be signed in to change notification settings - Fork 6
test: add Azure health check test script for basic validation #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideAdds an Ansible-managed Azure HPC health check wrapper test script and wires it into the role so clusters can validate basic Azure HPC health check functionality, including Docker/image prerequisites, SKU support handling, exit codes, and health.log content validation. Sequence diagram for running the Azure HPC health check test scriptsequenceDiagram
actor Admin
participant AnsibleRole as ansible_role_hpc_azure
participant VM as Azure_HPC_VM
participant TestScript as test_azure_health_checks_sh
participant HealthChecks as run_health_checks_sh
participant Container as NVIDIA_health_check_container
participant Log as health_log
Admin->>AnsibleRole: apply role
AnsibleRole->>VM: copy test_azure_health_checks_sh template
AnsibleRole->>VM: clone and configure azurehpc_health_checks_repo
Admin->>TestScript: execute test_azure_health_checks_sh
TestScript->>TestScript: detect_or_read_VM_SKU
alt SKU_not_supported
TestScript-->>Admin: print SKU_not_supported message
else SKU_supported
TestScript->>HealthChecks: invoke run_health_checks_sh with SKU_conf
HealthChecks->>Container: start health check container
Container->>Container: run CUDA_GPU_IB_Ethernet_tests
Container-->>HealthChecks: return exit_code
HealthChecks->>Log: write detailed results
HealthChecks-->>TestScript: exit with code
TestScript->>TestScript: check exit_code
alt exit_code_142
TestScript-->>Admin: print timeout message
else exit_code_0_or_other
TestScript->>Log: verify file_exists
TestScript->>Log: scan for FAIL FAULT ERROR
alt errors_found
TestScript-->>Admin: print failure summary and error excerpts
else no_errors
TestScript-->>Admin: print health checks passed
end
end
end
Flow diagram for Azure HPC health check test script logicflowchart TD
S[Start test_azure_health_checks_sh] --> DSK[Detect or read VM SKU]
DSK --> CHKSKU{Is SKU supported by
azurehpc_health_checks?}
CHKSKU -- No --> MSGNS[Print SKU not supported message]
MSGNS --> END[End]
CHKSKU -- Yes --> RUNHC[Run run_health_checks_sh
with SKU-specific conf]
RUNHC --> EC{Exit code from
run_health_checks_sh}
EC -- 142 --> MSGTO[Print timeout message]
MSGTO --> END
EC -- other --> EXLOG[Check health_log exists]
EXLOG --> CHKLOG{Does health_log exist?}
CHKLOG -- No --> MSGNF[Report missing health_log
and mark test failed]
MSGNF --> END
CHKLOG -- Yes --> SCAN[Scan health_log for
FAIL FAULT ERROR]
SCAN --> ERR{Any errors found
in health_log?}
ERR -- Yes --> MSGFAIL[Print failure summary
and error excerpts]
MSGFAIL --> END
ERR -- No --> MSGPASS[Print Azure HPC health
checks completed successfully]
MSGPASS --> END
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've left some high level feedback:
- The usage/help text and examples still reference
run-health-checks.shas the executable rather than this wrapper script name (test-azure-health-checks.sh), which could confuse users; consider updating the usage block to match how this script is actually invoked. - In
setup_checks, the Docker image check usesdocker images | grep -q "mcr.microsoft.com/aznhc/aznhc-nv:latest", which is prone to false positives; it would be more robust to usedocker image inspectordocker images mcr.microsoft.com/aznhc/aznhc-nv:latest -qand verify the result is non-empty. - The script assumes both
timeoutandsystemctlare available and functional; if this role can run on systems without them (e.g., non-systemd or minimal environments), consider adding explicit checks or fallbacks before using those commands to avoid immediate failures.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The usage/help text and examples still reference `run-health-checks.sh` as the executable rather than this wrapper script name (`test-azure-health-checks.sh`), which could confuse users; consider updating the usage block to match how this script is actually invoked.
- In `setup_checks`, the Docker image check uses `docker images | grep -q "mcr.microsoft.com/aznhc/aznhc-nv:latest"`, which is prone to false positives; it would be more robust to use `docker image inspect` or `docker images mcr.microsoft.com/aznhc/aznhc-nv:latest -q` and verify the result is non-empty.
- The script assumes both `timeout` and `systemctl` are available and functional; if this role can run on systems without them (e.g., non-systemd or minimal environments), consider adding explicit checks or fallbacks before using those commands to avoid immediate failures.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| VERBOSE=0 | ||
| TIMEOUT="" | ||
| AZNHC_HOME="{{ __hpc_azure_tests_dir }}/azurehpc-health-checks" | ||
| TRIGGER_SCRIPT="${AZNHC_HOME}/run-health-checks.sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this file exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spetrosi yes, this file exist. I install azurehpc-health-checks to folder /opt/hpc/azure/tests/, but not folder /opt/hpc/azure/tools/.
ls /opt/hpc/azure/tests/azurehpc-health-checks/run-health-checks.sh
/opt/hpc/azure/tests/azurehpc-health-checks/run-health-checks.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spetrosi run-health-checks.sh is from azurehpc-health-checks itself, refer to https://github.com/Azure/azurehpc-health-checks/blob/main/run-health-checks.sh, thank you so much.
| fail() { | ||
| echo "[FAIL] $1" | ||
| FAILED=$((FAILED + 1)) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a test fails, echo the failure message and then just exit the script with a non-zero exit status.
I note that further down the typical caller is "fail .... ; exit 1", so lets fail the whole test on the first failure rather than trying to run more things and failing because of cascading failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgchinner Updated, thank you so much.
| if [[ $FAILED -eq 0 ]]; then | ||
| log "Overall: SUCCESS" | ||
| exit 0 | ||
| else | ||
| log "Overall: FAILURE" | ||
| exit $ret | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if this skips tests, there is no exit status communicated? i.e. the test runner will consider skipped tests as a pass? Also, if fail() exits the test, there is no need for the summary boilerplate code - to get here the test must have passed....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgchinner updated, thank you so much for detailed comment
| # Run the health checks and capture output | ||
| ret=0 | ||
| output=$(mktemp) | ||
| "${cmd[@]}" 2>&1 | tee "$output" || ret=$? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sets the value of ret to the exit status of the 'tee' command in the pipeline, not the command that is being run. This should probably be written as something like:
"${cmd[@]}" 2>&1 | tee "$output"
ret="${PIPESTATUS[0]}"
To capture the return value of the cmd being run rather than the tee process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgchinner updated, thank you so much for review.
07bfd23 to
e1194ea
Compare
|
@dgchinner the main issue is than when I execute the health check test on Standard_nd40rs_v2, the test result cannot be PASS for all the test cases of https://github.com/Azure/azurehpc-health-checks/blob/main/conf/nd40rs_v2.conf. Do we need to analyze these failures?
Running health checks for Standard_nd40rs_v2 SKU... ========== Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. This container image and its contents are governed by the NVIDIA Deep Learning Container License. A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience. SUCCESS: nhc: Health check passed: check_gpu_ecc: ECC checks passed [PASS] Azure HPC health checks completed successfully Checking: health.log file exists --- Error excerpts from health.log --- |
Add a test script for azurehpc-health-checks validation. Validate azurehpc-health-checks basic setup, and mainly call run-health-checks.sh to do SKU health verification. Refer to <https://github.com/Azure/azurehpc-health-checks> Test result scenario log output examples: - SKU not support: The vm SKU 'standard_nc4as_t4_v3' is currently not supported by Azure health checks. - Test timeout scenario Health checks completed with exit code: 142 - SKU supported scenario Standard_nd40rs_v2 could run with detailed test result JIRA: RHELHPC-117 Signed-off-by: Xuemin Li <xuli@redhat.com>
e1194ea to
58c7019
Compare
The IP network device name is likely to be "ib0", not "mlx5_ib0". This may be a result of not having the persistent naming monitor running, or may just be a hard coded naming assumption in the health check scripts. I don't think this is a blocker, but it needs investigation as to why the "ethernet" device name either doesn't exist or why the health check is trying to use a name that doesn't exist.
That's from some kind of GPU bandwidth test. I'm not sure what is being tested, exactly, but I'd suggest that it is likely a spurious threshold failure as the number reported (9.71) is only just less than the failure threshold (10). For the moment, I don't see this as being a problem as the performance is very close to what is expected. Hence I don't think either of these issues should block merging the test scripts - the tests are indicating potential problems with the functionality we have installed/configured, not problems with the test scripts or the health monitoring infrastructure it is exercising. |
|
Code and commit messages look good, I think this can be merged. |
Enhancement:
Add a test script for azurehpc-health-checks validation. Validate azurehpc-health-checks basic setup, and mainly call run-health-checks.sh to do SKU health verification. Refer to https://github.com/Azure/azurehpc-health-checks
Test result scenario log output examples:
Reason:
Result:
Test 1:
No custom conf file specified, detecting VM SKU...
The vm SKU 'standard_nc4as_t4_v3' is currently not supported by Azure health checks.
Test 2:
Running health checks for Standard_nd40rs_v2 SKU...
Running health checks using /opt/hpc/azure/tests/azurehpc-health-checks/conf/nd40rs_v2.conf and outputting to /opt/hpc/azure/tests/azurehpc-health-checks/health.log
==========
== CUDA ==
CUDA Version 12.4.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
SUCCESS: nhc: Health check passed: check_gpu_ecc: ECC checks passed
SUCCESS: nhc: Health check passed: check_gpu_count: Expected 8 and found 8
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
SUCCESS: nhc: Health check passed: check_nvsmi_healthmon: nvidia-smi completed successfully
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 0 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 1 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 2 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 3 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 4 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 5 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 6 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 7 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nccl_allreduce: NCCL all reduce bandwidth test passed, 128.579 GB/s
SUCCESS: nhc: Health check passed: check_ib_bw_non_gdr: IB write bandwidth non gdr test passed for IB=mlx5_0, IB BW=103.59 Gbps
SUCCESS: nhc: Health check passed: check_ib_link_flapping: No IB link flapping found
Health checks completed with exit code: 0.
[PASS] Azure HPC health checks completed successfully
[2026-02-07 10:51:19] Validating health.log...
Checking: health.log file exists
[PASS] health.log file exists
Checking: health.log for errors
[FAIL] FAIL/FAULT/ERROR found in health.log
--- Error excerpts from health.log ---
ERROR: nhc: Health check failed: check_hw_ib: No IB port mlx5_ib0:1 is ACTIVE (LinkUp 100 Gb/sec).
ERROR: nhc: Health check failed: check_hw_eth: Ethernet device ib0 not detected.
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
ERROR: nhc: Health check failed: check_gpu_bw: H2D test on GPU 0 failed. Bandwidth 9.71 is less than 10. FaultCode: NHC2020
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
--- End of excerpts ---
[2026-02-07 10:51:19] ========================================
[2026-02-07 10:51:19] Test Summary
[2026-02-07 10:51:19] ========================================
[2026-02-07 10:51:19] Passed: 6
[2026-02-07 10:51:19] Failed: 1
[2026-02-07 10:51:19] Skipped: 0
[2026-02-07 10:51:19] ========================================
[2026-02-07 10:51:19] Overall: FAILURE
JIRA: RHELHPC-117
Summary by Sourcery
Add an Azure HPC health checks wrapper script and integrate it into the HPC Azure test tasks for basic health validation.
Enhancements:
Tests: