test: add Azure health check test script for basic validation #69

lixuemin2016 · 2026-02-08T09:20:59Z

Enhancement:
Add a test script for azurehpc-health-checks validation. Validate azurehpc-health-checks basic setup, and mainly call run-health-checks.sh to do SKU health verification. Refer to https://github.com/Azure/azurehpc-health-checks

Test result scenario log output examples:

SKU not support: The vm SKU 'standard_nc4as_t4_v3' is currently not supported by Azure health checks.
Test timeout scenario: Health checks completed with exit code: 142
SKU supported scenario: Standard_nd40rs_v2 could run with detailed test result

Reason:

Result:

Test 1:
No custom conf file specified, detecting VM SKU...
The vm SKU 'standard_nc4as_t4_v3' is currently not supported by Azure health checks.

Test 2:
Running health checks for Standard_nd40rs_v2 SKU...
Running health checks using /opt/hpc/azure/tests/azurehpc-health-checks/conf/nd40rs_v2.conf and outputting to /opt/hpc/azure/tests/azurehpc-health-checks/health.log

==========
== CUDA ==

CUDA Version 12.4.1

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

SUCCESS: nhc: Health check passed: check_gpu_ecc: ECC checks passed
SUCCESS: nhc: Health check passed: check_gpu_count: Expected 8 and found 8
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
SUCCESS: nhc: Health check passed: check_nvsmi_healthmon: nvidia-smi completed successfully
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 0 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 1 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 2 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 3 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 4 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 5 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 6 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 7 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nccl_allreduce: NCCL all reduce bandwidth test passed, 128.579 GB/s
SUCCESS: nhc: Health check passed: check_ib_bw_non_gdr: IB write bandwidth non gdr test passed for IB=mlx5_0, IB BW=103.59 Gbps
SUCCESS: nhc: Health check passed: check_ib_link_flapping: No IB link flapping found
Health checks completed with exit code: 0.

[PASS] Azure HPC health checks completed successfully
[2026-02-07 10:51:19] Validating health.log...

Checking: health.log file exists
[PASS] health.log file exists
Checking: health.log for errors
[FAIL] FAIL/FAULT/ERROR found in health.log

--- Error excerpts from health.log ---
ERROR: nhc: Health check failed: check_hw_ib: No IB port mlx5_ib0:1 is ACTIVE (LinkUp 100 Gb/sec).
ERROR: nhc: Health check failed: check_hw_eth: Ethernet device ib0 not detected.
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
ERROR: nhc: Health check failed: check_gpu_bw: H2D test on GPU 0 failed. Bandwidth 9.71 is less than 10. FaultCode: NHC2020
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
--- End of excerpts ---

[2026-02-07 10:51:19] ========================================
[2026-02-07 10:51:19] Test Summary
[2026-02-07 10:51:19] ========================================
[2026-02-07 10:51:19] Passed: 6
[2026-02-07 10:51:19] Failed: 1
[2026-02-07 10:51:19] Skipped: 0
[2026-02-07 10:51:19] ========================================
[2026-02-07 10:51:19] Overall: FAILURE

JIRA: RHELHPC-117

Summary by Sourcery

Add an Azure HPC health checks wrapper script and integrate it into the HPC Azure test tasks for basic health validation.

Enhancements:

Introduce a wrapper script around azurehpc-health-checks to orchestrate health check execution, logging, and result summarization.

Tests:

Install and wire up a test script that runs azurehpc-health-checks with pre-flight environment checks, timeout handling, SKU support detection, and post-run log validation.

sourcery-ai · 2026-02-08T09:21:05Z

Reviewer's Guide

Adds an Ansible-managed Azure HPC health check wrapper test script and wires it into the role so clusters can validate basic Azure HPC health check functionality, including Docker/image prerequisites, SKU support handling, exit codes, and health.log content validation.

Sequence diagram for running the Azure HPC health check test script

sequenceDiagram
  actor Admin
  participant AnsibleRole as ansible_role_hpc_azure
  participant VM as Azure_HPC_VM
  participant TestScript as test_azure_health_checks_sh
  participant HealthChecks as run_health_checks_sh
  participant Container as NVIDIA_health_check_container
  participant Log as health_log

  Admin->>AnsibleRole: apply role
  AnsibleRole->>VM: copy test_azure_health_checks_sh template
  AnsibleRole->>VM: clone and configure azurehpc_health_checks_repo

  Admin->>TestScript: execute test_azure_health_checks_sh
  TestScript->>TestScript: detect_or_read_VM_SKU
  alt SKU_not_supported
    TestScript-->>Admin: print SKU_not_supported message
  else SKU_supported
    TestScript->>HealthChecks: invoke run_health_checks_sh with SKU_conf
    HealthChecks->>Container: start health check container
    Container->>Container: run CUDA_GPU_IB_Ethernet_tests
    Container-->>HealthChecks: return exit_code
    HealthChecks->>Log: write detailed results
    HealthChecks-->>TestScript: exit with code

    TestScript->>TestScript: check exit_code
    alt exit_code_142
      TestScript-->>Admin: print timeout message
    else exit_code_0_or_other
      TestScript->>Log: verify file_exists
      TestScript->>Log: scan for FAIL FAULT ERROR
      alt errors_found
        TestScript-->>Admin: print failure summary and error excerpts
      else no_errors
        TestScript-->>Admin: print health checks passed
      end
    end
  end

Flow diagram for Azure HPC health check test script logic

flowchart TD
  S[Start test_azure_health_checks_sh] --> DSK[Detect or read VM SKU]
  DSK --> CHKSKU{Is SKU supported by
azurehpc_health_checks?}

  CHKSKU -- No --> MSGNS[Print SKU not supported message]
  MSGNS --> END[End]

  CHKSKU -- Yes --> RUNHC[Run run_health_checks_sh
with SKU-specific conf]
  RUNHC --> EC{Exit code from
run_health_checks_sh}

  EC -- 142 --> MSGTO[Print timeout message]
  MSGTO --> END

  EC -- other --> EXLOG[Check health_log exists]
  EXLOG --> CHKLOG{Does health_log exist?}

  CHKLOG -- No --> MSGNF[Report missing health_log
and mark test failed]
  MSGNF --> END

  CHKLOG -- Yes --> SCAN[Scan health_log for
FAIL FAULT ERROR]
  SCAN --> ERR{Any errors found
in health_log?}

  ERR -- Yes --> MSGFAIL[Print failure summary
and error excerpts]
  MSGFAIL --> END

  ERR -- No --> MSGPASS[Print Azure HPC health
checks completed successfully]
  MSGPASS --> END

File-Level Changes

Change	Details	Files
Wire installation of the Azure HPC health check wrapper script into the Ansible role.	Add a conditional Ansible task to template a new test script into the Azure tests directory when health checks are not skipped Set proper ownership and executable permissions on the generated test script	`tasks/main.yml`
Introduce a Bash wrapper script template that orchestrates Azure HPC health checks, validates prerequisites, interprets outcomes, and summarizes results.	Define configurable script options for verbosity and optional timeout handling through getopts Implement setup validations for the azurehpc-health-checks directory, the run-health-checks.sh trigger script, Docker service status, and required Azure NHC container image presence Execute run-health-checks.sh (optionally under timeout), capture output, and special-case unsupported SKU messaging versus exit codes Inspect generated health.log for FAIL/FAULT/ERROR patterns and print error excerpts when present Track and report pass/fail/skip counters and overall SUCCESS/FAILURE with appropriate exit codes to integrate into automated testing	`templates/test-azure-health-checks.sh.j2`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've left some high level feedback:

The usage/help text and examples still reference run-health-checks.sh as the executable rather than this wrapper script name (test-azure-health-checks.sh), which could confuse users; consider updating the usage block to match how this script is actually invoked.
In setup_checks, the Docker image check uses docker images | grep -q "mcr.microsoft.com/aznhc/aznhc-nv:latest", which is prone to false positives; it would be more robust to use docker image inspect or docker images mcr.microsoft.com/aznhc/aznhc-nv:latest -q and verify the result is non-empty.
The script assumes both timeout and systemctl are available and functional; if this role can run on systems without them (e.g., non-systemd or minimal environments), consider adding explicit checks or fallbacks before using those commands to avoid immediate failures.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The usage/help text and examples still reference `run-health-checks.sh` as the executable rather than this wrapper script name (`test-azure-health-checks.sh`), which could confuse users; consider updating the usage block to match how this script is actually invoked.
- In `setup_checks`, the Docker image check uses `docker images | grep -q "mcr.microsoft.com/aznhc/aznhc-nv:latest"`, which is prone to false positives; it would be more robust to use `docker image inspect` or `docker images mcr.microsoft.com/aznhc/aznhc-nv:latest -q` and verify the result is non-empty.
- The script assumes both `timeout` and `systemctl` are available and functional; if this role can run on systems without them (e.g., non-systemd or minimal environments), consider adding explicit checks or fallbacks before using those commands to avoid immediate failures.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

spetrosi · 2026-02-09T09:05:52Z

templates/test-azure-health-checks.sh.j2

+VERBOSE=0
+TIMEOUT=""
+AZNHC_HOME="{{ __hpc_azure_tests_dir }}/azurehpc-health-checks"
+TRIGGER_SCRIPT="${AZNHC_HOME}/run-health-checks.sh"


Will this file exist?

@spetrosi yes, this file exist. I install azurehpc-health-checks to folder /opt/hpc/azure/tests/, but not folder /opt/hpc/azure/tools/.

ls /opt/hpc/azure/tests/azurehpc-health-checks/run-health-checks.sh /opt/hpc/azure/tests/azurehpc-health-checks/run-health-checks.sh

@spetrosi run-health-checks.sh is from azurehpc-health-checks itself, refer to https://github.com/Azure/azurehpc-health-checks/blob/main/run-health-checks.sh, thank you so much.

dgchinner · 2026-02-10T05:17:37Z

templates/test-azure-health-checks.sh.j2

+fail() {
+    echo "[FAIL] $1"
+    FAILED=$((FAILED + 1))
+}
+


When a test fails, echo the failure message and then just exit the script with a non-zero exit status.
I note that further down the typical caller is "fail .... ; exit 1", so lets fail the whole test on the first failure rather than trying to run more things and failing because of cascading failures.

@dgchinner Updated, thank you so much.

dgchinner · 2026-02-10T05:22:56Z

templates/test-azure-health-checks.sh.j2

+    if [[ $FAILED -eq 0 ]]; then
+        log "Overall: SUCCESS"
+        exit 0
+    else
+        log "Overall: FAILURE"
+        exit $ret
+    fi


So if this skips tests, there is no exit status communicated? i.e. the test runner will consider skipped tests as a pass? Also, if fail() exits the test, there is no need for the summary boilerplate code - to get here the test must have passed....

@dgchinner updated, thank you so much for detailed comment

dgchinner · 2026-02-10T05:31:01Z

templates/test-azure-health-checks.sh.j2

+    # Run the health checks and capture output
+    ret=0
+    output=$(mktemp)
+    "${cmd[@]}" 2>&1 | tee "$output" || ret=$?


This sets the value of ret to the exit status of the 'tee' command in the pipeline, not the command that is being run. This should probably be written as something like:

"${cmd[@]}" 2>&1 | tee "$output" ret="${PIPESTATUS[0]}"

To capture the return value of the cmd being run rather than the tee process.

@dgchinner updated, thank you so much for review.

lixuemin2016 · 2026-02-10T13:11:52Z

@dgchinner the main issue is than when I execute the health check test on Standard_nd40rs_v2, the test result cannot be PASS for all the test cases of https://github.com/Azure/azurehpc-health-checks/blob/main/conf/nd40rs_v2.conf. Do we need to analyze these failures?
e.g.

IB port mlx5_ib0:1 related to name
"Bandwidth 9.71 is less than 10. FaultCode: NHC2020" , is it acceptable? Thank you so much.

Running health checks for Standard_nd40rs_v2 SKU...
Running health checks using /opt/hpc/azure/tests/azurehpc-health-checks/conf/nd40rs_v2.conf and outputting to /opt/hpc/azure/tests/azurehpc-health-checks/health.log

==========
== CUDA ==
CUDA Version 12.4.1

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

SUCCESS: nhc: Health check passed: check_gpu_ecc: ECC checks passed
SUCCESS: nhc: Health check passed: check_gpu_count: Expected 8 and found 8
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
SUCCESS: nhc: Health check passed: check_nvsmi_healthmon: nvidia-smi completed successfully
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 0 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 1 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 2 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 3 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 4 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 5 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 6 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 7 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nccl_allreduce: NCCL all reduce bandwidth test passed, 128.579 GB/s
SUCCESS: nhc: Health check passed: check_ib_bw_non_gdr: IB write bandwidth non gdr test passed for IB=mlx5_0, IB BW=103.59 Gbps
SUCCESS: nhc: Health check passed: check_ib_link_flapping: No IB link flapping found
Health checks completed with exit code: 0.

[PASS] Azure HPC health checks completed successfully
[2026-02-07 10:51:19] Validating health.log...

Checking: health.log file exists
[PASS] health.log file exists
Checking: health.log for errors
[FAIL] FAIL/FAULT/ERROR found in health.log

--- Error excerpts from health.log ---
ERROR: nhc: Health check failed: check_hw_ib: No IB port mlx5_ib0:1 is ACTIVE (LinkUp 100 Gb/sec).
ERROR: nhc: Health check failed: check_hw_eth: Ethernet device ib0 not detected.
ERROR: nhc: Health check failed: check_gpu_bw: H2D test on GPU 0 failed. Bandwidth 9.71 is less than 10. FaultCode: NHC2020
--- End of excerpts ---

Add a test script for azurehpc-health-checks validation. Validate azurehpc-health-checks basic setup, and mainly call run-health-checks.sh to do SKU health verification. Refer to <https://github.com/Azure/azurehpc-health-checks> Test result scenario log output examples: - SKU not support: The vm SKU 'standard_nc4as_t4_v3' is currently not supported by Azure health checks. - Test timeout scenario Health checks completed with exit code: 142 - SKU supported scenario Standard_nd40rs_v2 could run with detailed test result JIRA: RHELHPC-117 Signed-off-by: Xuemin Li <xuli@redhat.com>

dgchinner · 2026-02-10T22:15:23Z

@dgchinner the main issue is than when I execute the health check test on Standard_nd40rs_v2, the test result cannot be PASS for all the test cases of https://github.com/Azure/azurehpc-health-checks/blob/main/conf/nd40rs_v2.conf. Do we need to analyze these failures? e.g.
1. IB port mlx5_ib0:1 related to name

The IP network device name is likely to be "ib0", not "mlx5_ib0". This may be a result of not having the persistent naming monitor running, or may just be a hard coded naming assumption in the health check scripts. I don't think this is a blocker, but it needs investigation as to why the "ethernet" device name either doesn't exist or why the health check is trying to use a name that doesn't exist.

2. "Bandwidth 9.71 is less than 10. FaultCode: NHC2020" , is it acceptable? Thank you so much.

That's from some kind of GPU bandwidth test. I'm not sure what is being tested, exactly, but I'd suggest that it is likely a spurious threshold failure as the number reported (9.71) is only just less than the failure threshold (10). For the moment, I don't see this as being a problem as the performance is very close to what is expected.

Hence I don't think either of these issues should block merging the test scripts - the tests are indicating potential problems with the functionality we have installed/configured, not problems with the test scripts or the health monitoring infrastructure it is exercising.

dgchinner · 2026-02-10T22:17:46Z

Code and commit messages look good, I think this can be merged.

lixuemin2016 requested review from richm and spetrosi as code owners February 8, 2026 09:21

sourcery-ai bot reviewed Feb 8, 2026

View reviewed changes

spetrosi reviewed Feb 9, 2026

View reviewed changes

dgchinner reviewed Feb 10, 2026

View reviewed changes

lixuemin2016 force-pushed the testahc branch from 07bfd23 to e1194ea Compare February 10, 2026 13:01

lixuemin2016 force-pushed the testahc branch from e1194ea to 58c7019 Compare February 10, 2026 13:21

test: add Azure health check test script for basic validation #69

Are you sure you want to change the base?

test: add Azure health check test script for basic validation #69

Conversation

lixuemin2016 commented Feb 8, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

========== == CUDA ==

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for running the Azure HPC health check test script

Flow diagram for Azure HPC health check test script logic

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lixuemin2016 commented Feb 10, 2026

Uh oh!

dgchinner commented Feb 10, 2026

Uh oh!

dgchinner commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lixuemin2016 commented Feb 8, 2026 •

edited by sourcery-ai bot

Loading

==========
== CUDA ==

sourcery-ai bot commented Feb 8, 2026 •

edited

Loading