Skip to content

Conversation

@lixuemin2016
Copy link
Collaborator

@lixuemin2016 lixuemin2016 commented Feb 8, 2026

Enhancement:
Add a test script for azurehpc-health-checks validation. Validate azurehpc-health-checks basic setup, and mainly call run-health-checks.sh to do SKU health verification. Refer to https://github.com/Azure/azurehpc-health-checks

Test result scenario log output examples:

  • SKU not support: The vm SKU 'standard_nc4as_t4_v3' is currently not supported by Azure health checks.
  • Test timeout scenario: Health checks completed with exit code: 142
  • SKU supported scenario: Standard_nd40rs_v2 could run with detailed test result

Reason:

Result:

Test 1:
No custom conf file specified, detecting VM SKU...
The vm SKU 'standard_nc4as_t4_v3' is currently not supported by Azure health checks.

Test 2:
Running health checks for Standard_nd40rs_v2 SKU...
Running health checks using /opt/hpc/azure/tests/azurehpc-health-checks/conf/nd40rs_v2.conf and outputting to /opt/hpc/azure/tests/azurehpc-health-checks/health.log

==========
== CUDA ==

CUDA Version 12.4.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

SUCCESS: nhc: Health check passed: check_gpu_ecc: ECC checks passed
SUCCESS: nhc: Health check passed: check_gpu_count: Expected 8 and found 8
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
SUCCESS: nhc: Health check passed: check_nvsmi_healthmon: nvidia-smi completed successfully
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 0 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 1 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 2 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 3 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 4 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 5 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 6 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 7 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nccl_allreduce: NCCL all reduce bandwidth test passed, 128.579 GB/s
SUCCESS: nhc: Health check passed: check_ib_bw_non_gdr: IB write bandwidth non gdr test passed for IB=mlx5_0, IB BW=103.59 Gbps
SUCCESS: nhc: Health check passed: check_ib_link_flapping: No IB link flapping found
Health checks completed with exit code: 0.

[PASS] Azure HPC health checks completed successfully
[2026-02-07 10:51:19] Validating health.log...

Checking: health.log file exists
[PASS] health.log file exists
Checking: health.log for errors
[FAIL] FAIL/FAULT/ERROR found in health.log

--- Error excerpts from health.log ---
ERROR: nhc: Health check failed: check_hw_ib: No IB port mlx5_ib0:1 is ACTIVE (LinkUp 100 Gb/sec).
ERROR: nhc: Health check failed: check_hw_eth: Ethernet device ib0 not detected.
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
ERROR: nhc: Health check failed: check_gpu_bw: H2D test on GPU 0 failed. Bandwidth 9.71 is less than 10. FaultCode: NHC2020
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
--- End of excerpts ---

[2026-02-07 10:51:19] ========================================
[2026-02-07 10:51:19] Test Summary
[2026-02-07 10:51:19] ========================================
[2026-02-07 10:51:19] Passed: 6
[2026-02-07 10:51:19] Failed: 1
[2026-02-07 10:51:19] Skipped: 0
[2026-02-07 10:51:19] ========================================
[2026-02-07 10:51:19] Overall: FAILURE

JIRA: RHELHPC-117

Summary by Sourcery

Add an Azure HPC health checks wrapper script and integrate it into the HPC Azure test tasks for basic health validation.

Enhancements:

  • Introduce a wrapper script around azurehpc-health-checks to orchestrate health check execution, logging, and result summarization.

Tests:

  • Install and wire up a test script that runs azurehpc-health-checks with pre-flight environment checks, timeout handling, SKU support detection, and post-run log validation.

@sourcery-ai
Copy link

sourcery-ai bot commented Feb 8, 2026

Reviewer's Guide

Adds an Ansible-managed Azure HPC health check wrapper test script and wires it into the role so clusters can validate basic Azure HPC health check functionality, including Docker/image prerequisites, SKU support handling, exit codes, and health.log content validation.

Sequence diagram for running the Azure HPC health check test script

sequenceDiagram
  actor Admin
  participant AnsibleRole as ansible_role_hpc_azure
  participant VM as Azure_HPC_VM
  participant TestScript as test_azure_health_checks_sh
  participant HealthChecks as run_health_checks_sh
  participant Container as NVIDIA_health_check_container
  participant Log as health_log

  Admin->>AnsibleRole: apply role
  AnsibleRole->>VM: copy test_azure_health_checks_sh template
  AnsibleRole->>VM: clone and configure azurehpc_health_checks_repo

  Admin->>TestScript: execute test_azure_health_checks_sh
  TestScript->>TestScript: detect_or_read_VM_SKU
  alt SKU_not_supported
    TestScript-->>Admin: print SKU_not_supported message
  else SKU_supported
    TestScript->>HealthChecks: invoke run_health_checks_sh with SKU_conf
    HealthChecks->>Container: start health check container
    Container->>Container: run CUDA_GPU_IB_Ethernet_tests
    Container-->>HealthChecks: return exit_code
    HealthChecks->>Log: write detailed results
    HealthChecks-->>TestScript: exit with code

    TestScript->>TestScript: check exit_code
    alt exit_code_142
      TestScript-->>Admin: print timeout message
    else exit_code_0_or_other
      TestScript->>Log: verify file_exists
      TestScript->>Log: scan for FAIL FAULT ERROR
      alt errors_found
        TestScript-->>Admin: print failure summary and error excerpts
      else no_errors
        TestScript-->>Admin: print health checks passed
      end
    end
  end
Loading

Flow diagram for Azure HPC health check test script logic

flowchart TD
  S[Start test_azure_health_checks_sh] --> DSK[Detect or read VM SKU]
  DSK --> CHKSKU{Is SKU supported by
azurehpc_health_checks?}

  CHKSKU -- No --> MSGNS[Print SKU not supported message]
  MSGNS --> END[End]

  CHKSKU -- Yes --> RUNHC[Run run_health_checks_sh
with SKU-specific conf]
  RUNHC --> EC{Exit code from
run_health_checks_sh}

  EC -- 142 --> MSGTO[Print timeout message]
  MSGTO --> END

  EC -- other --> EXLOG[Check health_log exists]
  EXLOG --> CHKLOG{Does health_log exist?}

  CHKLOG -- No --> MSGNF[Report missing health_log
and mark test failed]
  MSGNF --> END

  CHKLOG -- Yes --> SCAN[Scan health_log for
FAIL FAULT ERROR]
  SCAN --> ERR{Any errors found
in health_log?}

  ERR -- Yes --> MSGFAIL[Print failure summary
and error excerpts]
  MSGFAIL --> END

  ERR -- No --> MSGPASS[Print Azure HPC health
checks completed successfully]
  MSGPASS --> END
Loading

File-Level Changes

Change Details Files
Wire installation of the Azure HPC health check wrapper script into the Ansible role.
  • Add a conditional Ansible task to template a new test script into the Azure tests directory when health checks are not skipped
  • Set proper ownership and executable permissions on the generated test script
tasks/main.yml
Introduce a Bash wrapper script template that orchestrates Azure HPC health checks, validates prerequisites, interprets outcomes, and summarizes results.
  • Define configurable script options for verbosity and optional timeout handling through getopts
  • Implement setup validations for the azurehpc-health-checks directory, the run-health-checks.sh trigger script, Docker service status, and required Azure NHC container image presence
  • Execute run-health-checks.sh (optionally under timeout), capture output, and special-case unsupported SKU messaging versus exit codes
  • Inspect generated health.log for FAIL/FAULT/ERROR patterns and print error excerpts when present
  • Track and report pass/fail/skip counters and overall SUCCESS/FAILURE with appropriate exit codes to integrate into automated testing
templates/test-azure-health-checks.sh.j2

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The usage/help text and examples still reference run-health-checks.sh as the executable rather than this wrapper script name (test-azure-health-checks.sh), which could confuse users; consider updating the usage block to match how this script is actually invoked.
  • In setup_checks, the Docker image check uses docker images | grep -q "mcr.microsoft.com/aznhc/aznhc-nv:latest", which is prone to false positives; it would be more robust to use docker image inspect or docker images mcr.microsoft.com/aznhc/aznhc-nv:latest -q and verify the result is non-empty.
  • The script assumes both timeout and systemctl are available and functional; if this role can run on systems without them (e.g., non-systemd or minimal environments), consider adding explicit checks or fallbacks before using those commands to avoid immediate failures.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The usage/help text and examples still reference `run-health-checks.sh` as the executable rather than this wrapper script name (`test-azure-health-checks.sh`), which could confuse users; consider updating the usage block to match how this script is actually invoked.
- In `setup_checks`, the Docker image check uses `docker images | grep -q "mcr.microsoft.com/aznhc/aznhc-nv:latest"`, which is prone to false positives; it would be more robust to use `docker image inspect` or `docker images mcr.microsoft.com/aznhc/aznhc-nv:latest -q` and verify the result is non-empty.
- The script assumes both `timeout` and `systemctl` are available and functional; if this role can run on systems without them (e.g., non-systemd or minimal environments), consider adding explicit checks or fallbacks before using those commands to avoid immediate failures.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

VERBOSE=0
TIMEOUT=""
AZNHC_HOME="{{ __hpc_azure_tests_dir }}/azurehpc-health-checks"
TRIGGER_SCRIPT="${AZNHC_HOME}/run-health-checks.sh"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this file exist?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spetrosi yes, this file exist. I install azurehpc-health-checks to folder /opt/hpc/azure/tests/, but not folder /opt/hpc/azure/tools/.

 ls /opt/hpc/azure/tests/azurehpc-health-checks/run-health-checks.sh 
/opt/hpc/azure/tests/azurehpc-health-checks/run-health-checks.sh

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spetrosi run-health-checks.sh is from azurehpc-health-checks itself, refer to https://github.com/Azure/azurehpc-health-checks/blob/main/run-health-checks.sh, thank you so much.

Comment on lines 36 to 38
fail() {
echo "[FAIL] $1"
FAILED=$((FAILED + 1))
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a test fails, echo the failure message and then just exit the script with a non-zero exit status.
I note that further down the typical caller is "fail .... ; exit 1", so lets fail the whole test on the first failure rather than trying to run more things and failing because of cascading failures.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgchinner Updated, thank you so much.

Comment on lines 268 to 274
if [[ $FAILED -eq 0 ]]; then
log "Overall: SUCCESS"
exit 0
else
log "Overall: FAILURE"
exit $ret
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if this skips tests, there is no exit status communicated? i.e. the test runner will consider skipped tests as a pass? Also, if fail() exits the test, there is no need for the summary boilerplate code - to get here the test must have passed....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgchinner updated, thank you so much for detailed comment

# Run the health checks and capture output
ret=0
output=$(mktemp)
"${cmd[@]}" 2>&1 | tee "$output" || ret=$?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sets the value of ret to the exit status of the 'tee' command in the pipeline, not the command that is being run. This should probably be written as something like:

"${cmd[@]}" 2>&1 | tee "$output"
ret="${PIPESTATUS[0]}"

To capture the return value of the cmd being run rather than the tee process.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgchinner updated, thank you so much for review.

@lixuemin2016
Copy link
Collaborator Author

@dgchinner the main issue is than when I execute the health check test on Standard_nd40rs_v2, the test result cannot be PASS for all the test cases of https://github.com/Azure/azurehpc-health-checks/blob/main/conf/nd40rs_v2.conf. Do we need to analyze these failures?
e.g.

  1. IB port mlx5_ib0:1 related to name
  2. "Bandwidth 9.71 is less than 10. FaultCode: NHC2020" , is it acceptable? Thank you so much.

Running health checks for Standard_nd40rs_v2 SKU...
Running health checks using /opt/hpc/azure/tests/azurehpc-health-checks/conf/nd40rs_v2.conf and outputting to /opt/hpc/azure/tests/azurehpc-health-checks/health.log

==========
== CUDA ==
CUDA Version 12.4.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

SUCCESS: nhc: Health check passed: check_gpu_ecc: ECC checks passed
SUCCESS: nhc: Health check passed: check_gpu_count: Expected 8 and found 8
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
SUCCESS: nhc: Health check passed: check_nvsmi_healthmon: nvidia-smi completed successfully
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 0 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 1 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 2 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 3 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 4 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 5 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 6 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 7 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nccl_allreduce: NCCL all reduce bandwidth test passed, 128.579 GB/s
SUCCESS: nhc: Health check passed: check_ib_bw_non_gdr: IB write bandwidth non gdr test passed for IB=mlx5_0, IB BW=103.59 Gbps
SUCCESS: nhc: Health check passed: check_ib_link_flapping: No IB link flapping found
Health checks completed with exit code: 0.

[PASS] Azure HPC health checks completed successfully
[2026-02-07 10:51:19] Validating health.log...

Checking: health.log file exists
[PASS] health.log file exists
Checking: health.log for errors
[FAIL] FAIL/FAULT/ERROR found in health.log

--- Error excerpts from health.log ---
ERROR: nhc: Health check failed: check_hw_ib: No IB port mlx5_ib0:1 is ACTIVE (LinkUp 100 Gb/sec).
ERROR: nhc: Health check failed: check_hw_eth: Ethernet device ib0 not detected.
ERROR: nhc: Health check failed: check_gpu_bw: H2D test on GPU 0 failed. Bandwidth 9.71 is less than 10. FaultCode: NHC2020
--- End of excerpts ---

Add a test script for azurehpc-health-checks validation.
Validate azurehpc-health-checks basic setup, and mainly call
run-health-checks.sh to do SKU health verification.
Refer to <https://github.com/Azure/azurehpc-health-checks>

Test result scenario log output examples:
- SKU not support:
  The vm SKU 'standard_nc4as_t4_v3' is currently not supported
  by Azure health checks.
- Test timeout scenario
  Health checks completed with exit code: 142
- SKU supported scenario
  Standard_nd40rs_v2 could run with detailed test result

JIRA: RHELHPC-117

Signed-off-by: Xuemin Li <xuli@redhat.com>
@dgchinner
Copy link
Contributor

@dgchinner the main issue is than when I execute the health check test on Standard_nd40rs_v2, the test result cannot be PASS for all the test cases of https://github.com/Azure/azurehpc-health-checks/blob/main/conf/nd40rs_v2.conf. Do we need to analyze these failures? e.g.

1. IB port mlx5_ib0:1 related to name

The IP network device name is likely to be "ib0", not "mlx5_ib0". This may be a result of not having the persistent naming monitor running, or may just be a hard coded naming assumption in the health check scripts. I don't think this is a blocker, but it needs investigation as to why the "ethernet" device name either doesn't exist or why the health check is trying to use a name that doesn't exist.

2. "Bandwidth 9.71 is less than 10. FaultCode: NHC2020" , is it acceptable? Thank you so much.

That's from some kind of GPU bandwidth test. I'm not sure what is being tested, exactly, but I'd suggest that it is likely a spurious threshold failure as the number reported (9.71) is only just less than the failure threshold (10). For the moment, I don't see this as being a problem as the performance is very close to what is expected.

Hence I don't think either of these issues should block merging the test scripts - the tests are indicating potential problems with the functionality we have installed/configured, not problems with the test scripts or the health monitoring infrastructure it is exercising.

@dgchinner
Copy link
Contributor

Code and commit messages look good, I think this can be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants