Skip to content

feat(cluster_healthcheck): add cluster health validation role#39

Open
stevefulme1 wants to merge 2 commits into
redhat-cop:mainfrom
stevefulme1:feat/cluster-healthcheck-role
Open

feat(cluster_healthcheck): add cluster health validation role#39
stevefulme1 wants to merge 2 commits into
redhat-cop:mainfrom
stevefulme1:feat/cluster-healthcheck-role

Conversation

@stevefulme1
Copy link
Copy Markdown
Contributor

Summary

Adds a new cluster_healthcheck role that validates the health of an OpenShift cluster for virtualization migration readiness. The role performs comprehensive checks across six categories and generates an HTML summary report with pass/fail/warning status and actionable recommendations.

Health checks included

  • OCP Node Health - Node Ready status, MemoryPressure/DiskPressure/PIDPressure conditions, allocatable vs capacity ratios, kubevirt.io/schedulable label verification
  • KubeVirt Health - HyperConverged CR conditions (Available/Degraded), virt-operator/controller/handler/api pods, CDI operator and deployment health
  • MTV Health - ForkliftController CR status, MTV operator pods, Provider readiness, failed migration Plans
  • Storage Health - StorageClass enumeration and default verification, CSI driver discovery, PV capacity, pending PVC detection
  • Network Health - Multus pods, NetworkAttachmentDefinitions, OVN-Kubernetes/OpenShiftSDN health, migration network configuration
  • Post-Migration VM - VirtualMachineInstance running state, guest agent reporting, network interface IPs, optional SSH connectivity

Files added

roles/cluster_healthcheck/
├── defaults/main.yml
├── meta/main.yml
├── README.md
├── tasks/
│   ├── main.yml
│   ├── ocp_node_health.yml
│   ├── kubevirt_health.yml
│   ├── mtv_health.yml
│   ├── storage_health.yml
│   ├── network_health.yml
│   ├── post_migration_vm.yml
│   └── report.yml
├── templates/
│   └── cluster_healthcheck_report.html.j2
├── tests/
│   ├── inventory
│   └── test.yml
└── vars/main.yml
playbooks/cluster_healthcheck.yml

Design decisions

  • Follows existing validate_migration role patterns (task naming, k8s_info usage, variable prefixing)
  • All variables prefixed with cluster_healthcheck_ per collection convention
  • Private/internal variables use __cluster_healthcheck_ double-underscore prefix
  • Uses FQCNs throughout (kubernetes.core.k8s_info, ansible.builtin.*)
  • Check list is configurable via cluster_healthcheck_checks default
  • Post-migration VM checks are opt-in via cluster_healthcheck_post_migration_vms
  • HTML report includes per-category breakdown with recommendations

Testing

  • ansible-lint --profile production passes with 0 errors on the role (playbook FQCN resolution matches existing collection behavior)

Copy link
Copy Markdown
Contributor

@sabre1041 sabre1041 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review the issues that are being reported.

Also, please review conflicted files

kind: Pod
namespace: "{{ cluster_healthcheck_kubevirt_namespace }}"
label_selectors:
- "app=cdi-operator"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This label does not match what is deployed

kind: Pod
namespace: "{{ cluster_healthcheck_kubevirt_namespace }}"
label_selectors:
- "app=cdi-deployment"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This label does not match what is deployed


- name: mtv_health | Evaluate Provider readiness
ansible.builtin.set_fact:
__cluster_healthcheck_providers_not_ready: >-
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not reporting correctly. Both providers are Ready in my testing environment

| selectattr('status.phase', 'equalto', 'Running')
| list | length) }}

- name: network_health | Check migration network configuration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only be checked if one has been defined in the HyperConverged CR

kubernetes.core.k8s_info:
api_version: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
namespace: "{{ cluster_healthcheck_mtv_namespace }}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should check in the openshift-cnv namespace

Adds a cluster_healthcheck role that validates OpenShift cluster health
for virtualization migration readiness across six categories: OCP nodes,
KubeVirt, MTV, storage, network, and post-migration VMs.

Generates an HTML summary report with pass/fail/warning status.

Review feedback addressed:
- Fix CDI pod labels to use app.kubernetes.io/component selectors
- Fix Provider readiness to correctly detect Ready condition status
- Make migration network check conditional on HyperConverged CR config
- Check migration NAD in openshift-cnv namespace, not openshift-mtv
- Drop unrelated scaffolding file changes (CODE_OF_CONDUCT, etc.)
@stevefulme1 stevefulme1 force-pushed the feat/cluster-healthcheck-role branch from d4928cd to 51d077e Compare May 22, 2026 12:39
@sabre1041
Copy link
Copy Markdown
Contributor

@stevefulme1 looking better. still seeing misalignment on the CDI components. Here are the labels that are applied to the CDI pods

cdi-apiserver-76789fd699-7ft5j                         1/1     Running   0          6h      app.kubernetes.io/component=storage,app.kubernetes.io/managed-by=cdi-operator,app.kubernetes.io/part-of=hyperconverged-cluster,app.kubernetes.io/version=4.21.8,app=containerized-data-importer,cdi.kubevirt.io=cdi-apiserver,np.kubevirt.io/allow-access-cluster-services=true,operator.cdi.kubevirt.io/createVersion=4.21.3,pod-template-hash=76789fd699
cdi-deployment-78b4485977-ft8xh                        1/1     Running   0          6h      app.kubernetes.io/component=storage,app.kubernetes.io/managed-by=cdi-operator,app.kubernetes.io/part-of=hyperconverged-cluster,app.kubernetes.io/version=4.21.8,app=containerized-data-importer,cdi.kubevirt.io=cdi-deployment,np.kubevirt.io/allow-access-cluster-services=true,operator.cdi.kubevirt.io/createVersion=4.21.3,pod-template-hash=78b4485977,prometheus.cdi.kubevirt.io=true
cdi-operator-f9dcf5cf6-bfpjw                           1/1     Running   0          6h1m    app.kubernetes.io/component=storage,app.kubernetes.io/managed-by=olm,app.kubernetes.io/part-of=hyperconverged-cluster,app.kubernetes.io/version=4.21.8,cdi.kubevirt.io=cdi-operator,name=cdi-operator,np.kubevirt.io/allow-access-cluster-services=true,operator.cdi.kubevirt.io=,pod-template-hash=f9dcf5cf6,prometheus.cdi.kubevirt.io=true
cdi-uploadproxy-7dcfc947d7-jd5kb                       1/1     Running   0          6h      app.kubernetes.io/component=storage,app.kubernetes.io/managed-by=cdi-operator,app.kubernetes.io/part-of=hyperconverged-cluster,app.kubernetes.io/version=4.21.8,app=containerized-data-importer,cdi.kubevirt.io=cdi-uploadproxy,np.kubevirt.io/allow-access-cluster-services=true,operator.cdi.kubevirt.io/createVersion=4.21.3,pod-template-hash=7dcfc947d7

…sing components

- Change CDI label selectors from app.kubernetes.io/component to
  cdi.kubevirt.io which matches actual pod labels on OCP 4.21+
- Add cdi-apiserver and cdi-uploadproxy pod health checks (were missing)
- Add CDI API Server and CDI Upload Proxy to the health report details
@stevefulme1
Copy link
Copy Markdown
Contributor Author

Fixed the CDI label selectors and added the missing components:

  • Changed CDI label selectors from app.kubernetes.io/component to cdi.kubevirt.io to match actual pod labels on OCP 4.21+
  • Added health checks for cdi-apiserver and cdi-uploadproxy pods (were missing from the original implementation)
  • Added both to the health report details section

All four CDI pods are now covered:

  • cdi.kubevirt.io=cdi-operator
  • cdi.kubevirt.io=cdi-deployment
  • cdi.kubevirt.io=cdi-apiserver
  • cdi.kubevirt.io=cdi-uploadproxy

@sabre1041 sabre1041 changed the title feat(cluster_healthcheck): add cluster health validation role feat(cluster_healthcheck): add cluster health validation role [skip ci] May 27, 2026
@sabre1041 sabre1041 changed the title feat(cluster_healthcheck): add cluster health validation role [skip ci] feat(cluster_healthcheck): add cluster health validation role May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants