Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions charts/openstack-hypervisor-operator/alerts/eviction.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ groups:
labels:
severity: warning
type: hypervisor_operator
playbook: docs/compute/kvm/playbooks/evictionfailed
annotations:
summary: "Eviction {{ $labels.name }} has failed"
description: "The eviction {{ $labels.name }} for hypervisor {{ $labels.hypervisor }} has reached a terminal failure state. Manual intervention is required — check if the hypervisor exists in OpenStack."
Expand All @@ -24,6 +25,7 @@ groups:
labels:
severity: warning
type: hypervisor_operator
playbook: docs/compute/kvm/playbooks/evictionmigrationfailing
annotations:
summary: "Eviction {{ $labels.name }} has failing instance migrations for over 1 hour"
description: "The eviction {{ $labels.name }} has had MigratingInstance=Failed for more than 1 hour while still running. Instances may be in ERROR state, blocking eviction progress."
Expand All @@ -37,6 +39,7 @@ groups:
labels:
severity: warning
type: hypervisor_operator
playbook: docs/compute/kvm/playbooks/evictionoutstandingram
annotations:
summary: "Eviction {{ $labels.name }} has outstanding RAM for over 6 hours"
description: "The eviction {{ $labels.name }} has had {{ $value }}MB of outstanding RAM for more than 6 hours. Check for stuck live-migrations or instances that cannot be moved."
7 changes: 7 additions & 0 deletions charts/openstack-hypervisor-operator/alerts/operator.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ groups:
labels:
severity: warning
type: hypervisor_operator
playbook: docs/compute/kvm/playbooks/hypervisoronboardingstuck
annotations:
summary: "Hypervisor {{ $labels.name }} onboarding stuck for over 1 hour"
description: "The hypervisor {{ $labels.name }} in zone {{ $labels.zone }} has been onboarding for more than 1 hour. Check nova registration, test VM status, or trait/aggregate sync."
Expand All @@ -22,6 +23,7 @@ groups:
labels:
severity: warning
type: hypervisor_operator
playbook: docs/compute/kvm/playbooks/hypervisorevictionstuck
annotations:
summary: "Hypervisor {{ $labels.name }} eviction running for over 4 hours"
description: "The hypervisor {{ $labels.name }} in zone {{ $labels.zone }} has had an active eviction for more than 4 hours. Check for stuck live-migrations or failed VMs."
Expand All @@ -35,6 +37,7 @@ groups:
labels:
severity: info
type: hypervisor_operator
playbook: docs/compute/kvm/playbooks/hypervisorevictedtoolong
annotations:
summary: "Hypervisor {{ $labels.name }} has been evicted for over 7 days"
description: "The hypervisor {{ $labels.name }} in zone {{ $labels.zone }} has been evicted for more than 7 days without being offboarded. Consider re-enabling or decommissioning."
Expand All @@ -50,6 +53,7 @@ groups:
labels:
severity: warning
type: hypervisor_operator
playbook: docs/compute/kvm/playbooks/hypervisortraitsyncfailed
annotations:
summary: "Hypervisor {{ $labels.name }} trait sync has been failing"
description: "The hypervisor {{ $labels.name }} in zone {{ $labels.zone }} has had TraitsUpdated=False for more than 30 minutes outside of onboarding. Check OpenStack Placement API connectivity."
Expand All @@ -65,6 +69,7 @@ groups:
labels:
severity: warning
type: hypervisor_operator
playbook: docs/compute/kvm/playbooks/hypervisoraggregatesyncfailed
annotations:
summary: "Hypervisor {{ $labels.name }} aggregate sync has been failing"
description: "The hypervisor {{ $labels.name }} in zone {{ $labels.zone }} has had AggregatesUpdated=False for more than 30 minutes outside of onboarding and eviction. Check OpenStack Nova API connectivity."
Expand All @@ -78,6 +83,7 @@ groups:
labels:
severity: warning
type: hypervisor_operator
playbook: docs/compute/kvm/playbooks/hypervisorreconcileerrors
annotations:
summary: "Hypervisor operator controller {{ $labels.controller }} has persistent reconcile errors"
description: "The controller {{ $labels.controller }} has been producing sustained reconciliation errors for more than 15 minutes."
Expand All @@ -89,6 +95,7 @@ groups:
labels:
severity: critical
type: hypervisor_operator
playbook: docs/compute/kvm/playbooks/hypervisoroperatordown
annotations:
summary: "Hypervisor operator is down"
description: "The hypervisor operator metrics endpoint has been unreachable for more than 5 minutes."
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,18 @@ rules:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: {{ include "openstack-hypervisor-operator.fullname" . }}-metrics-reader-binding
labels:
{{- include "openstack-hypervisor-operator.labels" . | nindent 4 }}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: '{{ include "openstack-hypervisor-operator.fullname" . }}-metrics-reader'
subjects:
- kind: ServiceAccount
name: '{{ include "openstack-hypervisor-operator.serviceAccountName" . }}'
namespace: '{{ .Release.Namespace }}'
Comment on lines +24 to +26
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify token source and RBAC subject alignment for metrics scraping.

# 1) Find how ServiceMonitor authenticates.
rg -n -C3 'kind:\s*ServiceMonitor|bearerTokenFile|authorization|bearerTokenSecret' charts/openstack-hypervisor-operator

# 2) Find who is granted nonResourceURL /metrics access.
rg -n -C3 'kind:\s*ClusterRoleBinding|metrics-reader|nonResourceURLs|/metrics|subjects:' charts/openstack-hypervisor-operator

# 3) Check whether metrics endpoint auth/authz is enabled in manager/proxy wiring.
rg -n -C3 'metrics-bind-address|metrics-secure|metrics-require-rbac|kube-rbac-proxy|WithAuthenticationAndAuthorization'

Repository: cobaltcore-dev/openstack-hypervisor-operator

Length of output: 20446


🏁 Script executed:

# Check if .Values.serviceMonitor.scrapeServiceAccount exists in values
cat charts/openstack-hypervisor-operator/values.yaml

# Look for any serviceAccount references in values
rg -n 'serviceAccount|scrape' charts/openstack-hypervisor-operator/values.yaml

# Check the serviceAccountName helper function
rg -n 'serviceAccountName' charts/openstack-hypervisor-operator/templates

Repository: cobaltcore-dev/openstack-hypervisor-operator

Length of output: 3296


🏁 Script executed:

# Read the helpers file to see the serviceAccountName implementation
cat charts/openstack-hypervisor-operator/templates/_helpers.tpl | head -70

Repository: cobaltcore-dev/openstack-hypervisor-operator

Length of output: 2317


🏁 Script executed:

# Check what the actual Prometheus service account name would be
# Prometheus operator typically uses standard naming conventions
rg -n 'prometheus|monitoring' charts/openstack-hypervisor-operator/values.yaml

# Check if there are any defaults or documented configurations for Prometheus
rg -n 'prometheus' charts/openstack-hypervisor-operator/templates/servicemonitor.yaml

# Verify if there's any deployment or documentation about Prometheus namespace/SA
fd -type f -name "README*" -o -name "*.md" | head -20

Repository: cobaltcore-dev/openstack-hypervisor-operator

Length of output: 295


🏁 Script executed:

# Fix the fd command
find . -type f \( -name "README*" -o -name "*.md" \) 2>/dev/null | head -5

# Check if there's any mention of what service account Prometheus uses
rg -n 'prometheus' charts/openstack-hypervisor-operator/templates/servicemonitor.yaml -A5 -B5

# Look at the complete servicemonitor.yaml file
cat charts/openstack-hypervisor-operator/templates/servicemonitor.yaml

Repository: cobaltcore-dev/openstack-hypervisor-operator

Length of output: 1635


Update metrics RBAC binding to target the Prometheus scraper's ServiceAccount instead of the operator's.

The metrics endpoint has authentication and authorization enforced (via --metrics-secure=true and the WithAuthenticationAndAuthorization filter in cmd/main.go), but the metrics-reader role is currently bound to the operator ServiceAccount. The ServiceMonitor scrapes using the bearer token from Prometheus's own ServiceAccount (line 33 of servicemonitor.yaml), creating a token-to-RBAC mismatch that will cause 403 errors.

To fix this, the binding must target the Prometheus scraper's ServiceAccount. However, the suggested configuration path (.Values.serviceMonitor.scrapeServiceAccount) does not exist in the values.yaml. Define this configuration in the values schema, then update the binding subjects to reference it:

 subjects:
 - kind: ServiceAccount
-  name: '{{ include "openstack-hypervisor-operator.serviceAccountName" . }}'
-  namespace: '{{ .Release.Namespace }}'
+  name: '{{ .Values.serviceMonitor.scrapeServiceAccount.name }}'
+  namespace: '{{ .Values.serviceMonitor.scrapeServiceAccount.namespace | default .Release.Namespace }}'

Additionally, ensure the Prometheus ServiceAccount exists and is correctly referenced by the ServiceMonitor.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/openstack-hypervisor-operator/templates/metrics-reader-rbac.yaml`
around lines 24 - 26, The RoleBinding currently binds the "metrics-reader" Role
to the operator ServiceAccount; change it to bind to the Prometheus scraper
ServiceAccount by adding a new values key (e.g.
serviceMonitor.scrapeServiceAccount) to the chart's values/schema and update the
RoleBinding template (the subject referencing include
"openstack-hypervisor-operator.serviceAccountName") to use that new value
instead; also ensure the ServiceMonitor template (servicemonitor.yaml)
references the same serviceMonitor.scrapeServiceAccount and that the Prometheus
ServiceAccount resource exists or is documented as required for the deployment.

48 changes: 48 additions & 0 deletions charts/openstack-hypervisor-operator/templates/servicemonitor.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# SPDX-FileCopyrightText: 2025 SAP SE or an SAP affiliate company and cobaltcore-dev contributors
# SPDX-License-Identifier: Apache-2.0

{{- if .Values.serviceMonitor.create }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: {{ include "openstack-hypervisor-operator.fullname" . }}-metrics
labels:
control-plane: controller-manager
{{- include "openstack-hypervisor-operator.labels" . | nindent 4 }}
{{- with .Values.serviceMonitor.labels }}
{{- toYaml . | nindent 4 }}
{{- end }}
{{- with .Values.serviceMonitor.annotations }}
annotations:
{{- toYaml . | nindent 4 }}
{{- end }}
spec:
jobLabel: control-plane
selector:
matchLabels:
control-plane: controller-manager
{{- include "openstack-hypervisor-operator.selectorLabels" . | nindent 6 }}
namespaceSelector:
matchNames:
- {{ .Release.Namespace }}
endpoints:
- port: https
scheme: https
tlsConfig:
insecureSkipVerify: true
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
Comment on lines +31 to +33
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether insecure TLS is currently hardcoded and if a value toggle already exists.
rg -n -C3 'insecureSkipVerify|tlsConfig|serviceMonitor\..*insecure|caFile|serverName' charts/openstack-hypervisor-operator

Repository: cobaltcore-dev/openstack-hypervisor-operator

Length of output: 897


🏁 Script executed:

find charts/openstack-hypervisor-operator -name "values*.yaml" -o -name "values*.yml" | head -20

Repository: cobaltcore-dev/openstack-hypervisor-operator

Length of output: 136


🏁 Script executed:

cat -n charts/openstack-hypervisor-operator/templates/servicemonitor.yaml

Repository: cobaltcore-dev/openstack-hypervisor-operator

Length of output: 1941


🏁 Script executed:

rg -A 10 "serviceMonitor:" charts/openstack-hypervisor-operator/values* | head -50

Repository: cobaltcore-dev/openstack-hypervisor-operator

Length of output: 272


Make TLS verification configurable with secure defaults.

The insecureSkipVerify: true hardcoded at line 32 disables certificate validation, creating a security risk. This should be made configurable through values.yaml with a default of false to enforce certificate verification, aligning with the existing pattern used for other ServiceMonitor settings like interval and scrapeTimeout.

Consider also adding optional support for serverName and caFile to accommodate custom certificate chains and hostname verification in air-gapped or self-signed certificate environments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/openstack-hypervisor-operator/templates/servicemonitor.yaml` around
lines 31 - 33, Replace the hardcoded tlsConfig.insecureSkipVerify with a
values-driven setting (e.g. .Values.serviceMonitor.tls.insecureSkipVerify)
defaulting to false in values.yaml and update the ServiceMonitor template to
render tlsConfig.insecureSkipVerify from that value; also add optional
.Values.serviceMonitor.tls.serverName and .Values.serviceMonitor.tls.caFile keys
and conditionally include tlsConfig.serverName and tlsConfig.caFile only when
set, keeping bearerTokenFile unchanged so existing behavior is preserved.

{{- with .Values.serviceMonitor.interval }}
interval: {{ . }}
{{- end }}
{{- with .Values.serviceMonitor.scrapeTimeout }}
scrapeTimeout: {{ . }}
{{- end }}
{{- with .Values.serviceMonitor.metricRelabelings }}
metricRelabelings:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.serviceMonitor.relabelings }}
relabelings:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- end }}
8 changes: 8 additions & 0 deletions charts/openstack-hypervisor-operator/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,14 @@ metricsService:
protocol: TCP
targetPort: 8443
type: ClusterIP
serviceMonitor:
create: true
labels: {}
annotations: {}
interval: ""
scrapeTimeout: ""
metricRelabelings: []
relabelings: []
secret:
servicePassword: ""
serviceAccount:
Expand Down
Loading