diff --git a/calico-cloud/operations/monitor/prometheus/configure-prometheus.mdx b/calico-cloud/operations/monitor/prometheus/configure-prometheus.mdx index de68da887c..29c4d7a155 100644 --- a/calico-cloud/operations/monitor/prometheus/configure-prometheus.mdx +++ b/calico-cloud/operations/monitor/prometheus/configure-prometheus.mdx @@ -31,7 +31,7 @@ As an example, the range query in this Manifest is 10 seconds. apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: - name: calico-prometheus-dp-rate + name: calico namespace: tigera-prometheus labels: role: tigera-prometheus-rules @@ -56,7 +56,7 @@ To update this alerting rule, to say, execute the query with a range of apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: - name: calico-prometheus-dp-rate + name: calico namespace: tigera-prometheus labels: role: tigera-prometheus-rules diff --git a/calico-enterprise/operations/license-options.mdx b/calico-enterprise/operations/license-options.mdx index 8482c1a2d5..ec2a638846 100644 --- a/calico-enterprise/operations/license-options.mdx +++ b/calico-enterprise/operations/license-options.mdx @@ -30,7 +30,7 @@ These metrics are scraped by the built-in Prometheus instance via the `tigera-op $[prodname] installs PrometheusRule resources with alerting rules for license expiration. You can view them with: ```bash -kubectl -n tigera-prometheus get prometheusrule calico-prometheus-dp-rate -o yaml +kubectl -n tigera-prometheus get prometheusrule calico -o yaml ``` The built-in rules include: diff --git a/calico-enterprise/operations/monitor/metrics/operator-metrics.mdx b/calico-enterprise/operations/monitor/metrics/operator-metrics.mdx new file mode 100644 index 0000000000..645a2952b7 --- /dev/null +++ b/calico-enterprise/operations/monitor/metrics/operator-metrics.mdx @@ -0,0 +1,133 @@ +--- +description: Monitor the Calico Enterprise operator with Prometheus metrics for component health, TLS certificate expiry, and license status. +--- + +# Operator metrics + +## Big picture + +Use Prometheus to monitor the $[prodname] operator. + +## Value + +The $[prodname] operator exposes Prometheus metrics that give you visibility into the overall health of your $[prodname] installation. These metrics let you set up alerts for degraded components, expiring TLS certificates, and license issues before they impact operations. + +## Before you begin + +Operator metrics are enabled by default. The `tigera-operator` deployment ships with the following environment variables already set: + +- `METRICS_ENABLED=true` +- `METRICS_SCHEME=https` + +Metrics are served on port **9484** over HTTPS and scraped by the built-in Prometheus instance via the `tigera-operator-metrics` ServiceMonitor. + +## Metrics reference + +### Component status + +The `tigera_operator_component_status` metric reports the status of each $[prodname] component as managed by the operator. This mirrors the information available through `kubectl get tigerastatus`. + +| Metric | Labels | Description | +|---|---|---| +| `tigera_operator_component_status` | `component`, `condition` | Status of a component. Value is `1` (true) or `0` (false). | + +**Labels:** + +- `component` — The $[prodname] component (e.g. `calico`, `apiserver`, `monitor`, `log-storage`, `manager`) +- `condition` — One of: + - `available` — The component is running and healthy + - `degraded` — The component is in an error state + - `progressing` — The component is being updated or is starting up + +**Example queries:** + +- Find all degraded components: + + ``` + tigera_operator_component_status{condition="degraded"} == 1 + ``` + +- Check if a specific component is available: + + ``` + tigera_operator_component_status{component="calico", condition="available"} + ``` + +### TLS certificate expiry + +The `tigera_operator_tls_certificate_expiry_timestamp_seconds` metric reports the expiry time of each TLS certificate managed by the operator. + +| Metric | Labels | Description | +|---|---|---| +| `tigera_operator_tls_certificate_expiry_timestamp_seconds` | `name`, `namespace`, `issuer` | Unix timestamp when the certificate expires. | + +**Labels:** + +- `name` — The Secret name containing the certificate +- `namespace` — The namespace of the Secret +- `issuer` — The certificate issuer (e.g. `tigera-operator-signer`) + +**Example queries:** + +- Certificates expiring within 30 days: + + ``` + tigera_operator_tls_certificate_expiry_timestamp_seconds - time() < 30 * 24 * 3600 + ``` + +- Certificates expiring within 7 days: + + ``` + tigera_operator_tls_certificate_expiry_timestamp_seconds - time() < 7 * 24 * 3600 + ``` + +### License + +| Metric | Labels | Description | +|---|---|---| +| `tigera_operator_license_valid` | — | Whether the license is valid (`1`) or not (`0`). | +| `tigera_operator_license_expiry_timestamp_seconds` | — | Unix timestamp when the license expires. | + +**Example queries:** + +- License expires within 30 days: + + ``` + tigera_operator_license_expiry_timestamp_seconds - time() < 30 * 24 * 3600 + ``` + +- License is invalid: + + ``` + tigera_operator_license_valid == 0 + ``` + +## Built-in alerts + +$[prodname] installs a PrometheusRule resource named `calico` with alerting rules that use these metrics. You can view it with: + +```bash +kubectl -n tigera-prometheus get prometheusrule calico -o yaml +``` + +The built-in rules include: + +| Alert | Condition | Severity | +|---|---|---| +| `DeniedPacketsRate` | Denied packets rate > 50/s | info | +| `TLSCertExpiringWarning` | Certificate expires in < 30 days | warning | +| `TLSCertExpiringCritical` | Certificate expires in < 7 days | critical | +| `LicenseExpiringWarning` | License expires in < 30 days | warning | +| `LicenseExpiringCritical` | License expires in < 7 days or is invalid | critical | +| `ComponentDegradedWarning` | Component degraded for > 15m | warning | +| `ComponentDegradedCritical` | Component degraded for > 30m | critical | +| `ComponentProgressingWarning` | Component progressing for > 15m | warning | +| `ComponentProgressingCritical` | Component progressing for > 30m | critical | + +To route these alerts, see [Configure Alertmanager](../prometheus/alertmanager.mdx). + +## Additional resources + +- [License expiration and renewal](../../license-options.mdx) +- [Configure Prometheus](../prometheus/configure-prometheus.mdx) +- [BYO Prometheus](../prometheus/byo-prometheus.mdx) diff --git a/calico-enterprise/operations/monitor/prometheus/configure-prometheus.mdx b/calico-enterprise/operations/monitor/prometheus/configure-prometheus.mdx index 4cf8aadd5c..8f8c09a65f 100644 --- a/calico-enterprise/operations/monitor/prometheus/configure-prometheus.mdx +++ b/calico-enterprise/operations/monitor/prometheus/configure-prometheus.mdx @@ -31,7 +31,7 @@ As an example, the range query in this Manifest is 10 seconds. apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: - name: calico-prometheus-dp-rate + name: calico namespace: tigera-prometheus labels: role: tigera-prometheus-rules @@ -56,7 +56,7 @@ To update this alerting rule, to say, execute the query with a range of apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: - name: calico-prometheus-dp-rate + name: calico namespace: tigera-prometheus labels: role: tigera-prometheus-rules diff --git a/sidebars-calico-enterprise.js b/sidebars-calico-enterprise.js index 3ddc5a675a..187af41307 100644 --- a/sidebars-calico-enterprise.js +++ b/sidebars-calico-enterprise.js @@ -588,6 +588,7 @@ module.exports = { label: 'Metrics', link: { type: 'doc', id: 'operations/monitor/metrics/index' }, items: [ + 'operations/monitor/metrics/operator-metrics', 'operations/monitor/metrics/recommended-metrics', 'operations/monitor/metrics/bgp-metrics', 'operations/monitor/metrics/policy-metrics',