PKI CRL health check failure should not cause overall health status to be DOWN

## Problem

When a CRL endpoint is unavailable (e.g. returns 404), the `pki.crl` health check reports as failing, which causes the overall `/health` endpoint to return a non-healthy status. In Kubernetes environments using liveness/readiness probes on `/health`, this causes the pod to be killed and restarted — but a restart cannot resolve an external CRL endpoint being down.

Example warning seen in such a scenario:
```
time="2026-04-07T10:49:19Z" level=warning msg="Health check status is not UP, failing components:\n - pki.crl: [{CN=TEST Zorg CSP Private Root CA G1,O=CIBG,C=NL http://www.uzi-register-test.nl/cdp/test_zorg_csp_private_root_ca_g1.crl 0001-01-01 00:00:00 +0000 UTC} ...]"
```

## Root cause

External resources (CRL endpoints from CAs) are outside the operator's control. Including them in the top-level health check creates a feedback loop: the node becomes unhealthy, gets restarted, and will never recover on its own even though the application itself is perfectly functional.

## Proposed solution

Separate the health check into two layers, similar to the Spring Boot Actuator pattern:

- **`/health`** — aggregates only components where a restart could help (i.e. internal, self-recoverable issues). This is what Kubernetes liveness/readiness probes should target.
- **`/health/<component>`** — exposes the full detail per component, including external dependencies like CRL availability. Operators can wire these into their monitoring/alerting stack.

This way, a CA's CRL endpoint being temporarily unavailable is surfaced as observable via monitoring, but does not cause the pod to be indefinitely crash-looped.

## Trade-off

Adding a new component to a future release would require operators to explicitly add it to their monitoring setup; it won't be picked up automatically. This is acceptable since the alternative (one external failure killing the entire pod) is worse.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PKI CRL health check failure should not cause overall health status to be DOWN #4151

Problem

Root cause

Proposed solution

Trade-off

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PKI CRL health check failure should not cause overall health status to be DOWN #4151

Description

Problem

Root cause

Proposed solution

Trade-off

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions