Enhancement proposal for multi-cluster alerts management#1921
Enhancement proposal for multi-cluster alerts management#1921sradco wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@coleenquadros @jacobbaungard @simonpasquier @jgbernalp @jan--f @moadz I would appreciate your review of this proposal for multi-cluster alerting management UI. |
7e23e3d to
4749f64
Compare
|
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
|
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
|
/remove-lifecycle rotten |
90d1528 to
53326b6
Compare
|
Hi @jacobbaungard , Please review this enhancement proposal. |
4e9725c to
61e639b
Compare
simonpasquier
left a comment
There was a problem hiding this comment.
Did a quick pass but the proposal would benefit from being split into different parts because it's quite impossible to review in the current state. I'd recommend to focus on 1 part at a time like visualization of spoke/hub alerts in the console.
I'd expect also some inputs from the ACM observability folks about the recommended approach for alert silencing.
| - **No ARC-applied labels**: The `ALERTS` metric is produced by Prometheus rule evaluation, before ARCs are applied. It lacks `openshift_io_alert_rule_id`, `openshift_io_alert_rule_component`, and `openshift_io_alert_rule_layer`. | ||
| - **No silence awareness**: Silenced alerts still appear as `alertstate="firing"` in the `ALERTS` metric — Prometheus does not know about Alertmanager silences. | ||
| - **`managed_cluster` is stripped**: The metrics-collector strips the `managed_cluster` label during federation. Only the `cluster` label (added by MCOA addon write relabel configs) is available on hub Thanos. | ||
| - **No disabled alert awareness**: ARC-dropped alerts never fire, so they are absent from `ALERTS`, but there is no way to distinguish "never fired" from "disabled by ARC." |
There was a problem hiding this comment.
Hmm RelabelConfig don't change the alerting rules evaluated by Prometheus but only the alerts sent to Alertmanager.
| The controller periodically polls each spoke Alertmanager (`GET /api/v2/silences` via ManagedClusterProxy) and reconciles the state on hub AM: | ||
|
|
||
| - **Create**: when a new active silence is found on a spoke, the controller creates a replica on hub AM. The replica includes all original matchers plus an additional `managed_cluster=<cluster-name>` matcher to scope it to that spoke's alerts. A label or annotation `sync.source=<cluster-name>/<silence-id>` is added to the hub silence comment for traceability and to prevent conflicts with user-created hub silences. | ||
| - **Update**: if a spoke silence's `endsAt` is extended or matchers change, the controller expires the old hub replica and creates a new one. |
There was a problem hiding this comment.
note that a request to update a silence means "expire the silence" then "create new silence". Similarly deleting a silence = expiring it.
|
|
||
| For MVP, the UI focuses on the real-time alerts page (hub AM). Historical alert views are a future enhancement that depends on the `alerts_effective_*` metric being deployed and federated. | ||
|
|
||
| ### Silence Sync Controller |
There was a problem hiding this comment.
I've got concerns about the whole approach around silences. Silences in a "single Alertmanager cluster" situation are replicated using an approach which favors availability over consistency (using Conflict-free replicated data types under the hood). It means that we have no real guarantee that 2 Alertmanager instances in the same spoke have a consistent state for silences.
There was a problem hiding this comment.
I removed the silences from the proposal for now.
It is not MVP as long as there is a way to silence the alerts in the hub, which afaik there is.
There was a problem hiding this comment.
Long term I do think we should consider creating a CRD to manage the silences, but its not in the scope of this enhancement.
| - Hub AM can serve as a centralized notification hub for spoke alerts. Users can configure receivers (Slack, PagerDuty, email) on hub AM and route notifications by `managed_cluster` label — enabling fleet-wide notification management from a single configuration point instead of configuring receivers on each spoke individually. | ||
| - The hub AM config Secret uses `skip-creation-if-exist: "true"`, so user customizations are preserved across operator reconciliation. | ||
| - Future UI improvements could include managing hub AM receivers and routes from the console, multi‑cluster routing by cluster labels (region, team), notifications by impact group and component, and team‑scoped subscriptions honoring RBAC. | ||
| - The silence sync controller is essential for notification consistency: spoke silences must be replicated to hub AM so that both spoke-local and hub-centralized notifications are suppressed for silenced alerts. |
There was a problem hiding this comment.
This assumes that users configure both Alertmanager. For context, we added the possibility to disable Alertmanager in the spoke clusters at the request of ACM long time ago already so that alert notifications would be only managed at the hub level.
61e639b to
35ff2f6
Compare
c08a4bc to
58f29ef
Compare
d3e9ce1 to
66653f1
Compare
66653f1 to
0587bdb
Compare
Signed-off-by: Shirly Radco <sradco@redhat.com>
0587bdb to
5128955
Compare
|
@sradco: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR includes the enhancement proposal for a new Multi-Cluster Alerts Management UI.