Skip to content

Enhancement proposal for multi-cluster alerts management#1921

Open
sradco wants to merge 1 commit intoopenshift:masterfrom
sradco:multi_cluster_alert_managment_enhancment
Open

Enhancement proposal for multi-cluster alerts management#1921
sradco wants to merge 1 commit intoopenshift:masterfrom
sradco:multi_cluster_alert_managment_enhancment

Conversation

@sradco
Copy link
Copy Markdown

@sradco sradco commented Jan 12, 2026

This PR includes the enhancement proposal for a new Multi-Cluster Alerts Management UI.

@openshift-ci openshift-ci Bot requested review from jan--f and moadz January 12, 2026 19:39
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jan 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jan--f for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sradco
Copy link
Copy Markdown
Author

sradco commented Jan 12, 2026

@coleenquadros @jacobbaungard @simonpasquier @jgbernalp @jan--f @moadz I would appreciate your review of this proposal for multi-cluster alerting management UI.

@sradco sradco force-pushed the multi_cluster_alert_managment_enhancment branch 3 times, most recently from 7e23e3d to 4749f64 Compare January 18, 2026 10:44
@openshift-bot
Copy link
Copy Markdown

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2026
@openshift-bot
Copy link
Copy Markdown

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci Bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 23, 2026
@sradco
Copy link
Copy Markdown
Author

sradco commented Feb 25, 2026

/remove-lifecycle rotten

@openshift-ci openshift-ci Bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 25, 2026
@sradco sradco force-pushed the multi_cluster_alert_managment_enhancment branch 2 times, most recently from 90d1528 to 53326b6 Compare March 4, 2026 15:23
@sradco
Copy link
Copy Markdown
Author

sradco commented Mar 4, 2026

Hi @jacobbaungard , Please review this enhancement proposal.
It is built on to of #1822 and #1917.

@sradco sradco force-pushed the multi_cluster_alert_managment_enhancment branch 2 times, most recently from 4e9725c to 61e639b Compare March 11, 2026 13:56
Copy link
Copy Markdown
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick pass but the proposal would benefit from being split into different parts because it's quite impossible to review in the current state. I'd recommend to focus on 1 part at a time like visualization of spoke/hub alerts in the console.
I'd expect also some inputs from the ACM observability folks about the recommended approach for alert silencing.

Comment thread enhancements/monitoring/multi-cluster-alerts-ui-management.md Outdated
- **No ARC-applied labels**: The `ALERTS` metric is produced by Prometheus rule evaluation, before ARCs are applied. It lacks `openshift_io_alert_rule_id`, `openshift_io_alert_rule_component`, and `openshift_io_alert_rule_layer`.
- **No silence awareness**: Silenced alerts still appear as `alertstate="firing"` in the `ALERTS` metric — Prometheus does not know about Alertmanager silences.
- **`managed_cluster` is stripped**: The metrics-collector strips the `managed_cluster` label during federation. Only the `cluster` label (added by MCOA addon write relabel configs) is available on hub Thanos.
- **No disabled alert awareness**: ARC-dropped alerts never fire, so they are absent from `ALERTS`, but there is no way to distinguish "never fired" from "disabled by ARC."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm RelabelConfig don't change the alerting rules evaluated by Prometheus but only the alerts sent to Alertmanager.

Comment thread enhancements/monitoring/multi-cluster-alerts-ui-management.md Outdated
The controller periodically polls each spoke Alertmanager (`GET /api/v2/silences` via ManagedClusterProxy) and reconciles the state on hub AM:

- **Create**: when a new active silence is found on a spoke, the controller creates a replica on hub AM. The replica includes all original matchers plus an additional `managed_cluster=<cluster-name>` matcher to scope it to that spoke's alerts. A label or annotation `sync.source=<cluster-name>/<silence-id>` is added to the hub silence comment for traceability and to prevent conflicts with user-created hub silences.
- **Update**: if a spoke silence's `endsAt` is extended or matchers change, the controller expires the old hub replica and creates a new one.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that a request to update a silence means "expire the silence" then "create new silence". Similarly deleting a silence = expiring it.


For MVP, the UI focuses on the real-time alerts page (hub AM). Historical alert views are a future enhancement that depends on the `alerts_effective_*` metric being deployed and federated.

### Silence Sync Controller
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got concerns about the whole approach around silences. Silences in a "single Alertmanager cluster" situation are replicated using an approach which favors availability over consistency (using Conflict-free replicated data types under the hood). It means that we have no real guarantee that 2 Alertmanager instances in the same spoke have a consistent state for silences.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the silences from the proposal for now.
It is not MVP as long as there is a way to silence the alerts in the hub, which afaik there is.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term I do think we should consider creating a CRD to manage the silences, but its not in the scope of this enhancement.

Comment thread enhancements/monitoring/multi-cluster-alerts-ui-management.md Outdated
Comment thread enhancements/monitoring/multi-cluster-alerts-ui-management.md Outdated
Comment thread enhancements/monitoring/multi-cluster-alerts-ui-management.md Outdated
- Hub AM can serve as a centralized notification hub for spoke alerts. Users can configure receivers (Slack, PagerDuty, email) on hub AM and route notifications by `managed_cluster` label — enabling fleet-wide notification management from a single configuration point instead of configuring receivers on each spoke individually.
- The hub AM config Secret uses `skip-creation-if-exist: "true"`, so user customizations are preserved across operator reconciliation.
- Future UI improvements could include managing hub AM receivers and routes from the console, multi‑cluster routing by cluster labels (region, team), notifications by impact group and component, and team‑scoped subscriptions honoring RBAC.
- The silence sync controller is essential for notification consistency: spoke silences must be replicated to hub AM so that both spoke-local and hub-centralized notifications are suppressed for silenced alerts.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that users configure both Alertmanager. For context, we added the possibility to disable Alertmanager in the spoke clusters at the request of ACM long time ago already so that alert notifications would be only managed at the hub level.

@sradco sradco force-pushed the multi_cluster_alert_managment_enhancment branch from 61e639b to 35ff2f6 Compare March 18, 2026 11:40
@sradco sradco force-pushed the multi_cluster_alert_managment_enhancment branch 9 times, most recently from c08a4bc to 58f29ef Compare April 5, 2026 16:21
@sradco sradco force-pushed the multi_cluster_alert_managment_enhancment branch 10 times, most recently from d3e9ce1 to 66653f1 Compare April 12, 2026 11:32
@sradco sradco force-pushed the multi_cluster_alert_managment_enhancment branch from 66653f1 to 0587bdb Compare April 15, 2026 11:56
Signed-off-by: Shirly Radco <sradco@redhat.com>
@sradco sradco force-pushed the multi_cluster_alert_managment_enhancment branch from 0587bdb to 5128955 Compare April 18, 2026 13:35
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 18, 2026

@sradco: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants