Skip to content

Conversation

@b-wu26
Copy link

@b-wu26 b-wu26 commented Dec 11, 2025

While using cortex, my team recently ran into an issue with the alert managers where a corrupt config was synced and extended across the entire alert manager ring. This led to our entire fleet being stuck in a crash loop back off and prompted a need to limit the blast radius of potential issues to the replication factor and also improve the api tolerance for apis that ran into extreme issues during the event.

What this PR does:

  • creates a feature flag which disables the replica set extension capabilities called: DisableReplicaSetExtension
  • makes the getReceivers v2 api a quorum call since most other read APIs are quorum based anyways,
  • make the put silence more resilient and retry on valid 4xx and 5xx errors (timeouts, failed connections, etc.)

Which issue(s) this PR fixes:
Fixes #

Checklist

  • [x ] Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant