Skip to content

Improve sustained alert aggregation#21

Merged
Royal-lobster merged 1 commit into
mainfrom
feat/quieter-sustained-aggregation
May 15, 2026
Merged

Improve sustained alert aggregation#21
Royal-lobster merged 1 commit into
mainfrom
feat/quieter-sustained-aggregation

Conversation

@Royal-lobster
Copy link
Copy Markdown
Member

@Royal-lobster Royal-lobster commented May 15, 2026

Summary

  • add rate-aware early handoff from ramp to sustained mode
  • add periodCount metadata and show per-period plus total counts in sustained formatter output
  • change the default sustained digest interval to 15 minutes and document the new aggregation knobs

Validation

  • pnpm test
  • pnpm typecheck
  • pnpm build
  • pnpm lint

Closes #23

@Royal-lobster Royal-lobster merged commit 3633fee into main May 15, 2026
3 checks passed
@github-actions github-actions Bot mentioned this pull request May 15, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces rate-aware early handoff from the ramp phase to the sustained phase, making alerting quieter during high-volume incidents. It also increases the default sustained update interval to 15 minutes and adds a periodCount field to track deltas between updates. Review feedback highlights a logic error in the early handoff implementation that could cause alert spamming and suggests ensuring the transition to sustained mode is one-way. Additionally, an optimization for the sliding window rate calculation was recommended to prevent performance degradation during alert storms.

Comment thread src/core/aggregator.ts
Comment on lines +106 to 124
const shouldEnterSustainedByRate =
state.hasSentRampAlert && currentRate >= this.config.rampExitRatePerSecond

// Phase: sustained (count exceeds rampThreshold)
if (state.count > this.config.rampThreshold) {
if (state.count > this.config.rampThreshold || shouldEnterSustainedByRate) {
state.phase = 'sustained'
state.everEnteredSustained = true
result.phase = 'sustained'

if (now - state.lastAlertedAt >= this.config.digestIntervalMs) {
const sustainedByRate = shouldEnterSustainedByRate && state.count <= this.config.rampThreshold
if (sustainedByRate || now - state.lastAlertedAt >= this.config.digestIntervalMs) {
result.shouldSend = true
result.suppressedSince = state.count - state.lastAlertedCount
result.periodCount = state.count - state.lastAlertedCount
result.suppressedSince = result.periodCount
state.lastAlertedAt = now
state.lastAlertedCount = state.count
}

return result
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation of the rate-aware early handoff causes an alert to be sent on every subsequent event once the rate threshold is crossed, until the rampThreshold is reached. This happens because shouldEnterSustainedByRate (and thus sustainedByRate) remains true for every call in that range, bypassing the digestIntervalMs check. Additionally, if the rate drops below the threshold before reaching rampThreshold, the aggregator could regress from 'sustained' back to 'ramp' phase. The transition to sustained mode should be one-way and only trigger a single immediate "handoff" alert. Furthermore, ensure that context blocks (service name and timestamp) are handled consistently after the phase logic by avoiding early returns within the phase-handling blocks, as per the design document.

    const wasSustained = state.phase === 'sustained'
    const isRateThresholdHit = state.hasSentRampAlert && currentRate >= this.config.rampExitRatePerSecond

    if (state.count > this.config.rampThreshold || isRateThresholdHit || wasSustained) {
      state.phase = 'sustained'
      state.everEnteredSustained = true
      result.phase = 'sustained'

      const isEarlyExit = !wasSustained && isRateThresholdHit && state.count <= this.config.rampThreshold
      if (isEarlyExit || now - state.lastAlertedAt >= this.config.digestIntervalMs) {
        result.shouldSend = true
        result.periodCount = state.count - state.lastAlertedCount
        result.suppressedSince = result.periodCount
        state.lastAlertedAt = now
        state.lastAlertedCount = state.count
      }
    }
References
  1. Ensure that context blocks, such as service name and timestamp, are consistently present across all alert phases (onset, ramp, sustained, resolution) by moving the logic for adding the context block outside and after the switch statement that handles different phases.

Comment thread src/core/aggregator.ts
const windowCutoff = now - this.config.rampExitRateWindowMs
state.rateWindow = state.rateWindow.filter((t) => t > windowCutoff)
const currentRate = state.rateWindow.length / (RATE_WINDOW_MS / 1000)
const currentRate = state.rateWindow.length / (this.config.rampExitRateWindowMs / 1000)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of Array.prototype.filter on state.rateWindow (line 78) during every process call can lead to significant performance degradation during high-volume alert storms. For example, a storm of 10k events/sec would result in filtering and re-allocating a 600k-element array 10k times per second. Consider using a more efficient sliding window approach, such as a queue with shift() or a bucketed counter, to maintain the rate calculation with better time and memory complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve sustained alert aggregation

1 participant