Conversation
📝 WalkthroughWalkthroughThe PR adds comprehensive monitoring to the committed resource syncer by introducing new Prometheus alerts, refactoring the Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Test Coverage ReportTest Coverage 📊: 68.4% |
There was a problem hiding this comment.
🧹 Nitpick comments (7)
internal/scheduling/reservations/commitments/reservation_manager.go (1)
66-68: Consider returning nil on early errors before any mutations.Currently, error paths (lines 77, 93) return
nilwhile later errors (lines 125, 163, etc.) return a partially populatedresult. This inconsistency could confuse callers. Consider either always returningnilon error, or always returningresultto provide partial information about what was done before the failure.Looking at the callers in
api_change_commitments.goandsyncer.go, they only check forerr != niland don't use the result on error, so this is not a functional issue.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/reservation_manager.go` around lines 66 - 68, The function initializes a local variable result of type *ApplyResult and currently returns inconsistent values on error; update the error returns to be consistent by returning (nil, err) on all error paths (including the early error branches around where result is first created and the later ones at lines like 125 and 163), so callers (api_change_commitments.go and syncer.go) that only test err != nil will see a nil result on failure; ensure every return that currently returns a non-nil result with an error instead returns nil for the *ApplyResult and the same error.internal/scheduling/reservations/commitments/syncer.go (1)
295-304: Consider recording metrics unconditionally for observability.Currently metrics are only recorded when
> 0. While this avoids incrementing counters by zero, it means Prometheus won't see the metric series at all if no changes occur in a sync run. This could causerate()calculations in alerts to behave unexpectedly (returning no data vs. returning 0).Since Prometheus counters with
Add(0)are essentially no-ops and the series are already pre-initialized in the monitor, recording unconditionally would ensure consistent metric availability.Optional: Record metrics unconditionally
// Record reservation change metrics if s.monitor != nil { - if totalCreated > 0 { - s.monitor.RecordReservationsCreated(totalCreated) - } - if totalDeleted > 0 { - s.monitor.RecordReservationsDeleted(totalDeleted) - } - if totalRepaired > 0 { - s.monitor.RecordReservationsRepaired(totalRepaired) - } + s.monitor.RecordReservationsCreated(totalCreated) + s.monitor.RecordReservationsDeleted(totalDeleted) + s.monitor.RecordReservationsRepaired(totalRepaired) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/syncer.go` around lines 295 - 304, The current conditional guards around s.monitor.RecordReservationsCreated, RecordReservationsDeleted, and RecordReservationsRepaired prevent metrics from being emitted when counts are zero; change the logic to call these monitor methods unconditionally (still guarded by s.monitor != nil) so that each sync run records totals even when 0, i.e., remove the > 0 checks and always invoke s.monitor.RecordReservationsCreated(totalCreated), s.monitor.RecordReservationsDeleted(totalDeleted), and s.monitor.RecordReservationsRepaired(totalRepaired) when s.monitor is non-nil.helm/bundles/cortex-nova/alerts/nova.alerts.yaml (4)
557-577: Same division by zero risk applies here.This alert also divides by
rate(cortex_committed_resource_syncer_commitments_processed_total...)which could be zero. Apply the sameand > 0guard on the denominator.Suggested fix
- alert: CortexNovaCommittedResourceSyncerLocalChangeRateHigh expr: | ( rate(cortex_committed_resource_syncer_reservations_created_total{service="cortex-nova-metrics"}[1h]) + rate(cortex_committed_resource_syncer_reservations_deleted_total{service="cortex-nova-metrics"}[1h]) + rate(cortex_committed_resource_syncer_reservations_repaired_total{service="cortex-nova-metrics"}[1h]) ) / rate(cortex_committed_resource_syncer_commitments_processed_total{service="cortex-nova-metrics"}[1h]) > 0.01 + and rate(cortex_committed_resource_syncer_commitments_processed_total{service="cortex-nova-metrics"}[1h]) > 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@helm/bundles/cortex-nova/alerts/nova.alerts.yaml` around lines 557 - 577, The alert CortexNovaCommittedResourceSyncerLocalChangeRateHigh divides by rate(cortex_committed_resource_syncer_commitments_processed_total{service="cortex-nova-metrics"}[1h]) which can be zero; update the expr to guard the denominator by including an additional clause that ensures rate(cortex_committed_resource_syncer_commitments_processed_total{service="cortex-nova-metrics"}[1h]) > 0 (e.g., multiply the numerator/denominator expression by and the denominator > 0 or use a conditional with and > 0) so the division cannot occur when the processed rate is zero; modify the expression in the CortexNovaCommittedResourceSyncerLocalChangeRateHigh alert accordingly, referencing the metrics cortex_committed_resource_syncer_reservations_created_total, _deleted_total, _repaired_total and cortex_committed_resource_syncer_commitments_processed_total.
518-536: Division by zero risk in rate calculations.When
cortex_committed_resource_syncer_commitments_totalhas zero rate (e.g., syncer hasn't run or no commitments exist), the division will produceNaNor+Inf, which won't trigger the> 0.05condition but could cause confusing metric behavior.Consider using a
andclause or> 0filter on the denominator to ensure clean results.Suggested fix to avoid division by zero
- alert: CortexNovaCommittedResourceSyncerUnitMismatchRateHigh expr: | - rate(cortex_committed_resource_syncer_commitments_skipped_total{service="cortex-nova-metrics", reason="unit_mismatch"}[1h]) - / rate(cortex_committed_resource_syncer_commitments_total{service="cortex-nova-metrics"}[1h]) > 0.05 + ( + rate(cortex_committed_resource_syncer_commitments_skipped_total{service="cortex-nova-metrics", reason="unit_mismatch"}[1h]) + / rate(cortex_committed_resource_syncer_commitments_total{service="cortex-nova-metrics"}[1h]) + ) > 0.05 + and rate(cortex_committed_resource_syncer_commitments_total{service="cortex-nova-metrics"}[1h]) > 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@helm/bundles/cortex-nova/alerts/nova.alerts.yaml` around lines 518 - 536, The alert CortexNovaCommittedResourceSyncerUnitMismatchRateHigh risks division-by-zero; update its expr to ensure the denominator (rate(cortex_committed_resource_syncer_commitments_total{service="cortex-nova-metrics"}[1h])) is positive before dividing the skipped rate by it — e.g., compute the skipped rate and denominator rate separately using the same selector and time window (metrics: cortex_committed_resource_syncer_commitments_skipped_total and cortex_committed_resource_syncer_commitments_total) and add a guard like "and rate(...denominator...)[1h] > 0" or filter the denominator with "> 0" so the division only occurs when the denominator rate is nonzero.
538-555: Threshold of > 0 may be too sensitive.This alert fires when any commitment is skipped due to unknown flavor group. While this might be intentional for catching configuration issues early, it could be noisy if there's even one stale commitment referencing a recently-removed flavor group.
Consider whether a small threshold (e.g.,
> 0.01for 1%) or absolute count would be more appropriate for your operational needs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@helm/bundles/cortex-nova/alerts/nova.alerts.yaml` around lines 538 - 555, The alert CortexNovaCommittedResourceSyncerUnknownFlavorGroupRateHigh is currently triggered for any non-zero rate (expr uses > 0), which is too sensitive and may be noisy; update the PromQL in this alert (the expr block in CortexNovaCommittedResourceSyncerUnknownFlavorGroupRateHigh) to use a more practical threshold (e.g., replace > 0 with > 0.01 for 1%) or switch to an absolute-count check (e.g., compare increase(...) or sum(...) over [1h] > N) depending on preference, and ensure the summary/description reflect the chosen threshold change.
579-597: Division by zero risk and consider threshold.Similar to the previous alerts:
- Division by zero when no commitments are processed
- Threshold
> 0fires on any repair - consider if this is the desired sensitivitySuggested fix
- alert: CortexNovaCommittedResourceSyncerRepairRateHigh expr: | rate(cortex_committed_resource_syncer_reservations_repaired_total{service="cortex-nova-metrics"}[1h]) / rate(cortex_committed_resource_syncer_commitments_processed_total{service="cortex-nova-metrics"}[1h]) > 0 + and rate(cortex_committed_resource_syncer_commitments_processed_total{service="cortex-nova-metrics"}[1h]) > 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@helm/bundles/cortex-nova/alerts/nova.alerts.yaml` around lines 579 - 597, The alert CortexNovaCommittedResourceSyncerRepairRateHigh risks division by zero and is overly sensitive; update its expr to guard the denominator and make the threshold configurable: use the metric names cortex_committed_resource_syncer_reservations_repaired_total and cortex_committed_resource_syncer_commitments_processed_total and change the expression to only evaluate when the processed rate > 0 (e.g. add "and rate(cortex_committed_resource_syncer_commitments_processed_total{service=\"cortex-nova-metrics\"}[1h]) > 0" or use a bool operator) and replace the hard > 0 threshold with a configurable small threshold (or a templated alert severity threshold) so the alert only fires for a meaningful repair rate instead of any single repair.internal/scheduling/reservations/commitments/syncer_monitor.go (1)
112-115: Consider removing unused parameter or adding documentation.
RecordUnitMismatchaccepts a_ stringparameter that is ignored. If this is for backward compatibility with an older API that passed flavor group, consider adding a brief comment explaining why the parameter exists but is unused. Alternatively, if no callers pass meaningful values, the parameter could be removed.Looking at
syncer.goline 140, the caller passesflavorGroupNamewhich gets ignored.Option 1: Add documentation
-// RecordUnitMismatch records a unit mismatch skip (convenience method). +// RecordUnitMismatch records a unit mismatch skip (convenience method). +// The flavorGroup parameter is accepted for API compatibility but ignored; +// unit mismatches are tracked as a single aggregate counter. func (m *SyncerMonitor) RecordUnitMismatch(_ string) {Option 2: Remove unused parameter
-func (m *SyncerMonitor) RecordUnitMismatch(_ string) { +func (m *SyncerMonitor) RecordUnitMismatch() {And update caller in
syncer.go:- s.monitor.RecordUnitMismatch(flavorGroupName) + s.monitor.RecordUnitMismatch()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/syncer_monitor.go` around lines 112 - 115, RecordUnitMismatch on type SyncerMonitor accepts an unused string parameter; either remove the parameter or document why it exists. Fix option A: remove the parameter from SyncerMonitor.RecordUnitMismatch and update all call sites (e.g., the call in syncer.go that currently passes flavorGroupName) to call RecordUnitMismatch() with no args. Fix option B: keep the signature for compatibility and add a brief comment on the RecordUnitMismatch method explaining the parameter is intentionally ignored (e.g., retained for backward compatibility when callers passed flavor group) so future readers know it’s deliberate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@helm/bundles/cortex-nova/alerts/nova.alerts.yaml`:
- Around line 557-577: The alert
CortexNovaCommittedResourceSyncerLocalChangeRateHigh divides by
rate(cortex_committed_resource_syncer_commitments_processed_total{service="cortex-nova-metrics"}[1h])
which can be zero; update the expr to guard the denominator by including an
additional clause that ensures
rate(cortex_committed_resource_syncer_commitments_processed_total{service="cortex-nova-metrics"}[1h])
> 0 (e.g., multiply the numerator/denominator expression by and the denominator
> 0 or use a conditional with and > 0) so the division cannot occur when the
processed rate is zero; modify the expression in the
CortexNovaCommittedResourceSyncerLocalChangeRateHigh alert accordingly,
referencing the metrics
cortex_committed_resource_syncer_reservations_created_total, _deleted_total,
_repaired_total and
cortex_committed_resource_syncer_commitments_processed_total.
- Around line 518-536: The alert
CortexNovaCommittedResourceSyncerUnitMismatchRateHigh risks division-by-zero;
update its expr to ensure the denominator
(rate(cortex_committed_resource_syncer_commitments_total{service="cortex-nova-metrics"}[1h]))
is positive before dividing the skipped rate by it — e.g., compute the skipped
rate and denominator rate separately using the same selector and time window
(metrics: cortex_committed_resource_syncer_commitments_skipped_total and
cortex_committed_resource_syncer_commitments_total) and add a guard like "and
rate(...denominator...)[1h] > 0" or filter the denominator with "> 0" so the
division only occurs when the denominator rate is nonzero.
- Around line 538-555: The alert
CortexNovaCommittedResourceSyncerUnknownFlavorGroupRateHigh is currently
triggered for any non-zero rate (expr uses > 0), which is too sensitive and may
be noisy; update the PromQL in this alert (the expr block in
CortexNovaCommittedResourceSyncerUnknownFlavorGroupRateHigh) to use a more
practical threshold (e.g., replace > 0 with > 0.01 for 1%) or switch to an
absolute-count check (e.g., compare increase(...) or sum(...) over [1h] > N)
depending on preference, and ensure the summary/description reflect the chosen
threshold change.
- Around line 579-597: The alert CortexNovaCommittedResourceSyncerRepairRateHigh
risks division by zero and is overly sensitive; update its expr to guard the
denominator and make the threshold configurable: use the metric names
cortex_committed_resource_syncer_reservations_repaired_total and
cortex_committed_resource_syncer_commitments_processed_total and change the
expression to only evaluate when the processed rate > 0 (e.g. add "and
rate(cortex_committed_resource_syncer_commitments_processed_total{service=\"cortex-nova-metrics\"}[1h])
> 0" or use a bool operator) and replace the hard > 0 threshold with a
configurable small threshold (or a templated alert severity threshold) so the
alert only fires for a meaningful repair rate instead of any single repair.
In `@internal/scheduling/reservations/commitments/reservation_manager.go`:
- Around line 66-68: The function initializes a local variable result of type
*ApplyResult and currently returns inconsistent values on error; update the
error returns to be consistent by returning (nil, err) on all error paths
(including the early error branches around where result is first created and the
later ones at lines like 125 and 163), so callers (api_change_commitments.go and
syncer.go) that only test err != nil will see a nil result on failure; ensure
every return that currently returns a non-nil result with an error instead
returns nil for the *ApplyResult and the same error.
In `@internal/scheduling/reservations/commitments/syncer_monitor.go`:
- Around line 112-115: RecordUnitMismatch on type SyncerMonitor accepts an
unused string parameter; either remove the parameter or document why it exists.
Fix option A: remove the parameter from SyncerMonitor.RecordUnitMismatch and
update all call sites (e.g., the call in syncer.go that currently passes
flavorGroupName) to call RecordUnitMismatch() with no args. Fix option B: keep
the signature for compatibility and add a brief comment on the
RecordUnitMismatch method explaining the parameter is intentionally ignored
(e.g., retained for backward compatibility when callers passed flavor group) so
future readers know it’s deliberate.
In `@internal/scheduling/reservations/commitments/syncer.go`:
- Around line 295-304: The current conditional guards around
s.monitor.RecordReservationsCreated, RecordReservationsDeleted, and
RecordReservationsRepaired prevent metrics from being emitted when counts are
zero; change the logic to call these monitor methods unconditionally (still
guarded by s.monitor != nil) so that each sync run records totals even when 0,
i.e., remove the > 0 checks and always invoke
s.monitor.RecordReservationsCreated(totalCreated),
s.monitor.RecordReservationsDeleted(totalDeleted), and
s.monitor.RecordReservationsRepaired(totalRepaired) when s.monitor is non-nil.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bb3ceac4-83d8-468a-829d-66eb7c73d9f4
📒 Files selected for processing (6)
helm/bundles/cortex-nova/alerts/nova.alerts.yamlinternal/scheduling/reservations/commitments/api_change_commitments.gointernal/scheduling/reservations/commitments/reservation_manager.gointernal/scheduling/reservations/commitments/reservation_manager_test.gointernal/scheduling/reservations/commitments/syncer.gointernal/scheduling/reservations/commitments/syncer_monitor.go
No description provided.