Problem
We're exposing 5 Prometheus metrics from the controller, but only 1 of them (BootstrapCompleted) has test coverage. This makes it harder to catch regressions when metrics behavior changes.
Current Metrics Status
- ✅
BootstrapCompleted - Has tests
- ❌
RulesTotal - No tests
- ❌
TaintOperations - No tests
- ❌
Failures - No tests
- ❌
EvaluationDuration - No tests (and not even used in code yet)
What Needs to Be Done
1. Add tests for RulesTotal metric
This gauge tracks the total number of active rules. We should test:
- Value increases when a rule is added to the cache
- Value decreases when a rule is removed from the cache
2. Add tests for TaintOperations metric
This counter tracks taint add/remove operations. We should test:
- Counter increments when adding a taint (with labels:
rule, operation="add")
- Counter increments when removing a taint (with labels:
rule, operation="remove")
3. Add tests for Failures metric
This counter tracks operational failures. We should test:
- Counter increments on evaluation errors (label:
reason="EvaluationError")
- Counter increments on taint operation failures (labels:
reason="AddTaintError" or reason="RemoveTaintError")
4. Implement and test EvaluationDuration metric
This histogram is defined but never used. We should:
- Add instrumentation to the
evaluateRuleForNode() function to record evaluation duration
- Add tests to verify the histogram records evaluations
Acceptance Criteria
Problem
We're exposing 5 Prometheus metrics from the controller, but only 1 of them (
BootstrapCompleted) has test coverage. This makes it harder to catch regressions when metrics behavior changes.Current Metrics Status
BootstrapCompleted- Has testsRulesTotal- No testsTaintOperations- No testsFailures- No testsEvaluationDuration- No tests (and not even used in code yet)What Needs to Be Done
1. Add tests for
RulesTotalmetricThis gauge tracks the total number of active rules. We should test:
2. Add tests for
TaintOperationsmetricThis counter tracks taint add/remove operations. We should test:
rule,operation="add")rule,operation="remove")3. Add tests for
FailuresmetricThis counter tracks operational failures. We should test:
reason="EvaluationError")reason="AddTaintError"orreason="RemoveTaintError")4. Implement and test
EvaluationDurationmetricThis histogram is defined but never used. We should:
evaluateRuleForNode()function to record evaluation durationAcceptance Criteria
make test