Skip to content

[Bug]: Skip events emit "no remediation target supports this rule" rows the PRD said to filter #36

@cigan1

Description

@cigan1

Summary

The activity-feed skip filter shipped in #35 passes through kind: "skip" events with reason no remediation target supports this rule whenever the workload is in config/remediation-targets.json but the specific rule is not listed in that target's supported_rules. The PRD for the feature (/tmp/co-activity-prd.md, "v1 must-do" item 1) called for that reason to be aggregated as a count in engine_status, not emitted as a per-event row, because it's noise — the operator can't act on it without editing the targets file. Observed live, 43% (6 of 14) of skip events in the verification CronJob run carried this noisy reason.

Reproduction

# After the verification CronJob run that landed shortly after deploy:
curl -s 'http://127.0.0.1:18088/api/remediations/history?cluster_id=default&limit=50' \
  | jq '.events[] | select(.kind=="skip") | .reason' \
  | sort | uniq -c | sort -rn
# Expected: only "confidence ... below minimum ..." and similar actionable reasons.
# Actual:
#   8 "confidence \"low\" below minimum \"high\""
#   6 "no remediation target supports this rule"

The 6 noisy entries are workloads like Deployment/nightlamp-api (in the targets allowlist with supported_rules: [cpu-hpa-low-request-sensitive, runtime-modernization-candidate]) being skipped for cpu-request-over-provisioned and memory-request-over-provisioned. Both rules aren't in supported_rules so the planner correctly skips them — but per the PRD the operator shouldn't see those skips.

Root cause

cmd/cluster-optimizer/main.go skipperEvents() filters by cls.TargetFor(ns, workload) only — it accepts any skip for a workload that has a target row, regardless of whether the rule is in supported_rules. The PRD-intended filter is cls.IsRemediable(ruleID, ns, workload), which is what internal/plan/plan.go already uses to gate no remediation target supports this rule.

Fix

Swap the filter in skipperEvents:

// cmd/cluster-optimizer/main.go
func skipperEvents(skipped []plan.SkippedReason, ts time.Time, cls *classifier.Classifier) []store.RemediationEvent {
    events := make([]store.RemediationEvent, 0, len(skipped))
    for _, skip := range skipped {
        if cls == nil {
            continue
        }
        // Only surface skips for (workload, rule) pairs the operator could
        // have remediated had the other gates (confidence, persistence,
        // safe trim) passed. Workloads whose target doesn't list the rule
        // produce a per-run count, not a per-event row.
        if !cls.IsRemediable(skip.RuleID, skip.Namespace, skip.Workload) {
            continue
        }
        events = append(events, store.RemediationEvent{...})
    }
    return events
}

Impact

  • Severity: MEDIUM. Feature works; the feed is noisier than the design intended (~43% of skip events are non-actionable).
  • No data loss, no crash, no incorrect remediation behavior.
  • Operator can ignore the rows but they push useful skip rows further down on each run.

Recommended regression test

Add a Go unit test in cmd/cluster-optimizer/main_test.go (if absent — create) that constructs a classifier with one target (rules=[memory-request-below-usage]), passes synthetic SkippedReasons for that workload with both an in-list rule and an out-of-list rule, and asserts only the in-list one survives skipperEvents().

Environment

  • PR: Add in-dashboard halt toggle and richer activity feed #35 (merge commit 197e256)
  • Deployed image: ghcr.io/gipsychef/cluster-optimizer:197e25621fa8b8173424b1e34d4ad1cda1b68ba7
  • Verification job: cluster-optimizer/cluster-optimizer-deploy-verify (2026-05-24T02:29:26Z)
  • Discovered during post-deploy live-cluster QA per /Users/cigan/skills/qa-agent

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions