Skip to content

Improve status manager pod diagnostics#4645

Open
caseydavenport wants to merge 13 commits intotigera:masterfrom
caseydavenport:casey-status-diagnostics
Open

Improve status manager pod diagnostics#4645
caseydavenport wants to merge 13 commits intotigera:masterfrom
caseydavenport:casey-status-diagnostics

Conversation

@caseydavenport
Copy link
Copy Markdown
Member

Description

This refactors the status manager's pod health checking to give users more actionable information when things go wrong. Previously, the status manager would only report the first pod error it found, silently skip workloads it couldn't find, and had no way to tell you whether a failing pod was from the current rollout or a previous one.

Changes:

  • Report all pod issues, deduplicated and capped. Instead of stopping at the first broken pod, the status manager now collects all issues, deduplicates them by root cause, and reports up to 3 distinct reasons per workload (with a count of how many pods share each issue).
  • Detect not-found workloads. If the status manager is told to watch a DaemonSet/Deployment/StatefulSet/CronJob that doesn't exist, it now reports that as degraded instead of silently skipping it.
  • Surface pending/unschedulable pod reasons. Pods stuck in Pending now show the scheduler's reason (e.g., "0/3 nodes are available: 3 Insufficient memory") instead of just a generic "not yet scheduled" message at the workload level.
  • Distinguish old vs new revision pods during rollouts. During a rolling update, failures in new-revision pods are prioritized and old-revision pod failures are annotated with "(old revision)" so you can tell whether the new version is the problem or old pods are just winding down.

Builds on #4644, which added readiness probe detection and crash loop termination context.

None

…Issues

Wire diagnosePods and summarizeIssues into syncState, replacing the
old podsFailing/containerErrorMessage functions. Each workload type
now reports not-found as a degraded condition instead of silently
continuing. DaemonSets and Deployments pass revision info so
diagnosePods can distinguish old-revision pods from current ones.
Comment thread pkg/controller/status/status.go
Comment thread pkg/controller/status/status.go
Comment thread pkg/controller/status/status.go Outdated
Comment thread pkg/controller/status/status.go
Comment thread pkg/controller/status/status.go
Comment thread pkg/controller/status/status.go
Comment thread pkg/controller/status/status.go Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants