Improve status manager pod diagnostics by caseydavenport · Pull Request #4645 · tigera/operator

caseydavenport · 2026-04-07T20:37:03Z

Description

This refactors the status manager's pod health checking to give users more actionable information when things go wrong. Previously, the status manager would only report the first pod error it found, silently skip workloads it couldn't find, and had no way to tell you whether a failing pod was from the current rollout or a previous one.

Changes:

Report all pod issues, deduplicated and capped. Instead of stopping at the first broken pod, the status manager now collects all issues, deduplicates them by root cause, and reports up to 3 distinct reasons per workload (with a count of how many pods share each issue).
Detect not-found workloads. If the status manager is told to watch a DaemonSet/Deployment/StatefulSet/CronJob that doesn't exist, it now reports that as degraded instead of silently skipping it.
Surface pending/unschedulable pod reasons. Pods stuck in Pending now show the scheduler's reason (e.g., "0/3 nodes are available: 3 Insufficient memory") instead of just a generic "not yet scheduled" message at the workload level.
Distinguish old vs new revision pods during rollouts. During a rolling update, failures in new-revision pods are prioritized and old-revision pod failures are annotated with "(old revision)" so you can tell whether the new version is the problem or old pods are just winding down.

Builds on #4644, which added readiness probe detection and crash loop termination context.

None

…Issues Wire diagnosePods and summarizeIssues into syncState, replacing the old podsFailing/containerErrorMessage functions. Each workload type now reports not-found as a degraded condition instead of silently continuing. DaemonSets and Deployments pass revision info so diagnosePods can distinguish old-revision pods from current ones.

…tants, use metav1.GetControllerOf

…stics

caseydavenport added 8 commits April 7, 2026 13:17

Cherry-pick status manager readiness and crash context improvements

4e4f0e5

Add podIssue types for structured pod diagnostics

c5b360a

Add summarizeIssues for dedup, prioritization, and capping

900420e

Add diagnosePods to replace podsFailing with structured diagnosis

7ed2656

Add revision detection helpers for Deployments and DaemonSets

2b12262

Add integration test for rollout-aware status reporting

528915d

Fix formatting

90f17e2

caseydavenport requested a review from a team as a code owner April 7, 2026 20:37

caseydavenport added release-note-not-required docs-not-required labels Apr 7, 2026

marvin-tigera added this to the v1.43.0 milestone Apr 7, 2026

caseydavenport mentioned this pull request Apr 7, 2026

Add TigeraStatus warnings for ignored resources and override correlation #4649

Open

Merge master into casey-status-diagnostics

a81cd02

caseydavenport commented Apr 7, 2026

View reviewed changes

caseydavenport added 4 commits April 7, 2026 15:51

Address review comments: add doc comments, define waiting reason cons…

ba775a1

…tants, use metav1.GetControllerOf

Add integration tests for TigeraStatus message content

be87fcd

Merge remote-tracking branch 'origin/master' into casey-status-diagno…

e0d3bd4

…stics

Merge master, run gen-versions and gen-files

6c4e576

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve status manager pod diagnostics#4645

Improve status manager pod diagnostics#4645
caseydavenport wants to merge 13 commits intotigera:masterfrom
caseydavenport:casey-status-diagnostics

caseydavenport commented Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

caseydavenport commented Apr 7, 2026

Description

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants