Skip to content

operator - showing continuous panic errors related to memory and appears to be causing pods to loose network access until restarted #4703

@JohnPolansky

Description

@JohnPolansky

We have been using calico very successfully for 5+ years now however in the last couple weeks we started to notice that some of our Kubernetes pods were getting network failures, they would effectively be offline for ingress traffic for 2-3 mins then start working for 1-2 mins then down again.

After a lot of various troubleshooting we found that the tigera-operator was continuously spamming:
"panic\":\"runtime error: invalid memory address or nil pointer dereference\"

However immediately after that message is another error:
"error\":\"panic: runtime error: invalid memory address or nil pointer dereference [recovered]\"

We aren't sure if this is indicating there is no issue, but if so why is it continuously spamming these messages until we restart, then it seems to work ok for 1-3 days and then starts panic'ing again.

What we did find is that if we restart our impacted pods they continue to experience network issues. However if we FIRST restart the tigera-operator all the panic messages stop and if we then restart the impacted pods they start working normally until the panics start again.

At this point we are pretty stumped we upgraded to the latest versions:

  • image: quay.io/tigera/operator:v1.40.8
  • image: quay.io/calico/apiserver:v3.31.5
  • image: quay.io/calico/kube-controllers:v3.31.5
  • image: quay.io/calico/node:v3.31.5
  • image: quay.io/calico/pod2daemon-flexvol:v3.31.5
  • image: quay.io/calico/typha:v3.31.5
  • AWS EKS 1.35.0 / BottleRocket 1.57.0 / containerd 2.1.6+bottlerocket
  • | vpc-cni | ACTIVE | v1.21.1-eksbuild.7 |
  • | coredns | ACTIVE | v1.13.2-eksbuild.4 |
  • | kube-proxy | ACTIVE | v1.35.3-eksbuild.2 |
  • | aws-ebs-csi-driver | ACTIVE | v1.58.0-eksbuild.1 |

I did see another post from Oct 2025 that looks very similar to the panic I'm getting but the ticket doesn't appear to have been closed without resolution?

At this point this is have a serious impact on our clusters.. we have 6 separate EKS clusters all similar versions and 4 out of 5 are showing the issue. And the same restart operator, restart pod seems to fix things for a few days. We are pretty desperate to find a solution. So appreciate any help you can offer.

One other oddity is that our production clusters.. has never experienced the same network failures to date, all clusters run the same versions. However even though Prod never has the issues.. I can see in the logs this panic error going back to Jan 2026 .. which makes me second guess the cause, however the restarts do appear to solve it.

Thanks for any help.

operator-panic-errors.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions