We have been using calico very successfully for 5+ years now however in the last couple weeks we started to notice that some of our Kubernetes pods were getting network failures, they would effectively be offline for ingress traffic for 2-3 mins then start working for 1-2 mins then down again.
After a lot of various troubleshooting we found that the tigera-operator was continuously spamming:
"panic\":\"runtime error: invalid memory address or nil pointer dereference\"
However immediately after that message is another error:
"error\":\"panic: runtime error: invalid memory address or nil pointer dereference [recovered]\"
We aren't sure if this is indicating there is no issue, but if so why is it continuously spamming these messages until we restart, then it seems to work ok for 1-3 days and then starts panic'ing again.
What we did find is that if we restart our impacted pods they continue to experience network issues. However if we FIRST restart the tigera-operator all the panic messages stop and if we then restart the impacted pods they start working normally until the panics start again.
At this point we are pretty stumped we upgraded to the latest versions:
- image: quay.io/tigera/operator:v1.40.8
- image: quay.io/calico/apiserver:v3.31.5
- image: quay.io/calico/kube-controllers:v3.31.5
- image: quay.io/calico/node:v3.31.5
- image: quay.io/calico/pod2daemon-flexvol:v3.31.5
- image: quay.io/calico/typha:v3.31.5
- AWS EKS 1.35.0 / BottleRocket 1.57.0 / containerd 2.1.6+bottlerocket
- | vpc-cni | ACTIVE | v1.21.1-eksbuild.7 |
- | coredns | ACTIVE | v1.13.2-eksbuild.4 |
- | kube-proxy | ACTIVE | v1.35.3-eksbuild.2 |
- | aws-ebs-csi-driver | ACTIVE | v1.58.0-eksbuild.1 |
I did see another post from Oct 2025 that looks very similar to the panic I'm getting but the ticket doesn't appear to have been closed without resolution?
At this point this is have a serious impact on our clusters.. we have 6 separate EKS clusters all similar versions and 4 out of 5 are showing the issue. And the same restart operator, restart pod seems to fix things for a few days. We are pretty desperate to find a solution. So appreciate any help you can offer.
One other oddity is that our production clusters.. has never experienced the same network failures to date, all clusters run the same versions. However even though Prod never has the issues.. I can see in the logs this panic error going back to Jan 2026 .. which makes me second guess the cause, however the restarts do appear to solve it.
Thanks for any help.
operator-panic-errors.txt
We have been using
calicovery successfully for 5+ years now however in the last couple weeks we started to notice that some of our Kubernetes pods were getting network failures, they would effectively be offline for ingress traffic for 2-3 mins then start working for 1-2 mins then down again.After a lot of various troubleshooting we found that the
tigera-operatorwas continuously spamming:"panic\":\"runtime error: invalid memory address or nil pointer dereference\"However immediately after that message is another error:
"error\":\"panic: runtime error: invalid memory address or nil pointer dereference [recovered]\"We aren't sure if this is indicating there is no issue, but if so why is it continuously spamming these messages until we restart, then it seems to work ok for 1-3 days and then starts panic'ing again.
What we did find is that if we
restartour impacted pods they continue to experience network issues. However if we FIRST restart thetigera-operatorall thepanicmessages stop and if we then restart the impacted pods they start working normally until the panics start again.At this point we are pretty stumped we upgraded to the latest versions:
I did see another post from Oct 2025 that looks very similar to the panic I'm getting but the ticket doesn't appear to have been closed without resolution?
At this point this is have a serious impact on our clusters.. we have 6 separate EKS clusters all similar versions and 4 out of 5 are showing the issue. And the same
restart operator, restart podseems to fix things for a few days. We are pretty desperate to find a solution. So appreciate any help you can offer.One other oddity is that our production clusters.. has never experienced the same network failures to date, all clusters run the same versions. However even though Prod never has the issues.. I can see in the logs this panic error going back to Jan 2026 .. which makes me second guess the cause, however the restarts do appear to solve it.
Thanks for any help.
operator-panic-errors.txt