-
Notifications
You must be signed in to change notification settings - Fork 135
Add troubleshooting guide for non-cluster hosts and VMs setup #2350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,111 @@ | ||
| --- | ||
| description: Troubleshoot non-cluster hosts and VMs setup | ||
| --- | ||
|
|
||
| # Troubleshoot non-cluster hosts and VMs setup | ||
|
|
||
| This document provides guidance for troubleshooting Calico running on hosts and VMs outside of a cluster. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you be more specific here? Right now, it's just 'troubleshooting'. But most of the sections deal with certificates. The title and sidebars location would suggest it's troubleshooting related to installation. Better to provide details here so readers know whether this is for them. |
||
|
|
||
| ## Useful commands | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps better to group this commands section here: Troubleshooting commands? |
||
|
|
||
| These commands can help you collect logs and monitor system activities during troubleshooting. | ||
|
|
||
| ### On non-cluster hosts or VMs | ||
|
|
||
| ```bash | ||
| journalctl -xue calico-node.service -f | ||
| journalctl -xue calico-fluent-bit.service -f | ||
| ``` | ||
|
|
||
| ### On the cluster side | ||
|
|
||
| ```bash | ||
| kubectl logs -n calico-system -l k8s-app=calico-typha-noncluster-host | ||
| kubectl logs -n tigera-manager -l k8s-app=tigera-manager -c tigera-voltron | ||
| ``` | ||
|
|
||
| You can monitor CertificateSigningRequests (CSR) by running: | ||
|
|
||
| ```bash | ||
| kubectl get certificatesigningrequest -w | ||
| ``` | ||
|
|
||
| Monitoring CSRs is useful for debugging certificates used for Calico Node and Typha mutual TLS (mTLS) communication. The automatic CSR approval and signing flow can fail in several ways. For example: | ||
|
Check failure on line 33 in calico-enterprise/getting-started/bare-metal/troubleshoot.mdx
|
||
|
|
||
| - The CSR request might not be created or submitted correctly. | ||
| - The Tigera Operator CSR controller might not process it. | ||
| - The Tigera Operator signer might reject the request due to invalid fields or missing permission. | ||
|
|
||
| When such failure occur, the CSR status object contains detailed condition and error messages that help identify the root cause. | ||
|
|
||
| ## Common problems | ||
|
|
||
| ### No internet connection after installing the Calico Node package | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "after installing the Calico Node package" Does that mean after installing Calico? Or something different? |
||
|
|
||
| By default, $[prodname] blocks all traffic to and from host interfaces. You can use a profile with host endpoints to modify default behavior. Apply the built-in profile `projectcalico-default-allow`, which allows all ingress and egress traffic. Host endpoints that use this profile will have *allow-all* behavior instead of *deny-all* when no network policy is applied. | ||
hjiawei marked this conversation as resolved.
Show resolved
Hide resolved
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This feels less like a troubleshooting step and more like expected behavior. If this is a default status, and most users need to enable this connection, then perhaps it would be better placed with the installation guide as an important post-installation step. What do you think? |
||
|
|
||
| Example `HostEndpoint` with the `projectcalico-default-allow` profile: | ||
|
|
||
| ```yaml | ||
| apiVersion: projectcalico.org/v3 | ||
| kind: HostEndpoint | ||
| metadata: | ||
| name: <endpoint-name> | ||
| spec: | ||
| interfaceName: <interface-name> | ||
| node: <node-hostname> | ||
| expectedIPs: ["<list-of-expected-ips>"] | ||
| profiles: | ||
| - projectcalico-default-allow | ||
| ``` | ||
|
|
||
| ### Certificate signed by unknown authority | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's good for troubleshooting sections to lead with the symptom, rather than the cause. What behavior will a user experience if the cert is signed by an unknown authority? |
||
|
|
||
| If the certificate presented by the Kubernetes API server or Tigera Manager endpoint is not signed by a trusted Certificate Authority (CA), add the correct CA certificate to the system trust store. Alternatively, for the Calico fluent-bit log forwarder, you can temporarily disable TLS verifications by setting: | ||
|
|
||
| ```conf | ||
| [OUTPUT] | ||
| ... | ||
| tls.verify Off | ||
| ... | ||
| ``` | ||
|
|
||
| in the configuration file `/etc/calico/calico-fluent-bit/calico-fluent-bit.conf`. | ||
|
|
||
| :::note | ||
|
|
||
| Disabling TLS verification should only be used for testing or troubleshooting. | ||
|
|
||
| ::: | ||
|
Comment on lines
+75
to
+79
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense, but it would be helpful to state explicitly why we turned off TLS in the proceeding section. What's the next step for troubleshooting after disabling TLS here? Do we do this only if the correct CA cert didn't work? |
||
|
|
||
| ### No object can be associated with CSR error | ||
|
|
||
| If a CSR is denied with the following error: | ||
|
|
||
| ```text | ||
| invalid: no object can be associated with CSR node-certs-noncluster-host:<hostname> | ||
| ``` | ||
|
|
||
| verify the following: | ||
|
|
||
| * A corresponding host endpoint resource exists for the non-cluster host or VM. | ||
| * The `spec.node` field in the host endpoint resource matches the non-cluster host name exactly. | ||
|
|
||
| ### Peer certificate does not have required CN | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Symptom? |
||
|
|
||
| If the non-cluster host fails to connect to the dedicated Typha deployment, check that the certificate Common Name (CN) values are consistent on both sides. | ||
|
|
||
| On the non-cluster host or VM under the `/etc/calico/calico-node` folder: | ||
|
|
||
| * In `calico-node.conf`, verify the `TyphaCN` value matches the remote Typha server certificate CN, or | ||
| * In `calico-node.env`, verify the `FELIX_TYPHACN` value matches the remote Typha server certificate CN. | ||
|
|
||
| On the cluster side (`calico-system/calico-typha-noncluster-host` deployment): | ||
|
|
||
| * The `TYPHA_CLIENTCN` environment variable must match the CN used in the non-cluster node certificate. | ||
|
|
||
| ### Certificate is not renewed or updated | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Symptom? |
||
|
|
||
| The `calico-noncluster-host-init` process runs before the main `calico-node` service is responsible for renewing certificates that are expired or near expiry. Certificates are renewed automatically within 90 days of expiry. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to say that the noncluster host may be created, but the main calico-node service isn't going to renew certificates until near the expiry. Is that right? That doing the certs isn't part of the noncluster host connection process? If that's the case, should we not have the certificate renewal as part of the main installation process, or post-install section? |
||
|
|
||
| If you need to force immediate renewal, manually delete the existing certificate (`calico-node.crt`) and private key (`calico-node.key`) under the `/etc/calico/calico-node` folder and restart the service. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The article is located in the installation section. Is this the right place?
I wonder whether it's better placed in the troubleshooting section: https://docs.tigera.io/calico-enterprise/latest/operations/troubleshoot/.