A copy‑paste runbook that avoids the issues you hit (stale IPs in manifests, broken kube-apiserver.yaml, etcd pointing to old LAN IP, DiskPressure evictions, Calico API timeouts).
API DNS name: fm.vpn → always resolves to the master’s current IP (e.g., WireGuard).
K8s: v1.29.x • CRI: containerd • CNI: Calico
Service CIDR: 10.96.0.0/12 • Pod CIDR: 192.168.0.0/16
- Prerequisites
- Node Roles & Notation
- Quickstart (TL;DR)
- Prepare All Nodes
- Initialize the Master
- Fixes You Must Apply (Prevents Prior Failures)
- Install Calico
- Join Workers
- Post-Install Checks
- Adding More Workers Later
- Troubleshooting (Common Symptoms → Fix)
- Appendix A: kubeadm-init.yaml (good defaults)
- Appendix B: Good kube-apiserver probes & volumes
- Appendix C: Etcd static pod settings (no stale IPs)
- Ubuntu 22.04+ on all nodes.
- Passwordless sudo or root access.
fm.vpnDNS/hosts record that points to the master’s current IP (WireGuard recommended).- Outbound internet for image pulls (or a mirror/registry configured).
- [MASTER] = run on control plane node only.
- [WORKER] = run on worker node(s) only.
- [ALL NODES] = run on every node (master + all workers).
Tip: copy blocks exactly; comments show intent. Don’t prefix variable assignments with
sudo(it won’t set the variable).
- Do Prepare All Nodes on every node.
- On MASTER: Initialize the Master (uses
fm.vpn, sets sane eviction thresholds). - Apply Fixes You Must Apply (prevents your previous breakages).
- Install Calico.
- On each WORKER: Join Workers with the
kubeadm joincommand. - Run Post-Install Checks.
# [ALL NODES] kernel & sysctl
cat <<'EOF' | sudo tee /etc/modules-load.d/k8s.conf
br_netfilter
overlay
EOF
sudo modprobe br_netfilter overlay
cat <<'EOF' | sudo tee /etc/sysctl.d/99-k8s.conf
net.ipv4.ip_forward=1
net.bridge.bridge-nf-call-iptables=1
net.bridge.bridge-nf-call-ip6tables=1
vm.overcommit_memory=1
EOF
sudo sysctl --system
# [ALL NODES] disable swap
sudo swapoff -a
sudo sed -i '/ swap / s/^\(.*\)$/#\1/' /etc/fstab
# [ALL NODES] make fm.vpn resolve to the master
echo "10.8.0.25 fm.vpn" | sudo tee -a /etc/hosts
getent hosts fm.vpn
# [ALL NODES] containerd + kube packages
sudo apt-get update
sudo apt-get install -y containerd apt-transport-https curl gpg
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml >/dev/null
sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
sudo systemctl enable --now containerd
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.29/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-1-29.gpg
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-1-29.gpg] https://pkgs.k8s.io/core:/stable:/v1.29/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl# [MASTER] your WireGuard (or master) IP (no sudo here)
WG_IP=$(ip -4 addr show wg0 | awk '/inet /{print $2}' | cut -d/ -f1)
echo "$WG_IP" # expected: 10.8.0.25 or your master’s VPN IPCreate config (or use examples/kubeadm-init.yaml and replace <<MASTER_WG_IP>>):
# [MASTER]
cat > kubeadm-init.yaml <<'EOF'
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
clusterName: fm-cluster
kubernetesVersion: v1.29.6
controlPlaneEndpoint: "fm.vpn:6443"
networking:
serviceSubnet: 10.96.0.0/12
podSubnet: 192.168.0.0/16
apiServer:
certSANs:
- fm.vpn
extraArgs:
advertise-address: "${WG_IP}"
bind-address: "0.0.0.0"
etcd-servers: "https://127.0.0.1:2379"
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
evictionHard:
"imagefs.available": "5%"
"nodefs.available": "5%"
"nodefs.inodesFree": "5%"
evictionMinimumReclaim:
"imagefs.available": "2Gi"
"nodefs.available": "2Gi"
EOF
sudo kubeadm init --config kubeadm-init.yaml
# kubectl for the admin user
mkdir -p $HOME/.kube
sudo cp -f /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown "$(id -u)":"$(id -g)" $HOME/.kube/config# [MASTER]
for f in /etc/kubernetes/{admin.conf,controller-manager.conf,scheduler.conf}; do
sudo sed -i 's#server: https://[^:]*:6443#server: https://fm.vpn:6443#g' "$f"
doneEdit /etc/kubernetes/manifests/kube-apiserver.yaml:
- Keep one
--advertise-address=${WG_IP}and--bind-address=0.0.0.0. - Use 127.0.0.1 for liveness/readiness probe
host. - Ensure volumes indentation is correct (see Appendix B).
Save the file — kubelet will restart the static pod.
Edit /etc/kubernetes/manifests/etcd.yaml and replace any 192.168.1.67 with 127.0.0.1 and/or your WG_IP as shown in Appendix C.
Save — kubelet restarts etcd.
# [MASTER]
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.3/manifests/calico.yamlOn MASTER, get a fresh join command:
# [MASTER]
kubeadm token create --print-join-commandOn each WORKER:
# [WORKER]
getent hosts fm.vpn
sudo kubeadm join fm.vpn:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash># [MASTER]
curl -sk https://fm.vpn:6443/healthz && echo
kubectl get nodes -o wide
kubectl -n kube-system get pods -owide
kubectl -n calico-system get pods -owide
# test scheduling to a worker
kubectl run test-pod --image=nginx --restart=Never --image-pull-policy=IfNotPresent
kubectl wait --for=condition=Ready pod/test-pod --timeout=90s
kubectl delete pod test-podNote: Master is tainted
NoScheduleby default. If you want workloads on master (not recommended):
kubectl taint nodes master node-role.kubernetes.io/control-plane:NoSchedule-
Repeat Prepare All Nodes on the new worker, ensure fm.vpn resolves, then run a fresh kubeadm join from the master.
A) kubectl ... connection refused to fm.vpn:6443
kube-apiserver.yamlmalformed or probes point to old IP → fix per Appendix B.etcd.yamlstill uses old IP → fix per Appendix C.- Control-plane kubeconfigs use old address → re-run the kubeconfig sed loop above.
B) Pods Evicted: ephemeral-storage / Node DiskPressure=True
# [AFFECTED NODE]
sudo ctr -n k8s.io images prune
sudo journalctl --vacuum-time=3d
sudo rm -rf /var/log/*-???????? /var/log/*.gz 2>/dev/null || true
# then verify:
kubectl describe node <node> | sed -n '/Conditions:/,$p'
# if taint persists after pressure clears:
kubectl taint node <node> node.kubernetes.io/disk-pressure:NoSchedule- || trueC) Calico errors to https://10.96.0.1:443
That’s the in-cluster API Service. Fix apiserver/etcd first. Then:
kubectl -n kube-system get pods -owide | egrep 'kube-proxy|coredns'D) Master IP changed later
- Update DNS/hosts so
fm.vpn→ new IP on all nodes. - Ensure apiserver cert has SAN
fm.vpn:openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -text | grep -A1 "Subject Alternative Name"
- If missing:
sudo kubeadm certs renew apiserver --config /root/kubeadm-init.yaml && sudo systemctl restart kubelet
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
clusterName: fm-cluster
kubernetesVersion: v1.29.6
controlPlaneEndpoint: "fm.vpn:6443"
networking:
serviceSubnet: 10.96.0.0/12
podSubnet: 192.168.0.0/16
apiServer:
certSANs:
- fm.vpn
extraArgs:
advertise-address: "${WG_IP}"
bind-address: "0.0.0.0"
etcd-servers: "https://127.0.0.1:2379"
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
evictionHard:
"imagefs.available": "5%"
"nodefs.available": "5%"
"nodefs.inodesFree": "5%"
evictionMinimumReclaim:
"imagefs.available": "2Gi"
"nodefs.available": "2Gi"Probes (inside containers[0])
livenessProbe:
httpGet:
host: 127.0.0.1
path: /livez
port: 6443
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
failureThreshold: 8
readinessProbe:
httpGet:
host: 127.0.0.1
path: /readyz
port: 6443
scheme: HTTPS
periodSeconds: 1
failureThreshold: 3Volumes (at spec.volumes)
volumes:
- name: ca-certs
hostPath:
path: /etc/ssl/certs
type: DirectoryOrCreate
- name: etc-ca-certificates
hostPath:
path: /etc/ca-certificates
- name: etc-pki
hostPath:
path: /etc/pki
- name: k8s-certs
hostPath:
path: /etc/kubernetes/pki
- name: usr-local-share-ca-certificates
hostPath:
path: /usr/local/share/ca-certificates
- name: usr-share-ca-certificates
hostPath:
path: /usr/share/ca-certificatesEdit /etc/kubernetes/manifests/etcd.yaml and ensure:
# Clients: always keep localhost; add WG IP only if you truly need remote client access
- --listen-client-urls=https://127.0.0.1:2379,https://<<MASTER_WG_IP>>:2379
- --advertise-client-urls=https://127.0.0.1:2379
# Peers: for single-node etcd, these may still reference the WG IP but must NOT reference old LAN IPs
- --listen-peer-urls=https://<<MASTER_WG_IP>>:2380
- --initial-advertise-peer-urls=https://<<MASTER_WG_IP>>:2380
- --initial-cluster=master=https://<<MASTER_WG_IP>>:2380