Skip to content

venketeswar664/kubernetes_with_MPS_GPU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Kubernetes 1 Master + N Workers (containerd + Calico)

A copy‑paste runbook that avoids the issues you hit (stale IPs in manifests, broken kube-apiserver.yaml, etcd pointing to old LAN IP, DiskPressure evictions, Calico API timeouts).
API DNS name: fm.vpn → always resolves to the master’s current IP (e.g., WireGuard).
K8s: v1.29.x • CRI: containerd • CNI: Calico
Service CIDR: 10.96.0.0/12Pod CIDR: 192.168.0.0/16


Table of Contents

  1. Prerequisites
  2. Node Roles & Notation
  3. Quickstart (TL;DR)
  4. Prepare All Nodes
  5. Initialize the Master
  6. Fixes You Must Apply (Prevents Prior Failures)
  7. Install Calico
  8. Join Workers
  9. Post-Install Checks
  10. Adding More Workers Later
  11. Troubleshooting (Common Symptoms → Fix)
  12. Appendix A: kubeadm-init.yaml (good defaults)
  13. Appendix B: Good kube-apiserver probes & volumes
  14. Appendix C: Etcd static pod settings (no stale IPs)

Prerequisites

  • Ubuntu 22.04+ on all nodes.
  • Passwordless sudo or root access.
  • fm.vpn DNS/hosts record that points to the master’s current IP (WireGuard recommended).
  • Outbound internet for image pulls (or a mirror/registry configured).

Node Roles & Notation

  • [MASTER] = run on control plane node only.
  • [WORKER] = run on worker node(s) only.
  • [ALL NODES] = run on every node (master + all workers).

Tip: copy blocks exactly; comments show intent. Don’t prefix variable assignments with sudo (it won’t set the variable).


Quickstart (TL;DR)

  1. Do Prepare All Nodes on every node.
  2. On MASTER: Initialize the Master (uses fm.vpn, sets sane eviction thresholds).
  3. Apply Fixes You Must Apply (prevents your previous breakages).
  4. Install Calico.
  5. On each WORKER: Join Workers with the kubeadm join command.
  6. Run Post-Install Checks.

Prepare All Nodes

# [ALL NODES] kernel & sysctl
cat <<'EOF' | sudo tee /etc/modules-load.d/k8s.conf
br_netfilter
overlay
EOF
sudo modprobe br_netfilter overlay

cat <<'EOF' | sudo tee /etc/sysctl.d/99-k8s.conf
net.ipv4.ip_forward=1
net.bridge.bridge-nf-call-iptables=1
net.bridge.bridge-nf-call-ip6tables=1
vm.overcommit_memory=1
EOF
sudo sysctl --system

# [ALL NODES] disable swap
sudo swapoff -a
sudo sed -i '/ swap / s/^\(.*\)$/#\1/' /etc/fstab

# [ALL NODES] make fm.vpn resolve to the master
echo "10.8.0.25 fm.vpn" | sudo tee -a /etc/hosts
getent hosts fm.vpn

# [ALL NODES] containerd + kube packages
sudo apt-get update
sudo apt-get install -y containerd apt-transport-https curl gpg

sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml >/dev/null
sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
sudo systemctl enable --now containerd

curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.29/deb/Release.key   | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-1-29.gpg
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-1-29.gpg] https://pkgs.k8s.io/core:/stable:/v1.29/deb/ /"   | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl

Initialize the Master

# [MASTER] your WireGuard (or master) IP (no sudo here)
WG_IP=$(ip -4 addr show wg0 | awk '/inet /{print $2}' | cut -d/ -f1)
echo "$WG_IP"   # expected: 10.8.0.25 or your master’s VPN IP

Create config (or use examples/kubeadm-init.yaml and replace <<MASTER_WG_IP>>):

# [MASTER]
cat > kubeadm-init.yaml <<'EOF'
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
clusterName: fm-cluster
kubernetesVersion: v1.29.6
controlPlaneEndpoint: "fm.vpn:6443"
networking:
  serviceSubnet: 10.96.0.0/12
  podSubnet: 192.168.0.0/16
apiServer:
  certSANs:
    - fm.vpn
  extraArgs:
    advertise-address: "${WG_IP}"
    bind-address: "0.0.0.0"
    etcd-servers: "https://127.0.0.1:2379"
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
evictionHard:
  "imagefs.available": "5%"
  "nodefs.available": "5%"
  "nodefs.inodesFree": "5%"
evictionMinimumReclaim:
  "imagefs.available": "2Gi"
  "nodefs.available": "2Gi"
EOF

sudo kubeadm init --config kubeadm-init.yaml

# kubectl for the admin user
mkdir -p $HOME/.kube
sudo cp -f /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown "$(id -u)":"$(id -g)" $HOME/.kube/config

Fixes You Must Apply (Prevents Prior Failures)

1) Control-plane kubeconfigs must use NAME, not old IP

# [MASTER]
for f in /etc/kubernetes/{admin.conf,controller-manager.conf,scheduler.conf}; do
  sudo sed -i 's#server: https://[^:]*:6443#server: https://fm.vpn:6443#g' "$f"
done

2) kube-apiserver.yaml must be valid & point probes to localhost

Edit /etc/kubernetes/manifests/kube-apiserver.yaml:

  • Keep one --advertise-address=${WG_IP} and --bind-address=0.0.0.0.
  • Use 127.0.0.1 for liveness/readiness probe host.
  • Ensure volumes indentation is correct (see Appendix B).

Save the file — kubelet will restart the static pod.

3) etcd.yaml must not reference the old LAN IP

Edit /etc/kubernetes/manifests/etcd.yaml and replace any 192.168.1.67 with 127.0.0.1 and/or your WG_IP as shown in Appendix C.

Save — kubelet restarts etcd.


Install Calico

# [MASTER]
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.3/manifests/calico.yaml

Join Workers

On MASTER, get a fresh join command:

# [MASTER]
kubeadm token create --print-join-command

On each WORKER:

# [WORKER]
getent hosts fm.vpn
sudo kubeadm join fm.vpn:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>

Post-Install Checks

# [MASTER]
curl -sk https://fm.vpn:6443/healthz && echo
kubectl get nodes -o wide
kubectl -n kube-system get pods -owide
kubectl -n calico-system get pods -owide

# test scheduling to a worker
kubectl run test-pod --image=nginx --restart=Never --image-pull-policy=IfNotPresent
kubectl wait --for=condition=Ready pod/test-pod --timeout=90s
kubectl delete pod test-pod

Note: Master is tainted NoSchedule by default. If you want workloads on master (not recommended):
kubectl taint nodes master node-role.kubernetes.io/control-plane:NoSchedule-


Adding More Workers Later

Repeat Prepare All Nodes on the new worker, ensure fm.vpn resolves, then run a fresh kubeadm join from the master.


Troubleshooting (Common Symptoms → Fix)

A) kubectl ... connection refused to fm.vpn:6443

  • kube-apiserver.yaml malformed or probes point to old IP → fix per Appendix B.
  • etcd.yaml still uses old IP → fix per Appendix C.
  • Control-plane kubeconfigs use old address → re-run the kubeconfig sed loop above.

B) Pods Evicted: ephemeral-storage / Node DiskPressure=True

# [AFFECTED NODE]
sudo ctr -n k8s.io images prune
sudo journalctl --vacuum-time=3d
sudo rm -rf /var/log/*-???????? /var/log/*.gz 2>/dev/null || true
# then verify:
kubectl describe node <node> | sed -n '/Conditions:/,$p'
# if taint persists after pressure clears:
kubectl taint node <node> node.kubernetes.io/disk-pressure:NoSchedule- || true

C) Calico errors to https://10.96.0.1:443 That’s the in-cluster API Service. Fix apiserver/etcd first. Then:

kubectl -n kube-system get pods -owide | egrep 'kube-proxy|coredns'

D) Master IP changed later

  • Update DNS/hosts so fm.vpn → new IP on all nodes.
  • Ensure apiserver cert has SAN fm.vpn:
    openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -text | grep -A1 "Subject Alternative Name"
  • If missing:
    sudo kubeadm certs renew apiserver --config /root/kubeadm-init.yaml && sudo systemctl restart kubelet

Appendix A: kubeadm-init.yaml (good defaults)

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
clusterName: fm-cluster
kubernetesVersion: v1.29.6
controlPlaneEndpoint: "fm.vpn:6443"
networking:
  serviceSubnet: 10.96.0.0/12
  podSubnet: 192.168.0.0/16
apiServer:
  certSANs:
    - fm.vpn
  extraArgs:
    advertise-address: "${WG_IP}"
    bind-address: "0.0.0.0"
    etcd-servers: "https://127.0.0.1:2379"
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
evictionHard:
  "imagefs.available": "5%"
  "nodefs.available": "5%"
  "nodefs.inodesFree": "5%"
evictionMinimumReclaim:
  "imagefs.available": "2Gi"
  "nodefs.available": "2Gi"

Appendix B: Good kube-apiserver probes & volumes

Probes (inside containers[0])

    livenessProbe:
      httpGet:
        host: 127.0.0.1
        path: /livez
        port: 6443
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
      failureThreshold: 8
    readinessProbe:
      httpGet:
        host: 127.0.0.1
        path: /readyz
        port: 6443
        scheme: HTTPS
      periodSeconds: 1
      failureThreshold: 3

Volumes (at spec.volumes)

  volumes:
  - name: ca-certs
    hostPath:
      path: /etc/ssl/certs
      type: DirectoryOrCreate
  - name: etc-ca-certificates
    hostPath:
      path: /etc/ca-certificates
  - name: etc-pki
    hostPath:
      path: /etc/pki
  - name: k8s-certs
    hostPath:
      path: /etc/kubernetes/pki
  - name: usr-local-share-ca-certificates
    hostPath:
      path: /usr/local/share/ca-certificates
  - name: usr-share-ca-certificates
    hostPath:
      path: /usr/share/ca-certificates

Appendix C: Etcd static pod settings (no stale IPs)

Edit /etc/kubernetes/manifests/etcd.yaml and ensure:

# Clients: always keep localhost; add WG IP only if you truly need remote client access
- --listen-client-urls=https://127.0.0.1:2379,https://<<MASTER_WG_IP>>:2379
- --advertise-client-urls=https://127.0.0.1:2379

# Peers: for single-node etcd, these may still reference the WG IP but must NOT reference old LAN IPs
- --listen-peer-urls=https://<<MASTER_WG_IP>>:2380
- --initial-advertise-peer-urls=https://<<MASTER_WG_IP>>:2380
- --initial-cluster=master=https://<<MASTER_WG_IP>>:2380

About

master worker nodes setup

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages