
Ich betreibe einen lokalen Kubernetes-Cluster mit drei Knoten in meinem LAN. Die drei Knoten sind mein Router, mein Medienserver und mein PC. Auf dem PC läuft Pop!_OS, aber ich habe Probleme, es einzurichten.
Ich habe kubeadm
den Cluster initiiert und die Knoten miteinander verbunden und verwende das Calico CNI.
Das erste Problem ist, dass ich Swap unter Pop!_OS nicht dauerhaft deaktivieren kann, da ich es zwar ausführen kann, sudo swapoff -a
es aber nach dem Neustart wieder aktiviert wird. Normalerweise ist dies so einfach wie das Entfernen des Eintrags in /etc/fstab
, in diesem Fall gibt es jedoch keinen Swap-Eintrag, der entfernt werden könnte.
Meine erste Frage lautet: Wie deaktiviere ich Swap unter Pop!_OS dauerhaft?
Jetzt befindet sich das Kubelet in einer Absturzschleife, bis der Swap deaktiviert und neu gestartet wird:
sudo swapoff -a
sudo systemctl restart kubelet
Dadurch können beispielsweise das System, Calico und Nvidia-Pods ausgeführt werden:
-> % kubectl get pods -A -o wide | grep pop-os
calico-system calico-node-spxz4 1/1 Running 4 (2d12h ago) 5d9h 10.0.0.235 pop-os <none> <none>
calico-system csi-node-driver-cvw7l 2/2 Running 8 (2d12h ago) 5d9h 192.168.179.103 pop-os <none> <none>
default gpu-feature-discovery-8rx9w 1/1 Running 4 (2d12h ago) 5d9h 192.168.179.105 pop-os <none> <none>
default gpu-operator-1711318735-node-feature-discovery-worker-mc5xv 1/1 Running 5 (2d12h ago) 5d9h 192.168.179.99 pop-os <none> <none>
default nvidia-container-toolkit-daemonset-tmjt9 1/1 Running 4 (2d12h ago) 5d9h 192.168.179.106 pop-os <none> <none>
default nvidia-cuda-validator-ndcr4 0/1 Completed 0 19m <none> pop-os <none> <none>
default nvidia-dcgm-exporter-th8w4 1/1 Running 4 (25h ago) 5d9h 192.168.179.101 pop-os <none> <none>
default nvidia-device-plugin-daemonset-66576 1/1 Running 4 (2d12h ago) 5d9h 192.168.179.102 pop-os <none> <none>
default nvidia-operator-validator-sl5kc 1/1 Running 4 (2d12h ago) 5d9h 192.168.179.100 pop-os <none> <none>
kube-system kube-proxy-mjncv 1/1 Running 5 (2d12h ago) 5d9h 10.0.0.235 pop-os <none> <none>
...der GPU-Metrik-Export, den ich auf diesem Knoten planen möchte, läuft jedoch nicht:
-> % kubectl describe pod nvidia-exporter-pc-5b78bdcd6d-vmtq4
Name: nvidia-exporter-pc-5b78bdcd6d-vmtq4
Namespace: default
Priority: 0
Runtime Class Name: nvidia
Service Account: default
Node: <none>
Labels: app=nvidia-exporter-pc
pod-template-hash=5b78bdcd6d
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/nvidia-exporter-pc-5b78bdcd6d
Containers:
nvidia-exporter-pc:
Image: utkuozdemir/nvidia_gpu_exporter:1.2.0
Port: 9835/TCP
Host Port: 0/TCP
Environment:
NVIDIA_VISIBLE_DEVICES: all
Mounts:
/dev/nvidia0 from nvidia0 (rw)
/dev/nvidiactl from nvidiactl (rw)
/usr/bin/nvidia-smi from nvidia-smi (rw)
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so from nvidia-lib (rw)
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 from nvidia-lib-1 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-khzrv (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
nvidiactl:
Type: HostPath (bare host directory volume)
Path: /dev/nvidiactl
HostPathType:
nvidia0:
Type: HostPath (bare host directory volume)
Path: /dev/nvidia0
HostPathType:
nvidia-smi:
Type: HostPath (bare host directory volume)
Path: /usr/bin/nvidia-smi
HostPathType:
nvidia-lib:
Type: HostPath (bare host directory volume)
Path: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
HostPathType:
nvidia-lib-1:
Type: HostPath (bare host directory volume)
Path: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
HostPathType:
kube-api-access-khzrv:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/hostname=pop-os
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 30s default-scheduler 0/3 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
Ich vermute, dass dies auf eine Warnung zurückzuführen ist, die in der Knotenbeschreibung angezeigt wird:
Warning InvalidDiskCapacity 24m kubelet invalid capacity 0 on image filesystem
... Untersuchungen zeigen jedoch, dass dies normalerweise eine harmlose Warnung ist, die schnell behoben werden sollte und die Planung nicht verhindert. Nachfolgend finden Sie die vollständige Knotenbeschreibung:
Name: pop-os
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/cpu-cpuid.AESNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX=true
feature.node.kubernetes.io/cpu-cpuid.AVX2=true
feature.node.kubernetes.io/cpu-cpuid.AVX512BITALG=true
feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
feature.node.kubernetes.io/cpu-cpuid.AVX512IFMA=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI2=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VPOPCNTDQ=true
feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
feature.node.kubernetes.io/cpu-cpuid.FLUSH_L1D=true
feature.node.kubernetes.io/cpu-cpuid.FMA3=true
feature.node.kubernetes.io/cpu-cpuid.FSRM=true
feature.node.kubernetes.io/cpu-cpuid.FXSR=true
feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
feature.node.kubernetes.io/cpu-cpuid.GFNI=true
feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP=true
feature.node.kubernetes.io/cpu-cpuid.IBPB=true
feature.node.kubernetes.io/cpu-cpuid.LAHF=true
feature.node.kubernetes.io/cpu-cpuid.MD_CLEAR=true
feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
feature.node.kubernetes.io/cpu-cpuid.MPX=true
feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
feature.node.kubernetes.io/cpu-cpuid.PSFD=true
feature.node.kubernetes.io/cpu-cpuid.SHA=true
feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true
feature.node.kubernetes.io/cpu-cpuid.STIBP=true
feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
feature.node.kubernetes.io/cpu-cpuid.VAES=true
feature.node.kubernetes.io/cpu-cpuid.VMX=true
feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ=true
feature.node.kubernetes.io/cpu-cpuid.X87=true
feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true
feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
feature.node.kubernetes.io/cpu-cstate.enabled=true
feature.node.kubernetes.io/cpu-hardware_multithreading=true
feature.node.kubernetes.io/cpu-model.family=6
feature.node.kubernetes.io/cpu-model.id=167
feature.node.kubernetes.io/cpu-model.vendor_id=Intel
feature.node.kubernetes.io/cpu-pstate.scaling_governor=powersave
feature.node.kubernetes.io/cpu-pstate.status=active
feature.node.kubernetes.io/cpu-pstate.turbo=true
feature.node.kubernetes.io/kernel-config.NO_HZ=true
feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
feature.node.kubernetes.io/kernel-version.full=6.8.0-76060800daily20240311-generic
feature.node.kubernetes.io/kernel-version.major=6
feature.node.kubernetes.io/kernel-version.minor=8
feature.node.kubernetes.io/kernel-version.revision=0
feature.node.kubernetes.io/pci-10de.present=true
feature.node.kubernetes.io/pci-8086.present=true
feature.node.kubernetes.io/storage-nonrotationaldisk=true
feature.node.kubernetes.io/system-os_release.ID=pop
feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
feature.node.kubernetes.io/usb-ef_043e_9a39.present=true
feature.node.kubernetes.io/usb-ef_046d_081b.present=true
kubernetes.io/arch=amd64
kubernetes.io/hostname=pop-os
kubernetes.io/os=linux
nvidia.com/cuda.driver.major=550
nvidia.com/cuda.driver.minor=67
nvidia.com/cuda.driver.rev=
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=4
nvidia.com/gfd.timestamp=1711817271
nvidia.com/gpu-driver-upgrade-state=pod-restart-required
nvidia.com/gpu.compute.major=8
nvidia.com/gpu.compute.minor=6
nvidia.com/gpu.count=1
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=pre-installed
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.family=ampere
nvidia.com/gpu.machine=MS-7D09
nvidia.com/gpu.memory=10240
nvidia.com/gpu.present=true
nvidia.com/gpu.product=NVIDIA-GeForce-RTX-3080
nvidia.com/gpu.replicas=1
nvidia.com/mig.capable=false
nvidia.com/mig.strategy=single
Annotations: csi.volume.kubernetes.io/nodeid: {"csi.tigera.io":"pop-os"}
kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
nfd.node.kubernetes.io/feature-labels:
cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BITALG,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ...
nfd.node.kubernetes.io/worker.version: v0.14.2
node.alpha.kubernetes.io/ttl: 0
nvidia.com/gpu-driver-upgrade-enabled: true
projectcalico.org/IPv4Address: 10.0.0.235/24
projectcalico.org/IPv4VXLANTunnelAddr: 192.168.179.64
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 25 Mar 2024 00:16:19 -0700
Taints: node.kubernetes.io/unschedulable:NoSchedule
Unschedulable: true
Lease:
HolderIdentity: pop-os
AcquireTime: <unset>
RenewTime: Sat, 30 Mar 2024 10:12:10 -0700
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Sat, 30 Mar 2024 09:47:37 -0700 Sat, 30 Mar 2024 09:47:37 -0700 CalicoIsUp Calico is running on this node
MemoryPressure False Sat, 30 Mar 2024 10:11:30 -0700 Sat, 30 Mar 2024 09:47:34 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sat, 30 Mar 2024 10:11:30 -0700 Sat, 30 Mar 2024 09:47:34 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sat, 30 Mar 2024 10:11:30 -0700 Sat, 30 Mar 2024 09:47:34 -0700 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sat, 30 Mar 2024 10:11:30 -0700 Sat, 30 Mar 2024 09:47:34 -0700 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.0.0.235
Hostname: pop-os
Capacity:
cpu: 16
ephemeral-storage: 238222068Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32777432Ki
nvidia.com/gpu: 1
pods: 110
Allocatable:
cpu: 16
ephemeral-storage: 219545457506
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32675032Ki
nvidia.com/gpu: 1
pods: 110
System Info:
Machine ID: 709fbcb158d7dd28973351156441d28c
System UUID: 43a79490-8c6e-4b1e-ac81-d8bbc1049bdf
Boot ID: 5af7fdf9-5d8b-44bd-934a-46d2f0c379e1
Kernel Version: 6.8.0-76060800daily20240311-generic
OS Image: Pop!_OS 22.04 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.7.2
Kubelet Version: v1.28.8
Kube-Proxy Version: v1.28.8
PodCIDR: 192.168.2.0/24
PodCIDRs: 192.168.2.0/24
Non-terminated Pods: (9 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
calico-system calico-node-spxz4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
calico-system csi-node-driver-cvw7l 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
default gpu-feature-discovery-8rx9w 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
default gpu-operator-1711318735-node-feature-discovery-worker-mc5xv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
default nvidia-container-toolkit-daemonset-tmjt9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
default nvidia-dcgm-exporter-th8w4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
default nvidia-device-plugin-daemonset-66576 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
default nvidia-operator-validator-sl5kc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
kube-system kube-proxy-mjncv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 0 (0%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 24m kube-proxy
Normal NodeNotSchedulable 24m kubelet Node pop-os status is now: NodeNotSchedulable
Normal NodeReady 24m (x2 over 24m) kubelet Node pop-os status is now: NodeReady
Normal NodeHasSufficientMemory 24m (x3 over 24m) kubelet Node pop-os status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 24m (x3 over 24m) kubelet Node pop-os status is now: NodeHasNoDiskPressure
Warning InvalidDiskCapacity 24m kubelet invalid capacity 0 on image filesystem
Warning Rebooted 24m (x2 over 24m) kubelet Node pop-os has been rebooted, boot id: 5af7fdf9-5d8b-44bd-934a-46d2f0c379e1
Normal NodeAllocatableEnforced 24m kubelet Updated Node Allocatable limit across pods
Normal Starting 24m kubelet Starting kubelet.
Normal NodeHasSufficientPID 24m (x3 over 24m) kubelet Node pop-os status is now: NodeHasSufficientPID
Normal Starting 16m kubelet Starting kubelet.
Warning InvalidDiskCapacity 16m kubelet invalid capacity 0 on image filesystem
Normal NodeAllocatableEnforced 16m kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 16m kubelet Node pop-os status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 16m kubelet Node pop-os status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 16m kubelet Node pop-os status is now: NodeHasSufficientPID
Normal NodeNotSchedulable 16m kubelet Node pop-os status is now: NodeNotSchedulable
Wie kann ich den nicht planbaren Makel beheben und den GPU-Exporter (und andere) Pods auf dem PC ausführen? Ich bin ein Neuling bei Kubernetes, daher geht das ein bisschen über meinen Horizont.