具有 100GbE 網路的 Kubernetes 效能非常差

2024-6-28 • tag-icon

我們在伺服器上使用 ConnectX-5 100GbE 乙太網路卡，該卡透過 Mellanox 交換器相互連接。我們在 Kubernetes 叢集上使用 weavenet cni 插件。當我們使用以下命令進行一些測試時iperf使用以下命令工具，我們可以在主機中獲得 100Gbps 的連線速度。

# server host
host1 $ iperf -s -P8
# client host
host2 $ iperf -c <host_ip> -P8
Result: 98.8 Gbps transfer speed

此外，當我們在同一台主機上使用兩個 docker 容器使用相同的工具和指令進行一些測試時，我們也會得到相同的結果。

# server host
host1$ docker run -it -p 5001:5001 ubuntu:latest-with-iperf iperf -s -P8 
# client host
host2 $ docker run -it -p 5001:5001 ubuntu:latest-with-iperf iperf -c <host_ip> -P8
Result: 98.8 Gbps transfer speed

但是，當我們在相同的主機（host1、host2）中使用相同的映像建立兩個不同的部署並透過服務ip（我們使用以下yaml 建立了一個k8s 服務）進行相同的測試時，它將流量重定向到我們得到的伺服器Pod唯一的2Gbps。我們也使用 pod 的叢集 ip 和服務的叢集域進行相同的測試，但結果是相同的。

kubectl create deployment iperf-server --image=ubuntu:latest-with-iperf  # after that we add affinity(host1) and container port sections to the yaml
kubectl create deployment iperf-client --image=ubuntu:latest-with-iperf  # after that we add affinity(host2) and container port sections to the yaml

kind: Service
apiVersion: v1
metadata:
  name: iperf-server
  namespace: default
spec:
  ports:
    - name: iperf
      protocol: TCP
      port: 5001
      targetPort: 5001
  selector:
    name: iperf-server
  clusterIP: 10.104.10.230
  type: ClusterIP
  sessionAffinity: None

太長了；我們測試的場景：

主機1（ubuntu 20.04，已安裝mellanox驅動程式）<-------->主機2（ubuntu 20.04，已安裝mellanox驅動程式）= 98.8 Gbps
主機 1 上的容器 1 <--------> 主機 2 上的容器 2 = 98.8 Gbps
Pod1-on-host1 <--------> Pod2-on-host2（使用叢集 ip）= 2Gbps
Pod1-on-host1 <--------> Pod2-on-host2（使用服務叢集 ip）= 2Gbps
Pod1-on-host1 <--------> Pod2-on-host2（使用服務叢集網域）= 2Gbps

我們需要獲得 100Gbps 的 Pod 到 Pod 通訊速度。那麼什麼可能導致這個問題呢？

更新1：

當我在 iperf 測試期間檢查 pod 內的 htop 時，有 112 個 cpu 核心，並且沒有一個與 CPU 發生衝突。
當我將hostNetwork: true金鑰新增至部署 Pod 時，頻寬可以達到 100Gbps。

答案1

我們透過停用 weavenet 上的加密來解決這個問題。但重新啟動伺服器就成功了。感謝這個文章。

答案1

相關內容