調整 Linux 和 Nignx 以處理 10k 連線 @10Gbps 伺服器

2024-6-27 • tag-icon

我剛剛購買了一台新的 10Gbps 伺服器，配備 8 個 CPU 核心、64GB RAM 和 1TB NVMe

OS Centos 7.9 kernel 3.10.0-1160.36.2.el7.x86_64 also tried kernel-ml 5.13
SELinux is disabled.
firewalld and irqbalance stopped

我使用 iperf3 進行了網路測試，確認速度約為 9.5 Gbps。

然後使用 10 x 1Gbps 伺服器從伺服器下載靜態檔案進行另一項測試，該伺服器能夠輕鬆地將幾乎完整的 10Gbps 推送到 10 個伺服器。

因此，我們將伺服器投入生產，為使用 Nginx 下載靜態檔案的用戶端提供服務。它能夠提供穩定的性能，直到達到約 2,000 個連接，然後性能開始顯著下降。我發現當連線數增加時流量會下降，因此為超過 4,000 個連線提供服務僅提供 2Gbps！

圖 1 顯示流量和 HTTP

最令人困惑的是，CPU 幾乎空閒，RAM 空閒，由於NVMe 和大RAM，IO 使用率很低，但是當伺服器有數千個連線時，所有服務HTTP、FTP、SSH 的速度都會變慢，甚至yum 更新也需要很長時間是時候回應了。這看起來像是網路或封包擁塞，或是核心或網路卡出現某種限制。

頂部頂部載入

我嘗試了大部分調音技巧

ifconfig eth0 txqueuelen 20000
ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 00:16:3e:c2:f5:21  txqueuelen 20000  (Ethernet)
        RX packets 26012067560  bytes 1665662731749 (1.5 TiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 30684216747  bytes 79033055227212 (71.8 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

tc -s -d qdisc 顯示 dev eth0

qdisc mq 1: root
    Sent 7733649086021 bytes 1012203012 pkt (dropped 0, overlimits 0 requeues 169567)
    backlog 4107556b 2803p requeues 169567
qdisc pfifo_fast 0: parent 1:8 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    Sent 2503685906926 bytes 1714686297 pkt (dropped 0, overlimits 0 requeues 1447)
    backlog 4107556b 2803p requeues 1447
qdisc pfifo_fast 0: parent 1:7 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    Sent 532876060762 bytes 366663805 pkt (dropped 0, overlimits 0 requeues 7790)
    backlog 0b 0p requeues 7790
qdisc pfifo_fast 0: parent 1:6 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    Sent 563510390106 bytes 387948990 pkt (dropped 0, overlimits 0 requeues 9694)
    backlog 0b 0p requeues 9694
qdisc pfifo_fast 0: parent 1:5 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    Sent 563033712946 bytes 387564038 pkt (dropped 0, overlimits 0 requeues 10259)
    backlog 0b 0p requeues 10259
qdisc pfifo_fast 0: parent 1:4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    Sent 562982455659 bytes 387451904 pkt (dropped 0, overlimits 0 requeues 10706)
    backlog 0b 0p requeues 10706
qdisc pfifo_fast 0: parent 1:3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    Sent 559557988260 bytes 385263948 pkt (dropped 0, overlimits 0 requeues 9983)
    backlog 0b 0p requeues 9983
qdisc pfifo_fast 0: parent 1:2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    Sent 528903326344 bytes 364105031 pkt (dropped 0, overlimits 0 requeues 7718)
    backlog 0b 0p requeues 7718
qdisc pfifo_fast 0: parent 1:1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    Sent 1919099245018 bytes 1313486295 pkt (dropped 0, overlimits 0 requeues 111970)
    backlog 0b 0p requeues 111970

ethtool -k eth0

Features for eth0:
rx-checksumming: on [fixed]
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off
        tx-tcp-ecn-segmentation: off
        tx-tcp6-segmentation: off
        tx-tcp-mangleid-segmentation: off
udp-fragmentation-offload: on
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
busy-poll: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-sctp-segmentation: off [fixed]
rx-gro-hw: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]

系統控制-p

vm.max_map_count = 1048575
net.ipv4.tcp_timestamps = 0
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.tcp_ecn = 0
net.ipv4.tcp_sack = 1
net.ipv4.tcp_syncookies = 0
net.ipv4.conf.all.log_martians = 1
vm.swappiness = 10
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65536
net.core.netdev_max_backlog = 250000
fs.file-max = 100000
net.ipv4.ip_local_port_range = 13000 65000
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.ip_forward = 0
net.ipv6.conf.all.forwarding = 0
net.ipv4.tcp_slow_start_after_idle = 0
net.core.rmem_max = 2147483647
net.core.rmem_default = 2147483647
net.core.wmem_max = 2147483647
net.core.wmem_default = 2147483647
net.core.optmem_max = 2147483647
net.ipv4.tcp_rmem = 4096 87380 2147483647
net.ipv4.tcp_wmem = 4096 65536 2147483647
net.ipv4.tcp_low_latency = 1
net.ipv4.tcp_adv_win_scale = 1
net.ipv4.tcp_keepalive_time = 60
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 5
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 5
net.netfilter.nf_conntrack_max = 655360
net.netfilter.nf_conntrack_tcp_timeout_established = 10800
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256680
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 100000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 100000
cpu time               (seconds, -t) unlimited
max user processes              (-u) 100000
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

nginx.conf

worker_processes        auto;
worker_rlimit_nofile    100000;

thread_pool default threads=256     max_queue=65536;

events {
    worker_connections  65536;
    worker_aio_requests 65536;
    multi_accept on;
    accept_mutex on;
    use epoll;
}

http {
    server_tokens off;
    server_names_hash_max_size      4096;
    server_names_hash_bucket_size   128;

    tcp_nopush     on;
    tcp_nodelay     on;
    client_body_timeout 12;
    client_header_timeout 12;
    keepalive_timeout 15;
    keepalive_requests 1000;
    send_timeout 10;

    aio                         threads=default;
    sendfile                    on;
    sendfile_max_chunk          512k;
    open_file_cache             max=100000  inactive=10m;
    open_file_cache_valid       10m;
    open_file_cache_min_uses    10;
    open_file_cache_errors      on;

    gzip  off;
}

所以問題是：如何以 10Gbps 流量下載靜態檔案來服務 10k 連線？是linux還是nginx的問題還是硬體的問題？

答案1

布蘭登已經回答了。打開 irqbalance。運行 numad 並進行調整。除非您有特定的工作負載需要它，否則請停止嘗試調整。部署之前測試 2000-10000 個請求的 wrk 測試結果在哪裡？這個問題在生產上不應該出現。顯然可以透過測試來識別。現實世界的使用通常會發現不常見的錯誤，但許多/大多數配置和應用程式錯誤可以在測試期間識別和修正。有許多關於 irq 關聯性的文檔。我懷疑您的用例是否可以比使用內建的調優工具做得更好。

答案2

的輸出top表示您的核心正被來自所有傳入連線的軟中斷淹沒。連接進入的速度如此之快，以至於網路卡觸發的硬體中斷對軟中斷的排隊速度比核心處理它們的速度要快。這就是為什麼你的CPU、RAM和IO使用率如此之低的原因；系統不斷被傳入連線中斷。這裡你需要的是一個負載平衡器。

答案1

答案2

相關內容