
Estou solicitando conselhos para diagnosticar um R710 rodando xen no Debian Stretch entrando em pânico uma vez por dia.
O servidor funcionará bem por aproximadamente 24 horas e, eventualmente, lançará um Kernel Panic, geralmente em torno de bxn2. (Colocaremos pânico total abaixo) O sistema não responde mais e geralmente requer uma reinicialização. Após a reinicialização, o sistema está bem, mas eventualmente entrará em pânico.
Eu tenho um servidor idêntico executando um servidor que não trava quando tem uma ou duas VMs, mas trava quando todas as VMs são executadas no sistema. Eu também tenho outro R710 com um H700 que está rodando ~10 vms sem problemas.
Também estou tendo problemas para recriar o pânico. A certa altura, consegui travar o segundo servidor de maneira confiável, carregando a CPU ao máximo e executando IO alto. (sha1sum /dev/zero e dd).
As especificações do Dell R710 são as seguintes:
- 2x CPU de 4 núcleos
- 72 GB de RAM
- Versão de firmware integrado SAS 6/iR: 00.25.47.00.06.22.03.00
- SAS6 passando para 1 disco de sistema operacional de 1 TB, 2 discos de 2 TB no array mdadm (03:00.0 Controlador de armazenamento SCSI: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08))
- Placa Broadcom NetExtreeme 2. (01:00.0 Controlador Ethernet: Broadcom Limited NetXtreme II BCM5709 Gigabit Ethernet (rev 20))
- Todos os BIOS e firmware atualizados com o controlador de ciclo de vida.
Ele está executando o Debian Stretch com os seguintes detalhes:
- GNU/Linux 9 \n\l
- Linux xen01 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u3 (06/08/2017) x86_64 GNU/Linux
- Ele está executando o xen-hypervisor-4.8-amd64 4.8.1-1+deb9u1 com 22 máquinas virtuais.
- ii firmware-bnx2 20161130-3 todos firmware binário para Broadcom NetXtremeII
- mptsas versão 3.04.20
- mpt3sas versão 13.100.00.00
- ii bridge-utils 1.5-13 amd64 Utilitários para configurar a ponte Ethernet do Linux
Até agora tentei o seguinte (sozinho e em diferentes combinações):
- Desative o MSIX para bnx2. modprobe bnx2 desabilitar_msi=1
- Desative o MSIX para mpt3sas. modprobe mpt3sas msix_disable=1
- Adicionado intremap=off ao kernel e desativado Intel Virtualization via BIOS para evitar erratas do chipset Intel 55x0.https://support.citrix.com/article/CTX136517
- Reduzido vm.dirty_background_ratio=5 e vm.dirty_ratio=10 para contornar a falha do servidor Linux com "INFO: tarefa bloqueada por mais de 120 segundos"
- Defina nic rx mais alto e aumente net.core.netdev_max_backlog=30000 conformehttps://unix.stackexchange.com/questions/37727/solving-ethernet-watchdog-timer-deadlocks
Um exemplo de pânico.
Aug 18 14:45:16 xen02 kernel: [54277.859415] ------------[ cut here ]------------
Aug 18 14:45:16 xen02 kernel: [54277.859451] WARNING: CPU: 0 PID: 0 at /build/linux-me40Ry/linux-4.9.30/net/sched/sch_generic.c:316 dev_watchdog+0x22d/0x230
Aug 18 14:45:16 xen02 kernel: [54277.859456] NETDEV WATCHDOG: eno2 (bnx2): transmit queue 5 timed out
Aug 18 14:45:16 xen02 kernel: [54277.859457] Modules linked in: ipmi_si xt_tcpudp xt_physdev br_netfilter iptable_filter xen_netback xen_blkback mpt3sas raid_class mptctl bridge stp llc dell_rbu xen_gntdev xen_evtchn xenfs xen_privcmd ipmi_devintf iTCO_wdt iTCO_vendor_support evdev joydev mgag200 ttm drm_kms_helper intel_powerclamp coretemp drm i2c_algo_bit serio_raw dcdbas sg pcspkr acpi_power_meter ipmi_msghandler wmi button shpchp i7core_edac lpc_ich mfd_core edac_core ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto ecb glue_helper lrw gf128mul ablk_helper cryptd aes_x86_64 mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear dm_mod raid1 md_mod sd_mod uas usb_storage sr_mod cdrom ata_generic hid_generic usbhid hid crc32c_intel psmouse
Aug 18 14:45:16 xen02 kernel: [54277.859517] ehci_pci uhci_hcd mptsas ehci_hcd ata_piix scsi_transport_sas mptscsih libata mptbase usbcore usb_common scsi_mod bnx2 [last unloaded: ipmi_si]
Aug 18 14:45:16 xen02 kernel: [54277.859533] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-3-amd64 #1 Debian 4.9.30-2+deb9u3
Aug 18 14:45:16 xen02 kernel: [54277.859534] Hardware name: Dell Inc. PowerEdge R710/0YDJK3, BIOS 6.4.0 07/23/2013
Aug 18 14:45:16 xen02 kernel: [54277.859537] 0000000000000000 ffffffff81328574 ffff8811f5a03e20 0000000000000000
Aug 18 14:45:16 xen02 kernel: [54277.859540] ffffffff81076ebe 0000000000000005 ffff8811f5a03e78 ffff8811dee04000
Aug 18 14:45:16 xen02 kernel: [54277.859542] 0000000000000000 ffff8811e4b9c940 0000000000000008 ffffffff81076f3f
Aug 18 14:45:16 xen02 kernel: [54277.859544] Call Trace:
Aug 18 14:45:16 xen02 kernel: [54277.859547] <IRQ>
Aug 18 14:45:16 xen02 kernel: [54277.859553] [<ffffffff81328574>] ? dump_stack+0x5c/0x78
Aug 18 14:45:16 xen02 kernel: [54277.859558] [<ffffffff81076ebe>] ? __warn+0xbe/0xe0
Aug 18 14:45:16 xen02 kernel: [54277.859560] [<ffffffff81076f3f>] ? warn_slowpath_fmt+0x5f/0x80
Aug 18 14:45:16 xen02 kernel: [54277.859563] [<ffffffff8152a98d>] ? dev_watchdog+0x22d/0x230
Aug 18 14:45:16 xen02 kernel: [54277.859564] [<ffffffff8152a760>] ? qdisc_rcu_free+0x40/0x40
Aug 18 14:45:16 xen02 kernel: [54277.859570] [<ffffffff810e3e90>] ? call_timer_fn+0x30/0x110
Aug 18 14:45:16 xen02 kernel: [54277.859571] [<ffffffff810e43ce>] ? run_timer_softirq+0x1ce/0x420
Aug 18 14:45:16 xen02 kernel: [54277.859575] [<ffffffff810d0f91>] ? handle_irq_event_percpu+0x51/0x70
Aug 18 14:45:16 xen02 kernel: [54277.859576] [<ffffffff810d4dc7>] ? handle_percpu_irq+0x37/0x50
Aug 18 14:45:16 xen02 kernel: [54277.859581] [<ffffffff81608d95>] ? __do_softirq+0x105/0x290
Aug 18 14:45:16 xen02 kernel: [54277.859583] [<ffffffff8107cf6e>] ? irq_exit+0xae/0xb0
Aug 18 14:45:16 xen02 kernel: [54277.859587] [<ffffffff814052e1>] ? xen_evtchn_do_upcall+0x31/0x40
Aug 18 14:45:16 xen02 kernel: [54277.859588] [<ffffffff8160724e>] ? xen_do_hypervisor_callback+0x1e/0x40
Aug 18 14:45:16 xen02 kernel: [54277.859589] <EOI>
Aug 18 14:45:16 xen02 kernel: [54277.859592] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
Aug 18 14:45:16 xen02 kernel: [54277.859594] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
Aug 18 14:45:16 xen02 kernel: [54277.859597] [<ffffffff8101b30c>] ? xen_safe_halt+0xc/0x20
Aug 18 14:45:16 xen02 kernel: [54277.859600] [<ffffffff8160584a>] ? default_idle+0x1a/0xd0
Aug 18 14:45:16 xen02 kernel: [54277.859603] [<ffffffff810b957a>] ? cpu_startup_entry+0x1ca/0x240
Aug 18 14:45:16 xen02 kernel: [54277.859608] [<ffffffff81d38f57>] ? start_kernel+0x443/0x463
Aug 18 14:45:16 xen02 kernel: [54277.859611] [<ffffffff81d3e098>] ? xen_start_kernel+0x526/0x530
Aug 18 14:45:16 xen02 kernel: [54277.859613] ---[ end trace 213eed970c44d2fa ]---
Aug 18 14:45:16 xen02 kernel: [54277.859619] bnx2 0000:01:00.1 eno2: <--- start FTQ dump --->
Aug 18 14:45:16 xen02 kernel: [54277.859658] bnx2 0000:01:00.1 eno2: RV2P_PFTQ_CTL 00010000
Aug 18 14:45:16 xen02 kernel: [54277.859682] bnx2 0000:01:00.1 eno2: RV2P_TFTQ_CTL 00020000
Aug 18 14:45:16 xen02 kernel: [54277.859707] bnx2 0000:01:00.1 eno2: RV2P_MFTQ_CTL 00004000
Aug 18 14:45:16 xen02 kernel: [54277.859730] bnx2 0000:01:00.1 eno2: TBDR_FTQ_CTL 00004000
Aug 18 14:45:16 xen02 kernel: [54277.859753] bnx2 0000:01:00.1 eno2: TDMA_FTQ_CTL 00010002
Aug 18 14:45:16 xen02 kernel: [54277.859776] bnx2 0000:01:00.1 eno2: TXP_FTQ_CTL 00010000
Aug 18 14:45:16 xen02 kernel: [54277.859799] bnx2 0000:01:00.1 eno2: TXP_FTQ_CTL 00010000
Aug 18 14:45:16 xen02 kernel: [54277.859822] bnx2 0000:01:00.1 eno2: TPAT_FTQ_CTL 00010000
Aug 18 14:45:16 xen02 kernel: [54277.859845] bnx2 0000:01:00.1 eno2: RXP_CFTQ_CTL 00008000
Aug 18 14:45:16 xen02 kernel: [54277.859868] bnx2 0000:01:00.1 eno2: RXP_FTQ_CTL 00100000
Aug 18 14:45:16 xen02 kernel: [54277.859891] bnx2 0000:01:00.1 eno2: COM_COMXQ_FTQ_CTL 00010000
Aug 18 14:45:16 xen02 kernel: [54277.859916] bnx2 0000:01:00.1 eno2: COM_COMTQ_FTQ_CTL 00020000
Aug 18 14:45:16 xen02 kernel: [54277.859941] bnx2 0000:01:00.1 eno2: COM_COMQ_FTQ_CTL 00010000
Aug 18 14:45:16 xen02 kernel: [54277.859965] bnx2 0000:01:00.1 eno2: CP_CPQ_FTQ_CTL 00004000
Aug 18 14:45:16 xen02 kernel: [54277.859988] bnx2 0000:01:00.1 eno2: CPU states:
Aug 18 14:45:16 xen02 kernel: [54277.860017] bnx2 0000:01:00.1 eno2: 045000 mode b84c state 80001000 evt_mask 500 pc 8001284 pc 8001284 instr 1440fffc
Aug 18 14:45:16 xen02 kernel: [54277.860063] bnx2 0000:01:00.1 eno2: 085000 mode b84c state 80001000 evt_mask 500 pc 8000a4c pc 8000a5c instr 1440fffc
Aug 18 14:45:16 xen02 kernel: [54277.860108] bnx2 0000:01:00.1 eno2: 0c5000 mode b84c state 80001000 evt_mask 500 pc 8004c10 pc 8004c14 instr 32050003
Aug 18 14:45:16 xen02 kernel: [54277.860154] bnx2 0000:01:00.1 eno2: 105000 mode b8cc state 80000000 evt_mask 500 pc 8000a98 pc 8000aa4 instr 3c020800
Aug 18 14:45:16 xen02 kernel: [54277.860199] bnx2 0000:01:00.1 eno2: 145000 mode b880 state 80000000 evt_mask 500 pc 800ae38 pc 800ae40 instr 24130001
Aug 18 14:45:16 xen02 kernel: [54277.860245] bnx2 0000:01:00.1 eno2: 185000 mode b8cc state 80000000 evt_mask 500 pc 8000c6c pc 8000c6c instr 1180000b
Aug 18 14:45:16 xen02 kernel: [54277.860285] bnx2 0000:01:00.1 eno2: <--- end FTQ dump --->
Aug 18 14:45:16 xen02 kernel: [54277.860308] bnx2 0000:01:00.1 eno2: <--- start TBDC dump --->
Aug 18 14:45:16 xen02 kernel: [54277.860332] bnx2 0000:01:00.1 eno2: TBDC free cnt: 32
Aug 18 14:45:16 xen02 kernel: [54277.860353] bnx2 0000:01:00.1 eno2: LINE CID BIDX CMD VALIDS
Aug 18 14:45:16 xen02 kernel: [54277.860382] bnx2 0000:01:00.1 eno2: 00 001100 d618 00 [0]
Aug 18 14:45:16 xen02 kernel: [54277.860411] bnx2 0000:01:00.1 eno2: 01 001300 61b8 00 [0]
Aug 18 14:45:16 xen02 kernel: [54277.860440] bnx2 0000:01:00.1 eno2: 02 001280 63c8 00 [0]
Aug 18 14:45:16 xen02 kernel: [54277.860469] bnx2 0000:01:00.1 eno2: 03 000800 79c8 00 [0]
Aug 18 14:45:16 xen02 kernel: [54277.860498] bnx2 0000:01:00.1 eno2: 04 000800 40f8 00 [0]
Aug 18 14:45:16 xen02 kernel: [54277.860526] bnx2 0000:01:00.1 eno2: 05 16fd80 9ef8 bf [0]
Aug 18 14:45:16 xen02 kernel: [54277.860555] bnx2 0000:01:00.1 eno2: 06 1b5f80 f7c8 7f [0]
Aug 18 14:45:16 xen02 kernel: [54277.860584] bnx2 0000:01:00.1 eno2: 07 1bef80 fbd8 7f [0]
Aug 18 14:45:16 xen02 kernel: [54277.860612] bnx2 0000:01:00.1 eno2: 08 1bcd80 f5f8 7c [0]
Aug 18 14:45:16 xen02 kernel: [54277.860641] bnx2 0000:01:00.1 eno2: 09 1fff80 f9f8 96 [0]
Aug 18 14:45:16 xen02 kernel: [54277.860669] bnx2 0000:01:00.1 eno2: 0a 077f00 e7b8 7f [0]
Aug 18 14:45:16 xen02 kernel: [54277.860698] bnx2 0000:01:00.1 eno2: 0b 1dff80 f9f8 e7 [0]
Aug 18 14:45:16 xen02 kernel: [54277.860727] bnx2 0000:01:00.1 eno2: 0c 1f9c00 7a78 f0 [0]
Aug 18 14:45:16 xen02 kernel: [54277.860756] bnx2 0000:01:00.1 eno2: 0d 0ff680 fdf8 ff [0]
Aug 18 14:45:16 xen02 kernel: [54277.860784] bnx2 0000:01:00.1 eno2: 0e 067980 ffe8 f7 [0]
Aug 18 14:45:16 xen02 kernel: [54277.860813] bnx2 0000:01:00.1 eno2: 0f 0ef300 fb78 7e [0]
Aug 18 14:45:16 xen02 kernel: [54277.860842] bnx2 0000:01:00.1 eno2: 10 1be600 dff8 df [0]
Aug 18 14:45:16 xen02 kernel: [54277.860870] bnx2 0000:01:00.1 eno2: 11 1fff80 faf8 bf [0]
Aug 18 14:45:16 xen02 kernel: [54277.860899] bnx2 0000:01:00.1 eno2: 12 05fd80 7ef8 ff [0]
Aug 18 14:45:16 xen02 kernel: [54277.860928] bnx2 0000:01:00.1 eno2: 13 1fba00 d6f0 ff [0]
Aug 18 14:45:16 xen02 kernel: [54277.860957] bnx2 0000:01:00.1 eno2: 14 1fed80 7fd8 db [0]
Aug 18 14:45:16 xen02 kernel: [54277.860985] bnx2 0000:01:00.1 eno2: 15 17cf80 73b0 dd [0]
Aug 18 14:45:16 xen02 kernel: [54277.861014] bnx2 0000:01:00.1 eno2: 16 1ff700 eff8 1b [0]
Aug 18 14:45:16 xen02 kernel: [54277.861042] bnx2 0000:01:00.1 eno2: 17 1dfd80 eeb8 7f [0]
Aug 18 14:45:16 xen02 kernel: [54277.861071] bnx2 0000:01:00.1 eno2: 18 1bd780 fff8 ff [0]
Aug 18 14:45:16 xen02 kernel: [54277.861099] bnx2 0000:01:00.1 eno2: 19 17fb80 fef0 df [0]
Aug 18 14:45:16 xen02 kernel: [54277.861128] bnx2 0000:01:00.1 eno2: 1a 1ffe80 6a70 df [0]
Aug 18 14:45:16 xen02 kernel: [54277.861157] bnx2 0000:01:00.1 eno2: 1b 1efe80 dfe8 ff [0]
Aug 18 14:45:16 xen02 kernel: [54277.861186] bnx2 0000:01:00.1 eno2: 1c 0f7f80 dfb0 7f [0]
Aug 18 14:45:16 xen02 kernel: [54277.861214] bnx2 0000:01:00.1 eno2: 1d 1f7f80 fad8 fb [0]
Aug 18 14:45:16 xen02 kernel: [54277.861243] bnx2 0000:01:00.1 eno2: 1e 1fff80 fbd8 d7 [0]
Aug 18 14:45:16 xen02 kernel: [54277.861272] bnx2 0000:01:00.1 eno2: 1f 0bbf80 ffd8 bb [0]
Aug 18 14:45:16 xen02 kernel: [54277.861297] bnx2 0000:01:00.1 eno2: <--- end TBDC dump --->
Aug 18 14:45:16 xen02 kernel: [54277.861327] bnx2 0000:01:00.1 eno2: DEBUG: intr_sem[0] PCI_CMD[00100406]
Aug 18 14:45:16 xen02 kernel: [54277.861358] bnx2 0000:01:00.1 eno2: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000088]
Aug 18 14:45:16 xen02 kernel: [54277.861851] bnx2 0000:01:00.1 eno2: DEBUG: EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000]
Aug 18 14:45:16 xen02 kernel: [54277.862318] bnx2 0000:01:00.1 eno2: DEBUG: RPM_MGMT_PKT_CTRL[40000088]
Aug 18 14:45:16 xen02 kernel: [54277.862770] bnx2 0000:01:00.1 eno2: DEBUG: HC_STATS_INTERRUPT_STATUS[01df0020]
Aug 18 14:45:16 xen02 kernel: [54277.863211] bnx2 0000:01:00.1 eno2: DEBUG: PBA[00000000]
Aug 18 14:45:16 xen02 kernel: [54277.863653] bnx2 0000:01:00.1 eno2: <--- start MCP states dump --->
Aug 18 14:45:16 xen02 kernel: [54277.864102] bnx2 0000:01:00.1 eno2: DEBUG: MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e]
Aug 18 14:45:16 xen02 kernel: [54277.864570] bnx2 0000:01:00.1 eno2: DEBUG: MCP mode[0000b880] state[80008000] evt_mask[00000500]
Aug 18 14:45:16 xen02 kernel: [54277.865039] bnx2 0000:01:00.1 eno2: DEBUG: pc[080009b8] pc[0800d240] instr[1440002c]
Aug 18 14:45:16 xen02 kernel: [54277.865515] bnx2 0000:01:00.1 eno2: DEBUG: shmem states:
Aug 18 14:45:16 xen02 kernel: [54277.865993] bnx2 0000:01:00.1 eno2: DEBUG: drv_mb[01030003] fw_mb[00000003] link_status[0000006e]
Aug 18 14:45:16 xen02 kernel: [54277.866494] drv_pulse_mb[00004ed3]
Aug 18 14:45:16 xen02 kernel: [54277.866498] bnx2 0000:01:00.1 eno2: DEBUG: dev_info_signature[44564903] reset_type[01005254]
Aug 18 14:45:16 xen02 kernel: [54277.867006] condition[0003610e]
Aug 18 14:45:16 xen02 kernel: [54277.867012] bnx2 0000:01:00.1 eno2: DEBUG: 000001c0: 01005254 42530000 0003610e 00000000
Aug 18 14:45:16 xen02 kernel: [54277.867565] bnx2 0000:01:00.1 eno2: DEBUG: 000003cc: 44444444 44444444 44444444 00000a28
Aug 18 14:45:16 xen02 kernel: [54277.868094] bnx2 0000:01:00.1 eno2: DEBUG: 000003dc: 0004ffff 00000000 00000000 00000000
Aug 18 14:45:16 xen02 kernel: [54277.868638] bnx2 0000:01:00.1 eno2: DEBUG: 000003ec: 00000000 00000000 00000000 00000000
Aug 18 14:45:16 xen02 kernel: [54277.869161] bnx2 0000:01:00.1 eno2: DEBUG: 0x3fc[0000ffff]
Aug 18 14:45:16 xen02 kernel: [54277.869686] bnx2 0000:01:00.1 eno2: <--- end MCP states dump --->
Aug 18 14:45:16 xen02 kernel: [54277.952626] bnx2 0000:01:00.1 eno2: NIC Copper Link is Down
Aug 18 14:45:16 xen02 kernel: [54277.953376] br-eno2: port 1(eno2) entered disabled state
Aug 18 14:45:19 xen02 kernel: [54281.121380] bnx2 0000:01:00.1 eno2: NIC Copper Link is Up, 1000 Mbps full duplex
Aug 18 14:45:19 xen02 kernel: [54281.121395] , receive & transmit flow control ON
Aug 18 14:45:19 xen02 kernel: [54281.121506] br-eno2: port 1(eno2) entered blocking state
Aug 18 14:45:19 xen02 kernel: [54281.121518] br-eno2: port 1(eno2) entered forwarding state
Aug 18 14:45:21 xen02 kernel: [54282.291106] bnx2 0000:01:00.1 eno2: NIC Copper Link is Down
Aug 18 14:45:21 xen02 kernel: [54282.292209] br-eno2: port 1(eno2) entered disabled state
Aug 18 14:45:23 xen02 kernel: [54284.644260] bnx2 0000:01:00.1 eno2: NIC Copper Link is Up, 1000 Mbps full duplex
Aug 18 14:45:23 xen02 kernel: [54284.644275] , receive & transmit flow control ON
Aug 18 14:45:23 xen02 kernel: [54284.644373] br-eno2: port 1(eno2) entered blocking state
Aug 18 14:45:23 xen02 kernel: [54284.644386] br-eno2: port 1(eno2) entered forwarding state
Aug 18 14:45:31 xen02 kernel: [54292.350727] usb 6-3: reset high-speed USB device number 4 using ehci-pci
Aug 18 14:45:47 xen02 kernel: [54308.549804] usb 6-3: device not accepting address 4, error -110
Aug 18 14:45:47 xen02 kernel: [54308.669880] usb 6-3: reset high-speed USB device number 4 using ehci-pci
Aug 18 14:46:03 xen02 kernel: [54324.676957] usb 6-3: device not accepting address 4, error -110
Aug 18 14:46:03 xen02 kernel: [54324.796936] usb 6-3: reset high-speed USB device number 4 using ehci-pci
Aug 18 14:46:10 xen02 kernel: [54331.872538] bnx2 0000:01:00.1 eno2: <--- start FTQ dump --->
Aug 18 14:46:10 xen02 kernel: [54331.873570] bnx2 0000:01:00.1 eno2: RV2P_PFTQ_CTL 00010000
Aug 18 14:46:10 xen02 kernel: [54331.874207] bnx2 0000:01:00.1 eno2: RV2P_TFTQ_CTL 00020000
Aug 18 14:46:10 xen02 kernel: [54331.874793] bnx2 0000:01:00.1 eno2: RV2P_MFTQ_CTL 00004000
Aug 18 14:46:10 xen02 kernel: [54331.875336] bnx2 0000:01:00.1 eno2: TBDR_FTQ_CTL 00004002
Aug 18 14:46:10 xen02 kernel: [54331.875876] bnx2 0000:01:00.1 eno2: TDMA_FTQ_CTL 00010000
Aug 18 14:46:10 xen02 kernel: [54331.876408] bnx2 0000:01:00.1 eno2: TXP_FTQ_CTL 00010000
Aug 18 14:46:10 xen02 kernel: [54331.876953] bnx2 0000:01:00.1 eno2: TXP_FTQ_CTL 00010000
Aug 18 14:46:10 xen02 kernel: [54331.877475] bnx2 0000:01:00.1 eno2: TPAT_FTQ_CTL 00010000
Aug 18 14:46:10 xen02 kernel: [54331.877999] bnx2 0000:01:00.1 eno2: RXP_CFTQ_CTL 00008000
Aug 18 14:46:10 xen02 kernel: [54331.878524] bnx2 0000:01:00.1 eno2: RXP_FTQ_CTL 00100000
Aug 18 14:46:10 xen02 kernel: [54331.879061] bnx2 0000:01:00.1 eno2: COM_COMXQ_FTQ_CTL 00010000
Aug 18 14:46:10 xen02 kernel: [54331.879595] bnx2 0000:01:00.1 eno2: COM_COMTQ_FTQ_CTL 00020000
Aug 18 14:46:10 xen02 kernel: [54331.880129] bnx2 0000:01:00.1 eno2: COM_COMQ_FTQ_CTL 00010000
Aug 18 14:46:10 xen02 kernel: [54331.880673] bnx2 0000:01:00.1 eno2: CP_CPQ_FTQ_CTL 00004000
Aug 18 14:46:10 xen02 kernel: [54331.881209] bnx2 0000:01:00.1 eno2: CPU states:
Aug 18 14:46:10 xen02 kernel: [54331.881754] bnx2 0000:01:00.1 eno2: 045000 mode b84c state 80001000 evt_mask 500 pc 8001294 pc 8001284 instr 8e260000
Aug 18 14:46:10 xen02 kernel: [54331.882330] bnx2 0000:01:00.1 eno2: 085000 mode b84c state 80005000 evt_mask 500 pc 8000a4c pc 8000a4c instr 10400016
Aug 18 14:46:10 xen02 kernel: [54331.882917] bnx2 0000:01:00.1 eno2: 0c5000 mode b84c state 80001000 evt_mask 500 pc 8004c20 pc 8004c14 instr 10e00088
Aug 18 14:46:10 xen02 kernel: [54331.883497] bnx2 0000:01:00.1 eno2: 105000 mode b8cc state 80000000 evt_mask 500 pc 8000aa4 pc 8000b28 instr 3c028000
Aug 18 14:46:10 xen02 kernel: [54331.884088] bnx2 0000:01:00.1 eno2: 145000 mode b880 state 80004000 evt_mask 500 pc 800adec pc 800ae00 instr 8c6366e4
Qualquer conselho seria muito apreciado. Obrigado!
atualização 20170926 - Atualizei a segunda máquina com 2 placas Intel NIC de porta dupla e desativei o bnx2 e a máquina ainda continua travando. Sem nenhum domU, a primeira máquina permaneceu ativa por 6 dias.
Responder1
Eu tive exatamente esse problema e estou convencido de que a culpa é do chipset instável. Veja abaixo os detalhes do histórico.
Gambiarra: Limite e fixe seu Dom0 para usar apenas 1 ou 2 núcleos de CPU, conforme sugerido emhttps://wiki.debian.org/Xen#Other_configuration_tweaksehttps://wiki.xenproject.org/wiki/Tuning_Xen_for_Performance
Etapas de solução alternativa:
1: Em /etc/default/grub, adicione/modifique GRUB_CMDLINE_XEN para conter:
GRUB_CMDLINE_XEN="dom0_max_vcpus=1 dom0_vcpus_pin"
(No meu Dom0, também limito a memória com:dom0_mem=2048Me desligue o balão automático)
2: execute update-grub
para atualizar o bootloader
3: No seu DomU/etc/xen/.cfgarquivos, adicione o seguinte a cada DomU para mantê-lo fora da CPU 0:
cpus="all,^0"
(ou se você limitou a 2 núcleos, use cpus="all,^0-1"
:)
4: Desligue seu DomU e reinicie para obter as novas configurações do kernel. Seu Dom0 agora deve ter apenas um VCPU aparecendo na top
saída
5: Redefina a placa "dias desde o último kernel panic" na sua parede e cruze os dedos!
Fundo:
Uma história triste! Isso começou a acontecer comigo imediatamente após atualizar um DomU ocupado para um novo PowerEdge R710 Dom0. Foi absolutamente brutal solucionar o problema! Aconteceu com apenas um DomU em execução na caixa (portanto, ter 24 VMs não é sua causa raiz). Nada funcionou para pará-lo ou corrigi-lo, ele seria acionado em horários de pico e o erro mudaria de "tempo limite da fila de transmissão esgotado" para erros com o controlador RAID sendo somente leitura. Tentei tudo na sua lista, incluindo mudar para NICs Intel e1000 e um novo chassi físico R710. Eu mexi em vão no BIOS tentando obter a NIC e o RAID em um IRQ separado. Durante uma semana, o servidor explodiu várias vezes ao dia com importantes sites de produção. Foi realmente horrível em todos os sentidos :(
Finalmente, encontrei algum alívio seguindo as sugestões no final deste bug aqui:https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=866952. Embora não fosse nossa descrição de bug (era Dom0 que não inicializa), era atual, Xen e um R710. Ele recomenda reduzir e fixar CPUS Dom0 conformehttps://wiki.xenproject.org/wiki/Tuning_Xen_for_Performance.
Desesperado por algo para tentar, eu tentei e (ZOMG!) funcionou! Limitar o Dom0 para usar apenas duas CPUs e fixar os DomUs para usar apenas os outros núcleos fez com que o problema desaparecesse para mim e permanecesse por mais de 2 meses. Na verdade, eu tinha certeza de que isso resolveria o problema completamente, mas o erro ocorreu novamente na semana passada. Vou tentar reduzir para apenas 1 CPU fixada para o Dom0 a seguir.
Estou convencido de que o problema é o tratamento de interrupções prejudicado pelo chipset Intel, e nenhuma das soluções alternativas que encontramos online funciona. Isso porque são todos de muitos anos atrás.
Responder2
Não em um R710, mas já vi isso com outros Broadcoms integrados, sempre foi uma NIC com defeito, suspeito que você precisará adicionar uma NIC plug-in ou ficar sem ela.
Você pode verificar isso inicializando a partir de um stick USB simples (não XEN) e saturando a NIC (por exemplo, com alguns netcat
s) - você provavelmente verá algumas mensagens de erro relacionadas a erros de barramento pouco antes da falha.