O servidor Debian continua reiniciando inesperadamente

O servidor Debian continua reiniciando inesperadamente

O servidor do meu laboratório com Debian-Wheezy-7.8-Stable continua reiniciando algumas vezes após algumas horas de atividade sem qualquer notificação. Este servidor está configurado para computação numérica de carga consideravelmente alta, bem como computação paralela. Eu imprimi o log de var/log/messagese, last rebootmas achei difícil entender essas mensagens de log. Tentei examinar a entrada logo antes da reinicialização e ao mesmo tempo, var/log/messagesmas parece que as entradas var/log/messagesmostram apenas log/mensagens após a reinicialização.

Eu naveguei e descobri que algumas pessoas têm o mesmo problema, mas parece que a causa é diferente uma da outra e /var/log/messagesparece ser a chave para investigar o problema. O que var/log/messagesrealmente descreve em relação a esse evento de reinicialização indesejado? e como começar a aprender como ler este log para iniciantes? Quero dizer, há alguma palavra-chave importante para procurar ou algo assim?

Obrigado por qualquer ajuda que você possa fornecer.

last reboot

reboot   system boot  3.2.0-4-amd64    Wed May 20 03:29 - 12:43  (09:14)
reboot   system boot  3.2.0-4-amd64    Tue May 19 16:01 - 12:43  (20:42)

var/log/messages

May 18 07:35:01 labserver rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2400" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
May 19 07:35:01 labserver rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2400" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
May 19 16:01:19 labserver kernel: imklog 5.8.11, log source = /proc/kmsg started.
May 19 16:01:19 labserver rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2401" x-info="http://www.rsyslog.com"] start
May 19 16:01:19 labserver kernel: [    0.000000] Initializing cgroup subsys cpuset
May 19 16:01:19 labserver kernel: [    0.000000] Initializing cgroup subsys cpu
May 19 16:01:19 labserver kernel: [    0.000000] Linux version 3.2.0-4-amd64 ([email protected]) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.65-1+deb7u2
May 19 16:01:19 labserver kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.2.0-4-amd64 root=UUID=1fc245ac-9058-4208-862a-7f4e8e1b20b2 ro text
May 19 16:01:19 labserver kernel: [    0.000000] BIOS-provided physical RAM map:
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 0000000000000000 - 000000000009ac00 (usable)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 000000000009ac00 - 00000000000a0000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 0000000000100000 - 000000007df71000 (usable)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 000000007df71000 - 000000007e0f1000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 000000007e0f1000 - 000000007e2ec000 (ACPI NVS)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 000000007e2ec000 - 000000007f367000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 000000007f367000 - 000000007f800000 (ACPI NVS)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 0000000080000000 - 0000000090000000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 00000000fed1c000 - 00000000fed40000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 00000000ff000000 - 0000000100000000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 0000000100000000 - 0000000880000000 (usable)
May 19 16:01:19 labserver kernel: [    0.000000] NX (Execute Disable) protection: active
May 19 16:01:19 labserver kernel: [    0.000000] SMBIOS 2.7 present.
May 19 16:01:19 labserver kernel: [    0.000000] No AGP bridge found
May 19 16:01:19 labserver kernel: [    0.000000] last_pfn = 0x880000 max_arch_pfn = 0x400000000
May 19 16:01:19 labserver kernel: [    0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
May 19 16:01:19 labserver kernel: [    0.000000] last_pfn = 0x7df71 max_arch_pfn = 0x400000000
May 19 16:01:19 labserver kernel: [    0.000000] found SMP MP-table at [ffff8800000fd900] fd900
May 19 16:01:19 labserver kernel: [    0.000000] Using GB pages for direct mapping
May 19 16:01:19 labserver kernel: [    0.000000] init_memory_mapping: 0000000000000000-000000007df71000
May 19 16:01:19 labserver kernel: [    0.000000] init_memory_mapping: 0000000100000000-0000000880000000
May 19 16:01:19 labserver kernel: [    0.000000] RAMDISK: 36bea000 - 375ed000
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: RSDP 00000000000f04a0 00024 (v02 ALASKA)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: XSDT 000000007e204088 0008C (v01 ALASKA    A M I 01072009 AMI  00010013)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: FACP 000000007e211040 0010C (v05 ALASKA    A M I 01072009 AMI  00010013)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI Warning: FADT (revision 5) is longer than ACPI 2.0 version, truncating length 268 to 244 (20110623/tbfadt-288)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: DSDT 000000007e2041a8 0CE96 (v02 ALASKA    A M I 00000015 INTL 20051117)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: FACS 000000007e2e3080 00040
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: APIC 000000007e211150 00100 (v03 ALASKA    A M I 01072009 AMI  00010013)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: FPDT 000000007e211250 00044 (v01 ALASKA    A M I 01072009 AMI  00010013)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: MCFG 000000007e211298 0003C (v01 ALASKA OEMMCFG. 01072009 MSFT 00000097)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: HPET 000000007e2112d8 00038 (v01 ALASKA    A M I 01072009 AMI. 00000005)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: PRAD 000000007e211310 000BE (v02 PRADID  PRADTID 00000001 MSFT 03000001)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: SPMI 000000007e2113d0 00040 (v05 A M I   OEMSPMI 00000000 AMI. 00000000)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: SSDT 000000007e211410 D0CB0 (v02  INTEL    CpuPm 00004000 INTL 20051117)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: EINJ 000000007e2e20c0 00130 (v01    AMI AMI EINJ 00000000      00000000)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: ERST 000000007e2e21f0 00230 (v01  AMIER AMI ERST 00000000      00000000)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: HEST 000000007e2e2420 000A8 (v01    AMI AMI HEST 00000000      00000000)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: BERT 000000007e2e24c8 00030 (v01    AMI AMI BERT 00000000      00000000)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: DMAR 000000007e2e24f8 000C4 (v01 A M I   OEMDMAR 00000001 INTL 00000001)
May 19 16:01:19 labserver kernel: [    0.000000] No NUMA configuration found
May 19 16:01:19 labserver kernel: [    0.000000] Faking a node at 0000000000000000-0000000880000000
May 19 16:01:19 labserver kernel: [    0.000000] Initmem setup node 0 0000000000000000-0000000880000000
May 19 16:01:19 labserver kernel: [    0.000000]   NODE_DATA [000000087fffb000 - 000000087fffffff]
May 19 16:01:19 labserver kernel: [    0.000000] Zone PFN ranges:
May 19 16:01:19 labserver kernel: [    0.000000]   DMA      0x00000010 -> 0x00001000
May 19 16:01:19 labserver kernel: [    0.000000]   DMA32    0x00001000 -> 0x00100000
May 19 16:01:19 labserver kernel: [    0.000000]   Normal   0x00100000 -> 0x00880000
May 19 16:01:19 labserver kernel: [    0.000000] Movable zone start PFN for each node
May 19 16:01:19 labserver kernel: [    0.000000] early_node_map[3] active PFN ranges
May 19 16:01:19 labserver kernel: [    0.000000]     0: 0x00000010 -> 0x0000009a
May 19 16:01:19 labserver kernel: [    0.000000]     0: 0x00000100 -> 0x0007df71
May 19 16:01:19 labserver kernel: [    0.000000]     0: 0x00100000 -> 0x00880000
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: PM-Timer IO Port: 0x408
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x04] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x06] lapic_id[0x06] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x08] lapic_id[0x08] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x0a] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x05] lapic_id[0x05] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x07] lapic_id[0x07] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x09] lapic_id[0x09] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x0b] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x06] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x08] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x0a] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x05] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x07] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x09] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x0b] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: IOAPIC (id[0x00] address[0xfec00000] gsi_base[0])
May 19 16:01:19 labserver kernel: [    0.000000] IOAPIC[0]: apic_id 0, version 32, address 0xfec00000, GSI 0-23
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: IOAPIC (id[0x02] address[0xfec01000] gsi_base[24])
May 19 16:01:19 labserver kernel: [    0.000000] IOAPIC[1]: apic_id 2, version 32, address 0xfec01000, GSI 24-47
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
May 19 16:01:19 labserver kernel: [    0.000000] Using ACPI (MADT) for SMP configuration information
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: HPET id: 0x8086a701 base: 0xfed00000
May 19 16:01:19 labserver kernel: [    0.000000] SMP: Allowing 12 CPUs, 0 hotplug CPUs
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000000009a000 - 000000000009b000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000000009b000 - 00000000000a0000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 00000000000a0000 - 00000000000e0000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 00000000000e0000 - 0000000000100000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000007df71000 - 000000007e0f1000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000007e0f1000 - 000000007e2ec000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000007e2ec000 - 000000007f367000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000007f367000 - 000000007f800000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000007f800000 - 0000000080000000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 0000000080000000 - 0000000090000000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 0000000090000000 - 00000000fed1c000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 00000000fed1c000 - 00000000fed40000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 00000000fed40000 - 00000000ff000000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 00000000ff000000 - 0000000100000000
May 19 16:01:19 labserver kernel: [    0.000000] Allocating PCI resources starting at 90000000 (gap: 90000000:6ed1c000)
May 19 16:01:19 labserver kernel: [    0.000000] Booting paravirtualized kernel on bare hardware
May 19 16:01:19 labserver kernel: [    0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:12 nr_node_ids:1
May 19 16:01:19 labserver kernel: [    0.000000] PERCPU: Embedded 27 pages/cpu @ffff88087fc00000 s78848 r8192 d23552 u131072
May 19 16:01:19 labserver kernel: [    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 8258294
May 19 16:01:19 labserver kernel: [    0.000000] Policy zone: Normal
May 19 16:01:19 labserver kernel: [    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.2.0-4-amd64 root=UUID=1fc245ac-9058-4208-862a-7f4e8e1b20b2 ro text
May 19 16:01:19 labserver kernel: [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
May 19 16:01:19 labserver kernel: [    0.000000] xsave/xrstor: enabled xstate_bv 0x7, cntxt size 0x340
May 19 16:01:19 labserver kernel: [    0.000000] Checking aperture...
May 19 16:01:19 labserver kernel: [    0.000000] No AGP bridge found
May 19 16:01:19 labserver kernel: [    0.000000] Memory: 32975732k/35651584k available (3434k kernel code, 2130964k absent, 544888k reserved, 3305k data, 576k init)
May 19 16:01:19 labserver kernel: [    0.000000] Hierarchical RCU implementation.
May 19 16:01:19 labserver kernel: [    0.000000]    RCU dyntick-idle grace-period acceleration is enabled.
May 19 16:01:19 labserver kernel: [    0.000000] NR_IRQS:33024 nr_irqs:1184 16
May 19 16:01:19 labserver kernel: [    0.000000] Extended CMOS year: 2000
May 19 16:01:19 labserver kernel: [    0.000000] Console: colour VGA+ 80x25
May 19 16:01:19 labserver kernel: [    0.000000] console [tty0] enabled
May 19 16:01:19 labserver kernel: [    0.000000] Fast TSC calibration using PIT
May 19 16:01:19 labserver kernel: [    0.004000] Detected 2100.074 MHz processor.
May 19 16:01:19 labserver kernel: [    0.000003] Calibrating delay loop (skipped), value calculated using timer frequency.. 4200.14 BogoMIPS (lpj=8400296)
May 19 16:01:19 labserver kernel: [    0.000144] pid_max: default: 32768 minimum: 301
May 19 16:01:19 labserver kernel: [    0.000253] Security Framework initialized
May 19 16:01:19 labserver kernel: [    0.000324] AppArmor: AppArmor disabled by boot time parameter
May 19 16:01:19 labserver kernel: [    0.002355] Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes)
May 19 16:01:19 labserver kernel: [    0.011585] Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes)
May 19 16:01:19 labserver kernel: [    0.015724] Mount-cache hash table entries: 256
May 19 16:01:19 labserver kernel: [    0.015915] Initializing cgroup subsys cpuacct
May 19 16:01:19 labserver kernel: [    0.015986] Initializing cgroup subsys memory
May 19 16:01:19 labserver kernel: [    0.016063] Initializing cgroup subsys devices
May 19 16:01:19 labserver kernel: [    0.016133] Initializing cgroup subsys freezer
May 19 16:01:19 labserver kernel: [    0.016201] Initializing cgroup subsys net_cls
May 19 16:01:19 labserver kernel: [    0.016270] Initializing cgroup subsys blkio
May 19 16:01:19 labserver kernel: [    0.016344] Initializing cgroup subsys perf_event
May 19 16:01:19 labserver kernel: [    0.016441] CPU: Physical Processor ID: 0
May 19 16:01:19 labserver kernel: [    0.016509] CPU: Processor Core ID: 0
May 19 16:01:19 labserver kernel: [    0.017564] mce: CPU supports 23 MCE banks
May 19 16:01:19 labserver kernel: [    0.017670] CPU0: Thermal monitoring enabled (TM1)
May 19 16:01:19 labserver kernel: [    0.017768] using mwait in idle threads.
May 19 16:01:19 labserver kernel: [    0.018315] ACPI: Core revision 20110623
May 19 16:01:19 labserver kernel: [    0.049889] DMAR: Host address width 46
May 19 16:01:19 labserver kernel: [    0.049958] DMAR: DRHD base: 0x000000fbffc000 flags: 0x1
May 19 16:01:19 labserver kernel: [    0.050034] IOMMU 0: reg_base_addr fbffc000 ver 1:0 cap d2078c106f0466 ecap f020de
May 19 16:01:19 labserver kernel: [    0.050122] DMAR: RMRR base: 0x0000007f239000 end: 0x0000007f247fff
May 19 16:01:19 labserver kernel: [    0.050195] DMAR: ATSR flags: 0x0
May 19 16:01:19 labserver kernel: [    0.050261] DMAR: RHSA base: 0x000000fbffc000 proximity domain: 0x0
May 19 16:01:19 labserver kernel: [    0.050427] IOAPIC id 0 under DRHD base  0xfbffc000 IOMMU 0
May 19 16:01:19 labserver kernel: [    0.050497] IOAPIC id 2 under DRHD base  0xfbffc000 IOMMU 0
May 19 16:01:19 labserver kernel: [    0.050568] HPET id 0 under DRHD base 0xfbffc000
May 19 16:01:19 labserver kernel: [    0.050741] Enabled IRQ remapping in x2apic mode
May 19 16:01:19 labserver kernel: [    0.050810] Enabling x2apic
May 19 16:01:19 labserver kernel: [    0.050875] Enabled x2apic
May 19 16:01:19 labserver kernel: [    0.050943] Switched APIC routing to cluster x2apic.
May 19 16:01:19 labserver kernel: [    0.051552] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
May 19 16:01:19 labserver kernel: [    0.091256] CPU0: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz stepping 04
May 19 16:01:19 labserver kernel: [    0.195570] Performance Events: PEBS fmt1+, generic architected perfmon, Intel PMU driver.
May 19 16:01:19 labserver kernel: [    0.195802] ... version:                3
May 19 16:01:19 labserver kernel: [    0.195869] ... bit width:              48
May 19 16:01:19 labserver kernel: [    0.195936] ... generic registers:      4
May 19 16:01:19 labserver kernel: [    0.196003] ... value mask:             0000ffffffffffff
May 19 16:01:19 labserver kernel: [    0.196073] ... max period:             000000007fffffff
May 19 16:01:19 labserver kernel: [    0.196143] ... fixed-purpose events:   3
May 19 16:01:19 labserver kernel: [    0.196210] ... event mask:             000000070000000f
May 19 16:01:19 labserver kernel: [    0.196468] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.196637] Booting Node   0, Processors  #1
May 19 16:01:19 labserver kernel: [    0.312587] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.312765]  #2
May 19 16:01:19 labserver kernel: [    0.424400] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.424578]  #3
May 19 16:01:19 labserver kernel: [    0.536316] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.536489]  #4
May 19 16:01:19 labserver kernel: [    0.648124] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.648303]  #5
May 19 16:01:19 labserver kernel: [    0.759941] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.760115]  #6
May 19 16:01:19 labserver kernel: [    0.871864] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.872050]  #7
May 19 16:01:19 labserver kernel: [    0.983690] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.983866]  #8
May 19 16:01:19 labserver kernel: [    1.095600] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    1.095774]  #9
May 19 16:01:19 labserver kernel: [    1.207414] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    1.207589]  #10
May 19 16:01:19 labserver kernel: [    1.319223] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    1.319400]  #11 Ok.
May 19 16:01:19 labserver kernel: [    1.431095] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    1.431192] Brought up 12 CPUs
May 19 16:01:19 labserver kernel: [    1.431260] Total of 12 processors activated (50398.84 BogoMIPS).
May 19 16:01:19 labserver kernel: [    1.450786] devtmpfs: initialized
May 19 16:01:19 labserver kernel: [    1.455360] PM: Registering ACPI NVS region at 7e0f1000 (2076672 bytes)
May 19 16:01:19 labserver kernel: [    1.455494] PM: Registering ACPI NVS region at 7f367000 (4820992 bytes)
May 19 16:01:19 labserver kernel: [    1.455843] print_constraints: dummy: 
May 19 16:01:19 labserver kernel: [    1.455977] NET: Registered protocol family 16
May 19 16:01:19 labserver kernel: [    1.456140] ACPI: bus type pci registered
May 19 16:01:19 labserver kernel: [    1.456268] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000)
May 19 16:01:19 labserver kernel: [    1.456361] PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820
May 19 16:01:19 labserver kernel: [    1.466673] PCI: Using configuration type 1 for base access
May 19 16:01:19 labserver kernel: [    1.468173] bio: create slab <bio-0> at 0
May 19 16:01:19 labserver kernel: [    1.468353] ACPI: Added _OSI(Module Device)
May 19 16:01:19 labserver kernel: [    1.468422] ACPI: Added _OSI(Processor Device)
May 19 16:01:19 labserver kernel: [    1.468491] ACPI: Added _OSI(3.0 _SCP Extensions)
May 19 16:01:19 labserver kernel: [    1.468560] ACPI: Added _OSI(Processor Aggregator Device)
May 19 16:01:19 labserver kernel: [    1.484562] ACPI: Executed 1 blocks of module-level executable AML code
May 19 16:01:19 labserver kernel: [    1.727818] ACPI: Interpreter enabled
May 19 16:01:19 labserver kernel: [    1.727891] ACPI: (supports S0 S1 S4 S5)
May 19 16:01:19 labserver kernel: [    1.728159] ACPI: Using IOAPIC for interrupt routing
May 19 16:01:19 labserver kernel: [    1.736531] ACPI: No dock devices found.
May 19 16:01:19 labserver kernel: [    1.736630] HEST: Table parsing has been initialized.
May 19 16:01:19 labserver kernel: [    1.736704] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
May 19 16:01:19 labserver kernel: [    1.737041] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe])
May 19 16:01:19 labserver kernel: [    1.737361] pci_root PNP0A08:00: host bridge window [io  0x0000-0x03af]
May 19 16:01:19 labserver kernel: [    1.737435] pci_root PNP0A08:00: host bridge window [io  0x03e0-0x0cf7]
May 19 16:01:19 labserver kernel: [    1.737508] pci_root PNP0A08:00: host bridge window [io  0x03b0-0x03df]
May 19 16:01:19 labserver kernel: [    1.737586] pci_root PNP0A08:00: host bridge window [io  0x0d00-0xffff]
May 19 16:01:19 labserver kernel: [    1.737659] pci_root PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff]
May 19 16:01:19 labserver kernel: [    1.737747] pci_root PNP0A08:00: host bridge window [mem 0x000c0000-0x000dffff]
May 19 16:01:19 labserver kernel: [    1.737834] pci_root PNP0A08:00: host bridge window [mem 0xfed0e000-0xfed0ffff]
May 19 16:01:19 labserver kernel: [    1.737922] pci_root PNP0A08:00: host bridge window [mem 0x80000000-0xfbffffff]
May 19 16:01:19 labserver kernel: [    1.740791] pci 0000:00:01.0: PCI bridge to [bus 01-01]
May 19 16:01:19 labserver kernel: [    1.745575] pci 0000:00:01.1: PCI bridge to [bus 02-03]
May 19 16:01:19 labserver kernel: [    1.745700] pci 0000:00:02.0: PCI bridge to [bus 04-04]
May 19 16:01:19 labserver kernel: [    1.745816] pci 0000:00:03.0: PCI bridge to [bus 05-05]
May 19 16:01:19 labserver kernel: [    1.745933] pci 0000:00:03.2: PCI bridge to [bus 06-06]
May 19 16:01:19 labserver kernel: [    1.746285] pci 0000:00:11.0: PCI bridge to [bus 07-07]
May 19 16:01:19 labserver kernel: [    1.746541] pci 0000:00:1e.0: PCI bridge to [bus 08-08] (subtractive decode)
May 19 16:01:19 labserver kernel: [    1.747170]  pci0000:00: Requesting ACPI _OSC control (0x1d)
May 19 16:01:19 labserver kernel: [    1.747465]  pci0000:00: ACPI _OSC control (0x15) granted
May 19 16:01:19 labserver kernel: [    1.756901] ACPI: PCI Root Bridge [UNC0] (domain 0000 [bus ff])
May 19 16:01:19 labserver kernel: [    1.758443]  pci0000:ff: Requesting ACPI _OSC control (0x1d)
May 19 16:01:19 labserver kernel: [    1.758528]  pci0000:ff: ACPI _OSC control (0x1d) granted
May 19 16:01:19 labserver kernel: [    1.759439] ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 10 *11 12 14 15)
May 19 16:01:19 labserver kernel: [    1.760105] ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 *10 11 12 14 15)
May 19 16:01:19 labserver kernel: [    1.760768] ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 *5 6 10 11 12 14 15)
May 19 16:01:19 labserver kernel: [    1.761383] ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 10 *11 12 14 15)
May 19 16:01:19 labserver kernel: [    1.762006] ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 10 11 12 14 15) *0
May 19 16:01:19 labserver kernel: [    1.762729] ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 10 11 12 14 15) *0
May 19 16:01:19 labserver kernel: [    1.763450] ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 7 10 11 12 14 15) *0
May 19 16:01:19 labserver kernel: [    1.764170] ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 *7 10 11 12 14 15)

Responder1

Você precisa fornecer mais informações, especialmente entradas de log logo antes da reinicialização do sistema. No entanto, tanto quanto posso ver, pode não fornecer mais informações. Verifique outros logs, como syslog.

As causas mais comuns em minha experiência para reinicializações repentinas sem qualquer indicação do que realmente deu errado geralmente estão relacionadas ao hardware. Caso contrário, o kernel terá a chance de escrever algo nos logs para dar uma pista.

Algumas causas comuns de reinicializações repentinas:

  • Superaquecimento, provavelmente a causa principal, tenha uma ideia da temperatura, tente registrá-la, o servidor possui um display que possa mostrar a temperatura, a sala está resfriada adequadamente. Talvez substitua a pasta térmica nos dissipadores de calor que cobrem a(s) CPU(s).

  • Hardware ou drivers ruins, obtenha uma lista usando "lspci", por exemplo, um dimm ruim pode fazer com que um sistema trave e/ou reinicie repentinamente (reposicione dimms, CPUs e placas). Lembro-me de um servidor que reiniciava ocasionalmente devido a um problema com a placa Intel Ethernet. Às vezes, um disco defeituoso também pode causar esses problemas, embora normalmente apenas fizesse com que ele travasse em vez de reiniciar.

  • Um no-break ruim, lembro-me de um no-break com bateria funcionando lentamente e um dos indicadores de que isso acontecia era um ciclo de energia semanal regular dos servidores conectados a ele. Você pode simplesmente ter uma programação de ciclo de energia mal configurada.

informação relacionada