Solución de problemas de Hard Crash de Ubuntu 22.10 con un sistema frío y sin carga pesada

Solución de problemas de Hard Crash de Ubuntu 22.10 con un sistema frío y sin carga pesada

Estoy ejecutando Ubuntu 22.10 en mi máquina de escritorio doméstico.

Mi sistema falla en lo que parecen ser intervalos aleatorios, sin que se me ocurra ninguna causa inmediata. Sin ninguna advertencia ni acción particular que pueda causarlo, la computadora simplemente se apaga y se reinicia, como cuando alguien presiona el botón "Restablecer" en la carcasa/BIOS.

Estos son mis sensores inmediatamente después de uno de estos accidentes:

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:         +58.6°C  
Tccd1:        +46.2°C  
Tccd2:        +44.5°C  

nvme-pci-2200
Adapter: PCI adapter
Composite:    +46.9°C  (low  = -273.1°C, high = +81.8°C)
                       (crit = +84.8°C)
Sensor 1:     +46.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +56.9°C  (low  = -273.1°C, high = +65261.8°C)

nvme-pci-2300
Adapter: PCI adapter
Composite:    +56.9°C  (low  =  -0.1°C, high = +89.8°C)
                       (crit = +94.8°C)

iwlwifi_1-virtual-0
Adapter: Virtual device
temp1:        +42.0°C  

nct6797-isa-0a20
Adapter: ISA adapter
in0:            1.26 V  (min =  +0.00 V, max =  +1.74 V)
in1:          1000.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in2:            3.33 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in3:            3.31 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in4:            1.02 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in5:          160.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in6:          672.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in7:            3.33 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in8:            3.30 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in9:            1.84 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in10:           0.00 V  (min =  +0.00 V, max =  +0.00 V)
in11:         456.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in12:           1.10 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in13:         680.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in14:           1.53 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
fan1:            0 RPM  (min =    0 RPM)
fan2:         1086 RPM  (min =    0 RPM)
fan3:            0 RPM  (min =    0 RPM)
fan4:            0 RPM  (min =    0 RPM)
fan5:          699 RPM  (min =    0 RPM)
fan6:          969 RPM  (min =    0 RPM)
fan7:         1422 RPM  (min =    0 RPM)
SYSTIN:        +47.0°C  (high = +80.0°C, hyst = +75.0°C)  sensor = CPU diode
CPUTIN:        +41.0°C  (high = +108.0°C, hyst = +90.0°C)  sensor = thermistor
AUXTIN0:       +45.0°C  (high = +108.0°C, hyst = +90.0°C)  sensor = thermistor
AUXTIN1:      -128.0°C    sensor = thermistor
AUXTIN2:       +62.0°C    sensor = thermistor
AUXTIN3:        -2.0°C    sensor = thermistor
Virtual_TEMP:  +58.0°C  
Virtual_TEMP:  +59.0°C  
Virtual_TEMP:  +58.0°C  
Virtual_TEMP:  +58.0°C  
TSI0_TEMP:     +58.5°C  
intrusion0:   ALARM
intrusion1:   ALARM
beep_enable:  disabled

nvme-pci-0100
Adapter: PCI adapter
Composite:    +45.9°C  (low  = -273.1°C, high = +84.8°C)
                       (crit = +84.8°C)
Sensor 1:     +45.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +48.9°C  (low  = -273.1°C, high = +65261.8°C)

Este es mi journalctl -b -1 -eresultado, que muestra que no se registró nada antes del bloqueo/reinicio:

mag 29 10:49:14 bwian-MS-7C35 gnome-shell[2742]: Window manager warning: Overwriting existing binding of keysym 38 with keysym 38 (keycode 11).
mag 29 10:49:14 bwian-MS-7C35 gnome-shell[2742]: Window manager warning: Overwriting existing binding of keysym 39 with keysym 39 (keycode 12).
mag 29 10:50:04 bwian-MS-7C35 gnome-shell[2742]: Can't update stage views actor <unnamed>[<MetaWindowGroup>:0x5650fd7ec680] is on because it needs an allocation.
mag 29 10:50:04 bwian-MS-7C35 gnome-shell[2742]: Can't update stage views actor <unnamed>[<MetaWindowActorX11>:0x5650fdea1ba0] is on because it needs an allocation.
mag 29 10:50:04 bwian-MS-7C35 gnome-shell[2742]: Can't update stage views actor <unnamed>[<MetaSurfaceActorX11>:0x5651001067c0] is on because it needs an allocation.
mag 29 10:50:25 bwian-MS-7C35 gnome-shell[2742]: Can't update stage views actor <unnamed>[<MetaSurfaceActorX11>:0x5651001067c0] is on because it needs an allocation.
mag 29 10:52:10 bwian-MS-7C35 gnome-shell[2742]: Can't update stage views actor <unnamed>[<MetaWindowGroup>:0x5650fd7ec680] is on because it needs an allocation.
mag 29 10:52:10 bwian-MS-7C35 gnome-shell[2742]: Can't update stage views actor <unnamed>[<MetaWindowActorX11>:0x5650ffdcd1f0] is on because it needs an allocation.
mag 29 10:52:10 bwian-MS-7C35 gnome-shell[2742]: Can't update stage views actor <unnamed>[<MetaSurfaceActorX11>:0x565109af6f60] is on because it needs an allocation.
mag 29 10:53:08 bwian-MS-7C35 [email protected][2742]: Microsoft Teams - Preview1, Impossible to lookup icon for 'Microsoft Teams - Preview1_13-panel' in path /tmp/.org.chr>
mag 29 10:53:08 bwian-MS-7C35 [email protected][2742]: unable to update icon for Microsoft Teams - Preview1
mag 29 10:53:15 bwian-MS-7C35 [email protected][2742]: Microsoft Teams - Preview1, Impossible to lookup icon for 'Microsoft Teams - Preview1_14-panel' in path /tmp/.org.chr>
mag 29 10:53:15 bwian-MS-7C35 [email protected][2742]: unable to update icon for Microsoft Teams - Preview1
mag 29 10:55:01 bwian-MS-7C35 CRON[29178]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
mag 29 10:55:01 bwian-MS-7C35 CRON[29179]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
mag 29 10:55:01 bwian-MS-7C35 CRON[29178]: pam_unix(cron:session): session closed for user root

Tampoco kern.logmuestro nada que me parezca relevante:

May 29 09:48:09 bwian-MS-7C35 kernel: [ 2121.745635] kauditd_printk_skb: 7 callbacks suppressed
May 29 09:48:09 bwian-MS-7C35 kernel: [ 2121.745638] audit: type=1400 audit(1685346489.868:116): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" pro
file="unconfined" name="libreoffice-oosplash" pid=18136 comm="apparmor_parser"
May 29 09:48:09 bwian-MS-7C35 kernel: [ 2121.767162] audit: type=1400 audit(1685346489.888:117): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" pro
file="unconfined" name="libreoffice-senddoc" pid=18140 comm="apparmor_parser"
May 29 09:48:12 bwian-MS-7C35 kernel: [ 2124.796003] audit: type=1400 audit(1685346492.916:118): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libreoffice-soffi
ce" pid=18143 comm="apparmor_parser"
May 29 09:48:12 bwian-MS-7C35 kernel: [ 2124.822358] audit: type=1400 audit(1685346492.944:119): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libreoffice-soffi
ce//gpg" pid=18143 comm="apparmor_parser"
May 29 09:48:12 bwian-MS-7C35 kernel: [ 2124.846377] audit: type=1400 audit(1685346492.968:120): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" pro
file="unconfined" name="libreoffice-xpdfimport" pid=18182 comm="apparmor_parser"
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] Linux version 5.19.0-42-generic (buildd@lcy02-amd64-019) (x86_64-linux-gnu-gcc-12 (Ubuntu 12.2.0-3ubuntu1) 12.2.0, GNU ld (GNU Binutil
s for Ubuntu) 2.39) #43-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 18:21:28 UTC 2023 (Ubuntu 5.19.0-42.43-generic 5.19.17)
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.19.0-42-generic root=UUID=ea1660b0-ea10-41d0-baa8-bc942fb21e02 ro quiet splash vt.handoff=7
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] KERNEL supported cpus:
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000]   Intel GenuineIntel
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000]   AMD AuthenticAMD
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000]   Hygon HygonGenuine
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000]   Centaur CentaurHauls
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000]   zhaoxin   Shanghai  
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] signal: max sigframe size: 1776
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] BIOS-provided physical RAM map:
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000009d81fff] usable
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] BIOS-e820: [mem 0x0000000009d82000-0x0000000009ffffff] reserved
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] BIOS-e820: [mem 0x000000000a200000-0x000000000a20ffff] ACPI NVS
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] BIOS-e820: [mem 0x000000000a210000-0x00000000cacb0fff] usable
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] BIOS-e820: [mem 0x00000000cacb1000-0x00000000cb0a8fff] reserved
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] BIOS-e820: [mem 0x00000000cb0a9000-0x00000000cb10cfff] ACPI data
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] BIOS-e820: [mem 0x00000000cb10d000-0x00000000ccc0cfff] ACPI NVS
May 29 11:04:23 bwian-MS-7C35 kernel: [    0.000000] BIOS-e820: [mem 0x00000000ccc0d000-0x00000000cdbfefff] reserved

Notas adicionales:

  • No hay nada que pueda hacer para reproducir el problema.
  • La temperatura se ve bien, la pasta térmica se cambió el año pasado y estoy usando un refrigerador Noctua NX-15 de alto rendimiento. La PC está libre de polvo.
  • El sistema está actualizado.
  • A veces, el problema se ha correlacionado con una carga pesada, tanto en la GPUola CPU
  • ...Pero las pruebas de estrés no reprodujeron el accidente.
  • Memtest no encontró ningún problema
  • También utilicé algunos programas de pruebas de referencia y de estrés para intentar poner algo de carga en la fuente de alimentación, pero carecían de asistente o pruebas automatizadas y no estoy seguro de haberlo hecho correctamente. Me alegraría que alguien me proporcionara alguna forma de probar mi fuente de alimentación
  • Los registros parecen claros, como si el sistema no pudiera escribir nada a tiempo antes de reiniciarse

Mi sospecha es que la fuente de alimentación podría estar fallando o la placa base, pero nunca pude estar seguro de cuál es el problema.

¿Cómo puedo asegurarme de cuál es el hardware defectuoso?

Respuesta1

  1. Reemplace una fuente de alimentación que funcione para verificar si Ubuntu aún falla.

  2. Si no tiene uno o no puede pedir prestado uno, instale Windows 10 para una configuración de arranque dual. Hay más opciones para programas de pruebas de estrés fáciles de usar en Windows. Ejecútelos para ver si Windows falla. Si funciona sin problemas, casi se puede descartar un problema de hardware.

  3. Cambie entre los controladores de GPU Nouveau y propietarios, y pruebe también diferentes versiones.

  4. Considere probar Ubuntu 20.04 o utilizar entornos de escritorio distintos de Cinnamon, como KDE, Xfce o LXDE, para ver si el problema persiste.

información relacionada