Wir haben in unserer Gruppe eine Workstation mit einem KNPA-U16-Motherboard und einer AMD RX560-GPU, die meistens funktioniert. Das Betriebssystem ist Kubuntu 20.04 und die Kernelversion 5.8.0-59.
Jetzt tritt das Problem auf, wenn wir die Workstation eine Zeit lang nicht benutzen und dann versuchen, von vorne darauf zuzugreifen. Das Verhalten ist wie folgt: Der Bildschirm erscheint für eine kurze Zeit (variiert zwischen ~0-10s) und wird dann schwarz. Wie gesagt, die Zeit variiert, einmal konnte ich mich sogar anmelden, bevor der Bildschirm dunkel wurde. Es gibt danach keine Möglichkeit mehr, ihn aufzuwecken. Er ist jedoch über SSH erreichbar.
Das Kernelprotokoll zeigt Folgendes:
09:27:51 PC3 kernel: [165861.461855] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
09:27:51 PC3 kernel: [165861.461858] {1}[Hardware Error]: event severity: info
09:27:51 PC3 kernel: [165861.461860] {1}[Hardware Error]: Error 0, type: fatal
09:27:51 PC3 kernel: [165861.461861] {1}[Hardware Error]: fru_text: PcieError
09:27:51 PC3 kernel: [165861.461862] {1}[Hardware Error]: section_type: PCIe error
09:27:51 PC3 kernel: [165861.461863] {1}[Hardware Error]: port_type: 4, root port
09:27:51 PC3 kernel: [165861.461864] {1}[Hardware Error]: version: 0.2
09:27:51 PC3 kernel: [165861.461866] {1}[Hardware Error]: command: 0x0407, status: 0x0010
09:27:51 PC3 kernel: [165861.461867] {1}[Hardware Error]: device_id: 0000:20:03.1
09:27:51 PC3 kernel: [165861.461868] {1}[Hardware Error]: slot: 7
09:27:51 PC3 kernel: [165861.461868] {1}[Hardware Error]: secondary_bus: 0x23
09:27:51 PC3 kernel: [165861.461869] {1}[Hardware Error]: vendor_id: 0x1022, device_id: 0x1453
09:27:51 PC3 kernel: [165861.461870] {1}[Hardware Error]: class_code: 060400
09:27:51 PC3 kernel: [165861.461871] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x001a
09:27:51 PC3 kernel: [165861.461872] {1}[Hardware Error]: aer_uncor_status: 0x00000000, aer_uncor_mask: 0x04500000
09:27:51 PC3 kernel: [165861.461873] {1}[Hardware Error]: aer_uncor_severity: 0x004e2030
09:27:51 PC3 kernel: [165861.461874] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
09:27:51 PC3 kernel: [165861.461933] pcieport 0000:20:03.1: AER: aer_status: 0x00000000, aer_mask: 0x04500000
09:27:51 PC3 kernel: [165861.461939] pcieport 0000:20:03.1: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
09:27:51 PC3 kernel: [165861.461941] pcieport 0000:20:03.1: AER: aer_uncor_severity: 0x004e2030
09:27:51 PC3 kernel: [165861.461945] amdgpu 0000:23:00.0: AER: can't recover (no error_detected callback)
09:27:51 PC3 kernel: [165861.461947] snd_hda_intel 0000:23:00.1: AER: can't recover (no error_detected callback)
09:27:52 PC3 kernel: [165862.485806] pcieport 0000:20:03.1: AER: Root Port link has been reset
09:27:52 PC3 kernel: [165862.485854] pcieport 0000:20:03.1: AER: device recovery successful
09:28:02 PC3 kernel: [165866.837702] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
09:28:02 PC3 kernel: [165872.219438] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=841906, emitted seq=841908
09:28:02 PC3 kernel: [165872.219526] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process sddm-greeter pid 88965 thread sddm-greet:cs0 pid 88969
09:28:02 PC3 kernel: [165872.219534] amdgpu 0000:23:00.0: amdgpu: GPU reset begin!
09:28:02 PC3 kernel: [165872.219865] amdgpu:
09:28:02 PC3 kernel: [165872.219865] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219870] amdgpu:
09:28:02 PC3 kernel: [165872.219870] failed to send message 281 ret is 65535
09:28:02 PC3 kernel: [165872.219879] amdgpu:
09:28:02 PC3 kernel: [165872.219879] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219883] amdgpu:
09:28:02 PC3 kernel: [165872.219883] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219887] amdgpu:
09:28:02 PC3 kernel: [165872.219887] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219890] amdgpu:
09:28:02 PC3 kernel: [165872.219890] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219894] amdgpu:
09:28:02 PC3 kernel: [165872.219894] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219897] amdgpu:
09:28:02 PC3 kernel: [165872.219897] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219901] amdgpu:
09:28:02 PC3 kernel: [165872.219901] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219905] amdgpu:
09:28:02 PC3 kernel: [165872.219905] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219909] amdgpu:
09:28:02 PC3 kernel: [165872.219909] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219912] amdgpu:
09:28:02 PC3 kernel: [165872.219912] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219916] amdgpu:
09:28:02 PC3 kernel: [165872.219916] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219919] amdgpu:
09:28:02 PC3 kernel: [165872.219919] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219923] amdgpu:
09:28:02 PC3 kernel: [165872.219923] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219926] amdgpu:
09:28:02 PC3 kernel: [165872.219926] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219930] amdgpu:
09:28:02 PC3 kernel: [165872.219930] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219933] amdgpu:
09:28:02 PC3 kernel: [165872.219933] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219937] amdgpu:
09:28:02 PC3 kernel: [165872.219937] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219940] amdgpu:
09:28:02 PC3 kernel: [165872.219940] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219944] amdgpu:
09:28:02 PC3 kernel: [165872.219944] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219947] amdgpu:
09:28:02 PC3 kernel: [165872.219947] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219951] amdgpu:
09:28:02 PC3 kernel: [165872.219951] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219954] amdgpu:
09:28:02 PC3 kernel: [165872.219954] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219958] amdgpu:
09:28:02 PC3 kernel: [165872.219958] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219961] amdgpu:
09:28:02 PC3 kernel: [165872.219961] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219965] amdgpu:
09:28:02 PC3 kernel: [165872.219965] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219968] amdgpu:
09:28:02 PC3 kernel: [165872.219968] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219972] amdgpu:
09:28:02 PC3 kernel: [165872.219972] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219975] amdgpu:
09:28:02 PC3 kernel: [165872.219975] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219978] amdgpu:
09:28:02 PC3 kernel: [165872.219978] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219981] amdgpu:
09:28:02 PC3 kernel: [165872.219981] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219985] amdgpu:
09:28:02 PC3 kernel: [165872.219985] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219988] amdgpu:
09:28:02 PC3 kernel: [165872.219988] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219992] amdgpu:
09:28:02 PC3 kernel: [165872.219992] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219995] amdgpu:
09:28:02 PC3 kernel: [165872.219995] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.220169] amdgpu:
09:28:02 PC3 kernel: [165872.220169] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.220173] amdgpu:
09:28:02 PC3 kernel: [165872.220173] failed to send message 306 ret is 65535
09:28:02 PC3 kernel: [165872.220175] amdgpu:
09:28:02 PC3 kernel: [165872.220175] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.220179] amdgpu:
09:28:02 PC3 kernel: [165872.220179] failed to send message 5e ret is 65535
09:28:02 PC3 kernel: [165872.220183] amdgpu:
09:28:02 PC3 kernel: [165872.220183] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.220186] amdgpu:
09:28:02 PC3 kernel: [165872.220186] failed to send message 145 ret is 65535
09:28:02 PC3 kernel: [165872.220190] amdgpu:
09:28:02 PC3 kernel: [165872.220190] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.220195] amdgpu:
09:28:02 PC3 kernel: [165872.220195] failed to send message 146 ret is 65535
09:28:02 PC3 kernel: [165872.220200] amdgpu:
09:28:02 PC3 kernel: [165872.220200] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.220203] amdgpu:
09:28:02 PC3 kernel: [165872.220203] failed to send message 148 ret is 65535
09:28:02 PC3 kernel: [165872.220207] amdgpu:
09:28:02 PC3 kernel: [165872.220207] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.220210] amdgpu:
09:28:02 PC3 kernel: [165872.220210] failed to send message 145 ret is 65535
09:28:02 PC3 kernel: [165872.220215] amdgpu:
09:28:02 PC3 kernel: [165872.220215] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.220219] amdgpu:
09:28:02 PC3 kernel: [165872.220219] failed to send message 146 ret is 65535
09:28:22 PC3 kernel: [165892.248439] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
09:28:22 PC3 kernel: [165892.248505] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D8DE (len 824, WS 0, PS 0) @ 0xDA5E
09:28:22 PC3 kernel: [165892.248569] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D798 (len 326, WS 0, PS 0) @ 0xD888
09:28:22 PC3 kernel: [165892.248664] [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
Wir haben das Problem nun schon seit längerer Zeit (~ 1 Jahr, sichtbar im SMBIOS-Ereignisprotokoll) und haben einige Dinge ausprobiert:
- Grafikkarte aus dem PCIe-Slot entnehmen und wieder einstecken
- Installieren eines neuen Kernels
- Installieren der proprietären amdgpu-Treiber
- Durch das Deaktivieren aller Schlafeinstellungen in Kubuntu konnten wir Folgendes feststellen:
- Verwenden von pcie_aspm=off
- Anderen Bildschirm verwenden (DVI / DisplayPort)
- Ändern einiger BIOS-Einstellungen im Zusammenhang mit PCIe
Aber nichts, was wir tun, scheint etwas zu ändern. Was dieses Problem so schwer zu beheben macht, ist, dass es nur auftritt, wenn der PC eine Zeit lang nicht verwendet wird. Es ist also wirklich mühsam, irgendetwas zu testen.
Hat jemand eine Idee, was der Fehler sein könnte oder wo wir anhand des Protokolls mit der Suche beginnen könnten?
Update: Wir haben die Grafikkarte in einem anderen PC mit Windows getestet und sie funktionierte einwandfrei. Anschließend haben wir Win 10 auf demselben PC installiert und auch dort funktioniert es ohne Probleme. Es scheint also an der Kombination GPU + Motherboard + KDE Neon zu liegen. Linux scheint eine Art Energiesparmodus zu ermöglichen, der mit dem Mainboard nicht gut funktioniert. Wir haben jedoch viel gesucht und keine Ruhezustandsoption gefunden, die wir nicht ausgeschaltet haben.