Schwarzer Bildschirm, PCIe-Fehler mit AMD GPU und Server-Motherboard

Schwarzer Bildschirm, PCIe-Fehler mit AMD GPU und Server-Motherboard

Wir haben in unserer Gruppe eine Workstation mit einem KNPA-U16-Motherboard und einer AMD RX560-GPU, die meistens funktioniert. Das Betriebssystem ist Kubuntu 20.04 und die Kernelversion 5.8.0-59.

Jetzt tritt das Problem auf, wenn wir die Workstation eine Zeit lang nicht benutzen und dann versuchen, von vorne darauf zuzugreifen. Das Verhalten ist wie folgt: Der Bildschirm erscheint für eine kurze Zeit (variiert zwischen ~0-10s) und wird dann schwarz. Wie gesagt, die Zeit variiert, einmal konnte ich mich sogar anmelden, bevor der Bildschirm dunkel wurde. Es gibt danach keine Möglichkeit mehr, ihn aufzuwecken. Er ist jedoch über SSH erreichbar.

Das Kernelprotokoll zeigt Folgendes:

09:27:51 PC3 kernel: [165861.461855] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4  
09:27:51 PC3 kernel: [165861.461858] {1}[Hardware Error]: event severity: info  
09:27:51 PC3 kernel: [165861.461860] {1}[Hardware Error]:  Error 0, type: fatal  
09:27:51 PC3 kernel: [165861.461861] {1}[Hardware Error]:  fru_text: PcieError  
09:27:51 PC3 kernel: [165861.461862] {1}[Hardware Error]:   section_type: PCIe error  
09:27:51 PC3 kernel: [165861.461863] {1}[Hardware Error]:   port_type: 4, root port  
09:27:51 PC3 kernel: [165861.461864] {1}[Hardware Error]:   version: 0.2  
09:27:51 PC3 kernel: [165861.461866] {1}[Hardware Error]:   command: 0x0407, status: 0x0010  
09:27:51 PC3 kernel: [165861.461867] {1}[Hardware Error]:   device_id: 0000:20:03.1  
09:27:51 PC3 kernel: [165861.461868] {1}[Hardware Error]:   slot: 7  
09:27:51 PC3 kernel: [165861.461868] {1}[Hardware Error]:   secondary_bus: 0x23  
09:27:51 PC3 kernel: [165861.461869] {1}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1453  
09:27:51 PC3 kernel: [165861.461870] {1}[Hardware Error]:   class_code: 060400  
09:27:51 PC3 kernel: [165861.461871] {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x001a  
09:27:51 PC3 kernel: [165861.461872] {1}[Hardware Error]:   aer_uncor_status: 0x00000000, aer_uncor_mask: 0x04500000  
09:27:51 PC3 kernel: [165861.461873] {1}[Hardware Error]:   aer_uncor_severity: 0x004e2030  
09:27:51 PC3 kernel: [165861.461874] {1}[Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000  
09:27:51 PC3 kernel: [165861.461933] pcieport 0000:20:03.1: AER: aer_status: 0x00000000, aer_mask: 0x04500000  
09:27:51 PC3 kernel: [165861.461939] pcieport 0000:20:03.1: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID  
09:27:51 PC3 kernel: [165861.461941] pcieport 0000:20:03.1: AER: aer_uncor_severity: 0x004e2030  
09:27:51 PC3 kernel: [165861.461945] amdgpu 0000:23:00.0: AER: can't recover (no error_detected callback)  
09:27:51 PC3 kernel: [165861.461947] snd_hda_intel 0000:23:00.1: AER: can't recover (no error_detected callback)  
09:27:52 PC3 kernel: [165862.485806] pcieport 0000:20:03.1: AER: Root Port link has been reset  
09:27:52 PC3 kernel: [165862.485854] pcieport 0000:20:03.1: AER: device recovery successful   
09:28:02 PC3 kernel: [165866.837702] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!  
09:28:02 PC3 kernel: [165872.219438] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=841906, emitted seq=841908  
09:28:02 PC3 kernel: [165872.219526] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process sddm-greeter pid 88965 thread sddm-greet:cs0 pid 88969  
09:28:02 PC3 kernel: [165872.219534] amdgpu 0000:23:00.0: amdgpu: GPU reset begin!  
09:28:02 PC3 kernel: [165872.219865] amdgpu:   
09:28:02 PC3 kernel: [165872.219865]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219870] amdgpu:   
09:28:02 PC3 kernel: [165872.219870]  failed to send message 281 ret is 65535   
09:28:02 PC3 kernel: [165872.219879] amdgpu:   
09:28:02 PC3 kernel: [165872.219879]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219883] amdgpu:   
09:28:02 PC3 kernel: [165872.219883]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219887] amdgpu:   
09:28:02 PC3 kernel: [165872.219887]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219890] amdgpu:   
09:28:02 PC3 kernel: [165872.219890]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219894] amdgpu:   
09:28:02 PC3 kernel: [165872.219894]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219897] amdgpu:   
09:28:02 PC3 kernel: [165872.219897]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219901] amdgpu:   
09:28:02 PC3 kernel: [165872.219901]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219905] amdgpu:   
09:28:02 PC3 kernel: [165872.219905]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219909] amdgpu:   
09:28:02 PC3 kernel: [165872.219909]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219912] amdgpu:   
09:28:02 PC3 kernel: [165872.219912]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219916] amdgpu:   
09:28:02 PC3 kernel: [165872.219916]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219919] amdgpu:   
09:28:02 PC3 kernel: [165872.219919]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219923] amdgpu:   
09:28:02 PC3 kernel: [165872.219923]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219926] amdgpu:   
09:28:02 PC3 kernel: [165872.219926]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219930] amdgpu:   
09:28:02 PC3 kernel: [165872.219930]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219933] amdgpu:   
09:28:02 PC3 kernel: [165872.219933]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219937] amdgpu:   
09:28:02 PC3 kernel: [165872.219937]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219940] amdgpu:   
09:28:02 PC3 kernel: [165872.219940]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219944] amdgpu:   
09:28:02 PC3 kernel: [165872.219944]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219947] amdgpu:   
09:28:02 PC3 kernel: [165872.219947]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219951] amdgpu:   
09:28:02 PC3 kernel: [165872.219951]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219954] amdgpu:   
09:28:02 PC3 kernel: [165872.219954]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219958] amdgpu: 
09:28:02 PC3 kernel: [165872.219958]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219961] amdgpu:   
09:28:02 PC3 kernel: [165872.219961]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219965] amdgpu:   
09:28:02 PC3 kernel: [165872.219965]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219968] amdgpu:   
09:28:02 PC3 kernel: [165872.219968]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219972] amdgpu:   
09:28:02 PC3 kernel: [165872.219972]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219975] amdgpu:   
09:28:02 PC3 kernel: [165872.219975]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219978] amdgpu:   
09:28:02 PC3 kernel: [165872.219978]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219981] amdgpu:   
09:28:02 PC3 kernel: [165872.219981]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219985] amdgpu:   
09:28:02 PC3 kernel: [165872.219985]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219988] amdgpu:   
09:28:02 PC3 kernel: [165872.219988]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219992] amdgpu:   
09:28:02 PC3 kernel: [165872.219992]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219995] amdgpu:   
09:28:02 PC3 kernel: [165872.219995]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.220169] amdgpu:   
09:28:02 PC3 kernel: [165872.220169]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.220173] amdgpu:   
09:28:02 PC3 kernel: [165872.220173]  failed to send message 306 ret is 65535   
09:28:02 PC3 kernel: [165872.220175] amdgpu:   
09:28:02 PC3 kernel: [165872.220175]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.220179] amdgpu:   
09:28:02 PC3 kernel: [165872.220179]  failed to send message 5e ret is 65535   
09:28:02 PC3 kernel: [165872.220183] amdgpu:   
09:28:02 PC3 kernel: [165872.220183]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.220186] amdgpu:   
09:28:02 PC3 kernel: [165872.220186]  failed to send message 145 ret is 65535   
09:28:02 PC3 kernel: [165872.220190] amdgpu:   
09:28:02 PC3 kernel: [165872.220190]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.220195] amdgpu:   
09:28:02 PC3 kernel: [165872.220195]  failed to send message 146 ret is 65535   
09:28:02 PC3 kernel: [165872.220200] amdgpu:   
09:28:02 PC3 kernel: [165872.220200]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.220203] amdgpu:   
09:28:02 PC3 kernel: [165872.220203]  failed to send message 148 ret is 65535   
09:28:02 PC3 kernel: [165872.220207] amdgpu:   
09:28:02 PC3 kernel: [165872.220207]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.220210] amdgpu:   
09:28:02 PC3 kernel: [165872.220210]  failed to send message 145 ret is 65535   
09:28:02 PC3 kernel: [165872.220215] amdgpu:   
09:28:02 PC3 kernel: [165872.220215]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.220219] amdgpu:   
09:28:02 PC3 kernel: [165872.220219]  failed to send message 146 ret is 65535   
09:28:22 PC3 kernel: [165892.248439] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting  
09:28:22 PC3 kernel: [165892.248505] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D8DE (len 824, WS 0, PS 0) @ 0xDA5E  
09:28:22 PC3 kernel: [165892.248569] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D798 (len 326, WS 0, PS 0) @ 0xD888  
09:28:22 PC3 kernel: [165892.248664] [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!  

Wir haben das Problem nun schon seit längerer Zeit (~ 1 Jahr, sichtbar im SMBIOS-Ereignisprotokoll) und haben einige Dinge ausprobiert:

  • Grafikkarte aus dem PCIe-Slot entnehmen und wieder einstecken
  • Installieren eines neuen Kernels
  • Installieren der proprietären amdgpu-Treiber
  • Durch das Deaktivieren aller Schlafeinstellungen in Kubuntu konnten wir Folgendes feststellen:
  • Verwenden von pcie_aspm=off
  • Anderen Bildschirm verwenden (DVI / DisplayPort)
  • Ändern einiger BIOS-Einstellungen im Zusammenhang mit PCIe

Aber nichts, was wir tun, scheint etwas zu ändern. Was dieses Problem so schwer zu beheben macht, ist, dass es nur auftritt, wenn der PC eine Zeit lang nicht verwendet wird. Es ist also wirklich mühsam, irgendetwas zu testen.

Hat jemand eine Idee, was der Fehler sein könnte oder wo wir anhand des Protokolls mit der Suche beginnen könnten?


Update: Wir haben die Grafikkarte in einem anderen PC mit Windows getestet und sie funktionierte einwandfrei. Anschließend haben wir Win 10 auf demselben PC installiert und auch dort funktioniert es ohne Probleme. Es scheint also an der Kombination GPU + Motherboard + KDE Neon zu liegen. Linux scheint eine Art Energiesparmodus zu ermöglichen, der mit dem Mainboard nicht gut funktioniert. Wir haben jedoch viel gesucht und keine Ruhezustandsoption gefunden, die wir nicht ausgeschaltet haben.

verwandte Informationen