검은색 화면, AMD GPU 및 서버 마더보드의 PCIe 오류

검은색 화면, AMD GPU 및 서버 마더보드의 PCIe 오류

우리 그룹에는 KNPA-U16 마더보드와 대부분의 경우 작동하는 AMD RX560 GPU가 있는 워크스테이션이 있습니다. 운영 체제는 Kubuntu 20.04 및 커널 버전 5.8.0-59입니다.

이제 워크스테이션을 한동안 사용하지 않고 그 앞에서 액세스하려고 하면 문제가 나타납니다. 동작은 다음과 같습니다. 화면이 짧은 시간 동안(~0~10초 사이) 나타났다가 검게 변합니다. 시간대에 따라 다르다고 하여 화면이 어두워지기 전에 로그인이 된 적도 있었습니다. 해당 이벤트 후에는 더 이상 깨울 가능성이 없습니다. 하지만 SSH를 통해 접근할 수 있습니다.

커널 로그에는 다음이 표시됩니다.

09:27:51 PC3 kernel: [165861.461855] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4  
09:27:51 PC3 kernel: [165861.461858] {1}[Hardware Error]: event severity: info  
09:27:51 PC3 kernel: [165861.461860] {1}[Hardware Error]:  Error 0, type: fatal  
09:27:51 PC3 kernel: [165861.461861] {1}[Hardware Error]:  fru_text: PcieError  
09:27:51 PC3 kernel: [165861.461862] {1}[Hardware Error]:   section_type: PCIe error  
09:27:51 PC3 kernel: [165861.461863] {1}[Hardware Error]:   port_type: 4, root port  
09:27:51 PC3 kernel: [165861.461864] {1}[Hardware Error]:   version: 0.2  
09:27:51 PC3 kernel: [165861.461866] {1}[Hardware Error]:   command: 0x0407, status: 0x0010  
09:27:51 PC3 kernel: [165861.461867] {1}[Hardware Error]:   device_id: 0000:20:03.1  
09:27:51 PC3 kernel: [165861.461868] {1}[Hardware Error]:   slot: 7  
09:27:51 PC3 kernel: [165861.461868] {1}[Hardware Error]:   secondary_bus: 0x23  
09:27:51 PC3 kernel: [165861.461869] {1}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1453  
09:27:51 PC3 kernel: [165861.461870] {1}[Hardware Error]:   class_code: 060400  
09:27:51 PC3 kernel: [165861.461871] {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x001a  
09:27:51 PC3 kernel: [165861.461872] {1}[Hardware Error]:   aer_uncor_status: 0x00000000, aer_uncor_mask: 0x04500000  
09:27:51 PC3 kernel: [165861.461873] {1}[Hardware Error]:   aer_uncor_severity: 0x004e2030  
09:27:51 PC3 kernel: [165861.461874] {1}[Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000  
09:27:51 PC3 kernel: [165861.461933] pcieport 0000:20:03.1: AER: aer_status: 0x00000000, aer_mask: 0x04500000  
09:27:51 PC3 kernel: [165861.461939] pcieport 0000:20:03.1: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID  
09:27:51 PC3 kernel: [165861.461941] pcieport 0000:20:03.1: AER: aer_uncor_severity: 0x004e2030  
09:27:51 PC3 kernel: [165861.461945] amdgpu 0000:23:00.0: AER: can't recover (no error_detected callback)  
09:27:51 PC3 kernel: [165861.461947] snd_hda_intel 0000:23:00.1: AER: can't recover (no error_detected callback)  
09:27:52 PC3 kernel: [165862.485806] pcieport 0000:20:03.1: AER: Root Port link has been reset  
09:27:52 PC3 kernel: [165862.485854] pcieport 0000:20:03.1: AER: device recovery successful   
09:28:02 PC3 kernel: [165866.837702] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!  
09:28:02 PC3 kernel: [165872.219438] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=841906, emitted seq=841908  
09:28:02 PC3 kernel: [165872.219526] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process sddm-greeter pid 88965 thread sddm-greet:cs0 pid 88969  
09:28:02 PC3 kernel: [165872.219534] amdgpu 0000:23:00.0: amdgpu: GPU reset begin!  
09:28:02 PC3 kernel: [165872.219865] amdgpu:   
09:28:02 PC3 kernel: [165872.219865]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219870] amdgpu:   
09:28:02 PC3 kernel: [165872.219870]  failed to send message 281 ret is 65535   
09:28:02 PC3 kernel: [165872.219879] amdgpu:   
09:28:02 PC3 kernel: [165872.219879]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219883] amdgpu:   
09:28:02 PC3 kernel: [165872.219883]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219887] amdgpu:   
09:28:02 PC3 kernel: [165872.219887]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219890] amdgpu:   
09:28:02 PC3 kernel: [165872.219890]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219894] amdgpu:   
09:28:02 PC3 kernel: [165872.219894]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219897] amdgpu:   
09:28:02 PC3 kernel: [165872.219897]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219901] amdgpu:   
09:28:02 PC3 kernel: [165872.219901]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219905] amdgpu:   
09:28:02 PC3 kernel: [165872.219905]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219909] amdgpu:   
09:28:02 PC3 kernel: [165872.219909]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219912] amdgpu:   
09:28:02 PC3 kernel: [165872.219912]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219916] amdgpu:   
09:28:02 PC3 kernel: [165872.219916]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219919] amdgpu:   
09:28:02 PC3 kernel: [165872.219919]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219923] amdgpu:   
09:28:02 PC3 kernel: [165872.219923]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219926] amdgpu:   
09:28:02 PC3 kernel: [165872.219926]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219930] amdgpu:   
09:28:02 PC3 kernel: [165872.219930]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219933] amdgpu:   
09:28:02 PC3 kernel: [165872.219933]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219937] amdgpu:   
09:28:02 PC3 kernel: [165872.219937]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219940] amdgpu:   
09:28:02 PC3 kernel: [165872.219940]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219944] amdgpu:   
09:28:02 PC3 kernel: [165872.219944]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219947] amdgpu:   
09:28:02 PC3 kernel: [165872.219947]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219951] amdgpu:   
09:28:02 PC3 kernel: [165872.219951]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219954] amdgpu:   
09:28:02 PC3 kernel: [165872.219954]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219958] amdgpu: 
09:28:02 PC3 kernel: [165872.219958]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219961] amdgpu:   
09:28:02 PC3 kernel: [165872.219961]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219965] amdgpu:   
09:28:02 PC3 kernel: [165872.219965]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219968] amdgpu:   
09:28:02 PC3 kernel: [165872.219968]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219972] amdgpu:   
09:28:02 PC3 kernel: [165872.219972]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219975] amdgpu:   
09:28:02 PC3 kernel: [165872.219975]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219978] amdgpu:   
09:28:02 PC3 kernel: [165872.219978]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219981] amdgpu:   
09:28:02 PC3 kernel: [165872.219981]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219985] amdgpu:   
09:28:02 PC3 kernel: [165872.219985]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219988] amdgpu:   
09:28:02 PC3 kernel: [165872.219988]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.219992] amdgpu:   
09:28:02 PC3 kernel: [165872.219992]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.219995] amdgpu:   
09:28:02 PC3 kernel: [165872.219995]  failed to send message 261 ret is 65535   
09:28:02 PC3 kernel: [165872.220169] amdgpu:   
09:28:02 PC3 kernel: [165872.220169]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.220173] amdgpu:   
09:28:02 PC3 kernel: [165872.220173]  failed to send message 306 ret is 65535   
09:28:02 PC3 kernel: [165872.220175] amdgpu:   
09:28:02 PC3 kernel: [165872.220175]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.220179] amdgpu:   
09:28:02 PC3 kernel: [165872.220179]  failed to send message 5e ret is 65535   
09:28:02 PC3 kernel: [165872.220183] amdgpu:   
09:28:02 PC3 kernel: [165872.220183]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.220186] amdgpu:   
09:28:02 PC3 kernel: [165872.220186]  failed to send message 145 ret is 65535   
09:28:02 PC3 kernel: [165872.220190] amdgpu:   
09:28:02 PC3 kernel: [165872.220190]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.220195] amdgpu:   
09:28:02 PC3 kernel: [165872.220195]  failed to send message 146 ret is 65535   
09:28:02 PC3 kernel: [165872.220200] amdgpu:   
09:28:02 PC3 kernel: [165872.220200]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.220203] amdgpu:   
09:28:02 PC3 kernel: [165872.220203]  failed to send message 148 ret is 65535   
09:28:02 PC3 kernel: [165872.220207] amdgpu:   
09:28:02 PC3 kernel: [165872.220207]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.220210] amdgpu:   
09:28:02 PC3 kernel: [165872.220210]  failed to send message 145 ret is 65535   
09:28:02 PC3 kernel: [165872.220215] amdgpu:   
09:28:02 PC3 kernel: [165872.220215]  last message was failed ret is 65535  
09:28:02 PC3 kernel: [165872.220219] amdgpu:   
09:28:02 PC3 kernel: [165872.220219]  failed to send message 146 ret is 65535   
09:28:22 PC3 kernel: [165892.248439] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting  
09:28:22 PC3 kernel: [165892.248505] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D8DE (len 824, WS 0, PS 0) @ 0xDA5E  
09:28:22 PC3 kernel: [165892.248569] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D798 (len 326, WS 0, PS 0) @ 0xD888  
09:28:22 PC3 kernel: [165892.248664] [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!  

우리는 오랫동안(약 1년, SMBIOS 이벤트 로그에 표시됨) 문제가 있었고 몇 가지를 시도했습니다.

  • PCIe 슬롯에서 그래픽 카드를 꺼냈다가 다시 연결합니다.
  • 새 커널 설치
  • amdgpu 독점 드라이버 설치
  • 우리가 찾을 수 있는 쿠분투의 모든 수면 설정을 끄는 것
  • pcie_aspm=off 사용
  • 다른 화면(DVI/디스플레이 포트) 사용하기
  • PCIe와 관련된 일부 BIOS 설정 변경

그러나 우리가 하는 일은 아무것도 바꾸지 못하는 것 같습니다. 이 문제를 해결하기 어렵게 만드는 이유는 PC를 한동안 사용하지 않는 경우에만 발생한다는 것입니다. 그래서 무엇이든 테스트하는 것은 정말 고통스럽습니다.

오류가 무엇인지 또는 로그를 기반으로 어디에서 검색을 시작할 수 있는지 아는 사람이 있습니까?


업데이트 Windows가 설치된 다른 PC에서 그래픽 카드를 테스트했는데 제대로 작동했습니다. 이후 동일한 PC에 Win 10을 설치했는데 역시 문제없이 작동됩니다. 그래서 GPU + 마더보드 + KDE 네온의 조합인 것 같습니다. 리눅스는 메인보드와 잘 어울리지 않는 일종의 에너지 절약 상태를 허용하는 것 같습니다. 그러나 우리는 많이 검색했지만 끄지 않은 최대 절전 모드 옵션을 찾지 못했습니다.

관련 정보