私たちのグループには、KNPA-U16 マザーボードと、ほとんどの場合に動作する AMD RX560 GPU を搭載したワークステーションがあります。オペレーティング システムは Kubuntu 20.04、カーネル バージョンは 5.8.0-59 です。
ワークステーションをしばらく使用せず、その前でアクセスしようとすると、問題が発生します。動作は次のとおりです。画面が短時間 (約 0 ~ 10 秒) 表示され、その後黒くなります。時間は異なると述べられていますが、画面が暗くなる前にログインできたことさえありました。その後、起動することはもうできません。ただし、ssh 経由ではアクセスできます。
カーネル ログには次のように表示されます。
09:27:51 PC3 kernel: [165861.461855] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
09:27:51 PC3 kernel: [165861.461858] {1}[Hardware Error]: event severity: info
09:27:51 PC3 kernel: [165861.461860] {1}[Hardware Error]: Error 0, type: fatal
09:27:51 PC3 kernel: [165861.461861] {1}[Hardware Error]: fru_text: PcieError
09:27:51 PC3 kernel: [165861.461862] {1}[Hardware Error]: section_type: PCIe error
09:27:51 PC3 kernel: [165861.461863] {1}[Hardware Error]: port_type: 4, root port
09:27:51 PC3 kernel: [165861.461864] {1}[Hardware Error]: version: 0.2
09:27:51 PC3 kernel: [165861.461866] {1}[Hardware Error]: command: 0x0407, status: 0x0010
09:27:51 PC3 kernel: [165861.461867] {1}[Hardware Error]: device_id: 0000:20:03.1
09:27:51 PC3 kernel: [165861.461868] {1}[Hardware Error]: slot: 7
09:27:51 PC3 kernel: [165861.461868] {1}[Hardware Error]: secondary_bus: 0x23
09:27:51 PC3 kernel: [165861.461869] {1}[Hardware Error]: vendor_id: 0x1022, device_id: 0x1453
09:27:51 PC3 kernel: [165861.461870] {1}[Hardware Error]: class_code: 060400
09:27:51 PC3 kernel: [165861.461871] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x001a
09:27:51 PC3 kernel: [165861.461872] {1}[Hardware Error]: aer_uncor_status: 0x00000000, aer_uncor_mask: 0x04500000
09:27:51 PC3 kernel: [165861.461873] {1}[Hardware Error]: aer_uncor_severity: 0x004e2030
09:27:51 PC3 kernel: [165861.461874] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
09:27:51 PC3 kernel: [165861.461933] pcieport 0000:20:03.1: AER: aer_status: 0x00000000, aer_mask: 0x04500000
09:27:51 PC3 kernel: [165861.461939] pcieport 0000:20:03.1: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
09:27:51 PC3 kernel: [165861.461941] pcieport 0000:20:03.1: AER: aer_uncor_severity: 0x004e2030
09:27:51 PC3 kernel: [165861.461945] amdgpu 0000:23:00.0: AER: can't recover (no error_detected callback)
09:27:51 PC3 kernel: [165861.461947] snd_hda_intel 0000:23:00.1: AER: can't recover (no error_detected callback)
09:27:52 PC3 kernel: [165862.485806] pcieport 0000:20:03.1: AER: Root Port link has been reset
09:27:52 PC3 kernel: [165862.485854] pcieport 0000:20:03.1: AER: device recovery successful
09:28:02 PC3 kernel: [165866.837702] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
09:28:02 PC3 kernel: [165872.219438] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=841906, emitted seq=841908
09:28:02 PC3 kernel: [165872.219526] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process sddm-greeter pid 88965 thread sddm-greet:cs0 pid 88969
09:28:02 PC3 kernel: [165872.219534] amdgpu 0000:23:00.0: amdgpu: GPU reset begin!
09:28:02 PC3 kernel: [165872.219865] amdgpu:
09:28:02 PC3 kernel: [165872.219865] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219870] amdgpu:
09:28:02 PC3 kernel: [165872.219870] failed to send message 281 ret is 65535
09:28:02 PC3 kernel: [165872.219879] amdgpu:
09:28:02 PC3 kernel: [165872.219879] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219883] amdgpu:
09:28:02 PC3 kernel: [165872.219883] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219887] amdgpu:
09:28:02 PC3 kernel: [165872.219887] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219890] amdgpu:
09:28:02 PC3 kernel: [165872.219890] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219894] amdgpu:
09:28:02 PC3 kernel: [165872.219894] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219897] amdgpu:
09:28:02 PC3 kernel: [165872.219897] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219901] amdgpu:
09:28:02 PC3 kernel: [165872.219901] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219905] amdgpu:
09:28:02 PC3 kernel: [165872.219905] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219909] amdgpu:
09:28:02 PC3 kernel: [165872.219909] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219912] amdgpu:
09:28:02 PC3 kernel: [165872.219912] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219916] amdgpu:
09:28:02 PC3 kernel: [165872.219916] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219919] amdgpu:
09:28:02 PC3 kernel: [165872.219919] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219923] amdgpu:
09:28:02 PC3 kernel: [165872.219923] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219926] amdgpu:
09:28:02 PC3 kernel: [165872.219926] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219930] amdgpu:
09:28:02 PC3 kernel: [165872.219930] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219933] amdgpu:
09:28:02 PC3 kernel: [165872.219933] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219937] amdgpu:
09:28:02 PC3 kernel: [165872.219937] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219940] amdgpu:
09:28:02 PC3 kernel: [165872.219940] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219944] amdgpu:
09:28:02 PC3 kernel: [165872.219944] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219947] amdgpu:
09:28:02 PC3 kernel: [165872.219947] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219951] amdgpu:
09:28:02 PC3 kernel: [165872.219951] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219954] amdgpu:
09:28:02 PC3 kernel: [165872.219954] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219958] amdgpu:
09:28:02 PC3 kernel: [165872.219958] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219961] amdgpu:
09:28:02 PC3 kernel: [165872.219961] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219965] amdgpu:
09:28:02 PC3 kernel: [165872.219965] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219968] amdgpu:
09:28:02 PC3 kernel: [165872.219968] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219972] amdgpu:
09:28:02 PC3 kernel: [165872.219972] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219975] amdgpu:
09:28:02 PC3 kernel: [165872.219975] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219978] amdgpu:
09:28:02 PC3 kernel: [165872.219978] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219981] amdgpu:
09:28:02 PC3 kernel: [165872.219981] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219985] amdgpu:
09:28:02 PC3 kernel: [165872.219985] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219988] amdgpu:
09:28:02 PC3 kernel: [165872.219988] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.219992] amdgpu:
09:28:02 PC3 kernel: [165872.219992] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.219995] amdgpu:
09:28:02 PC3 kernel: [165872.219995] failed to send message 261 ret is 65535
09:28:02 PC3 kernel: [165872.220169] amdgpu:
09:28:02 PC3 kernel: [165872.220169] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.220173] amdgpu:
09:28:02 PC3 kernel: [165872.220173] failed to send message 306 ret is 65535
09:28:02 PC3 kernel: [165872.220175] amdgpu:
09:28:02 PC3 kernel: [165872.220175] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.220179] amdgpu:
09:28:02 PC3 kernel: [165872.220179] failed to send message 5e ret is 65535
09:28:02 PC3 kernel: [165872.220183] amdgpu:
09:28:02 PC3 kernel: [165872.220183] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.220186] amdgpu:
09:28:02 PC3 kernel: [165872.220186] failed to send message 145 ret is 65535
09:28:02 PC3 kernel: [165872.220190] amdgpu:
09:28:02 PC3 kernel: [165872.220190] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.220195] amdgpu:
09:28:02 PC3 kernel: [165872.220195] failed to send message 146 ret is 65535
09:28:02 PC3 kernel: [165872.220200] amdgpu:
09:28:02 PC3 kernel: [165872.220200] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.220203] amdgpu:
09:28:02 PC3 kernel: [165872.220203] failed to send message 148 ret is 65535
09:28:02 PC3 kernel: [165872.220207] amdgpu:
09:28:02 PC3 kernel: [165872.220207] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.220210] amdgpu:
09:28:02 PC3 kernel: [165872.220210] failed to send message 145 ret is 65535
09:28:02 PC3 kernel: [165872.220215] amdgpu:
09:28:02 PC3 kernel: [165872.220215] last message was failed ret is 65535
09:28:02 PC3 kernel: [165872.220219] amdgpu:
09:28:02 PC3 kernel: [165872.220219] failed to send message 146 ret is 65535
09:28:22 PC3 kernel: [165892.248439] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
09:28:22 PC3 kernel: [165892.248505] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D8DE (len 824, WS 0, PS 0) @ 0xDA5E
09:28:22 PC3 kernel: [165892.248569] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D798 (len 326, WS 0, PS 0) @ 0xD888
09:28:22 PC3 kernel: [165892.248664] [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
私たちはこの問題を長い間(約 1 年、SMBIOS イベント ログで確認可能)抱えており、いくつかのことを試しました。
- グラフィックカードをPCIeスロットから取り外し、再度差し込む
- 新しいカーネルのインストール
- amdgpu専用ドライバーのインストール
- Kubuntuのすべてのスリープ設定をオフにすると、
- pcie_aspm=off の使用
- 別の画面を使用する(DVI / ディスプレイポート)
- PCIeに関連するBIOS設定の変更
しかし、私たちが何をしても何も変わらないようです。この問題をいじるのが非常に難しいのは、PC がしばらく使用されていない場合にのみ発生するためです。そのため、何かをテストするのは本当に面倒です。
エラーが何であるか、またはログに基づいてどこから調べ始めればよいか、誰か分かる人はいますか?
更新: Windows 搭載の別の PC でグラフィック カードをテストしたところ、正常に動作しました。その後、同じ PC に Win 10 をインストールしましたが、これも問題なく動作しています。つまり、GPU + マザーボード + KDE neon の組み合わせのようです。Linux では、マザーボードと相性の悪い、ある種の省電力状態が許可されているようです。ただし、いろいろ検索しましたが、オフにしていない休止状態のオプションは見つかりませんでした。