Ubuntu 20.04 崩潰:偵測到 ECC 錯誤或 L2 中毒

Ubuntu 20.04 崩潰:偵測到 ECC 錯誤或 L2 中毒

Ubuntu 20.04 在不同時間隨機崩潰。無法指向特定事件。

uname -a 
Linux ubuntu 5.11.0-051100-generic #202102142330 
SMP Sun Feb 14 23:33:21 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

出現以下訊號時崩潰:

 kernel:[19849.215258] [Hardware Error]: Uncorrected, software restartable error.

 kernel:[19849.215259] [Hardware Error]: CPU:22 (19:21:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|-|Poison|-]: 0xbc00080001010135

 kernel:[19849.215263] [Hardware Error]: Error Addr: 0x000000076bed1c00

 kernel:[19849.215264] [Hardware Error]: IPID: 0x001000b000000000

 kernel:[19849.215266] [Hardware Error]: Load Store Unit Ext. Error Code: 1, An ECC error or L2 poison was detected on a data cache read by a load.

 kernel:[19849.215269] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD

硬體資訊:

### CPU
  Architecture:                    x86_64
  CPU op-mode(s):                  32-bit, 64-bit
  Byte Order:                      Little Endian
  Address sizes:                   48 bits physical, 48 bits virtual
  CPU(s):                          24
  On-line CPU(s) list:             0-23
  Thread(s) per core:              2
  Core(s) per socket:              12
  Socket(s):                       1
  NUMA node(s):                    1
  Vendor ID:                       AuthenticAMD
  CPU family:                      25
  Model:                           33
  Model name:                      AMD Ryzen 9 5900X 12-Core Processor
  Stepping:                        0
  Frequency boost:                 enabled
  CPU MHz:                         2200.000
  CPU max MHz:                     6442.4800
  CPU min MHz:                     2200.0000

### Base Board Information
  Manufacturer: ASRock
  Product Name: X570 Taichi

### Memory:
G Skill Trident Z Neo DDR4 - 3600Mhz 32GB (2 x 16GB)

找出根本原因的建議方法是什麼?如何啟用更多日誌記錄,或者日誌是否已存在,我在哪裡可以找到它們等。謝謝!

答案1

這不是技術上一個答案,但是…

這則ECC error or L2 poison was detected on a data cache read by a load訊息指出記憶體問題,可能是 RAM 本身的問題,也可能是 CPU 上快取的問題。兩者都不是很好,但您可以通過以下過程測試系統 RAM:

  1. 重新啟動您的系統
  2. 按住該Shift鍵調出 GRUB 選單
  3. 選擇“Ubuntu,memtest86+”並按Enter
    記憶體測試將運行直到時間結束或直到您按下該Esc鍵。讓機器在逃脫之前至少完成一項測試。

基於報告 大約在網路上,這個問題似乎只出現在高階 AMD Ryzen 處理器上。通讀這條長線AMD 社群網站上透露了這有趣的內容:

我更換了內存,幾天來電腦一直堅如磐石。希望這可以幫助你,就像它幫助我一樣。以前的記憶體是 Gskill 3600mhz 記憶體...新記憶體是 Corsair 的 3200 記憶體。

您的問題沒有說明您安裝了哪種內存,但如果它是一組頻率較高的模組,則 RAM 和 CPU 之間可能存在某些問題,導致不穩定。如果記憶體測試失敗,而您碰巧有一些可用的相容 3200MHz RAM(即使只是一個 DIMM),請考慮將其更換並再次執行記憶體測試。

答案2

BIOS

華擎 X570 太極

BIOS 的目前版本為 P4.30。

記憶

G Skill Trident Z Neo DDR4 - 3600Mhz 32GB (2 x 16GB),產品:F4-3600C16-16GTZNC

AMD 銳龍 9 5900X 12 核心處理器

Ryzen 處理器對 RAM 非常挑剔。

這些 DIMM 不會出現在記憶體支援清單中,如圖所示這裡

memtest通過了所有測試。

當我們觀察時,sudo lshw -C memory我們發現 DIMM可能安裝到不正確的插槽位置。當使用 2 個相同大小的 DIMM 時,應將它們安裝到插槽 A2 和 B2 中。這是電路板佈局和記憶體插槽的圖像......取自用戶手冊:這裡...所以只需驗證這一點...

在此輸入影像描述

答案3

根據@heynnema的建議,我發現我的電腦上安裝的DIMM型號並未列在其相容性清單中。以下是遵循的步驟:

  1. 存取 CPU 支援列表華擎 x570 Taichi 網站。找出核心類型。就我而言是Vermeer
  2. 透過運行找出系統上安裝的 DIMM 的型號sudo lshw -C memory(它是F4-3600C16-16GTZNC
  3. 導航至記憶體支援列表對於 Vermeer,看看它是否受支援。不幸的是它不在列表中!也許這就是不一致崩潰的原因。我將嘗試受支援的 DIMM 版本,看看是否再次發生崩潰,並相應地更新此答案。
 *-firmware
       description: BIOS
       vendor: American Megatrends Inc.
       physical id: 0
       version: P4.30
       date: 04/14/2021
       size: 64KiB
       capacity: 16MiB
       capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
  *-memory
       description: System Memory
       physical id: e
       slot: System board or motherboard
       size: 32GiB
     *-bank:0
          description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
          product: F4-3600C16-16GTZNC
          vendor: Unknown
          physical id: 0
          serial: 00000000
          slot: DIMM 0
          size: 16GiB
          width: 64 bits
          clock: 2133MHz (0.5ns)
     *-bank:1
          description: Project-Id-Version: lshwReport-Msgid-Bugs-To: FULL NAME <EMAIL@ADDRESS>PO-Revision-Date: 2012-02-02 13:04+0000Last-Translator: Joel Addison <[email protected]>Language-Team: English (Australia) <[email protected]>MIME-Version: 1.0Content-Type: text/plain; charset=UTF-8Content-Transfer-Encoding: 8bitX-Launchpad-Export-Date: 2021-01-21 18:43+0000X-Generator: Launchpad (build 2d1d5e352f0d063d660df2300e31f66bed027fa5)Project-Id-Version: lshwReport-Msgid-Bugs-To: FULL NAME <EMAIL@ADDRESS>PO-Revision-Date: 2012-02-02 13:04+0000Last-Translator: Joel Addison <[email protected]>Language-Team: English (Australia) <[email protected]>MIME-Version: 1.0Content-Type: text/plain; charset=UTF-8Content-Transfer-Encoding: 8bitX-Launchpad-Export-Date: 2021-01-21 18:43+0000X-Generator: Launchpad (build 2d1d5e352f0d063d660df2300e31f66bed027fa5) [empty]
          product: Unknown
          vendor: Unknown
          physical id: 1
          serial: Unknown
          slot: DIMM 1
     *-bank:2
          description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
          product: F4-3600C16-16GTZNC
          vendor: Unknown
          physical id: 2
          serial: 00000000
          slot: DIMM 0
          size: 16GiB
          width: 64 bits
          clock: 2133MHz (0.5ns)
     *-bank:3
          description: Project-Id-Version: lshwReport-Msgid-Bugs-To: FULL NAME <EMAIL@ADDRESS>PO-Revision-Date: 2012-02-02 13:04+0000Last-Translator: Joel Addison <[email protected]>Language-Team: English (Australia) <[email protected]>MIME-Version: 1.0Content-Type: text/plain; charset=UTF-8Content-Transfer-Encoding: 8bitX-Launchpad-Export-Date: 2021-01-21 18:43+0000X-Generator: Launchpad (build 2d1d5e352f0d063d660df2300e31f66bed027fa5)Project-Id-Version: lshwReport-Msgid-Bugs-To: FULL NAME <EMAIL@ADDRESS>PO-Revision-Date: 2012-02-02 13:04+0000Last-Translator: Joel Addison <[email protected]>Language-Team: English (Australia) <[email protected]>MIME-Version: 1.0Content-Type: text/plain; charset=UTF-8Content-Transfer-Encoding: 8bitX-Launchpad-Export-Date: 2021-01-21 18:43+0000X-Generator: Launchpad (build 2d1d5e352f0d063d660df2300e31f66bed027fa5) [empty]
          product: Unknown
          vendor: Unknown
          physical id: 3
          serial: Unknown
          slot: DIMM 1
  *-cache:0
       description: L1 cache
       physical id: 11
       slot: L1 - Cache
       size: 768KiB
       capacity: 768KiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=1
  *-cache:1
       description: L2 cache
       physical id: 12
       slot: L2 - Cache
       size: 6MiB
       capacity: 6MiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=2
  *-cache:2
       description: L3 cache
       physical id: 13
       slot: L3 - Cache
       size: 64MiB
       capacity: 64MiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=3

相關內容