HDD & SSD Linux:硬重置連結

HDD & SSD Linux:硬重置連結

我目前的儲存設定由 Linux 機器中的兩個傳統 HDD 和兩個 SSD 組成,每兩個都位於自己的 RAID 1 陣列上,該陣列透過 luks 加密。我有一個故事,而不是一個具體的問題。

一年多來,我在某些驅動器的內核日誌中隨機收到“硬重置連結”錯誤。我會對有問題的驅動器進行 RMA,新驅動器將使問題停止。幾個月後,我最終會在看似隨機的時間再次看到相同的錯誤。此磁碟機將在 RAID 中被標記為故障,並且不再顯示在 中fdisk -l。我會重新啟動計算機,驅動器會再次出現,我可以重新添加到陣列中,它會重建。這個問題遲早會再發生,通常是幾個小時後。

大約六個月前,我用 SSD 取代了兩個傳統 HDD,希望它們的故障率不會像傳統硬碟那麼高。然而,在過去的幾天裡,我的新 SSD 之一和傳統硬碟之一開始出現問題。

我開始看到一種模式的出現。我買了一個新驅動器,幾個月後我開始遇到問題。我一直認為這是由於 HDD 的故障率很高,但現在 SSD 也出現這種情況,所以我認為這不是驅動器的故障。還有什麼可能出現問題?自從我開始遇到問題以來,我已經安裝了多個作業系統,因此我想排除軟體問題。這樣就留下了 SATA 電纜或主機板。磁碟加密是否會對磁碟機造成太大壓力?我可以做些什麼來確定更多資訊嗎?一如既往地感謝。

以下是dmesg我幾個月前遇到同樣問題時提出的問題的輸出。

[43161.734107] ata3: ATA_REG 0x41 ERR_REG 0x84
[43161.734110] ata3: tag : dhfis dmafis sdbfis sactive
[43161.734113] ata3: tag 0x0: 1 1 0 1  
[43161.734123] ata3.00: exception Emask 0x1 SAct 0x1 SErr 0x180000 action 0x6 frozen
[43161.734127] ata3.00: Ata error. fis:0x21
[43161.734130] ata3: SError: { 10B8B Dispar }
[43161.734134] ata3.00: failed command: READ FPDMA QUEUED
[43161.734142] ata3.00: cmd 60/08:00:a8:03:00/00:00:00:00:00/40 tag 0 ncq 4096 in
[43161.734144]          res 41/84:04:a8:03:00/84:00:00:00:00/40 Emask 0x10 (ATA bus error)
[43161.734148] ata3.00: status: { DRDY ERR }
[43161.734150] ata3.00: error: { ICRC ABRT }
[43161.734155] ata3: hard resetting link
[43161.734158] ata3: nv: skipping hardreset on occupied port
[43162.220095] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43162.260202] ata3.00: model number mismatch 'WDC WD2002FAEX-007BA0' != 'C WD2002FAEX-007BA0                   �'
[43162.260206] ata3.00: revalidation failed (errno=-19)
[43162.260211] ata3.00: limiting speed to UDMA/133:PIO2
[43167.220123] ata3: hard resetting link
[43167.220127] ata3: nv: skipping hardreset on occupied port
[43167.710060] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43167.750228] ata3.00: model number mismatch 'WDC WD2002FAEX-007BA0' != 'C WD2002FAEX-007BA0                   �'
[43167.750232] ata3.00: revalidation failed (errno=-19)
[43167.750236] ata3.00: disabled
[43172.710100] ata3: hard resetting link
[43173.620110] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43173.640455] ata3.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[43178.620116] ata3: hard resetting link
[43179.530113] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43179.550748] ata3.00: ATA-8: WDC WD2002FAEX-007BA0, 05.01D05, max UDMA/133
[43179.550753] ata3.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 31/32)
[43179.570208] ata3.00: model number mismatch 'WDC WD2002FAEX-007BA0' != 'C WD2002FAEX-007BA0                   �'
[43179.570213] ata3.00: revalidation failed (errno=-19)
[43179.570220] ata3: limiting SATA link speed to 1.5 Gbps
[43179.570224] ata3.00: limiting speed to UDMA/133:PIO3
[43184.530066] ata3: hard resetting link
[43184.530070] ata3: nv: skipping hardreset on occupied port
[43185.020091] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43185.060949] ata3.00: configured for UDMA/133
[43185.060969] sd 2:0:0:0: [sdd]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[43185.060974] sd 2:0:0:0: [sdd]  Sense Key : Aborted Command [current] [descriptor]
[43185.060980] Descriptor sense data with sense descriptors (in hex):
[43185.060983]         72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
[43185.060995]         00 00 03 a8 
[43185.061000] sd 2:0:0:0: [sdd]  Add. Sense: Scsi parity error
[43185.061006] sd 2:0:0:0: [sdd] CDB: Read(10): 28 00 00 00 03 a8 00 00 08 00
[43185.061017] end_request: I/O error, dev sdd, sector 936
[43185.061023] Buffer I/O error on device sdd, logical block 117
[43185.061044] sd 2:0:0:0: rejecting I/O to offline device
[43185.061048] sd 2:0:0:0: killing request
[43185.061062] ata3: EH complete
[43185.061075] sd 2:0:0:0: rejecting I/O to offline device
[43185.061123] sd 2:0:0:0: rejecting I/O to offline device
[43185.061134] sd 2:0:0:0: rejecting I/O to offline device
[43185.061140] sd 2:0:0:0: rejecting I/O to offline device
[43185.061145] sd 2:0:0:0: [sdd] READ CAPACITY(16) failed
[43185.061147] sd 2:0:0:0: [sdd]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[43185.061152] sd 2:0:0:0: [sdd] Sense not available.
[43185.061155] sd 2:0:0:0: rejecting I/O to offline device
[43185.061166] sd 2:0:0:0: rejecting I/O to offline device
[43185.061175] sd 2:0:0:0: rejecting I/O to offline device
[43185.061185] sd 2:0:0:0: rejecting I/O to offline device
[43185.061193] sd 2:0:0:0: rejecting I/O to offline device
[43185.061198] sd 2:0:0:0: [sdd] READ CAPACITY failed
[43185.061202] sd 2:0:0:0: rejecting I/O to offline device
[43185.061209] sd 2:0:0:0: [sdd]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[43185.061215] sd 2:0:0:0: [sdd] Sense not available.
[43185.061226] sd 2:0:0:0: rejecting I/O to offline device
[43185.061235] sd 2:0:0:0: rejecting I/O to offline device
[43185.061245] sd 2:0:0:0: rejecting I/O to offline device
[43185.061254] sd 2:0:0:0: rejecting I/O to offline device
[43185.061263] sd 2:0:0:0: rejecting I/O to offline device
[43185.061274] sd 2:0:0:0: rejecting I/O to offline device
[43185.061280] sd 2:0:0:0: [sdd] Asking for cache data failed
[43185.061283] sd 2:0:0:0: [sdd] Assuming drive cache: write through
[43185.061289] sdd: detected capacity change from 2000398934016 to 0
[43185.061610] ata3.00: detaching (SCSI 2:0:0:0)
[43185.062444] sd 2:0:0:0: [sdd] Stopping disk
[43249.120042] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[43249.120046] ata4.00: failed command: FLUSH CACHE EXT
[43249.120051] ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[43249.120052]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[43249.120054] ata4.00: status: { DRDY }
[43249.120059] ata4: hard resetting link
[43249.120060] ata4: nv: skipping hardreset on occupied port
[43249.610042] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43249.650323] ata4.00: configured for UDMA/133
[43249.650326] ata4.00: retrying FLUSH 0xea Emask 0x4
[43249.650452] ata4.00: device reported invalid CHS sector 0
[43249.650458] ata4: EH complete

答案1

你在這裡確實有一個問題。我想(如果我理解正確的話)確定導致此故障的原因是什麼?

我是網路安全工程師。所以請理解我在打字時感到畏縮。將其作為加密問題消除。解密驅動器並查看問題是否仍然存在。缺點是您需要在解密後使用它們幾個月。

電纜是一個簡單的測試(您應該首先從那裡開始)。把它們換掉,但我很難相信這就是問題所在,除非你的盒子裡有霓虹燈。

這就剩下主機板了。如果不是另外兩個...

我確信如果有人不同意我的故障排除,他們會插話。更換電纜的成本並不高,暫時停用加密會帶來安全風險,只有您才能確定是否願意接受。

答案2

看起來您的 SATA 連結上有很多錯誤。因此,主機無法透過連結可靠地取得命令,並且有時傳回的資料會損壞。

您會在訊息中看到速度受限或未收到預期的磁碟機識別碼。您還會看到來自驅動程式不同層的令人困惑的訊息,這些訊息不一定反映 SATA 硬體層級發生的情況。例如,「限制 UDMA/133:PIO3 的速度」嚴格僅適用於並行 ATA 驅動器(這僅意味著驅動程式正在嘗試較慢的介面速度以查看錯誤是否清除),但錯誤訊息清楚地表明最低實際處理硬體的層級知道它正在與SATA 驅動器通訊。

您認為可能是 SATA 電纜的想法是正確的。嘗試更換它們,並確保它們的額定速率為 SATA 3.0 Gb/秒(也稱為“SATA 2”或“SATA II”)。我不認為你的驅動器有問題。為什麼更換驅動器後幾個月後錯誤才會出現?也許電纜以某種方式鬆動並更換驅動器重新安裝它們。或許這只是偶然的機會。

相關內容