硬關機後硬碟故障或控制器損壞

2024-7-17 • tag-icon

在 nouveau 用雙顯示器第一百萬次凍結所有內容後，我不得不切斷我的 MacBook Pro（2010 年中，fedora 24，三星 HN-M500MBB 硬碟）的電源。沒有做任何 IO 繁重的事情，只是觀看有證據的幻燈片。

重新啟動時，它開始吐出有關壞扇區的錯誤，並出現以下錯誤：

blk_update_request: I/O error, dev sda, sector 969158669
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x3c000000 SErr 0x0 action 0x6 frozen
ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/08:d0:08:30:c4/00:00:39:00:00/40 tag 26 ncq dma 4096 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/28:d8:c8:2f:c4/00:00:39:00:00/40 tag 27 ncq dma 20480 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/38:e0:88:2f:c4/00:00:39:00:00/40 tag 28 ncq dma 28672 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/78:e8:08:2f:c4/00:00:39:00:00/40 tag 29 ncq dma 61440 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1.00: device reported invalid CHS sector 0

與偶爾的

sd 0:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:0:0: [sda] tag#19 Sense Key : Medium Error [current] 
sd 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - auto reallocate failed
sd 0:0:0:0: [sda] tag#19 CDB: Read(10) 28 00 39 c4 30 08 00 00 08 00
blk_update_request: I/O error, dev sda, sector 969158669
Buffer I/O error on dev dm-2, logical block 1, async page read

和

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000001
ata1.00: failed command: READ SECTOR(S) EXT
ata1.00: cmd 24/00:01:0d:30:c4/00:00:39:00:00/e0 tag 6 pio 512 in
         res 51/40:01:0d:30:c4/00:00:39:00:00/e0 Emask 0x9 (media error)
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }

這是嘗試使用 hdparm 讀取壞扇區後的幾個扇區後的 smartctl 輸出：

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       469
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   086   086   025    Pre-fail  Always       -       4463
  4 Start_Stop_Count        0x0032   092   092   000    Old_age   Always       -       8099
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       19382
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       980
 12 Power_Cycle_Count       0x0032   092   092   000    Old_age   Always       -       8214
181 Program_Fail_Cnt_Total  0x0022   097   097   000    Old_age   Always       -       66246139
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       3820
192 Power-Off_Retract_Count 0x0022   100   100   000    Old_age   Always       -       20
194 Temperature_Celsius     0x0002   064   051   000    Old_age   Always       -       32 (Min/Max 15/49)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       15
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       255
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       980
225 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       1583719

注意待處理的扇區...短自檢和長自檢都報告與內核相同的壞扇區。

Hdparm~~奇怪地設法成功讀取所有內容，但它~~（請參閱下面的編輯）有點掛起說

reading sector 969158769: SG_IO: bad/missing sense data, sb[]:  70 00 03 00 00 00 00 0a 00 51 e0 01 11 04 00 00 a0 71 00 00 00 00 00 00 00 00 00 00 00 00 00 00
succeeded

它表示第一個壞扇區之後的大約 200 個扇區。我用 hdparm --write-sector 重寫了幾個，他們就不再抱怨了。現在我正在做備份並訂購了一個新驅動器，但同時我想了解發生了什麼，也許嘗試修復這個問題。

請注意，在我重寫了幾個錯誤的扇區後，重新分配的扇區數量並沒有增加，這增加了整個事情的奇怪性。重寫後，它們讀寫正常，就像什麼都沒發生一樣，但韌體似乎沒有將它們重新映射為壞扇區。

任何想法？我應該放棄驅動器嗎？

附言。另一個分區中的 OSX 仍然運作得很好。

編輯：後果

備份後，我開始對硬碟進行一些實驗。

在第一個壞扇區之後，大約還有 150 個壞扇區也出現了同樣的問題。我嘗試用ddand閱讀它們dd_rescue，但失敗了。 hdparm --read-sector工作（與上面的感知錯誤）但返回不一致的數據（每次讀取時都不同）。 hdparm --write-sector似乎修復了它們，所以我重寫了所有失敗的扇區。

現在smartctl報告 0 個待處理磁區和 0 個重新分配，短期和長期自測都完成且沒有錯誤。 Linux 啟動正常，所有錯誤都消失了。

我有點擔心我殺死的那些 ~70kb，LVM 很難理解它們真正包含的內容。我在該區域周圍傾倒了幾個 MB，結果全部為零，所以我確信它要么是空白空間，要么是交換空間。

現在慶祝還為時過早，但結果看起來很有希望，如果發生任何新情況，我們將更新問題。

相關內容