發生了什麼事

發生了什麼事

我的Solid State Disk變成了嗎Super Slim Doorstopper

我知道這是一個很長的問題,但我試圖使其盡可能詳盡且資訊豐富。只需tl;dr跳過問題的前半部分,儘管我認為其中的資訊可能與該問題相關。

發生了什麼事

首先:我居住的地區目前正遭受大規模熱浪的侵襲。我房間的內部空氣溫度在 2-3 週內從未低於 30°C。多年來,氣溫從未低於34°C,即使是半夜。我沒有空調,我的風扇幾乎沒有任何作用。我的 SSD 的溫度。感測器似乎壞了(總是報告 5°C),我的 HDD 幾乎總是處於 48°C、54°C 和 54°C。 GPU 約 60°C,CPU 約 52°C。這不太好,但對我來說仍然可以忍受。

昨晚我使用我的 PC,64GB SSD 上的 arch linux,當時一切都凍結了。我甚至無法再透過 SSH 連接到機器。因此,為了至少獲得 SSH 連接而等待了半個小時後,我不得不關閉電源。我還想提一下,有時當我使用audacity 時,我的電腦會變得非常慢(將臨時資料寫入SSD,因為audacity 似乎不支援NTFS 檔案系統,而我的SSD 是我擁有的唯一非NTFS 文件系統),而且最近我碰到關於 SSD 滿載後速度變慢的問題。我可以說,由於大量的大膽記錄,我的 SSD 每週多次(如果不是每天)使用空間達到 +95%。

因此,在關閉電腦後,我嘗試再次打開它,在 BIOS 螢幕上,它檢查了所有磁碟,SSD 顯示S.M.A.R.T. error。啟動 grub(在另一個磁碟機上)並嘗試引導到 arch(也在另一個磁碟機上的引導分割區)後,我收到了訊息Device /dev/mapper/mydisk-root not found或類似的訊息。mydisk-root應該是我的 LUKS 加密 SSD 磁碟區組內的根分割區。因此,我嘗試重新啟動幾次,但總是得到相同的結果,當我最終放棄時,關閉電腦(在 PSU 上)並進入睡眠狀態。

我執行的下一步操作

醒來後,我想啟動一個即時 Linux USB 來執行 SMART 掃描,查看 dmesg,無論有什麼。突然BIOSS.M.A.R.T. ok又說。不過,我繼續使用 live USB,我可以像平常一樣解鎖並安裝 SSD。我也能夠毫無問題地執行完整備份。

然後我就去參加了SMART測試。測試long失敗兩次,失敗率為 50%,詳情如下。測試short已完成,我在結果中看不出有什麼不好的地方。我最後一次參加 SMART 測試是在兩週前,這是一次long測試(請參閱測試日誌),一切都很好。

問題1:我的SSD怎麼樣?

這是我嘗試過的任何測試的 SMART 屬性表的輸出,所以我認為這些應該是我兩週前進行的測試before的結果:long

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   001    Old_age   Always       -       23891
 12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       1063
170 Grown_Failing_Block_Ct  0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   001    Old_age   Always       -       10
172 Erase_Fail_Count        0x0032   100   100   001    Old_age   Always       -       0
173 Wear_Leveling_Count     0x0033   080   080   010    Pre-fail  Always       -       611
174 Unexpect_Power_Loss_Ct  0x0032   100   100   001    Old_age   Always       -       244
181 Non4k_Aligned_Access    0x0022   100   100   001    Old_age   Always       -       302 89 212
183 SATA_Iface_Downshift    0x0032   100   100   001    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   001    Old_age   Always       -       2
188 Command_Timeout         0x0032   100   100   001    Old_age   Always       -       0
189 Factory_Bad_Block_Ct    0x000e   100   100   001    Old_age   Always       -       58
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       0
195 Hardware_ECC_Recovered  0x003a   100   100   001    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   001    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   001    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   001    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   001    Old_age   Always       -       1
202 Perc_Rated_Life_Used    0x0018   080   080   001    Old_age   Offline      -       20
206 Write_Error_Rate        0x000e   100   100   001    Old_age   Always       -       10

-a這是今天嘗試long測試失敗後的完整結果(請參閱測試日誌):

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 117) The previous self-test completed having
                    the read element of the test failed.
Total time to complete Offline 
data collection:        (  295) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (   4) minutes.
Conveyance self-test routine
recommended polling time:    (   3) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   001    Old_age   Always       -       23891
 12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       1063
170 Grown_Failing_Block_Ct  0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   001    Old_age   Always       -       10
172 Erase_Fail_Count        0x0032   100   100   001    Old_age   Always       -       0
173 Wear_Leveling_Count     0x0033   080   080   010    Pre-fail  Always       -       611
174 Unexpect_Power_Loss_Ct  0x0032   100   100   001    Old_age   Always       -       244
181 Non4k_Aligned_Access    0x0022   100   100   001    Old_age   Always       -       302 89 212
183 SATA_Iface_Downshift    0x0032   100   100   001    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   001    Old_age   Always       -       2
188 Command_Timeout         0x0032   100   100   001    Old_age   Always       -       0
189 Factory_Bad_Block_Ct    0x000e   100   100   001    Old_age   Always       -       58
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       0
195 Hardware_ECC_Recovered  0x003a   100   100   001    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   001    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   001    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   001    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   001    Old_age   Always       -       1
202 Perc_Rated_Life_Used    0x0018   080   080   001    Old_age   Offline      -       20
206 Write_Error_Rate        0x000e   100   100   001    Old_age   Always       -       10

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 2

ATA Error Count: 0
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 0 occurred at disk power-on lifetime: 23890 hours (995 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 00 d0 14 d1 40   at LBA = 0x00d114d0 = 13702352

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 d0 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 00 08 c8 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 03 08 c0 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 10 08 b8 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 00 08 b0 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED

Error -1 occurred at disk power-on lifetime: 23890 hours (995 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 00 d0 14 d1 40   at LBA = 0x00d114d0 = 13702352

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 d5 00 d8 13 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 00 00 d8 12 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 da 00 d8 11 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 d0 00 d8 10 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 d1 80 58 10 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       50%     23891         66387896
# 2  Extended offline    Completed: read failure       50%     23889         66387896
# 3  Extended offline    Completed without error       00%     23437         -
# 4  Short offline       Completed without error       00%       564         -
# 5  Vendor (0xff)       Completed without error       00%       558         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

-a這是今天嘗試測試後的完整結果short,測試成功:

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (  295) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (   4) minutes.
Conveyance self-test routine
recommended polling time:    (   3) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   001    Old_age   Always       -       23891
 12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       1063
170 Grown_Failing_Block_Ct  0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   001    Old_age   Always       -       10
172 Erase_Fail_Count        0x0032   100   100   001    Old_age   Always       -       0
173 Wear_Leveling_Count     0x0033   080   080   010    Pre-fail  Always       -       611
174 Unexpect_Power_Loss_Ct  0x0032   100   100   001    Old_age   Always       -       244
181 Non4k_Aligned_Access    0x0022   100   100   001    Old_age   Always       -       302 89 212
183 SATA_Iface_Downshift    0x0032   100   100   001    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   001    Old_age   Always       -       2
188 Command_Timeout         0x0032   100   100   001    Old_age   Always       -       0
189 Factory_Bad_Block_Ct    0x000e   100   100   001    Old_age   Always       -       58
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       0
195 Hardware_ECC_Recovered  0x003a   100   100   001    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   001    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   001    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   001    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   001    Old_age   Always       -       1
202 Perc_Rated_Life_Used    0x0018   080   080   001    Old_age   Offline      -       20
206 Write_Error_Rate        0x000e   100   100   001    Old_age   Always       -       10

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 2

ATA Error Count: 0
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 0 occurred at disk power-on lifetime: 23890 hours (995 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 00 d0 14 d1 40   at LBA = 0x00d114d0 = 13702352

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 d0 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 00 08 c8 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 03 08 c0 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 10 08 b8 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 00 08 b0 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED

Error -1 occurred at disk power-on lifetime: 23890 hours (995 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 00 d0 14 d1 40   at LBA = 0x00d114d0 = 13702352

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 d5 00 d8 13 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 00 00 d8 12 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 da 00 d8 11 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 d0 00 d8 10 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 d1 80 58 10 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     23891         -
# 2  Extended offline    Completed: read failure       50%     23891         66387896
# 3  Extended offline    Completed: read failure       50%     23889         66387896
# 4  Extended offline    Completed without error       00%     23437         -
# 5  Short offline       Completed without error       00%       564         -
# 6  Vendor (0xff)       Completed without error       00%       558         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

我發現這三個屬性表都是相同的,這很有趣。或者我在這裡遺漏了什麼?我不是 SMART 專家,但據我所知,這三個都是完美的結果。 (?)我還沒有嘗試過,但是自從安裝並讓文件工作並且 BIOSok再次報告它以來,我認為我也可以再次啟動它。我應該嗎?

問題2:為什麼會出現這種情況?

這只是老化的問題還是我對 SSD 的持續使用導致了這種情況?

這與 SSD 不斷達到 90-100% 已用空間有關嗎?

怎麼能從一切都好我什至無法再執行 SMART 測試短短兩週內?

這些智能測試結果說明了什麼?今天測試後的屬性表在我看來還是很棒的,還是我錯了?

問題3:這會傳染嗎?

如果這個SSD壞了,我要買一個新的,我可以簡單地dd if=/old/ssd of=/new/ssd沒事還是會帶來麻煩?移動到新磁碟的最佳方法是什麼?請注意,我在整個裝置上以 RAW 模式使用 LUKS,並帶有分離的標頭,我想將所有這些「克隆」到新磁碟上。


編輯:我剛剛再次啟動該 SSD,它似乎可以工作。不過,我會盡快購買新的 SSD,因為我認為使用這個是一個壞主意。以下是崩潰前 syslos 中的最新條目:

系統

答案1

SMART 狀態顯示了許多舊的或即將消失的指標,但沒有什麼特別尖叫「這殺死了它!」。

您的日誌顯示開機壽命為995 天零10 小時,這表示您讓電腦永久處於開機狀態,這本身並不是一件壞事,它只是意味著驅動器在操作過程中已經進行了很多小時的小寫入操作。

對我來說,SSD 看起來像是又舊又磨損了。出奇Perc_Rated_Life_Used的低,因為Erase_Fail_Count

讓我擔心的是您的「常規」命中率達到 95% 以上,這將減少磨損均衡演算法完成其工作時可用的空塊池。當您空間不足時,您實際上會更加強調少量區塊,從而導致一小群區塊具有大量寫入,而整個磁碟機的平均值相當低。透過重複執行此操作,磨損均衡器可能會選擇首先寫入「最佳」(最少寫入)的區塊,但當達到 100% 滿時,您會留下「最差」的區塊。將其與通用程式和運行其任務的作業系統相結合意味著您將更快地磨損最差的區塊。這是一種完美的方式來強調驅動器中最糟糕的部分並將其送入墳墓。

您可以有效地將關鍵檔案系統和 SSD 簿記功能強制寫入最糟糕的單元,因為它們可能會定期寫入驅動器,特別是當 SSD 幾乎已滿時,遲早會發生一些不好的事情。如果您用完了可重新指派的區塊並且無法移動關鍵結構,則磁碟機可能會自行死鎖。

這就是為什麼人們說您應該始終嘗試在驅動器上保留一些傳聞中的可用空間,因為可用空間越少,您在該區域的工作就越努力。自由的。

舊的和對小塊組的大量寫入可能已經磨損了驅動器的某些部分。

有可能將您需要的內容複製到新驅動器上就可以了,像這樣的硬體故障往往不會傳染。

相關內容