透過 SMART 屬性診斷驅動器是否可靠

透過 SMART 屬性診斷驅動器是否可靠

我想弄清楚我的硬碟是否快要死了。我研究了智能值,看起來可能是這樣,但它仍然可以很好地讀取和寫入數據,並且沒有出現新的錯誤。

先前197 Current_Pending_Sector的值為 8,但將驅動器清除後,該值恢復為 0,並且196 Reallocated_Event_Count為 0。

這是否意味著驅動器本身沒有問題,只是暫時的系統問題?

另外值得關注的是188 Command_Timeout它的值為 1,其定義為:

由於 HDD 逾時而中止的操作的計數。通常該屬性值應等於零,如果該值遠大於零,則很可能存在電源或數據線氧化的嚴重問題。

我一直在進行一些低階編程,並且不得不強制關閉電腦大約 50 次。

我假設191 G-Sense_Error_Rate438 的值很好,我認為這是在硬碟開啟時移動筆記型電腦造成的。

真正有趣的是我的 Windows 分割區停止啟動並且無法安裝到另一台 Windows 或 Linux 機器上,但它在 OSX 上安裝得很好,允許我恢復我的檔案。我重新安裝並將資料複製到其中,它似乎運作得很好。 OSX 在另一個驅動器上。

H2O:~ jeremiah$ smartctl -a /dev/disk1
smartctl 6.3 2014-07-26 r3976 [x86_64-apple-darwin14.1.0] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HGST HTS541075A9E680
Serial Number:    JD13021X0A00GK
LU WWN Device Id: 5 000cca 764c48bc4
Firmware Version: JA2OA590
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Mar 11 21:59:30 2015 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (   45) seconds.
Offline data collection
capabilities:            (0x51) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 164) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   086   062    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0025   100   100   040    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0023   169   100   033    Pre-fail  Always       -       1
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       981
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002f   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   040    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       2586
 10 Spin_Retry_Count        0x0033   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       851
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   001   000    Old_age   Always       -       144929376764360
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       1
190 Airflow_Temperature_Cel 0x0022   069   050   045    Old_age   Always       -       31 (Min/Max 24/31)
191 G-Sense_Error_Rate      0x0032   099   099   000    Old_age   Always       -       438
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       2031647
193 Load_Cycle_Count        0x0032   089   089   000    Old_age   Always       -       115337
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   100   100   000    Old_age   Always       -       0
223 Load_Retry_Count        0x002a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 456 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 456 occurred at disk power-on lifetime: 2549 hours (106 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 38 8d 62 00  Error: UNC 8 sectors at LBA = 0x00628d38 = 6458680

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 38 8d 62 40 00      00:00:34.282  READ DMA EXT
  25 00 08 38 8d 62 40 00      00:00:30.471  READ DMA EXT
  25 00 08 38 8d 62 40 00      00:00:26.660  READ DMA EXT
  25 00 08 38 8d 62 40 00      00:00:22.849  READ DMA EXT
  2f 00 01 10 00 00 00 00      00:00:22.849  READ LOG EXT

Error 455 occurred at disk power-on lifetime: 2549 hours (106 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 38 8d 62 00  Error: UNC 8 sectors at LBA = 0x00628d38 = 6458680

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 38 8d 62 40 00      00:00:30.471  READ DMA EXT
  25 00 08 38 8d 62 40 00      00:00:26.660  READ DMA EXT
  25 00 08 38 8d 62 40 00      00:00:22.849  READ DMA EXT
  2f 00 01 10 00 00 00 00      00:00:22.849  READ LOG EXT
  60 08 a8 38 8d 62 40 00      00:00:19.060  READ FPDMA QUEUED

Error 454 occurred at disk power-on lifetime: 2549 hours (106 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 38 8d 62 00  Error: UNC 8 sectors at LBA = 0x00628d38 = 6458680

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 38 8d 62 40 00      00:00:26.660  READ DMA EXT
  25 00 08 38 8d 62 40 00      00:00:22.849  READ DMA EXT
  2f 00 01 10 00 00 00 00      00:00:22.849  READ LOG EXT
  60 08 a8 38 8d 62 40 00      00:00:19.060  READ FPDMA QUEUED
  60 08 a0 30 8d 62 40 00      00:00:19.059  READ FPDMA QUEUED

Error 453 occurred at disk power-on lifetime: 2549 hours (106 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 38 8d 62 00  Error: UNC 8 sectors at LBA = 0x00628d38 = 6458680

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 38 8d 62 40 00      00:00:22.849  READ DMA EXT
  2f 00 01 10 00 00 00 00      00:00:22.849  READ LOG EXT
  60 08 a8 38 8d 62 40 00      00:00:19.060  READ FPDMA QUEUED
  60 08 a0 30 8d 62 40 00      00:00:19.059  READ FPDMA QUEUED
  60 08 98 28 8d 62 40 00      00:00:19.059  READ FPDMA QUEUED

Error 452 occurred at disk power-on lifetime: 2548 hours (106 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 08 38 8d 62 00  Error: UNC at LBA = 0x00628d38 = 6458680

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 a8 38 8d 62 40 00      00:00:19.060  READ FPDMA QUEUED
  60 08 a0 30 8d 62 40 00      00:00:19.059  READ FPDMA QUEUED
  60 08 98 28 8d 62 40 00      00:00:19.059  READ FPDMA QUEUED
  60 08 90 20 8d 62 40 00      00:00:19.059  READ FPDMA QUEUED
  60 08 88 18 8d 62 40 00      00:00:19.059  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

答案1

第 197 章 Current_Pending_Sector 的值為 8,但在將磁碟機清除後,該值恢復為 0,且第 196 章 Reallated_Event_Count 為 0。

這意味著在某一時刻,驅動器在讀取某些扇區時遇到問題,但自從您將驅動器清除後,這些扇區就沒有出現任何問題。當您用新資料覆蓋整個磁碟機時,磁區從待重新分配變為正常,且磁碟機可能對寫入感到滿意,因為此時磁區尚未重新指派。您應該運行長時間的 SMART 自檢(通常包括表面掃描)來驗證,但這很可能是一個故障,可能與驅動器運行時移動計算機有關。

另外值得注意的是 188 Command_Timeout,其值為 1,其定義為:

不值得擔心。該驅動器報告了近 2600 小時的開機時間,並且在該時間段內出現了單一命令逾時。命令逾時由作業系統透過重試失敗的命令或使 I/O 操作失敗來處理,因此如果這是一個持續存在的問題,您就會知道它。可能與 8 個待處理磁區相關,也可能無關。

如果這個數字開始顯著上升,我會擔心,但是單位數的超時次數並且沒有其他系統運行問題的跡像不會讓我擔心。

我一直在進行一些低階編程,並且不得不強制關閉電腦大約 50 次。

這不應該影響任何值得擔心的級別的實體驅動器,儘管它可能會影響邏輯資料一致性(檔案系統損壞等)。

另外,從鋸末的評論:

您應該運行短期和擴展自檢。大量 ID#187 Reported_Un Correct 錯誤表示有問題。大約 40 小時前,似乎存在大量無法糾正的讀取錯誤。

這是一個很好的觀點,但是我們不知道原始值的編碼。我們可以知道的是,「值」目前是標準化的 100,最差值為 1,閾值(用於報告驅動器已發生故障或故障即將發生)為 0。在目前的時間驅動器並不覺得這個值有什麼值得擔心的。 1.45e14 的讀取錯誤聽起來幾乎不可能高;據其自己承認,該驅動器大約有(750 GB,4 KiB/扇區)183,000 個扇區。為了獲得作為原始值報告的讀取失敗次數,每個扇區在報告的 2,586 個通電小時內必須失敗 791,000 次,或者說該扇區有一次徹底的讀取失敗。全部的每 11 秒浮出一次。這簡直是個荒謬的數字(十秒鐘內你就可以僅佔整個磁碟表面的一小部分),因此我們可以高度肯定地得出結論,對於此磁碟機和屬性 187,原始值為其他的東西比簡單的整數計數。原始值可能被分成兩部分,高位或低位編碼實際值,其他位編碼其他內容。此屬性的原始值的十六進位值為 83D0 0005 01C8,其中中間的零串確實指示了這種編碼;雖然當然有可能,但隨機錯誤計數似乎不太可能在中間有這麼長的一串零。例如,如果我們採用較低位元(501C8 十六進位),則結果為 328,136 個報告錯誤,儘管仍然有很多聽起來很多更可信。

底線,SMART 可以是一個很好的監控工具,但它並不是為了捕捉和報告所有問題而設計的。有些驅動器即使在 SMART 表明它們應該完全失效很久之後仍能正常運行,而有些驅動器卻出現災難性故障,即使 SMART 表示即使在故障後一切都很好。了解 SMART 資料的本質,即預警系統和狀態報告,不是關於驅動器健康狀況的某種絕對事實。此外,您必須以挑剔的眼光讀取原始值,因為這些值的編碼是實現定義的。相當,您應該查看報告的“值”與驅動器的“閾值”值的比較,因為這些值應該由製造商為特定驅動器進行有意義的定義。

如果您擔心那些較早的待處理(這基本上意味著“發現很難閱讀”)扇區,透過 SMART 運行全表面掃描。如果它們返回為“待處理”,那麼可能值得考慮是否更換驅動器,但簡單的事實是,幾乎所有驅動器都會發展一些壞扇區在其使用壽命內,並且有許多備用扇區可以透過重新分配壞扇區來補償。然而,重新分配確實需要知道數據,因此如果某個磁區發生故障,則只能在寫入該磁區期間重新分配它。

相關內容