診斷 RAID/Fedora 14 上寫入速度**非常慢**

診斷 RAID/Fedora 14 上寫入速度**非常慢**

我們有一台運行 Fedora 14 的 Dell PowerEdge T110,它作為嵌入式 Linux 的建置伺服器和 Subversion 伺服器運作。

最近它變得非常慢,無法在新的一天開始之前完成夜間備份。

[編輯] 謝謝 User9517 - 我檢查了日誌,有來自 MRMON(Mega Raid Monitor)的多個訊息。任何有關解釋這些訊息、後續步驟以及如何確定需要更換哪個驅動器的指南都會有所幫助。

Dec 20 09:02:32 localhost MR_MONITOR[2153]: <MRMON096> Controller ID:  0   PD Predictive failure:  #012    -:-:2
Dec 20 09:06:44 localhost MR_MONITOR[2153]: <MRMON113> Controller ID:  0   Unexpected sense:   PD  #012    =   -:-:2No defect spare location available,   CDB   =    0x28 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00    ,   Sense   =    0x70 0x00 0x04 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x32 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Dec 20 09:09:44 localhost MR_MONITOR[2153]: <MRMON096> Controller ID:  0   PD Predictive failure:  #012    -:-:2

[/編輯]

我正在尋求幫助來找出故障。我絕對不是這方面的專家,最初設定係統的人已經不在了。

每晚備份約 6 GB 的 tgz 文件,於晚上 8 點開始。這通常在凌晨 4 點左右完成(包括複製到外部驅動器)。每週備份約 45 GB,以前從週五晚上 8 點開始,在周六上午 11 點完成。

除了備份之外,即使備份進程未運行,機器的反應速度也明顯緩慢。

以下是我到目前為止收集到的內容:

有一個 RAID 控制器 DELL PERC H200L,連接有四個希捷 1TB 硬碟 (ST31000424SS)。我思考它已設定為 RAID 10,但我不知道如何存取該控制器的配置。我相信 RAID 10,因為有 4 個驅動器,並且 vgdisplay 顯示 4 個驅動器上分配的總共 4 TB 中的 1.81 TB。

[root@fedorabox backup]# vgdisplay
  --- Volume group ---
  VG Name               vg_fedorabox2
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  8
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                5
  Open LV               5
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               1.81 TiB
  PE Size               32.00 MiB
  Total PE              59263
  Alloc PE / Size       28480 / 890.00 GiB
  Free  PE / Size       30783 / 961.97 GiB

我在機器中看不到任何其他實際驅動器,因此我猜想啟動分區 (/dev/sdb1) 以某種方式從 4 個驅動器中劃分出來。

(/dev/sda 是用於備份的外部硬碟 - 但這不是問題。當我們早上到達時,備份仍在 /backup 分區上生成。到 USB 連接驅動器的副本尚未生成開始)

[root@fedorabox backup]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_fedorabox2-LogVol00
                      9.9G  5.2G  4.3G  55% /
tmpfs                 2.0G  932K  2.0G   1% /dev/shm
/dev/sdb1             504M   56M  423M  12% /boot
/dev/mapper/vg_fedorabox2-LogVol03
                      394G  221G  153G  60% /home
/dev/mapper/vg_fedorabox2-LogVol02
                       99G   29G   65G  32% /shared
/dev/mapper/vg_fedorabox2-LogVol01
                       30G   11G   18G  37% /usr
/dev/sda2             5.5T  2.6T  3.0T  47% /mnt/root/usbbackup2
/dev/mapper/vg_fedorabox2-LogVol04
                      345G  363M  327G   1% /backup

正如我在問題中所說,寫入速度是非常慢的:

[root@fedorabox backup]# dd if=/dev/zero of=/backup/tmp/test.out bs=512 count=32 oflag=dsync
32+0 records in
32+0 records out
16384 bytes (16 kB) copied, 40.382 s, 0.4 kB/s
[root@fedorabox backup]# dd of=/dev/null if=/backup/tmp/test.out bs=512 count=32 oflag=dsync
32+0 records in
32+0 records out
16384 bytes (16 kB) copied, 3.5087e-05 s, 467 MB/s

我可以使用 smartctl 作為 /dev/sg2 到 /dev/sg5 存取四個磁碟機。下面列出了輸出。不知道正常讀書有什麼用更正錯誤在這裡,但我注意到第二個和第四個驅動器(/dev/sg3、sg5)已列出未更正的錯誤用於讀取和驗證。

關於後續步驟的任何建議 - 未糾正的錯誤是正常的還是令人擔憂的?這是速度緩慢的原因,還是我應該注意其他問題?

關於如何更換驅動器以及如何存取 RAID 配置有什麼建議嗎?

[root@fedorabox /]# smartctl -a /dev/sg2
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST1000NM0001     Version: PS06
Serial number: Z1N2LEDW
Device type: disk
Transport protocol: SAS
Local Time is: Mon Dec 19 12:10:20 2022 EST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
Log Sense failed, IE page [scsi response fails sanity test]

Current Drive Temperature:     37 C
Drive Trip Temperature:        68 C
Manufactured in week 33 of year 2012
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  71
Elements in grown defect list: 36
Vendor (Seagate) cache information
  Blocks sent to initiator = 2805494200
  Blocks received from initiator = 1072424796
  Blocks read from cache and sent to initiator = 19110177
  Number of read and write commands whose size <= segment size = 826634038
  Number of read and write commands whose size > segment size = 5264167
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 11183.37
  number of minutes until next internal SMART test = 43

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   3996823525        0         0  3996823525          0        130.509           0
write:         0        0         0         0          0      62619.327           0
verify: 1594450892        0         0  1594450892          0      51866.259           0

Non-medium error count:        9

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  32   11182                 - [-   -    -]

Long (extended) Self Test duration: 11100 seconds [185.0 minutes]
[root@fedorabox /]# smartctl -a /dev/sg3
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST31000424SS     Version: KS68
Serial number: 9WK3JSJV
Device type: disk
Transport protocol: SAS
Local Time is: Mon Dec 19 12:10:44 2022 EST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
Log Sense failed, IE page [scsi response fails sanity test]

Current Drive Temperature:     37 C
Drive Trip Temperature:        68 C
Manufactured in week 06 of year 2011
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  81
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  81
Elements in grown defect list: 21
Vendor (Seagate) cache information
  Blocks sent to initiator = 1872227385
  Blocks received from initiator = 3603107317
  Blocks read from cache and sent to initiator = 53905772
  Number of read and write commands whose size <= segment size = 1041622488
  Number of read and write commands whose size > segment size = 5288254
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 77337.02
  number of minutes until next internal SMART test = 16

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1454822558        3         0  1454822561   1454822585       2465.838          21
write:         0        0         0         0          0      64012.923           0
verify: 2113323340      143         0  2113323483   2113323510      49057.393          17

Non-medium error count:        4

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  16     643                 - [-   -    -]
# 2  Background short  Completed                  16       5                 - [-   -    -]
# 3  Background long   Completed                  16       5                 - [-   -    -]

Long (extended) Self Test duration: 11100 seconds [185.0 minutes]
[root@fedorabox /]# smartctl -a /dev/sg4
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST31000424SS     Version: KS68
Serial number: 9WK3H8DW
Device type: disk
Transport protocol: SAS
Local Time is: Mon Dec 19 12:11:02 2022 EST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
Log Sense failed, IE page [scsi response fails sanity test]

Current Drive Temperature:     38 C
Drive Trip Temperature:        68 C
Manufactured in week 06 of year 2011
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  76
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  76
Elements in grown defect list: 1
Vendor (Seagate) cache information
  Blocks sent to initiator = 1437832391
  Blocks received from initiator = 3080050213
  Blocks read from cache and sent to initiator = 2689371046
  Number of read and write commands whose size <= segment size = 3306395247
  Number of read and write commands whose size > segment size = 5018225
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 77337.17
  number of minutes until next internal SMART test = 58

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1514637706     1007         0  1514638713   1514638713    1576907.538           0
write:         0        0         0         0          0      61240.330           0
verify: 1697580124       32         0  1697580156   1697580157      48889.638           0

Non-medium error count:       27

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  16      18                 - [-   -    -]
# 2  Background short  Completed                  16       5                 - [-   -    -]
# 3  Background long   Completed                  16       5                 - [-   -    -]

Long (extended) Self Test duration: 11100 seconds [185.0 minutes]
[root@fedorabox /]# smartctl -a /dev/sg5
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST31000424SS     Version: KS68
Serial number: 9WK3FCZ6
Device type: disk
Transport protocol: SAS
Local Time is: Mon Dec 19 12:11:41 2022 EST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
Log Sense failed, IE page [scsi response fails sanity test]

Current Drive Temperature:     38 C
Drive Trip Temperature:        68 C
Manufactured in week 06 of year 2011
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  81
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  81
Elements in grown defect list: 4096
Vendor (Seagate) cache information
  Blocks sent to initiator = 923606853
  Blocks received from initiator = 3074269061
  Blocks read from cache and sent to initiator = 3237322768
  Number of read and write commands whose size <= segment size = 3044372010
  Number of read and write commands whose size > segment size = 5024782
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 77336.67
  number of minutes until next internal SMART test = 53

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   2058067359   277563         0  2058344922   2058345511    1420772.201         555
write:         0        0         0         0          0      62186.800           0
verify: 2750944424     2205         0  2750946629   2750946631      50834.359           1

Non-medium error count:      167

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  16     643                 - [-   -    -]
# 2  Background short  Completed                  16       5                 - [-   -    -]
# 3  Background long   Completed                  16       5                 - [-   -    -]

Long (extended) Self Test duration: 11100 seconds [185.0 minutes]

答案1

Dell PERC 本質上是更名為 LSI MegaRAID SAS。

查看您的lspci -k以了解它使用哪個驅動程式。這有可能megaraid_sas。您成功使用 MegaRAID Monitor 的事實表明情況一定如此。因此,使用該megacli軟體包可以從 Linux 控制您的 RAID 控制器。

不過,今天在哪裡可以找到舊版的 Fedora 版本仍然是一個問題。試著看一下https://hwraid.le-vert.net為了MegaRAID SAS或者,可能,超級RAID對於該軟體。

該軟體有一個小的內聯提醒(運行megacli -h),也描述於MegaRAID SAS 軟體使用者指南您可以從中獲得博通(誰買了 Avago,誰就買了 LSI)。網路上也有一些備忘單。

例如,您可以從獲取診斷資訊開始:

megacli -AdpAllInfo -aALL
megacli -AdpPR -info -aALL
megacli -LdPdInfo -aALL
megacli -AdpBbuCmd -GetBbuStatus -aALL
megacli -AdpEventLog -GetEventLogInfo -aALL

這些命令分別執行以下操作:

  • 取得控制器的狀態和一般警報(包括故障設備數量)
  • 取得巡檢讀取操作的狀態(定期讀取所有設備以儘早發現故障設備)
  • 取得邏輯磁碟及其組成的實體磁碟和他們的地位。如果存在故障磁碟,您將看到它們位於哪些磁碟以及位於哪些插槽中。
  • 取得快取電池狀態
  • 取得適配器事件日誌;這可以幫助您準確地確定何時以及在何種情況下檢測到問題。

儘管您擁有 RAID,但您仍可以監控磁碟和陣列的運作狀況。只有在正確監控和維護的情況下,RAID 才有助於避免停機。smartmontools甚至能夠在某些硬體 RAID 控制器後面監控磁碟;用它!

是時候忘記那些「如果有用就不要碰它」和「如果它沒有壞就不要修復」的咒語。這些與快速發展的世界無關。想想看:舊版的作業系統已經壞了,因為它很舊。稱職的管理員將要修復明顯「未損壞」的系統,以確保它們不會損壞。

更糟的是,像 Fedora 這樣古老(10 年)的非 LTS 系統已經嚴重損壞。在這樣的發行版上託管任何重要業務的想法在設計上被打破了。如果是 CentOS(10 年前是 LTS,目前您將使用 Oracle Linux、AlmaLinux 或 Rocky Linux),則不會不好,但 Fedora 始終不是生產伺服器的合適選擇。所以即使它只有兩年,你也必須更換它。

最好始終安裝硬體管理工具(megacliipmiutil)。你永遠不知道什麼時候需要它們,然後你可能已經無法使用它們,所以提前鋪上稻草。

相關內容