ZFS 및 mdadm과 함께 사용되는 디스크의 SMART 오류

2024-11-22 • tag-icon

한동안 워크스테이션('charm'이라고 함)을 재부팅할 때마다 다음과 같은 이메일을 받았습니다.

Subject: SMART error (ErrorCount) detected on host: charm

This message was generated by the smartd daemon running on:

   host name:  charm
   DNS domain: jj5.net

The following warning/error was logged by the smartd daemon:

Device: /dev/nvme3, number of Error Log entries increased from 324 to 326

Device info:
PNY CS3140 1TB SSD, S/N:PNY21242106180100094, FW:CS314312, 1.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Sat Feb  4 12:53:13 2023 AEDT
Another message will be sent in 24 hours if the problem persists.

재부팅할 때마다 워크스테이션에 있는 각 NVMe SSD 드라이브마다 하나씩 이러한 이메일 4개를 받습니다. 오류 이메일에서 볼 수 있듯이 내 드라이브는PNY CS3140 1TB SSD. 나는 그 중 네 개를 가지고 있습니다. 나는 이 문제가 내 PNY 드라이브에만 국한된 것이라고 생각하지 않습니다. 왜냐하면 여기에서 설명하는 것과 동일한 문제가 Samsung 990 PRO NVMe 드라이브를 사용하는 다른 컴퓨터에서 발생하기 때문입니다.

처음 워크스테이션을 설정할 때 저는 다음의 조언을 따랐습니다.이 기사ZFS zpool의 ashift 설정을 14로 설정합니다. 기사에서는 값을 12로 늘려야 하며 더 높아도 문제가 없다고 나와 있습니다. 발생한 SMART 오류 문제가 이 ZFS ashift 설정과 관련이 있을 수 있다고 생각하여 운영 체제(Ubuntu)를 다시 설치하고 다음과 같이 ashift 설정 없이 ZFS zpool을 다시 만들었습니다.

DISK1=/dev/disk/by-id/nvme-eui.6479a74fb0c00509
DISK2=/dev/disk/by-id/nvme-eui.6479a74fb0c00507
DISK3=/dev/disk/by-id/nvme-eui.6479a74fb0c004b7
DISK4=/dev/disk/by-id/nvme-eui.6479a74fb0c00508

zpool create -f \
    -o autotrim=on \
    -O acltype=posixacl -O compression=off \
    -O dnodesize=auto -O normalization=formD -O atime=off -O dedup=off \
    -O xattr=sa \
    best ${DISK1}-part4 ${DISK2}-part4 ${DISK3}-part4 ${DISK4}-part4

zpool create -f \
    -o autotrim=on \
    -O acltype=posixacl -O compression=lz4 \
    -O dnodesize=auto -O normalization=formD -O atime=off -O dedup=on \
    -O xattr=sa \
    fast mirror ${DISK1}-part5 ${DISK2}-part5 mirror ${DISK3}-part5 ${DISK4}-part5

Ashift 설정을 자동 감지하도록 했고 모든 디스크에 대해 9를 선택했습니다.

$ zdb | grep ashift
            ashift: 9
            ashift: 9
            ashift: 9
            ashift: 9
            ashift: 9
            ashift: 9

당신은 이것에 관심이 있을 수도 있습니다:

$ cat /proc/partitions | grep -v loop
major minor  #blocks  name

 259        0  976762584 nvme1n1
 259        2    1100800 nvme1n1p1
 259        3    1048576 nvme1n1p2
 259        4   52428800 nvme1n1p3
 259        5  104857600 nvme1n1p4
 259        6  817324032 nvme1n1p5
 259        1  976762584 nvme0n1
 259        7    1100800 nvme0n1p1
 259        8    1048576 nvme0n1p2
 259        9   52428800 nvme0n1p3
 259       10  104857600 nvme0n1p4
 259       11  817324032 nvme0n1p5
 259       12  976762584 nvme2n1
 259       14    1100800 nvme2n1p1
 259       15    1048576 nvme2n1p2
 259       16   52428800 nvme2n1p3
 259       17  104857600 nvme2n1p4
 259       18  817324032 nvme2n1p5
 259       13  976762584 nvme3n1
 259       19    1100800 nvme3n1p1
 259       20    1048576 nvme3n1p2
 259       21   52428800 nvme3n1p3
 259       22  104857600 nvme3n1p4
 259       23  817324032 nvme3n1p5
   9        1  104790016 md1
 259       24  104787968 md1p1
   9        0    2093056 md0
 259       25    2091008 md0p1
  11        0    1048575 sr0

이:

$ cat /proc/mdstat
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4]
md0 : active raid10 nvme3n1p2[2] nvme1n1p2[2] nvme2n1p2[0] nvme0n1p2[3]
      2093056 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]

md1 : active raid10 nvme2n1p3[3] nvme3n1p3[2] nvme1n1p3[0] nvme0n1p3[2]
      104790016 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]

unused devices: <none>

다음은 일부 선택 출력입니다 df.

$ df -h
Filesystem                 Size  Used Avail Use% Mounted on
/dev/md1p1                 100G   47G   51G  48% /
/dev/md0p1                 2.0G  261M  1.6G  15% /boot
/dev/nvme2n1p1             1.1G  6.1M  1.1G   1% /boot/efi
best                       325G  128K  325G   1% /best
fast                      1006G  128K 1006G   1% /fast

루트 파일 시스템은 btrfs입니다.

내 SSD에서 볼 수 있듯이 mdadm RAID 어레이에는 파티션 2와 3을 사용하고 ZFS zpool 'best' 및 'fast'에는 파티션 4와 5를 사용합니다.

이메일의 오류 메시지에는 syslog에서 자세한 내용을 확인하라고 나와 있지만 syslog에는 그다지 많은 내용이 없습니다.

$ cat /var/log/syslog | grep smartd
Feb  8 15:20:33 charm smartd[3202]: Device: /dev/nvme2, number of Error Log entries increased from 323 to 324
Feb  8 15:20:33 charm smartd[3202]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
Feb  8 15:20:33 charm smartd[3202]: Warning via /usr/share/smartmontools/smartd-runner to root: successful
Feb 10 13:47:49 charm smartd[3233]: smartd 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-60-generic] (local build)
Feb 10 13:47:49 charm smartd[3233]: Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
Feb 10 13:47:49 charm smartd[3233]: Opened configuration file /etc/smartd.conf
Feb 10 13:47:49 charm smartd[3233]: Drive: DEVICESCAN, implied '-a' Directive on line 21 of file /etc/smartd.conf
Feb 10 13:47:49 charm smartd[3233]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme0, opened
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme0, PNY CS3140 1TB SSD, S/N:PNY21242106180100095, FW:CS314312, 1.00 TB
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme0, is SMART capable. Adding to "monitor" list.
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme0, state read from /var/lib/smartmontools/smartd.PNY_CS3140_1TB_SSD-PNY21242106180100095.nvme.state
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme1, opened
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme1, PNY CS3140 1TB SSD, S/N:PNY21242106180100093, FW:CS314312, 1.00 TB
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme1, is SMART capable. Adding to "monitor" list.
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme1, state read from /var/lib/smartmontools/smartd.PNY_CS3140_1TB_SSD-PNY21242106180100093.nvme.state
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme2, opened
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme2, PNY CS3140 1TB SSD, S/N:PNY21242106180100092, FW:CS314312, 1.00 TB
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme2, is SMART capable. Adding to "monitor" list.
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme2, state read from /var/lib/smartmontools/smartd.PNY_CS3140_1TB_SSD-PNY21242106180100092.nvme.state
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme3, opened
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme3, PNY CS3140 1TB SSD, S/N:PNY21242106180100094, FW:CS314312, 1.00 TB
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme3, is SMART capable. Adding to "monitor" list.
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme3, state read from /var/lib/smartmontools/smartd.PNY_CS3140_1TB_SSD-PNY21242106180100094.nvme.state
Feb 10 13:47:49 charm smartd[3233]: Monitoring 0 ATA/SATA, 0 SCSI/SAS and 4 NVMe devices
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme0, number of Error Log entries increased from 318 to 320
Feb 10 13:47:49 charm smartd[3233]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
Feb 10 13:47:49 charm smartd[3233]: Warning via /usr/share/smartmontools/smartd-runner to root: successful
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme1, number of Error Log entries increased from 321 to 323
Feb 10 13:47:49 charm smartd[3233]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
Feb 10 13:47:49 charm smartd[3233]: Warning via /usr/share/smartmontools/smartd-runner to root: successful
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme2, number of Error Log entries increased from 324 to 326
Feb 10 13:47:49 charm smartd[3233]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
Feb 10 13:47:49 charm smartd[3233]: Warning via /usr/share/smartmontools/smartd-runner to root: successful
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme3, number of Error Log entries increased from 324 to 326
Feb 10 13:47:49 charm smartd[3233]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
Feb 10 13:47:49 charm smartd[3233]: Warning via /usr/share/smartmontools/smartd-runner to root: successful
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme0, state written to /var/lib/smartmontools/smartd.PNY_CS3140_1TB_SSD-PNY21242106180100095.nvme.state
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme1, state written to /var/lib/smartmontools/smartd.PNY_CS3140_1TB_SSD-PNY21242106180100093.nvme.state
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme2, state written to /var/lib/smartmontools/smartd.PNY_CS3140_1TB_SSD-PNY21242106180100092.nvme.state
Feb 10 13:47:49 charm smartd[3233]: Device: /dev/nvme3, state written to /var/lib/smartmontools/smartd.PNY_CS3140_1TB_SSD-PNY21242106180100094.nvme.state

smartctl -x다음은 nvme3 장치의 일부 출력입니다 .

# smartctl -x /dev/nvme3
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-60-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       PNY CS3140 1TB SSD
Serial Number:                      PNY21242106180100094
Firmware Version:                   CS314312
PCI Vendor/Subsystem ID:            0x1987
IEEE OUI Identifier:                0x6479a7
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            6479a7 4fb0c00508
Local Time is:                      Sat Feb 11 06:57:14 2023 AEDT
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005d):     Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x08):         Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     89 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.80W       -        -    0  0  0  0        0       0
 1 +     7.10W       -        -    1  1  1  1        0       0
 2 +     5.20W       -        -    2  2  2  2        0       0
 3 -   0.0620W       -        -    3  3  3  3     2500    7500
 4 -   0.0440W       -        -    4  4  4  4    10500   65000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        41 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    16%
Data Units Read:                    21,133,741 [10.8 TB]
Data Units Written:                 151,070,190 [77.3 TB]
Host Read Commands:                 202,445,947
Host Write Commands:                2,302,434,105
Controller Busy Time:               5,268
Power Cycles:                       58
Power On Hours:                     7,801
Unsafe Shutdowns:                   33
Media and Data Integrity Errors:    0
Error Information Log Entries:      326
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0        326     0  0x1018  0x4004  0x028            0     0     -

그래서 실제로 오류가 무엇인지 잘 모르겠습니다. 그 원인이 무엇인지 잘 모르겠습니다. 얼마나 심각한지 잘 모르겠습니다(모든 것이 작동하는 것 같습니다). 특히, 이 문제를 어떻게 해결해야 할지 모르겠습니다.

어떤 제안이라도 감사하겠습니다.

답변1

동일한 문제가 있지만 Crucial P3 CT4000P3SSD8을 사용하고 있습니다. Ubuntu 22.04(커널 5.15.0-67-generic)의 ZFS 미러 풀에서 두 개의 동일한 항목이 실행되고 있으며 시스템을 다시 시작할 때마다 각 드라이브의 SMART 오류 로그에 2개의 오류가 추가됩니다.

nvme-cli를 설치하고 실행했습니다 nvme error-log /dev/<your_drive>. 로그에서 오류가 발견되었습니다.0x2002(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field).

나는 찾았다이 스레드nvme-cli특히 님의 github 페이지 에서이 댓글@keithbusch의 말:

sqid는 관리자 대기열입니다.따라서 오류는 드라이버 또는 일부 도구가 컨트롤러가 지원하지 않는 무해한 선택적 명령을 시도했음을 나타낼 가능성이 높습니다.. 장치 공급업체는 이러한 오류를 기록하는 데 지나치게 신중한 것으로 알려져 있습니다. 사양에서는 이 동작을 허용하지만 반드시 그렇게 해야 하는 것은 아니며 누구에게도 도움이 되지 않습니다.

내 생각엔 우리는 안전하다고 생각해요. 단지 이 과잉 활동적인 로깅과 함께 살아가는 법을 배워야 할 뿐이죠...

답변1

관련 정보