디스크 4개로 구성된 RAID 5가 디스크 1개에 장애가 발생하면 작동하지 않습니까?

Question 1

이는 RAID5의 근본적인 문제입니다. 재구축 시 불량 블록이 킬러입니다.

Oct  2 15:08:51 it kernel: [1686185.573233] md/raid:md0: device xvdc operational as raid disk 0
Oct  2 15:08:51 it kernel: [1686185.580020] md/raid:md0: device xvde operational as raid disk 2
Oct  2 15:08:51 it kernel: [1686185.588307] md/raid:md0: device xvdd operational as raid disk 1
Oct  2 15:08:51 it kernel: [1686185.595745] md/raid:md0: allocated 4312kB
Oct  2 15:08:51 it kernel: [1686185.600729] md/raid:md0: raid level 5 active with 3 out of 4 devices, algorithm 2
Oct  2 15:08:51 it kernel: [1686185.608928] md0: detected capacity change from 0 to 2705221484544
⋮

어레이가 조립되어 성능이 저하되었습니다. xvdc, xvde 및 xvdd로 어셈블되었습니다. 분명히 핫 스페어가 있습니다.

Oct  2 15:08:51 it kernel: [1686185.615772] md: recovery of RAID array md0
Oct  2 15:08:51 it kernel: [1686185.621150] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Oct  2 15:08:51 it kernel: [1686185.627626] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Oct  2 15:08:51 it kernel: [1686185.634024]  md0: unknown partition table
Oct  2 15:08:51 it kernel: [1686185.645882] md: using 128k window, over a total of 880605952k.

'파티션 테이블' 메시지는 관련이 없습니다. 다른 메시지는 md가 아마도 핫 스페어(제거/다시 추가하려고 시도한 경우 이전에 실패한 장치일 수 있음)에서 복구를 시도하고 있음을 알려줍니다.

⋮
Oct  2 15:24:19 it kernel: [1687112.817845] end_request: I/O error, dev xvde, sector 881423360
Oct  2 15:24:19 it kernel: [1687112.820517] raid5_end_read_request: 1 callbacks suppressed
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423360 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: Disk failure on xvde, disabling device.
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: Operation continuing on 2 devices.

그리고 이것은 md가 xvde(나머지 세 장치 중 하나)에서 섹터를 읽으려고 시도하는 것입니다. 이는 실패하고[불량 섹터, 아마도] md(어레이 성능이 저하되었기 때문에)를 복구할 수 없습니다. 따라서 어레이에서 디스크가 제거되고 이중 디스크 오류가 발생하면 RAID5가 작동하지 않습니다.

왜 예비품으로 라벨이 지정되었는지 잘 모르겠습니다. 이상합니다(하지만 일반적으로 를 보는 것 같으니 /proc/mdstatmdadm이 라벨을 붙인 방식일 수도 있습니다). 또한 최신 커널은 불량 블록으로 인해 제거되는 것을 훨씬 더 주저한다고 생각했습니다. 하지만 오래된 커널을 실행하고 있는 것은 아닐까요?

이에 대해 무엇을 할 수 있나요?

좋은 백업. 이는 데이터를 유지하기 위한 모든 전략에서 항상 중요한 부분입니다.

어레이에서 정기적으로 불량 블록을 제거하는지 확인하십시오. OS에 이미 이를 위한 크론 작업이 포함되어 있을 수 있습니다. repair또는 check에 에코하여 이를 수행합니다 /sys/block/md0/md/sync_action. "복구"는 발견된 패리티 오류(예: 패리티 비트가 디스크의 데이터와 일치하지 않음)도 복구합니다.

# echo repair > /sys/block/md0/md/sync_action
#

cat /proc/mdstat, 또는 해당 sysfs 디렉토리의 다양한 파일을 사용하여 진행 상황을 볼 수 있습니다 . (다음에서 다소 최신 문서를 찾을 수 있습니다.Linux Raid Wiki mdstat 기사.

참고: 정확한 버전은 확실하지 않은 이전 커널에서는 검사를 통해 잘못된 블록이 수정되지 않을 수 있습니다.

마지막 옵션 중 하나는 RAID6으로 전환하는 것입니다. 이를 위해서는 다른 디스크가 필요합니다(~할 수 있다4개 또는 3개의 디스크로 구성된 RAID6을 실행하는 경우에는 아마도 원하지 않을 것입니다). 충분히 새로운 커널을 사용하면 가능하면 불량 블록이 즉시 수정됩니다. RAID6은 두 개의 디스크 오류에서도 살아남을 수 있으므로 하나의 디스크에 오류가 발생하더라도 불량 블록에서도 살아남을 수 있습니다. 따라서 둘 다 불량 블록을 매핑하고 재구축을 계속합니다.

Answer

이는 RAID5의 근본적인 문제입니다. 재구축 시 불량 블록이 킬러입니다.

Oct  2 15:08:51 it kernel: [1686185.573233] md/raid:md0: device xvdc operational as raid disk 0
Oct  2 15:08:51 it kernel: [1686185.580020] md/raid:md0: device xvde operational as raid disk 2
Oct  2 15:08:51 it kernel: [1686185.588307] md/raid:md0: device xvdd operational as raid disk 1
Oct  2 15:08:51 it kernel: [1686185.595745] md/raid:md0: allocated 4312kB
Oct  2 15:08:51 it kernel: [1686185.600729] md/raid:md0: raid level 5 active with 3 out of 4 devices, algorithm 2
Oct  2 15:08:51 it kernel: [1686185.608928] md0: detected capacity change from 0 to 2705221484544
⋮

어레이가 조립되어 성능이 저하되었습니다. xvdc, xvde 및 xvdd로 어셈블되었습니다. 분명히 핫 스페어가 있습니다.

Oct  2 15:08:51 it kernel: [1686185.615772] md: recovery of RAID array md0
Oct  2 15:08:51 it kernel: [1686185.621150] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Oct  2 15:08:51 it kernel: [1686185.627626] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Oct  2 15:08:51 it kernel: [1686185.634024]  md0: unknown partition table
Oct  2 15:08:51 it kernel: [1686185.645882] md: using 128k window, over a total of 880605952k.

'파티션 테이블' 메시지는 관련이 없습니다. 다른 메시지는 md가 아마도 핫 스페어(제거/다시 추가하려고 시도한 경우 이전에 실패한 장치일 수 있음)에서 복구를 시도하고 있음을 알려줍니다.

⋮
Oct  2 15:24:19 it kernel: [1687112.817845] end_request: I/O error, dev xvde, sector 881423360
Oct  2 15:24:19 it kernel: [1687112.820517] raid5_end_read_request: 1 callbacks suppressed
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423360 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: Disk failure on xvde, disabling device.
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: Operation continuing on 2 devices.

그리고 이것은 md가 xvde(나머지 세 장치 중 하나)에서 섹터를 읽으려고 시도하는 것입니다. 이는 실패하고[불량 섹터, 아마도] md(어레이 성능이 저하되었기 때문에)를 복구할 수 없습니다. 따라서 어레이에서 디스크가 제거되고 이중 디스크 오류가 발생하면 RAID5가 작동하지 않습니다.

왜 예비품으로 라벨이 지정되었는지 잘 모르겠습니다. 이상합니다(하지만 일반적으로 를 보는 것 같으니 /proc/mdstatmdadm이 라벨을 붙인 방식일 수도 있습니다). 또한 최신 커널은 불량 블록으로 인해 제거되는 것을 훨씬 더 주저한다고 생각했습니다. 하지만 오래된 커널을 실행하고 있는 것은 아닐까요?

이에 대해 무엇을 할 수 있나요?

좋은 백업. 이는 데이터를 유지하기 위한 모든 전략에서 항상 중요한 부분입니다.

어레이에서 정기적으로 불량 블록을 제거하는지 확인하십시오. OS에 이미 이를 위한 크론 작업이 포함되어 있을 수 있습니다. repair또는 check에 에코하여 이를 수행합니다 /sys/block/md0/md/sync_action. "복구"는 발견된 패리티 오류(예: 패리티 비트가 디스크의 데이터와 일치하지 않음)도 복구합니다.

# echo repair > /sys/block/md0/md/sync_action
#

cat /proc/mdstat, 또는 해당 sysfs 디렉토리의 다양한 파일을 사용하여 진행 상황을 볼 수 있습니다 . (다음에서 다소 최신 문서를 찾을 수 있습니다.Linux Raid Wiki mdstat 기사.

참고: 정확한 버전은 확실하지 않은 이전 커널에서는 검사를 통해 잘못된 블록이 수정되지 않을 수 있습니다.

마지막 옵션 중 하나는 RAID6으로 전환하는 것입니다. 이를 위해서는 다른 디스크가 필요합니다(~할 수 있다4개 또는 3개의 디스크로 구성된 RAID6을 실행하는 경우에는 아마도 원하지 않을 것입니다). 충분히 새로운 커널을 사용하면 가능하면 불량 블록이 즉시 수정됩니다. RAID6은 두 개의 디스크 오류에서도 살아남을 수 있으므로 하나의 디스크에 오류가 발생하더라도 불량 블록에서도 살아남을 수 있습니다. 따라서 둘 다 불량 블록을 매핑하고 재구축을 계속합니다.

Question 2

다음과 같이 RAID5 어레이를 생성한다고 상상합니다.

$ mdadm --create /dev/md0 --level=5 --raid-devices=4 \
       /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1

정확히 당신이 원하는 것이 아닙니다. 대신 다음과 같이 디스크를 추가해야 합니다.

$ mdadm --create /dev/md0 --level=5 --raid-devices=4 \
       /dev/sda1 /dev/sdb1 /dev/sdc1
$ mdadm --add /dev/md0 /dev/sdd1

mdadm또는 의 옵션을 사용하여 다음과 같이 예비 부품을 추가할 수 있습니다 .

$ mdadm --create /dev/md0 --level=5 --raid-devices=3 --spare-devices=1 \
       /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1

목록의 마지막 드라이브가 예비 드라이브가 됩니다.

에서 발췌mdadm 매뉴얼 페이지

-n, --raid-devices=
      Specify the number of active devices in the array.  This, plus the 
      number of spare devices (see below) must  equal the  number  of  
      component-devices (including "missing" devices) that are listed on 
      the command line for --create. Setting a value of 1 is probably a 
      mistake and so requires that --force be specified first.  A  value 
      of  1  will then be allowed for linear, multipath, RAID0 and RAID1.  
      It is never allowed for RAID4, RAID5 or RAID6. This  number  can only 
      be changed using --grow for RAID1, RAID4, RAID5 and RAID6 arrays, and
      only on kernels which provide the necessary support.

-x, --spare-devices=
      Specify the number of spare (eXtra) devices in the initial array.  
      Spares can also be  added  and  removed  later. The  number  of component
      devices listed on the command line must equal the number of RAID devices 
      plus the number of spare devices.

Answer

다음과 같이 RAID5 어레이를 생성한다고 상상합니다.

$ mdadm --create /dev/md0 --level=5 --raid-devices=4 \
       /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1

정확히 당신이 원하는 것이 아닙니다. 대신 다음과 같이 디스크를 추가해야 합니다.

$ mdadm --create /dev/md0 --level=5 --raid-devices=4 \
       /dev/sda1 /dev/sdb1 /dev/sdc1
$ mdadm --add /dev/md0 /dev/sdd1

mdadm또는 의 옵션을 사용하여 다음과 같이 예비 부품을 추가할 수 있습니다 .

$ mdadm --create /dev/md0 --level=5 --raid-devices=3 --spare-devices=1 \
       /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1

목록의 마지막 드라이브가 예비 드라이브가 됩니다.

에서 발췌mdadm 매뉴얼 페이지

-n, --raid-devices=
      Specify the number of active devices in the array.  This, plus the 
      number of spare devices (see below) must  equal the  number  of  
      component-devices (including "missing" devices) that are listed on 
      the command line for --create. Setting a value of 1 is probably a 
      mistake and so requires that --force be specified first.  A  value 
      of  1  will then be allowed for linear, multipath, RAID0 and RAID1.  
      It is never allowed for RAID4, RAID5 or RAID6. This  number  can only 
      be changed using --grow for RAID1, RAID4, RAID5 and RAID6 arrays, and
      only on kernels which provide the necessary support.

-x, --spare-devices=
      Specify the number of spare (eXtra) devices in the initial array.  
      Spares can also be  added  and  removed  later. The  number  of component
      devices listed on the command line must equal the number of RAID devices 
      plus the number of spare devices.

디스크 4개로 구성된 RAID 5가 디스크 1개에 장애가 발생하면 작동하지 않습니까?

답변1

이에 대해 무엇을 할 수 있나요?

답변2

관련 정보