Tengo un grupo Raid-Z2 con unidades de 6 * 4 TB. Todas las unidades tienen poco más de 40.000 horas de funcionamiento. Ahora todos parecen degradantes al mismo tiempo. El grupo está degradado y todas las unidades están marcadas como degradadas con muchos errores. Pero por suerte no se han perdido datos por el momento.
NAME STATE READ WRITE CKSUM
File DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sda DEGRADED 0 0 0 too many errors
sdb DEGRADED 0 0 0 too many errors
sdc DEGRADED 0 0 0 too many errors
sdd DEGRADED 0 0 0 too many errors
sde DEGRADED 0 0 0 too many errors
sdf DEGRADED 0 0 0 too many errors
Me gustaría construir un nuevo grupo con Raid-Z1 y unidades de 3 * 6 TB, ya que no necesito todo el espacio como en el grupo original. Mi problema es que el grupo antiguo tiene 6 unidades y mi grupo tendrá 3, pero mi controlador SAS solo tiene 8 puertos. Entonces, me gustaría desconectar un disco de mi grupo Raid-Z2, conectar mis 3 unidades nuevas y crear un nuevo grupo con ellas y luego guardar mis datos copiándolos en el nuevo grupo antes de que falle el grupo anterior.
¿Es eso posible? Mi opinión es que el grupo antiguo debería funcionar si falta un disco. Pero cuando intenté desconectar un disco, no pude acceder a ningún dato en el grupo anterior.
¿Alguien sabe como resolver esto?
Estado de Zpool -v:
pool: File
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: resilvered 6.82G in 0 days 00:04:00 with 0 errors on Sun Aug 23 21:21:15 2020
config:
NAME STATE READ WRITE CKSUM
File DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sda DEGRADED 0 0 0 too many errors
sdb DEGRADED 0 0 0 too many errors
sdc DEGRADED 0 0 0 too many errors
sdd DEGRADED 0 0 0 too many errors
sde DEGRADED 0 0 0 too many errors
sdf DEGRADED 0 0 0 too many errors
errors: No known data errors
Todos los discos informan que el estado SMART es correcto:
smartctl -H /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.55-1-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
syslog parece vacío:
root@boxvm:/var/log# cat syslog | grep sda
root@boxvm:/var/log#
La salida de dmesg también parece estar bien:
dmesg | grep sda
[ 8.997624] sd 1:0:0:0: [sda] Enabling DIF Type 2 protection
[ 8.998488] sd 1:0:0:0: [sda] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[ 8.998847] sd 1:0:0:0: [sda] Write Protect is off
[ 8.998848] sd 1:0:0:0: [sda] Mode Sense: df 00 10 08
[ 8.999540] sd 1:0:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 9.093385] sda: sda1 sda9
[ 9.096819] sd 1:0:0:0: [sda] Attached SCSI disk
dmesg | grep sdb
[ 8.997642] sd 1:0:1:0: [sdb] Enabling DIF Type 2 protection
[ 8.998467] sd 1:0:1:0: [sdb] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[ 8.998828] sd 1:0:1:0: [sdb] Write Protect is off
[ 8.998830] sd 1:0:1:0: [sdb] Mode Sense: df 00 10 08
[ 8.999524] sd 1:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 9.087056] sdb: sdb1 sdb9
[ 9.090465] sd 1:0:1:0: [sdb] Attached SCSI disk
dmesg | grep sdc
[ 8.997812] sd 1:0:2:0: [sdc] Enabling DIF Type 2 protection
[ 8.998639] sd 1:0:2:0: [sdc] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[ 8.998998] sd 1:0:2:0: [sdc] Write Protect is off
[ 8.998999] sd 1:0:2:0: [sdc] Mode Sense: df 00 10 08
[ 8.999692] sd 1:0:2:0: [sdc] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 9.084259] sdc: sdc1 sdc9
[ 9.088030] sd 1:0:2:0: [sdc] Attached SCSI disk
dmesg | grep sdd
[ 8.997932] sd 1:0:3:0: [sdd] Enabling DIF Type 2 protection
[ 8.998761] sd 1:0:3:0: [sdd] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[ 8.999120] sd 1:0:3:0: [sdd] Write Protect is off
[ 8.999121] sd 1:0:3:0: [sdd] Mode Sense: df 00 10 08
[ 8.999818] sd 1:0:3:0: [sdd] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 9.103840] sdd: sdd1 sdd9
[ 9.107482] sd 1:0:3:0: [sdd] Attached SCSI disk
dmesg | grep sde
[ 8.998017] sd 1:0:4:0: [sde] Enabling DIF Type 2 protection
[ 8.998839] sd 1:0:4:0: [sde] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[ 8.999234] sd 1:0:4:0: [sde] Write Protect is off
[ 8.999235] sd 1:0:4:0: [sde] Mode Sense: df 00 10 08
[ 8.999933] sd 1:0:4:0: [sde] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 9.088282] sde: sde1 sde9
[ 9.091665] sd 1:0:4:0: [sde] Attached SCSI disk
dmesg | grep sdf
[ 8.998247] sd 1:0:5:0: [sdf] Enabling DIF Type 2 protection
[ 8.999076] sd 1:0:5:0: [sdf] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[ 8.999435] sd 1:0:5:0: [sdf] Write Protect is off
[ 8.999436] sd 1:0:5:0: [sdf] Mode Sense: df 00 10 08
[ 9.000136] sd 1:0:5:0: [sdf] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 9.090609] sdf: sdf1 sdf9
[ 9.094235] sd 1:0:5:0: [sdf] Attached SCSI disk
dmesg para controlador SAS
root@boxvm:/var/log# dmesg | grep mpt2
[ 1.151805] mpt2sas_cm0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (65793672 kB)
[ 1.200012] mpt2sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
[ 1.200023] mpt2sas_cm0: MSI-X vectors supported: 1
[ 1.200024] mpt2sas_cm0: 0 1
[ 1.200098] mpt2sas_cm0: High IOPs queues : disabled
[ 1.200099] mpt2sas0-msix0: PCI-MSI-X enabled: IRQ 51
[ 1.200100] mpt2sas_cm0: iomem(0x00000000fc740000), mapped(0x00000000629d5dd1), size(65536)
[ 1.200101] mpt2sas_cm0: ioport(0x000000000000d000), size(256)
[ 1.254826] mpt2sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
[ 1.281681] mpt2sas_cm0: scatter gather: sge_in_main_msg(1), sge_per_chain(9), sge_per_io(128), chains_per_io(15)
[ 1.281746] mpt2sas_cm0: request pool(0x0000000074c49e3e) - dma(0xfcd700000): depth(3492), frame_size(128), pool_size(436 kB)
[ 1.289333] mpt2sas_cm0: sense pool(0x00000000693be9f4)- dma(0xfcba00000): depth(3367),element_size(96), pool_size(315 kB)
[ 1.289400] mpt2sas_cm0: config page(0x00000000f6926acf) - dma(0xfcb9ad000): size(512)
[ 1.289401] mpt2sas_cm0: Allocated physical memory: size(1687 kB)
[ 1.289401] mpt2sas_cm0: Current Controller Queue Depth(3364),Max Controller Queue Depth(3432)
[ 1.289402] mpt2sas_cm0: Scatter Gather Elements per IO(128)
[ 1.333780] mpt2sas_cm0: LSISAS2008: FWVersion(20.00.07.00), ChipRevision(0x03), BiosVersion(00.00.00.00)
[ 1.333781] mpt2sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
[ 1.334527] mpt2sas_cm0: sending port enable !!
[ 2.861790] mpt2sas_cm0: host_add: handle(0x0001), sas_addr(0x590b11c0155b3300), phys(8)
[ 8.996385] mpt2sas_cm0: port enable: SUCCESS
Respuesta1
Si su piscina ya está fallando, degradarla aún más es una muy mala idea. Si todos sus discos tienen errores a la vez, lo más probable es que tenga un controlador defectuoso o una fuente de alimentación defectuosa, en lugar de discos defectuosos.
Haría bien en invertir en un controlador adicional para colgar los discos de repuesto como primer paso.