desconexión automática de la matriz RAID

desconexión automática de la matriz RAID

Una matriz RAID conectada a mi sistema (Debian 7) a través de USB se desconecta con frecuencia sin motivo aparente. Tras la conexión inicial al sistema, se detecta el dispositivo y la matriz se puede inicializar, montar, leer, escribir y desmontar con bastante normalidad. Sin embargo, invariablemente, después de un corto período de tiempo (de minutos a horas), los discos componentes desaparecen de /devla lista de fdisk -ly permanecen inaccesibles hasta que se reinicia el dispositivo (es decir, el gabinete RAID).

A juzgar por el resultado de /var/log/messages, el problema parece estar en el reinicio del dispositivo USB. Después del reinicio no provocado, el sistema intenta repetidamente volver a conectar el dispositivo, asignando un número de dispositivo USB más alto y finalmente finaliza después de cinco intentos de reinicio.

¿Qué es responsable de que el dispositivo se reinicie? Sospecho que la falla está en el controlador USB. ¿Cómo se puede evitar el comportamiento de reinicio automático no deseado?

Los siguientes extractos /var/log/messagesmuestran el comportamiento típico de la matriz después de la inicialización y el posterior reinicio:

Conexión inicial:

Jun 19 19:38:51 hostname kernel: [406823.308418] usb 1-1.3: new high-speed USB device number 24 using ehci_hcd
Jun 19 19:38:51 hostname kernel: [406823.401317] usb 1-1.3: New USB device found, idVendor=152d, idProduct=2351
Jun 19 19:38:51 hostname kernel: [406823.401330] usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=5
Jun 19 19:38:51 hostname kernel: [406823.401338] usb 1-1.3: Product: USB to ATA/ATAPI Bridge
Jun 19 19:38:51 hostname kernel: [406823.401345] usb 1-1.3: Manufacturer: JMicron
Jun 19 19:38:51 hostname kernel: [406823.401350] usb 1-1.3: SerialNumber: DCC3..........
Jun 19 19:38:51 hostname kernel: [406823.402469] scsi16 : usb-storage 1-1.3:1.0
Jun 19 19:38:51 hostname mtp-probe: checking bus 1, device 24: "/sys/devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.3"
Jun 19 19:38:51 hostname mtp-probe: bus: 1, device: 24 was not an MTP device
Jun 19 19:38:52 hostname kernel: [406824.400835] scsi 16:0:0:0: Direct-Access     WDC WD20 EFRX-68AX9N0          PQ: 0 ANSI: 5
Jun 19 19:38:52 hostname kernel: [406824.401450] scsi 16:0:0:1: Direct-Access     WDC WD20 EFRX-68AX9N0          PQ: 0 ANSI: 5
Jun 19 19:38:52 hostname kernel: [406824.402433] sd 16:0:0:0: Attached scsi generic sg2 type 0
Jun 19 19:38:52 hostname kernel: [406824.402583] sd 16:0:0:1: Attached scsi generic sg3 type 0
Jun 19 19:38:52 hostname kernel: [406824.662288] sd 16:0:0:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
Jun 19 19:38:52 hostname kernel: [406824.662789] sd 16:0:0:1: [sde] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
Jun 19 19:38:52 hostname kernel: [406824.663573] sd 16:0:0:0: [sdd] Write Protect is off
Jun 19 19:38:52 hostname kernel: [406824.664356] sd 16:0:0:1: [sde] Write Protect is off
Jun 19 19:38:52 hostname kernel: [406824.665087] sd 16:0:0:0: [sdd] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Jun 19 19:38:52 hostname kernel: [406824.666295] sd 16:0:0:1: [sde] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Jun 19 19:38:52 hostname kernel: [406824.705355]  sdd: sdd1
Jun 19 19:38:52 hostname kernel: [406824.740148]  sde: sde1
Jun 19 19:38:52 hostname kernel: [406824.743667] sd 16:0:0:0: [sdd] Attached SCSI disk
Jun 19 19:38:52 hostname kernel: [406824.746756] sd 16:0:0:1: [sde] Attached SCSI disk

Al reiniciar:

Jun 19 20:05:25 hostname kernel: [408416.587392] usb 1-1.3: reset high-speed USB device number 24 using ehci_hcd
Jun 19 20:05:25 hostname kernel: [408416.679688] usb 1-1.3: device firmware changed
Jun 19 20:05:25 hostname kernel: [408416.679852] sd 16:0:0:0: Device offlined - not ready after error recovery
Jun 19 20:05:25 hostname kernel: [408416.679942] usb 1-1.3: USB disconnect, device number 24
Jun 19 20:05:25 hostname kernel: [408416.767366] usb 1-1.3: new high-speed USB device number 25 using ehci_hcd
Jun 19 20:05:25 hostname kernel: [408416.860214] usb 1-1.3: New USB device found, idVendor=152d, idProduct=2351
Jun 19 20:05:25 hostname kernel: [408416.860225] usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=5
Jun 19 20:05:25 hostname kernel: [408416.860232] usb 1-1.3: Product: USB to ATA/ATAPI Bridge
Jun 19 20:05:25 hostname kernel: [408416.860237] usb 1-1.3: Manufacturer: JMicron
Jun 19 20:05:25 hostname kernel: [408416.860241] usb 1-1.3: SerialNumber: 152D.............
Jun 19 20:05:25 hostname kernel: [408416.861634] scsi17 : usb-storage 1-1.3:1.0
Jun 19 20:05:25 hostname mtp-probe: checking bus 1, device 25: "/sys/devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.3"
Jun 19 20:05:25 hostname mtp-probe: bus: 1, device: 25 was not an MTP device
Jun 19 20:05:47 hostname kernel: [408438.591825] usb 1-1.3: reset high-speed USB device number 25 using ehci_hcd
Jun 19 20:05:57 hostname kernel: [408448.750695] usb 1-1.3: reset high-speed USB device number 25 using ehci_hcd
Jun 19 20:06:02 hostname kernel: [408453.911854] usb 1-1.3: reset high-speed USB device number 25 using ehci_hcd
Jun 19 20:06:03 hostname kernel: [408454.359608] usb 1-1.3: reset high-speed USB device number 25 using ehci_hcd
Jun 19 20:06:03 hostname kernel: [408454.807394] usb 1-1.3: reset high-speed USB device number 25 using ehci_hcd
Jun 19 20:06:04 hostname kernel: [408455.295168] usb 1-1.3: reset high-speed USB device number 25 using ehci_hcd
Jun 19 20:06:04 hostname kernel: [408455.711201] scsi 17:0:0:0: Device offlined - not ready after error recovery
Jun 19 20:06:04 hostname kernel: [408455.711448] usb 1-1.3: USB disconnect, device number 25
Jun 19 20:06:04 hostname kernel: [408455.786917] usb 1-1.3: new high-speed USB device number 26 using ehci_hcd
Jun 19 20:06:05 hostname kernel: [408456.234679] usb 1-1.3: new high-speed USB device number 27 using ehci_hcd
Jun 19 20:06:05 hostname kernel: [408456.686418] usb 1-1.3: new high-speed USB device number 28 using ehci_hcd
Jun 19 20:06:06 hostname kernel: [408457.174018] usb 1-1.3: new high-speed USB device number 29 using ehci_hcd

El intento de asignación del dispositivo USB número 29 es el último intento de reactivar la matriz de discos. Para volver a conectar el dispositivo en este punto, es obligatorio apagar y encender el gabinete RAID o desconectarlo/reconectarlo.

Actualización: Recientemente, el dispositivo pareció reiniciarse poco después de una resincronización. No estoy seguro de si esto es útil, pero los mensajes de error contenidos en /var/log/messagesse incluyen a continuación:

    Jul  5 02:55:02 hdac kernel: [135732.758796] md: md0: resync done.
Jul  5 03:12:04 hdac kernel: [136754.176970] usb 1-1.3: reset high-speed USB device number 10 using ehci_hcd
Jul  5 03:12:04 hdac kernel: [136754.269537] usb 1-1.3: device firmware changed
Jul  5 03:12:04 hdac kernel: [136754.269995] usb 1-1.3: USB disconnect, device number 10
Jul  5 03:12:04 hdac kernel: [136754.269998] sd 6:0:0:1: Device offlined - not ready after error recovery
Jul  5 03:12:04 hdac kernel: [136754.348882] usb 1-1.3: new high-speed USB device number 11 using ehci_hcd
Jul  5 03:12:04 hdac kernel: [136754.442408] usb 1-1.3: New USB device found, idVendor=152d, idProduct=2351
Jul  5 03:12:04 hdac kernel: [136754.442419] usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=5
Jul  5 03:12:04 hdac kernel: [136754.442425] usb 1-1.3: Product: USB to ATA/ATAPI Bridge
Jul  5 03:12:04 hdac kernel: [136754.442430] usb 1-1.3: Manufacturer: JMicron
Jul  5 03:12:04 hdac kernel: [136754.442434] usb 1-1.3: SerialNumber: 152D....
Jul  5 03:12:04 hdac kernel: [136754.443581] scsi7 : usb-storage 1-1.3:1.0
Jul  5 03:12:04 hdac mtp-probe: checking bus 1, device 11: "/sys/devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.3"
Jul  5 03:12:04 hdac mtp-probe: bus: 1, device: 11 was not an MTP device
Jul  5 03:12:26 hdac kernel: [136776.197464] usb 1-1.3: reset high-speed USB device number 11 using ehci_hcd
Jul  5 03:12:36 hdac kernel: [136786.356068] usb 1-1.3: reset high-speed USB device number 11 using ehci_hcd
Jul  5 03:12:41 hdac kernel: [136791.517429] usb 1-1.3: reset high-speed USB device number 11 using ehci_hcd
Jul  5 03:12:42 hdac kernel: [136791.965221] usb 1-1.3: reset high-speed USB device number 11 using ehci_hcd
Jul  5 03:12:42 hdac kernel: [136792.412974] usb 1-1.3: reset high-speed USB device number 11 using ehci_hcd
Jul  5 03:12:43 hdac kernel: [136792.900701] usb 1-1.3: reset high-speed USB device number 11 using ehci_hcd
Jul  5 03:12:43 hdac kernel: [136793.316865] scsi 7:0:0:0: Device offlined - not ready after error recovery
Jul  5 03:12:43 hdac kernel: [136793.317120] usb 1-1.3: USB disconnect, device number 11
Jul  5 03:12:43 hdac kernel: [136793.388401] usb 1-1.3: new high-speed USB device number 12 using ehci_hcd
Jul  5 03:12:44 hdac kernel: [136793.836359] usb 1-1.3: new high-speed USB device number 13 using ehci_hcd
Jul  5 03:12:44 hdac kernel: [136794.283982] usb 1-1.3: new high-speed USB device number 14 using ehci_hcd
Jul  5 03:12:45 hdac kernel: [136794.776418] usb 1-1.3: new high-speed USB device number 15 using ehci_hcd
Jul  5 07:46:51 hdac rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2234" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Jul  5 07:46:51 hdac kernel: [153233.051604] md: super_written gets error=-19, uptodate=0
Jul  5 07:46:51 hdac kernel: [153233.080561] lost page write due to I/O error on md0
Jul  5 07:46:51 hdac kernel: [153233.082218] lost page write due to I/O error on md0
Jul  5 07:46:51 hdac kernel: [153233.088909] lost page write due to I/O error on md0
Jul  5 07:46:51 hdac kernel: [153233.088972] lost page write due to I/O error on md0
Jul  5 07:46:51 hdac kernel: [153233.089030] lost page write due to I/O error on md0
Jul  5 07:46:51 hdac kernel: [153233.089084] lost page write due to I/O error on md0
Jul  5 07:46:51 hdac kernel: [153233.089139] lost page write due to I/O error on md0
Jul  5 07:46:51 hdac kernel: [153233.089193] lost page write due to I/O error on md0
Jul  5 07:46:51 hdac kernel: [153233.089247] lost page write due to I/O error on md0
Jul  5 07:46:51 hdac kernel: [153233.089300] lost page write due to I/O error on md0
Jul  5 07:46:52 hdac kernel: [153233.308299] md: super_written gets error=-19, uptodate=0
Jul  5 15:04:02 hdac kernel: [179450.340233] md: super_written gets error=-19, uptodate=0
Jul  5 15:04:02 hdac kernel: [179450.340549] quiet_error: 101 callbacks suppressed
Jul  5 15:04:02 hdac kernel: [179450.340566] lost page write due to I/O error on md0
Jul  5 15:04:02 hdac kernel: [179450.340774] lost page write due to I/O error on md0
Jul  5 15:04:03 hdac kernel: [179450.541182] md: super_written gets error=-19, uptodate=0
Jul  5 15:04:08 hdac kernel: [179455.698562] md: super_written gets error=-19, uptodate=0
Jul  5 15:04:08 hdac kernel: [179455.699059] lost page write due to I/O error on md0
Jul  5 15:04:08 hdac kernel: [179455.902387] md: super_written gets error=-19, uptodate=0
Jul  5 15:04:32 hdac kernel: [179479.848336] md: super_written gets error=-19, uptodate=0
Jul  5 15:04:32 hdac kernel: [179479.848803] lost page write due to I/O error on md0
Jul  5 15:04:32 hdac kernel: [179479.848832] lost page write due to I/O error on md0
Jul  5 15:04:32 hdac kernel: [179480.049689] md: super_written gets error=-19, uptodate=0
Jul  5 15:20:11 hdac kernel: [180418.849041] md: super_written gets error=-19, uptodate=0
Jul  5 15:20:11 hdac kernel: [180418.852710] lost page write due to I/O error on md0
Jul  5 15:20:12 hdac kernel: [180419.056405] md: super_written gets error=-19, uptodate=0

información relacionada