Seguimiento del kernel de Ubuntu: solución de problemas de fallas aleatorias de "tarea colgada durante 120 segundos"

Seguimiento del kernel de Ubuntu: solución de problemas de fallas aleatorias de "tarea colgada durante 120 segundos"

Finalmente pude obtener un rastro del error de "tarea colgada durante 120 segundos". Nuestros servidores simplemente fallarían aleatoriamente ante picos de carga. Los tiempos de espera de IO parecen normales. La memoria/intercambio está bien. Estos son Ubuntu 10.04 - 64 bits basado en Xen. VPS, algunos incluso son "Baremetals".

A continuación se muestra el rastro:

Aug 24 11:13:25 admin1 kernel: [2732107.501749] INFO: task java:29952 blocked for more than 120 seconds.
Aug 24 11:13:25 admin1 kernel: [2732107.501954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 24 11:13:25 admin1 kernel: [2732107.502223] java          D 0000000000000000     0 29952      1 0x00000000
Aug 24 11:13:25 admin1 kernel: [2732107.502227]  ffff8801d3b5fbb8 0000000000000082 ffffffff81a35800 ffff8801d635be8c
Aug 24 11:13:25 admin1 kernel: [2732107.502230]  0000000000014cc0 ffff88014fcc5b80 0000000000014cc0 ffff88014fcc5f18
Aug 24 11:13:25 admin1 kernel: [2732107.502233]  ffff8801d3b5ffd8 ffff88014fcc5f20 ffff8801d3b5e010 0000000000014cc0
Aug 24 11:13:25 admin1 kernel: [2732107.502236] Call Trace:
Aug 24 11:13:25 admin1 kernel: [2732107.502243]  [<ffffffff81589bd7>] __mutex_lock_slowpath+0xf7/0x180
Aug 24 11:13:25 admin1 kernel: [2732107.502247]  [<ffffffff81589abb>] mutex_lock+0x2b/0x50
Aug 24 11:13:25 admin1 kernel: [2732107.502251]  [<ffffffff8122633a>] __log_wait_for_space+0xda/0x1b0
Aug 24 11:13:25 admin1 kernel: [2732107.502255]  [<ffffffff814fed2f>] ? inet_recvmsg+0x4f/0x80
Aug 24 11:13:25 admin1 kernel: [2732107.502258]  [<ffffffff812220e1>] start_this_handle+0xf1/0x440
Aug 24 11:13:25 admin1 kernel: [2732107.502260]  [<ffffffff812225c1>] ? journal_start+0x81/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.502263]  [<ffffffff812225f5>] journal_start+0xb5/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.502266]  [<ffffffff811d3e71>] ext3_journal_start_sb+0x31/0x60
Aug 24 11:13:25 admin1 kernel: [2732107.502269]  [<ffffffff811ca63d>] ext3_dirty_inode+0x3d/0xa0
Aug 24 11:13:25 admin1 kernel: [2732107.502273]  [<ffffffff811776ca>] __mark_inode_dirty+0x3a/0x1f0
Aug 24 11:13:25 admin1 kernel: [2732107.502275]  [<ffffffff811ca7f5>] ext3_setattr+0x155/0x2c0
Aug 24 11:13:25 admin1 kernel: [2732107.502279]  [<ffffffff810097ce>] ? __switch_to+0x26e/0x320
Aug 24 11:13:25 admin1 kernel: [2732107.502283]  [<ffffffff8116c2c5>] notify_change+0x175/0x2f0
Aug 24 11:13:25 admin1 kernel: [2732107.502285]  [<ffffffff81588781>] ? schedule+0x411/0xa50
Aug 24 11:13:25 admin1 kernel: [2732107.502289]  [<ffffffff81153064>] do_truncate+0x64/0xa0
Aug 24 11:13:25 admin1 kernel: [2732107.502291]  [<ffffffff81153d7b>] ? generic_file_llseek+0x5b/0x70
Aug 24 11:13:25 admin1 kernel: [2732107.502294]  [<ffffffff811532d8>] sys_ftruncate+0xe8/0xf0
Aug 24 11:13:25 admin1 kernel: [2732107.502297]  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b

y otro justo debajo ahora cuando se ejecutó 'descomprimir', exactamente al mismo tiempo que el seguimiento anterior;

Aug 24 11:13:25 admin1 kernel: [2732107.503946] unzip         D 0000000000000000     0 28123      1 0x00000000
Aug 24 11:13:25 admin1 kernel: [2732107.503951]  ffff880031e156e8 0000000000000086 0000000000000000 0000000000000000
Aug 24 11:13:25 admin1 kernel: [2732107.503954]  0000000000014cc0 ffff8801183f5b80 0000000000014cc0 ffff8801183f5f18
Aug 24 11:13:25 admin1 kernel: [2732107.503957]  ffff880031e15fd8 ffff8801183f5f20 ffff880031e14010 0000000000014cc0
Aug 24 11:13:25 admin1 kernel: [2732107.503960] Call Trace:
Aug 24 11:13:25 admin1 kernel: [2732107.503964]  [<ffffffff81227745>] log_wait_commit+0xc5/0x150
Aug 24 11:13:25 admin1 kernel: [2732107.503967]  [<ffffffff81081150>] ? autoremove_wake_function+0x0/0x40
Aug 24 11:13:25 admin1 kernel: [2732107.503969]  [<ffffffff8158b08e>] ? _raw_spin_lock+0xe/0x20
Aug 24 11:13:25 admin1 kernel: [2732107.503972]  [<ffffffff81225ddb>] log_do_checkpoint+0xab/0x530
Aug 24 11:13:25 admin1 kernel: [2732107.503976]  [<ffffffff81036bd6>] ? __ticket_spin_lock+0x16/0x20
Aug 24 11:13:25 admin1 kernel: [2732107.503979]  [<ffffffff81226306>] __log_wait_for_space+0xa6/0x1b0
Aug 24 11:13:25 admin1 kernel: [2732107.503982]  [<ffffffff81070974>] ? mod_timer+0x174/0x2d0
Aug 24 11:13:25 admin1 kernel: [2732107.503985]  [<ffffffff812220e1>] start_this_handle+0xf1/0x440
Aug 24 11:13:25 admin1 kernel: [2732107.503987]  [<ffffffff812225c1>] ? journal_start+0x81/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.503990]  [<ffffffff812225f5>] journal_start+0xb5/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.503993]  [<ffffffff811d3e71>] ext3_journal_start_sb+0x31/0x60
Aug 24 11:13:25 admin1 kernel: [2732107.503995]  [<ffffffff811ccbe2>] ext3_write_begin+0xa2/0x230
Aug 24 11:13:25 admin1 kernel: [2732107.503998]  [<ffffffff8110029e>] generic_file_buffered_write+0xfe/0x250
Aug 24 11:13:25 admin1 kernel: [2732107.504001]  [<ffffffff811dc65f>] ? ext3_xattr_security_get+0x2f/0x40
Aug 24 11:13:25 admin1 kernel: [2732107.504004]  [<ffffffff81102df0>] __generic_file_aio_write+0x240/0x470
Aug 24 11:13:25 admin1 kernel: [2732107.504006]  [<ffffffff81103085>] generic_file_aio_write+0x65/0xd0
Aug 24 11:13:25 admin1 kernel: [2732107.504009]  [<ffffffff8115370a>] do_sync_write+0xda/0x120
Aug 24 11:13:25 admin1 kernel: [2732107.504012]  [<ffffffff812941a8>] ? apparmor_file_permission+0x18/0x20
Aug 24 11:13:25 admin1 kernel: [2732107.504015]  [<ffffffff8126366f>] ? security_file_permission+0x1f/0x80
Aug 24 11:13:25 admin1 kernel: [2732107.504017]  [<ffffffff811539f8>] vfs_write+0xc8/0x190
Aug 24 11:13:25 admin1 kernel: [2732107.504020]  [<ffffffff81154251>] sys_write+0x51/0x90
Aug 24 11:13:25 admin1 kernel: [2732107.504023]  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b

¿Alguien puede ayudar?

¿Es esto un punto muerto en el sistema de archivos?

tía.

Respuesta1

Esos seguimientos de pila indican que se estaba intentando una operación de escritura en el sistema de archivos en ese momento, así que verifique los problemas relacionados con el almacenamiento. Si tiene sistemas de archivos montados en red, verifique que el servidor sea accesible y funcione normalmente. Si no, revise sus unidades de disco. Deberías ver mensajes en dmesgeste caso y usarherramientas inteligentespara comprobar el estado de las unidades.

información relacionada