Rastreamento do kernel do Ubuntu - solução de problemas de travamentos aleatórios de "tarefa travada por 120 segundos"

Rastreamento do kernel do Ubuntu - solução de problemas de travamentos aleatórios de "tarefa travada por 120 segundos"

Finalmente consegui rastrear um erro de 'tarefa travada por 120 segundos'. Nossos servidores travavam aleatoriamente em picos de carga. Os tempos de espera de IO parecem normais. Memória/Swap está bem. Estes são Ubuntu 10.04 - 64 bits baseado em Xen. VPS, alguns são até “Baremetals”.

Abaixo está o rastreamento:

Aug 24 11:13:25 admin1 kernel: [2732107.501749] INFO: task java:29952 blocked for more than 120 seconds.
Aug 24 11:13:25 admin1 kernel: [2732107.501954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 24 11:13:25 admin1 kernel: [2732107.502223] java          D 0000000000000000     0 29952      1 0x00000000
Aug 24 11:13:25 admin1 kernel: [2732107.502227]  ffff8801d3b5fbb8 0000000000000082 ffffffff81a35800 ffff8801d635be8c
Aug 24 11:13:25 admin1 kernel: [2732107.502230]  0000000000014cc0 ffff88014fcc5b80 0000000000014cc0 ffff88014fcc5f18
Aug 24 11:13:25 admin1 kernel: [2732107.502233]  ffff8801d3b5ffd8 ffff88014fcc5f20 ffff8801d3b5e010 0000000000014cc0
Aug 24 11:13:25 admin1 kernel: [2732107.502236] Call Trace:
Aug 24 11:13:25 admin1 kernel: [2732107.502243]  [<ffffffff81589bd7>] __mutex_lock_slowpath+0xf7/0x180
Aug 24 11:13:25 admin1 kernel: [2732107.502247]  [<ffffffff81589abb>] mutex_lock+0x2b/0x50
Aug 24 11:13:25 admin1 kernel: [2732107.502251]  [<ffffffff8122633a>] __log_wait_for_space+0xda/0x1b0
Aug 24 11:13:25 admin1 kernel: [2732107.502255]  [<ffffffff814fed2f>] ? inet_recvmsg+0x4f/0x80
Aug 24 11:13:25 admin1 kernel: [2732107.502258]  [<ffffffff812220e1>] start_this_handle+0xf1/0x440
Aug 24 11:13:25 admin1 kernel: [2732107.502260]  [<ffffffff812225c1>] ? journal_start+0x81/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.502263]  [<ffffffff812225f5>] journal_start+0xb5/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.502266]  [<ffffffff811d3e71>] ext3_journal_start_sb+0x31/0x60
Aug 24 11:13:25 admin1 kernel: [2732107.502269]  [<ffffffff811ca63d>] ext3_dirty_inode+0x3d/0xa0
Aug 24 11:13:25 admin1 kernel: [2732107.502273]  [<ffffffff811776ca>] __mark_inode_dirty+0x3a/0x1f0
Aug 24 11:13:25 admin1 kernel: [2732107.502275]  [<ffffffff811ca7f5>] ext3_setattr+0x155/0x2c0
Aug 24 11:13:25 admin1 kernel: [2732107.502279]  [<ffffffff810097ce>] ? __switch_to+0x26e/0x320
Aug 24 11:13:25 admin1 kernel: [2732107.502283]  [<ffffffff8116c2c5>] notify_change+0x175/0x2f0
Aug 24 11:13:25 admin1 kernel: [2732107.502285]  [<ffffffff81588781>] ? schedule+0x411/0xa50
Aug 24 11:13:25 admin1 kernel: [2732107.502289]  [<ffffffff81153064>] do_truncate+0x64/0xa0
Aug 24 11:13:25 admin1 kernel: [2732107.502291]  [<ffffffff81153d7b>] ? generic_file_llseek+0x5b/0x70
Aug 24 11:13:25 admin1 kernel: [2732107.502294]  [<ffffffff811532d8>] sys_ftruncate+0xe8/0xf0
Aug 24 11:13:25 admin1 kernel: [2732107.502297]  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b

e outro logo abaixo, quando 'unzip' foi executado - exatamente ao mesmo tempo que o rastreamento acima;

Aug 24 11:13:25 admin1 kernel: [2732107.503946] unzip         D 0000000000000000     0 28123      1 0x00000000
Aug 24 11:13:25 admin1 kernel: [2732107.503951]  ffff880031e156e8 0000000000000086 0000000000000000 0000000000000000
Aug 24 11:13:25 admin1 kernel: [2732107.503954]  0000000000014cc0 ffff8801183f5b80 0000000000014cc0 ffff8801183f5f18
Aug 24 11:13:25 admin1 kernel: [2732107.503957]  ffff880031e15fd8 ffff8801183f5f20 ffff880031e14010 0000000000014cc0
Aug 24 11:13:25 admin1 kernel: [2732107.503960] Call Trace:
Aug 24 11:13:25 admin1 kernel: [2732107.503964]  [<ffffffff81227745>] log_wait_commit+0xc5/0x150
Aug 24 11:13:25 admin1 kernel: [2732107.503967]  [<ffffffff81081150>] ? autoremove_wake_function+0x0/0x40
Aug 24 11:13:25 admin1 kernel: [2732107.503969]  [<ffffffff8158b08e>] ? _raw_spin_lock+0xe/0x20
Aug 24 11:13:25 admin1 kernel: [2732107.503972]  [<ffffffff81225ddb>] log_do_checkpoint+0xab/0x530
Aug 24 11:13:25 admin1 kernel: [2732107.503976]  [<ffffffff81036bd6>] ? __ticket_spin_lock+0x16/0x20
Aug 24 11:13:25 admin1 kernel: [2732107.503979]  [<ffffffff81226306>] __log_wait_for_space+0xa6/0x1b0
Aug 24 11:13:25 admin1 kernel: [2732107.503982]  [<ffffffff81070974>] ? mod_timer+0x174/0x2d0
Aug 24 11:13:25 admin1 kernel: [2732107.503985]  [<ffffffff812220e1>] start_this_handle+0xf1/0x440
Aug 24 11:13:25 admin1 kernel: [2732107.503987]  [<ffffffff812225c1>] ? journal_start+0x81/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.503990]  [<ffffffff812225f5>] journal_start+0xb5/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.503993]  [<ffffffff811d3e71>] ext3_journal_start_sb+0x31/0x60
Aug 24 11:13:25 admin1 kernel: [2732107.503995]  [<ffffffff811ccbe2>] ext3_write_begin+0xa2/0x230
Aug 24 11:13:25 admin1 kernel: [2732107.503998]  [<ffffffff8110029e>] generic_file_buffered_write+0xfe/0x250
Aug 24 11:13:25 admin1 kernel: [2732107.504001]  [<ffffffff811dc65f>] ? ext3_xattr_security_get+0x2f/0x40
Aug 24 11:13:25 admin1 kernel: [2732107.504004]  [<ffffffff81102df0>] __generic_file_aio_write+0x240/0x470
Aug 24 11:13:25 admin1 kernel: [2732107.504006]  [<ffffffff81103085>] generic_file_aio_write+0x65/0xd0
Aug 24 11:13:25 admin1 kernel: [2732107.504009]  [<ffffffff8115370a>] do_sync_write+0xda/0x120
Aug 24 11:13:25 admin1 kernel: [2732107.504012]  [<ffffffff812941a8>] ? apparmor_file_permission+0x18/0x20
Aug 24 11:13:25 admin1 kernel: [2732107.504015]  [<ffffffff8126366f>] ? security_file_permission+0x1f/0x80
Aug 24 11:13:25 admin1 kernel: [2732107.504017]  [<ffffffff811539f8>] vfs_write+0xc8/0x190
Aug 24 11:13:25 admin1 kernel: [2732107.504020]  [<ffffffff81154251>] sys_write+0x51/0x90
Aug 24 11:13:25 admin1 kernel: [2732107.504023]  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b

alguém pode ajudar?

Isso é um impasse no sistema de arquivos?

tia.

Responder1

Esses rastreamentos de pilha indicam que uma operação de gravação do sistema de arquivos estava sendo tentada no momento, portanto, verifique os problemas relacionados ao armazenamento. Se você tiver sistemas de arquivos montados em rede, verifique se o servidor está acessível e funcionando normalmente. Caso contrário, verifique suas unidades de disco. Você deve ver as mensagens dmesgneste caso e usarferramentas inteligentespara verificar o estado das unidades.

informação relacionada