Como devemos entender esse rastreamento de chamada ‘bloqueado por mais de 120 segundos’?

Como devemos entender esse rastreamento de chamada ‘bloqueado por mais de 120 segundos’?

Ontem, um serviço Java em nosso cluster Docker Swarm parou de responder repentinamente. A tarefa (contêiner) deste serviço simplesmente travou, não respondeu a nenhuma solicitação e não pôde ser eliminada.

Olhando para /var/log/syslog, vemos o seguinte no momento em que o serviço foi interrompido. Estamos um pouco em dúvida sobre como entender esses eventos de log.

Nosso entendimento é que esse problema está relacionado a dados trocados que precisam ser trazidos de volta à memória.

O rastreamento de chamada é gravado em /var/log/syslog exatamente 120 segundos após a última entrada de log do serviço Java.

Estamos executando o cluster no Hyper-V e vemos que há uma menção a hv_netvsc(Consumidor de serviço virtual de rede Hyper-V). Isso poderia ser um problema causado pelo hipervisor?

O processo Java consome bastante recursos e geralmente requer cerca de 15 a 20 GB de RAM.

Como devemos entender esse rastreamento de chamada? Esse rastreamento de chamada poderia descrever a raiz do problema ou é mais provável que seja uma consequência do problema que também eliminou o serviço?

Dec  9 20:45:36 prodswarm1 kernel: [216481.008119] INFO: task java:8748 blocked for more than 120 seconds.
Dec  9 20:45:36 prodswarm1 kernel: [216481.008150]       Not tainted 4.4.0-101-generic #124-Ubuntu
Dec  9 20:45:36 prodswarm1 kernel: [216481.008169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  9 20:45:36 prodswarm1 kernel: [216481.008194] java            D ffff881fdf63f6b8     0  8748   7863 0x00000100
Dec  9 20:45:36 prodswarm1 kernel: [216481.008198]  ffff881fdf63f6b8 ffff881fdf63f690 ffff881fd93af000 ffff881fd93a8e00
Dec  9 20:45:36 prodswarm1 kernel: [216481.008200]  ffff881fdf640000 ffff8810278d6e00 7fffffffffffffff ffffffff81840ed0
Dec  9 20:45:36 cprodswarm1phprodswarm1 kernel: [216481.008202]  ffff881fdf63f810 ffff881fdf63f6d0 ffffffff818406d5 0000000000000000
Dec  9 20:45:36 prodswarm1 kernel: [216481.008204] Call Trace:
Dec  9 20:45:36 prodswarm1 kernel: [216481.008211]  [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec  9 20:45:36 prodswarm1 kernel: [216481.008214]  [<ffffffff818406d5>] schedule+0x35/0x80
Dec  9 20:45:36 prodswarm1 kernel: [216481.008216]  [<ffffffff81843825>] schedule_timeout+0x1b5/0x270
Dec  9 20:45:36 prodswarm1 kernel: [216481.008222]  [<ffffffff810b2d0c>] ? __enqueue_entity+0x6c/0x70
Dec  9 20:45:36 prodswarm1 kernel: [216481.008224]  [<ffffffff810b9b87>] ? put_prev_entity+0x97/0x7d0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008228]  [<ffffffff810f635b>] ? ktime_get+0x3b/0xb0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008229]  [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec  9 20:45:36 prodswarm1 kernel: [216481.008231]  [<ffffffff8183fc04>] io_schedule_timeout+0xa4/0x110
Dec  9 20:45:36 prodswarm1 kernel: [216481.008233]  [<ffffffff81840eeb>] bit_wait_io+0x1b/0x70
Dec  9 20:45:36 prodswarm1 kernel: [216481.008234]  [<ffffffff81840a7d>] __wait_on_bit+0x5d/0x90
Dec  9 20:45:36 prodswarm1 kernel: [216481.008239]  [<ffffffff8118efdb>] wait_on_page_bit+0xcb/0xf0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008242]  [<ffffffff810c4460>] ? autoremove_wake_function+0x40/0x40
Dec  9 20:45:36 prodswarm1 kernel: [216481.008246]  [<ffffffff811a5faa>] shrink_page_list+0x78a/0x7a0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008248]  [<ffffffff811a6659>] shrink_inactive_list+0x209/0x520
Dec  9 20:45:36 prodswarm1 kernel: [216481.008251]  [<ffffffff811a72e3>] shrink_lruvec+0x583/0x740
Dec  9 20:45:36 prodswarm1 kernel: [216481.008254]  [<ffffffff81098601>] ? insert_work+0x11/0xc0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008256]  [<ffffffff811a758f>] shrink_zone+0xef/0x2e0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008258]  [<ffffffff811a78db>] do_try_to_free_pages+0x15b/0x3b0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008260]  [<ffffffff811a7d6a>] try_to_free_mem_cgroup_pages+0xba/0x1a0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008264]  [<ffffffff811ffc64>] try_charge+0x194/0x660
Dec  9 20:45:36 prodswarm1 kernel: [216481.008267]  [<ffffffff81203f7b>] mem_cgroup_try_charge+0x6b/0x1e0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008269]  [<ffffffff811fb71a>] do_huge_pmd_anonymous_page+0x16a/0x580
Dec  9 20:45:36 prodswarm1 kernel: [216481.008271]  [<ffffffff811c30b4>] handle_mm_fault+0x15d4/0x1820
Dec  9 20:45:36 prodswarm1 kernel: [216481.008275]  [<ffffffff81103ba7>] ? do_futex+0x307/0x540
Dec  9 20:45:36 prodswarm1 kernel: [216481.008279]  [<ffffffff8106b577>] __do_page_fault+0x197/0x400
Dec  9 20:45:36 prodswarm1 kernel: [216481.008281]  [<ffffffff8106b802>] do_page_fault+0x22/0x30
Dec  9 20:45:36 prodswarm1 kernel: [216481.008283]  [<ffffffff81846978>] page_fault+0x28/0x30
Dec  9 20:45:36 prodswarm1 kernel: [216481.008293] INFO: task java:14024 blocked for more than 120 seconds.
Dec  9 20:45:36 prodswarm1 kernel: [216481.008313]       Not tainted 4.4.0-101-generic #124-Ubuntu
Dec  9 20:45:36 prodswarm1 kernel: [216481.008331] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  9 20:45:36 prodswarm1 kernel: [216481.008356] java            D ffff8800f2c5b158     0 14024   7863 0x00000100
Dec  9 20:45:36 prodswarm1 kernel: [216481.008358]  ffff8800f2c5b158 0000000000000000 ffff8810208aaa00 ffff8800f2ac0000
Dec  9 20:45:36 prodswarm1 kernel: [216481.008360]  ffff8800f2c5c000 ffff881027956e00 7fffffffffffffff ffffffff81840ed0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008362]  ffff8800f2c5b2b0 ffff8800f2c5b170 ffffffff818406d5 0000000000000000
Dec  9 20:45:36 prodswarm1 kernel: [216481.008363] Call Trace:
Dec  9 20:45:36 prodswarm1 kernel: [216481.008365]  [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec  9 20:45:36 prodswarm1 kernel: [216481.008367]  [<ffffffff818406d5>] schedule+0x35/0x80
Dec  9 20:45:36 prodswarm1 kernel: [216481.008369]  [<ffffffff81843825>] schedule_timeout+0x1b5/0x270
Dec  9 20:45:36 prodswarm1 kernel: [216481.008374]  [<ffffffffc01f095b>] ? netvsc_start_xmit+0x3ab/0x7b0 [hv_netvsc]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008376]  [<ffffffff810f635b>] ? ktime_get+0x3b/0xb0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008377]  [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec  9 20:45:36 prodswarm1 kernel: [216481.008379]  [<ffffffff8183fc04>] io_schedule_timeout+0xa4/0x110
Dec  9 20:45:36 prodswarm1 kernel: [216481.008381]  [<ffffffff81840eeb>] bit_wait_io+0x1b/0x70
Dec  9 20:45:36 prodswarm1 kernel: [216481.008382]  [<ffffffff81840a7d>] __wait_on_bit+0x5d/0x90
Dec  9 20:45:36 prodswarm1 kernel: [216481.008385]  [<ffffffff8118efdb>] wait_on_page_bit+0xcb/0xf0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008387]  [<ffffffff810c4460>] ? autoremove_wake_function+0x40/0x40
Dec  9 20:45:36 prodswarm1 kernel: [216481.008388]  [<ffffffff811a5faa>] shrink_page_list+0x78a/0x7a0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008391]  [<ffffffff811a6659>] shrink_inactive_list+0x209/0x520
Dec  9 20:45:36 prodswarm1 kernel: [216481.008392]  [<ffffffff811a72e3>] shrink_lruvec+0x583/0x740
Dec  9 20:45:36 prodswarm1 kernel: [216481.008394]  [<ffffffff811a758f>] shrink_zone+0xef/0x2e0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008396]  [<ffffffff811a78db>] do_try_to_free_pages+0x15b/0x3b0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008398]  [<ffffffff811a7d6a>] try_to_free_mem_cgroup_pages+0xba/0x1a0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008400]  [<ffffffff811ffc64>] try_charge+0x194/0x660
Dec  9 20:45:36 prodswarm1 kernel: [216481.008402]  [<ffffffff81203836>] __memcg_kmem_charge_memcg+0x96/0xc0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008406]  [<ffffffff811ed611>] new_slab+0x461/0x490
Dec  9 20:45:36 prodswarm1 kernel: [216481.008408]  [<ffffffff811ee48b>] ___slab_alloc+0x22b/0x470
Dec  9 20:45:36 prodswarm1 kernel: [216481.008429]  [<ffffffffc045597b>] ? nfs_generic_pgio+0x6b/0x2a0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008431]  [<ffffffff81203659>] ? __memcg_kmem_get_cache+0x39/0x130
Dec  9 20:45:36 prodswarm1 kernel: [216481.008437]  [<ffffffffc045597b>] ? nfs_generic_pgio+0x6b/0x2a0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008439]  [<ffffffff811ee6f0>] __slab_alloc+0x20/0x40
Dec  9 20:45:36 prodswarm1 kernel: [216481.008441]  [<ffffffff811f0085>] __kmalloc+0x1d5/0x250
Dec  9 20:45:36 prodswarm1 kernel: [216481.008448]  [<ffffffffc045597b>] nfs_generic_pgio+0x6b/0x2a0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008456]  [<ffffffffc0455c18>] nfs_generic_pg_pgios+0x68/0xd0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008463]  [<ffffffffc0455857>] nfs_pageio_doio+0x27/0x60 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008470]  [<ffffffffc045631e>] __nfs_pageio_add_request+0xae/0x480 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008477]  [<ffffffffc0456da3>] nfs_pageio_add_request+0xa3/0x1b0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008487]  [<ffffffffc045a76a>] nfs_do_writepage+0xca/0x1c0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008496]  [<ffffffffc045a874>] nfs_writepages_callback+0x14/0x30 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008499]  [<ffffffff8119ba4e>] write_cache_pages+0x1ee/0x510
Dec  9 20:45:36 prodswarm1 kernel: [216481.008502]  [<ffffffff810c4175>] ? wake_up_bit+0x25/0x30
Dec  9 20:45:36 prodswarm1 kernel: [216481.008511]  [<ffffffffc045a860>] ? nfs_do_writepage+0x1c0/0x1c0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008520]  [<ffffffffc045ac95>] nfs_writepages+0xa5/0x150 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008528]  [<ffffffffc044bdeb>] ? nfs_write_end+0x16b/0x4b0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008530]  [<ffffffff8119e25e>] do_writepages+0x1e/0x30
Dec  9 20:45:36 prodswarm1 kernel: [216481.008532]  [<ffffffff81190f86>] __filemap_fdatawrite_range+0xc6/0x100
Dec  9 20:45:36 prodswarm1 kernel: [216481.008534]  [<ffffffff811910ca>] filemap_write_and_wait_range+0x2a/0x70
Dec  9 20:45:36 prodswarm1 kernel: [216481.008539]  [<ffffffffc044b1d3>] nfs_file_fsync+0x83/0x130 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008542]  [<ffffffff8124509b>] vfs_fsync_range+0x4b/0xb0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008544]  [<ffffffff8124511c>] vfs_fsync+0x1c/0x20
Dec  9 20:45:36 prodswarm1 kernel: [216481.008549]  [<ffffffffc044b6a6>] nfs_file_flush+0x46/0x60 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008552]  [<ffffffff8120d7df>] filp_close+0x2f/0x70
Dec  9 20:45:36 prodswarm1 kernel: [216481.008554]  [<ffffffff8122f193>] __close_fd+0xa3/0xd0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008556]  [<ffffffff8120d843>] SyS_close+0x23/0x50
Dec  9 20:45:36 prodswarm1 kernel: [216481.008558]  [<ffffffff818447f2>] entry_SYSCALL_64_fastpath+0x16/0x71

Responder1

Para mim, parece que é algum tipo de impasse durante a troca. Você precisa monitorar o uso de memória do aplicativo e o uso de swap e ver como ele se comporta quando essa condição ocorre. Dessa forma, você pode ver se o problema é pouca memória alocada para o contêiner.

informação relacionada