¿Cómo debemos entender este seguimiento de llamada 'bloqueado durante más de 120 segundos'?

¿Cómo debemos entender este seguimiento de llamada 'bloqueado durante más de 120 segundos'?

Ayer, un servicio Java en nuestro clúster Docker Swarm dejó de responder repentinamente. La tarea (contenedor) para este servicio simplemente se colgó, no respondió ninguna solicitud y no se pudo eliminar.

Al observar /var/log/syslog, vemos lo siguiente en el momento en que se detuvo el servicio. Tenemos algunas dudas sobre cómo entender estos eventos de registro.

Entendemos que este problema se relaciona con datos intercambiados que deben devolverse a la memoria.

El seguimiento de llamadas se escribe en /var/log/syslog exactamente 120 segundos después de la última entrada de registro del servicio Java.

Estamos ejecutando el clúster sobre Hyper-V y vemos que se menciona hv_netvsc(Consumidor de servicios virtuales de red Hyper-V). ¿Podría ser esto un problema causado por el hipervisor?

El proceso Java consume bastantes recursos y normalmente requiere entre 15 y 20 GB de RAM.

¿Cómo debemos entender este rastreo de llamadas? ¿Podría este seguimiento de llamada describir la raíz del problema o es más probable que sea una consecuencia del problema que también acabó con el servicio?

Dec  9 20:45:36 prodswarm1 kernel: [216481.008119] INFO: task java:8748 blocked for more than 120 seconds.
Dec  9 20:45:36 prodswarm1 kernel: [216481.008150]       Not tainted 4.4.0-101-generic #124-Ubuntu
Dec  9 20:45:36 prodswarm1 kernel: [216481.008169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  9 20:45:36 prodswarm1 kernel: [216481.008194] java            D ffff881fdf63f6b8     0  8748   7863 0x00000100
Dec  9 20:45:36 prodswarm1 kernel: [216481.008198]  ffff881fdf63f6b8 ffff881fdf63f690 ffff881fd93af000 ffff881fd93a8e00
Dec  9 20:45:36 prodswarm1 kernel: [216481.008200]  ffff881fdf640000 ffff8810278d6e00 7fffffffffffffff ffffffff81840ed0
Dec  9 20:45:36 cprodswarm1phprodswarm1 kernel: [216481.008202]  ffff881fdf63f810 ffff881fdf63f6d0 ffffffff818406d5 0000000000000000
Dec  9 20:45:36 prodswarm1 kernel: [216481.008204] Call Trace:
Dec  9 20:45:36 prodswarm1 kernel: [216481.008211]  [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec  9 20:45:36 prodswarm1 kernel: [216481.008214]  [<ffffffff818406d5>] schedule+0x35/0x80
Dec  9 20:45:36 prodswarm1 kernel: [216481.008216]  [<ffffffff81843825>] schedule_timeout+0x1b5/0x270
Dec  9 20:45:36 prodswarm1 kernel: [216481.008222]  [<ffffffff810b2d0c>] ? __enqueue_entity+0x6c/0x70
Dec  9 20:45:36 prodswarm1 kernel: [216481.008224]  [<ffffffff810b9b87>] ? put_prev_entity+0x97/0x7d0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008228]  [<ffffffff810f635b>] ? ktime_get+0x3b/0xb0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008229]  [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec  9 20:45:36 prodswarm1 kernel: [216481.008231]  [<ffffffff8183fc04>] io_schedule_timeout+0xa4/0x110
Dec  9 20:45:36 prodswarm1 kernel: [216481.008233]  [<ffffffff81840eeb>] bit_wait_io+0x1b/0x70
Dec  9 20:45:36 prodswarm1 kernel: [216481.008234]  [<ffffffff81840a7d>] __wait_on_bit+0x5d/0x90
Dec  9 20:45:36 prodswarm1 kernel: [216481.008239]  [<ffffffff8118efdb>] wait_on_page_bit+0xcb/0xf0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008242]  [<ffffffff810c4460>] ? autoremove_wake_function+0x40/0x40
Dec  9 20:45:36 prodswarm1 kernel: [216481.008246]  [<ffffffff811a5faa>] shrink_page_list+0x78a/0x7a0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008248]  [<ffffffff811a6659>] shrink_inactive_list+0x209/0x520
Dec  9 20:45:36 prodswarm1 kernel: [216481.008251]  [<ffffffff811a72e3>] shrink_lruvec+0x583/0x740
Dec  9 20:45:36 prodswarm1 kernel: [216481.008254]  [<ffffffff81098601>] ? insert_work+0x11/0xc0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008256]  [<ffffffff811a758f>] shrink_zone+0xef/0x2e0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008258]  [<ffffffff811a78db>] do_try_to_free_pages+0x15b/0x3b0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008260]  [<ffffffff811a7d6a>] try_to_free_mem_cgroup_pages+0xba/0x1a0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008264]  [<ffffffff811ffc64>] try_charge+0x194/0x660
Dec  9 20:45:36 prodswarm1 kernel: [216481.008267]  [<ffffffff81203f7b>] mem_cgroup_try_charge+0x6b/0x1e0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008269]  [<ffffffff811fb71a>] do_huge_pmd_anonymous_page+0x16a/0x580
Dec  9 20:45:36 prodswarm1 kernel: [216481.008271]  [<ffffffff811c30b4>] handle_mm_fault+0x15d4/0x1820
Dec  9 20:45:36 prodswarm1 kernel: [216481.008275]  [<ffffffff81103ba7>] ? do_futex+0x307/0x540
Dec  9 20:45:36 prodswarm1 kernel: [216481.008279]  [<ffffffff8106b577>] __do_page_fault+0x197/0x400
Dec  9 20:45:36 prodswarm1 kernel: [216481.008281]  [<ffffffff8106b802>] do_page_fault+0x22/0x30
Dec  9 20:45:36 prodswarm1 kernel: [216481.008283]  [<ffffffff81846978>] page_fault+0x28/0x30
Dec  9 20:45:36 prodswarm1 kernel: [216481.008293] INFO: task java:14024 blocked for more than 120 seconds.
Dec  9 20:45:36 prodswarm1 kernel: [216481.008313]       Not tainted 4.4.0-101-generic #124-Ubuntu
Dec  9 20:45:36 prodswarm1 kernel: [216481.008331] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  9 20:45:36 prodswarm1 kernel: [216481.008356] java            D ffff8800f2c5b158     0 14024   7863 0x00000100
Dec  9 20:45:36 prodswarm1 kernel: [216481.008358]  ffff8800f2c5b158 0000000000000000 ffff8810208aaa00 ffff8800f2ac0000
Dec  9 20:45:36 prodswarm1 kernel: [216481.008360]  ffff8800f2c5c000 ffff881027956e00 7fffffffffffffff ffffffff81840ed0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008362]  ffff8800f2c5b2b0 ffff8800f2c5b170 ffffffff818406d5 0000000000000000
Dec  9 20:45:36 prodswarm1 kernel: [216481.008363] Call Trace:
Dec  9 20:45:36 prodswarm1 kernel: [216481.008365]  [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec  9 20:45:36 prodswarm1 kernel: [216481.008367]  [<ffffffff818406d5>] schedule+0x35/0x80
Dec  9 20:45:36 prodswarm1 kernel: [216481.008369]  [<ffffffff81843825>] schedule_timeout+0x1b5/0x270
Dec  9 20:45:36 prodswarm1 kernel: [216481.008374]  [<ffffffffc01f095b>] ? netvsc_start_xmit+0x3ab/0x7b0 [hv_netvsc]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008376]  [<ffffffff810f635b>] ? ktime_get+0x3b/0xb0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008377]  [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec  9 20:45:36 prodswarm1 kernel: [216481.008379]  [<ffffffff8183fc04>] io_schedule_timeout+0xa4/0x110
Dec  9 20:45:36 prodswarm1 kernel: [216481.008381]  [<ffffffff81840eeb>] bit_wait_io+0x1b/0x70
Dec  9 20:45:36 prodswarm1 kernel: [216481.008382]  [<ffffffff81840a7d>] __wait_on_bit+0x5d/0x90
Dec  9 20:45:36 prodswarm1 kernel: [216481.008385]  [<ffffffff8118efdb>] wait_on_page_bit+0xcb/0xf0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008387]  [<ffffffff810c4460>] ? autoremove_wake_function+0x40/0x40
Dec  9 20:45:36 prodswarm1 kernel: [216481.008388]  [<ffffffff811a5faa>] shrink_page_list+0x78a/0x7a0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008391]  [<ffffffff811a6659>] shrink_inactive_list+0x209/0x520
Dec  9 20:45:36 prodswarm1 kernel: [216481.008392]  [<ffffffff811a72e3>] shrink_lruvec+0x583/0x740
Dec  9 20:45:36 prodswarm1 kernel: [216481.008394]  [<ffffffff811a758f>] shrink_zone+0xef/0x2e0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008396]  [<ffffffff811a78db>] do_try_to_free_pages+0x15b/0x3b0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008398]  [<ffffffff811a7d6a>] try_to_free_mem_cgroup_pages+0xba/0x1a0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008400]  [<ffffffff811ffc64>] try_charge+0x194/0x660
Dec  9 20:45:36 prodswarm1 kernel: [216481.008402]  [<ffffffff81203836>] __memcg_kmem_charge_memcg+0x96/0xc0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008406]  [<ffffffff811ed611>] new_slab+0x461/0x490
Dec  9 20:45:36 prodswarm1 kernel: [216481.008408]  [<ffffffff811ee48b>] ___slab_alloc+0x22b/0x470
Dec  9 20:45:36 prodswarm1 kernel: [216481.008429]  [<ffffffffc045597b>] ? nfs_generic_pgio+0x6b/0x2a0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008431]  [<ffffffff81203659>] ? __memcg_kmem_get_cache+0x39/0x130
Dec  9 20:45:36 prodswarm1 kernel: [216481.008437]  [<ffffffffc045597b>] ? nfs_generic_pgio+0x6b/0x2a0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008439]  [<ffffffff811ee6f0>] __slab_alloc+0x20/0x40
Dec  9 20:45:36 prodswarm1 kernel: [216481.008441]  [<ffffffff811f0085>] __kmalloc+0x1d5/0x250
Dec  9 20:45:36 prodswarm1 kernel: [216481.008448]  [<ffffffffc045597b>] nfs_generic_pgio+0x6b/0x2a0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008456]  [<ffffffffc0455c18>] nfs_generic_pg_pgios+0x68/0xd0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008463]  [<ffffffffc0455857>] nfs_pageio_doio+0x27/0x60 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008470]  [<ffffffffc045631e>] __nfs_pageio_add_request+0xae/0x480 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008477]  [<ffffffffc0456da3>] nfs_pageio_add_request+0xa3/0x1b0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008487]  [<ffffffffc045a76a>] nfs_do_writepage+0xca/0x1c0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008496]  [<ffffffffc045a874>] nfs_writepages_callback+0x14/0x30 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008499]  [<ffffffff8119ba4e>] write_cache_pages+0x1ee/0x510
Dec  9 20:45:36 prodswarm1 kernel: [216481.008502]  [<ffffffff810c4175>] ? wake_up_bit+0x25/0x30
Dec  9 20:45:36 prodswarm1 kernel: [216481.008511]  [<ffffffffc045a860>] ? nfs_do_writepage+0x1c0/0x1c0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008520]  [<ffffffffc045ac95>] nfs_writepages+0xa5/0x150 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008528]  [<ffffffffc044bdeb>] ? nfs_write_end+0x16b/0x4b0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008530]  [<ffffffff8119e25e>] do_writepages+0x1e/0x30
Dec  9 20:45:36 prodswarm1 kernel: [216481.008532]  [<ffffffff81190f86>] __filemap_fdatawrite_range+0xc6/0x100
Dec  9 20:45:36 prodswarm1 kernel: [216481.008534]  [<ffffffff811910ca>] filemap_write_and_wait_range+0x2a/0x70
Dec  9 20:45:36 prodswarm1 kernel: [216481.008539]  [<ffffffffc044b1d3>] nfs_file_fsync+0x83/0x130 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008542]  [<ffffffff8124509b>] vfs_fsync_range+0x4b/0xb0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008544]  [<ffffffff8124511c>] vfs_fsync+0x1c/0x20
Dec  9 20:45:36 prodswarm1 kernel: [216481.008549]  [<ffffffffc044b6a6>] nfs_file_flush+0x46/0x60 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008552]  [<ffffffff8120d7df>] filp_close+0x2f/0x70
Dec  9 20:45:36 prodswarm1 kernel: [216481.008554]  [<ffffffff8122f193>] __close_fd+0xa3/0xd0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008556]  [<ffffffff8120d843>] SyS_close+0x23/0x50
Dec  9 20:45:36 prodswarm1 kernel: [216481.008558]  [<ffffffff818447f2>] entry_SYSCALL_64_fastpath+0x16/0x71

Respuesta1

A mí me parece que se trata de una especie de punto muerto durante el intercambio. Debe monitorear el uso de memoria de la aplicación y el uso de intercambio, y ver cómo se comporta cuando ocurre esta condición. De esta manera puede ver si el problema es que hay muy poca memoria asignada para el contenedor.

información relacionada