
Gestern reagierte ein Java-Dienst in unserem Docker Swarm-Cluster plötzlich nicht mehr. Die Aufgabe (Container) für diesen Dienst blieb einfach hängen, beantwortete keine Anfragen und konnte nicht beendet werden.
Wenn wir uns /var/log/syslog ansehen, sehen wir Folgendes zum Zeitpunkt, als der Dienst gestoppt wurde. Wir sind uns nur ein wenig unsicher, wie wir diese Protokollereignisse verstehen sollen.
Unseres Wissens bezieht sich dieses Problem auf ausgelagerte Daten, die wieder in den Speicher zurückgeholt werden müssen.
Die Anrufverfolgung wird genau 120 Sekunden nach dem letzten Protokolleintrag des Java-Dienstes in /var/log/syslog geschrieben.
Wir führen den Cluster auf Hyper-V aus und sehen, dass hv_netvsc
(Hyper-V Network Virtual Service Consumer) erwähnt wird. Könnte dies ein durch den Hypervisor verursachtes Problem sein?
Der Java-Prozess ist ziemlich ressourcenhungrig und benötigt normalerweise etwa 15 – 20 GB RAM.
Wie ist diese Anrufverfolgung zu verstehen? Könnte diese Anrufverfolgung die Ursache des Problems beschreiben oder ist sie eher eine Folge des Problems, das auch den Dienst beendet hat?
Dec 9 20:45:36 prodswarm1 kernel: [216481.008119] INFO: task java:8748 blocked for more than 120 seconds.
Dec 9 20:45:36 prodswarm1 kernel: [216481.008150] Not tainted 4.4.0-101-generic #124-Ubuntu
Dec 9 20:45:36 prodswarm1 kernel: [216481.008169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 9 20:45:36 prodswarm1 kernel: [216481.008194] java D ffff881fdf63f6b8 0 8748 7863 0x00000100
Dec 9 20:45:36 prodswarm1 kernel: [216481.008198] ffff881fdf63f6b8 ffff881fdf63f690 ffff881fd93af000 ffff881fd93a8e00
Dec 9 20:45:36 prodswarm1 kernel: [216481.008200] ffff881fdf640000 ffff8810278d6e00 7fffffffffffffff ffffffff81840ed0
Dec 9 20:45:36 cprodswarm1phprodswarm1 kernel: [216481.008202] ffff881fdf63f810 ffff881fdf63f6d0 ffffffff818406d5 0000000000000000
Dec 9 20:45:36 prodswarm1 kernel: [216481.008204] Call Trace:
Dec 9 20:45:36 prodswarm1 kernel: [216481.008211] [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec 9 20:45:36 prodswarm1 kernel: [216481.008214] [<ffffffff818406d5>] schedule+0x35/0x80
Dec 9 20:45:36 prodswarm1 kernel: [216481.008216] [<ffffffff81843825>] schedule_timeout+0x1b5/0x270
Dec 9 20:45:36 prodswarm1 kernel: [216481.008222] [<ffffffff810b2d0c>] ? __enqueue_entity+0x6c/0x70
Dec 9 20:45:36 prodswarm1 kernel: [216481.008224] [<ffffffff810b9b87>] ? put_prev_entity+0x97/0x7d0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008228] [<ffffffff810f635b>] ? ktime_get+0x3b/0xb0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008229] [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec 9 20:45:36 prodswarm1 kernel: [216481.008231] [<ffffffff8183fc04>] io_schedule_timeout+0xa4/0x110
Dec 9 20:45:36 prodswarm1 kernel: [216481.008233] [<ffffffff81840eeb>] bit_wait_io+0x1b/0x70
Dec 9 20:45:36 prodswarm1 kernel: [216481.008234] [<ffffffff81840a7d>] __wait_on_bit+0x5d/0x90
Dec 9 20:45:36 prodswarm1 kernel: [216481.008239] [<ffffffff8118efdb>] wait_on_page_bit+0xcb/0xf0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008242] [<ffffffff810c4460>] ? autoremove_wake_function+0x40/0x40
Dec 9 20:45:36 prodswarm1 kernel: [216481.008246] [<ffffffff811a5faa>] shrink_page_list+0x78a/0x7a0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008248] [<ffffffff811a6659>] shrink_inactive_list+0x209/0x520
Dec 9 20:45:36 prodswarm1 kernel: [216481.008251] [<ffffffff811a72e3>] shrink_lruvec+0x583/0x740
Dec 9 20:45:36 prodswarm1 kernel: [216481.008254] [<ffffffff81098601>] ? insert_work+0x11/0xc0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008256] [<ffffffff811a758f>] shrink_zone+0xef/0x2e0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008258] [<ffffffff811a78db>] do_try_to_free_pages+0x15b/0x3b0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008260] [<ffffffff811a7d6a>] try_to_free_mem_cgroup_pages+0xba/0x1a0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008264] [<ffffffff811ffc64>] try_charge+0x194/0x660
Dec 9 20:45:36 prodswarm1 kernel: [216481.008267] [<ffffffff81203f7b>] mem_cgroup_try_charge+0x6b/0x1e0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008269] [<ffffffff811fb71a>] do_huge_pmd_anonymous_page+0x16a/0x580
Dec 9 20:45:36 prodswarm1 kernel: [216481.008271] [<ffffffff811c30b4>] handle_mm_fault+0x15d4/0x1820
Dec 9 20:45:36 prodswarm1 kernel: [216481.008275] [<ffffffff81103ba7>] ? do_futex+0x307/0x540
Dec 9 20:45:36 prodswarm1 kernel: [216481.008279] [<ffffffff8106b577>] __do_page_fault+0x197/0x400
Dec 9 20:45:36 prodswarm1 kernel: [216481.008281] [<ffffffff8106b802>] do_page_fault+0x22/0x30
Dec 9 20:45:36 prodswarm1 kernel: [216481.008283] [<ffffffff81846978>] page_fault+0x28/0x30
Dec 9 20:45:36 prodswarm1 kernel: [216481.008293] INFO: task java:14024 blocked for more than 120 seconds.
Dec 9 20:45:36 prodswarm1 kernel: [216481.008313] Not tainted 4.4.0-101-generic #124-Ubuntu
Dec 9 20:45:36 prodswarm1 kernel: [216481.008331] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 9 20:45:36 prodswarm1 kernel: [216481.008356] java D ffff8800f2c5b158 0 14024 7863 0x00000100
Dec 9 20:45:36 prodswarm1 kernel: [216481.008358] ffff8800f2c5b158 0000000000000000 ffff8810208aaa00 ffff8800f2ac0000
Dec 9 20:45:36 prodswarm1 kernel: [216481.008360] ffff8800f2c5c000 ffff881027956e00 7fffffffffffffff ffffffff81840ed0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008362] ffff8800f2c5b2b0 ffff8800f2c5b170 ffffffff818406d5 0000000000000000
Dec 9 20:45:36 prodswarm1 kernel: [216481.008363] Call Trace:
Dec 9 20:45:36 prodswarm1 kernel: [216481.008365] [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec 9 20:45:36 prodswarm1 kernel: [216481.008367] [<ffffffff818406d5>] schedule+0x35/0x80
Dec 9 20:45:36 prodswarm1 kernel: [216481.008369] [<ffffffff81843825>] schedule_timeout+0x1b5/0x270
Dec 9 20:45:36 prodswarm1 kernel: [216481.008374] [<ffffffffc01f095b>] ? netvsc_start_xmit+0x3ab/0x7b0 [hv_netvsc]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008376] [<ffffffff810f635b>] ? ktime_get+0x3b/0xb0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008377] [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec 9 20:45:36 prodswarm1 kernel: [216481.008379] [<ffffffff8183fc04>] io_schedule_timeout+0xa4/0x110
Dec 9 20:45:36 prodswarm1 kernel: [216481.008381] [<ffffffff81840eeb>] bit_wait_io+0x1b/0x70
Dec 9 20:45:36 prodswarm1 kernel: [216481.008382] [<ffffffff81840a7d>] __wait_on_bit+0x5d/0x90
Dec 9 20:45:36 prodswarm1 kernel: [216481.008385] [<ffffffff8118efdb>] wait_on_page_bit+0xcb/0xf0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008387] [<ffffffff810c4460>] ? autoremove_wake_function+0x40/0x40
Dec 9 20:45:36 prodswarm1 kernel: [216481.008388] [<ffffffff811a5faa>] shrink_page_list+0x78a/0x7a0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008391] [<ffffffff811a6659>] shrink_inactive_list+0x209/0x520
Dec 9 20:45:36 prodswarm1 kernel: [216481.008392] [<ffffffff811a72e3>] shrink_lruvec+0x583/0x740
Dec 9 20:45:36 prodswarm1 kernel: [216481.008394] [<ffffffff811a758f>] shrink_zone+0xef/0x2e0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008396] [<ffffffff811a78db>] do_try_to_free_pages+0x15b/0x3b0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008398] [<ffffffff811a7d6a>] try_to_free_mem_cgroup_pages+0xba/0x1a0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008400] [<ffffffff811ffc64>] try_charge+0x194/0x660
Dec 9 20:45:36 prodswarm1 kernel: [216481.008402] [<ffffffff81203836>] __memcg_kmem_charge_memcg+0x96/0xc0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008406] [<ffffffff811ed611>] new_slab+0x461/0x490
Dec 9 20:45:36 prodswarm1 kernel: [216481.008408] [<ffffffff811ee48b>] ___slab_alloc+0x22b/0x470
Dec 9 20:45:36 prodswarm1 kernel: [216481.008429] [<ffffffffc045597b>] ? nfs_generic_pgio+0x6b/0x2a0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008431] [<ffffffff81203659>] ? __memcg_kmem_get_cache+0x39/0x130
Dec 9 20:45:36 prodswarm1 kernel: [216481.008437] [<ffffffffc045597b>] ? nfs_generic_pgio+0x6b/0x2a0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008439] [<ffffffff811ee6f0>] __slab_alloc+0x20/0x40
Dec 9 20:45:36 prodswarm1 kernel: [216481.008441] [<ffffffff811f0085>] __kmalloc+0x1d5/0x250
Dec 9 20:45:36 prodswarm1 kernel: [216481.008448] [<ffffffffc045597b>] nfs_generic_pgio+0x6b/0x2a0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008456] [<ffffffffc0455c18>] nfs_generic_pg_pgios+0x68/0xd0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008463] [<ffffffffc0455857>] nfs_pageio_doio+0x27/0x60 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008470] [<ffffffffc045631e>] __nfs_pageio_add_request+0xae/0x480 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008477] [<ffffffffc0456da3>] nfs_pageio_add_request+0xa3/0x1b0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008487] [<ffffffffc045a76a>] nfs_do_writepage+0xca/0x1c0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008496] [<ffffffffc045a874>] nfs_writepages_callback+0x14/0x30 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008499] [<ffffffff8119ba4e>] write_cache_pages+0x1ee/0x510
Dec 9 20:45:36 prodswarm1 kernel: [216481.008502] [<ffffffff810c4175>] ? wake_up_bit+0x25/0x30
Dec 9 20:45:36 prodswarm1 kernel: [216481.008511] [<ffffffffc045a860>] ? nfs_do_writepage+0x1c0/0x1c0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008520] [<ffffffffc045ac95>] nfs_writepages+0xa5/0x150 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008528] [<ffffffffc044bdeb>] ? nfs_write_end+0x16b/0x4b0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008530] [<ffffffff8119e25e>] do_writepages+0x1e/0x30
Dec 9 20:45:36 prodswarm1 kernel: [216481.008532] [<ffffffff81190f86>] __filemap_fdatawrite_range+0xc6/0x100
Dec 9 20:45:36 prodswarm1 kernel: [216481.008534] [<ffffffff811910ca>] filemap_write_and_wait_range+0x2a/0x70
Dec 9 20:45:36 prodswarm1 kernel: [216481.008539] [<ffffffffc044b1d3>] nfs_file_fsync+0x83/0x130 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008542] [<ffffffff8124509b>] vfs_fsync_range+0x4b/0xb0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008544] [<ffffffff8124511c>] vfs_fsync+0x1c/0x20
Dec 9 20:45:36 prodswarm1 kernel: [216481.008549] [<ffffffffc044b6a6>] nfs_file_flush+0x46/0x60 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008552] [<ffffffff8120d7df>] filp_close+0x2f/0x70
Dec 9 20:45:36 prodswarm1 kernel: [216481.008554] [<ffffffff8122f193>] __close_fd+0xa3/0xd0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008556] [<ffffffff8120d843>] SyS_close+0x23/0x50
Dec 9 20:45:36 prodswarm1 kernel: [216481.008558] [<ffffffff818447f2>] entry_SYSCALL_64_fastpath+0x16/0x71
Antwort1
Für mich sieht es so aus, als ob es sich beim Swapping um eine Art Deadlock handelt. Sie müssen die Speichernutzung der Anwendung und die Swap-Nutzung überwachen und beobachten, wie sie sich verhält, wenn dieser Zustand eintritt. Auf diese Weise können Sie erkennen, ob das Problem darin besteht, dass dem Container zu wenig Speicher zugewiesen ist.