Utilizo Debian/testing Linux en mis múltiples máquinas. Por razones de seguridad, siempre lo instalo en un LVM cifrado. Normalmente uso los sistemas de archivos ext4 con un tamaño de 1TB a 3TB.
Desafortunadamente, crea ciertos efectos secundarios muy desagradables.
Cuando realizo operaciones intensivas en un sistema de archivos (por ejemplo, compilar los proyectos Buildroot de algunos estudiantes en paralelo, archivar 20 GB de datos en formato tar.xz o instalar un software como Xilinx Vivado, que escribe aproximadamente 130 GB en el sistema de archivos). ), el sistema se congela periódicamente durante aprox. 2 minutos. Cuando ejecuto dmesg, recibo los siguientes mensajes molestos:
[ 8648.672075] systemd[1]: systemd-journald.service: Processes still around after final SIGKILL. Entering failed mode.
[ 8648.672083] systemd[1]: systemd-journald.service: Failed with result 'timeout'.
[ 8648.672140] systemd[1]: systemd-journald.service: Unit process 7708 (systemd-journal) remains running after unit stopped.
[ 8648.672299] systemd[1]: Failed to start Journal Service.
[ 8648.672679] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 14.
[ 8648.672911] systemd[1]: Stopped Journal Service.
[ 8648.672980] systemd[1]: systemd-journald.service: Found left-over process 7708 (systemd-journal) in control group while starting unit. Ignoring.
[ 8648.672983] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[ 8648.673699] systemd[1]: Starting Journal Service...
[ 8738.922289] systemd[1]: systemd-journald.service: start operation timed out. Terminating.
[ 8828.923063] systemd[1]: systemd-journald.service: State 'stop-sigterm' timed out. Killing.
[ 8828.923105] systemd[1]: systemd-journald.service: Killing process 7854 (systemd-journal) with signal SIGKILL.
[ 8828.923141] systemd[1]: systemd-journald.service: Killing process 7708 (systemd-journal) with signal SIGKILL.
[ 8919.173428] systemd[1]: systemd-journald.service: Processes still around after SIGKILL. Ignoring.
[ 9009.423787] systemd[1]: systemd-journald.service: State 'final-sigterm' timed out. Killing.
[ 9009.423831] systemd[1]: systemd-journald.service: Killing process 7854 (systemd-journal) with signal SIGKILL.
[ 9099.674142] systemd[1]: systemd-journald.service: Processes still around after final SIGKILL. Entering failed mode.
[ 9099.674173] systemd[1]: systemd-journald.service: Failed with result 'timeout'.
[ 9099.674241] systemd[1]: systemd-journald.service: Unit process 7854 (systemd-journal) remains running after unit stopped.
[ 9099.674477] systemd[1]: Failed to start Journal Service.
[ 9099.674924] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 15.
[ 9099.675102] systemd[1]: Stopped Journal Service.
[ 9099.675185] systemd[1]: systemd-journald.service: Found left-over process 7854 (systemd-journal) in control group while starting unit. Ignoring.
[ 9099.675209] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[ 9099.675958] systemd[1]: Starting Journal Service...
Lo interesante es que el problema ocurre con mayor frecuencia en máquinas con mayor RAM (yo uso máquinas con las siguientes capacidades de RAM: 16 GB (CPU Intel i7), 32 GB (CPU Intel i7) o 64 GB (CPU Intel Xeon E-2176M)) .
Encontré viejos informes de problemas similares como:
- https://unix.stackexchange.com/questions/265756/desktop-completely-freezing-on-io-operaciones
- Las operaciones de disco congelan Debian
- https://www.linuxquestions.org/questions/linux-software-2/debian-buster-freezes-on-high-disk-io-task-blocked-for-more-than-120-segundos-4175676665/
Con base en lo anterior, he verificado que uso el programador de fechas límite:
# cat /sys/block/sda/queue/scheduler
[mq-deadline] none
También uso un kernel bastante nuevo:
# uname -a
Linux WZabHP 5.10.0-8-amd64 #1 SMP Debian 5.10.46-2 (2021-07-20) x86_64 GNU/Linux
Incluso intenté cambiar al kernel RT en una de esas máquinas (32 GB, CPU i7), pero no sirvió de nada. Peor aún: las congelaciones resultan en corrupción del sistema de archivos (el efecto se repitió dos veces, después de eso dejé de intentar usar el kernel RT).
¿Cuál puede ser la causa del problema descrito? ¿Qué configuraciones debo verificar/ajustar para solucionarlo?
Actualizar
Encontré otro conjunto de publicaciones que sugieren que los parámetros de la memoria virtual pueden estar asociados con el problema descrito:
- https://www.reddit.com/r/Fedora/comments/ay7dkh/linux_large_transfers_freeze_system_high_io/
- https://unix.stackexchange.com/questions/233421/prevent-large-file-write-from-freezing-the-system
Sin embargo, parece que ajustar el vm.dirty_background_ratio
y
vm.dirty_ratio
requiere un análisis de la velocidad del disco.
Actualización 2
Antes de que ocurra el problema de muerte journald
, aparece lo siguiente en el registro del kernel (estos son los primeros errores relacionados con el tiempo de espera):
Jul 26 20:44:38 WZabHP kernel: [ 484.449978] INFO: task StreamTrans #6:2335 blocked for more than 120 seconds.
Jul 26 20:44:38 WZabHP kernel: [ 484.449981] Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:44:38 WZabHP kernel: [ 484.449981] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:44:38 WZabHP kernel: [ 484.449982] task:StreamTrans #6 state:D stack: 0 pid: 2335 ppid: 1945 flags:0x00004000
Jul 26 20:44:38 WZabHP kernel: [ 484.449985] Call Trace:
Jul 26 20:44:38 WZabHP kernel: [ 484.449989] __schedule+0x282/0x870
Jul 26 20:44:38 WZabHP kernel: [ 484.449991] schedule+0x46/0xb0
Jul 26 20:44:38 WZabHP kernel: [ 484.449997] jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:44:38 WZabHP kernel: [ 484.450000] ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:44:38 WZabHP kernel: [ 484.450013] ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:44:38 WZabHP kernel: [ 484.450016] __x64_sys_fsync+0x34/0x60
Jul 26 20:44:38 WZabHP kernel: [ 484.450017] do_syscall_64+0x33/0x80
Jul 26 20:44:38 WZabHP kernel: [ 484.450019] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:44:38 WZabHP kernel: [ 484.450021] RIP: 0033:0x7fd2cf3385eb
Jul 26 20:44:38 WZabHP kernel: [ 484.450022] RSP: 002b:00007fd2b8d528c0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:44:38 WZabHP kernel: [ 484.450023] RAX: ffffffffffffffda RBX: 00007fd2af8539d0 RCX: 00007fd2cf3385eb
Jul 26 20:44:38 WZabHP kernel: [ 484.450024] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000003f
Jul 26 20:44:38 WZabHP kernel: [ 484.450025] RBP: 00007fd2af506ba0 R08: 0000000000000000 R09: 00007fd2cbbe4f28
Jul 26 20:44:38 WZabHP kernel: [ 484.450025] R10: 000200050000002a R11: 0000000000000293 R12: 0000000000000000
Jul 26 20:44:38 WZabHP kernel: [ 484.450026] R13: 00000000000000c4 R14: 00007fd2caa6a6f1 R15: 00000000000000c4
Jul 26 20:44:38 WZabHP kernel: [ 484.450030] INFO: task mozStorage #2:2430 blocked for more than 120 seconds.
Jul 26 20:44:38 WZabHP kernel: [ 484.450031] Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:44:38 WZabHP kernel: [ 484.450031] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:44:38 WZabHP kernel: [ 484.450032] task:mozStorage #2 state:D stack: 0 pid: 2430 ppid: 1945 flags:0x00004000
Jul 26 20:44:38 WZabHP kernel: [ 484.450033] Call Trace:
Jul 26 20:44:38 WZabHP kernel: [ 484.450034] __schedule+0x282/0x870
Jul 26 20:44:38 WZabHP kernel: [ 484.450036] schedule+0x46/0xb0
Jul 26 20:44:38 WZabHP kernel: [ 484.450039] jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:44:38 WZabHP kernel: [ 484.450040] ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:44:38 WZabHP kernel: [ 484.450047] ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:44:38 WZabHP kernel: [ 484.450049] __x64_sys_fsync+0x34/0x60
Jul 26 20:44:38 WZabHP kernel: [ 484.450050] do_syscall_64+0x33/0x80
Jul 26 20:44:38 WZabHP kernel: [ 484.450051] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:44:38 WZabHP kernel: [ 484.450052] RIP: 0033:0x7fd2cf3385eb
Jul 26 20:44:38 WZabHP kernel: [ 484.450053] RSP: 002b:00007fd2b334c300 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:44:38 WZabHP kernel: [ 484.450054] RAX: ffffffffffffffda RBX: 00007fd2b34a6858 RCX: 00007fd2cf3385eb
Jul 26 20:44:38 WZabHP kernel: [ 484.450055] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000073
Jul 26 20:44:38 WZabHP kernel: [ 484.450055] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
Jul 26 20:44:38 WZabHP kernel: [ 484.450056] R10: 00007ffd1278a080 R11: 0000000000000293 R12: 00000000000001f7
Jul 26 20:44:38 WZabHP kernel: [ 484.450057] R13: 00007fd2aeffbd40 R14: 00007fd2b34a67a0 R15: 000000000000004e
Jul 26 20:48:40 WZabHP kernel: [ 726.109957] INFO: task jbd2/dm-2-8:591 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [ 726.109959] Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [ 726.109960] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [ 726.109961] task:jbd2/dm-2-8 state:D stack: 0 pid: 591 ppid: 2 flags:0x00004000
Jul 26 20:48:40 WZabHP kernel: [ 726.109963] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [ 726.109967] __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [ 726.109969] ? out_of_line_wait_on_bit_lock+0xb0/0xb0
Jul 26 20:48:40 WZabHP kernel: [ 726.109970] schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [ 726.109971] io_schedule+0x42/0x70
Jul 26 20:48:40 WZabHP kernel: [ 726.109973] bit_wait_io+0xd/0x50
Jul 26 20:48:40 WZabHP kernel: [ 726.109974] __wait_on_bit+0x2a/0x90
Jul 26 20:48:40 WZabHP kernel: [ 726.109975] out_of_line_wait_on_bit+0x92/0xb0
Jul 26 20:48:40 WZabHP kernel: [ 726.109977] ? var_wake_function+0x20/0x20
Jul 26 20:48:40 WZabHP kernel: [ 726.109982] jbd2_journal_commit_transaction+0x16b3/0x1ad0 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [ 726.109987] kjournald2+0xab/0x270 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [ 726.109988] ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [ 726.109991] ? load_superblock.part.0+0xb0/0xb0 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [ 726.109993] kthread+0x11b/0x140
Jul 26 20:48:40 WZabHP kernel: [ 726.109994] ? __kthread_bind_mask+0x60/0x60
Jul 26 20:48:40 WZabHP kernel: [ 726.109996] ret_from_fork+0x22/0x30
Jul 26 20:48:40 WZabHP kernel: [ 726.109999] INFO: task journal-offline:2579 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [ 726.109999] Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [ 726.110000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [ 726.110001] task:journal-offline state:D stack: 0 pid: 2579 ppid: 1 flags:0x00004324
Jul 26 20:48:40 WZabHP kernel: [ 726.110002] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [ 726.110003] __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [ 726.110004] schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [ 726.110007] jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [ 726.110009] ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [ 726.110021] ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:48:40 WZabHP kernel: [ 726.110024] __x64_sys_fsync+0x34/0x60
Jul 26 20:48:40 WZabHP kernel: [ 726.110025] do_syscall_64+0x33/0x80
Jul 26 20:48:40 WZabHP kernel: [ 726.110027] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:48:40 WZabHP kernel: [ 726.110029] RIP: 0033:0x7f68c72ebabb
Jul 26 20:48:40 WZabHP kernel: [ 726.110030] RSP: 002b:00007f68be229cf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:48:40 WZabHP kernel: [ 726.110031] RAX: ffffffffffffffda RBX: 000056427bbd8870 RCX: 00007f68c72ebabb
Jul 26 20:48:40 WZabHP kernel: [ 726.110032] RDX: 00007f68c5a2b000 RSI: 00007f68c7649414 RDI: 000000000000001a
Jul 26 20:48:40 WZabHP kernel: [ 726.110032] RBP: 00007f68c764bd30 R08: 0000000000000000 R09: 00007f68be22a700
Jul 26 20:48:40 WZabHP kernel: [ 726.110033] R10: 0000000000000014 R11: 0000000000000293 R12: 0000000000000002
Jul 26 20:48:40 WZabHP kernel: [ 726.110034] R13: 00007ffe691a51df R14: 00007f68be229e00 R15: 000056427bbf9920
Jul 26 20:48:40 WZabHP kernel: [ 726.110035] INFO: task journal-offline:2580 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [ 726.110036] Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [ 726.110036] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [ 726.110037] task:journal-offline state:D stack: 0 pid: 2580 ppid: 1 flags:0x00004324
Jul 26 20:48:40 WZabHP kernel: [ 726.110038] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [ 726.110039] __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [ 726.110040] schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [ 726.110043] jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [ 726.110045] ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [ 726.110051] ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:48:40 WZabHP kernel: [ 726.110053] __x64_sys_fsync+0x34/0x60
Jul 26 20:48:40 WZabHP kernel: [ 726.110054] do_syscall_64+0x33/0x80
Jul 26 20:48:40 WZabHP kernel: [ 726.110056] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:48:40 WZabHP kernel: [ 726.110056] RIP: 0033:0x7f68c72ebabb
Jul 26 20:48:40 WZabHP kernel: [ 726.110057] RSP: 002b:00007f68bda28cf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:48:40 WZabHP kernel: [ 726.110058] RAX: ffffffffffffffda RBX: 000056427bc095d0 RCX: 00007f68c72ebabb
Jul 26 20:48:40 WZabHP kernel: [ 726.110059] RDX: 00007f68c622b000 RSI: 00007f68c7649414 RDI: 0000000000000026
Jul 26 20:48:40 WZabHP kernel: [ 726.110059] RBP: 00007f68c764bd30 R08: 0000000000000000 R09: 00007f68bda29700
Jul 26 20:48:40 WZabHP kernel: [ 726.110060] R10: 0000000000000014 R11: 0000000000000293 R12: 0000000000000002
Jul 26 20:48:40 WZabHP kernel: [ 726.110061] R13: 00007ffe691a51df R14: 00007f68bda28e00 R15: 0000000000802000
Jul 26 20:48:40 WZabHP kernel: [ 726.110064] INFO: task NetworkManager:1090 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [ 726.110065] Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [ 726.110065] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [ 726.110066] task:NetworkManager state:D stack: 0 pid: 1090 ppid: 1 flags:0x00004000
Jul 26 20:48:40 WZabHP kernel: [ 726.110067] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [ 726.110068] __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [ 726.110069] schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [ 726.110072] jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [ 726.110073] ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [ 726.110080] ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:48:40 WZabHP kernel: [ 726.110081] __x64_sys_fsync+0x34/0x60
Jul 26 20:48:40 WZabHP kernel: [ 726.110083] do_syscall_64+0x33/0x80
Jul 26 20:48:40 WZabHP kernel: [ 726.110084] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:48:40 WZabHP kernel: [ 726.110085] RIP: 0033:0x7fd15da1c5eb
Jul 26 20:48:40 WZabHP kernel: [ 726.110085] RSP: 002b:00007ffedee571c0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:48:40 WZabHP kernel: [ 726.110086] RAX: ffffffffffffffda RBX: 00007fd15cd06928 RCX: 00007fd15da1c5eb
Jul 26 20:48:40 WZabHP kernel: [ 726.110087] RDX: 0000000000000184 RSI: 000055c28ee17560 RDI: 000000000000001a
Jul 26 20:48:40 WZabHP kernel: [ 726.110088] RBP: 000000000000001a R08: 0000000000000000 R09: 00007ffedee572d0
Jul 26 20:48:40 WZabHP kernel: [ 726.110088] R10: 0000000000000184 R11: 0000000000000293 R12: 00007ffedee572d0
Jul 26 20:48:40 WZabHP kernel: [ 726.110089] R13: 000055c28ed78fd0 R14: 000055c28ee176e4 R15: 0000000000000000
Jul 26 20:48:40 WZabHP kernel: [ 726.110104] INFO: task Permission:2299 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [ 726.110105] Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [ 726.110105] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [ 726.110106] task:Permission state:D stack: 0 pid: 2299 ppid: 1945 flags:0x00004000
Jul 26 20:48:40 WZabHP kernel: [ 726.110107] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [ 726.110108] __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [ 726.110109] schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [ 726.110112] jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [ 726.110113] ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [ 726.110120] ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:48:40 WZabHP kernel: [ 726.110122] __x64_sys_fsync+0x34/0x60
Jul 26 20:48:40 WZabHP kernel: [ 726.110123] do_syscall_64+0x33/0x80
Jul 26 20:48:40 WZabHP kernel: [ 726.110124] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:48:40 WZabHP kernel: [ 726.110125] RIP: 0033:0x7fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [ 726.110125] RSP: 002b:00007fd2beffe3e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:48:40 WZabHP kernel: [ 726.110126] RAX: ffffffffffffffda RBX: 00007fd2bf0582e0 RCX: 00007fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [ 726.110127] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000089
Jul 26 20:48:40 WZabHP kernel: [ 726.110128] RBP: 0000000000000002 R08: 0000000000000000 R09: 00007fd2cec47af0
Jul 26 20:48:40 WZabHP kernel: [ 726.110128] R10: 00007ffd1278a080 R11: 0000000000000293 R12: 00000000000001f5
Jul 26 20:48:40 WZabHP kernel: [ 726.110129] R13: 0000000000000000 R14: 0000000000010400 R15: 000000000000003f
Jul 26 20:48:40 WZabHP kernel: [ 726.110131] INFO: task StreamTrans #3:2319 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [ 726.110132] Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [ 726.110132] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [ 726.110133] task:StreamTrans #3 state:D stack: 0 pid: 2319 ppid: 1945 flags:0x00004000
Jul 26 20:48:40 WZabHP kernel: [ 726.110134] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [ 726.110135] __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [ 726.110136] schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [ 726.110139] jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [ 726.110140] ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [ 726.110146] ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:48:40 WZabHP kernel: [ 726.110148] __x64_sys_fsync+0x34/0x60
Jul 26 20:48:40 WZabHP kernel: [ 726.110149] do_syscall_64+0x33/0x80
Jul 26 20:48:40 WZabHP kernel: [ 726.110150] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:48:40 WZabHP kernel: [ 726.110151] RIP: 0033:0x7fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [ 726.110152] RSP: 002b:00007fd2bba8e8c0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:48:40 WZabHP kernel: [ 726.110153] RAX: ffffffffffffffda RBX: 00007fd2af8539d0 RCX: 00007fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [ 726.110153] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000003f
Jul 26 20:48:40 WZabHP kernel: [ 726.110154] RBP: 00007fd2aecd16d0 R08: 0000000000000000 R09: 00007fd2cbbe4f28
Jul 26 20:48:40 WZabHP kernel: [ 726.110155] R10: 0002000500000049 R11: 0000000000000293 R12: 0000000000000000
Jul 26 20:48:40 WZabHP kernel: [ 726.110155] R13: 00000000000000c4 R14: 00007fd2caa6a6f1 R15: 00000000000000c4
Jul 26 20:48:40 WZabHP kernel: [ 726.110157] INFO: task QuotaManager IO:2333 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [ 726.110158] Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [ 726.110158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [ 726.110159] task:QuotaManager IO state:D stack: 0 pid: 2333 ppid: 1945 flags:0x00004000
Jul 26 20:48:40 WZabHP kernel: [ 726.110160] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [ 726.110161] __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [ 726.110162] schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [ 726.110165] jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [ 726.110166] ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [ 726.110172] ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:48:40 WZabHP kernel: [ 726.110173] __x64_sys_fsync+0x34/0x60
Jul 26 20:48:40 WZabHP kernel: [ 726.110175] do_syscall_64+0x33/0x80
Jul 26 20:48:40 WZabHP kernel: [ 726.110176] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:48:40 WZabHP kernel: [ 726.110177] RIP: 0033:0x7fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [ 726.110177] RSP: 002b:00007fd2b8d93020 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:48:40 WZabHP kernel: [ 726.110178] RAX: ffffffffffffffda RBX: 00007fd2ba6f52e0 RCX: 00007fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [ 726.110179] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000047
Jul 26 20:48:40 WZabHP kernel: [ 726.110179] RBP: 0000000000000002 R08: 0000000000000000 R09: 00007fd2b8c1b910
Jul 26 20:48:40 WZabHP kernel: [ 726.110180] R10: 00007ffd1278a080 R11: 0000000000000293 R12: 00000000000001f5
Jul 26 20:48:40 WZabHP kernel: [ 726.110180] R13: 0000000000000000 R14: 0000000000000a00 R15: 000000000000003b
Jul 26 20:48:40 WZabHP kernel: [ 726.110184] INFO: task mozStorage #2:2430 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [ 726.110185] Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [ 726.110185] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [ 726.110186] task:mozStorage #2 state:D stack: 0 pid: 2430 ppid: 1945 flags:0x00000000
Jul 26 20:48:40 WZabHP kernel: [ 726.110187] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [ 726.110188] __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [ 726.110189] schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [ 726.110192] jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [ 726.110193] ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [ 726.110196] __jbd2_journal_force_commit+0x5d/0xb0 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [ 726.110199] jbd2_journal_force_commit+0x1d/0x30 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [ 726.110205] ext4_sync_file+0x2c4/0x350 [ext4]
Jul 26 20:48:40 WZabHP kernel: [ 726.110206] __x64_sys_fsync+0x34/0x60
Jul 26 20:48:40 WZabHP kernel: [ 726.110207] do_syscall_64+0x33/0x80
Jul 26 20:48:40 WZabHP kernel: [ 726.110209] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:48:40 WZabHP kernel: [ 726.110209] RIP: 0033:0x7fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [ 726.110210] RSP: 002b:00007fd2b334c300 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:48:40 WZabHP kernel: [ 726.110211] RAX: ffffffffffffffda RBX: 00007fd2b34a6858 RCX: 00007fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [ 726.110212] RDX: 00000000000a0000 RSI: 00007fd2b334c0f0 RDI: 0000000000000072
Jul 26 20:48:40 WZabHP kernel: [ 726.110212] RBP: 0000000000000002 R08: 0000000000000000 R09: 000000000000003e
Jul 26 20:48:40 WZabHP kernel: [ 726.110213] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
Jul 26 20:48:40 WZabHP kernel: [ 726.110213] R13: 00007fd2aeffbd40 R14: 00007fd2b34a67a0 R15: 000000000000004e
Entonces el problema está relacionado con la baja sincronización del sistema de archivos. El rendimiento del disco es suficiente. Sin embargo, el problema aparece en instalaciones con volúmenes cifrados. El rendimiento de la CPU (especialmente en el caso de la máquina Xeon) es más que suficiente para realizar el cifrado con la suficiente rapidez.
Otro recurso escaso necesario para el cifrado son los datos aleatorios verdaderos. De hecho, el sistema al que puedo acceder ahora parecía no tener ni rng-tools5 ni rng-tools-debian instalados. Esa es la fuente probable del problema.
Respuesta1
He encontrado el problema similar descrito enhttps://askubuntu.com/questions/1406444/what-is-causing-my-system-to-stall-freeze-corrrupt-data-when-using-lvm-luks.
Una de las modificaciones descritas aquí eliminó la congelación del sistema en escrituras masivas.
cryptsetup --perf-no_write_workqueue refresh name_of_the_mapping
(en mi caso fue: cryptsetup --perf-no_write_workqueue refresh sda6_crypt
)
La opción se puede configurar de forma permanente haciendo:
cryptsetup --perf-no_write_workqueue --persistent refresh name_of_the_mapping
He reportado el problema como un error.https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1016474, pero es cuestionable si se trata de un error en el paquete cryptsetup.
Entonces la solución completa parece ser:
- Instalación del
haveged
paquete - Usando la
--perf-no_write_workqueue
opción en el mapeo del LVM cifrado.
Respuesta2
Parece que el problema puede estar relacionado (como se encuentra en los comentarios) con la falta de entropía en el sistema. Eso explica por qué el problema ocurre en sistemas con LVM cifrado. Los verdaderos bytes aleatorios son necesarios para cifrar los datos escritos.
He instalado rng-tools5
, y haveged
(como sugiere @michalng, ¡gracias!), y parece que el problema se reduce. Sin embargo, se requieren pruebas más exhaustivas para confirmar que esta es realmente la solución correcta a ese problema.