我的 Python 應用程式在多個執行緒中使用 100% 的「系統」CPU 時間時遇到了困難,持續時間長達 10 秒以上。我(有限)的理解是「系統」CPU 時間是在核心中花費的時間。strace
顯示正在進行哪些核心呼叫以及(可選)它們需要多長時間。我已經設法在其中一個暫停期間在一個執行緒上運行 strace,但 strace 輸出在該期間只有一個間隙,並且該間隙兩側的呼叫顯示的時間非常短。
如何正確運行 strace 來找出正在發生的核心呼叫?
在我的 Python 應用程式中,我使用單獨的線程來psutil
檢查系統 CPU 較高的任何線程,然後將它們登出。一旦我看到一個,我就會使用以下命令列來啟動 strace:
strace -tT -o ~/strace.log -p <PID of thread>
從我的應用程式記錄(psutil
在背景使用):
2015-05-19 08:05:47,025 WARNING:log.cpu Logging paused for 0.34s to 08:05:46.985768 - longer than expected - other threads probably delayed as well
2015-05-19 08:05:47,026 WARNING:log.cpu This process utilization (CPUs) user: 0.1, system: 52.4 | OS (%) scputimes(user=0.4, nice=0.0, system=80.2, idle=19.3, iowait=0.0, irq=0.0, softirq=0.0, steal=0.0, guest=0.0, guest_nice=0.0), memory used: 16.6, threads: 101, fds: 70
2015-05-19 08:05:47,027 WARNING:log.cpu Thread 47441 utilization (CPUs) user: 0.0, system: 1.0
2015-05-19 08:05:47,512 WARNING:log.cpu This process utilization (CPUs) user: 0.1, system: 61.4 | OS (%) scputimes(user=0.5, nice=0.0, system=99.4, idle=0.1, iowait=0.0, irq=0.0, softirq=0.0, steal=0.0, guest=0.0, guest_nice=0.0), memory used: 16.6, threads: 101, fds: 70
2015-05-19 08:05:47,516 WARNING:log.cpu Thread 47441 utilization (CPUs) user: 0.0, system: 0.9
2015-05-19 08:05:47,995 WARNING:log.cpu This process utilization (CPUs) user: 0.3, system: 63.7 | OS (%) scputimes(user=0.8, nice=0.0, system=99.2, idle=0.0, iowait=0.0, irq=0.0, softirq=0.0, steal=0.0, guest=0.0, guest_nice=0.0), memory used: 16.6, threads: 101, fds: 70
2015-05-19 08:05:47,995 WARNING:log.cpu Thread 47441 utilization (CPUs) user: 0.0, system: 0.8
2015-05-19 08:05:48,495 WARNING:log.cpu This process utilization (CPUs) user: 0.3, system: 63.8 | OS (%) scputimes(user=0.7, nice=0.0, system=99.3, idle=0.0, iowait=0.0, irq=0.0, softirq=0.0, steal=0.0, guest=0.0, guest_nice=0.0), memory used: 16.6, threads: 101, fds: 70
2015-05-19 08:05:48,496 WARNING:log.cpu Thread 47441 utilization (CPUs) user: 0.0, system: 0.8
2015-05-19 08:05:52,140 WARNING:log.cpu Logging paused for 3.12s to 08:05:52.103363 - longer than expected - other threads probably delayed as well
2015-05-19 08:05:52,153 WARNING:log.cpu This process utilization (CPUs) user: 0.0, system: 63.6 | OS (%) scputimes(user=0.3, nice=0.0, system=98.5, idle=1.2, iowait=0.0, irq=0.0, softirq=0.0, steal=0.0, guest=0.0, guest_nice=0.0), memory used: 16.6, threads: 101, fds: 70
2015-05-19 08:05:52,180 WARNING:log.cpu Thread 47441 utilization (CPUs) user: 0.0, system: 1.0
同一時間段內該執行緒 ID 47441 上的 strace 日誌:
08:05:46 futex(0x7f7b46b305ac, FUTEX_WAKE, 1) = 1 <0.000030>
08:05:46 sched_yield() = 0 <0.000016>
08:05:46 sched_yield() = 0 <0.000017>
08:05:46 sched_yield() = 0 <0.000021>
08:05:46 sched_yield() = 0 <0.000053>
08:05:46 sched_yield() = 0 <0.000023>
08:05:46 sched_yield() = 0 <0.000020>
08:05:46 sched_yield() = 0 <0.000019>
08:05:46 sched_yield() = 0 <0.000021>
08:05:46 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.000574>
08:05:46 futex(0x7f7b46b30f2c, FUTEX_WAKE, 1) = 1 <0.000019>
08:05:46 futex(0x7f7b46b3072c, FUTEX_WAKE, 1) = 1 <0.000028>
08:05:46 sched_yield() = 0 <0.000022>
08:05:46 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.000305>
08:05:46 futex(0x7f7b46b31fac, FUTEX_WAKE, 1) = 1 <0.000119>
08:05:46 futex(0x7f7b46b319ac, FUTEX_WAKE, 1) = 1 <0.000073>
08:05:46 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.000474>
08:05:46 futex(0x7f7b46b31f2c, FUTEX_WAKE, 1) = 1 <0.000094>
08:05:46 futex(0x7f7b46b30dac, FUTEX_WAKE, 1) = 1 <0.000096>
08:05:46 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.001139>
08:05:46 futex(0x7f7b46b31eac, FUTEX_WAKE, 1) = 1 <0.000156>
08:05:46 futex(0x7f7b46b303ac, FUTEX_WAKE, 1) = 1 <0.000059>
08:05:52 sched_yield() = 0 <0.000081>
08:05:52 sched_yield() = 0 <0.000050>
08:05:52 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.001603>
08:05:52 futex(0x7f7b46b31d2c, FUTEX_WAKE, 1) = 1 <0.000318>
08:05:52 futex(0x7f7b46b30f2c, FUTEX_WAKE, 1) = 1 <0.000259>
08:05:52 sched_yield() = 0 <0.000048>
08:05:52 sched_yield() = 0 <0.000060>
08:05:52 sched_yield() = 0 <0.000203>
08:05:52 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.000148>
08:05:52 sched_yield() = 0 <0.000036>
08:05:52 sched_yield() = 0 <0.000164>
08:05:52 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.003125>
08:05:52 sched_yield() = 0 <0.000053>
08:05:52 sched_yield() = 0 <0.000052>
08:05:52 sched_yield() = 0 <0.000023>
08:05:52 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.002264>
08:05:52 futex(0x7f7b46b3122c, FUTEX_WAKE, 1) = 1 <0.000119>
請注意 strace 每秒記錄 100 個條目,然後有 6 秒的間隙沒有記錄。但沒有顯示任何呼叫花費的時間超過幾毫秒。那麼為什麼該執行緒顯示 100% 的「系統」CPU 時間呢?
我曾經top
確認過所psutil
報告的數字是正確的。我確信我已經追蹤了正確的線程,因為 6 秒的間隙與 psutil 記錄的高 CPU 時間完全對應。
還可以幫助我使用哪些其他工具來確定導致此核心/系統時間的原因?
環境: - Python 3.4.3(在 Python 2.7 上也出現問題) - CentOS 7 核心 3.10.0-229.1.2.el7.x86_64(在 Ubuntu 12 LTS 上也出現問題) - 64 核心 AMD 機器