Linux的進程存在多種狀態,如TASK_RUNNING的運行態、EXIT_DEAD的停止態和TASK_INTERRUPTIBLE的接收信號的等待狀態等等(可在include/linux/sched.h中查看)。其中有一種狀態等待為TASK_UNINTERRUPTIBLE,稱為D狀態,該種狀態下進程不接收信號,只能通過wake_up喚醒。處於這種狀態的情況有很多,例如mutex鎖就可能會設置進程於該狀態,有時候進程在等待某種IO資源就緒時(wait_event機制)會設置進程進入該狀態。一般情況下,進程處於該狀態的時間不會太久,但若IO設備出現故障或者出現進程死鎖等情況,進程就可能長期處於該狀態而無法再返回到TASK_RUNNING態。因此,內核為了便於發現這類情況設計出了hung
task機制專門用於檢測長期處於D狀態的進程並發出告警。本文分析內核hung task機制的源碼並給出一個示例演示。
圖 D狀態死鎖流程圖
其核心思想為創建一個內核監測進程循環監測處於D狀態的每一個進程(任務),統計它們在兩次檢測之間的調度次數,如果發現有任務在兩次監測之間沒有發生任何的調度則可判斷該進程一直處於D狀態,很有可能已經死鎖,因此觸發報警日志打印,輸出進程的基本信息,棧回溯以及寄存器保存信息以供內核開發人員定位。
下面詳細分析實現方式:
static int __init hung_task_init(void) { atomic_notifier_chain_register(&panic_notifier_list, &panic_block); watchdog_task = kthread_run(watchdog, NULL, "khungtaskd"); return 0; } subsys_initcall(hung_task_init);首先,若在內核配置中啟用了該機制,在內核的subsys初始化階段就會調用hung_task_init()函數啟用功能,首先向內核的panic_notifier_list通知鏈注冊回調:
static struct notifier_block panic_block = { .notifier_call = hung_task_panic, };在內核觸發panic時就會調用該hung_task_panic()函數,這個函數的作用稍後再看。繼續往下初始化,調用kthread_run()函數創建了一個名為khungtaskd的線程,執行watchdog()函數,立即嘗試調度執行。該線程就是專用於檢測D狀態死鎖進程的後台內核線程。
/* * kthread which checks for tasks stuck in D state */ static int watchdog(void *dummy) { set_user_nice(current, 0); for ( ; ; ) { unsigned long timeout = sysctl_hung_task_timeout_secs; while (schedule_timeout_interruptible(timeout_jiffies(timeout))) timeout = sysctl_hung_task_timeout_secs; if (atomic_xchg(&reset_hung_task, 0)) continue; check_hung_uninterruptible_tasks(timeout); } return 0; }本進程首先設置優先級為0,即一般優先級,不影響其他進程。然後進入主循環(每隔timeout時間執行一次),首先讓進程睡眠,設置的睡眠時間為
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT,可以通過內核配置選項修改,默認值為120s,睡眠結束被喚醒後判斷原子變量標識reset_hung_task,若被置位則跳過本輪監測,同時會清除該標識。該標識通過reset_hung_task_detector()函數設置(目前內核中尚無其他程序使用該接口):
void reset_hung_task_detector(void) { atomic_set(&reset_hung_task, 1); } EXPORT_SYMBOL_GPL(reset_hung_task_detector);接下來循環的最後即為監測函數check_hung_uninterruptible_tasks(),函數入參為監測超時時間。
/* * Check whether a TASK_UNINTERRUPTIBLE does not get woken up for * a really long time (120 seconds). If that happens, print out * a warning. */ static void check_hung_uninterruptible_tasks(unsigned long timeout) { int max_count = sysctl_hung_task_check_count; int batch_count = HUNG_TASK_BATCHING; struct task_struct *g, *t; /* * If the system crashed already then all bets are off, * do not report extra hung tasks: */ if (test_taint(TAINT_DIE) || did_panic) return; rcu_read_lock(); for_each_process_thread(g, t) { if (!max_count--) goto unlock; if (!--batch_count) { batch_count = HUNG_TASK_BATCHING; if (!rcu_lock_break(g, t)) goto unlock; } /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */ if (t->state == TASK_UNINTERRUPTIBLE) check_hung_task(t, timeout); } unlock: rcu_read_unlock(); }首先檢測內核是否已經DIE了或者已經panic了,如果是則表明內核已經crash了,無需再進行監測了,直接返回即可。注意這裡的did_panic標識在前文中的panic通知鏈回調函數中hung_task_panic()置位:
static int hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr) { did_panic = 1; return NOTIFY_DONE; }接下去若尚無觸發內核crash,則進入監測流程並逐一檢測內核中的所有進程(任務task),該過程在RCU加鎖的狀態下進行,因此為了避免在進程較多的情況下加鎖時間過長,這裡設置了一個batch_count,一次最多檢測HUNG_TASK_BATCHING個進程。於此同時用戶也可以設定最大的檢測個數max_count=sysctl_hung_task_check_count,默認值為最大PID個數PID_MAX_LIMIT(通過sysctl命令設置)。函數調用for_each_process_thread()函數輪詢內核中的所有進程(任務task),僅對狀態處於TASK_UNINTERRUPTIBLE狀態的進程進行超時判斷,調用check_hung_task()函數,入參為task_struct結構和超時時間(120s):
static void check_hung_task(struct task_struct *t, unsigned long timeout) { unsigned long switch_count = t->nvcsw + t->nivcsw; /* * Ensure the task is not frozen. * Also, skip vfork and any other user process that freezer should skip. */ if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP))) return; /* * When a freshly created task is scheduled once, changes its state to * TASK_UNINTERRUPTIBLE without having ever been switched out once, it * musn't be checked. */ if (unlikely(!switch_count)) return; if (switch_count != t->last_switch_count) { t->last_switch_count = switch_count; return; } trace_sched_process_hang(t); if (!sysctl_hung_task_warnings) return; if (sysctl_hung_task_warnings > 0) sysctl_hung_task_warnings--;首先通過t->nvcsw和t->nivcsw的計數累加表示進程從創建開始至今的調度次數總和,其中t->nvcsw表示進程主動放棄CPU的次數,t->nivcsw表示被強制搶占的次數。隨後函數判斷幾個標識:(1)如果進程被frozen了那就跳過檢測;(2)調度次數為0的不檢測。接下來判斷從上一次檢測時保存的進程調度次數和本次是否相同,若不相同則表明這輪timeout(120s)時間內進程發生了調度,則更新該調度值返回,否則則表明該進程已經有timeout(120s)時間沒有得到調度了,一直處於D狀態。接下來的trace_sched_process_hang()暫不清楚作用,然後判斷sysctl_hung_task_warnings標識,它表示需要觸發報警的次數,用戶也可以通過sysctl命令配置,默認值為10,即若當前檢測的進程一直處於D狀態,默認情況下此處每2分鐘發出一次告警,一共發出10次,之後不再發出告警。下面來看告警代碼:
/* * Ok, the task did not get scheduled for more than 2 minutes, * complain: */ pr_err("INFO: task %s:%d blocked for more than %ld seconds.\n", t->comm, t->pid, timeout); pr_err(" %s %s %.*s\n", print_tainted(), init_utsname()->release, (int)strcspn(init_utsname()->version, " "), init_utsname()->version); pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\"" " disables this message.\n"); sched_show_task(t); debug_show_held_locks(t); touch_nmi_watchdog();這裡會在控制台和日志中打印死鎖任務的名稱、PID號、超時時間、內核tainted信息、sysinfo、內核棧barktrace以及寄存器信息等。如果開啟了debug lock則打印鎖占用的情況,並touch nmi_watchdog以防止nmi_watchdog超時(對於我的ARM環境無需考慮nmi_watchdog)。
if (sysctl_hung_task_panic) { trigger_all_cpu_backtrace(); panic("hung_task: blocked tasks"); }最後如果設置了sysctl_hung_task_panic標識則直接觸發panic(該值可通過內核配置文件配置也可以通過sysctl設置)。
Kernel hacking --->
Debug Lockups and Hangs --->
[*] Detect Hung Tasks
(120) Default timeout for hung task detection (in seconds)
2、編寫測試程序
#include <linux/module.h> #include <linux/kernel.h> #include <linux/init.h> #include <linux/mutex.h> DEFINE_MUTEX(dlock); static int __init dlock_init(void) { mutex_lock(&dlock); mutex_lock(&dlock); return 0; } static void __exit dlock_exit(void) { return; } module_init(dlock_init); module_exit(dlock_exit); MODULE_LICENSE("GPL");本示例程序定義了一個mutex鎖,然後在模塊的init函數中重復加鎖,人為造成死鎖現象(mutex_lock()函數會調用__mutex_lock_slowpath()將進程設置為TASK_UNINTERRUPTIBLE狀態),進程進入D狀態後是無法退出的。可以通過ps命令來查看:root@apple:~# busybox ps
PID USER TIME COMMAND
......
521 root 0:00 insmod dlock.ko
......
然後查看該進程的狀態,可見已經進入了D狀態
root@apple:~# cat /proc/521/status Name: insmod
State: D (disk sleep)
Tgid: 521
Ngid: 0
Pid: 521
至此在等待兩分鐘後調試串口就會輸出以下信息,可見每兩分鐘就會輸出一次:
[ 360.625466] INFO: task insmod:521 blocked for more than 120 seconds.
[ 360.631878] Tainted: G O 4.1.15 #5
[ 360.637042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 360.644986] [<c05278e8>] (__schedule) from [<c0527d34>] (schedule+0x40/0xa4)
[ 360.652129] [<c0527d34>] (schedule) from [<c0527ec8>] (schedule_preempt_disabled+0x18/0x1c)
[ 360.660570] [<c0527ec8>] (schedule_preempt_disabled) from [<c0529200>] (__mutex_lock_slowpath+0x6c/0xe4)
[ 360.670142] [<c0529200>] (__mutex_lock_slowpath) from [<c05292bc>] (mutex_lock+0x44/0x48)
[ 360.678432] [<c05292bc>] (mutex_lock) from [<bf026020>] (dlock_init+0x20/0x2c [dlock])
[ 360.686480] [<bf026020>] (dlock_init [dlock]) from [<c0009558>] (do_one_initcall+0x90/0x1e8)
[ 360.694976] [<c0009558>] (do_one_initcall) from [<c007ac4c>] (do_init_module+0x6c/0x1c0)
[ 360.703170] [<c007ac4c>] (do_init_module) from [<c007c568>] (load_module+0x1690/0x1d34)
[ 360.711284] [<c007c568>] (load_module) from [<c007cce8>] (SyS_init_module+0xdc/0x130)
[ 360.719239] [<c007cce8>] (SyS_init_module) from [<c000f800>] (ret_fast_syscall+0x0/0x54)
[ 480.725351] INFO: task insmod:521 blocked for more than 120 seconds.
[ 480.731759] Tainted: G O 4.1.15 #5
[ 480.736917] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 480.744842] [<c05278e8>] (__schedule) from [<c0527d34>] (schedule+0x40/0xa4)
[ 480.752029] [<c0527d34>] (schedule) from [<c0527ec8>] (schedule_preempt_disabled+0x18/0x1c)
[ 480.760479] [<c0527ec8>] (schedule_preempt_disabled) from [<c0529200>] (__mutex_lock_slowpath+0x6c/0xe4)
[ 480.770066] [<c0529200>] (__mutex_lock_slowpath) from [<c05292bc>] (mutex_lock+0x44/0x48)
[ 480.778363] [<c05292bc>] (mutex_lock) from [<bf026020>] (dlock_init+0x20/0x2c [dlock])
[ 480.786402] [<bf026020>] (dlock_init [dlock]) from [<c0009558>] (do_one_initcall+0x90/0x1e8)
[ 480.794897] [<c0009558>] (do_one_initcall) from [<c007ac4c>] (do_init_module+0x6c/0x1c0)
[ 480.803085] [<c007ac4c>] (do_init_module) from [<c007c568>] (load_module+0x1690/0x1d34)
[ 480.811188] [<c007c568>] (load_module) from [<c007cce8>] (SyS_init_module+0xdc/0x130)
[ 480.819113] [<c007cce8>] (SyS_init_module) from [<c000f800>] (ret_fast_syscall+0x0/0x54)
[ 600.825353] INFO: task insmod:521 blocked for more than 120 seconds.
[ 600.831759] Tainted: G O 4.1.15 #5
[ 600.836916] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 600.844865] [<c05278e8>] (__schedule) from [<c0527d34>] (schedule+0x40/0xa4)
[ 600.852005] [<c0527d34>] (schedule) from [<c0527ec8>] (schedule_preempt_disabled+0x18/0x1c)
[ 600.860445] [<c0527ec8>] (schedule_preempt_disabled) from [<c0529200>] (__mutex_lock_slowpath+0x6c/0xe4)
[ 600.870014] [<c0529200>] (__mutex_lock_slowpath) from [<c05292bc>] (mutex_lock+0x44/0x48)
[ 600.878303] [<c05292bc>] (mutex_lock) from [<bf026020>] (dlock_init+0x20/0x2c [dlock])
[ 600.886339] [<bf026020>] (dlock_init [dlock]) from [<c0009558>] (do_one_initcall+0x90/0x1e8)
[ 600.894835] [<c0009558>] (do_one_initcall) from [<c007ac4c>] (do_init_module+0x6c/0x1c0)
[ 600.903023] [<c007ac4c>] (do_init_module) from [<c007c568>] (load_module+0x1690/0x1d34)
[ 600.911133] [<c007c568>] (load_module) from [<c007cce8>] (SyS_init_module+0xdc/0x130)
[ 600.919059] [<c007cce8>] (SyS_init_module) from [<c000f800>] (ret_fast_syscall+0x0/0x54)