您现在的位置： Linux教程網 >> UnixLinux > >> Linux綜合 >> Linux內核

Linux內核調度算法

為什麼要了解內核的調度策略呢？呵呵，因為它值得我們學習，不算是廢話吧。內核調度程序很先進很強大，管理你的LINUX上跑的大量的亂七八糟的進程，同時還保持著對用戶操作的高靈敏響應，如果可能，為什麼不把這種思想放到自己的應用程序裡呢？或者，有沒有可能更好的實現自己的應用，使得操作系統能夠以自己的意志來分配資源給自己的進程？

帶著這兩個問題來看看KERNEL。首先回顧上我們開發應用程序，基本上就兩種類型，1、IO消耗型：比如Hadoop上的trunk服務，很明顯它的消耗主要在IO上，包括網絡IO磁盤IO等等。2、CPU消耗型，比如mapreduce或者其他的需要對大量數據進行計算處理的組件，就象對高清視頻壓縮成適合手機觀看分辨率的進程，他們的消耗主要在CPU上。當兩類進程都在一台SERVER上運行時，操作系統會如何調度它們呢？現在的服務器都是SMP多核的，那麼一個進程在多CPU時會來回切換嗎？如果我有一個程序，既有IO消耗又有CPU消耗，怎麼讓多核更好的調度我的程序呢？

又多了幾個問題。來看看內核調度程序吧，我們先從它的優先隊列談起吧。調度程序代碼就在內核源碼的kernel/sched.c的schedule函數中。
首先看下面的優先級隊列，每一個runqueue都有。runqueue是什麼？下面會詳細說下，現在大家可以理解為，內核為每一顆CPU分配了一個runqueue，用於維護這顆CPU可以運行的進程。runqueue裡，有幾個成員是prio_array類型，這個東東就是優先隊列，先看看它的定義：
[cpp]

struct prio_array {
unsigned int nr_active; 表示等待執行的進程總數
unsigned long bitmap[BITMAP_SIZE]; 一個unsigned long在內核中只有32位哈，大家要跟64位OS上的C程序中的long區分開，那個是64位的。那麼這個bitmap是干什麼的呢？它是用位的方式，表示某個優先級上有沒有待處理的隊列，是實現快速找到最高待處理優先進程的關鍵。如果我定義了四種優先級，我只需要四位就能表示某個優先級上有沒有進程要運行，例如優先級是2和3上有進程，那麼就應該是0110.......非常省空間，效率也快，不是嗎？
struct list_head queue[MAX_PRIO]; 與上面的bitmap是對應的，它存儲所有等待運行的進程。
};

看看BITMAP_SIZE是怎麼算出來的：#define BITMAP_SIZE ((((MAX_PRIO+1+7)/8)+sizeof(long)-1)/sizeof(long))
那麼，LINUX默認配置（如果你用默認選項編譯內核的話）MAX_PRIO是140，就是說一共內核對進程一共定義了140種優先級。等待某個CPU來處理的進程中，可能包含許多種優先級的進程，但，LINUX是個搶占式調度算法的操作系統，就是說，需要調度時一定是找到最高優先級的進程執行。上面的BITMAP_SIZE值根據MAX_PRIO算出來為5，那麼bitmap實際是32*5=160位，這樣就包含了MAX_PRIO的140位。優先級隊列是怎麼使用的？看2649行代碼：idx = sched_find_first_bit(array->bitmap);這個方法就用來快速的找到優先級最高的隊列。看看它的實現可以方便我們理解這個優先級位的設計：
[cpp]

static inline int sched_find_first_bit(unsigned long *b)
{
if (unlikely(b[0]))
return __ffs(b[0]);
if (unlikely(b[1]))
return __ffs(b[1]) + 32;
if (unlikely(b[2]))
return __ffs(b[2]) + 64;
if (b[3])
return __ffs(b[3]) + 96;
return __ffs(b[4]) + 128;
}

那麼__ffs是干什麼的？
[cpp]

static inline int __ffs(int x)
{
int r = 0;
if (!x)
return 0;
if (!(x & 0xffff)) {
x >>= 16;
r += 16;
}
if (!(x & 0xff)) {
x >>= 8;
r += 8;
}
if (!(x & 0xf)) {
x >>= 4;
r += 4;
}
if (!(x & 3)) {
x >>= 2;
r += 2;
}
if (!(x & 1)) {
x >>= 1;
r += 1;
}
return r;
}

sched_find_first_bit返回值就是最高優先級所在隊列的序號，與queue是對應使用的哈，queue = array->queue + idx;這樣就取到了要處理的進程隊列。這個設計在查找優先級時是非常快的，非常值得我們學習。

好，優先級隊列搞明白了，現在來看看runqueue，每個runqueue包含三個優先級隊列。
[cpp]

struct runqueue {
spinlock_t lock; 這是個自旋鎖，nginx裡解決驚群現象時也是用這個。與普通鎖的區別就是，使用普通鎖時，你去試圖拿一把鎖，結果發現已經被別人拿走了，你就在那睡覺，等別人鎖用完了叫你起來。所以如果有一個人拿住鎖了，一百個人都在門前睡覺等。當之前的人用完鎖回來後，會叫醒所有100個等鎖的人，然後這些人開始互相搶，搶到的人拿鎖進去，其他的人繼續等。自旋鎖不同，當他去拿鎖發現鎖被別人拿走了，他在那不睡覺的等，稍打個盹就看看自己主動看看鎖有沒有還回來。大家比較出優劣了嗎？
/*
* nr_running and cpu_load should be in the same cacheline because
* remote CPUs use both these fields when doing load calculation.
*/
unsigned long nr_running;
#ifdef CONFIG_SMP
unsigned long cpu_load;
#endif
unsigned long long nr_switches;
/*
* This is part of a global counter where only the total sum
* over all CPUs matters. A task can increase this counter on
* one CPU and if it got migrated afterwards it may decrease
* it on another CPU. Always updated under the runqueue lock:
*/
unsigned long nr_uninterruptible;
unsigned long expired_timestamp;
unsigned long long timestamp_last_tick;
task_t *curr, *idle;
struct mm_struct *prev_mm;
prio_array_t *active, *expired, arrays[2];上面說了半天的優先級隊列在這裡，但是在runqueue裡，為什麼不只一個呢？這個在下面講。
int best_expired_prio;
atomic_t nr_iowait;
... ...
};

LINUX是一個時間多路復用的系統，就是說，通過把CPU執行時間分成許多片，再分配給進程們使用，造成即使單CPU系統，也貌似允許多個任務在同時執行。那麼，時間片大小假設為100ms，過短過長，過長了有些不靈敏，過短了，連切換進程時可能都要消耗幾毫秒的時間。分給100個進程執行，在所有進程都用完自己的時間片後，需要重新給所有的進程重新分配時間片，怎麼分配呢？for循環遍歷所有的run狀態進程，重設時間片？這個性能無法容忍！太慢了，跟當前系統進程數相關。那麼2.6內核怎麼做的呢？它用了上面提到的兩個優先級隊列active和expired，顧名思義，active是還有時間片的進程隊列，而expired是時間片耗盡必須重新分配時間片的進程隊列。

這麼設計的好處就是不用再循環一遍所有進程重設時間片了，看看調度函數是怎麼玩的：
[cpp]

array = rq->active;
if (unlikely(!array->nr_active)) {
/*
* Switch the active and expired arrays.
*/
schedstat_inc(rq, sched_switch);
rq->active = rq->expired;
rq->expired = array;
array = rq->active;
rq->expired_timestamp = 0;
rq->best_expired_prio = MAX_PRIO;
} else
schedstat_inc(rq, sched_noswitch);

當所有運行進程的時間片都用完時，就把active和expired隊列互換指針，沒有遍歷哦，而時間片耗盡的進程在出acitve隊列入expired隊列時，已經單獨的重新分配好新時間片了。

再看一下schedule(void)調度函數，當某個進程休眠或者被搶占時，系統就開始調試schedule(void)決定接下來運行哪個進程。上面說過的東東都在這個函數裡有體現哈。
[cpp]

asmlinkage void __sched schedule(void)
{
long *switch_count;
task_t *prev, *next;
runqueue_t *rq;
prio_array_t *array;
struct list_head *queue;
unsigned long long now;
unsigned long run_time;
int cpu, idx;
/*
* Test if we are atomic. Since do_exit() needs to call into
* schedule() atomically, we ignore that path for now.
* Otherwise, whine if we are scheduling when we should not be.
*/
if (likely(!(current->exit_state & (EXIT_DEAD | EXIT_ZOMBIE)))) {先看看當前運行進程的狀態
if (unlikely(in_atomic())) {
printk(KERN_ERR "scheduling while atomic: "
"%s/0x%08x/%d\n",
current->comm, preempt_count(), current->pid);
dump_stack();
}
}
profile_hit(SCHED_PROFILING, __builtin_return_address(0));
need_resched:
preempt_disable();
prev = current;
release_kernel_lock(prev);
need_resched_nonpreemptible:
rq = this_rq(); 這行找到這個CPU對應的runqueue，再次強調，每個CPU有一個自己的runqueue
/*
* The idle thread is not allowed to schedule!
* Remove this check after it has been exercised a bit.
*/
if (unlikely(current == rq->idle) && current->state != TASK_RUNNING) {
printk(KERN_ERR "bad: scheduling from the idle thread!\n");
dump_stack();
}
schedstat_inc(rq, sched_cnt);
now = sched_clock();
if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
run_time = now - prev->timestamp;
else
run_time = NS_MAX_SLEEP_AVG;
/*
* Tasks with interactive credits get charged less run_time
* at high sleep_avg to delay them losing their interactive
* status
*/
if (HIGH_CREDIT(prev))
run_time /= (CURRENT_BONUS(prev) ? : 1);
spin_lock_irq(&rq->lock);
if (unlikely(current->flags & PF_DEAD))
current->state = EXIT_DEAD;
/*
* if entering off of a kernel preemption go straight
* to picking the next task.
*/
switch_count = &prev->nivcsw;
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
switch_count = &prev->nvcsw;
if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
unlikely(signal_pending(prev))))
prev->state = TASK_RUNNING;
else {
if (prev->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
deactivate_task(prev, rq);
}
}
cpu = smp_processor_id();
if (unlikely(!rq->nr_running)) {
go_idle:
idle_balance(cpu, rq);
if (!rq->nr_running) {
next = rq->idle;
rq->expired_timestamp = 0;
wake_sleeping_dependent(cpu, rq);
/*
* wake_sleeping_dependent() might have released
* the runqueue, so break out if we got new
* tasks meanwhile:
*/
if (!rq->nr_running)
goto switch_tasks;
}
} else {
if (dependent_sleeper(cpu, rq)) {
next = rq->idle;
goto switch_tasks;
}
/*
* dependent_sleeper() releases and reacquires the runqueue
* lock, hence go into the idle loop if the rq went
* empty meanwhile:
*/
if (unlikely(!rq->nr_running))
goto go_idle;
}
array = rq->active;
if (unlikely(!array->nr_active)) { 上面說過的，需要重新計算時間片時，就用已經計算好的expired隊列了
/*
* Switch the active and expired arrays.
*/
schedstat_inc(rq, sched_switch);
rq->active = rq->expired;
rq->expired = array;
array = rq->active;
rq->expired_timestamp = 0;
rq->best_expired_prio = MAX_PRIO;
} else
schedstat_inc(rq, sched_noswitch);
idx = sched_find_first_bit(array->bitmap); 找到優先級最高的隊列
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);
if (!rt_task(next) && next->activated > 0) {
unsigned long long delta = now - next->timestamp;
if (next->activated == 1)
delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
array = next->array;
dequeue_task(next, array);
recalc_task_prio(next, next->timestamp + delta);
enqueue_task(next, array);
}
next->activated = 0;
switch_tasks:
if (next == rq->idle)
schedstat_inc(rq, sched_goidle);
prefetch(next);
clear_tsk_need_resched(prev);
rcu_qsctr_inc(task_cpu(prev));
prev->sleep_avg -= run_time;
if ((long)prev->sleep_avg <= 0) {
prev->sleep_avg = 0;
if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))
prev->interactive_credit--;
}
prev->timestamp = prev->last_ran = now;
sched_info_switch(prev, next);
if (likely(prev != next)) { 表面現在正在執行的進程，不是選出來的優先級最高的進程
next->timestamp = now;
rq->nr_switches++;
rq->curr = next;
++*switch_count;
prepare_arch_switch(rq, next);
prev = context_switch(rq, prev, next); 所以需要完成進程上下文切換，把之前的進程信息CACHE住
barrier();
finish_task_switch(prev);
} else
spin_unlock_irq(&rq->lock);
prev = current;
if (unlikely(reacquire_kernel_lock(prev) < 0))
goto need_resched_nonpreemptible;
preempt_enable_no_resched();
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
}

當然，在我們程序中，也可以通過執行以下系統調用來改變自己進程的優先級。nice系統調用可以改變某個進程的基本優先級，setpriority可以改變一組進程的優先級。

上一篇文章： Linux內核編譯錯誤： make: *** [.tmp_vmlinux1] Error 1
下一篇文章： Tiny6410 編譯Linux內核的問題（Linux-2.6.38）

Linux內核

linux系統：定制Linux內核，發揮Linux潛能

Linux內核的三種調度方法

Linux內核分析之調度算法

Linux內核之進程調度

Linux 2.6內核的四種IO調度算法總結

Linux內核的一些算法

Linux內核之進程調度

Linux 內核概述，linux內核概述