您现在的位置： Linux教程網 >> UnixLinux > >> Linux基礎 >> 關於Linux

linux進程管理

提到程序的執行必不可少我們會想到的就是進程，那麼進程到底是什麼呢?

實質上計算機系統為了使程序可以並發執行，並對並發執行的程序進行描述和控制，引入了“進程”的概念。為了使每個程序能夠獨立的執行，在操作系統中為他配置了一個數據結構即task_struct,稱為進程描述符。

系統可以利用進程控制塊(PCB)來描述進程的基本情況和活動過程，從而管理和控制進程。由程序段，相關數據段和PCB構成進程實體。

傳統OS的定義：進程是進程實體的運行過程，是系統進行調度和資源分配的一個獨立單位。

進程和程序時兩個完全不同的概念，他們的不同點總結如下：

1.程序沒有進程所包含的進程控制塊。

2.進程是進程實體的執行過程，是動態的，所以進程有一定的生命期，而程序只是一組有序指令的集合，是靜態的。

3.一個程序執行在不同的數據集上就成為不同的進程，可以用進程控制塊來唯一地標識每個進程。

4.進程還具有並發性和交往性，這也與程序的封閉性不同。

Linux內核主要是通過進程標識符task_struct來管理進程，這個結構體包含了一個進程所需要的所有信息。

下面就來對task_struct結構體成員的用法進行研究：

task_struct的成員相當多目前為止大概有3000多行。

下面來慢慢介紹這些復雜成員

進程狀態

volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */

state成員的可能取值如下

參見http://lxr.free-electrons.com/source/include/linux/sched.h?v=4.5

* Task state bitmask. NOTE! These bits are also

* encoded in fs/proc/array.c: get_task_state().

* We have two separate sets of flags: task->state

* is about runnability, while task->exit_state are

* about the task exiting. Confusing, but this way

* modifying one set can't modify the other one by

* mistake.

#define TASK_RUNNING 0

#define TASK_INTERRUPTIBLE 1

#define TASK_UNINTERRUPTIBLE 2

#define __TASK_STOPPED 4

#define __TASK_TRACED 8

/* in tsk->exit_state */

#define EXIT_DEAD 16

#define EXIT_ZOMBIE 32

#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)

/* in tsk->state again */

#define TASK_DEAD 64

#define TASK_WAKEKILL 128 /** wake on signals that are deadly **/

#define TASK_WAKING 256

#define TASK_PARKED 512

#define TASK_NOLOAD 1024

#define TASK_STATE_MAX 2048

/* Convenience macros for the sake of set_task_state */

#define TASK_KILLABLE (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)

#define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED)

#define TASK_TRACED (TASK_WAKEKILL | __TASK_TRACED)

5個互斥狀態

state域能夠取5個互為排斥的值(通俗一點就是這五個值任意兩個不能一起使用，只能單獨使用)。系統中的每個進程都必然處於以上所列進程狀態中的一種。

狀態描述

TASK_RUNNING表示進程要麼正在執行，要麼正要准備執行(已經就緒)，正在等待cpu時間片的調度

TASK_INTERRUPTIBLE進程因為等待一些條件而被掛起(阻塞)而所處的狀態。這些條件主要包括：硬中斷、資源、一些信號……，一旦等待的條件成立，進程就會從該狀態(阻塞)迅速轉化成為就緒狀態TASK_RUNNING

TASK_UNINTERRUPTIBLE意義與TASK_INTERRUPTIBLE類似，除了不能通過接受一個信號來喚醒以外，對於處於TASK_UNINTERRUPIBLE狀態的進程，哪怕我們傳遞一個信號或者有一個外部中斷都不能喚醒他們。只有它所等待的資源可用的時候，他才會被喚醒。這個標志很少用，但是並不代表沒有任何用處，其實他的作用非常大，特別是對於驅動刺探相關的硬件過程很重要，這個刺探過程不能被一些其他的東西給中斷，否則就會讓進城進入不可預測的狀態

TASK_STOPPED進程被停止執行，當進程接收到SIGSTOP、SIGTTIN、SIGTSTP或者SIGTTOU信號之後就會進入該狀態

TASK_TRACED表示進程被debugger等進程監視，進程執行被調試程序所停止，當一個進程被另外的進程所監視，每一個信號都會讓進城進入該狀態

2個終止狀態

其實還有兩個附加的進程狀態既可以被添加到state域中，又可以被添加到exit_state域中。只有當進程終止的時候，才會達到這兩種狀態.

/* task state */

int exit_state;

int exit_code, exit_signal;

狀態描述

EXIT_ZOMBIE進程的執行被終止，但是其父進程還沒有使用wait()等系統調用來獲知它的終止信息，此時進程成為僵屍進程

EXIT_DEAD進程的最終狀態

而int exit_code, exit_signal;我們會在後面進程介紹

新增睡眠狀態

參見

TASK_KILLABLE：Linux 中的新進程狀態

如前所述，進程狀態 TASK_UNINTERRUPTIBLE 和 TASK_INTERRUPTIBLE 都是睡眠狀態。現在，我們來看看內核如何將進程置為睡眠狀態。

內核如何將進程置為睡眠狀態。

Linux 內核提供了兩種方法將進程置為睡眠狀態。

將進程置為睡眠狀態的普通方法是將進程狀態設置為 TASK_INTERRUPTIBLE 或 TASK_UNINTERRUPTIBLE 並調用調度程序的 schedule() 函數。這樣會將進程從 CPU 運行隊列中移除。

如果進程處於可中斷模式的睡眠狀態(通過將其狀態設置為 TASK_INTERRUPTIBLE)，那麼可以通過顯式的喚醒呼叫(wakeup_process())或需要處理的信號來喚醒它。

但是，如果進程處於非可中斷模式的睡眠狀態(通過將其狀態設置為 TASK_UNINTERRUPTIBLE)，那麼只能通過顯式的喚醒呼叫將其喚醒。除非萬不得已，否則我們建議您將進程置為可中斷睡眠模式，而不是不可中斷睡眠模式(比如說在設備 I/O 期間，處理信號非常困難時)。

當處於可中斷睡眠模式的任務接收到信號時，它需要處理該信號(除非它已被屏弊)，離開之前正在處理的任務(此處需要清除代碼)，並將 -EINTR 返回給用戶空間。再一次，檢查這些返回代碼和采取適當操作的工作將由程序員完成。

因此，懶惰的程序員可能比較喜歡將進程置為不可中斷模式的睡眠狀態，因為信號不會喚醒這類任務。

但需要注意的一種情況是，對不可中斷睡眠模式的進程的喚醒呼叫可能會由於某些原因不會發生，這會使進程無法被終止，從而最終引發問題，因為惟一的解決方法就是重啟系統。一方面，您需要考慮一些細節，因為不這樣做會在內核端和用戶端引入 bug。另一方面，您可能會生成永遠不會停止的進程(被阻塞且無法終止的進程)。

現在，我們在內核中實現了一種新的睡眠方法

Linux Kernel 2.6.25 引入了一種新的進程睡眠狀態，

狀態描述

TASK_KILLABLE當進程處於這種可以終止的新睡眠狀態中，它的運行原理類似於 TASK_UNINTERRUPTIBLE，只不過可以響應致命信號

它定義如下：

#define TASK_WAKEKILL 128 /** wake on signals that are deadly **/

/* Convenience macros for the sake of set_task_state */

#define TASK_KILLABLE (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)

#define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED)

#define TASK_TRACED (TASK_WAKEKILL | __TASK_TRACED)

換句話說，TASK_UNINTERRUPTIBLE + TASK_WAKEKILL = TASK_KILLABLE。

而TASK_WAKEKILL 用於在接收到致命信號時喚醒進程

新的睡眠狀態允許 TASK_UNINTERRUPTIBLE 響應致命信號

進程狀態的切換過程和原因大致如下圖

進程標識符(PID)

pid_t pid;

pid_t tgid;

Unix系統通過pid來標識進程，linux把不同的pid與系統中每個進程或輕量級線程關聯，而unix程序員希望同一組線程具有共同的pid，遵照這個標准linux引入線程組的概念。一個線程組所有線程與領頭線程具有相同的pid，存入tgid字段，getpid()返回當前進程的tgid值而不是pid的值。

在CONFIG_BASE_SMALL配置為0的情況下，PID的取值范圍是0到32767，即系統中的進程數最大為32768個。

#define PID_MAX_DEFAULT (CONFIG_BASE_SMALL ? 0x1000 : 0x8000)

參見http://lxr.free-electrons.com/source/include/linux/threads.h

在Linux系統中，一個線程組中的所有線程使用和該線程組的領頭線程(該組中的第一個輕量級進程)相同的PID，並被存放在tgid成員中。只有線程組的領頭線程的pid成員才會被設置為與tgid相同的值。注意，getpid()系統調用返回的是當前進程的tgid值而不是pid值。

進程內核棧

void *stack;

內核棧與線程描述符

對每個進程，Linux內核都把兩個不同的數據結構緊湊的存放在一個單獨為進程分配的內存區域中

一個是內核態的進程堆棧，

另一個是緊挨著進程描述符的小數據結構thread_info，叫做線程描述符。

Linux把thread_info(線程描述符)和內核態的線程堆棧存放在一起，這塊區域通常是8192K(占兩個頁框)，其實地址必須是8192的整數倍。

在linux/arch/x86/include/asm/page_32_types.h中，

#define THREAD_SIZE_ORDER 1

#define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER)

出於效率考慮，內核讓這8K空間占據連續的兩個頁框並讓第一個頁框的起始地址是213的倍數。

內核態的進程訪問處於內核數據段的棧，這個棧不同於用戶態的進程所用的棧。

用戶態進程所用的棧，是在進程線性地址空間中;

而內核棧是當進程從用戶空間進入內核空間時，特權級發生變化，需要切換堆棧，那麼內核空間中使用的就是這個內核棧。因為內核控制路徑使用很少的棧空間，所以只需要幾千個字節的內核態堆棧。

需要注意的是，內核態堆棧僅用於內核例程，Linux內核另外為中斷提供了單獨的硬中斷棧和軟中斷棧

下圖中顯示了在物理內存中存放兩種數據結構的方式。線程描述符駐留與這個內存區的開始，而棧頂末端向下增長。下圖摘自ULK3,進程內核棧與進程描述符的關系如下圖：

但是較新的內核代碼中，進程描述符task_struct結構中沒有直接指向thread_info結構的指針，而是用一個void指針類型的成員表示，然後通過類型轉換來訪問thread_info結構。

相關代碼在include/linux/sched.h中

#define task_thread_info(task) ((struct thread_info *)(task)->stack)

在這個圖中，esp寄存器是CPU棧指針，用來存放棧頂單元的地址。在80x86系統中，棧起始於頂端，並朝著這個內存區開始的方向增長。從用戶態剛切換到內核態以後，進程的內核棧總是空的。因此，esp寄存器指向這個棧的頂端。一旦數據寫入堆棧，esp的值就遞減。

內核棧數據結構描述thread_info和thread_union

thread_info是體系結構相關的，結構的定義在thread_info.h中

架構定義鏈接

x86linux-4.5/arch/x86/include/asm/thread_info.h, line 55

armlinux-4.5arch/arm/include/asm/thread_info.h, line 49

arm64linux/4.5/arch/arm64/include/asm/thread_info.h, line 47

Linux內核中使用一個聯合體來表示一個進程的線程描述符和內核棧：

union thread_union

{

struct thread_info thread_info;

unsigned long stack[THREAD_SIZE/sizeof(long)];

};

獲取當前在CPU上正在運行進程的thread_info

下面來說說如何通過esp棧指針來獲取當前在CPU上正在運行進程的thread_info結構。

實際上，上面提到，thread_info結構和內核態堆棧是緊密結合在一起的，占據兩個頁框的物理內存空間。而且，這兩個頁框的起始起始地址是213對齊的。

早期的版本中，不需要對64位處理器的支持，所以，內核通過簡單的屏蔽掉esp的低13位有效位就可以獲得thread_info結構的基地址了。

我們在下面對比了，獲取正在運行的進程的thread_info的實現方式

架構版本定義鏈接實現方式思路解析

x863.14current_thread_info(void)return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));屏蔽了esp的低十三位，最終得到的是thread_info的地址

x863.15current_thread_info(void)ti = (void *)(this_cpu_read_stable(kernel_stack) + KERNEL_STACK_OFFSET - THREAD_SIZE);

x864.1current_thread_info(void)(struct thread_info *)(current_top_of_stack() - THREAD_SIZE);

早期版本

當前的棧指針(current_stack_pointer == sp)就是esp，

THREAD_SIZE為8K，二進制的表示為0000 0000 0000 0000 0010 0000 0000 0000。

~(THREAD_SIZE-1)的結果剛好為1111 1111 1111 1111 1110 0000 0000 0000，第十三位是全為零，也就是剛好屏蔽了esp的低十三位，最終得到的是thread_info的地址。

進程最常用的是進程描述符結構task_struct而不是thread_info結構的地址。為了獲取當前CPU上運行進程的task_struct結構，內核提供了current宏，由於task_struct *task在thread_info的起始位置，該宏本質上等價於current_thread_info()->task，在include/asm-generic/current.h中定義：

#define get_current() (current_thread_info()->task)

#define current get_current()

這個定義是體系結構無關的，當然linux也為各個體系結構定義了更加方便或者快速的current

請參見：http://lxr.free-electrons.com/ident?v=4.5;i=current

分配和銷毀thread_info

進程通過alloc_thread_info_node函數分配它的內核棧，通過free_thread_info函數釋放所分配的內核棧。

# if THREAD_SIZE >= PAGE_SIZE

static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,

int node)

{

struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,

THREAD_SIZE_ORDER);

return page ? page_address(page) : NULL;

}

static inline void free_thread_info(struct thread_info *ti)

{

free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);

}

# else

static struct kmem_cache *thread_info_cache;

static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,

int node)

{

return kmem_cache_alloc_node(thread_info_cache, THREADINFO_GFP, node);

}

static void free_thread_info(struct thread_info *ti)

{

kmem_cache_free(thread_info_cache, ti);

}

其中，THREAD_SIZE_ORDER宏的定義請查看

架構版本定義鏈接實現方式思路解析

x864.5arch/x86/include/asm/page_32_types.h, line 20define THREAD_SIZE_ORDER 1__get_free_pages函數分配2個頁的內存(它的首地址是8192字節對齊的)

x86_644.5arch/x86/include/asm/page_64_types.h, line 10define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER)

進程標記

unsigned int flags; /* per process flags, defined below */

反應進程狀態的信息，但不是運行狀態，用於內核識別進程當前的狀態，以備下一步操作

flags成員的可能取值如下，這些宏以PF(ProcessFlag)開頭

參見

http://lxr.free-electrons.com/source/include/linux/sched.h?v4.5#L2083

例如

PF_FORKNOEXEC 進程剛創建，但還沒執行。

PF_SUPERPRIV 超級用戶特權。

PF_DUMPCORE dumped core。

PF_SIGNALED 進程被信號(signal)殺出。

PF_EXITING 進程開始關閉。

* Per process flags

#define PF_EXITING 0x00000004 /* getting shut down */

#define PF_EXITPIDONE 0x00000008 /* pi exit done on shut down */

#define PF_VCPU 0x00000010 /* I'm a virtual CPU */

#define PF_WQ_WORKER 0x00000020 /* I'm a workqueue worker */

#define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */

#define PF_MCE_PROCESS 0x00000080 /* process policy on mce errors */

#define PF_SUPERPRIV 0x00000100 /* used super-user privileges */

#define PF_DUMPCORE 0x00000200 /* dumped core */

#define PF_SIGNALED 0x00000400 /* killed by a signal */

#define PF_MEMALLOC 0x00000800 /* Allocating memory */

#define PF_NPROC_EXCEEDED 0x00001000 /* set_user noticed that RLIMIT_NPROC was exceeded */

#define PF_USED_MATH 0x00002000 /* if unset the fpu must be initialized before use */

#define PF_USED_ASYNC 0x00004000 /* used async_schedule*(), used by module init */

#define PF_NOFREEZE 0x00008000 /* this thread should not be frozen */

#define PF_FROZEN 0x00010000 /* frozen for system suspend */

#define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */

#define PF_KSWAPD 0x00040000 /* I am kswapd */

#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */

#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */

#define PF_KTHREAD 0x00200000 /* I am a kernel thread */

#define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */

#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */

#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */

#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */

#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */

#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */

#define PF_SUSPEND_TASK 0x80000000 /* this thread called freeze_processes and should not be frozen */

表示進程親屬關系的成員

* pointers to (original) parent process, youngest child, younger sibling,

* older sibling, respectively. (p->father can be replaced with

* p->real_parent->pid)

struct task_struct __rcu *real_parent; /* real parent process */

struct task_struct __rcu *parent; /* recipient of SIGCHLD, wait4() reports */

* children/sibling forms the list of my natural children

struct list_head children; /* list of my children */

struct list_head sibling; /* linkage in my parent's children list */

struct task_struct *group_leader; /* threadgroup leader */

在Linux系統中，所有進程之間都有著直接或間接地聯系，每個進程都有其父進程，也可能有零個或多個子進程。擁有同一父進程的所有進程具有兄弟關系。

字段描述

real_parent指向其父進程，如果創建它的父進程不再存在，則指向PID為1的init進程

parent指向其父進程，當它終止時，必須向它的父進程發送信號。它的值通常與real_parent相同

children表示鏈表的頭部，鏈表中的所有元素都是它的子進程

sibling用於把當前進程插入到兄弟鏈表中

group_leader指向其所在進程組的領頭進程

ptrace系統調用

Ptrace 提供了一種父進程可以控制子進程運行，並可以檢查和改變它的核心image。

它主要用於實現斷點調試。一個被跟蹤的進程運行中，直到發生一個信號。則進程被中止，並且通知其父進程。在進程中止的狀態下，進程的內存空間可以被讀寫。父進程還可以使子進程繼續執行，並選擇是否是否忽略引起中止的信號。

unsigned int ptrace;

ptraced is the list of tasks this task is using ptrace on.

* This includes both natural children and PTRACE_ATTACH targets.

* p->ptrace_entry is p's link on the p->parent->ptraced list.

struct list_head ptraced;

struct list_head ptrace_entry;

unsigned long ptrace_message;

siginfo_t *last_siginfo; /* For ptrace use. */

成員ptrace被設置為0時表示不需要被跟蹤，它的可能取值如下：

參見

http://lxr.free-electrons.com/source/include/linux/ptrace.h?v=4.5

* Ptrace flags

* The owner ship rules for task->ptrace which holds the ptrace

* flags is simple. When a task is running it owns it's task->ptrace

* flags. When the a task is stopped the ptracer owns task->ptrace.

#define PT_SEIZED 0x00010000 /* SEIZE used, enable new behavior */

#define PT_PTRACED 0x00000001

#define PT_DTRACE 0x00000002 /* delayed trace (used on m68k, i386) */

#define PT_PTRACE_CAP 0x00000004 /* ptracer can follow suid-exec */

#define PT_OPT_FLAG_SHIFT 3

/* PT_TRACE_* event enable flags */

#define PT_EVENT_FLAG(event) (1 << (PT_OPT_FLAG_SHIFT + (event)))

#define PT_TRACESYSGOOD PT_EVENT_FLAG(0)

#define PT_TRACE_FORK PT_EVENT_FLAG(PTRACE_EVENT_FORK)

#define PT_TRACE_VFORK PT_EVENT_FLAG(PTRACE_EVENT_VFORK)

#define PT_TRACE_CLONE PT_EVENT_FLAG(PTRACE_EVENT_CLONE)

#define PT_TRACE_EXEC PT_EVENT_FLAG(PTRACE_EVENT_EXEC)

#define PT_TRACE_VFORK_DONE PT_EVENT_FLAG(PTRACE_EVENT_VFORK_DONE)

#define PT_TRACE_EXIT PT_EVENT_FLAG(PTRACE_EVENT_EXIT)

#define PT_TRACE_SECCOMP PT_EVENT_FLAG(PTRACE_EVENT_SECCOMP)

#define PT_EXITKILL (PTRACE_O_EXITKILL << PT_OPT_FLAG_SHIFT)

#define PT_SUSPEND_SECCOMP (PTRACE_O_SUSPEND_SECCOMP << PT_OPT_FLAG_SHIFT)

/* single stepping state bits (used on ARM and PA-RISC) */

#define PT_SINGLESTEP_BIT 31

#define PT_SINGLESTEP (1<

#define PT_BLOCKSTEP_BIT 30

#define PT_BLOCKSTEP (1<

Performance Event

Performance Event是一款隨 Linux 內核代碼一同發布和維護的性能診斷工具。這些成員用於幫助PerformanceEvent分析進程的性能問題。

#ifdef CONFIG_PERF_EVENTS

struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];

struct mutex perf_event_mutex;

struct list_head perf_event_list;

#endif

關於Performance Event工具的介紹可參考文章http://www.ibm.com/developerworks/cn/linux/l-cn-perf1/index.html?ca=drs-和http://www.ibm.com/developerworks/cn/linux/l-cn-perf2/index.html?ca=drs-。

進程調度

優先級

int prio, static_prio, normal_prio;

unsigned int rt_priority;

字段描述

static_prio用於保存靜態優先級，可以通過nice系統調用來進行修改

rt_priority用於保存實時優先級

normal_prio的值取決於靜態優先級和調度策略

prio用於保存動態優先級

實時優先級范圍是0到MAX_RT_PRIO-1(即99)，而普通進程的靜態優先級范圍是從MAX_RT_PRIO到MAX_PRIO-1(即100到139)。值越大靜態優先級越低。

/* http://lxr.free-electrons.com/source/include/linux/sched/prio.h#L21 */

#define MAX_USER_RT_PRIO 100

#define MAX_RT_PRIO MAX_USER_RT_PRIO

/* http://lxr.free-electrons.com/source/include/linux/sched/prio.h#L24 */

#define MAX_PRIO (MAX_RT_PRIO + 40)

#define DEFAULT_PRIO (MAX_RT_PRIO + 20)

調度策略相關字段

/* http://lxr.free-electrons.com/source/include/linux/sched.h?v=4.5#L1426 */

unsigned int policy;

/* http://lxr.free-electrons.com/source/include/linux/sched.h?v=4.5#L1409 */

const struct sched_class *sched_class;

struct sched_entity se;

struct sched_rt_entity rt;

cpumask_t cpus_allowed;

字段描述

policy調度策略

sched_class調度類

se普通進程的調用實體，每個進程都有其中之一的實體

rt實時進程的調用實體，每個進程都有其中之一的實體

cpus_allowed用於控制進程可以在哪裡處理器上運行

調度策略

policy表示進程的調度策略，目前主要有以下五種：

參見

http://lxr.free-electrons.com/source/include/uapi/linux/sched.h?v=4.5#L36

* Scheduling policies

#define SCHED_NORMAL 0

#define SCHED_FIFO 1

#define SCHED_RR 2

#define SCHED_BATCH 3

/* SCHED_ISO: reserved but not implemented yet */

#define SCHED_IDLE 5

#define SCHED_DEADLINE 6

字段描述所在調度器類

SCHED_NORMAL(也叫SCHED_OTHER)用於普通進程，通過CFS調度器實現。SCHED_BATCH用於非交互的處理器消耗型進程。SCHED_IDLE是在系統負載很低時使用CFS

SCHED_BATCHSCHED_NORMAL普通進程策略的分化版本。采用分時策略，根據動態優先級(可用nice()API設置)，分配 CPU 運算資源。注意：這類進程比上述兩類實時進程優先級低，換言之，在有實時進程存在時，實時進程優先調度。但針對吞吐量優化CFS

SCHED_IDLE優先級最低，在系統空閒時才跑這類進程(如利用閒散計算機資源跑地外文明搜索，蛋白質結構分析等任務，是此調度策略的適用者)CFS

SCHED_FIFO先入先出調度算法(實時調度策略)，相同優先級的任務先到先服務，高優先級的任務可以搶占低優先級的任務RT

SCHED_RR輪流調度算法(實時調度策略)，後者提供 Roound-Robin 語義，采用時間片，相同優先級的任務當用完時間片會被放到隊列尾部，以保證公平性，同樣，高優先級的任務可以搶占低優先級的任務。不同要求的實時任務可以根據需要用sched_setscheduler()API 設置策略RT

SCHED_DEADLINE新支持的實時進程調度策略，針對突發型計算，且對延遲和完成時間高度敏感的任務適用。基於Earliest Deadline First (EDF) 調度算法

調度類

sched_class結構體表示調度類，目前內核中有實現以下四種：

extern const struct sched_class stop_sched_class;

extern const struct sched_class dl_sched_class;

extern const struct sched_class rt_sched_class;

extern const struct sched_class fair_sched_class;

extern const struct sched_class idle_sched_class;

調度器類描述

idle_sched_class每個cup的第一個pid=0線程：swapper，是一個靜態線程。調度類屬於：idel_sched_class，所以在ps裡面是看不到的。一般運行在開機過程和cpu異常的時候做dump

stop_sched_class優先級最高的線程，會中斷所有其他線程，且不會被其他任務打斷。作用：1.發生在cpu_stop_cpu_callback 進行cpu之間任務migration;2.HOTPLUG_CPU的情況下關閉任務。

rt_sched_classRT，作用：實時線程

fair_sched_classCFS(公平)，作用：一般常規線程

目前系統中,Scheduling Class的優先級順序為StopTask > RealTime > Fair > IdleTask

開發者可以根據己的設計需求,來把所屬的Task配置到不同的Scheduling Class中.

進程地址空間

/* http://lxr.free-electrons.com/source/include/linux/sched.h?V=4.5#L1453 */

struct mm_struct *mm, *active_mm;

/* per-thread vma caching */

u32 vmacache_seqnum;

struct vm_area_struct *vmacache[VMACACHE_SIZE];

#if defined(SPLIT_RSS_COUNTING)

struct task_rss_stat rss_stat;

#endif

/* http://lxr.free-electrons.com/source/include/linux/sched.h?V=4.5#L1484 */

#ifdef CONFIG_COMPAT_BRK

unsigned brk_randomized:1;

#endif

字段描述

mm進程所擁有的用戶空間內存描述符，內核線程無的mm為NULL

active_mmactive_mm指向進程運行時所使用的內存描述符，對於普通進程而言，這兩個指針變量的值相同。但是內核線程kernel thread是沒有進程地址空間的，所以內核線程的tsk->mm域是空(NULL)。但是內核必須知道用戶空間包含了什麼，因此它的active_mm成員被初始化為前一個運行進程的active_mm值。

brk_randomized用來確定對隨機堆內存的探測。參見LKML上的介紹

rss_stat用來記錄緩沖信息

因此如果當前內核線程被調度之前運行的也是另外一個內核線程時候，那麼其mm和avtive_mm都是NULL

判斷標志

int exit_code, exit_signal;

int pdeath_signal; /* The signal sent when the parent dies */

unsigned long jobctl; /* JOBCTL_*, siglock protected */

/* Used for emulating ABI behavior of previous Linux versions */

unsigned int personality;

/* scheduler bits, serialized by scheduler locks */

unsigned sched_reset_on_fork:1;

unsigned sched_contributes_to_load:1;

unsigned sched_migrated:1;

unsigned :0; /* force alignment to the next boundary */

/* unserialized, strictly 'current' */

unsigned in_execve:1; /* bit to tell LSMs we're in execve */

unsigned in_iowait:1;

字段描述

exit_code用於設置進程的終止代號，這個值要麼是_exit()或exit_group()系統調用參數(正常終止)，要麼是由內核提供的一個錯誤代號(異常終止)。

exit_signal被置為-1時表示是某個線程組中的一員。只有當線程組的最後一個成員終止時，才會產生一個信號，以通知線程組的領頭進程的父進程。

pdeath_signal用於判斷父進程終止時發送信號。

personality用於處理不同的ABI，參見Linux-Man

in_execve用於通知LSM是否被do_execve()函數所調用。詳見補丁說明，參見LKML

in_iowait用於判斷是否進行iowait計數

sched_reset_on_fork用於判斷是否恢復默認的優先級或調度策略

時間

cputime_t utime, stime, utimescaled, stimescaled;

cputime_t gtime;

struct prev_cputime prev_cputime;

#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN

seqcount_t vtime_seqcount;

unsigned long long vtime_snap;

enum {

/* Task is sleeping or running in a CPU with VTIME inactive */

VTIME_INACTIVE = 0,

/* Task runs in userspace in a CPU with VTIME active */

VTIME_USER,

/* Task runs in kernelspace in a CPU with VTIME active */

VTIME_SYS,

} vtime_snap_whence;

#endif

unsigned long nvcsw, nivcsw; /* context switch counts */

u64 start_time; /* monotonic time in nsec */

u64 real_start_time; /* boot based time in nsec */

/* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */

unsigned long min_flt, maj_flt;

struct task_cputime cputime_expires;

struct list_head cpu_timers[3];

/* process credentials */

const struct cred __rcu *real_cred; /* objective and real subjective task

* credentials (COW) */

const struct cred __rcu *cred; /* effective (overridable) subjective task

* credentials (COW) */

char comm[TASK_COMM_LEN]; /* executable name excluding path

- access with [gs]et_task_comm (which lock

it with task_lock())

- initialized normally by setup_new_exec */

/* file system info */

struct nameidata *nameidata;

#ifdef CONFIG_SYSVIPC

/* ipc stuff */

struct sysv_sem sysvsem;

struct sysv_shm sysvshm;

#endif

#ifdef CONFIG_DETECT_HUNG_TASK

/* hung task detection */

unsigned long last_switch_count;

#endif

字段描述

utime/stime用於記錄進程在用戶態/內核態下所經過的節拍數(定時器)

prev_utime/prev_stime先前的運行時間，請參考LKML的補丁說明

utimescaled/stimescaled用於記錄進程在用戶態/內核態的運行時間，但它們以處理器的頻率為刻度

gtime以節拍計數的虛擬機運行時間(guest time)

nvcsw/nivcsw是自願(voluntary)/非自願(involuntary)上下文切換計數

last_switch_countnvcsw和nivcsw的總和

start_time/real_start_time進程創建時間，real_start_time還包含了進程睡眠時間，常用於/proc/pid/stat，補丁說明請參考LKML

cputime_expires用來統計進程或進程組被跟蹤的處理器時間，其中的三個成員對應著cpu_timers[3]的三個鏈表

信號處理

/* signal handlers */

struct signal_struct *signal;

struct sighand_struct *sighand;

1583

sigset_t blocked, real_blocked;

sigset_t saved_sigmask; /* restored if set_restore_sigmask() was used */

struct sigpending pending;

1587

unsigned long sas_ss_sp;

size_t sas_ss_size;

字段描述

signal指向進程的信號描述符

sighand指向進程的信號處理程序描述符

blocked表示被阻塞信號的掩碼，real_blocked表示臨時掩碼

pending存放私有掛起信號的數據結構

sas_ss_sp是信號處理程序備用堆棧的地址，sas_ss_size表示堆棧的大小

其他

(1)、用於保護資源分配或釋放的自旋鎖

/* Protection of (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed,

* mempolicy */

spinlock_t alloc_lock;

(2)、進程描述符使用計數，被置為2時，表示進程描述符正在被使用而且其相應的進程處於活動狀態

atomic_t usage;

(3)、用於表示獲取大內核鎖的次數，如果進程未獲得過鎖，則置為-1。

int lock_depth; /* BKL lock depth */

(4)、在SMP上幫助實現無加鎖的進程切換(unlocked context switches)

#ifdef CONFIG_SMP

#ifdef __ARCH_WANT_UNLOCKED_CTXSW

int oncpu;

#endif

(5)、preempt_notifier結構體鏈表

#ifdef CONFIG_PREEMPT_NOTIFIERS

/* list of struct preempt_notifier: */

struct hlist_head preempt_notifiers;

#endif

(6)、FPU使用計數

unsigned char fpu_counter;

(7)、 blktrace是一個針對Linux內核中塊設備I/O層的跟蹤工具。

#ifdef CONFIG_BLK_DEV_IO_TRACE

unsigned int btrace_seq;

#endif

(8)、RCU同步原語

#ifdef CONFIG_PREEMPT_RCU

int rcu_read_lock_nesting;

char rcu_read_unlock_special;

struct list_head rcu_node_entry;

#endif /* #ifdef CONFIG_PREEMPT_RCU */

#ifdef CONFIG_TREE_PREEMPT_RCU

struct rcu_node *rcu_blocked_node;

#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */

#ifdef CONFIG_RCU_BOOST

struct rt_mutex *rcu_boost_mutex;

#endif /* #ifdef CONFIG_RCU_BOOST */

(9)、用於調度器統計進程的運行信息

#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)

struct sched_info sched_info;

#endif

(10)、用於構建進程鏈表

struct list_head tasks;

(11)、to limit pushing to one attempt

#ifdef CONFIG_SMP

struct plist_node pushable_tasks;

#endif

補丁說明請參考：http://lkml.indiana.edu/hypermail/linux/kernel/0808.3/0503.html

(12)、防止內核堆棧溢出

#ifdef CONFIG_CC_STACKPROTECTOR

/* Canary value for the -fstack-protector gcc feature */

unsigned long stack_canary;

#endif

在GCC編譯內核時，需要加上-fstack-protector選項。

(13)、PID散列表和鏈表

/* PID/PID hash table linkage. */

struct pid_link pids[PIDTYPE_MAX];

struct list_head thread_group; //線程組中所有進程的鏈表

(14)、do_fork函數

struct completion *vfork_done; /* for vfork() */

int __user *set_child_tid; /* CLONE_CHILD_SETTID */

int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */

在執行do_fork()時，如果給定特別標志，則vfork_done會指向一個特殊地址。

如果copy_process函數的clone_flags參數的值被置為CLONE_CHILD_SETTID或CLONE_CHILD_CLEARTID，則會把child_tidptr參數的值分別復制到set_child_tid和clear_child_tid成員。這些標志說明必須改變子進程用戶態地址空間的child_tidptr所指向的變量的值。

(15)、缺頁統計

/* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */

unsigned long min_flt, maj_flt;

(16)、進程權能

const struct cred __rcu *real_cred; /* objective and real subjective task

* credentials (COW) */

const struct cred __rcu *cred; /* effective (overridable) subjective task

* credentials (COW) */

struct cred *replacement_session_keyring; /* for KEYCTL_SESSION_TO_PARENT */

(17)、相應的程序名

char comm[TASK_COMM_LEN];

(18)、文件

/* file system info */

int link_count, total_link_count;

/* filesystem information */

struct fs_struct *fs;

/* open file information */

struct files_struct *files;

fs用來表示進程與文件系統的聯系，包括當前目錄和根目錄。

files表示進程當前打開的文件。

(19)、進程通信(SYSVIPC)

#ifdef CONFIG_SYSVIPC

/* ipc stuff */

struct sysv_sem sysvsem;

#endif

(20)、處理器特有數據

/* CPU-specific state of this task */

struct thread_struct thread;

(21)、命名空間

/* namespaces */

struct nsproxy *nsproxy;

(22)、進程審計

struct audit_context *audit_context;

#ifdef CONFIG_AUDITSYSCALL

uid_t loginuid;

unsigned int sessionid;

#endif

(23)、secure computing

seccomp_t seccomp;

(24)、用於copy_process函數使用CLONE_PARENT 標記時

/* Thread group tracking */

u32 parent_exec_id;

u32 self_exec_id;

(25)、中斷

#ifdef CONFIG_GENERIC_HARDIRQS

/* IRQ handler threads */

struct irqaction *irqaction;

#endif

#ifdef CONFIG_TRACE_IRQFLAGS

unsigned int irq_events;

unsigned long hardirq_enable_ip;

unsigned long hardirq_disable_ip;

unsigned int hardirq_enable_event;

unsigned int hardirq_disable_event;

int hardirqs_enabled;

int hardirq_context;

unsigned long softirq_disable_ip;

unsigned long softirq_enable_ip;

unsigned int softirq_disable_event;

unsigned int softirq_enable_event;

int softirqs_enabled;

int softirq_context;

#endif

(26)、task_rq_lock函數所使用的鎖

/* Protection of the PI data structures: */

raw_spinlock_t pi_lock;

(27)、基於PI協議的等待互斥鎖，其中PI指的是priority inheritance(優先級繼承)

#ifdef CONFIG_RT_MUTEXES

/* PI waiters blocked on a rt_mutex held by this task */

struct plist_head pi_waiters;

/* Deadlock detection and priority inheritance handling */

struct rt_mutex_waiter *pi_blocked_on;

#endif

(28)、死鎖檢測

#ifdef CONFIG_DEBUG_MUTEXES

/* mutex deadlock detection */

struct mutex_waiter *blocked_on;

#endif

(29)、lockdep，參見內核說明文檔linux-2.6.38.8/Documentation/lockdep-design.txt

#ifdef CONFIG_LOCKDEP

# define MAX_LOCK_DEPTH 48UL

u64 curr_chain_key;

int lockdep_depth;

unsigned int lockdep_recursion;

struct held_lock held_locks[MAX_LOCK_DEPTH];

gfp_t lockdep_reclaim_gfp;

#endif

(30)、JFS文件系統

/* journalling filesystem info */

void *journal_info;

(31)、塊設備鏈表

/* stacked block device info */

struct bio_list *bio_list;

(32)、內存回收

struct reclaim_state *reclaim_state;

(33)、存放塊設備I/O數據流量信息

struct backing_dev_info *backing_dev_info;

(34)、I/O調度器所使用的信息

struct io_context *io_context;

(35)、記錄進程的I/O計數

struct task_io_accounting ioac;

if defined(CONFIG_TASK_XACCT)

u64 acct_rss_mem1; /* accumulated rss usage */

u64 acct_vm_mem1; /* accumulated virtual memory usage */

cputime_t acct_timexpd; /* stime + utime since last update */

endif

在Ubuntu 11.04上，執行cat獲得進程1的I/O計數如下：

輸出的數據項剛好是task_io_accounting結構體的所有成員。

(36)、CPUSET功能

#ifdef CONFIG_CPUSETS

nodemask_t mems_allowed; /* Protected by alloc_lock */

int mems_allowed_change_disable;

int cpuset_mem_spread_rotor;

int cpuset_slab_spread_rotor;

#endif

(37)、Control Groups

#ifdef CONFIG_CGROUPS

/* Control Group info protected by css_set_lock */

struct css_set __rcu *cgroups;

/* cg_list protected by css_set_lock and tsk->alloc_lock */

struct list_head cg_list;

#endif

#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */

struct memcg_batch_info {

int do_batch; /* incremented when batch uncharge started */

struct mem_cgroup *memcg; /* target memcg of uncharge */

unsigned long bytes; /* uncharged usage */

unsigned long memsw_bytes; /* uncharged mem+swap usage */

} memcg_batch;

#endif

(38)、futex同步機制

#ifdef CONFIG_FUTEX

struct robust_list_head __user *robust_list;

#ifdef CONFIG_COMPAT

struct compat_robust_list_head __user *compat_robust_list;

#endif

struct list_head pi_state_list;

struct futex_pi_state *pi_state_cache;

#endif

(39)、非一致內存訪問(NUMA Non-Uniform Memory Access)

#ifdef CONFIG_NUMA

struct mempolicy *mempolicy; /* Protected by alloc_lock */

short il_next;

#endif

(40)、文件系統互斥資源

atomic_t fs_excl; /* holding fs exclusive resources */

(41)、RCU鏈表

struct rcu_head rcu;

(42)、管道

struct pipe_inode_info *splice_pipe;

(43)、延遲計數

#ifdef CONFIG_TASK_DELAY_ACCT

struct task_delay_info *delays;

#endif

(44)、fault injection，參考內核說明文件linux-2.6.38.8/Documentation/fault-injection/fault-injection.txt

#ifdef CONFIG_FAULT_INJECTION

int make_it_fail;

#endif

(45)、FLoating proportions

struct prop_local_single dirties;

(46)、Infrastructure for displayinglatency

#ifdef CONFIG_LATENCYTOP

int latency_record_count;

struct latency_record latency_record[LT_SAVECOUNT];

#endif

(47)、time slack values，常用於poll和select函數

unsigned long timer_slack_ns;

unsigned long default_timer_slack_ns;

(48)、socket控制消息(control message)

struct list_head *scm_work_list;

(49)、ftrace跟蹤器

#ifdef CONFIG_FUNCTION_GRAPH_TRACER

/* Index of current stored address in ret_stack */

int curr_ret_stack;

/* Stack of return addresses for return function tracing */

struct ftrace_ret_stack *ret_stack;

/* time stamp for last schedule */

unsigned long long ftrace_timestamp;

* Number of functions that haven't been traced

* because of depth overrun.

atomic_t trace_overrun;

/* Pause for the tracing */

atomic_t tracing_graph_pause;

#endif

#ifdef CONFIG_TRACING

/* state flags for use by tracers */

unsigned long trace;

/* bitmask of trace recursion */

unsigned long trace_recursion;

#endif /* CONFIG_TRACING */

上一篇文章： WSL 文件系統支持
下一篇文章：讀薄《Linux 內核設計與實現》(3) - 系統調用

關於Linux