您现在的位置： Linux教程網 >> UnixLinux > >> Linux綜合 >> Linux內核

linux內核md源代碼解讀十一 raid5d

正是有了上一篇的讀寫基礎，我們才開始看raid5d的代碼。raid5d不是讀寫的入口，也不是讀寫處理的地方，只是簡簡單單的中轉站或者叫做交通樞紐。這個樞紐具有制高點的作用，就像美國在新加坡的基地，直接就控制了太平洋和印度洋的交通樞紐。

4626 /* 
4627  * This is our raid5 kernel thread. 
4628  * 
4629  * We scan the hash table for stripes which can be handled now. 
4630  * During the scan, completed stripes are saved for us by the interrupt 
4631  * handler, so that they will not have to wait for our next wakeup. 
4632  */
4633 static void raid5d(struct mddev *mddev)  
4634 {  
4635         struct r5conf *conf = mddev->private;  
4636         int handled;  
4637         struct blk_plug plug;  
4638  
4639         pr_debug("+++ raid5d active\n");  
4640  
4641         md_check_recovery(mddev);  
4642  
4643         blk_start_plug(&plug);  
4644         handled = 0;  
4645         spin_lock_irq(&conf->device_lock);  
4646         while (1) {  
4647                 struct bio *bio;  
4648                 int batch_size;  
4649  
4650                 if (  
4651                     !list_empty(&conf->bitmap_list)) {  
4652                         /* Now is a good time to flush some bitmap updates */
4653                         conf->seq_flush++;  
4654                         spin_unlock_irq(&conf->device_lock);  
4655                         bitmap_unplug(mddev->bitmap);  
4656                         spin_lock_irq(&conf->device_lock);  
4657                         conf->seq_write = conf->seq_flush;  
4658                         activate_bit_delay(conf);  
4659                 }

4641行，md_check_recovery這個函數前面看過了，用來檢查觸發同步

4643行，blk_start_plug和4688行blk_finish_plug是一對，用於合並請求。

4646行，這裡為什麼要來個大循環呢？剛開始看4629行注釋可能有點迷糊，可是看到這個循環就知道原來講的是這裡，4629行注釋說我們不必等到下次喚醒raid5線程，可以繼續處理stripes，因為可能有stripes已經在中斷處理函數裡處理完成返回了。

4651行，判斷陣列對應的bitmap_list是否為空，如果這個鏈表不為空則進入分支。bitmap跟條帶處理有什麼關系呢？這個問題就比較有歷史性了。對於raid5陣列來說，最可怕的事情莫過於在寫的過程中異常掉電，這就意味陣列不知道哪些數據是一致的，哪些是不一致的？這就是safemode干的事情，用來記錄陣列數據是否一致。然而數據不一致導致的代碼是全盤同步，這個是raid5最頭疼的問題。好了，現在有bitmap了可以解決這個問題啦，太happy啦。那bitmap是如何解決這個問題的呢？bitmap說你寫每個條帶的時候我都記錄一下，寫完成就清除一下。如果異常掉電就只要同步掉電時未寫完成的條帶就可以啦。娃哈哈太happy了！！！但是請別高興的太早，bitmap也不是一個好侍候的爺，bitmap必須要在寫條帶之前寫完成，這裡的寫完成就是要Write Through即同步寫。這下悲催了，bitmap的寫過程太慢了，完全拖垮了raid5的性能。於是有了這個的bitmap_list，raid5說，bitmap老弟你批量寫吧，有點類似bio的合並請求。但是這也只能部分彌補bitmap帶來的負面性能作用。

4655行，下發bitmap批量寫請求。
4657行，更新bitmap批量寫請求的序號。

4658行，將等待bitmap寫的條帶下發。

4660 raid5_activate_delayed(conf);

4660行，看函數名就是激活延遲條帶的意思。那麼為什麼要延遲條帶的處理呢？按照塊設備常用的手段，延遲處理是為了合並請求，這裡也是同樣的道理。那麼條帶什麼時候做延遲處理呢？我們跟進raid5_activate_delayed函數：

3691static void raid5_activate_delayed(struct r5conf *conf)  
3692{  
3693     if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {  
3694          while (!list_empty(&conf->delayed_list)) {  
3695               struct list_head *l = conf->delayed_list.next;  
3696               struct stripe_head *sh;  
3697               sh = list_entry(l, struct stripe_head, lru);  
3698               list_del_init(l);  
3699               clear_bit(STRIPE_DELAYED, &sh->state);  
3700               if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))  
3701                    atomic_inc(&conf->preread_active_stripes);  
3702               list_add_tail(&sh->lru, &conf->hold_list);  
3703          }  
3704     }  
3705}

3693行，這裡控制預讀數量。

3694行，遍歷陣列延遲處理鏈表

3695行，獲取陣列延遲處理鏈表表頭

3697行，獲取陣列延遲處理鏈表第一個條帶

3698行，從陣列延遲處理鏈表取出一個條帶

3700行，設置預讀標志

3702行，添加到預讀鏈表中

條帶在什麼情況下會加入陣列延遲處理鏈表呢？我們搜索conf->delayed_list，發現加入的時機是設置了STRIPE_DELAYED標志的條帶：

204          if (test_bit(STRIPE_DELAYED, &sh->state) &&  
205              !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))  
206               list_add_tail(&sh->lru, &conf->delayed_list);

在什麼情況下條帶會設置STRIPE_DELAYED標志呢？繼續搜索STRIPE_DELAYED標志，這裡只抽取了相關代碼部分：

2772static void handle_stripe_dirtying(struct r5conf *conf,  
2773                       struct stripe_head *sh,  
2774                       struct stripe_head_state *s,  
2775                       int disks)  
2776{  
...  
2808     set_bit(STRIPE_HANDLE, &sh->state);  
2809     if (rmw < rcw && rmw > 0)  
...  
2825                    } else {  
2826                         set_bit(STRIPE_DELAYED, &sh->state);  
2827                         set_bit(STRIPE_HANDLE, &sh->state);  
2828                    }  
2829               }  
2830          }  
2831     if (rcw <= rmw && rcw > 0) {  
...  
2851                    } else {  
2852                         set_bit(STRIPE_DELAYED, &sh->state);  
2853                         set_bit(STRIPE_HANDLE, &sh->state);  
2854                    }

這裡有兩種情況會設置STRIPE_DELAYED，rcw和rmw。不管是rcw還是rmw，都不是滿條帶寫，都需要去磁盤預讀，因此在效率上肯定比不上滿條帶寫。所以這裡需要延遲處理以合並請求。那麼合並請求的流程是怎麼樣的呢？我們這裡根據代碼流程簡要說明一下：

1）第一次非滿條帶寫過來之後，申請到一個struct stripe_head並加入陣列delayed_list延遲處理

2）第二次寫過來並命中前面條帶，並將bio加入到同一個struct stripe_head中

3）這時再下發請求就可以減少IO，如果湊到滿條帶就不需要下發讀請求了

當然條帶命中還有許多其他情況，只要能命中就能提高速度。

回到raid5d函數中來：

4662                 while ((bio = remove_bio_from_retry(conf))) {  
4663                         int ok;  
4664                         spin_unlock_irq(&conf->device_lock);  
4665                         ok = retry_aligned_read(conf, bio);  
4666                         spin_lock_irq(&conf->device_lock);  
4667                         if (!ok)  
4668                                 break;  
4669                         handled++;  
4670                 }

這裡處理陣列的另外一個鏈表，就是滿條塊讀重試鏈表。在raid5陣列中，如果剛好是滿條塊的IO請求，就可以直接下發到磁盤。但如果此時申請不到struct stripe_head就會加入到滿條塊讀重試鏈表中，等到struct stripe_head釋放的時候喚醒raid5d函數，再重新將滿條塊讀請求下發。

再接著往下看：

4672          batch_size = handle_active_stripes(conf);  
4673          if (!batch_size)  
4674               break;

handle_active_stripes函數就是我們處理條帶的主戰場，因為大部分條帶的處理都要經過這個函數，我們接著進來看這個函數：

4601#define MAX_STRIPE_BATCH 8  
4602static int handle_active_stripes(struct r5conf *conf)  
4603{  
4604     struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;  
4605     int i, batch_size = 0;  
4606  
4607     while (batch_size < MAX_STRIPE_BATCH &&  
4608               (sh = __get_priority_stripe(conf)) != NULL)  
4609          batch[batch_size++] = sh;  
4610  
4611     if (batch_size == 0)  
4612          return batch_size;  
4613     spin_unlock_irq(&conf->device_lock);  
4614  
4615     for (i = 0; i < batch_size; i++)  
4616          handle_stripe(batch[i]);  
4617  
4618     cond_resched();  
4619  
4620     spin_lock_irq(&conf->device_lock);  
4621     for (i = 0; i < batch_size; i++)  
4622          __release_stripe(conf, batch[i]);  
4623     return batch_size;  
4624}

這個函數幾乎可以一覽無余。首先是一個大循環，獲取最大MAX_STRIPE_BATCH個條帶存放到batch數組，4615行挨個處理這個條帶數組，4618行調度一下，4621行條帶重新進入陣列鏈表，然後開始下一輪的處理。

我們進入__get_priority_stripe函數看看，究竟是如何選擇條帶的。

3966/* __get_priority_stripe - get the next stripe to process 
3967 * 
3968 * Full stripe writes are allowed to pass preread active stripes up until 
3969 * the bypass_threshold is exceeded.  In general the bypass_count 
3970 * increments when the handle_list is handled before the hold_list; however, it 
3971 * will not be incremented when STRIPE_IO_STARTED is sampled set signifying a 
3972 * stripe with in flight i/o.  The bypass_count will be reset when the 
3973 * head of the hold_list has changed, i.e. the head was promoted to the 
3974 * handle_list. 
3975 */

每一個社會都有特權階段，每一個國家都有貴族，所以條帶跟條帶還是有不一樣的，從函數名我們一眼就看出優先選擇特權條帶，就跟電影《2012》一樣，只有被選上才可以上到諾亞方舟。我們雖然不能像古代帝皇那樣翻牌子，但我們仍然有優先選擇條帶處理的權力。

第一特權是handle_list鏈表，第二特權是hold_list鏈表。

3976static struct stripe_head *__get_priority_stripe(struct r5conf *conf)  
3977{  
3978     struct stripe_head *sh;  
3979  
3980     pr_debug("%s: handle: %s hold: %s full_writes: %d bypass_count: %d\n",  
3981            __func__,  
3982            list_empty(&conf->handle_list) ? "empty" : "busy",  
3983            list_empty(&conf->hold_list) ? "empty" : "busy",  
3984            atomic_read(&conf->pending_full_writes), conf->bypass_count);  
3985  
3986     if (!list_empty(&conf->handle_list)) {  
3987          sh = list_entry(conf->handle_list.next, typeof(*sh), lru);  
3988  
3989          if (list_empty(&conf->hold_list))  
3990               conf->bypass_count = 0;  
3991          else if (!test_bit(STRIPE_IO_STARTED, &sh->state)) {  
3992               if (conf->hold_list.next == conf->last_hold)  
3993                    conf->bypass_count++;  
3994               else {  
3995                    conf->last_hold = conf->hold_list.next;  
3996                    conf->bypass_count -= conf->bypass_threshold;  
3997                    if (conf->bypass_count < 0)  
3998                         conf->bypass_count = 0;  
3999               }  
4000          }  
4001     } else if (!list_empty(&conf->hold_list) &&  
4002             ((conf->bypass_threshold &&  
4003               conf->bypass_count > conf->bypass_threshold) ||  
4004              atomic_read(&conf->pending_full_writes) == 0)) {  
4005          sh = list_entry(conf->hold_list.next,  
4006                    typeof(*sh), lru);  
4007          conf->bypass_count -= conf->bypass_threshold;  
4008          if (conf->bypass_count < 0)  
4009               conf->bypass_count = 0;  
4010     } else
4011          return NULL;  
4012  
4013     list_del_init(&sh->lru);  
4014     atomic_inc(&sh->count);  
4015     BUG_ON(atomic_read(&sh->count) != 1);  
4016     return sh;  
4017}

3986行，優先選擇handle_list鏈表。

3987行，取出一個條帶

3989行，判斷hold_list鏈表是否為空。這裡是特權階級的社會，為什麼要去視察下面老百姓是否有吃飽呢？因為linux內核深谙“水能載舟，也能覆舟”的道理，如果把下面老百姓逼得太緊難免會社會不安定，所以到關鍵時刻還是得開倉放糧。這裡統計handle_list連續下發的請求個數，如果達到一定數量則在空閒的時候下發hold_list鏈表的請求。

3991行，如果不是已經在下發請求

3992行，hold_list在這一段時間內未下發條帶

3993行，遞增bypass_count計數

3995行，reset last_hold，遞減bypass_count

4001行，hold_list非空，bypass_count超過上限或者有滿條帶寫

4005行，返回hold_list鏈表中條帶

4007行，更新bypass_count

這裡這麼多對bypass_count的處理，簡單小結一下bypass_count的作用：

1）從handle_list取條帶處理，遞增bypass_count

2）如果handle_list為空，則判斷bypass_count是否達到bypass_threshold，如果是則可以從hold_list取出一個條帶來處理，bypass_count減去bypass_threshold

bypass_count就是用來限制低效率preread的下發速度的，增加IO合並機會。

接著看raid5d函數：

4675          handled += batch_size;  
4676  
4677          if (mddev->flags & ~(1<<MD_CHANGE_PENDING)) {  
4678               spin_unlock_irq(&conf->device_lock);  
4679               md_check_recovery(mddev);  
4680               spin_lock_irq(&conf->device_lock);  
4681          }  
4682     }

4675行，統計處理條帶數4677行，陣列有變化，則釋放設備鎖，進行同步檢查raid5d函數也就這樣了，每個條帶從申請到釋放至少要到raid5d走一趟，raid5d迎來一批新條帶，又會送走一批條帶，每個條帶都只是匆匆的過客。raid5d的介紹就到此，下一小節接著講raid5的讀寫流程。

出處：http://blog.csdn.net/liumangxiong