您现在的位置： Linux教程網 >> UnixLinux > >> Linux綜合 >> Linux內核

linux內核md源代碼解讀十三 raid5重試讀

上節我們講到條塊內讀失敗，在回調函數raid5_align_endio中將請求加入陣列重試鏈表，在喚醒raid5d線程之後，raid5d線程將該請求調用retry_aligned_read函數進行重試讀：

4539static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)  
4540{  
4541     /* We may not be able to submit a whole bio at once as there 
4542     * may not be enough stripe_heads available. 
4543     * We cannot pre-allocate enough stripe_heads as we may need 
4544     * more than exist in the cache (if we allow ever large chunks). 
4545     * So we do one stripe head at a time and record in 
4546     * ->bi_hw_segments how many have been done. 
4547     * 
4548     * We *know* that this entire raid_bio is in one chunk, so 
4549     * it will be only one 'dd_idx' and only need one call to raid5_compute_sector. 
4550     */

如果沒有足夠的struct stripe_head結構，我們沒能把請求一次性提交。我們也不能提前預留足夠的struct stripe_head結構，所以我們一次提交一個struct stripe_head，並將已提交記錄在bio->bi_hw_segments字段裡。

由於是條塊內讀，所以raid_bio請求區間都在一個條塊內的，所以我們只需要調用一次raid5_compute_sector來計算對應磁盤下標dd_idx。

看完了以上的注釋部分，我們就知道這裡復用了bio->bi_hw_segment字段，用於記錄已經下發的struct stripe_head數，那具體是怎麼用的呢？我們來繼續看代碼：

4558     logical_sector = raid_bio->bi_sector & ~((sector_t)STRIPE_SECTORS-1);  
4559     sector = raid5_compute_sector(conf, logical_sector,  
4560                          0, &dd_idx, NULL);  
4561     last_sector = raid_bio->bi_sector + (raid_bio->bi_size>>9);

4558行，計算請求開始扇區對應的stripe扇區，因為讀操作的基本單位是stripe大小，即一頁大小

4559行，計算對應磁盤下標dd_idx，磁盤中偏移sector

4561行，請求結束扇區

4563     for (; logical_sector < last_sector;  
4564          logical_sector += STRIPE_SECTORS,  
4565               sector += STRIPE_SECTORS,  
4566               scnt++) {  
4567  
4568          if (scnt < raid5_bi_processed_stripes(raid_bio))  
4569               /* already done this stripe */
4570               continue;  
4571  
4572          sh = get_active_stripe(conf, sector, 0, 1, 0);  
4573  
4574          if (!sh) {  
4575               /* failed to get a stripe - must wait */
4576               raid5_set_bi_processed_stripes(raid_bio, scnt);  
4577               conf->retry_read_aligned = raid_bio;  
4578               return handled;  
4579          }  
4580  
4581          if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {  
4582               release_stripe(sh);  
4583               raid5_set_bi_processed_stripes(raid_bio, scnt);  
4584               conf->retry_read_aligned = raid_bio;  
4585               return handled;  
4586          }  
4587  
4588          set_bit(R5_ReadNoMerge, &sh->dev[dd_idx].flags);  
4589          handle_stripe(sh);  
4590          release_stripe(sh);  
4591          handled++;  
4592     }

4563行，對於條塊內的每一個stripe進行操作，比如說條塊為64KB，stripe為4KB，請求為整個條塊，那麼這裡就需要循環16次。4568行，如果是已經下發請求的stripe，那麼就跳過去。在上面注釋裡我們已經講過，利用了bio->bi_hw_segments來表示一個請求中已經下發的stripe數量。比如說一次只下發了8個stripe，有了這裡的continue那麼下次再進來這個函數就繼續下發後面8個stripe。4572行，獲取sector對應的stripe_head4574行，如果沒有申請到stripe_head，那麼保存已經下發的stripe數量，將請求raid_bio保存到陣列retry_read_aligned指針裡，下次喚醒raid5d裡直接從該指針中獲取bio，並繼續下發stripe請求。4578行，返回已下發stripe個數4581行，將bio添加到stripe_head請求鏈表中

4582行，如果添加失敗，釋放stripe_head，記錄下發stripe數量，保存重試讀請求

4588行，設置塊層不需要合並標志

4589行，處理stripe

4590行，遞減stripe計數

4591行，增加處理stripe數

4593     remaining = raid5_dec_bi_active_stripes(raid_bio);  
4594     if (remaining == 0)  
4595          bio_endio(raid_bio, 0);  
4596     if (atomic_dec_and_test(&conf->active_aligned_reads))  
4597          wake_up(&conf->wait_for_stripe);  
4598     return handled;

4593行，遞減stripe數

4594行，所有下發stripe都已處理完成

4595行，調用請求回調函數

4596行，喚醒等待該條帶的進程

4598行，返回已下發stirpe數

我們已經將stripe_head調用handle_stripe進行處理了，對於一個條塊內讀，handle_stripe會如何處理呢，接著看handle_stripe函數，前面兩個if代碼沒有執行到，直接來到analyse_stripe函數，這個函數很長，但真正執行到的有用地方就一兩句，所以這裡抓取重點把這幾句代碼給找出來，大家可以打開源代碼對照著看。首先，在之前retry_aligned_read函數中分配了stripe_head，在給stripe_head添加bio的函數add_stripe_bio中將bio加入了對應磁盤的toread隊列，由於又是條塊內讀，所以只有一個數據盤的toread隊列掛有bio，所以有analyse_stripe函數中就執行到了：

3245          if (test_bit(R5_Wantfill, &dev->flags))  
3246               s->to_fill++;  
3247          else if (dev->toread)  
3248               s->to_read++;

3247行，判斷設備讀隊列中有請求。

3248行，遞增需要讀的設備數。

由於之前是條塊內讀失敗，物理磁盤對應的扇區出錯或者磁盤異常，對應的是rdev被設置了Faulty標志，或者對應的物理磁盤扇區為壞塊，即對應扇區判斷is_badblock會返回true。對於第一種情況磁盤設置了Faulty標志：

3271          if (rdev && test_bit(Faulty, &rdev->flags))  
3272               rdev = NULL;  
3287          if (!rdev)  
3288               /* Not in-sync */;  
...  
3304          else if (test_bit(R5_UPTODATE, &dev->flags) &&  
3305               test_bit(R5_Expanded, &dev->flags))  
3310               set_bit(R5_Insync, &dev->flags);

如果設置了Faulty標志，那麼rdev被設置為NULL，那麼就3287行就成立，進行不會進入3310行，從而dev->flags不會被設置R5_Insync標志。

對於第二種情況對應扇區是壞塊，那麼去嘗試讀之後必然會設置R5_ReadError標志：

3350          if (test_bit(R5_ReadError, &dev->flags))  
3351               clear_bit(R5_Insync, &dev->flags);

3350行，成立

3351行，清除了R5_Insync標志

所以不管是以上哪一種情況，最終結果是一樣的，就是dev->flags會清除R5_Insync標志。那麼接著看：

3352          if (!test_bit(R5_Insync, &dev->flags)) {  
3353               if (s->failed < 2)  
3354                    s->failed_num[s->failed] = i;  
3355               s->failed++;

3352行，成立

3353行，成立，因為對於raid5來說，fail>2就是陣列已經fail

3354行，記錄fail磁盤下標

3355行，遞增fail磁盤計數

所以這一趟analyse_stripe下來，我們得到了兩樣寶貝：一是s->toread，二是s->failed並且s->failed_num[0]=i。帶著這兩樣寶貝我們回到了handle_stripe函數中來：

3468     /* Now we might consider reading some blocks, either to check/generate 
3469     * parity, or to satisfy requests 
3470     * or to load a block that is being partially written. 
3471     */
3472     if (s.to_read || s.non_overwrite  
3473         || (conf->level == 6 && s.to_write && s.failed)  
3474         || (s.syncing && (s.uptodate + s.compute < disks))  
3475         || s.replacing  
3476         || s.expanding)  
3477          handle_stripe_fill(sh, &s, disks);

查看是否要做讀操作。

3472行，s.to_read成立，毫不猶豫地進入handle_stripe_fill

2707/** 
2708 * handle_stripe_fill - read or compute data to satisfy pending requests. 
2709 */
2710static void handle_stripe_fill(struct stripe_head *sh,  
2711                      struct stripe_head_state *s,  
2712                      int disks)  
2713{  
2714     int i;  
2715  
2716     /* look for blocks to read/compute, skip this if a compute 
2717     * is already in flight, or if the stripe contents are in the 
2718     * midst of changing due to a write 
2719     */
2720     if (!test_bit(STRIPE_COMPUTE_RUN, &sh->state) && !sh->check_state &&  
2721         !sh->reconstruct_state)  
2722          for (i = disks; i--; )  
2723               if (fetch_block(sh, s, i, disks))  
2724                    break;  
2725     set_bit(STRIPE_HANDLE, &sh->state);  
2726}

看注釋，直接讀取數據或者用於計算數據。很顯然，我們要讀的磁盤已經出錯了，我們現在要做的是讀其他盤的數據來計算數據。

handle_stripe_fill對於我們來說也是老朋友了，我們在講Raid5同步的時候就已經拜訪過了。

2722行，對於條帶中每一個磁盤，調用fetch_block函數。

我們跟著來到fetch_block函數，雖然這個函數我們之前也已經閱讀過了，但今時不同晚日，當我們帶著不一樣的心情來欣賞這片風景時，得到的感覺是不一樣的。

2624static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,  
2625                 int disk_idx, int disks)  
2626{  
2627     struct r5dev *dev = &sh->dev[disk_idx];  
2628     struct r5dev *fdev[2] = { &sh->dev[s->failed_num[0]],  
2629                      &sh->dev[s->failed_num[1]] };  
2630  
2631     /* is the data in this block needed, and can we get it? */
2632     if (!test_bit(R5_LOCKED, &dev->flags) &&  
2633         !test_bit(R5_UPTODATE, &dev->flags) &&  
2634         (dev->toread ||  
2635          (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||  
2636          s->syncing || s->expanding ||  
2637          (s->replacing && want_replace(sh, disk_idx)) ||  
2638          (s->failed >= 1 && fdev[0]->toread) ||  
2639          (s->failed >= 2 && fdev[1]->toread) ||  
2640          (sh->raid_conf->level <= 5 && s->failed && fdev[0]->towrite &&  
2641           !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||  
2642          (sh->raid_conf->level == 6 && s->failed && s->to_write))) {

這裡再次重現一下我們的上下文，即條帶中下標為i盤的dev->toread不為空，同時s->failed==1，s->failed_num[0]=i。

進入了fetch_blockb函數，當disk_idx==i時，2632行和2633行成立，2634行也成立，所以進入if分支。當disk_idx!=i時，2632行和2633行成立，2638行也成立，所以也進入if分支。

2648          if ((s->uptodate == disks - 1) &&  
...  
2670          } else if (s->uptodate == disks-2 && s->failed >= 2) {  
...  
2695          } else if (test_bit(R5_Insync, &dev->flags)) {  
2696               set_bit(R5_LOCKED, &dev->flags);  
2697               set_bit(R5_Wantread, &dev->flags);  
2698               s->locked++;  
2699               pr_debug("Reading block %d (sync=%d)\n",  
2700                    disk_idx, s->syncing);  
2701          }

由於s->uptodate==0，所以直接進入2695行代碼。所以對於非fail盤而言，都設置了R5_LOCKED和R5_Wantread標志。這裡就簡單地歸納一下讀重試流程的全過程：1）發起條塊內讀2）讀失敗，加入重試鏈表，喚醒raid5d3）raid5d將讀請求從重試鏈表中移除，為每個stripe申請struct stripe_head並調用handle_stripe4）handle_stripe調用analyse_stripe設置了s->toread和s->failed，然後再調用handle_stripe_fill從其他冗余磁盤讀取數據，最後調用ops_run_io下發請求到磁盤5）當下發到磁盤的所有子請求返回時，raid5_end_read_request將stripe_head加入到陣列handle_list鏈表中6）raid5d從handle_list鏈表中取出stripe_head，調用handle_stripe7）由於這時s->uptodate==disks-1，handle_stripe調用handle_stripe_fill設置set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);由於設置了該標志，在raid_run_ops函數中調用ops_run_compute5將需要讀的塊給計算出來。8）計算回調函數ops_complete_compute設置對應dev->flags為R5_UPTODATE，重新加入handle_list9）再一次進入handle_stripe函數，analyse_stripe中設置了R5_Wantfill標志和s->to_fill。handle_stripe中再設置了STRIPE_OP_BIOFILL和STRIPE_BIOFILL_RUN標志。之後raid_run_ops調用ops_run_biofill將計算出來的數據拷貝到bio的頁中。10）拷貝回調函數ops_complete_biofill中，當所有下發的stripe都已經返回的時候，原始請求bio也得到了想要的所有數據，然後通過return_io函數將原始下發的請求bio done回去。下一小節繼續講raid5非條塊內的讀流程。

出處：http://blog.csdn.net/liumangxiong