上一節講到在raid5的同步函數sync_request中炸土豆片是通過handle_stripe來進行的。從最初的創建陣列,到申請各種資源,建立每個陣列的personality,所有的一切都是為了迎接數據流而作的准備。就像我們寒窗苦讀就是為了上大學一樣。數據流的過程就像大學校園一樣豐富多彩並且富有挑戰性,但只要跨過了這道坎,內核代碼將不再神秘,剩下的問題只是時間而已。
首先看handle_stripe究竟把我們的土豆片帶往何處:
3379 static void handle_stripe(struct stripe_head *sh) 3380 { 3381 struct stripe_head_state s; 3382 struct r5conf *conf = sh->raid_conf; 3383 int i; 3384 int prexor; 3385 int disks = sh->disks; 3386 struct r5dev *pdev, *qdev; 3387 3388 clear_bit(STRIPE_HANDLE, &sh->state); 3389 if (test_and_set_bit_lock(STRIPE_ACTIVE, &sh->state)) { 3390 /* already being handled, ensure it gets handled 3391 * again when current action finishes */ 3392 set_bit(STRIPE_HANDLE, &sh->state); 3393 return; 3394 } 3395 3396 if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) { 3397 set_bit(STRIPE_SYNCING, &sh->state); 3398 clear_bit(STRIPE_INSYNC, &sh->state); 3399 } 3400 clear_bit(STRIPE_DELAYED, &sh->state); 3401 3402 pr_debug("handling stripe %llu, state=%#lx cnt=%d, " 3403 "pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n", 3404 (unsigned long long)sh->sector, sh->state, 3405 atomic_read(&sh->count), sh->pd_idx, sh->qd_idx, 3406 sh->check_state, sh->reconstruct_state); 3407 3408 analyse_stripe(sh, &s);
這個函數代碼比較長先貼第一部分,分析條帶。分析的作用就是根據條帶的狀態做一些預處理,根據這些狀態再來判斷下一步應該做什麼具體操作。比如說同步,那麼首先會讀數據盤,等讀回來之後,再校驗,然後再寫校驗值。但是這些步驟又不是一次性在handle_stripe裡就完成的,因為跟磁盤IO都是異步的,所以必要要等上一次磁盤請求回調之後再次調用handle_stripe,通常每個數據流都會多次進入handle_stripe,而每一次進入經過的代碼流程是不大一樣的。
struct stripe_head有很多狀態,這些狀態決定條帶應該怎麼處理,所以必須非常小心處理這些標志,這些標志很多,現在先簡單地過一下。
enum { STRIPE_ACTIVE, // 正在處理 STRIPE_HANDLE, // 需要處理 STRIPE_SYNC_REQUESTED, // 同步請求 STRIPE_SYNCING, // 正在處理同步 STRIPE_INSYNC, // 條帶已同步 STRIPE_PREREAD_ACTIVE, // 預讀 STRIPE_DELAYED, // 延遲處理 STRIPE_DEGRADED, // 降級 STRIPE_BIT_DELAY, // 等待bitmap處理 STRIPE_EXPANDING, // STRIPE_EXPAND_SOURCE, // STRIPE_EXPAND_READY, // STRIPE_IO_STARTED, /* do not count towards 'bypass_count' */ // IO已下發 STRIPE_FULL_WRITE, /* all blocks are set to be overwritten */ // 滿寫 STRIPE_BIOFILL_RUN, // bio填充,就是將page頁拷貝到bio STRIPE_COMPUTE_RUN, // 運行計算 STRIPE_OPS_REQ_PENDING, // handle_stripe排隊用 STRIPE_ON_UNPLUG_LIST, // 批量release_stripe時標識是否加入unplug鏈表 };
3388行,清除需要處理標志。
3389行,設置正在處理標志。
3392行,如果已經在處理則設置下次處理標志並返回。
3396行,如果是同步請求。
3397行,設置正在處理同步標志。
3398行,清除已同步標志。
3400行,清除延遲處理標志。
3408行,分析stripe,這個函數很長分幾段來說明:
3198 static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) 3199 { 3200 struct r5conf *conf = sh->raid_conf; 3201 int disks = sh->disks; 3202 struct r5dev *dev; 3203 int i; 3204 int do_recovery = 0; 3205 3206 memset(s, 0, sizeof(*s)); 3207 3208 s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); 3209 s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state); 3210 s->failed_num[0] = -1; 3211 s->failed_num[1] = -1; 3212 3213 /* Now to look around and see what can be done */ 3214 rcu_read_lock();
數據初始化和加鎖,接著看:
3215 for (i=disks; i--; ) { 3216 struct md_rdev *rdev; 3217 sector_t first_bad; 3218 int bad_sectors; 3219 int is_bad = 0; 3220 3221 dev = &sh->dev[i]; 3222 3223 pr_debug("check %d: state 0x%lx read %p write %p written %p\n", 3224 i, dev->flags, 3225 dev->toread, dev->towrite, dev->written);
接著是一個大循環,循環次數是數據盤的個數,循環的對象的3221行的dev,dev的類型是struct r5dev,那我們先來看一看這個結構,這個結構是嵌套在struct stripe_head裡面的:
struct r5dev { /* rreq and rvec are used for the replacement device when * writing data to both devices. */ struct bio req, rreq; struct bio_vec vec, rvec; struct page *page; struct bio *toread, *read, *towrite, *written; sector_t sector; /* sector of this page */ unsigned long flags; } dev[1]; /* allocated with extra space depending of RAID geometry */
首先看注釋,rreq 和rvec由replacement設備在寫數據時使用。r就是replacement的簡寫,replacement是什麼意思呢?就是原數據盤的替代,replacement是最近幾個版本裡才引入的特性,在實際產品中這個特性很重要,具體實現後面會講到。page是緩存頁,通常用於運行計算,接著幾個bio是讀寫bio頭指針。sector是條帶對應的物理扇區位置。flags是struct r5dev的標志。
3226 /* maybe we can reply to a read 3227 * 3228 * new wantfill requests are only permitted while 3229 * ops_complete_biofill is guaranteed to be inactive 3230 */ 3231 if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread && 3232 !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) 3233 set_bit(R5_Wantfill, &dev->flags); 3234 3235 /* now count some things */ 3236 if (test_bit(R5_LOCKED, &dev->flags)) 3237 s->locked++; 3238 if (test_bit(R5_UPTODATE, &dev->flags)) 3239 s->uptodate++; 3240 if (test_bit(R5_Wantcompute, &dev->flags)) { 3241 s->compute++; 3242 BUG_ON(s->compute > 2); 3243 } 3244 3245 if (test_bit(R5_Wantfill, &dev->flags)) 3246 s->to_fill++; 3247 else if (dev->toread) 3248 s->to_read++; 3249 if (dev->towrite) { 3250 s->to_write++; 3251 if (!test_bit(R5_OVERWRITE, &dev->flags)) 3252 s->non_overwrite++; 3253 } 3254 if (dev->written) 3255 s->written++;
3231行,什麼樣的r5dev要設置R5_Wantfill標志呢?已更新、有讀請求、不在拷貝過程。這又是什麼意思呢?就是說需要的數據已經為最新了,這時只要把數據從page拷貝到bio就可以了。
3236行,統計加鎖磁盤數
3238行,統計已最新磁盤數
3240行,統計需要計算的磁盤數
3245行,統計需要拷貝操作磁盤數
3247行,統計需要讀的磁盤數
3249行,統計需要寫的磁盤數
3251行,統計滿寫的磁盤數
3254行,統計已下發寫的磁盤數
3256 /* Prefer to use the replacement for reads, but only 3257 * if it is recovered enough and has no bad blocks. 3258 */ 3259 rdev = rcu_dereference(conf->disks[i].replacement); 3260 if (rdev && !test_bit(Faulty, &rdev->flags) && 3261 rdev->recovery_offset >= sh->sector + STRIPE_SECTORS && 3262 !is_badblock(rdev, sh->sector, STRIPE_SECTORS, 3263 &first_bad, &bad_sectors)) 3264 set_bit(R5_ReadRepl, &dev->flags); 3265 else { 3266 if (rdev) 3267 set_bit(R5_NeedReplace, &dev->flags); 3268 rdev = rcu_dereference(conf->disks[i].rdev); 3269 clear_bit(R5_ReadRepl, &dev->flags); 3270 } 3271 if (rdev && test_bit(Faulty, &rdev->flags)) 3272 rdev = NULL; 3273 if (rdev) { 3274 is_bad = is_badblock(rdev, sh->sector, STRIPE_SECTORS, 3275 &first_bad, &bad_sectors); 3276 if (s->blocked_rdev == NULL 3277 && (test_bit(Blocked, &rdev->flags) 3278 || is_bad < 0)) { 3279 if (is_bad < 0) 3280 set_bit(BlockedBadBlocks, 3281 &rdev->flags); 3282 s->blocked_rdev = rdev; 3283 atomic_inc(&rdev->nr_pending); 3284 } 3285 }
3256行,優先讀重建過並沒有壞扇區的replacement盤
3264行,讀replacement盤
3267行,寫replacement盤
3271行,壞盤
3273行,檢查壞扇區
3286行,初始化dev狀態
3300行,沒有壞扇區,設置同步標志
3312行,寫錯誤處理
3325行,數據盤修復處理
3336行,replacement盤修復處理
3352行,記錄不同步盤
3360行,判斷同步還是重構replacement盤
到此analyse_stripe就結束了,那麼對於同步來說,這個函數做了哪些事情呢?就只是設置了s.syncing=1而已,所以不要看這個函數那麼長,每一次進來做的事情卻很少。
繼續返回到handle_stripe函數中,中間不執行的代碼先跳過,然後就會執行到這裡:
3468 /* Now we might consider reading some blocks, either to check/generate 3469 * parity, or to satisfy requests 3470 * or to load a block that is being partially written. 3471 */ 3472 if (s.to_read || s.non_overwrite 3473 || (conf->level == 6 && s.to_write && s.failed) 3474 || (s.syncing && (s.uptodate + s.compute < disks)) 3475 || s.replacing 3476 || s.expanding) 3477 handle_stripe_fill(sh, &s, disks);
3468行,這裡是准備讀磁盤,在生成校驗、讀寫請求時都有可能讀磁盤
3474行,在analyse_stripe中設置了syncing標志,所以這裡滿足這個條件,進入handle_stripe_fill函數。
2707 /** 2708 * handle_stripe_fill - read or compute data to satisfy pending requests. 2709 */ 2710 static void handle_stripe_fill(struct stripe_head *sh, 2711 struct stripe_head_state *s, 2712 int disks) 2713 { 2714 int i; 2715 2716 /* look for blocks to read/compute, skip this if a compute 2717 * is already in flight, or if the stripe contents are in the 2718 * midst of changing due to a write 2719 */ 2720 if (!test_bit(STRIPE_COMPUTE_RUN, &sh->state) && !sh->check_state && 2721 !sh->reconstruct_state) 2722 for (i = disks; i--; ) 2723 if (fetch_block(sh, s, i, disks)) 2724 break; 2725 set_bit(STRIPE_HANDLE, &sh->state); 2726 }
2720行,如果已經在計算、校驗或重建狀態,則不需要再讀磁盤
2722行,循環每一個r5dev看是否需要讀磁盤
跟進fetch_block函數:
2618 /* fetch_block - checks the given member device to see if its data needs
2619 * to be read or computed to satisfy a request.
2620 *
2621 * Returns 1 when no more member devices need to be checked, otherwise returns
2622 * 0 to tell the loop in handle_stripe_fill to continue
2623 */
查看指定設置是否有必要讀入數據,返回1表示剩余設置不需要檢查了,返回0表示需要繼續檢查剩余的設置。
2624 static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s, 2625 int disk_idx, int disks) 2626 { 2627 struct r5dev *dev = &sh->dev[disk_idx]; 2628 struct r5dev *fdev[2] = { &sh->dev[s->failed_num[0]], 2629 &sh->dev[s->failed_num[1]] }; 2630 2631 /* is the data in this block needed, and can we get it? */ 2632 if (!test_bit(R5_LOCKED, &dev->flags) && 2633 !test_bit(R5_UPTODATE, &dev->flags) && 2634 (dev->toread || 2635 (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) || 2636 s->syncing || s->expanding || 2637 (s->replacing && want_replace(sh, disk_idx)) || 2638 (s->failed >= 1 && fdev[0]->toread) || 2639 (s->failed >= 2 && fdev[1]->toread) || 2640 (sh->raid_conf->level <= 5 && s->failed && fdev[0]->towrite && 2641 !test_bit(R5_OVERWRITE, &fdev[0]->flags)) || 2642 (sh->raid_conf->level == 6 && s->failed && s->to_write))) { 2643 /* we would like to get this block, possibly by computing it, 2644 * otherwise read it if the backing disk is insync 2645 */ 2646 BUG_ON(test_bit(R5_Wantcompute, &dev->flags)); 2647 BUG_ON(test_bit(R5_Wantread, &dev->flags)); 2648 if ((s->uptodate == disks - 1) && 2649 (s->failed && (disk_idx == s->failed_num[0] || 2650 disk_idx == s->failed_num[1]))) { 2651 /* have disk failed, and we're requested to fetch it; 2652 * do compute it 2653 */ 2654 pr_debug("Computing stripe %llu block %d\n", 2655 (unsigned long long)sh->sector, disk_idx); 2656 set_bit(STRIPE_COMPUTE_RUN, &sh->state); 2657 set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request); 2658 set_bit(R5_Wantcompute, &dev->flags); 2659 sh->ops.target = disk_idx; 2660 sh->ops.target2 = -1; /* no 2nd target */ 2661 s->req_compute = 1; 2662 /* Careful: from this point on 'uptodate' is in the eye 2663 * of raid_run_ops which services 'compute' operations 2664 * before writes. R5_Wantcompute flags a block that will 2665 * be R5_UPTODATE by the time it is needed for a 2666 * subsequent operation. 2667 */ 2668 s->uptodate++; 2669 return 1; 2670 } else if (s->uptodate == disks-2 && s->failed >= 2) { 2671 /* Computing 2-failure is *very* expensive; only 2672 * do it if failed >= 2 2673 */ 2674 int other; 2675 for (other = disks; other--; ) { 2676 if (other == disk_idx) 2677 continue; 2678 if (!test_bit(R5_UPTODATE, 2679 &sh->dev[other].flags)) 2680 break; 2681 } 2682 BUG_ON(other < 0); 2683 pr_debug("Computing stripe %llu blocks %d,%d\n", 2684 (unsigned long long)sh->sector, 2685 disk_idx, other); 2686 set_bit(STRIPE_COMPUTE_RUN, &sh->state); 2687 set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request); 2688 set_bit(R5_Wantcompute, &sh->dev[disk_idx].flags); 2689 set_bit(R5_Wantcompute, &sh->dev[other].flags); 2690 sh->ops.target = disk_idx; 2691 sh->ops.target2 = other; 2692 s->uptodate += 2; 2693 s->req_compute = 1; 2694 return 1; 2695 } else if (test_bit(R5_Insync, &dev->flags)) { 2696 set_bit(R5_LOCKED, &dev->flags); 2697 set_bit(R5_Wantread, &dev->flags); 2698 s->locked++; 2699 pr_debug("Reading block %d (sync=%d)\n", 2700 disk_idx, s->syncing); 2701 } 2702 } 2703 2704 return 0; 2705 }
從這個函數進入,我們擁有的僅僅是s.syncing這張牌,那麼這張牌在這裡能不能發揮作用呢?
2632行,判斷是否需要讀設置
2636行,很明顯地,這個判斷為真,因為s.syncing==1,其他判斷暫且不看
2648行,當前設置都未讀入,所以s->uptodate==0
2670行,同上也不成立
2695行,真正執行到的是這個分支
2696行,設置設備加鎖標志
2697行,設置設備准備讀標志
2698行,遞增本條帶加鎖設備數
handle_stripe函數執行完成,條帶的每個struct r5dev都被設置了R5_Wantread標志。在接下來handle_stripe就會調用ops_run_io函數去讀:
3673 ops_run_io(sh, &s);
我們再跟進這個函數,為了突出重點,這裡只列出跟同步相關的代碼:
537 static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) 538 { 539 struct r5conf *conf = sh->raid_conf; 540 int i, disks = sh->disks; 541 542 might_sleep(); 543 544 for (i = disks; i--; ) { 545 int rw; 546 int replace_only = 0; 547 struct bio *bi, *rbi; 548 struct md_rdev *rdev, *rrdev = NULL; ... 554 } else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags)) 555 rw = READ; ... 560 } else 561 continue; 564 565 bi = &sh->dev[i].req; 566 rbi = &sh->dev[i].rreq; /* For writing to replacement */ 567 568 bi->bi_rw = rw; 569 rbi->bi_rw = rw; 570 if (rw & WRITE) { 573 } else 574 bi->bi_end_io = raid5_end_read_request; 575 576 rcu_read_lock(); 577 rrdev = rcu_dereference(conf->disks[i].replacement); 578 smp_mb(); /* Ensure that if rrdev is NULL, rdev won't be */ 579 rdev = rcu_dereference(conf->disks[i].rdev); 580 if (!rdev) { 581 rdev = rrdev; 582 rrdev = NULL; 583 } ... 598 if (rdev) 599 atomic_inc(&rdev->nr_pending); ... 604 rcu_read_unlock(); ... 643 if (rdev) { 644 if (s->syncing || s->expanding || s->expanded 645 || s->replacing) 646 md_sync_acct(rdev->bdev, STRIPE_SECTORS); 647 648 set_bit(STRIPE_IO_STARTED, &sh->state); 649 650 bi->bi_bdev = rdev->bdev; 651 pr_debug("%s: for %llu schedule op %ld on disc %d\n", 652 __func__, (unsigned long long)sh->sector, 653 bi->bi_rw, i); 654 atomic_inc(&sh->count); 655 if (use_new_offset(conf, sh)) 656 bi->bi_sector = (sh->sector 657 + rdev->new_data_offset); 658 else 659 bi->bi_sector = (sh->sector 660 + rdev->data_offset); 661 if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags)) 662 bi->bi_rw |= REQ_FLUSH; 663 664 bi->bi_flags = 1 << BIO_UPTODATE; 665 bi->bi_idx = 0; 666 bi->bi_io_vec[0].bv_len = STRIPE_SIZE; 667 bi->bi_io_vec[0].bv_offset = 0; 668 bi->bi_size = STRIPE_SIZE; 669 bi->bi_next = NULL; 670 if (rrdev) 671 set_bit(R5_DOUBLE_LOCKED, &sh->dev[i].flags); 672 generic_make_request(bi); 673 } ... 709 } 710 }
542行,函數可能休眠
544行,遍歷每一個r5dev
554行,設置讀標志
568行,設置bio為讀
574行,設置bio回調函數為raid5_end_read_request,這裡將是下發讀請求之後代碼繼續執行的入口點。
598行,增加設備nr_pending
646行,統計信息
648行,設置IO下發標志
650行,設置bio設備為對應的磁盤設備
654行,增加stripe_head引用計數
655-660行,設置新的扇區數,需要加上磁盤上的數據偏移
661行,如果為NoMerge讀,則設置bio REQ_FLUSH標志
664行,接著設置bio其他域
672行,下發bio到磁盤
在磁盤執行完讀請求的時候,raid5_end_read_request被調用:
1710 static void raid5_end_read_request(struct bio * bi, int error) 1711 { ... 1824 rdev_dec_pending(rdev, conf->mddev); 1825 clear_bit(R5_LOCKED, &sh->dev[i].flags); 1826 set_bit(STRIPE_HANDLE, &sh->state); 1827 release_stripe(sh); 1828 }
在這個函數中,清除了R5_LOCKED標志,並重新將stripe_head加入處理。經過raid5d中轉,重新調用到handle_stripe函數,這一次調用時在analyse_stripe函數中遞增s->uptodate,所有數據盤都遞增1,所以s->uptodate等於數據盤。接著handle_tripe函數到達:
3528 if (sh->check_state || 3529 (s.syncing && s.locked == 0 && 3530 !test_bit(STRIPE_COMPUTE_RUN, &sh->state) && 3531 !test_bit(STRIPE_INSYNC, &sh->state))) { 3532 if (conf->level == 6) 3533 handle_parity_checks6(conf, sh, &s, disks); 3534 else 3535 handle_parity_checks5(conf, sh, &s, disks); 3536 }
進入3535行進行校驗,進入handle_parity_check5函數:
2881 switch (sh->check_state) { 2882 case check_state_idle: 2883 /* start a new check operation if there are no failures */ 2884 if (s->failed == 0) { 2885 BUG_ON(s->uptodate != disks); 2886 sh->check_state = check_state_run; 2887 set_bit(STRIPE_OP_CHECK, &s->ops_request); 2888 clear_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags); 2889 s->uptodate--; 2890 break; 2891 }
2881行,check_state為0,進入2882行分支
2886行,設置check_state_run狀態
2887行,設置STRIPE_OP_CHECK操作
2889行,遞減s->uptodate
由於這裡設置了STRIPE_OP_CHECK操作,所以在handle_stripe會調用到raid_run_ops,進而會調用到:
1412 if (test_bit(STRIPE_OP_CHECK, &ops_request)) { 1413 if (sh->check_state == check_state_run) 1414 ops_run_check_p(sh, percpu);
ops_run_check_p校驗條帶是否同步,對應的回調函數為:
1301static void ops_complete_check(void *stripe_head_ref) 1302{ 1303 struct stripe_head *sh = stripe_head_ref; 1304 1305 pr_debug("%s: stripe %llu\n", __func__, 1306 (unsigned long long)sh->sector); 1307 1308 sh->check_state = check_state_check_result; 1309 set_bit(STRIPE_HANDLE, &sh->state); 1310 release_stripe(sh); 1311}
 
第1308行將狀態設置為check_state_check_result,條帶繼續又重新加入到handle_list。handle_stripe再一次調用到handle_parity_check5函數,但這一次check_state==check_state_check_result:
2916 case check_state_check_result: 2917 sh->check_state = check_state_idle; 2918 2919 /* if a failure occurred during the check operation, leave 2920 * STRIPE_INSYNC not set and let the stripe be handled again 2921 */ 2922 if (s->failed) 2923 break; 2924 2925 /* handle a successful check operation, if parity is correct 2926 * we are done. Otherwise update the mismatch count and repair 2927 * parity if !MD_RECOVERY_CHECK 2928 */ 2929 if ((sh->ops.zero_sum_result & SUM_CHECK_P_RESULT) == 0) 2930 /* parity is correct (on disc, 2931 * not in buffer any more) 2932 */ 2933 set_bit(STRIPE_INSYNC, &sh->state); 2934 else { 2935 conf->mddev->resync_mismatches += STRIPE_SECTORS; 2936 if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) 2937 /* don't try to repair!! */ 2938 set_bit(STRIPE_INSYNC, &sh->state); 2939 else { 2940 sh->check_state = check_state_compute_run; 2941 set_bit(STRIPE_COMPUTE_RUN, &sh->state); 2942 set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request); 2943 set_bit(R5_Wantcompute, 2944 &sh->dev[sh->pd_idx].flags); 2945 sh->ops.target = sh->pd_idx; 2946 sh->ops.target2 = -1; 2947 s->uptodate++; 2948 } 2949 } 2950 break;
2929行,如果校驗的結果是同步的
2933行,直接設置條帶為同步的,不需要進行其他任何操作了
2934行,如果條帶不同步
2940行,設置check_state為check_state_compute_run
2942行,ops_request 為STRIPE_OP_COMPUTE_BLK,即准備計算校驗
2943行,計算目標為條帶校驗盤
2947行,由於之前計算校驗時uptodate遞減,這裡恢復
如果條帶已經同步了,那麼帶著STRIPE_INSYNC標志我們來到了handle_stripe:
3550 if ((s.syncing || s.replacing) && s.locked == 0 && 3551 test_bit(STRIPE_INSYNC, &sh->state)) { 3552 md_done_sync(conf->mddev, STRIPE_SECTORS, 1); 3553 clear_bit(STRIPE_SYNCING, &sh->state); 3554 }
如果條帶未同步,那帶著STRIPE_OP_COMPUTE_BLK標志來到了raid_run_ops函數,該函數調用__raid_run_ops:
1383 if (test_bit(STRIPE_OP_COMPUTE_BLK, &ops_request)) { 1384 if (level < 6) 1385 tx = ops_run_compute5(sh, percpu);
最終調用ops_run_compute5函數計算出條帶中校驗盤的值,該函數回調函數ops_complete_compute:
856static void ops_complete_compute(void *stripe_head_ref) 857{ 858 struct stripe_head *sh = stripe_head_ref; 859 860 pr_debug("%s: stripe %llu\n", __func__, 861 (unsigned long long)sh->sector); 862 863 /* mark the computed target(s) as uptodate */ 864 mark_target_uptodate(sh, sh->ops.target); 865 mark_target_uptodate(sh, sh->ops.target2); 866 867 clear_bit(STRIPE_COMPUTE_RUN, &sh->state); 868 if (sh->check_state == check_state_compute_run) 869 sh->check_state = check_state_compute_result; 870 set_bit(STRIPE_HANDLE, &sh->state); 871 release_stripe(sh); 872}
864行,設置校驗盤dev為R5_UPTODATE
869行,由於handle_parity_check5中設置為check_state_compute_run,這裡繼續設置為check_state_compute_result
870行,設置處理標志,在871之後再一次進入handle_stripe
當再一次進入handle_stripe函數,又再一次來到handle_parity_check5函數,由於這次是check_state_compute_result標志:
2894 case check_state_compute_result: 2895 sh->check_state = check_state_idle; 2896 if (!dev) 2897 dev = &sh->dev[sh->pd_idx]; 2898 2899 /* check that a write has not made the stripe insync */ 2900 if (test_bit(STRIPE_INSYNC, &sh->state)) 2901 break; 2902 2903 /* either failed parity check, or recovery is happening */ 2904 BUG_ON(!test_bit(R5_UPTODATE, &dev->flags)); 2905 BUG_ON(s->uptodate != disks); 2906 2907 set_bit(R5_LOCKED, &dev->flags); 2908 s->locked++; 2909 set_bit(R5_Wantwrite, &dev->flags); 2910 2911 clear_bit(STRIPE_DEGRADED, &sh->state); 2912 set_bit(STRIPE_INSYNC, &sh->state); 2913 break;
我們可以一眼看2912行設置了STRIPE_INSYNC標志,那麼也意味著條帶同步的結束。但是也別高興得太早,回頭看卻有2908行s->locked++,同步結束的判斷條件之一就是s->locked==0,所以在同步結束之前我們還有一件事情要做,2909行設置了R5_Wantwrite標志就是告訴我們需要調用一次ops_run_io將剛才計算的校驗值寫入條帶的校驗盤中,再寫成功再返回時就會滿足同步結束的條件了。就這樣,一次簡單的同步過程就完成了。
出處:http://blog.csdn.net/liumangxiong