我們都知道,對一個linux塊設備來說,都有一個對應的請求隊列。注冊在這個請求隊列上的請求就是該塊設備的請求入口。對於raid來說,分配struct mddev時就已經設置好了,在函數md_alloc中有這樣的代碼:
4846 blk_queue_make_request(mddev->queue, md_make_request); 4847 blk_set_stacking_limits(&mddev->queue->limits);
雖然全國的PM一直保持著穩健的增長,但絲毫也阻擋不了我們看代碼的慧眼,在成千上萬行的代碼裡我們依然能夠迅速地找出raid讀寫入口就是md_make_request。
328/* Rather than calling directly into the personality make_request function, 329 * IO requests come here first so that we can check if the device is 330 * being suspended pending a reconfiguration. 331 * We hold a refcount over the call to ->make_request. By the time that 332 * call has finished, the bio has been linked into some internal structure 333 * and so is visible to ->quiesce(), so we don't need the refcount any more. 334 */
我們在調用make_request函數之前,先檢查設備是否因為重配置而掛起。在調用make_request函數之前,我們增加設備的引用計數,在make_request調用完成時再遞減。增加設備引用計數主要是為調用->quiesce()之前保證下發到設備的IO已經完成。
335static void md_make_request(struct request_queue *q, struct bio *bio) 336{ 337 const int rw = bio_data_dir(bio); 338 struct mddev *mddev = q->queuedata; 339 int cpu; 340 unsigned int sectors; 341 342 if (mddev == NULL || mddev->pers == NULL 343 || !mddev->ready) { 344 bio_io_error(bio); 345 return; 346 } 347 smp_rmb(); /* Ensure implications of 'active' are visible */ 348 rcu_read_lock(); 349 if (mddev->suspended) { 350 DEFINE_WAIT(__wait); 351 for (;;) { 352 prepare_to_wait(&mddev->sb_wait, &__wait, 353 TASK_UNINTERRUPTIBLE); 354 if (!mddev->suspended) 355 break; 356 rcu_read_unlock(); 357 schedule(); 358 rcu_read_lock(); 359 } 360 finish_wait(&mddev->sb_wait, &__wait); 361 } 362 atomic_inc(&mddev->active_io); 363 rcu_read_unlock(); 364 365 /* 366 * save the sectors now since our bio can 367 * go away inside make_request 368 */ 369 sectors = bio_sectors(bio); 370 mddev->pers->make_request(mddev, bio); 371 372 cpu = part_stat_lock(); 373 part_stat_inc(cpu, &mddev->gendisk->part0, ios[rw]); 374 part_stat_add(cpu, &mddev->gendisk->part0, sectors[rw], sectors); 375 part_stat_unlock(); 376 377 if (atomic_dec_and_test(&mddev->active_io) && mddev->suspended) 378 wake_up(&mddev->sb_wait); 379}
337行,獲取IO方向,用於設備信息統計
338行,獲取陣列指針,該指針是在md_alloc中賦值的
342行,基本檢查
348行,訪問struct mddev信息加rcu讀鎖
349行,陣列suspend
350行,如果陣列suspend,即前面注釋中講的正在重配置,則加入sb_wait等待隊列
360行,陣列完成suspend,從等待隊列中移除
362行,遞增陣列引用計數,在前面注釋裡有原因說明
370行,下發bio到陣列
372行,這個開始是信息統計
377行,遞減陣列引用計數,如果正在重配置,則喚醒該進程
真正的數據通道命令也就370行這一句,其他的都是控制通道的。
對於raid5陣列,這個請求函數對應的是raid5.c中的make_request函數:
4075static void make_request(struct mddev *mddev, struct bio * bi) 4076{ 4077 struct r5conf *conf = mddev->private; 4078 int dd_idx; 4079 sector_t new_sector; 4080 sector_t logical_sector, last_sector; 4081 struct stripe_head *sh; 4082 const int rw = bio_data_dir(bi); 4083 int remaining; 4084 4085 if (unlikely(bi->bi_rw & REQ_FLUSH)) { 4086 md_flush_request(mddev, bi); 4087 return; 4088 } 4089 4090 md_write_start(mddev, bi); 4091 4092 if (rw == READ && 4093 mddev->reshape_position == MaxSector && 4094 chunk_aligned_read(mddev,bi)) 4095 return;
4085行,flush命令
4090行,寫開始處理,進入看看:
7156/* md_write_start(mddev, bi) 7157 * If we need to update some array metadata (e.g. 'active' flag 7158 * in superblock) before writing, schedule a superblock update 7159 * and wait for it to complete. 7160 */
如果在寫請求之前有需要更新陣列metadata,則發起一個同步超級塊更新請求。
7161void md_write_start(struct mddev *mddev, struct bio *bi) 7162{ 7163 int did_change = 0; 7164 if (bio_data_dir(bi) != WRITE) 7165 return; 7166 7167 BUG_ON(mddev->ro == 1); 7168 if (mddev->ro == 2) { 7169 /* need to switch to read/write */ 7170 mddev->ro = 0; 7171 set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 7172 md_wakeup_thread(mddev->thread); 7173 md_wakeup_thread(mddev->sync_thread); 7174 did_change = 1; 7175 } 7176 atomic_inc(&mddev->writes_pending); 7177 if (mddev->safemode == 1) 7178 mddev->safemode = 0; 7179 if (mddev->in_sync) { 7180 spin_lock_irq(&mddev->write_lock); 7181 if (mddev->in_sync) { 7182 mddev->in_sync = 0; 7183 set_bit(MD_CHANGE_CLEAN, &mddev->flags); 7184 set_bit(MD_CHANGE_PENDING, &mddev->flags); 7185 md_wakeup_thread(mddev->thread); 7186 did_change = 1; 7187 } 7188 spin_unlock_irq(&mddev->write_lock); 7189 } 7190 if (did_change) 7191 sysfs_notify_dirent_safe(mddev->sysfs_state); 7192 wait_event(mddev->sb_wait, 7193 !test_bit(MD_CHANGE_PENDING, &mddev->flags)); 7194}
7164行,如果不為寫,則直接返回
7168行,如果陣列為臨時只讀狀態,則設置回讀寫狀態,並設置檢查同步標志。
7177行,如果陣列為安全模式,則設置為不安全模式。安全模式請參考本系列之四。
7179行,如果陣列為in_sync,則設置in_sync=0,設置陣列改變標志
7190行,更新sys對應狀態
7192行,這一句代碼在這個函數中是最重要。首先一看wait_event就知道該函數是同步的,再看條件是!test_bit(MD_CHANGE_PENDING, &mddev->flags),這個標志是在7184行設置的,所以這一句話最重要的作用就是要同步in_sync標志到磁盤中陣列超級塊上。如果這裡看明白了,也就懂得了什麼是安全模式,safemode和in_sync各自的分工。
既然有md_write_start,那麼也就有md_write_end,這個函數的作用就是在沒有write_pending時添加一個定時器,這個定時器將in_sync置1並寫到磁盤中陣列超級塊。這樣陣列也就從不安全模式再次回到了安全模式。
回到make_request函數中。
4092行,如果是讀請求並且非reshape操作中,則執行chunk_aligned_read操作。顧名思義就是條塊對齊讀。條塊對齊讀指的是請求剛好為一個條帶在某一個磁盤上的那一部分。由於條塊對齊讀剛好對應的數據都在一塊物理磁盤上,所以可以直接發送到物理磁盤上,而不用申請struct stripe_head,可謂是最單純的一個讀寫方式了。我們進入到chunk_aligned_read函數中來:
3885static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio) 3886{ 3887 struct r5conf *conf = mddev->private; 3888 int dd_idx; 3889 struct bio* align_bi; 3890 struct md_rdev *rdev; 3891 sector_t end_sector; 3892 3893 if (!in_chunk_boundary(mddev, raid_bio)) { 3894 pr_debug("chunk_aligned_read : non aligned\n"); 3895 return 0; 3896 }
3893行,判斷請求是否在條塊內,如果不是則返回走正常讀寫流程。
為了滿足好奇心,我們還是跟進in_chunk_boundary函數:
3775static int in_chunk_boundary(struct mddev *mddev, struct bio *bio) 3776{ 3777 sector_t sector = bio->bi_sector + get_start_sect(bio->bi_bdev); 3778 unsigned int chunk_sectors = mddev->chunk_sectors; 3779 unsigned int bio_sectors = bio->bi_size >> 9; 3780 3781 if (mddev->new_chunk_sectors < mddev->chunk_sectors) 3782 chunk_sectors = mddev->new_chunk_sectors; 3783 return chunk_sectors >= 3784 ((sector & (chunk_sectors - 1)) + bio_sectors); 3785}
3777行,計算出bio對應的物理扇區
3779行,計算出bio請求的扇區數
3781行,這個是reshape陣列用到的,先不管
3783行,判斷請求結束扇區是否和請求起始扇區在同一個條塊內,如果是則說明是條塊內請求。
3897 /* 3898 * use bio_clone_mddev to make a copy of the bio 3899 */ 3900 align_bi = bio_clone_mddev(raid_bio, GFP_NOIO, mddev); 3901 if (!align_bi) 3902 return 0; 3903 /* 3904 * set bi_end_io to a new function, and set bi_private to the 3905 * original bio. 3906 */ 3907 align_bi->bi_end_io = raid5_align_endio; 3908 align_bi->bi_private = raid_bio; 3909 /* 3910 * compute position 3911 */ 3912 align_bi->bi_sector = raid5_compute_sector(conf, raid_bio->bi_sector, 3913 0, 3914 &dd_idx, NULL); 3915 3916 end_sector = align_bi->bi_sector + (align_bi->bi_size >> 9);
3900行,克隆一個請求bio,為什麼要克隆呢?因為下發到陣列的bio與下發到磁盤的bio中內容是不一樣的,比如對應的bi_sector和bi_end_io。所以在克隆之後還要在接下來的代碼裡修改克隆bio的幾個字段。
3907行,設置bio回調函數為raid5_align_endio,所以在這個bio下發到磁盤之後,我們就接著這個函數看bio的返回處理。
3908行,指向原始bio
3912行,計算bio在物理磁盤上的偏移地址。來看一下原型:
1943/* 1944 * Input: a 'big' sector number, 1945 * Output: index of the data and parity disk, and the sector # in them. 1946 */ 1947static sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector, 1948 int previous, int *dd_idx, 1949 struct stripe_head *sh)
輸入參數r_sector為陣列對應的扇區,輸出參數dd_idx為r_sector對應磁盤在陣列中的下標,返回值為磁盤上相應的扇區偏移。
3916行,該bio結尾在磁盤上對應的扇區偏移
3917 rcu_read_lock(); 3918 rdev = rcu_dereference(conf->disks[dd_idx].replacement); 3919 if (!rdev || test_bit(Faulty, &rdev->flags) || 3920 rdev->recovery_offset < end_sector) { 3921 rdev = rcu_dereference(conf->disks[dd_idx].rdev); 3922 if (rdev && 3923 (test_bit(Faulty, &rdev->flags) || 3924 !(test_bit(In_sync, &rdev->flags) || 3925 rdev->recovery_offset >= end_sector))) 3926 rdev = NULL; 3927 }
3918行,優先訪問replacement盤,replacement機制在後面統一介紹,這裡假設沒有replacement盤。
3921行,沒有replacement盤,rdev指向對應的數據盤
3922行,如果為Faulty或者!In_sync或者未同步,即該磁盤上數據不是有效數據,所以清空rdev指針。
3928 if (rdev) { 3929 sector_t first_bad; 3930 int bad_sectors; 3931 3932 atomic_inc(&rdev->nr_pending); 3933 rcu_read_unlock(); 3934 raid_bio->bi_next = (void*)rdev; 3935 align_bi->bi_bdev = rdev->bdev; 3936 align_bi->bi_flags &= ~(1 << BIO_SEG_VALID); 3937 3938 if (!bio_fits_rdev(align_bi) || 3939 is_badblock(rdev, align_bi->bi_sector, align_bi->bi_size>>9, 3940 &first_bad, &bad_sectors)) { 3941 /* too big in some way, or has a known bad block */ 3942 bio_put(align_bi); 3943 rdev_dec_pending(rdev, mddev); 3944 return 0; 3945 } 3946 3947 /* No reshape active, so we can trust rdev->data_offset */ 3948 align_bi->bi_sector += rdev->data_offset; 3949 3950 spin_lock_irq(&conf->device_lock); 3951 wait_event_lock_irq(conf->wait_for_stripe, 3952 conf->quiesce == 0, 3953 conf->device_lock, /* nothing */); 3954 atomic_inc(&conf->active_aligned_reads); 3955 spin_unlock_irq(&conf->device_lock); 3956 3957 generic_make_request(align_bi); 3958 return 1; 3959 } else { 3960 rcu_read_unlock(); 3961 bio_put(align_bi); 3962 return 0; 3963 }
3928行,對應物理磁盤上數據是有效的3932行,遞增磁盤請求數3933行,rdev解鎖3934行,這裡復用了bi_next指針,記錄rdev,在bio done回來的時候用到3935行,設置磁盤bdev3938行,判斷bio滿足請求隊列limit,並且不是壞塊。如果不滿足條件就不能直接發送請求到磁盤上。3948行,加上磁盤data_offset,就是實際磁盤扇區了。3951行,陣列正在執行配置命令,等待conf->quiesce為03957行,下發bio到磁盤。
3958行,如果下發條塊內讀請求,則返回1
3961行,讀對應的磁盤不存在或者數據無效,釋放bio
3962行,未下發讀,返回0
如果要讀的位置對應磁盤數據是有效的,那麼請求就直接下發到磁盤上去了。我們就在bio的回調函數raid5_align_endio看看條塊內讀這個故事的繼集:
3829/* 3830 * The "raid5_align_endio" should check if the read succeeded and if it 3831 * did, call bio_endio on the original bio (having bio_put the new bio 3832 * first). 3833 * If the read failed.. 3834 */
如果條塊內讀請求成功的話,那麼就將原始請求bio done回去。如果失敗的話,那麼哼哼哼。
3835static void raid5_align_endio(struct bio *bi, int error) 3836{ 3837 struct bio* raid_bi = bi->bi_private; 3838 struct mddev *mddev; 3839 struct r5conf *conf; 3840 int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags); 3841 struct md_rdev *rdev; 3842 3843 bio_put(bi); 3844 3845 rdev = (void*)raid_bi->bi_next; 3846 raid_bi->bi_next = NULL; 3847 mddev = rdev->mddev; 3848 conf = mddev->private; 3849 3850 rdev_dec_pending(rdev, conf->mddev); 3851 3852 if (!error && uptodate) { 3853 bio_endio(raid_bi, 0); 3854 if (atomic_dec_and_test(&conf->active_aligned_reads)) 3855 wake_up(&conf->wait_for_stripe); 3856 return; 3857 } 3858 3859 3860 pr_debug("raid5_align_endio : io error...handing IO for a retry\n"); 3861 3862 add_bio_to_retry(raid_bi, conf); 3863}
3843行,遞減bio計數
3845行,找到保存的磁盤rdev指針
3850行,遞減rdev的下發IO計數
3852行,如果讀請求成功
3853行,將原始bio done回去
3854行,喚醒等待的控制命令,如果有的話
3862行,如果讀請求失敗,則加入陣列重試鏈表
 
3787/* 3788 * add bio to the retry LIFO ( in O(1) ... we are in interrupt ) 3789 * later sampled by raid5d. 3790 */ 3791static void add_bio_to_retry(struct bio *bi,struct r5conf *conf) 3792{ 3793 unsigned long flags; 3794 3795 spin_lock_irqsave(&conf->device_lock, flags); 3796 3797 bi->bi_next = conf->retry_read_aligned_list; 3798 conf->retry_read_aligned_list = bi; 3799 3800 spin_unlock_irqrestore(&conf->device_lock, flags); 3801 md_wakeup_thread(conf->mddev->thread); 3802}
重試鏈表是後進先出的,重試bio加入的是retry_read_aligned_list。這個鏈表我們已經在raid5d中照過面了。不記得也沒有關系,我們再回頭去看看:
4662 while ((bio = remove_bio_from_retry(conf))) { 4663 int ok; 4664 spin_unlock_irq(&conf->device_lock); 4665 ok = retry_aligned_read(conf, bio); 4666 spin_lock_irq(&conf->device_lock); 4667 if (!ok) 4668 break; 4669 handled++; 4670 }
4662行,從重試鏈表中取出一個bio4665行,重試讀請求至於retry_aligned_read函數不會再走之前直接下發到磁盤的老路了,因為重試之前的方法就相當於一個死循環調用了,那接下來會做些什麼呢?大家不妨猜想一下,提示一下,陣列是有數據冗余的。好了,下一節再來揭開重試讀的真實面紗。
出處:http://blog.csdn.net/liumangxiong