writeback相關數據結構
與writeback相關的數據結構主要有:
1,backing_dev_info,該數據結構描述了backing_dev的所有信息,通常塊設備的request queue中會包含backing_dev對象。
2,bdi_writeback,該數據結構封裝了writeback的內核線程以及需要操作的inode隊列。
3,wb_writeback_work,該數據結構封裝了writeback的工作任務。
各數據結構之間的關系如下圖所示:
下面對各個數據結構做簡要介紹。
bdi information
bdi對象在塊設備添加的時候需要注冊到系統的bdi隊列中。對於ext3而言,在mount的時候需要將底層塊設備的bdi對象聯系到ext3 root_inode中。bdi對象數據結構定義如下:
struct backing_dev_info { struct list_head bdi_list; unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */ unsigned long state; /* Always use atomic bitops on this */ unsigned int capabilities; /* Device capabilities */ congested_fn *congested_fn; /* Function pointer if device is md/dm */ void *congested_data; /* Pointer to aux data for congested func */ char *name; struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS]; unsigned long bw_time_stamp; /* last time write bw is updated */ unsigned long dirtied_stamp; unsigned long written_stamp; /* pages written at bw_time_stamp */ unsigned long write_bandwidth; /* the estimated write bandwidth */ unsigned long avg_write_bandwidth; /* further smoothed write bw */ /* * The base dirty throttle rate, re-calculated on every 200ms. * All the bdi tasks' dirty rate will be curbed under it. * @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit * in small steps and is much more smooth/stable than the latter. */ unsigned long dirty_ratelimit; unsigned long balanced_dirty_ratelimit; struct prop_local_percpu completions; int dirty_exceeded; unsigned int min_ratio; unsigned int max_ratio, max_prop_frac; struct bdi_writeback wb; /* default writeback info for this bdi,writeback對象 */ spinlock_t wb_lock; /* protects work_list */ /* 任務鏈表 */ struct list_head work_list; struct device *dev; /* 在laptop模式下應用的定時器 */ struct timer_list laptop_mode_wb_timer; #ifdef CONFIG_DEBUG_FS struct dentry *debug_dir; struct dentry *debug_stats; #endif };
在bdi數據結構中定義了一個writeback對象,該對象是對writeback內核線程的描述,並且封裝了需要處理的inode隊列。在bdi數據結構中有一條work_list,該work隊列維護了writeback內核線程需要處理的任務。如果該隊列上沒有work可以處理,那麼writeback內核線程將會睡眠等待。
writeback
writeback對象封裝了內核線程task以及需要處理的inode隊列。當page cache/buffer cache需要刷新radix tree上的inode時,可以將該inode掛載到writeback對象的b_dirty隊列上,然後喚醒writeback線程。在處理過程中,inode會被移到b_io隊列上進行處理。多條鏈表的方式可以降低多線程之間的資源共享。writeback數據結構具體定義如下:
struct bdi_writeback { struct backing_dev_info *bdi; /* our parent bdi */ unsigned int nr; unsigned long last_old_flush; /* last old data flush */ unsigned long last_active; /* last time bdi thread was active */ struct task_struct *task; /* writeback thread */ struct timer_list wakeup_timer; /* used for delayed bdi thread wakeup */ struct list_head b_dirty; /* dirty inodes */ struct list_head b_io; /* parked for writeback */ struct list_head b_more_io; /* parked for more writeback */ spinlock_t list_lock; /* protects the b_* lists */ };
writeback work
wb_writeback_work數據結構是對writeback任務的封裝,不同的任務可以采用不同的刷新策略。writeback線程的處理對象就是writeback_work。如果writeback_work隊列為空,那麼內核線程就可以睡眠了。Writeback_work的數據結構定義如下:
struct wb_writeback_work { long nr_pages; struct super_block *sb; /* superblock對象 */ unsigned long *older_than_this; enum writeback_sync_modes sync_mode; unsigned int tagged_writepages:1; unsigned int for_kupdate:1; unsigned int range_cyclic:1; unsigned int for_background:1; enum wb_reason reason; /* why was writeback initiated? */ struct list_head list; /* pending work list,鏈入bdi-> work_list隊列 */ struct completion *done; /* set if the caller waits,work完成時通知調用者 */ };
writeback主要函數分析
writeback機制的主要函數包括如下兩個方面:
1,管理bdi對象並且fork相應的writeback內核線程處理cache數據的刷新工作。
2,writeback內核線程處理函數,實現dirty page的刷新操作
writeback線程管理
Linux中有一個內核守護線程,該線程用來管理系統bdi隊列,並且負責為block device創建writeback thread。當bdi中有dirty page並且還沒有為bdi分配內核線程的時候,bdi_forker_thread程序會為其分配線程資源;當一個writeback線程長時間處於空閒狀態時,bdi_forker_thread程序會釋放該線程資源。
writeback線程管理程序分析如下:
static int bdi_forker_thread(void *ptr) { struct bdi_writeback *me = ptr; current->flags |= PF_SWAPWRITE; set_freezable(); /* * Our parent may run at a different priority, just set us to normal */ set_user_nice(current, 0); for (;;) { struct task_struct *task = NULL; struct backing_dev_info *bdi; enum { NO_ACTION, /* Nothing to do */ FORK_THREAD, /* Fork bdi thread */ KILL_THREAD, /* Kill inactive bdi thread */ } action = NO_ACTION; /* * Temporary measure, we want to make sure we don't see * dirty data on the default backing_dev_info */ if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list)) { del_timer(&me->wakeup_timer); wb_do_writeback(me, 0); } spin_lock_bh(&bdi_lock); /* * In the following loop we are going to check whether we have * some work to do without any synchronization with tasks * waking us up to do work for them. Set the task state here * so that we don't miss wakeups after verifying conditions. */ set_current_state(TASK_INTERRUPTIBLE); /* 遍歷所有的bdi對象,檢查這些bdi是否存在髒數據,如果有髒數據,那麼需要為其fork線程,然後做writeback操作 */ list_for_each_entry(bdi, &bdi_list, bdi_list) { bool have_dirty_io; if (!bdi_cap_writeback_dirty(bdi) || bdi_cap_flush_forker(bdi)) continue; WARN(!test_bit(BDI_registered, &bdi->state), "bdi %p/%s is not registered!\n", bdi, bdi->name); /* 檢查是否存在髒數據 */ have_dirty_io = !list_empty(&bdi->work_list) || wb_has_dirty_io(&bdi->wb); /* * If the bdi has work to do, but the thread does not * exist - create it. */ if (!bdi->wb.task && have_dirty_io) { /* * Set the pending bit - if someone will try to * unregister this bdi - it'll wait on this bit. */ /* 如果有髒數據,並且不存在線程,那麼接下來做線程的FORK操作 */ set_bit(BDI_pending, &bdi->state); action = FORK_THREAD; break; } spin_lock(&bdi->wb_lock); /* * If there is no work to do and the bdi thread was * inactive long enough - kill it. The wb_lock is taken * to make sure no-one adds more work to this bdi and * wakes the bdi thread up. */ /* 如果一個bdi長時間沒有髒數據,那麼執行線程的KILL操作,結束掉該bdi對應的writeback線程 */ if (bdi->wb.task && !have_dirty_io && time_after(jiffies, bdi->wb.last_active + bdi_longest_inactive())) { task = bdi->wb.task; bdi->wb.task = NULL; spin_unlock(&bdi->wb_lock); set_bit(BDI_pending, &bdi->state); action = KILL_THREAD; break; } spin_unlock(&bdi->wb_lock); } spin_unlock_bh(&bdi_lock); /* Keep working if default bdi still has things to do */ if (!list_empty(&me->bdi->work_list)) __set_current_state(TASK_RUNNING); /* 執行線程的FORK和KILL操作 */ switch (action) { case FORK_THREAD: /* FORK一個bdi_writeback_thread線程,該線程的名字為flush-major:minor */ __set_current_state(TASK_RUNNING); task = kthread_create(bdi_writeback_thread, &bdi->wb, "flush-%s", dev_name(bdi->dev)); if (IS_ERR(task)) { /* * If thread creation fails, force writeout of * the bdi from the thread. Hopefully 1024 is * large enough for efficient IO. */ writeback_inodes_wb(&bdi->wb, 1024, WB_REASON_FORKER_THREAD); } else { /* * The spinlock makes sure we do not lose * wake-ups when racing with 'bdi_queue_work()'. * And as soon as the bdi thread is visible, we * can start it. */ spin_lock_bh(&bdi->wb_lock); bdi->wb.task = task; spin_unlock_bh(&bdi->wb_lock); wake_up_process(task); } bdi_clear_pending(bdi); break; case KILL_THREAD: /* KILL一個線程 */ __set_current_state(TASK_RUNNING); kthread_stop(task); bdi_clear_pending(bdi); break; case NO_ACTION: /* 如果沒有可執行的動作,那麼調度本線程睡眠一段時間 */ if (!wb_has_dirty_io(me) || !dirty_writeback_interval) /* * There are no dirty data. The only thing we * should now care about is checking for * inactive bdi threads and killing them. Thus, * let's sleep for longer time, save energy and * be friendly for battery-driven devices. */ schedule_timeout(bdi_longest_inactive()); else schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10)); try_to_freeze(); break; } } return 0; }
writeback線程
writeback線程是bdi_forker_thread 創建的,該線程的任務就是處理等待的數據回刷任務。線程處理函數為bdi_writeback_thread,其會調用wb_do_writeback函數完成具體操作,該函數分析如下:
long wb_do_writeback(struct bdi_writeback *wb, int force_wait) { struct backing_dev_info *bdi = wb->bdi; struct wb_writeback_work *work; long wrote = 0; set_bit(BDI_writeback_running, &wb->bdi->state); /* 處理等待的work,所有等待work pengding在bdi->work_list上 */ while ((work = get_next_work_item(bdi)) != NULL) { /* * Override sync mode, in case we must wait for completion * because this thread is exiting now. */ if (force_wait) work->sync_mode = WB_SYNC_ALL; trace_writeback_exec(bdi, work); /* 調用wb_writeback函數處理相應的inode */ wrote += wb_writeback(wb, work); /* * Notify the caller of completion if this is a synchronous * work item, otherwise just free it. */ /* 通知上層軟件,相應的work已經完成 */ if (work->done) complete(work->done); else kfree(work); } /* * Check for periodic writeback, kupdated() style */ /* 處理周期性的dirty page刷新作業,buffer cache就會走這條路徑,在下面的函數中會創建work,並且調用wb_writeback函數進行處理 */ wrote += wb_check_old_data_flush(wb); wrote += wb_check_background_flush(wb); clear_bit(BDI_writeback_running, &wb->bdi->state); return wrote; }
小結
本文在linux-3.2的基礎上對writeback代碼進行了浏覽。整體上來講,writeback機制是比較簡單的,其核心是通過一個常駐內核線程為bdi對象分配writeback線程,實現對cache中dirty page的數據回刷。