頁緩沖在《linux內核情景分析》一書的第5.6節文件的寫與讀一章中說明的很詳細,這裡摘抄下來;
在文件系統層中有三隔主要的數據結構,file結構、dentry結構和inode結構;
file結構:代表目標文件的一個上下文,不同進程可以在同一文件上建立不同的上下文,而且同一進程也可以通過打開一個文件多次而建立起多個上下文。因此不能在file結構上設置緩沖區隊列,因為這些file結構體之間都不共享。
dentry結構體:該結構體是文件名結構體,通過軟/硬鏈接可以得到多個dentry結構體對應一個文件,dentry結構體和文件也不是一對一關系,所以也不能在該結構體上建立緩沖區隊列;
inode結構體:很顯然就只有inode結構體了,inode結構體和文件是一對一的關系,可以這麼說inode就是代表文件。在inode結構體上設置了i_mapping指針,該指針指向了一個address_space數據結構,一般來說該數據結構就是inode->i_data,緩沖區隊列就是在該數據結構中;
掛在緩沖區隊列中的不是記錄塊而是內存頁面,因此當一個進程調用mmap()函數將一個文件映射到它用戶空間時,它只要設置相應的內存映射表,就可以很自然的把這些緩存頁面映射到進程的用戶空間。所以才又起名為i_mapping。
這裡還要了解下基數樹概念,先看看圖(圖片來自《深入linux內核架構》)
基數樹不是不是平衡樹,樹本身由兩種不同的數據結構組成,樹根節點和非葉子節點,樹根節點由簡單的數據結構表示,其中包含了樹的高度和指向組成樹的第一個節點的數據結構。節點本質上是數組,count是該節點的指針計數,其他的都是指向下一層節點的指針。而葉子節點是指向page的指針;
其中節點上的數據結構還包含了搜索標記,比如髒頁標記和回寫標記,可以很快的指定哪邊有標記的頁;
塊緩沖
塊緩沖在結構上由兩個部分組成:
1、緩沖頭:包含與緩沖區狀態相關的所有管理數據,塊號、長度,訪問器等,這些緩沖頭不直接存儲在緩沖頭之後,而是由緩沖頭指針指向的物理內存獨立區域中。
2、有用的數據保存在專門分配的頁中,這些頁也可以能同事存在頁緩沖中。
緩沖頭:
/* * Historically, a buffer_head was used to map a single block * within a page, and of course as the unit of I/O through the * filesystem and block layers. Nowadays the basic I/O unit * is the bio, and buffer_heads are used for extracting block * mappings (via a get_block_t call), for tracking state within * a page (via a page_mapping) and for wrapping bio submission * for backward compatibility reasons (e.g. submit_bh). */ struct buffer_head { unsigned long b_state; /* buffer state bitmap (see above) *///緩沖區狀態標識,看下面 struct buffer_head *b_this_page;/* circular list of page's buffers *///指向下一個緩沖頭 struct page *b_page; /* the page this bh is mapped to *///指向擁有該塊緩沖區的頁描述符指針 sector_t b_blocknr; /* start block number *///塊設備的邏輯塊號 size_t b_size; /* size of mapping *///塊大小 char *b_data; /* pointer to data within the page *///塊在緩沖頁內的位置 struct block_device *b_bdev;//指向塊設備描述符 bh_end_io_t *b_end_io; /* I/O completion *///i/o完成回調函數 void *b_private; /* reserved for b_end_io *///指向i/o完成回調函數的數據參數 struct list_head b_assoc_buffers; /* associated with another mapping */ struct address_space *b_assoc_map; /* mapping this buffer is associated with */ atomic_t b_count; /* users using this buffer_head *///塊使用計算器 };
緩沖區頭部的通用標志
enum bh_state_bits { BH_Uptodate, /* Contains valid data *///表示緩沖區包含有效數據 BH_Dirty, /* Is dirty *///緩沖區是髒的 BH_Lock, /* Is locked *///緩沖區被鎖住 BH_Req, /* Has been submitted for I/O *///初始化緩沖區而請求數據傳輸 BH_Uptodate_Lock,/* Used by the first bh in a page, to serialise * IO completion of other buffers in the page */ BH_Mapped, /* Has a disk mapping *///b_bdev和b_blocknr是有效的 BH_New, /* Disk mapping was newly created by get_block *///剛分配還沒有訪問過 BH_Async_Read, /* Is under end_buffer_async_read I/O *///異步讀該緩沖區 BH_Async_Write, /* Is under end_buffer_async_write I/O *///異步寫該緩沖區 BH_Delay, /* Buffer is not yet allocated on disk *///還沒有在磁盤上分配緩沖區 BH_Boundary, /* Block is followed by a discontiguity */// BH_Write_EIO, /* I/O error on write *///i/o錯誤 BH_Unwritten, /* Buffer is allocated on disk but not written */ BH_Quiet, /* Buffer Error Prinks to be quiet */ BH_Meta, /* Buffer contains metadata */ BH_Prio, /* Buffer should be submitted with REQ_PRIO */ BH_PrivateStart,/* not a state bit, but the first bit available * for private allocation by other entities */ };
從上圖可以看出一個緩沖頁對應了4個緩沖區,這就統一了page cache和buffer cache了。修改緩沖區或者緩沖頁,他們之間都會相互影響。
address_space結構體:
struct address_space {
struct inode *host; /* owner: inode, block_device *///指向宿主文件的inode
struct radix_tree_root page_tree; /* radix tree of all pages *///基數樹的root
spinlock_t tree_lock; /* and lock protecting it *///基數樹的鎖
unsigned int i_mmap_writable;/* count VM_SHARED mappings *///vm_SHARED共享映射頁計數
struct rb_root i_mmap; /* tree of private and shared mappings *///私有和共享映射的樹
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings *///匿名映射的鏈表元素
struct mutex i_mmap_mutex; /* protect tree, count, list *///包含樹的mutex
/* Protected by tree_lock together with the radix tree */
unsigned long nrpages; /* number of total pages *///頁的總數
pgoff_t writeback_index;/* writeback starts here *///回寫的開始
const struct address_space_operations *a_ops; /* methods *///函數指針
unsigned long flags; /* error bits/gfp mask *///錯誤碼
struct backing_dev_info *backing_dev_info; /* device readahead, etc *///設備預讀
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
void *private_data; /* ditto */
} __attribute__((aligned(sizeof(long))));
struct inode *host和struct radix_tree_root page_tree關聯了文件和內存頁。
346 struct address_space_operations { 347 int (*writepage)(struct page *page, struct writeback_control *wbc);//寫操作,從頁寫到所有者的磁盤映像 348 int (*readpage)(struct file *, struct page *);//讀操作,從所有者磁盤映像讀取到頁 349 350 /* Write back some dirty pages from this mapping. */ 351 int (*writepages)(struct address_space *, struct writeback_control *);//指定數量的所有者髒頁回寫磁盤 352 353 /* Set a page dirty. Return true if this dirtied it */ 354 int (*set_page_dirty)(struct page *page);//把所有者的頁設置為髒頁 355 356 int (*readpages)(struct file *filp, struct address_space *mapping, 357 struct list_head *pages, unsigned nr_pages);//從磁盤中讀取所有者頁的鏈表 358 359 int (*write_begin)(struct file *, struct address_space *mapping, 360 loff_t pos, unsigned len, unsigned flags, 361 struct page **pagep, void **fsdata);// 362 int (*write_end)(struct file *, struct address_space *mapping, 363 loff_t pos, unsigned len, unsigned copied, 364 struct page *page, void *fsdata); 365 366 /* Unfortunately this kludge is needed for FIBMAP. Don't use it */ 367 sector_t (*bmap)(struct address_space *, sector_t); 368 void (*invalidatepage) (struct page *, unsigned long); 369 int (*releasepage) (struct page *, gfp_t); 370 void (*freepage)(struct page *); 371 ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, 372 loff_t offset, unsigned long nr_segs); 373 int (*get_xip_mem)(struct address_space *, pgoff_t, int, 374 void **, unsigned long *); 375 /* 376 * migrate the contents of a page to the specified target. If sync 377 * is false, it must not block. 378 */ 379 int (*migratepage) (struct address_space *, 380 struct page *, struct page *, enum migrate_mode); 381 int (*launder_page) (struct page *); 382 int (*is_partially_uptodate) (struct page *, read_descriptor_t *, 383 unsigned long); 384 int (*error_remove_page)(struct address_space *, struct page *); 385 386 /* swapfile support */ 387 int (*swap_activate)(struct swap_info_struct *sis, struct file *file, 388 sector_t *span); 389 void (*swap_deactivate)(struct file *file); 390 }; 391