最近在看APUE,其中的一章談到了文件系統,所以我在這裡把linux 虛擬文件系統的相關內容做一個簡單總結,其中會有部分源碼,但不是很深入。
在上回的blog中,我們初步遇到了幾個數據結構,還是從現象出發,逐步深入。我們已經了解到在進程描述符中與文件系統相關的數據結構有"structfiles_struct",除此以外還有:
struct fs_struct {
int users;
spinlock_t lock;
seqcount_t seq;
int umask;
int in_exec;
struct path root, pwd;
};
還有一個結構體是:
struct nsproxy {
atomic_t count;
struct uts_namespace *uts_ns;
struct ipc_namespace *ipc_ns;
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
};
通過分析以上兩個結構體的內容,發現這兩個結構體其實與文件系統的基本操作關系不大(如read、write操作等),看來還是得回到struct files_struct上來,再來看看它的內容:
struct files_struct {
/*
* read mostly part
*/
atomic_t count;
struct fdtable __rcu *fdt;
struct fdtable fdtab;
/*
* written part on a separate cache line in SMP
*/
spinlock_t file_lock ____cacheline_aligned_in_smp;
int next_fd;
unsigned long close_on_exec_init[1];
unsigned long open_fds_init[1];
struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};
以下內容摘自LKD,其中的內容我無法通過實驗進程,因為上述內容位於內核中,關於內核的調試方法我還不會
fd_array數組指針指向已打開的文件對象,由於NR_OPEN_DEFAULT的值有上限,所以如果一個進程所打開的文件對象超過某個限定值,內核將分配一個新數組,並且將fdt指針指向它,關於“struct fdtable”結構體的內容我們之前已經進行了簡單的分析,再來回顧一下:
struct fdtable {
unsigned int max_fds;
struct file __rcu **fd; /* current fd array */
unsigned long *close_on_exec;
unsigned long *open_fds;
struct rcu_head rcu;
};
此處fd的作用與fd_array的作用相同,均指向已經打開的文件對象。
好,既然已經談到了文件對象,那就對文件對象做一個詳細的研究,根據當前我看到的一些資料(Linux內核設計與實現、深入理解Linux內核),虛擬文件系統(virtual file system,VFS)中有四個主要的對象類型,分別是:
超級塊對象,它代表一個具體的已安裝文件系統。索引節點對象,它代表一個具體文件。目錄項對象,它代表一個目錄項,是路徑的一個組成部分。文件對象,它代表由進程打開的文件。這裡盜用《深入理解Linux內核》中的一副圖,來表示這四個對象類型之間的關系。

先來看看struct file,基本定義如下:
struct file {
union {
struct llist_node fu_llist;
struct rcu_head fu_rcuhead;
} f_u;
struct path f_path;
struct inode *f_inode; /* cached value */
const struct file_operations *f_op;
/*
* Protects f_ep_links, f_flags.
* Must not be taken from IRQ context.
*/
spinlock_t f_lock;
atomic_long_t f_count;
unsigned int f_flags;
fmode_t f_mode;
struct mutex f_pos_lock;
loff_t f_pos;
struct fown_struct f_owner;
const struct cred *f_cred;
struct file_ra_state f_ra;
u64 f_version;
#ifdef CONFIG_SECURITY
void *f_security;
#endif
/* needed for tty driver, and maybe others */
void *private_data;
#ifdef CONFIG_EPOLL
/* Used by fs/eventpoll.c to link all the hooks to this file */
struct list_head f_ep_links;
struct list_head f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
struct address_space *f_mapping;
} __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */
文件對象是已打開的文件在內存中的表示。該對象(不是物理文件)由相應的open系統調用創建,由close系統調用撤銷,所有這些文件相關的調用實際上都是文件操作表中定義的方法。因為多個進程可以同時打開和操作同一個文件,所以同一個文件也可能存在多個對應的文件對象。文件對象僅僅在進程觀點上代表已打開的文件,它反過來指向目錄項對象,其實只有目錄項對象才代表已打開的實際文件。雖然一個文件對應的文件對象不是惟一的,即通過open函數打開一個文件就會得到一個文件描述符,即使是同一個進程打開相同的文件得到的文件描述符也不相同,不同的文件描述符指向fd_array中不同的文件對象。雖然一個文件對應的文件對象不是惟一的,但對應的索引節點和目錄項無疑是惟一的。
這裡比較重要的字段有三個:
struct path f_path;
struct inode *f_inode; /* cached value */
const struct file_operations *f_op;
先來看f_path的定義,位於/include/linux/path.h
struct path {
struct vfsmount *mnt;
struct dentry *dentry;
};
再來看f_inode字段。f_inode的類型是索引節點對象,這一點與上圖中描述的情況有所不同:文件對象與索引節點對象存在直接關系。這一點與《Linux內核設計與實現》、《深入理解Linux內核》中描述的也不相同,文件對象中就不包括這一字段,這一字段可能是2.6之後引入的新字段。
不過也可以根據注釋對f_inode的功能做一個簡單的推測,f_inode的可能是對索引節點的緩存,在訪問時可以不通過目錄項對象,直接對索引節點進行訪問。
接下來struct file_operations,這一字段定義了文件對象的所有操作,具體定義如下:
struct file_operations {
struct module *owner;
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
int (*iterate) (struct file *, struct dir_context *);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
void (*mremap)(struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, loff_t, loff_t, int datasync);
int (*aio_fsync) (struct kiocb *, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **, void **);
long (*fallocate)(struct file *file, int mode, loff_t offset,
loff_t len);
void (*show_fdinfo)(struct seq_file *m, struct file *f);
};
通過對文件對象的簡單研究我們也可以發現,虛擬文件系統的實現在很大程度上體現了面向對象的思想,其中即包括對象所操作的數據,同時也包括對這些數據進行操作的函數。
在對文件對象進行簡單分析後,再向下一層對目錄項對象進行分析。VFS把目錄當作文件對待,所以對於某個特定的路徑,其中可能即包括目錄文件同時也包括普通文件,路徑中的每個組成部分都由一個索引點對象表示。雖然他們可以統一由索引節點表示,但是VFS經常需要執行目錄相關的操作,比如路徑名查找等。路徑名查找需要解析路徑中的每一個組成部分,不但要確保它有效,而且還需要再進一步尋找路徑的下一個部分。為了方便查找操作,VFS引入了目錄項的概念。每個dentry代表路徑中的一個特定部分。必須明確一點:在路徑中(包括普通文件在內),每一個部分都是目錄項對象。解析一個路徑並遍歷其分量絕非簡單的演練,它是耗時的、常規的字符串比較過程,執行耗時、代碼繁瑣。目錄項對象的引入使得這個過程更加簡單(對於這一點我現在還不能理解,沒有目錄項對象會變成什麼樣我現在給不出什麼結論)。
回到主題,目錄項對象定義如下,定義位於/include/linux/dcache.h。
struct dentry {
/* RCU lookup touched fields */
unsigned int d_flags; /* protected by d_lock */
seqcount_t d_seq; /* per dentry seqlock */
struct hlist_bl_node d_hash; /* lookup hash list */
struct dentry *d_parent; /* parent directory */
struct qstr d_name;
struct inode *d_inode; /* Where the name belongs to - NULL is
* negative */
unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */
/* Ref lookup also touches following */
struct lockref d_lockref; /* per-dentry lock and refcount */
const struct dentry_operations *d_op;
struct super_block *d_sb; /* The root of the dentry tree */
unsigned long d_time; /* used by d_revalidate */
void *d_fsdata; /* fs-specific data */
struct list_head d_lru; /* LRU list */
struct list_head d_child; /* child of parent list */
struct list_head d_subdirs; /* our children */
/*
* d_alias and d_rcu can share memory
*/
union {
struct hlist_node d_alias; /* inode alias list */
struct rcu_head d_rcu;
} d_u;
};
以下內容直接引用自《linux內核設計與實現》、《深入理解linux內核》。
目錄項對象共包括三種狀態:被使用、未被使用和負狀態。
一個被使用的目錄項對應一個有效的索引節點(即d_node指向相應的索引節點)並且表明該對象存在一個或多個使用者(即d_count為正值)。它的內容不能被丟棄。一個未被使用的目錄項對應一個有效的索引節點,但是VFS當前並未使用它(即d_count為0)。但該目錄項對象仍然指向一個有效對象,而且被保留在緩存中以便需要時再使用它。這樣使路徑查找更迅速。為了在必要時回收內存,它的內容可能被丟棄。一個負狀態的目錄項沒有對應的有效索引節點(d_inode為NULL),因為索引節點被刪除了,或路徑不再正確了,但是目錄項仍然保留,以便快速解析以後的路徑查詢。該目錄向仍然被保存在目錄項高速緩存中是為後續對同一文件目錄名的查找操作能夠快速完成。在需要時其內容同樣可以被丟棄。上文中提到了目錄項高速緩存,下面就來簡單了解下這一內容。
由於從磁盤讀入一個目錄項並構造相應的目錄項對象需要花費大量的時間,所以,在完成對目錄項的操作後,可能後面還要使用它,因此仍在內存中保留它有重要意義。為了最大限度地提高這些目錄項對象的效率,Linux使用目錄項高速緩存,它由兩種類型的數據結構組成:
一個處於正在使用、未使用或負狀態的目錄項對象的集合。一個散列表,從中能夠快速獲取與給定的文件名和目錄名對應的目錄項對象。同樣,如果訪問的對象不在目錄項高速緩存中,則散列函數返回一個空值。對於正在使用的目錄項對象都被插入一個雙向鏈表中,該鏈表由相應索引節點對象的i_dentry字段所指向(由於每個索引節點可能與若干硬鏈接關聯,所以需要一個鏈表)。目錄項對象的d_alias字段存放鏈表中相鄰元素的地址。這兩個字段的類型都是struct list_head。
未被使用和負狀態的目錄項對象都被插入一個“最近最少使用(LRU)”的雙向鏈表中。由於該鏈表總是在頭部插入目錄項,所以鏈頭節點的數據總比鏈尾的數據要新。每當內核縮減目錄項高速緩存時,“負”狀態目錄項對象就朝著LRU鏈表的尾部移動,這樣一來,這些對象就逐漸被釋放了。
散列表和相應的散列函數用來快速地將給定路徑解析為相關目錄項對象。
接下來簡單看一下目錄項對象的操作函數:
struct dentry_operations {
int (*d_revalidate)(struct dentry *, unsigned int);
int (*d_weak_revalidate)(struct dentry *, unsigned int);
int (*d_hash)(const struct dentry *, struct qstr *);
int (*d_compare)(const struct dentry *, const struct dentry *,
unsigned int, const char *, const struct qstr *);
int (*d_delete)(const struct dentry *);
void (*d_release)(struct dentry *);
void (*d_prune)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
char *(*d_dname)(struct dentry *, char *, int);
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(struct dentry *, bool);
struct inode *(*d_select_inode)(struct dentry *, unsigned);
} ____cacheline_aligned;
在上文的分析中我們已經大概了解了虛擬文件系統的實現思想——OOP,所以還是沿著這個思路,先分析類成員,再來分析類操作。struct inode定義如下,位於/include/linux/fs.h。
struct inode {
umode_t i_mode;
unsigned short i_opflags;
kuid_t i_uid;
kgid_t i_gid;
unsigned int i_flags;
#ifdef CONFIG_FS_POSIX_ACL
struct posix_acl *i_acl;
struct posix_acl *i_default_acl;
#endif
const struct inode_operations *i_op;
struct super_block *i_sb;
struct address_space *i_mapping;
#ifdef CONFIG_SECURITY
void *i_security;
#endif
/* Stat data, not accessed from path walking */
unsigned long i_ino;
/*
* Filesystems may only read i_nlink directly. They shall use the
* following functions for modification:
*
* (set|clear|inc|drop)_nlink
* inode_(inc|dec)_link_count
*/
union {
const unsigned int i_nlink;
unsigned int __i_nlink;
};
dev_t i_rdev;
loff_t i_size;
struct timespec i_atime;
struct timespec i_mtime;
struct timespec i_ctime;
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short i_bytes;
unsigned int i_blkbits;
blkcnt_t i_blocks;
#ifdef __NEED_I_SIZE_ORDERED
seqcount_t i_size_seqcount;
#endif
/* Misc */
unsigned long i_state;
struct mutex i_mutex;
unsigned long dirtied_when; /* jiffies of first dirtying */
struct hlist_node i_hash;
struct list_head i_wb_list; /* backing dev IO list */
struct list_head i_lru; /* inode LRU list */
struct list_head i_sb_list;
union {
struct hlist_head i_dentry;
struct rcu_head i_rcu;
};
u64 i_version;
atomic_t i_count;
atomic_t i_dio_count;
atomic_t i_writecount;
#ifdef CONFIG_IMA
atomic_t i_readcount; /* struct files open RO */
#endif
const struct file_operations *i_fop; /* former ->i_op->default_file_ops */
struct file_lock *i_flock;
struct address_space i_data;
struct list_head i_devices;
union {
struct pipe_inode_info *i_pipe;
struct block_device *i_bdev;
struct cdev *i_cdev;
};
__u32 i_generation;
#ifdef CONFIG_FSNOTIFY
__u32 i_fsnotify_mask; /* all events this inode cares about */
struct hlist_head i_fsnotify_marks;
#endif
void *i_private; /* fs or device private pointer */
};
比較重要的字段有三個:
unsigned long i_state;
const struct inode_operations *i_op;
struct super_block *i_sb;
#define I_DIRTY_SYNC (1 << 0) #define I_DIRTY_DATASYNC (1 << 1) #define I_DIRTY_PAGES (1 << 2) #define __I_NEW 3 #define I_NEW (1 << __I_NEW) #define I_WILL_FREE (1 << 4) #define I_FREEING (1 << 5) #define I_CLEAR (1 << 6) #define __I_SYNC 7 #define I_SYNC (1 << __I_SYNC) #define I_REFERENCED (1 << 8) #define __I_DIO_WAKEUP 9 #define I_DIO_WAKEUP (1 << I_DIO_WAKEUP) #define I_LINKABLE (1 << 10) #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) //該索引節點為“髒”,磁盤內容必須被更新
struct inode_operations {
struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
void * (*follow_link) (struct dentry *, struct nameidata *);
int (*permission) (struct inode *, int);
struct posix_acl * (*get_acl)(struct inode *, int);
int (*readlink) (struct dentry *, char __user *,int);
void (*put_link) (struct dentry *, struct nameidata *, void *);
int (*create) (struct inode *,struct dentry *, umode_t, bool);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct inode *,struct dentry *,const char *);
int (*mkdir) (struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*rename2) (struct inode *, struct dentry *,
struct inode *, struct dentry *, unsigned int);
int (*setattr) (struct dentry *, struct iattr *);
int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
int (*update_time)(struct inode *, struct timespec *, int);
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode, int *opened);
int (*tmpfile) (struct inode *, struct dentry *, umode_t);
int (*set_acl)(struct inode *, struct posix_acl *, int);
/* WARNING: probably going away soon, do not use! */
} ____cacheline_aligned;
struct super_block {
struct list_head s_list; /* Keep this first */
dev_t s_dev; /* search index; _not_ kdev_t */
unsigned char s_blocksize_bits;
unsigned long s_blocksize;
loff_t s_maxbytes; /* Max file size */
struct file_system_type *s_type;
const struct super_operations *s_op;
const struct dquot_operations *dq_op;
const struct quotactl_ops *s_qcop;
const struct export_operations *s_export_op;
unsigned long s_flags;
unsigned long s_iflags; /* internal SB_I_* flags */
unsigned long s_magic;
struct dentry *s_root;
struct rw_semaphore s_umount;
int s_count;
atomic_t s_active;
#ifdef CONFIG_SECURITY
void *s_security;
#endif
const struct xattr_handler **s_xattr;
struct list_head s_inodes; /* all inodes */
struct hlist_bl_head s_anon; /* anonymous dentries for (nfs) exporting */
struct list_head s_mounts; /* list of mounts; _not_ for fs use */
struct block_device *s_bdev;
struct backing_dev_info *s_bdi;
struct mtd_info *s_mtd;
struct hlist_node s_instances;
unsigned int s_quota_types; /* Bitmask of supported quota types */
struct quota_info s_dquot; /* Diskquota specific options */
struct sb_writers s_writers;
char s_id[32]; /* Informational name */
u8 s_uuid[16]; /* UUID */
void *s_fs_info; /* Filesystem private info */
unsigned int s_max_links;
fmode_t s_mode;
/* Granularity of c/m/atime in ns.
Cannot be worse than a second */
u32 s_time_gran;
/*
* The next field is for VFS *only*. No filesystems have any business
* even looking at it. You had been warned.
*/
struct mutex s_vfs_rename_mutex; /* Kludge */
/*
* Filesystem subtype. If non-empty the filesystem type field
* in /proc/mounts will be "type.subtype"
*/
char *s_subtype;
/*
* Saved mount options for lazy filesystems using
* generic_show_options()
*/
char __rcu *s_options;
const struct dentry_operations *s_d_op; /* default d_op for dentries */
/*
* Saved pool identifier for cleancache (-1 means none)
*/
int cleancache_poolid;
struct shrinker s_shrink; /* per-sb shrinker handle */
/* Number of inodes with nlink == 0 but still referenced */
atomic_long_t s_remove_count;
/* Being remounted read-only */
int s_readonly_remount;
/* AIO completions deferred from interrupt context */
struct workqueue_struct *s_dio_done_wq;
struct hlist_head s_pins;
/*
* Keep the lru lists last in the structure so they always sit on their
* own individual cachelines.
*/
struct list_lru s_dentry_lru ____cacheline_aligned_in_smp;
struct list_lru s_inode_lru ____cacheline_aligned_in_smp;
struct rcu_head rcu;
/*
* Indicates how deep in a filesystem stack this SB is
*/
int s_stack_depth;
};
最後來看看超級塊對象操作,同樣定義於/include/linux/fs.h中。
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
void (*dirty_inode) (struct inode *, int flags);
int (*write_inode) (struct inode *, struct writeback_control *wbc);
int (*drop_inode) (struct inode *);
void (*evict_inode) (struct inode *);
void (*put_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
int (*freeze_super) (struct super_block *);
int (*freeze_fs) (struct super_block *);
int (*thaw_super) (struct super_block *);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*umount_begin) (struct super_block *);
int (*show_options)(struct seq_file *, struct dentry *);
int (*show_devname)(struct seq_file *, struct dentry *);
int (*show_path)(struct seq_file *, struct dentry *);
int (*show_stats)(struct seq_file *, struct dentry *);
#ifdef CONFIG_QUOTA
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
struct dquot **(*get_dquots)(struct inode *);
#endif
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
long (*nr_cached_objects)(struct super_block *, int);
long (*free_cached_objects)(struct super_block *, long, int);
};
超級塊對象:存放已安裝文件系統的有關信息。對於基於磁盤的文件系統,這類對象通常對應於存放在磁盤上的文件系統控制塊。索引節點對象:存放關於具體文件的一般信息。對於基於磁盤的文件系統,這類對象通常對應於存放在磁盤上的文件控制塊。每個索引節點對象都有一個索引節點號,這個節點號唯一地標識文件系統中的文件。文件對象:存放打開文件與進程之間進行交互的有關信息。這類信息僅當進程訪問文件期間存在於內核內存中。也即文件對象在實際的文件系統(與虛擬文件系統相對)中沒有對應的映像。目錄項對象:存放目錄項(也就是文件的特定名稱)與對應文件進行鏈接的有關信息。目錄項對象在實際的文件系統中同樣沒有對應的映像。
在研究文件系統過程中還提到了“目錄項高速緩存”,與之類似的還有“索引點高速緩存”,以上兩種都屬於“磁盤高速緩存”。“磁盤高速緩存”屬於軟件機制,它允許內核將原本存在磁盤上的某些信息保存在RAM中,以便對這些數據的進一步訪問能快速進行,而不必慢速訪問磁盤本身。
與“磁盤高速緩存”類似的概念還有“硬件高速緩存”、“內存高速緩存”,以後遇到了再詳細分析。
最後給大家推薦一點資料,同樣來自於網絡:http://wenku.baidu.com/link?url=nrZ4fZXU7e8dTtx9rrdrfgdK3hqnw8LEJcWxvvq4yME-SoFflpBRVaVnUYYMwdKXquqF47Twh4DwPuZdxSuGxyrgqBvfWal7MzN6mnAeXb_
特別是第15頁的圖,通過一個實例對上述四種文件系統對象之間的進行了一個圖解。