歡迎來到Linux教程網
Linux教程網
Linux教程網
Linux教程網
您现在的位置: Linux教程網 >> UnixLinux >  >> Linux基礎 >> Linux服務器

Linux系統調用fsync函數詳解

   功能描述:

  同步內存中所有已修改的文件數據到儲存設備。

  用法:

  #include

  int fsync(int fd);

  參數:

  fd:文件描述詞。

  返回說明:

  成功執行時,返回0。失敗返回-1,errno被設為以下的某個值

  EBADF: 文件描述詞無效

  EIO : 讀寫的過程中發生錯誤

  EROFS, EINVAL:文件所在的文件系統不支持同步

  強制把系統緩存寫入文件sync和fsync函數,, fflush和fsync的聯系和區別2010-05-10 11:25傳統的U N I X實現在內核中設有緩沖存儲器,大多數磁盤I / O都通過緩存進行。當將數據寫

  到文件上時,通常該數據先由內核復制到緩存中,如果該緩存尚未寫滿,則並不將其排入輸出

  隊列,而是等待其寫滿或者當內核需要重用該緩存以便存放其他磁盤塊數據時,再將該緩存排

  入輸出隊列,然後待其到達隊首時,才進行實際的I / O操作。這種輸出方式被稱之為延遲寫

  (delayed write)(Bach 〔1 9 8 6〕第3章詳細討論了延遲寫)。延遲寫減少了磁盤讀寫次數,但是

  第4章文件和目錄8 7

  下載

  卻降低了文件內容的更新速度,使得欲寫到文件中的數據在一段時間內並沒有寫到磁盤上。當

  系統發生故障時,這種延遲可能造成文件更新內容的丟失。為了保證磁盤上實際文件系統與緩

  存中內容的一致性,U N I X系統提供了s y n c和f s y n c兩個系統調用函數。

  #include

  void sync(void);

  int fsync(intf i l e d e s) ;

  返回:若成功則為0,若出錯則為-1

  s y n c只是將所有修改過的塊的緩存排入寫隊列,然後就返回,它並不等待實際I / O操作結束。

  系統精靈進程(通常稱為u p d a t e )一般每隔3 0秒調用一次s y n c函數。這就保證了定期刷新內

  核的塊緩存。命令s y n c ( 1 )也調用s y n c函數。

  函數f s y n c只引用單個文件(由文件描述符f i l e d e s指定),它等待I / O結束,然後返回。f s y n c可

  用於數據庫這樣的應用程序,它確保修改過的塊立即寫到磁盤上。比較一下f s y n c和O _ S Y N C標

  志(見3 . 1 3節)。當調用f s y n c時,它更新文件的內容,而對於O _ S Y N C,則每次對文件調用w r i t e

  函數時就更新文件的內容。

  fflush和fsync的聯系和區別

  [zz ] http://blog.chinaunix.net/u2/73874/showart_1421917.html

  1.提供者fflush是libc.a中提供的方法,fsync是系統提供的系統調用。2.原形fflush接受一個參數FILE *.fflush(FILE *);fsync接受的時一個Int型的文件描述符。fsync(int fd);3.功能fflush:是把C庫中的緩沖調用write函數寫到磁盤[其實是寫到內核的緩沖區]。fsync:是把內核緩沖刷到磁盤上。

  c庫緩沖-----fflush---------〉內核緩沖--------fsync-----〉磁盤

  再轉一篇英文的

  Write-back support

  UBIFS supports write-back, which means that file changes do not go to the flash media straight away, but they are cached and go to the flash later, when it is absolutely necessary. This helps to greatly reduce the amount of I/O which results in better performance. Write-back caching is a standard technique which is used by most file systems like ext3 or XFS.

  In contrast, JFFS2 does not have write-back support and all the JFFS2 file system changes go the flash synchronously. Well, this is not completely true and JFFS2 does have a small buffer of a NAND page size (if the underlying flash is NAND). This buffer contains last written data and is flushed once it is full. However, because the amount of cached data are very small, JFFS2 is very close to a synchronous file system.

  Write-back support requires the application programmers to take extra care about synchronizing important files in time. Otherwise the files may corrupt or disappear in case of power-cuts, which happens very often in many embedded devices. Let's take a glimpse at Linux manual pages:

  $ man 2 write

  ....

  NOTES

  A successful return from write() does not make any guarantee that data

  has been committed to disk. In fact, on some buggy implementations, it

  does not even guarantee that space has successfully been reserved for

  the data. The only way to be sure is to call fsync(2) after you are

  done writing all your data.

  ...

  This is true for UBIFS (except of the "some buggy implementations" part, because UBIFS does reserves space for cached dirty data). This is also true for JFFS2, as well as for any other Linux file system.

  However, some (perhaps not very good) user-space programmers do not take write-back into account. They do not read manual pages carefully. When such applications are used in embedded systems which run JFFS2 - they work fine, because JFFS2 is almost synchronous. Of course, the applications are buggy, but they appear to work well enough with JFFS2. But the bugs show up when UBIFS is used instead. Please, be careful and check/test your applications with respect to power cut tolerance if you switch from JFFS2 to UBIFS. The following is a list of useful hints and advices.

  If you want to switch into synchronous mode, use the -o sync option when mounting UBIFS; however, the file system performance will drop - be careful; Also remember that UBIFS mounted in synchronous mode provides less guarantees than JFFS2 - refer this section for details.

  Always keep in mind the above statement from the manual pages and run fsync() for all important files you change; of course, there is no need to synchronize "throw-away" temporary files; Just think how important is the file data and decide; and do not use fsync() unnecessarily, because this will hit the performance;

  If you want to be more accurate, you may use fdatasync(), in which cases only data changes will be flushed, but not inode meta-data changes (e.g., "mtime" or permissions); this might be more optimal than using fsync() if the synchronization is done often, e.g., in a loop; otherwise just stick with fsync();

  In shell, the sync command may be used, but it synchronizes whole file system which might be not very optimal; and there is a similar libc sync() function;

  You may use the O_SYNC flag of the open() call; this will make sure all the data (but not meta-data) changes go to the media before the write() operation returns; but in general, it is better to use fsync(), because O_SYNC makes each write to be synchronous, while fsync() allows to accumulate many writes and synchronize them at once;

  It is possible to make certain inodes to be synchronous by default by setting the "sync" inode flag; in a shell, the chattr +S command may be used; in C programs, use the FS_IOC_SETFLAGS ioctl command; Note, the mkfs.ubifs tool checks for the "sync" flag in the original FS tree, so the synchronous files in the original FS tree will be synchronous in the resulting UBIFS image.

  Let us stress that the above items are true for any Linux file system, including JFFS2.

  fsync() may be called for directories - it synchronizes the directory inode meta-data. The "sync" flag may also be set for directories to make the directory inode synchronous. But the flag is inherited, which means all new children of this directory will also have this flag. New files and sub-directories of this directory will also be synchronous, and their children, and so forth. This feature is very useful if one needs to create a whole sub-tree of synchronous files and directories, or to make all new children of some directory to be synchronous by default (e.g., /etc).

  The fdatasync() call for directories is "no-op" in UBIFS and all UBIFS operations which change directory entries are synchronous. However, you should not assume this for portability (e.g., this is not true for ext2). Similarly, the "dirsync" inode flag has no effect in UBIFS.

  The functions mentioned above work on file-descriptors, not on streams (FILE *). To synchronize a stream, you should first get its file descriptor using the fileno() libc function, then flush the stream using fflush(), and then synchronize the file using fsync() or fdatasync(). You may use other synchronization methods, but remember to flush the stream before synchronizing the file. The fflush() function flushes the libc-level buffers, while sync(), fsync(), etc flush kernel-level buffers.

  Please, refer this FAQ entry for information about how to atomically update the contents of a file. Also, the Theodore Tso's article is a good reading.

  Write-back knobs in Linux

  Linux has several knobs in "/proc/sys/vm" which you may use to tune write-back. The knobs are global, so they affect all file-systems. Please, refer the "Documentation/sysctl/vm.txt" file fore more information. The file may be found in the Linux kernel source tree. Below are interesting knobs described in UBIFS context and in a simplified form.

  dirty_writeback_centisecs - how often the Linux periodic write-back thread wakes up and writes out dirty data. This is a mechanism which makes sure all dirty data hits the media at some point.

  dirty_expire_centisecs - dirty data expire period. This is maximum time data may stay dirty. After this period of time it will be written back by the Linux periodic write-back thread. IOW, the periodic write-back thread wakes up every "dirty_writeback_centisecs" centi-seconds and synchronizes data which was dirtied "dirty_expire_centisecs" centi-seconds ago.

  dirty_background_ratio - maximum amount of dirty data in percent of total memory. When the amount of dirty data becomes larger, the periodic write-back thread starts synchronizing it until it becomes smaller. Even non-expired data will be synchronized. This may be used to set a "soft" limit for the amount of dirty data in the system.

  dirty_ratio - maximum amount of dirty data at which writers will first synchronize the existing dirty data before adding more. IOW, this is a "hard" limit of the amount of dirty data in the system.

  Note, UBIFS additionally has small write-buffers which are synchronized every 3-5 seconds. This means that most of the dirty data are delayed by dirty_expire_centisecs centi-seconds, but the last few KiB are additionally delayed by 3-5 seconds.

  UBIFS write-buffer

  UBIFS is asynchronous file-system (read this section for more information). As other Linux file-system, it utilizes the page cache. The page cache is a generic Linux memory-management mechanism. It may be very large and cache a lot of data. When you write to a file, the data are written to the page cache, marked as dirty, and the write returns (unless the file is synchronous). Later the data are written-back.

  Write-buffer is an additional UBIFS buffer, which is implemented inside UBIFS, and it sits between the page cache and the flash. This means that write-back actually writes to the write-buffer, not directly to the flash.

  The write-buffer is designated to speed-up UBIFS on NAND flashes. NAND flashes consist of NAND pages, which are usually 512, 2KiB or 4KiB in size. NAND page is the minimal read/write unit of NAND flash (see this section).

  Write-buffer size is equivalent to NAND page size (so it is tiny comparing to the page cache). It's purpose is to accumulate small writes, and write full NAND pages instead of partially filled. Indeed, imagine we have to write 4 512-byte nodes with half a second interval, and NAND page size is 2KiB. Without write-buffer we would have to write 4 NAND pages and waste 6KiB of flash space, while write-buffer allows us to write only once and waste nothing. This means we write less, we create less dirty space so UBIFS garbage collector will have to do less work, we save power.

  Well, the example shows an ideal situation, and even with the write-buffer we may waste space, for example in case of synchronous I/O, or if the data arrives with long time intervals. This is because the write-buffer has an associated timer, which flushes it every 3-5 seconds, even if it isn't full. We do this for data integrity reasons.

  Of course, when UBIFS has to write a lot of data, it does not use write buffer. Only the last part of the data which is smaller than the NAND page ends up in the write-buffer and waits more for data, until it is flushed by the timer.

  The write-buffer implementation is a little more complex, and we actually have several of them - one for each journal head. But this does not change the basic idea behind the write-buffer.

  Few notes with regards to synchronization:

  "sync()" also synchronizes all write-buffers;

  "fsync(fd)" also synchronizes all write-buffers which contain pieces of "fd";

  synchronous files, as well as files opened with "O_SYNC", bypass write-buffers, so the I/O is indeed synchronous for this files;

  write-buffers are also bypassed if the file-system is mounted with the "-o sync" mount option.

  Take into account that write-buffers delay the data synchronization timeout defined by "dirty_expire_centisecs" (see here) by 3-5 seconds. However, since write-buffers are small, only few data are delayed.

  UBIFS in synchronous mode vs JFFS2

  When UBIFS is mounted in synchronous mode (-o sync mount options) - all file system operations become synchronous. This means that all data are written to flash before the file-system operations return.

  For example, if you write 10MiB of data to a file f.dat using the write() call, and UBIFS is in synchronous mode, then UBIFS guarantees that all 10MiB of data and the meta-data (file size and date changes) will reach the flash media before write() returns. And if a power cut happens after the write() call returns, the file will contain the written data.

  The same is true for situations when f.dat has was opened with O_SYNC or has the sync flag (see man 2 chattr).

  It is well-known that the JFFS2 file-system is synchronous (except a small write-buffer). However, UBIFS in synchronous mode is not the same as JFFS2 and provides somewhat less guarantees that JFFS2 does with respect to sudden power cuts.

  In JFFS2 all the meta-data (like inode atime/mtime/ctime, inode size, UID/GID, etc) are stored in the data node headers. Data nodes carry 4KiB of (compressed) data. This means that the meta-data information is duplicated in many places, but this also means that every time JFFS2 writes a data node to the flash media, it updates inode size as well. So when JFFS2 mounts it scans the flash media, finds the latest data node, and fetches the inode size from there.

  In practice this means that JFFS2 will write these 10MiB of data sequentially, from the beginning to the end. And if you have a power cut, you will just lose some amount of data at the end of the inode. For example, if JFFS2 starts writing those 10MiB of data, write 5MiB, and a power cut happens, you will end up with a 5MiB f.dat file. You lose only the last 5MiB.

  Things are a little bit more complex in case of UBIFS, where data are stored in data nodes and meta-data are stored in (separate) inode nodes. The meta-data are not duplicated in each data node, like in JFFS2. UBIFS never writes data nodes beyond the on-flash inode size. If it has to write a data node and the data node is beyond the on-flash inode size (the in-memory inode has up-to-data size, but it is dirty and was not flushed yet), then UBIFS first writes the inode to the media, and then it starts writing the data. And if you have an interrupt, you lose data nodes and you have holes (or old data nodes, if you are overwriting). Lets consider an example.

  User creates an empty file f.dat. The file is synchronous, or UBIFS is mounted in synchronous mode. User calls the write() function with a 10MiB buffer.

  The kernel first copies all 10MiB of the data to the page cache. Inode size is changed to 10MiB as well and the inode is marked as dirty. Nothing has been written to the flash media so far. If a power cut happens at this point, the user will end up with an empty f.dat file.

  UBIFS sees that the I/O has to be synchronous, and starts synchronizing the inode. First of all, it writes the inode node to the flash media. If a power cut happens at this moment, the user will end up with a 10MiB file which contains no data (hole), and if he read this file, he will get 10MiB of zeroes.

  UBIFS starts writing the data. If a power cut happens at this point, the user will end up with a 10MiB file containing a hole at the end.

  Note, if the I/O was not synchronous, UBIFS would skip the last step and would just return. And the actual write-back would then happen in back-ground. But power cuts during write-back could anyway lead to files with holes at the end.

  Thus, synchronous I/O in UBIFS provides less guarantees than JFFS2 I/O - UBIFS has an effect of holes at the end of files. In ideal world applications should not assume anything about the contents of files which were not synchronized before a power-cut has happened. And "mainstream" file-systems like ext3 do not provide JFSS2-like guarantees.

  However, UBIFS is sometimes used as a JFFS2 replacement and people may want it to behave the same way as JFFS2 if it is mounted synchronously. This is doable, but needs some non-trivial development, so this was not implemented so far. On the other hand, there was no strong demand. You may implement this as an exercise, or you may try to convince UBIFS authors to do this.

  Synchronization exceptions for buggy applications

  As this section describes, UBIFS is an asynchronous file-system, and applications should synchronize their files whenever it is required. The same applies to most Linux file-systems, e.g. XFS.

  However, many applications ignore this and do not synchronize files properly. And there was a huge war between user-space and kernel developers related to ext4 delayed allocation feature. Please, see the Theodore Tso's blog post. More information may be found in this LWN article.

  In short, the flame war was about 2 cases. The first case was about the atomic re-name, where many user-space programs did not synchronize the copy before re-naming it. The second case was about applications which truncate files, then change them. There was no final agreement, but the "we cannot ignore the real world" argument found ext4 developers' understanding, and there were 2 ext4 changes which help both problems.

  Roughly speaking, the first change made ext4 synchronize files on close if they were previously truncated. This was a hack from file-system point of view, but it "fixed" applications which truncate files, write new contents, and close the files without synchronizing them.

  The second change made ext4 synchronize the renamed file.

  Well, this is not exactly correct description, because ext4 does not write the files synchronously, but actually initiates asynchronous write-out of the files, so the performance hit is not very high. For the truncation case this means that the file is synchronized soon after it is closed. For the re-name case this means that ext4 writes data before it writes the re-name meta-data.

  However, the application writers should never rely on these things, because this is not portable. Instead, they should properly synchronize files. The ext4 fixes were because there were many broken user-space applications in the wild already.

  We have plans to implement these features in UBIFS, but this has not been done yet. The problem is that UBI/MTD are fully synchronous and we cannot initiate asynchronous write-out, so we'd have to synchronously write files on close/rename, which is slow. So implementing these features would require implementing asynchronous I/O in UBI, which is a big job. But feel free to do this :-).

Copyright © Linux教程網 All Rights Reserved