Hi all, The following patch fully implements O_SYNC, fsync and fdatasync, at least for ext2. The infrastructure it includes should make it trivial for any other filesystem to do likewise. The basic changes are: Include a per-inode list of dirty buffers Pass a "datasync" parameter down to the filesystems when fsync or fdatasync are called, to distinguish between the two (when fdatasync is specified, we don't have to flush the inode to disk if only timestamps have changed) Split I_DIRTY into two bits, one (I_DIRTY_SYNC) which is set for all dirty inodes, and the other (I_DIRTY_DATASYNC) which is set only if fdatasync needs to flush the inode (ie. it is set for everything except for timestamp updates). This means: The old (flags & I_DIRTY) construct still returns true if the inode is in any way dirty; and (flags |= I_DIRTY) sets both bits, as expected. fs/ext2 and __block_commit_write are modified to record the all newly dirtied buffers (both data and metadata) on the inode's dirty block list generic_file_write now honours the O_SYNC flag and calls generic_osync_inode(), which flushes the inode dirty buffer list and calls the inode's fsync method. Note: currently, the O_SYNC code in generic_file_write calls generic_osync_inode with datasync==1, which means that O_SYNC is interpreted as O_DSYNC according to the SUS spec. In other words, O_SYNC is not guaranteed to flush timestamp updates to disk (but fsync is). This is important: we do not currently have an O_DSYNC flag (although that would now be trivial to implement), so existing apps are forced to use O_SYNC instead. Apps such as Oracle rely on O_SYNC for write ordering, but due to a 2.2 bug, existing kernels don't do the timestamp update and hence we achieve decent performance even without O_DSYNC. We cannot suddenly cause all of those applications to experience a massive performance drop. One way round this would be to split O_SYNC into O_DSYNC and O_TRUESYNC, and in glibc to redefine O_SYNC to be (O_DSYNC | O_TRUESYNC). If we keep the new O_DSYNC to have the same value as the old O_SYNC, then: * Old applications which specified O_SYNC will continue to get their expected (O_DSYNC) behaviour * New applications can specify O_SYNC or O_DSYNC and get the selected behaviour on new kernels * New applications calling either O_SYNC or O_DSYNC will still get O_SYNC on old kernels. In performance testing, "dd" with 64k blocks and writing into an existing, preallocated file, gets close to theoretical disk bandwidth (about 13MB/sec on a Cheetah), when using O_SYNC or when doing a fdatasync between each write. Doing fsync instead gives only about 3MB/sec and results in a lot of audible disk seeking, as expected. If I don't preallocate the file, then even fdatasync is slow, as it now has to sync the changed i_size information after every write (and it gets slower as the file grows and the distance between the inode and the data being written increases). --Stephen