Hi all,

The following patch fully implements O_SYNC, fsync and fdatasync,
at least for ext2.  The infrastructure it includes should make it
trivial for any other filesystem to do likewise.

The basic changes are:

	Include a per-inode list of dirty buffers

	Pass a "datasync" parameter down to the filesystems when fsync
	or fdatasync are called, to distinguish between the two (when
	fdatasync is specified, we don't have to flush the inode to disk
	if only timestamps have changed)

	Split I_DIRTY into two bits, one (I_DIRTY_SYNC) which is set
	for all dirty inodes, and the other (I_DIRTY_DATASYNC) which 
	is set only if fdatasync needs to flush the inode (ie. it is
	set for everything except for timestamp updates).  This means:

		The old (flags & I_DIRTY) construct still returns 
		true if the inode is in any way dirty; and

		(flags |= I_DIRTY) sets both bits, as expected.

	fs/ext2 and __block_commit_write are modified to record the
	all newly dirtied buffers (both data and metadata) on the
	inode's dirty block list

	generic_file_write now honours the O_SYNC flag and calls
	generic_osync_inode(), which flushes the inode dirty buffer
	list and calls the inode's fsync method.

Note: currently, the O_SYNC code in generic_file_write calls 
generic_osync_inode with datasync==1, which means that O_SYNC is
interpreted as O_DSYNC according to the SUS spec.  In other words,
O_SYNC is not guaranteed to flush timestamp updates to disk (but
fsync is).  This is important: we do not currently have an O_DSYNC
flag (although that would now be trivial to implement), so existing
apps are forced to use O_SYNC instead.  Apps such as Oracle rely on
O_SYNC for write ordering, but due to a 2.2 bug, existing kernels
don't do the timestamp update and hence we achieve decent 
performance even without O_DSYNC.  We cannot suddenly cause all of
those applications to experience a massive performance drop.

One way round this would be to split O_SYNC into O_DSYNC and
O_TRUESYNC, and in glibc to redefine O_SYNC to be (O_DSYNC |
O_TRUESYNC).  If we keep the new O_DSYNC to have the same value
as the old O_SYNC, then:

	* Old applications which specified O_SYNC will continue
	  to get their expected (O_DSYNC) behaviour

	* New applications can specify O_SYNC or O_DSYNC and get
	  the selected behaviour on new kernels

	* New applications calling either O_SYNC or O_DSYNC will
	  still get O_SYNC on old kernels.

In performance testing, "dd" with 64k blocks and writing into an 
existing, preallocated file, gets close to theoretical disk bandwidth
(about 13MB/sec on a Cheetah), when using O_SYNC or when doing a
fdatasync between each write.  Doing fsync instead gives only about
3MB/sec and results in a lot of audible disk seeking, as expected.
If I don't preallocate the file, then even fdatasync is slow, as it
now has to sync the changed i_size information after every write (and
it gets slower as the file grows and the distance between the inode 
and the data being written increases).

--Stephen