Hi,

I've been looking at a discrepancy between msync() behaviour on 2.4.9
and newer 2.4 kernels, and it looks like things changed again in
2.5.68.  From the ChangeLog:

ChangeSet 1.971.76.156 2003/04/09 11:31:36 akpm@digeo.com
  [PATCH] Make msync(MS_ASYNC) no longer start the I/O
  
  MS_ASYNC will currently wait on previously-submitted I/O, then start new I/O
  and not wait on it.  This can cause undesirable blocking if msync is called
  rapidly against the same memory.
  
  So instead, change msync(MS_ASYNC) to not start any IO at all.  Just flush
  the pte dirty bits into the pageframe and leave it at that.
  
  The IO _will_ happen within a kupdate period.  And the application can use
  fsync() or fadvise(FADV_DONTNEED) if it actually wants to schedule the IO
  immediately.

Unfortunately, this seems to contradict SingleUnix requirements, which
state:

        When MS_ASYNC is specified, msync() shall return immediately
        once all the write operations are initiated or queued for
        servicing
        
although I can't find an unambiguous definition of "queued for service"
in the online standard.  I'm reading it as requiring that the I/O has
reached the block device layer, not simply that it has been marked dirty
for some future writeback pass to catch; Uli agrees with that
interpretation.

The comment that was added with this change in 2.5.68 states:

 * MS_ASYNC does not start I/O (it used to, up to 2.5.67).  Instead, it just
 * marks the relevant pages dirty.  The application may now run fsync() to
 * write out the dirty pages and wait on the writeout and check the result.
 * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
 * async writeout immediately.
 * So my _not_ starting I/O in MS_ASYNC we provide complete flexibility to
 * applications.

but that's actually misleading --- in every kernel I've looked at as far
back as 2.4.9, the "new" behaviour of simply queueing the pages for
kupdated writeback is already available by specifying flags==0 for
msync(), so there's no additional flexibility added by making MS_ASYNC
do the same thing.  And FADV_DONTNEED doesn't make any guarantees that
anything gets queued for writeback: it only does a filemap writeback if
there's no congestion on the backing device.

So we're actually _more_ flexible, as well as more compliant, under the
old behaviour --- flags==0 gives the "defer to kupdated" behaviour,
MS_ASYNC guarantees that IOs have been queued, and MS_SYNC waits for
synchronised completion.

All that's really missing is documentation to define how Linux deals
with flags==0 as an extension to SingleUnix (which requires either
MS_SYNC, MS_ASYNC or MS_INVALIDATE to be set, and which does not define
a behaviour for flags==0.)
  
The 2.5.68 changeset also includes the comment:

  (This has triggered an ext3 bug - the page's buffers get dirtied so fast
  that kjournald keeps writing the buffers over and over for 10-20 seconds
  before deciding to give up for some reason)

Was that ever resolved?  If it's still there, I should have a look at it
if we're restoring the old trigger.

Patch below reverts the behaviour in 2.6.

--Stephen