* msync() behaviour broken for MS_ASYNC, revert patch?
@ 2004-03-31 22:16 Stephen C. Tweedie
2004-03-31 22:37 ` Linus Torvalds
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Stephen C. Tweedie @ 2004-03-31 22:16 UTC (permalink / raw)
To: linux-mm, Andrew Morton; +Cc: linux-kernel, Stephen Tweedie, Ulrich Drepper
[-- Attachment #1: Type: text/plain, Size: 3170 bytes --]
Hi,
I've been looking at a discrepancy between msync() behaviour on 2.4.9
and newer 2.4 kernels, and it looks like things changed again in
2.5.68. From the ChangeLog:
ChangeSet 1.971.76.156 2003/04/09 11:31:36 akpm@digeo.com
[PATCH] Make msync(MS_ASYNC) no longer start the I/O
MS_ASYNC will currently wait on previously-submitted I/O, then start new I/O
and not wait on it. This can cause undesirable blocking if msync is called
rapidly against the same memory.
So instead, change msync(MS_ASYNC) to not start any IO at all. Just flush
the pte dirty bits into the pageframe and leave it at that.
The IO _will_ happen within a kupdate period. And the application can use
fsync() or fadvise(FADV_DONTNEED) if it actually wants to schedule the IO
immediately.
Unfortunately, this seems to contradict SingleUnix requirements, which
state:
When MS_ASYNC is specified, msync() shall return immediately
once all the write operations are initiated or queued for
servicing
although I can't find an unambiguous definition of "queued for service"
in the online standard. I'm reading it as requiring that the I/O has
reached the block device layer, not simply that it has been marked dirty
for some future writeback pass to catch; Uli agrees with that
interpretation.
The comment that was added with this change in 2.5.68 states:
* MS_ASYNC does not start I/O (it used to, up to 2.5.67). Instead, it just
* marks the relevant pages dirty. The application may now run fsync() to
* write out the dirty pages and wait on the writeout and check the result.
* Or the application may run fadvise(FADV_DONTNEED) against the fd to start
* async writeout immediately.
* So my _not_ starting I/O in MS_ASYNC we provide complete flexibility to
* applications.
but that's actually misleading --- in every kernel I've looked at as far
back as 2.4.9, the "new" behaviour of simply queueing the pages for
kupdated writeback is already available by specifying flags==0 for
msync(), so there's no additional flexibility added by making MS_ASYNC
do the same thing. And FADV_DONTNEED doesn't make any guarantees that
anything gets queued for writeback: it only does a filemap writeback if
there's no congestion on the backing device.
So we're actually _more_ flexible, as well as more compliant, under the
old behaviour --- flags==0 gives the "defer to kupdated" behaviour,
MS_ASYNC guarantees that IOs have been queued, and MS_SYNC waits for
synchronised completion.
All that's really missing is documentation to define how Linux deals
with flags==0 as an extension to SingleUnix (which requires either
MS_SYNC, MS_ASYNC or MS_INVALIDATE to be set, and which does not define
a behaviour for flags==0.)
The 2.5.68 changeset also includes the comment:
(This has triggered an ext3 bug - the page's buffers get dirtied so fast
that kjournald keeps writing the buffers over and over for 10-20 seconds
before deciding to give up for some reason)
Was that ever resolved? If it's still there, I should have a look at it
if we're restoring the old trigger.
Patch below reverts the behaviour in 2.6.
--Stephen
[-- Attachment #2: msync-async-writeout.patch --]
[-- Type: text/plain, Size: 1935 bytes --]
--- linux-2.6-tmp/mm/msync.c.=K0000=.orig
+++ linux-2.6-tmp/mm/msync.c
@@ -127,13 +127,10 @@ static int filemap_sync(struct vm_area_s
/*
* MS_SYNC syncs the entire file - including mappings.
*
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67). Instead, it just
- * marks the relevant pages dirty. The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
- * So my _not_ starting I/O in MS_ASYNC we provide complete flexibility to
- * applications.
+ * MS_ASYNC once again starts I/O (it did not between 2.5.68 and 2.6.4.)
+ * SingleUnix requires it. If an application wants to queue dirty pages
+ * for normal asychronous writeback, msync with flags==0 should achieve
+ * that on all kernels at least as far back as 2.4.
*/
static int msync_interval(struct vm_area_struct * vma,
unsigned long start, unsigned long end, int flags)
@@ -147,20 +144,22 @@ static int msync_interval(struct vm_area
if (file && (vma->vm_flags & VM_SHARED)) {
ret = filemap_sync(vma, start, end-start, flags);
- if (!ret && (flags & MS_SYNC)) {
+ if (!ret && (flags & (MS_SYNC|MS_ASYNC))) {
struct address_space *mapping = file->f_mapping;
int err;
down(&mapping->host->i_sem);
ret = filemap_fdatawrite(mapping);
- if (file->f_op && file->f_op->fsync) {
- err = file->f_op->fsync(file,file->f_dentry,1);
- if (err && !ret)
+ if (flags & MS_SYNC) {
+ if (file->f_op && file->f_op->fsync) {
+ err = file->f_op->fsync(file, file->f_dentry, 1);
+ if (err && !ret)
+ ret = err;
+ }
+ err = filemap_fdatawait(mapping);
+ if (!ret)
ret = err;
}
- err = filemap_fdatawait(mapping);
- if (!ret)
- ret = err;
up(&mapping->host->i_sem);
}
}
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-03-31 22:16 msync() behaviour broken for MS_ASYNC, revert patch? Stephen C. Tweedie @ 2004-03-31 22:37 ` Linus Torvalds 2004-03-31 23:41 ` Stephen C. Tweedie 2004-03-31 22:53 ` Andrew Morton 2004-04-16 22:35 ` Jamie Lokier 2 siblings, 1 reply; 18+ messages in thread From: Linus Torvalds @ 2004-03-31 22:37 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: linux-mm, Andrew Morton, linux-kernel, Ulrich Drepper On Wed, 31 Mar 2004, Stephen C. Tweedie wrote: > > although I can't find an unambiguous definition of "queued for service" > in the online standard. I'm reading it as requiring that the I/O has > reached the block device layer, not simply that it has been marked dirty > for some future writeback pass to catch; Uli agrees with that > interpretation. That interpretation makes pretty much zero sense. If you care about the data hitting the disk, you have to use fsync() or similar _anyway_, and pretending anything else is just bogus. As such, just marking the pages dirty is as much of a "queing" them for write as actually writing them, since in both cases the guarantees are _exactly_ the same: the pages have not hit the disk by the time the system call returns, but will hit the disk at some time in the future. Having the requirement that it is on some sw-only request queue is nonsensical, since such a queue is totally invisible from a user perspective. User space has no idea about "block device layer" vs "VM layer" queues, and trying to distinguiosh between the two is madness. It's just an internal implementation issue that has no meaning to the user. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-03-31 22:37 ` Linus Torvalds @ 2004-03-31 23:41 ` Stephen C. Tweedie 2004-04-01 0:08 ` Linus Torvalds 0 siblings, 1 reply; 18+ messages in thread From: Stephen C. Tweedie @ 2004-03-31 23:41 UTC (permalink / raw) To: Linus Torvalds Cc: linux-mm, Andrew Morton, linux-kernel, Ulrich Drepper, Stephen Tweedie Hi, On Wed, 2004-03-31 at 23:37, Linus Torvalds wrote: > On Wed, 31 Mar 2004, Stephen C. Tweedie wrote: > > > > although I can't find an unambiguous definition of "queued for service" > > in the online standard. I'm reading it as requiring that the I/O has > > reached the block device layer, not simply that it has been marked dirty > > for some future writeback pass to catch; Uli agrees with that > > interpretation. > > That interpretation makes pretty much zero sense. > > If you care about the data hitting the disk, you have to use fsync() or > similar _anyway_, and pretending anything else is just bogus. You can make the same argument for either implementation of MS_ASYNC. And there's at least one way in which the "submit IO now" version can be used meaningfully --- if you've got several specific areas of data in one or more mappings that need flushed to disk, you'd be able to initiate IO with multiple MS_ASYNC calls and then wait for completion with either MS_SYNC or fsync(). That gives you an interface that corresponds somewhat with the region-based filemap_sync(); filemap_fdatawrite(); filemap_datawait() that the kernel itself uses. > Having the requirement that it is on some sw-only request queue is > nonsensical, since such a queue is totally invisible from a user > perspective. It's very much visible, just from a performance perspective, if you want to support "kick off this IO, I'm going to wait for the completion shortly." If that's the interpretation of MS_ASYNC, then the app is basically saying it doesn't want the writeback mechanism to be idle until the writes have completed, regardless of whether it's a block device or an NFS file or whatever underneath. But whether that's a legal use of MS_ASYNC really depends on what the standard is requiring. I could be persuaded either way. Uli? Does anyone know what other Unixen do here? --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-03-31 23:41 ` Stephen C. Tweedie @ 2004-04-01 0:08 ` Linus Torvalds 2004-04-01 0:30 ` Andrew Morton 2004-04-01 15:40 ` Stephen C. Tweedie 0 siblings, 2 replies; 18+ messages in thread From: Linus Torvalds @ 2004-04-01 0:08 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: linux-mm, Andrew Morton, linux-kernel, Ulrich Drepper On Wed, 1 Apr 2004, Stephen C. Tweedie wrote: > > On Wed, 2004-03-31 at 23:37, Linus Torvalds wrote: > > > If you care about the data hitting the disk, you have to use fsync() or > > similar _anyway_, and pretending anything else is just bogus. > > You can make the same argument for either implementation of MS_ASYNC. Exactly. Which is why I say that the implementation cannot matter, because user space would be _buggy_ if it depended on some timing issue. > And there's at least one way in which the "submit IO now" version can be > used meaningfully --- if you've got several specific areas of data in > one or more mappings that need flushed to disk, you'd be able to > initiate IO with multiple MS_ASYNC calls and then wait for completion > with either MS_SYNC or fsync(). Why wouldn't you be able to do that with the current one? Tha advantage of the current MS_ASYNC is absolutely astoundingly HUGE: because we don't wait for in-progress IO, it can be used to efficiently synchronize multiple different areas, and then after that waiting for them with _one_ single fsync(). In contrast, the "wait for queued IO" approach can't sanely do that, exactly because it will wait in the middle, depending on other activity at the same time. It will always have the worry that it happens to do the msync() at the wrong time, and then wait synchronously when it shouldn't. More importanrtly, the current behaviour makes certain patterns _possible_ that your suggested semantics simply cannot do efficiently. If we have data records smaller than a page, and want to mark them dirty as they happen, the current msync() allows that - it doesn't matter that another datum was marked dirty just a moment ago. Then, you do one fsync() only when you actually want to _commit_ a series of updates before you change the index. But if we want to have another flag, with MS_HALF_ASYNC, that's certainly ok by me. I'm all for choice. It's just that I most definitely want the choice of doing it the way we do it now, since I consider that to be the _sane_ way. > It's very much visible, just from a performance perspective, if you want > to support "kick off this IO, I'm going to wait for the completion > shortly." That may well be worth a call of its own. It has nothing to do with memory mapping, though - what you're really looking for is fasync(). And yes, I agree that _that_ would make sense. Havign some primitives to start writeout of an area of a file would likely be a good thing. I'd be perfectly happy with a set of file cache control operations, including - start writeback in [a,b] - wait for [a,b] stable - and maybe "punch hole in [a,b]" Then you could use these for write() in addition to mmap(), and you can first mark multiple regions dirty, and then do a single wait (which is clearly more efficient than synchronously waiting for multiple regions). But none of these have anything to do with what SuS or any other standard says about MS_ASYNC. > But whether that's a legal use of MS_ASYNC really depends on what the > standard is requiring. I could be persuaded either way. Uli? My argument was that a standard CANNOT say anything one way or the other, because the behaviour IS NOT USER-VISIBLE! A program fundamentally cannot care, since the only issue is a pure implementation issue of "which queue" the data got queued onto. Bringing in a standards body is irrelevant. It's like trying to use the bible to determine whether protons have a positive charge. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-04-01 0:08 ` Linus Torvalds @ 2004-04-01 0:30 ` Andrew Morton 2004-04-01 15:40 ` Stephen C. Tweedie 1 sibling, 0 replies; 18+ messages in thread From: Andrew Morton @ 2004-04-01 0:30 UTC (permalink / raw) To: Linus Torvalds; +Cc: sct, linux-mm, linux-kernel, drepper Linus Torvalds <torvalds@osdl.org> wrote: > > I'd be perfectly happy with a set of file cache control operations, > including > > - start writeback in [a,b] > - wait for [a,b] stable > - and maybe "punch hole in [a,b]" Yup, there are a number of linux-specific fadvise() extensions we can/should be adding, including "start writeback on this byte range for flush" and "start writeback on this byte range for data integrity" and "wait on writeback of this byte range". Some of these are needed internally for the fs-AIO implementation, and also for an O_SYNC which only writes the pages which the writer wrote. It's pretty simple, and it'll be happening. One wrinkle is that we'd need to add the start/end loff_t pair to the a_ops->writepages() prototype. But instead I intend to put the start/end info into struct writeback_control and pass it that way. It seems sleazy at first but when you think about it, it isn't. It provides forward and backward compatability, it recognises that it's just a hint and that filesystems can legitimately sync the whole file and it produces smaller+faster code. We might need a wait_on_page_writeback_range() a_op though. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-04-01 0:08 ` Linus Torvalds 2004-04-01 0:30 ` Andrew Morton @ 2004-04-01 15:40 ` Stephen C. Tweedie 2004-04-01 16:02 ` Linus Torvalds ` (2 more replies) 1 sibling, 3 replies; 18+ messages in thread From: Stephen C. Tweedie @ 2004-04-01 15:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-mm, Andrew Morton, linux-kernel, Ulrich Drepper Hi, On Thu, 2004-04-01 at 01:08, Linus Torvalds wrote: > > You can make the same argument for either implementation of MS_ASYNC. > Exactly. > Which is why I say that the implementation cannot matter, because user > space would be _buggy_ if it depended on some timing issue. I see it purely as a performance issue. That's the context in which we saw the initial complaint about the 2.4 behaviour change. > > And there's at least one way in which the "submit IO now" version can be > > used meaningfully --- if you've got several specific areas of data in > > one or more mappings that need flushed to disk, you'd be able to > > initiate IO with multiple MS_ASYNC calls and then wait for completion > > with either MS_SYNC or fsync(). > > Why wouldn't you be able to do that with the current one? You can, but only across one fd. A = mmap(..., a); B = mmap(..., b); msync(A, ..., MS_ASYNC); msync(B, ..., MS_ASYNC); fsync(a); fsync(b); has rather different performance characteristics according to which way you go. Do deferred writeback and the two fsync()s do serialised IO, with the fs idle in between. Submit the background IO immediately and you avoid that. Anyway, I just tried on a Solaris-2.8 box, and the results are rather interesting. Doing a simple (touch-one-char, msync one page) loop on a mmap'ed file on a local scsi disk, MS_ASYNC gives ~15000 msyncs per second; MS_SYNC gives ~900. [A null getpid() loop gives about 250,000 loops a second.] However, the "iostat" shows *exactly* the same disk throughput in each case. MS_ASYNC is causing immediate IO kickoff, but shows only ~900 ios per second, the same as the ios-per-second for MS_SYNC and the same as the MS_SYNC loop frequency. So it appears that on Solaris, MS_ASYNC is kicking off instant IO, but is not waiting for existing IO to complete first. So if we have an IO already in progress, then many msync calls end up queuing the *same* subsequent IO, and once one new IO is queued, further MS_ASYNC msyncs don't bother scheduling a new one (on the basis that the already-scheduled one hasn't started yet so the new data is already guaranteed to hit disk.) So Solaris behaviour is indeed to begin IO as soon as possible on MS_ASYNC, but they are doing it far more efficiently than our current msync code can do. > Tha advantage of the current MS_ASYNC is absolutely astoundingly HUGE: > because we don't wait for in-progress IO, it can be used to efficiently > synchronize multiple different areas, and then after that waiting for them > with _one_ single fsync(). The Solaris one manages to preserve those properties while still scheduling the IO "soon". I'm not sure how we could do that in the current VFS, short of having a background thread scheduling deferred writepage()s as soon as the existing page becomes unlocked. > More importanrtly, the current behaviour makes certain patterns _possible_ > that your suggested semantics simply cannot do efficiently. If we have > data records smaller than a page, and want to mark them dirty as they > happen, the current msync() allows that - it doesn't matter that another > datum was marked dirty just a moment ago. Then, you do one fsync() only > when you actually want to _commit_ a series of updates before you change > the index. > But if we want to have another flag, with MS_HALF_ASYNC, that's certainly > ok by me. I'm all for choice. Yes, but we _used_ to have that choice --- call msync() with flags == 0, and you'd get the deferred kupdated writeback; call it with MS_ASYNC and you'd get instant IO kickoff; call it with MS_SYNC and you'd get synchronous completion. But now we've lost the instant kickoff, async completion option, and MS_ASYNC behaves just like flags==0. So I'm all for adding the choice back, and *documenting* it so that people know exactly what to expect in all three cases. Whether the choice comes from an fadvise option or an msync() doesn't bother me that much. In that case, the decision about which version of the behaviour MS_ASYNC should give is (as it should be) a matter of obeying the standard correctly, and the other useful behaviours are preserved elsewhere. Which brings us back to trying to interpret the vague standard. Both Uli's interpretation and the Solaris implementation suggest that we need to start the writepage sooner rather than later. > > It's very much visible, just from a performance perspective, if you want > > to support "kick off this IO, I'm going to wait for the completion > > shortly." > > That may well be worth a call of its own. It has nothing to do with memory > mapping, though - what you're really looking for is fasync(). Indeed. And msync(flags==0) remains as a way of synchronising mmaps with the inode-dirty-list fasync writeback. > And yes, I agree that _that_ would make sense. Havign some primitives to > start writeout of an area of a file would likely be a good thing. > > I'd be perfectly happy with a set of file cache control operations, > including > > - start writeback in [a,b] posix_fadvise() seems to do something a little like this already: the FADV_DONTNEED handler tries if (!bdi_write_congested(mapping->backing_dev_info)) filemap_flush(mapping); before going into the invalidate_mapping_pages() call. Having that (a) limited to the specific file range passed into the fadvise(), and (b) available as a separate function independent of the DONTNEED page invalidator, would seem like an entirely sensible extension. The obvious implementations would be somewhat inefficient in some cases, though --- currently __filemap_fdatawrite simply list_splice()s the inode dirty list into the io list. Walking a long dirty list to flush just a few pages from a narrow range could get slow, and walking the radix tree would be inefficient if there are only a few dirty pages hidden in a large cache of clean pages. > My argument was that a standard CANNOT say anything one way or the other, > because the behaviour IS NOT USER-VISIBLE! Worse, it doesn't seem to be implemented consistently either. I've been trying on a few other Unixen while writing this. First on a Tru64 box, and it is _not_ kicking off any IO at all for MS_ASYNC, except for the 30-second regular sync. The same appears to be true on FreeBSD. And on HP-UX, things go in the other direction: the performance of MS_ASYNC is identical to MS_SYNC, both in terms of observed disk IO during the sync and the overall rate of the msync loop. So it appears we've got Unix precedent for pretty-much any reasonable interpretation of MS_ASYNC that we want. Patch withdrawn! --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-04-01 15:40 ` Stephen C. Tweedie @ 2004-04-01 16:02 ` Linus Torvalds 2004-04-01 16:33 ` Stephen C. Tweedie 2004-04-01 16:19 ` Jamie Lokier 2004-04-01 18:51 ` Andrew Morton 2 siblings, 1 reply; 18+ messages in thread From: Linus Torvalds @ 2004-04-01 16:02 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: linux-mm, Andrew Morton, linux-kernel, Ulrich Drepper On Thu, 1 Apr 2004, Stephen C. Tweedie wrote: > > So it appears that on Solaris, MS_ASYNC is kicking off instant IO, but > is not waiting for existing IO to complete first. A much more likely schenario is that Solaris is really doing the same thing we are, but it _also_ ends up opportunistically trying to put the resultant pages on the IO queues if possible (ie do a "write-ahead": start writeout if that doesn't imply blocking). We could probably do that too, it seems easy enough. A "TestSetPageLocked()" along with setting the BIO_RW_AHEAD flag. The only problem is that I don't think we really have support for doing write-ahead (ie we clear the page "dirty" bit too early, so if the write gets cancelled due to the IO queues being full, the dirty bit gets lost. So we don't want to go there for now, but it's something to keep in mind, perhaps. > Worse, it doesn't seem to be implemented consistently either. I've been > trying on a few other Unixen while writing this. First on a Tru64 box, > and it is _not_ kicking off any IO at all for MS_ASYNC, except for the > 30-second regular sync. The same appears to be true on FreeBSD. And on > HP-UX, things go in the other direction: the performance of MS_ASYNC is > identical to MS_SYNC, both in terms of observed disk IO during the sync > and the overall rate of the msync loop. If you check HP-UX, make sure it's a recent one. HPUX has historically been just too broken for words when it comes to mmap() (ie some _really_ strange semantics, like not being able to unmap partial mappings etc). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-04-01 16:02 ` Linus Torvalds @ 2004-04-01 16:33 ` Stephen C. Tweedie 0 siblings, 0 replies; 18+ messages in thread From: Stephen C. Tweedie @ 2004-04-01 16:33 UTC (permalink / raw) To: Linus Torvalds Cc: linux-mm, Andrew Morton, linux-kernel, Ulrich Drepper, Stephen Tweedie Hi, On Thu, 2004-04-01 at 17:02, Linus Torvalds wrote: > > Worse, it doesn't seem to be implemented consistently either. I've been > > trying on a few other Unixen while writing this. First on a Tru64 box, > > and it is _not_ kicking off any IO at all for MS_ASYNC, except for the > > 30-second regular sync. The same appears to be true on FreeBSD. And on > > HP-UX, things go in the other direction: the performance of MS_ASYNC is > > identical to MS_SYNC, both in terms of observed disk IO during the sync > > and the overall rate of the msync loop. > > If you check HP-UX, make sure it's a recent one. HPUX has historically > been just too broken for words when it comes to mmap() (ie some _really_ > strange semantics, like not being able to unmap partial mappings etc). I'm not sure what counts as "recent" for that, but this was on HP-UX 11. That's the most recent I've got access to. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-04-01 15:40 ` Stephen C. Tweedie 2004-04-01 16:02 ` Linus Torvalds @ 2004-04-01 16:19 ` Jamie Lokier 2004-04-01 16:56 ` s390 storage key inconsistency? [was Re: msync() behaviour broken for MS_ASYNC, revert patch?] Stephen C. Tweedie 2004-04-01 16:57 ` msync() behaviour broken for MS_ASYNC, revert patch? Stephen C. Tweedie 2004-04-01 18:51 ` Andrew Morton 2 siblings, 2 replies; 18+ messages in thread From: Jamie Lokier @ 2004-04-01 16:19 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Linus Torvalds, linux-mm, Andrew Morton, linux-kernel, Ulrich Drepper Stephen C. Tweedie wrote: > Yes, but we _used_ to have that choice --- call msync() with flags == 0, > and you'd get the deferred kupdated writeback; Is that not equivalent to MS_INVALIDATE? It seems to be equivalent in 2.6.4. The code in 2.6.4 ignores MS_INVALIDATE except for trivial error checks, so msync() with flags == MS_INVALIDATE has the same effect as msync() with flags == 0. Some documentation I'm looking at says MS_INVALIDATE updates the mapped page to contain the current contents of the file. 2.6.4 seems to do the reverse: update the file to contain the current content of the mapped page. "man msync" agrees with the the latter. (I can't look at SUS right now). On systems where the CPU caches are fully coherent, the only difference is that the former is a no-op and the latter does the same as the new behaviour of MS_ASYNC. On systems where the CPU caches aren't coherent, some cache synchronising or flushing operations are implied. On either type of system, MS_INVALIDATE doesn't seem to be doing what the documentation I'm looking at says it should do. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* s390 storage key inconsistency? [was Re: msync() behaviour broken for MS_ASYNC, revert patch?] 2004-04-01 16:19 ` Jamie Lokier @ 2004-04-01 16:56 ` Stephen C. Tweedie 2004-04-01 16:57 ` msync() behaviour broken for MS_ASYNC, revert patch? Stephen C. Tweedie 1 sibling, 0 replies; 18+ messages in thread From: Stephen C. Tweedie @ 2004-04-01 16:56 UTC (permalink / raw) To: Jamie Lokier, Linux on 390 Port Cc: Linus Torvalds, linux-mm, Andrew Morton, linux-kernel, Ulrich Drepper, Stephen Tweedie Hi, On Thu, 2004-04-01 at 17:19, Jamie Lokier wrote: > Some documentation I'm looking at says MS_INVALIDATE updates the > mapped page to contain the current contents of the file. 2.6.4 seems > to do the reverse: update the file to contain the current content of > the mapped page. "man msync" agrees with the the latter. (I can't > look at SUS right now). btw, just looking at the filemap_sync_pte() code for MS_INVALIDATE, I noticed if (!PageReserved(page) && (ptep_clear_flush_dirty(vma, address, ptep) || page_test_and_clear_dirty(page))) set_page_dirty(page); I just happened to follow the function and noticed that on s390, page_test_and_clear_dirty() has the comment: * Test and clear dirty bit in storage key. * We can't clear the changed bit atomically. This is a potential * race against modification of the referenced bit. This function * should therefore only be called if it is not mapped in any * address space. but in this case the page is clearly mapped in the caller's address space, else we wouldn't have reached this. Is this a problem? --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-04-01 16:19 ` Jamie Lokier 2004-04-01 16:56 ` s390 storage key inconsistency? [was Re: msync() behaviour broken for MS_ASYNC, revert patch?] Stephen C. Tweedie @ 2004-04-01 16:57 ` Stephen C. Tweedie 1 sibling, 0 replies; 18+ messages in thread From: Stephen C. Tweedie @ 2004-04-01 16:57 UTC (permalink / raw) To: Jamie Lokier Cc: Linus Torvalds, linux-mm, Andrew Morton, linux-kernel, Ulrich Drepper, Stephen Tweedie Hi, On Thu, 2004-04-01 at 17:19, Jamie Lokier wrote: > Stephen C. Tweedie wrote: > > Yes, but we _used_ to have that choice --- call msync() with flags == 0, > > and you'd get the deferred kupdated writeback; > > Is that not equivalent to MS_INVALIDATE? It seems to be equivalent in > 2.6.4. It is in all the kernels I've looked at, but that's mainly because we seem to ignore MS_INVALIDATE. > Some documentation I'm looking at says MS_INVALIDATE updates the > mapped page to contain the current contents of the file. 2.6.4 seems > to do the reverse: update the file to contain the current content of > the mapped page. "man msync" agrees with the the latter. (I can't > look at SUS right now). SUSv3 says When MS_INVALIDATE is specified, msync() shall invalidate all cached copies of mapped data that are inconsistent with the permanent storage locations such that subsequent references shall obtain data that was consistent with the permanent storage locations sometime between the call to msync() and the first subsequent memory reference to the data. which seems to imply that dirty ptes should simply be cleared, rather than propagated to the page dirty bits. That's easy enough --- we already propagate the flags down to filemap_sync_pte, where the page and pte dirty bits are modified. Does anyone know any reason why we don't do MS_INVALIDATE there already? --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-04-01 15:40 ` Stephen C. Tweedie 2004-04-01 16:02 ` Linus Torvalds 2004-04-01 16:19 ` Jamie Lokier @ 2004-04-01 18:51 ` Andrew Morton 2 siblings, 0 replies; 18+ messages in thread From: Andrew Morton @ 2004-04-01 18:51 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: torvalds, linux-mm, linux-kernel, drepper "Stephen C. Tweedie" <sct@redhat.com> wrote: > > > Tha advantage of the current MS_ASYNC is absolutely astoundingly HUGE: > > because we don't wait for in-progress IO, it can be used to efficiently > > synchronize multiple different areas, and then after that waiting for them > > with _one_ single fsync(). > > The Solaris one manages to preserve those properties while still > scheduling the IO "soon". I'm not sure how we could do that in the > current VFS, short of having a background thread scheduling deferred > writepage()s as soon as the existing page becomes unlocked. filemap_flush() will do exactly this. So if you want the Solaris semantics, calling filemap_flush() intead of filemap_fdatawrite() should do it. > posix_fadvise() seems to do something a little like this already: the > FADV_DONTNEED handler tries > > if (!bdi_write_congested(mapping->backing_dev_info)) > filemap_flush(mapping); > > before going into the invalidate_mapping_pages() call. Having that (a) > limited to the specific file range passed into the fadvise(), and (b) > available as a separate function independent of the DONTNEED page > invalidator, would seem like an entirely sensible extension. > > The obvious implementations would be somewhat inefficient in some cases, > though --- currently __filemap_fdatawrite simply list_splice()s the > inode dirty list into the io list. Walking a long dirty list to flush > just a few pages from a narrow range could get slow, and walking the > radix tree would be inefficient if there are only a few dirty pages > hidden in a large cache of clean pages. The patches I have queued in -mm allow us to do this. We use find_get_pages_tag() to iterate over only the dirty pages in the tree. That still has the efficiency problem that when searching for dirty pages we also visit pages which are both dirty and under writeback (we're not interested in those pages if it is a non-blocking flush), although I've only observed that to be a problem when the queue size was bumped up to 10,000 requests and I fixed that up for the common cases by other means. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-03-31 22:16 msync() behaviour broken for MS_ASYNC, revert patch? Stephen C. Tweedie 2004-03-31 22:37 ` Linus Torvalds @ 2004-03-31 22:53 ` Andrew Morton 2004-03-31 23:20 ` Stephen C. Tweedie 2004-04-16 22:35 ` Jamie Lokier 2 siblings, 1 reply; 18+ messages in thread From: Andrew Morton @ 2004-03-31 22:53 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: linux-mm, linux-kernel, drepper "Stephen C. Tweedie" <sct@redhat.com> wrote: > > Hi, > > I've been looking at a discrepancy between msync() behaviour on 2.4.9 > and newer 2.4 kernels, and it looks like things changed again in > 2.5.68. From the ChangeLog: > > ChangeSet 1.971.76.156 2003/04/09 11:31:36 akpm@digeo.com > [PATCH] Make msync(MS_ASYNC) no longer start the I/O > > MS_ASYNC will currently wait on previously-submitted I/O, then start new I/O > and not wait on it. This can cause undesirable blocking if msync is called > rapidly against the same memory. > > So instead, change msync(MS_ASYNC) to not start any IO at all. Just flush > the pte dirty bits into the pageframe and leave it at that. > > The IO _will_ happen within a kupdate period. And the application can use > fsync() or fadvise(FADV_DONTNEED) if it actually wants to schedule the IO > immediately. > > Unfortunately, this seems to contradict SingleUnix requirements, which > state: > > When MS_ASYNC is specified, msync() shall return immediately > once all the write operations are initiated or queued for > servicing > > although I can't find an unambiguous definition of "queued for service" > in the online standard. I'm reading it as requiring that the I/O has > reached the block device layer, not simply that it has been marked dirty > for some future writeback pass to catch; Uli agrees with that > interpretation. I don't think I agree with that. If "queued for service" means we've started the I/O, then what does "initiated" mean, and why did they specify "initiated" separately? What triggered all this was a dinky little test app which Linus wrote to time some aspect of P4 tlb writeback latency. It sits in a loop dirtying a page then msyncing it with MS_ASYNC. It ran very poorly, because MS_ASYNC ended up waiting on the previously-submitted I/O before starting new I/O. One approach to improving that would be for MS_ASYNC to say "if the page is already under writeout then just skip the I/O". But that's worthless, really - it makes the MS_ASYNC semantics too vague. As you point out, Linus's app should have used the "flags=0" linux extension. Didn't think of that. Your reversion patch would mean that current applications which use MS_ASYNC will again suffer large latencies if the pages are under writeout. Sure, users could switch apps to using flags=0 to avoid that, but people don't know to do that. So given that SUS is ambiguous about this, I'd suggest that we be able to demonstrate some real-world reason why this matters. Why are you concerned about this? > The 2.5.68 changeset also includes the comment: > > (This has triggered an ext3 bug - the page's buffers get dirtied so fast > that kjournald keeps writing the buffers over and over for 10-20 seconds > before deciding to give up for some reason) > > Was that ever resolved? If it's still there, I should have a look at it > if we're restoring the old trigger. (These changelog thingies are useful, aren't they?) I don't recall checking since that time. I expect that Linus's test app will still livelock kjournals in the current -linus tree - kjournald sits there trying to write out the dirty buffers but the dang things just keep on getting dirtied. If so, I'm sure this patch (queued for 2.6.6) will fix it: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.5-rc3/2.6.5-rc3-mm3/broken-out/jbd-move-locked-buffers.patch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-03-31 22:53 ` Andrew Morton @ 2004-03-31 23:20 ` Stephen C. Tweedie 0 siblings, 0 replies; 18+ messages in thread From: Stephen C. Tweedie @ 2004-03-31 23:20 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Ulrich Drepper, Linus Torvalds, Stephen Tweedie Hi, On Wed, 2004-03-31 at 23:53, Andrew Morton wrote: > "Stephen C. Tweedie" <sct@redhat.com> wrote: > > Unfortunately, this seems to contradict SingleUnix requirements, which > > state: > > When MS_ASYNC is specified, msync() shall return immediately > > once all the write operations are initiated or queued for > > servicing > > although I can't find an unambiguous definition of "queued for service" > > in the online standard. I'm reading it as requiring that the I/O has > > reached the block device layer > I don't think I agree with that. If "queued for service" means we've > started the I/O, then what does "initiated" mean, and why did they specify > "initiated" separately? I'd interpret "initiated" as having reached hardware. "Queued for service" is much more open to interpretation: Uli came up with "the data must be actively put in a stage where I/O is initiated", which still doesn't really address what sort of queueing is allowed. > What triggered all this was a dinky little test app which Linus wrote to > time some aspect of P4 tlb writeback latency. It sits in a loop dirtying a > page then msyncing it with MS_ASYNC. It ran very poorly, because MS_ASYNC > ended up waiting on the previously-submitted I/O before starting new I/O. Sure. There are lots of ways an interface can be misused, though: you only know if one use is valid or not once you've determined what the _correct_ use is. I'm much more concerned about getting a correct interpretation of the spec than of making IO fast for the sake of a memory benchmark. :-) > One approach to improving that would be for MS_ASYNC to say "if the page is > already under writeout then just skip the I/O". But that's worthless, > really - it makes the MS_ASYNC semantics too vague. Agreed. > Your reversion patch would mean that current applications which use > MS_ASYNC will again suffer large latencies if the pages are under writeout. Well, this whole issue came up precisely because somebody was seeing exactly such a latency hit going from 2.4.9 to a later kernel. We've not really been consistent about it in the past. > Sure, users could switch apps to using flags=0 to avoid that, but people > don't know to do that. Exactly why we need documentation for that combination, whatever happens. > So given that SUS is ambiguous about this, I'd suggest that we be able to > demonstrate some real-world reason why this matters. Why are you concerned > about this? Just for the reason you mentioned --- a real-world app (in-house, so flags==0 is actually a valid solution for them) which was seeing performance degradation when the "MS_ASYNC submits IO" was introduced in the first place. But it was internally written, so I've no idea at all whether or not the app was assuming one behaviour or the other on other Unixen. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-03-31 22:16 msync() behaviour broken for MS_ASYNC, revert patch? Stephen C. Tweedie 2004-03-31 22:37 ` Linus Torvalds 2004-03-31 22:53 ` Andrew Morton @ 2004-04-16 22:35 ` Jamie Lokier 2004-04-19 21:54 ` Stephen C. Tweedie 2 siblings, 1 reply; 18+ messages in thread From: Jamie Lokier @ 2004-04-16 22:35 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: linux-mm, Andrew Morton, linux-kernel, Ulrich Drepper Stephen C. Tweedie wrote: > I've been looking at a discrepancy between msync() behaviour on 2.4.9 > and newer 2.4 kernels, and it looks like things changed again in > 2.5.68. When you say a discrepancy between 2.4.9 and newer 2.4 kernels, do you mean that the msync() behaviour changed during the 2.4 series? If so, what was the change? Thanks, -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-04-16 22:35 ` Jamie Lokier @ 2004-04-19 21:54 ` Stephen C. Tweedie 2004-04-21 2:10 ` Jamie Lokier 0 siblings, 1 reply; 18+ messages in thread From: Stephen C. Tweedie @ 2004-04-19 21:54 UTC (permalink / raw) To: Jamie Lokier Cc: linux-mm, Andrew Morton, linux-kernel, Ulrich Drepper, Stephen Tweedie Hi, On Fri, 2004-04-16 at 23:35, Jamie Lokier wrote: > Stephen C. Tweedie wrote: > > I've been looking at a discrepancy between msync() behaviour on 2.4.9 > > and newer 2.4 kernels, and it looks like things changed again in > > 2.5.68. > > When you say a discrepancy between 2.4.9 and newer 2.4 kernels, do you > mean that the msync() behaviour changed during the 2.4 series? Yes. > If so, what was the change? 2.4.9 behaved like current 2.6 --- on MS_ASYNC, it did a set_page_dirty() which means the page will get picked up by the next 5-second bdflush pass. But later 2.4 kernels were changed so that they started MS_ASYNC IO immediately with filemap_fdatasync() (which is asynchronous regarding the new IO, but which blocks synchronously if there is already old IO in flight on the page.) That was reverted back to the earlier, 2.4.9 behaviour in the 2.5 series. Cheers, Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-04-19 21:54 ` Stephen C. Tweedie @ 2004-04-21 2:10 ` Jamie Lokier 2004-04-21 9:52 ` Stephen C. Tweedie 0 siblings, 1 reply; 18+ messages in thread From: Jamie Lokier @ 2004-04-21 2:10 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: linux-mm, Andrew Morton, linux-kernel, Ulrich Drepper Stephen C. Tweedie wrote: > > If so, what was the change? > > 2.4.9 behaved like current 2.6 --- on MS_ASYNC, it did a > set_page_dirty() which means the page will get picked up by the next > 5-second bdflush pass. But later 2.4 kernels were changed so that they > started MS_ASYNC IO immediately with filemap_fdatasync() (which is > asynchronous regarding the new IO, but which blocks synchronously if > there is already old IO in flight on the page.) > > That was reverted back to the earlier, 2.4.9 behaviour in the 2.5 > series. It was 2.5.68. Thanks, that's very helpful. msync(0) has always had behaviour consistent with the <=2.4.9 and >=2.5.68 MS_ASYNC behaviour, is that right? If so, programs may as well "#define MS_ASYNC 0" on Linux, to get well defined and consistent behaviour. It would be nice to change the definition in libc to zero, but I don't think it's possible because msync(MS_SYNC|MS_ASYNC) needs to fail. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: msync() behaviour broken for MS_ASYNC, revert patch? 2004-04-21 2:10 ` Jamie Lokier @ 2004-04-21 9:52 ` Stephen C. Tweedie 0 siblings, 0 replies; 18+ messages in thread From: Stephen C. Tweedie @ 2004-04-21 9:52 UTC (permalink / raw) To: Jamie Lokier Cc: linux-mm, Andrew Morton, linux-kernel, Ulrich Drepper, Stephen Tweedie Hi, On Wed, 2004-04-21 at 03:10, Jamie Lokier wrote: > msync(0) has always had behaviour consistent with the <=2.4.9 and > >=2.5.68 MS_ASYNC behaviour, is that right? Not sure about "always", but it looks like it recently at least. 2.2 msync was implemented very differently but seems, from the source, to have the same property --- do_write_page() calls f_op->write() on msync, and MS_SYNC forces an fsync after the writes. But 2.4 and 2.6 share much more similar code to each other. So all since 2.2 seem to do the fully-async, deferred writeback behaviour for flags==0. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2004-04-21 9:52 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2004-03-31 22:16 msync() behaviour broken for MS_ASYNC, revert patch? Stephen C. Tweedie 2004-03-31 22:37 ` Linus Torvalds 2004-03-31 23:41 ` Stephen C. Tweedie 2004-04-01 0:08 ` Linus Torvalds 2004-04-01 0:30 ` Andrew Morton 2004-04-01 15:40 ` Stephen C. Tweedie 2004-04-01 16:02 ` Linus Torvalds 2004-04-01 16:33 ` Stephen C. Tweedie 2004-04-01 16:19 ` Jamie Lokier 2004-04-01 16:56 ` s390 storage key inconsistency? [was Re: msync() behaviour broken for MS_ASYNC, revert patch?] Stephen C. Tweedie 2004-04-01 16:57 ` msync() behaviour broken for MS_ASYNC, revert patch? Stephen C. Tweedie 2004-04-01 18:51 ` Andrew Morton 2004-03-31 22:53 ` Andrew Morton 2004-03-31 23:20 ` Stephen C. Tweedie 2004-04-16 22:35 ` Jamie Lokier 2004-04-19 21:54 ` Stephen C. Tweedie 2004-04-21 2:10 ` Jamie Lokier 2004-04-21 9:52 ` Stephen C. Tweedie
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox