linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: Checking page_count(page) in invalidate_complete_page
       [not found] <4518333E.2060101@oracle.com>
@ 2006-09-25 21:10 ` Andrew Morton
  2006-09-25 22:30   ` Chuck Lever
                     ` (4 more replies)
  0 siblings, 5 replies; 56+ messages in thread
From: Andrew Morton @ 2006-09-25 21:10 UTC (permalink / raw)
  To: chuck.lever; +Cc: Trond Myklebust, Steve Dickson, linux-mm

(Added linux-mm)

On Mon, 25 Sep 2006 15:51:26 -0400
Chuck Lever <chuck.lever@oracle.com> wrote:

> Hi Andrew-
> 
> Steve Dickson and I have independently discovered some cache 
> invalidation problems in 2.6.18's NFS client.  Using git bisect, I was 
> able to track it back to this commit:
> 
> > commit 016eb4a0ed06a3677d67a584da901f0e9a63c666
> > Author: Andrew Morton <akpm@osdl.org>
> > Date:   Fri Sep 8 09:48:38 2006 -0700
> > 
> >     [PATCH] invalidate_complete_page() race fix
> > 
> >     If a CPU faults this page into pagetables after invalidate_mapping_pages()
> >     checked page_mapped(), invalidate_complete_page() will still proceed to remove
> >     the page from pagecache.  This leaves the page-faulting process with a
> >     detached page.  If it was MAP_SHARED then file data loss will ensue.
> > 
> >     Fix that up by checking the page's refcount after taking tree_lock.
> > 
> >     Cc: Nick Piggin <nickpiggin@yahoo.com.au>
> >     Cc: Hugh Dickins <hugh@veritas.com>
> >     Cc: <stable@kernel.org>
> >     Signed-off-by: Andrew Morton <akpm@osdl.org>
> >     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> 
> Instrumenting get_page and put_page has shown that the page reclaim 
> logic is temporarily bumping the page count on otherwise active but idle 
> pages, racing with the new page_count check in invalidate_complete_page 
> and preventing pages from being invalidated.
> 
> One problem for the NFS client is that invalidate_inode_pages2 is being 
> used to invalidate the pages associated with a cached directory.  If the 
> directory pages can't be invalidated because of this race, the contents 
> of the directory pages don't match the dentry cache, and all kinds of 
> strange behavior results.

NFS is presently ignoring the return value from invalidate_inode_pages2(),
in two places.  Could I suggest you fix that?  Then we'd at least not be
seeing "strange behaviour" and things will be easier to diagnose next time.

> I haven't checked the data invalidation behavior for regular files, but 
> the result will be file data corruption that comes and goes.
> 
> Would it be acceptable to revert that page_count(page) != 2 check in 
> invalidate_complete_page ?

Unfortunately not - that patch fixes cramfs failures and potential file
corruption.

The way to keep memory reclaim away from that page is to take
zone->lru_lock, but that's quite impractical for several reasons.

We could retry the invalidation a few times, but that stinks.

I think invalidate_inode_pages2() is sufficiently different from (ie:
stronger than) invalidate_inode_pages() to justify the addition of a new
invalidate_complete_page2(), which skips the page refcount check.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-25 21:10 ` Checking page_count(page) in invalidate_complete_page Andrew Morton
@ 2006-09-25 22:30   ` Chuck Lever
  2006-09-25 22:53     ` Andrew Morton
  2006-09-25 22:57     ` Steve Dickson
  2006-09-25 22:40   ` Chuck Lever
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 56+ messages in thread
From: Chuck Lever @ 2006-09-25 22:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Trond Myklebust, Steve Dickson, linux-mm

Andrew Morton wrote:
> (Added linux-mm)
> 
> On Mon, 25 Sep 2006 15:51:26 -0400
> Chuck Lever <chuck.lever@oracle.com> wrote:
> 
>> Hi Andrew-
>>
>> Steve Dickson and I have independently discovered some cache 
>> invalidation problems in 2.6.18's NFS client.  Using git bisect, I was 
>> able to track it back to this commit:
>>
>>> commit 016eb4a0ed06a3677d67a584da901f0e9a63c666
>>> Author: Andrew Morton <akpm@osdl.org>
>>> Date:   Fri Sep 8 09:48:38 2006 -0700
>>>
>>>     [PATCH] invalidate_complete_page() race fix
>>>
>>>     If a CPU faults this page into pagetables after invalidate_mapping_pages()
>>>     checked page_mapped(), invalidate_complete_page() will still proceed to remove
>>>     the page from pagecache.  This leaves the page-faulting process with a
>>>     detached page.  If it was MAP_SHARED then file data loss will ensue.
>>>
>>>     Fix that up by checking the page's refcount after taking tree_lock.
>>>
>>>     Cc: Nick Piggin <nickpiggin@yahoo.com.au>
>>>     Cc: Hugh Dickins <hugh@veritas.com>
>>>     Cc: <stable@kernel.org>
>>>     Signed-off-by: Andrew Morton <akpm@osdl.org>
>>>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
>> Instrumenting get_page and put_page has shown that the page reclaim 
>> logic is temporarily bumping the page count on otherwise active but idle 
>> pages, racing with the new page_count check in invalidate_complete_page 
>> and preventing pages from being invalidated.
>>
>> One problem for the NFS client is that invalidate_inode_pages2 is being 
>> used to invalidate the pages associated with a cached directory.  If the 
>> directory pages can't be invalidated because of this race, the contents 
>> of the directory pages don't match the dentry cache, and all kinds of 
>> strange behavior results.
> 
> NFS is presently ignoring the return value from invalidate_inode_pages2(),
> in two places.  Could I suggest you fix that?  Then we'd at least not be
> seeing "strange behaviour" and things will be easier to diagnose next time.

Yes, it certainly should check the return code there for directories and 
symlinks.  For files, it is expected that some pages could be left valid 
if they are still being used for other I/O operations.

Another place that's missing a return code check is 
invalidate_inode_pages2_range call in nfs_readdir_filler.

>> Would it be acceptable to revert that page_count(page) != 2 check in 
>> invalidate_complete_page ?
> 
> Unfortunately not - that patch fixes cramfs failures and potential file
> corruption.

It seems that the NFS client could now safely use a page cache 
invalidator that would wait for other page users to ensure that every 
page is invalidated properly, instead of skipping the pages that can't 
be immediately invalidated.

In my opinion that would be the correct fix here for NFS.

> The way to keep memory reclaim away from that page is to take
> zone->lru_lock, but that's quite impractical for several reasons.
> 
> We could retry the invalidation a few times, but that stinks.
> 
> I think invalidate_inode_pages2() is sufficiently different from (ie:
> stronger than) invalidate_inode_pages() to justify the addition of a new
> invalidate_complete_page2(), which skips the page refcount check.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-25 21:10 ` Checking page_count(page) in invalidate_complete_page Andrew Morton
  2006-09-25 22:30   ` Chuck Lever
@ 2006-09-25 22:40   ` Chuck Lever
  2006-09-25 23:02     ` Andrew Morton
  2006-09-25 22:50   ` Steve Dickson
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 56+ messages in thread
From: Chuck Lever @ 2006-09-25 22:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Trond Myklebust, Steve Dickson, linux-mm

Woops.

I got them backwards.

invalidate_inode_pages2 appears to wait for locked pages, while 
invalidate_inode_pages skips them.

Andrew Morton wrote:
> (Added linux-mm)
> 
> On Mon, 25 Sep 2006 15:51:26 -0400
> Chuck Lever <chuck.lever@oracle.com> wrote:
> 
>> Hi Andrew-
>>
>> Steve Dickson and I have independently discovered some cache 
>> invalidation problems in 2.6.18's NFS client.  Using git bisect, I was 
>> able to track it back to this commit:
>>
>>> commit 016eb4a0ed06a3677d67a584da901f0e9a63c666
>>> Author: Andrew Morton <akpm@osdl.org>
>>> Date:   Fri Sep 8 09:48:38 2006 -0700
>>>
>>>     [PATCH] invalidate_complete_page() race fix
>>>
>>>     If a CPU faults this page into pagetables after invalidate_mapping_pages()
>>>     checked page_mapped(), invalidate_complete_page() will still proceed to remove
>>>     the page from pagecache.  This leaves the page-faulting process with a
>>>     detached page.  If it was MAP_SHARED then file data loss will ensue.
>>>
>>>     Fix that up by checking the page's refcount after taking tree_lock.
>>>
>>>     Cc: Nick Piggin <nickpiggin@yahoo.com.au>
>>>     Cc: Hugh Dickins <hugh@veritas.com>
>>>     Cc: <stable@kernel.org>
>>>     Signed-off-by: Andrew Morton <akpm@osdl.org>
>>>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
>> Instrumenting get_page and put_page has shown that the page reclaim 
>> logic is temporarily bumping the page count on otherwise active but idle 
>> pages, racing with the new page_count check in invalidate_complete_page 
>> and preventing pages from being invalidated.
>>
>> One problem for the NFS client is that invalidate_inode_pages2 is being 
>> used to invalidate the pages associated with a cached directory.  If the 
>> directory pages can't be invalidated because of this race, the contents 
>> of the directory pages don't match the dentry cache, and all kinds of 
>> strange behavior results.
> 
> NFS is presently ignoring the return value from invalidate_inode_pages2(),
> in two places.  Could I suggest you fix that?  Then we'd at least not be
> seeing "strange behaviour" and things will be easier to diagnose next time.
> 
>> I haven't checked the data invalidation behavior for regular files, but 
>> the result will be file data corruption that comes and goes.
>>
>> Would it be acceptable to revert that page_count(page) != 2 check in 
>> invalidate_complete_page ?
> 
> Unfortunately not - that patch fixes cramfs failures and potential file
> corruption.
> 
> The way to keep memory reclaim away from that page is to take
> zone->lru_lock, but that's quite impractical for several reasons.
> 
> We could retry the invalidation a few times, but that stinks.
> 
> I think invalidate_inode_pages2() is sufficiently different from (ie:
> stronger than) invalidate_inode_pages() to justify the addition of a new
> invalidate_complete_page2(), which skips the page refcount check.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-25 21:10 ` Checking page_count(page) in invalidate_complete_page Andrew Morton
  2006-09-25 22:30   ` Chuck Lever
  2006-09-25 22:40   ` Chuck Lever
@ 2006-09-25 22:50   ` Steve Dickson
  2006-09-25 22:51   ` Nick Piggin
  2006-09-25 22:56   ` Chuck Lever
  4 siblings, 0 replies; 56+ messages in thread
From: Steve Dickson @ 2006-09-25 22:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: chuck.lever, Trond Myklebust, linux-mm

Andrew Morton wrote:

> I think invalidate_inode_pages2() is sufficiently different from (ie:
> stronger than) invalidate_inode_pages() to justify the addition of a new
> invalidate_complete_page2(), which skips the page refcount check.
Removing the page refcount does fix the problem at least the
one we are seeing with NFS not being able to flush its readdir cache..

steved.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-25 21:10 ` Checking page_count(page) in invalidate_complete_page Andrew Morton
                     ` (2 preceding siblings ...)
  2006-09-25 22:50   ` Steve Dickson
@ 2006-09-25 22:51   ` Nick Piggin
  2006-09-25 23:14     ` Chuck Lever
  2006-09-25 22:56   ` Chuck Lever
  4 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2006-09-25 22:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: chuck.lever, Trond Myklebust, Steve Dickson, linux-mm

Andrew Morton wrote:

>>I haven't checked the data invalidation behavior for regular files, but 
>>the result will be file data corruption that comes and goes.
>>
>>Would it be acceptable to revert that page_count(page) != 2 check in 
>>invalidate_complete_page ?
>>
>
>Unfortunately not - that patch fixes cramfs failures and potential file
>corruption.
>
>The way to keep memory reclaim away from that page is to take
>zone->lru_lock, but that's quite impractical for several reasons.
>

Also, you can't guarantee anything much about its refcount even then
(because it could be on a private reclaim list or pagevec somewhere).

>We could retry the invalidation a few times, but that stinks.
>
>I think invalidate_inode_pages2() is sufficiently different from (ie:
>stronger than) invalidate_inode_pages() to justify the addition of a new
>invalidate_complete_page2(), which skips the page refcount check.
>

Yes, I think that would be possible using the lock_page in do_no_page trick.
That would also enable you to invalidate pages that have direct IO going
into them, and other weird and wonderful get_user_pages happenings.

I haven't thrown away those patches, and I am looking for a justification
for them because they make the code look nicer ;)

For 2.6.18.stable, Andrew's idea of checking the return value and retry
might be the only option.

Nick
--

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-25 22:30   ` Chuck Lever
@ 2006-09-25 22:53     ` Andrew Morton
  2006-09-25 22:57     ` Steve Dickson
  1 sibling, 0 replies; 56+ messages in thread
From: Andrew Morton @ 2006-09-25 22:53 UTC (permalink / raw)
  To: chuck.lever; +Cc: Trond Myklebust, Steve Dickson, linux-mm

On Mon, 25 Sep 2006 18:30:54 -0400
Chuck Lever <chuck.lever@oracle.com> wrote:

> It seems that the NFS client could now safely use a page cache 
> invalidator that would wait for other page users to ensure that every 
> page is invalidated properly, instead of skipping the pages that can't 
> be immediately invalidated.
> 
> In my opinion that would be the correct fix here for NFS.

OK.  But unfortunately there isn't anything to wait *on*.  It would require
a sleep-a-bit-then-retry loop, with something in there to prevent infinite
looping, and something else to handle the case where the
infinite-loop-preventor prevented an infinite loop.  Which gets rather
nasty.


And block-backed direct-io needs the stronger invalidate_inode_pages2() as
well - right now it's subtly broken, only nobody has noticed yet.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-25 21:10 ` Checking page_count(page) in invalidate_complete_page Andrew Morton
                     ` (3 preceding siblings ...)
  2006-09-25 22:51   ` Nick Piggin
@ 2006-09-25 22:56   ` Chuck Lever
  4 siblings, 0 replies; 56+ messages in thread
From: Chuck Lever @ 2006-09-25 22:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Trond Myklebust, Steve Dickson, linux-mm

Andrew Morton wrote:
> I think invalidate_inode_pages2() is sufficiently different from (ie:
> stronger than) invalidate_inode_pages() to justify the addition of a new
> invalidate_complete_page2(), which skips the page refcount check.

Almost all of invalidate_complete_page() is the same in both cases.  How 
about using a boolean switch to determine whether the page_count is 
honored or not?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-25 22:30   ` Chuck Lever
  2006-09-25 22:53     ` Andrew Morton
@ 2006-09-25 22:57     ` Steve Dickson
  2006-09-25 23:14       ` Nick Piggin
  1 sibling, 1 reply; 56+ messages in thread
From: Steve Dickson @ 2006-09-25 22:57 UTC (permalink / raw)
  To: chuck.lever; +Cc: Andrew Morton, Trond Myklebust, linux-mm

Chuck Lever wrote:
> 
> It seems that the NFS client could now safely use a page cache 
> invalidator that would wait for other page users to ensure that every 
> page is invalidated properly, instead of skipping the pages that can't 
> be immediately invalidated.
> 
> In my opinion that would be the correct fix here for NFS.
I would have to agree with this... in debugging this I
changed the invalidate_inode_pages2 in nfs_revalidate_mapping
to truncate_inode_pages() for non-file inode which also seem
to work... So it does beg the question as to why aren't we
waiting for page to be invalidated? Is there some type of
VM deadlock we are trying to avoid?

steved.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-25 22:40   ` Chuck Lever
@ 2006-09-25 23:02     ` Andrew Morton
  0 siblings, 0 replies; 56+ messages in thread
From: Andrew Morton @ 2006-09-25 23:02 UTC (permalink / raw)
  To: chuck.lever; +Cc: Trond Myklebust, Steve Dickson, linux-mm

On Mon, 25 Sep 2006 18:40:51 -0400
Chuck Lever <chuck.lever@oracle.com> wrote:

> invalidate_inode_pages2 appears to wait for locked pages, while 
> invalidate_inode_pages skips them.

That won't help here: vmscan takes a ref on the page prior
to trying to lock it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-25 22:57     ` Steve Dickson
@ 2006-09-25 23:14       ` Nick Piggin
  0 siblings, 0 replies; 56+ messages in thread
From: Nick Piggin @ 2006-09-25 23:14 UTC (permalink / raw)
  To: Steve Dickson; +Cc: chuck.lever, Andrew Morton, Trond Myklebust, linux-mm

Steve Dickson wrote:

> Chuck Lever wrote:
>
>>
>> It seems that the NFS client could now safely use a page cache 
>> invalidator that would wait for other page users to ensure that every 
>> page is invalidated properly, instead of skipping the pages that 
>> can't be immediately invalidated.
>>
>> In my opinion that would be the correct fix here for NFS.
>
> I would have to agree with this... in debugging this I
> changed the invalidate_inode_pages2 in nfs_revalidate_mapping
> to truncate_inode_pages() for non-file inode which also seem
> to work... So it does beg the question as to why aren't we
> waiting for page to be invalidated? Is there some type of
> VM deadlock we are trying to avoid?


Some kind of VM race.

http://marc.theaimsgroup.com/?l=linux-mm&m=115443228617576&w=2

It turns out that Andrew's patch that check page_count fixes the same
problem: It does so by ensuring nothing will touch this page before it
is invalidated; the patch at the above url[*] does so by ensuring just
page faults will not touch the page.

[*] has some implementation bugs so don't use it.

Andrew's patch solves a couple of silent data loss / corruption issues
so there is no option to circumvent it upstream.

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-25 22:51   ` Nick Piggin
@ 2006-09-25 23:14     ` Chuck Lever
  2006-09-25 23:21       ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Chuck Lever @ 2006-09-25 23:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

Nick Piggin wrote:
> Andrew Morton wrote:
> Also, you can't guarantee anything much about its refcount even then
> (because it could be on a private reclaim list or pagevec somewhere).
> 
>> We could retry the invalidation a few times, but that stinks.
>>
>> I think invalidate_inode_pages2() is sufficiently different from (ie:
>> stronger than) invalidate_inode_pages() to justify the addition of a new
>> invalidate_complete_page2(), which skips the page refcount check.
>>
> 
> Yes, I think that would be possible using the lock_page in do_no_page 
> trick.
> That would also enable you to invalidate pages that have direct IO going
> into them, and other weird and wonderful get_user_pages happenings.
> 
> I haven't thrown away those patches, and I am looking for a justification
> for them because they make the code look nicer ;)
> 
> For 2.6.18.stable, Andrew's idea of checking the return value and retry
> might be the only option.

I think allowing callers of invalidate_inode_pages2() to get the 
previous behavior is reasonable here.  There are only 2 of them: v9fs 
and the NFS client.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-25 23:14     ` Chuck Lever
@ 2006-09-25 23:21       ` Nick Piggin
  2006-09-26  0:01         ` Chuck Lever
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2006-09-25 23:21 UTC (permalink / raw)
  To: chuck.lever; +Cc: Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

Chuck Lever wrote:

> Nick Piggin wrote:
>
>> Andrew Morton wrote:
>> Also, you can't guarantee anything much about its refcount even then
>> (because it could be on a private reclaim list or pagevec somewhere).
>>
>>> We could retry the invalidation a few times, but that stinks.
>>>
>>> I think invalidate_inode_pages2() is sufficiently different from (ie:
>>> stronger than) invalidate_inode_pages() to justify the addition of a 
>>> new
>>> invalidate_complete_page2(), which skips the page refcount check.
>>>
>>
>> Yes, I think that would be possible using the lock_page in do_no_page 
>> trick.
>> That would also enable you to invalidate pages that have direct IO going
>> into them, and other weird and wonderful get_user_pages happenings.
>>
>> I haven't thrown away those patches, and I am looking for a 
>> justification
>> for them because they make the code look nicer ;)
>>
>> For 2.6.18.stable, Andrew's idea of checking the return value and retry
>> might be the only option.
>
>
> I think allowing callers of invalidate_inode_pages2() to get the 
> previous behavior is reasonable here.  There are only 2 of them: v9fs 
> and the NFS client.



That still reintroduces the page fault race, but if the dumb 
check'n'retry is
no good then it may be OK for 2.6.18.stable, considering the page fault race
is much less common than the reclaim one. Not sure, not my call.

Upstream, it should be fixed properly without re-introducing bugs along the
way.

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-25 23:21       ` Nick Piggin
@ 2006-09-26  0:01         ` Chuck Lever
  2006-09-26  0:13           ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Chuck Lever @ 2006-09-26  0:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

Nick Piggin wrote:
> Chuck Lever wrote:
> 
>> Nick Piggin wrote:
>>
>>> Andrew Morton wrote:
>>> Also, you can't guarantee anything much about its refcount even then
>>> (because it could be on a private reclaim list or pagevec somewhere).
>>>
>>>> We could retry the invalidation a few times, but that stinks.
>>>>
>>>> I think invalidate_inode_pages2() is sufficiently different from (ie:
>>>> stronger than) invalidate_inode_pages() to justify the addition of a 
>>>> new
>>>> invalidate_complete_page2(), which skips the page refcount check.
>>>>
>>>
>>> Yes, I think that would be possible using the lock_page in do_no_page 
>>> trick.
>>> That would also enable you to invalidate pages that have direct IO going
>>> into them, and other weird and wonderful get_user_pages happenings.
>>>
>>> I haven't thrown away those patches, and I am looking for a 
>>> justification
>>> for them because they make the code look nicer ;)
>>>
>>> For 2.6.18.stable, Andrew's idea of checking the return value and retry
>>> might be the only option.
>>
>>
>> I think allowing callers of invalidate_inode_pages2() to get the 
>> previous behavior is reasonable here.  There are only 2 of them: v9fs 
>> and the NFS client.
> 
> 
> 
> That still reintroduces the page fault race, but if the dumb 
> check'n'retry is
> no good then it may be OK for 2.6.18.stable, considering the page fault 
> race
> is much less common than the reclaim one. Not sure, not my call.

The NFS client uses invalidate_inode_pages2 for files, symlinks, and 
directories.  The latter two won't have the do_no_page race since you 
can't map those types of file objects.

For NFS files, the do_no_page race does exist (at least theoretically -- 
I've never seen a report of such a problem).  Most people are not brave 
enough to use shared mapped files on NFS... so that race may be very 
rare indeed.

> Upstream, it should be fixed properly without re-introducing bugs along the
> way.

Of course... thanks for sending the history.

I'm wondering aloud here, because I'm a VM neophyte.  I'm not sure how 
common the reclaim race is in real environments.  For instance, the test 
I'm running is pretty simple, and I run it just after the client has 
rebooted.  Why is try_to_free_pages being called here?  If I knew that I 
could probably make a better guess about how common this race is.

Also, the last get_page() call is from pagevec_strip().  Why do we need 
to try to strip buffers off of a page that is guaranteed not to have any 
buffers?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-26  0:01         ` Chuck Lever
@ 2006-09-26  0:13           ` Nick Piggin
  2006-09-26  1:33             ` Chuck Lever
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2006-09-26  0:13 UTC (permalink / raw)
  To: chuck.lever; +Cc: Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

Chuck Lever wrote:

> Nick Piggin wrote:
>
>> That still reintroduces the page fault race, but if the dumb 
>> check'n'retry is
>> no good then it may be OK for 2.6.18.stable, considering the page 
>> fault race
>> is much less common than the reclaim one. Not sure, not my call.
>
>
> The NFS client uses invalidate_inode_pages2 for files, symlinks, and 
> directories.  The latter two won't have the do_no_page race since you 
> can't map those types of file objects.


But they're present on the LRU? That's unusual (I guess NFS doesn't have 
a buffer cache for a backing
block device).

> For NFS files, the do_no_page race does exist (at least theoretically 
> -- I've never seen a report of such a problem).  Most people are not 
> brave enough to use shared mapped files on NFS... so that race may be 
> very rare indeed.


I think the race is rare, but we have had at least one customer see it. 
The problem with mainline kernel
is that there will be no indication from the kernel, and it will be hard 
enough to reproduce that it will
probably be considered as a hardware bug or a userspace bug if someone 
hits it.

>> Upstream, it should be fixed properly without re-introducing bugs 
>> along the
>> way.
>
>
> Of course... thanks for sending the history.
>
> I'm wondering aloud here, because I'm a VM neophyte.  I'm not sure how 
> common the reclaim race is in real environments.  For instance, the 
> test I'm running is pretty simple, and I run it just after the client 
> has rebooted.  Why is try_to_free_pages being called here?  If I knew 
> that I could probably make a better guess about how common this race is.


It will be called because you're low on memory somewhere. Reclaim is 
common (usually to free up old pagecache).

>
> Also, the last get_page() call is from pagevec_strip().  Why do we 
> need to try to strip buffers off of a page that is guaranteed not to 
> have any buffers?


I don't see where pagevec_strip calls get_page()?

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-26  0:13           ` Nick Piggin
@ 2006-09-26  1:33             ` Chuck Lever
  2006-09-26  1:48               ` Nick Piggin
  2006-09-26  6:25               ` Nick Piggin
  0 siblings, 2 replies; 56+ messages in thread
From: Chuck Lever @ 2006-09-26  1:33 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

Nick Piggin wrote:
> Chuck Lever wrote:
>> Nick Piggin wrote:
>>> That still reintroduces the page fault race, but if the dumb 
>>> check'n'retry is
>>> no good then it may be OK for 2.6.18.stable, considering the page 
>>> fault race
>>> is much less common than the reclaim one. Not sure, not my call.
>>
>>
>> The NFS client uses invalidate_inode_pages2 for files, symlinks, and 
>> directories.  The latter two won't have the do_no_page race since you 
>> can't map those types of file objects.
> 
> 
> But they're present on the LRU? That's unusual (I guess NFS doesn't have 
> a buffer cache for a backing
> block device).

That is correct -- NFS doesn't use the buffer cache.

>> Also, the last get_page() call is from pagevec_strip().  Why do we 
>> need to try to strip buffers off of a page that is guaranteed not to 
>> have any buffers?
> 
> 
> I don't see where pagevec_strip calls get_page()?

Ah, you're right, I was off by one symbol.  The get_page call is in 
pagevec_lookup() (via find_get_pages).  That could mean there isn't a 
reclaim race at all.

The page references I'm noting are:

1.  add_to_page_cache_lru  1 -> 2

2.  do_generic_mapping_read, released in nfs_do_filldir

3.  read_cache_page, released in nfs_readdir

4.  pagevec_lookup (from invalidate_inode_pages2_range)  2 -> 3

The odd thing is some of the pages for the directory are released, so 
they must have a ref count of 2 *after* the pagevec_lookup.  Looks like 
more investigation is needed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-26  1:33             ` Chuck Lever
@ 2006-09-26  1:48               ` Nick Piggin
  2006-09-28 16:26                 ` Chuck Lever
  2006-09-26  6:25               ` Nick Piggin
  1 sibling, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2006-09-26  1:48 UTC (permalink / raw)
  To: chuck.lever; +Cc: Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

Chuck Lever wrote:

> Nick Piggin wrote:
>
>> I don't see where pagevec_strip calls get_page()?
>
>
> Ah, you're right, I was off by one symbol.  The get_page call is in 
> pagevec_lookup() (via find_get_pages).  That could mean there isn't a 
> reclaim race at all.


There will be transient (ie. race) conditions where the page count is 
elevated for whatever
reason. Reclaim is one of those, so this failure you're seeing will need 
to be fixed on the
NFS / invalidate side somehow.

>
> The page references I'm noting are:
>
> 1.  add_to_page_cache_lru  1 -> 2
>
> 2.  do_generic_mapping_read, released in nfs_do_filldir
>
> 3.  read_cache_page, released in nfs_readdir
>
> 4.  pagevec_lookup (from invalidate_inode_pages2_range)  2 -> 3
>
> The odd thing is some of the pages for the directory are released, so 
> they must have a ref count of 2 *after* the pagevec_lookup.  Looks 
> like more investigation is needed.


Well they could get flushed out of the lru add pagevecs which drops 
their count. Just to
test that theory you could run lru_add_drain_all() at the start of the 
invalidate function.
That's not going to actually fix the problem, but it might help 2.6.18 
limp along when
combined with some other work.

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-26  1:33             ` Chuck Lever
  2006-09-26  1:48               ` Nick Piggin
@ 2006-09-26  6:25               ` Nick Piggin
  2006-09-26 13:12                 ` Chuck Lever
  1 sibling, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2006-09-26  6:25 UTC (permalink / raw)
  To: chuck.lever; +Cc: Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

Chuck Lever wrote:

> Nick Piggin wrote:
>
>> But they're present on the LRU? That's unusual (I guess NFS doesn't 
>> have a buffer cache for a backing
>> block device).
>
>
> That is correct -- NFS doesn't use the buffer cache.



So that raises another question: how do they get to invalidate_inode_pages2
if they are not part of the buffer or pagecache?
--

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-26  6:25               ` Nick Piggin
@ 2006-09-26 13:12                 ` Chuck Lever
  2006-09-27  4:47                   ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Chuck Lever @ 2006-09-26 13:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

Nick Piggin wrote:
> Chuck Lever wrote:
> 
>> Nick Piggin wrote:
>>
>>> But they're present on the LRU? That's unusual (I guess NFS doesn't 
>>> have a buffer cache for a backing
>>> block device).
>>
>>
>> That is correct -- NFS doesn't use the buffer cache.
> 
> 
> So that raises another question: how do they get to invalidate_inode_pages2
> if they are not part of the buffer or pagecache?

It does use the page cache to cache data pages for files, directories, 
and symlinks.  It does not use buffers, however, since incoming file 
system data is read from a socket, not from a block device.  I believe 
the client provides a dummy backing device for the few things in the VFS 
that require it.

Invalidate_inode_pages2() is used to remove page cache data that the 
client has determined is stale.  The client detects that the file has 
changed on the server, and it is not responsible for those changes, by 
examining the file's attributes and noticing mtime or size changes. 
When such a change is detected, all pages cached for a file are 
invalidated, and the page cache is gradually repopulated from the server 
as applications access parts of the file.

I'd like to understand the difference between invalidate_inode_pages2() 
and truncate_inode_pages() in this scenario.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-26 13:12                 ` Chuck Lever
@ 2006-09-27  4:47                   ` Nick Piggin
  2006-09-27  8:25                     ` Andrew Morton
  2006-09-27 15:54                     ` Chuck Lever
  0 siblings, 2 replies; 56+ messages in thread
From: Nick Piggin @ 2006-09-27  4:47 UTC (permalink / raw)
  To: chuck.lever; +Cc: Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

Chuck Lever wrote:

> Nick Piggin wrote:
>
>> So that raises another question: how do they get to 
>> invalidate_inode_pages2
>> if they are not part of the buffer or pagecache?
>
>
> It does use the page cache to cache data pages for files, directories, 
> and symlinks.  It does not use buffers, however, since incoming file 
> system data is read from a socket, not from a block device.  I believe 
> the client provides a dummy backing device for the few things in the 
> VFS that require it.


OK, it uses the directory's inode's pagecache rather than the block's 
buffer cache like AFAIK many other
filesystems do. That's OK, I like that model better anyway ;)

> Invalidate_inode_pages2() is used to remove page cache data that the 
> client has determined is stale.  The client detects that the file has 
> changed on the server, and it is not responsible for those changes, by 
> examining the file's attributes and noticing mtime or size changes. 
> When such a change is detected, all pages cached for a file are 
> invalidated, and the page cache is gradually repopulated from the 
> server as applications access parts of the file.
>
> I'd like to understand the difference between 
> invalidate_inode_pages2() and truncate_inode_pages() in this scenario.


truncate_inode_pages will a) throw away everything including dirty 
pages, and b) probably be fairly
racy unless the inode's i_size (for normal files) is modified.

We really want to make an invalidation that works properly for you.

If you can guarantee that a pagecache page can never get mapped to a 
user mapping (eg. perhaps for
directories and symlinks) and also ensure that you don't dirty it via 
the filesystem, then you don't
have to worry about it becoming dirty, so we can skip the checks Andrew 
has added and maybe add a
WARN_ON(PageDirty()).

Now that won't help you for regular file pages that can be mmapped. For 
those, we need to ensure
that the page isn't mapped, and the page will not be dirtied via a 
get_user_pages user, and you need
to ensure that it can't get dirtied via the filesystem. The former two 
will require VM changes of
the scale that aren't going to get into 2.6.19, but I'm working on them.

For now, can make do with flushing lru pagevecs, and also testing and 
retrying in the caller?

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-27  4:47                   ` Nick Piggin
@ 2006-09-27  8:25                     ` Andrew Morton
  2006-09-27  8:39                       ` Nick Piggin
  2006-09-27 15:54                     ` Chuck Lever
  1 sibling, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2006-09-27  8:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: chuck.lever, Trond Myklebust, Steve Dickson, linux-mm

On Wed, 27 Sep 2006 14:47:26 +1000
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> truncate_inode_pages will a) throw away everything including dirty 
> pages, and b) probably be fairly
> racy unless the inode's i_size (for normal files) is modified.
> 
> We really want to make an invalidation that works properly for you.
> 
> If you can guarantee that a pagecache page can never get mapped to a 
> user mapping (eg. perhaps for
> directories and symlinks) and also ensure that you don't dirty it via 
> the filesystem, then you don't
> have to worry about it becoming dirty, so we can skip the checks Andrew 
> has added and maybe add a
> WARN_ON(PageDirty()).

None of that is true for when invalidate_inode_pages2() is used by
block-backed direct-io.

We should fix it for that application..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-27  8:25                     ` Andrew Morton
@ 2006-09-27  8:39                       ` Nick Piggin
  2006-09-27 16:03                         ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2006-09-27  8:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: chuck.lever, Trond Myklebust, Steve Dickson, linux-mm

Andrew Morton wrote:

>On Wed, 27 Sep 2006 14:47:26 +1000
>Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
>
>>truncate_inode_pages will a) throw away everything including dirty 
>>pages, and b) probably be fairly
>>racy unless the inode's i_size (for normal files) is modified.
>>
>>We really want to make an invalidation that works properly for you.
>>
>>If you can guarantee that a pagecache page can never get mapped to a 
>>user mapping (eg. perhaps for
>>directories and symlinks) and also ensure that you don't dirty it via 
>>the filesystem, then you don't
>>have to worry about it becoming dirty, so we can skip the checks Andrew 
>>has added and maybe add a
>>WARN_ON(PageDirty()).
>>
>
>None of that is true for when invalidate_inode_pages2() is used by
>block-backed direct-io.
>

Sure, you wouldn't be able to skip those checks for direct IO, just for the
NFS usage.

>We should fix it for that application..
>

I thought direct IO fell back to buffered IO when the teardown failed. I
was surprised that it didn't and gave me errors when I was testing. Do
you see a better way to handle the direct IO case?

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-27  4:47                   ` Nick Piggin
  2006-09-27  8:25                     ` Andrew Morton
@ 2006-09-27 15:54                     ` Chuck Lever
  1 sibling, 0 replies; 56+ messages in thread
From: Chuck Lever @ 2006-09-27 15:54 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

Nick Piggin wrote:
> Chuck Lever wrote:
>> Invalidate_inode_pages2() is used to remove page cache data that the 
>> client has determined is stale.  The client detects that the file has 
>> changed on the server, and it is not responsible for those changes, by 
>> examining the file's attributes and noticing mtime or size changes. 
>> When such a change is detected, all pages cached for a file are 
>> invalidated, and the page cache is gradually repopulated from the 
>> server as applications access parts of the file.
>>
> 
> If you can guarantee that a pagecache page can never get mapped to a 
> user mapping (eg. perhaps for
> directories and symlinks) and also ensure that you don't dirty it via 
> the filesystem, then you don't
> have to worry about it becoming dirty, so we can skip the checks Andrew 
> has added and maybe add a
> WARN_ON(PageDirty()).

As far as I can tell, for cached non-regular files, they can't be mapped 
or written to, and cannot be targets of direct I/O.

So, I think it's acceptable to add logic in nfs_revalidate_mapping() to 
use invalidate_inode_pages2() for non-regular files (and check the 
return code, of course), as long as we have addressed the page count 
issue somehow.

> Now that won't help you for regular file pages that can be mmapped. For 
> those, we need to ensure
> that the page isn't mapped, and the page will not be dirtied via a 
> get_user_pages user, and you need
> to ensure that it can't get dirtied via the filesystem. The former two 
> will require VM changes of
> the scale that aren't going to get into 2.6.19, but I'm working on them.
> 
> For now, can make do with flushing lru pagevecs, and also testing and 
> retrying in the caller?

I planned to play with this a bit yesterday, but got swept up in some 
higher priority issues.  Will try to get to testing this soonest.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-27  8:39                       ` Nick Piggin
@ 2006-09-27 16:03                         ` Andrew Morton
  0 siblings, 0 replies; 56+ messages in thread
From: Andrew Morton @ 2006-09-27 16:03 UTC (permalink / raw)
  To: Nick Piggin; +Cc: chuck.lever, Trond Myklebust, Steve Dickson, linux-mm

On Wed, 27 Sep 2006 18:39:19 +1000
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Andrew Morton wrote:
> 
> >On Wed, 27 Sep 2006 14:47:26 +1000
> >Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >
> >
> >>truncate_inode_pages will a) throw away everything including dirty 
> >>pages, and b) probably be fairly
> >>racy unless the inode's i_size (for normal files) is modified.
> >>
> >>We really want to make an invalidation that works properly for you.
> >>
> >>If you can guarantee that a pagecache page can never get mapped to a 
> >>user mapping (eg. perhaps for
> >>directories and symlinks) and also ensure that you don't dirty it via 
> >>the filesystem, then you don't
> >>have to worry about it becoming dirty, so we can skip the checks Andrew 
> >>has added and maybe add a
> >>WARN_ON(PageDirty()).
> >>
> >
> >None of that is true for when invalidate_inode_pages2() is used by
> >block-backed direct-io.
> >
> 
> Sure, you wouldn't be able to skip those checks for direct IO, just for the
> NFS usage.
> 
> >We should fix it for that application..
> >
> 
> I thought direct IO fell back to buffered IO when the teardown failed.

See mm/filemap.c:generic_file_direct_IO().

> I
> was surprised that it didn't and gave me errors when I was testing. Do
> you see a better way to handle the direct IO case?
> 

Ignore page_count() in invalidate_inode_pages2()?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-26  1:48               ` Nick Piggin
@ 2006-09-28 16:26                 ` Chuck Lever
  2006-09-28 16:36                   ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Chuck Lever @ 2006-09-28 16:26 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

Hi Nick-

Nick Piggin wrote:
>> The page references I'm noting are:
>>
>> 1.  add_to_page_cache_lru  1 -> 2
>>
>> 2.  do_generic_mapping_read, released in nfs_do_filldir
>>
>> 3.  read_cache_page, released in nfs_readdir
>>
>> 4.  pagevec_lookup (from invalidate_inode_pages2_range)  2 -> 3
>>
>> The odd thing is some of the pages for the directory are released, so 
>> they must have a ref count of 2 *after* the pagevec_lookup.  Looks 
>> like more investigation is needed.
> 
> 
> Well they could get flushed out of the lru add pagevecs which drops 
> their count. Just to
> test that theory you could run lru_add_drain_all() at the start of the 
> invalidate function.
> That's not going to actually fix the problem, but it might help 2.6.18 
> limp along when
> combined with some other work.

I've verified that the pages that fail to be invalidated are indeed 
languishing in the per-CPU lru-add pagevec.  Adding a call to 
lru_add_drain_all() to the invalidate_inode_pages2() path does fix the 
NFS readdir problem.

I think a call to lru_add_drain_all() belongs in both the 
invalidate_inode_pages() and the invalidate_inode_pages2() path.  Do you 
agree?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-28 16:26                 ` Chuck Lever
@ 2006-09-28 16:36                   ` Andrew Morton
  2006-09-28 16:40                     ` Andrew Morton
  2006-09-28 16:41                     ` Chuck Lever
  0 siblings, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2006-09-28 16:36 UTC (permalink / raw)
  To: chuck.lever; +Cc: Nick Piggin, Trond Myklebust, Steve Dickson, linux-mm

On Thu, 28 Sep 2006 12:26:36 -0400
Chuck Lever <chuck.lever@oracle.com> wrote:

> Hi Nick-
> 
> Nick Piggin wrote:
> >> The page references I'm noting are:
> >>
> >> 1.  add_to_page_cache_lru  1 -> 2
> >>
> >> 2.  do_generic_mapping_read, released in nfs_do_filldir
> >>
> >> 3.  read_cache_page, released in nfs_readdir
> >>
> >> 4.  pagevec_lookup (from invalidate_inode_pages2_range)  2 -> 3
> >>
> >> The odd thing is some of the pages for the directory are released, so 
> >> they must have a ref count of 2 *after* the pagevec_lookup.  Looks 
> >> like more investigation is needed.
> > 
> > 
> > Well they could get flushed out of the lru add pagevecs which drops 
> > their count. Just to
> > test that theory you could run lru_add_drain_all() at the start of the 
> > invalidate function.
> > That's not going to actually fix the problem, but it might help 2.6.18 
> > limp along when
> > combined with some other work.
> 
> I've verified that the pages that fail to be invalidated are indeed 
> languishing in the per-CPU lru-add pagevec.  Adding a call to 
> lru_add_drain_all() to the invalidate_inode_pages2() path does fix the 
> NFS readdir problem.
> 
> I think a call to lru_add_drain_all() belongs in both the 
> invalidate_inode_pages() and the invalidate_inode_pages2() path.  Do you 
> agree?

Yes.  But the page-pinned-by-vmscan scenario which you earlier identified
can still happen, I believe.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-28 16:36                   ` Andrew Morton
@ 2006-09-28 16:40                     ` Andrew Morton
  2006-09-28 16:42                       ` Chuck Lever
  2006-09-28 16:41                     ` Chuck Lever
  1 sibling, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2006-09-28 16:40 UTC (permalink / raw)
  To: chuck.lever, Nick Piggin, Trond Myklebust, Steve Dickson, linux-mm

On Thu, 28 Sep 2006 09:36:40 -0700
Andrew Morton <akpm@osdl.org> wrote:

> > I think a call to lru_add_drain_all() belongs in both the 
> > invalidate_inode_pages() and the invalidate_inode_pages2() path.  Do you 
> > agree?
> 
> Yes.

Or maybe not.  lru_add_drain() will only drain the local CPU's buffer.  If
the page is sitting in another CPU's buffer, the same problem will occur.

IOW, you got lucky.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-28 16:36                   ` Andrew Morton
  2006-09-28 16:40                     ` Andrew Morton
@ 2006-09-28 16:41                     ` Chuck Lever
  1 sibling, 0 replies; 56+ messages in thread
From: Chuck Lever @ 2006-09-28 16:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Trond Myklebust, Steve Dickson, linux-mm

Andrew Morton wrote:
> On Thu, 28 Sep 2006 12:26:36 -0400
> Chuck Lever <chuck.lever@oracle.com> wrote:
> 
>> Hi Nick-
>>
>> Nick Piggin wrote:
>>>> The page references I'm noting are:
>>>>
>>>> 1.  add_to_page_cache_lru  1 -> 2
>>>>
>>>> 2.  do_generic_mapping_read, released in nfs_do_filldir
>>>>
>>>> 3.  read_cache_page, released in nfs_readdir
>>>>
>>>> 4.  pagevec_lookup (from invalidate_inode_pages2_range)  2 -> 3
>>>>
>>>> The odd thing is some of the pages for the directory are released, so 
>>>> they must have a ref count of 2 *after* the pagevec_lookup.  Looks 
>>>> like more investigation is needed.
>>>
>>> Well they could get flushed out of the lru add pagevecs which drops 
>>> their count. Just to
>>> test that theory you could run lru_add_drain_all() at the start of the 
>>> invalidate function.
>>> That's not going to actually fix the problem, but it might help 2.6.18 
>>> limp along when
>>> combined with some other work.
>> I've verified that the pages that fail to be invalidated are indeed 
>> languishing in the per-CPU lru-add pagevec.  Adding a call to 
>> lru_add_drain_all() to the invalidate_inode_pages2() path does fix the 
>> NFS readdir problem.
>>
>> I think a call to lru_add_drain_all() belongs in both the 
>> invalidate_inode_pages() and the invalidate_inode_pages2() path.  Do you 
>> agree?
> 
> Yes.  But the page-pinned-by-vmscan scenario which you earlier identified
> can still happen, I believe.

Yes it can.  From what I gather from the conversation between you and 
Nick, that is a much harder problem to solve.  Given the debugging I've 
done, I don't think I was hitting that case, even though I still believe 
it is possible in the current code.

In this case, however, it seems like the non-empty LRU pagevecs will be 
a pretty common show-stopper for invalidate_inode_pages.  It will hit 
NFS directories and files quite a bit I would expect because the average 
size of these objects is only a few pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-28 16:40                     ` Andrew Morton
@ 2006-09-28 16:42                       ` Chuck Lever
  2006-09-28 17:03                         ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Chuck Lever @ 2006-09-28 16:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Trond Myklebust, Steve Dickson, linux-mm

Andrew Morton wrote:
> On Thu, 28 Sep 2006 09:36:40 -0700
> Andrew Morton <akpm@osdl.org> wrote:
> 
>>> I think a call to lru_add_drain_all() belongs in both the 
>>> invalidate_inode_pages() and the invalidate_inode_pages2() path.  Do you 
>>> agree?
>> Yes.
> 
> Or maybe not.  lru_add_drain() will only drain the local CPU's buffer.  If
> the page is sitting in another CPU's buffer, the same problem will occur.
> 
> IOW, you got lucky.

I used lru_add_drain_all(), so it hit all the per-CPU pagevecs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-28 16:42                       ` Chuck Lever
@ 2006-09-28 17:03                         ` Andrew Morton
  2006-09-28 17:09                           ` Chuck Lever
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2006-09-28 17:03 UTC (permalink / raw)
  To: chuck.lever; +Cc: Nick Piggin, Trond Myklebust, Steve Dickson, linux-mm

On Thu, 28 Sep 2006 12:42:44 -0400
Chuck Lever <chuck.lever@oracle.com> wrote:

> Andrew Morton wrote:
> > On Thu, 28 Sep 2006 09:36:40 -0700
> > Andrew Morton <akpm@osdl.org> wrote:
> > 
> >>> I think a call to lru_add_drain_all() belongs in both the 
> >>> invalidate_inode_pages() and the invalidate_inode_pages2() path.  Do you 
> >>> agree?
> >> Yes.
> > 
> > Or maybe not.  lru_add_drain() will only drain the local CPU's buffer.  If
> > the page is sitting in another CPU's buffer, the same problem will occur.
> > 
> > IOW, you got lucky.
> 
> I used lru_add_drain_all(), so it hit all the per-CPU pagevecs.

lru_add_drain_all() is a nasty, hacky, not-exported-to-modules thing.  It
equates to lru_add_drain() if !CONFIG_NUMA.

Sigh, we're not getting there, are we?

I'm still thinking we add invalidate_complete_page2() to get us out of
trouble and park the problem :(.  That'd be a good approach for 2.6.18.x,
which I assume is fairly urgent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-28 17:03                         ` Andrew Morton
@ 2006-09-28 17:09                           ` Chuck Lever
  2006-09-29  0:37                             ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: Chuck Lever @ 2006-09-28 17:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Trond Myklebust, Steve Dickson, linux-mm

Andrew Morton wrote:
> On Thu, 28 Sep 2006 12:42:44 -0400
> Chuck Lever <chuck.lever@oracle.com> wrote:
> 
>> Andrew Morton wrote:
>>> On Thu, 28 Sep 2006 09:36:40 -0700
>>> Andrew Morton <akpm@osdl.org> wrote:
>>>
>>>>> I think a call to lru_add_drain_all() belongs in both the 
>>>>> invalidate_inode_pages() and the invalidate_inode_pages2() path.  Do you 
>>>>> agree?
>>>> Yes.
>>> Or maybe not.  lru_add_drain() will only drain the local CPU's buffer.  If
>>> the page is sitting in another CPU's buffer, the same problem will occur.
>>>
>>> IOW, you got lucky.
>> I used lru_add_drain_all(), so it hit all the per-CPU pagevecs.
> 
> lru_add_drain_all() is a nasty, hacky, not-exported-to-modules thing.  It
> equates to lru_add_drain() if !CONFIG_NUMA.
> 
> Sigh, we're not getting there, are we?
> 
> I'm still thinking we add invalidate_complete_page2() to get us out of
> trouble and park the problem :(.  That'd be a good approach for 2.6.18.x,
> which I assume is fairly urgent.

Choosing which fix to include is above my pay grade.  Both of these 
proposals address the NFS readdir cache invalidation problem.

But it seems like there is a real problem here -- the pages that are 
waiting to be added the LRU will always have a page count that is too 
high for invalidate_inode_pages to work on them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-28 17:09                           ` Chuck Lever
@ 2006-09-29  0:37                             ` Nick Piggin
  2006-09-29 20:34                               ` Chuck Lever
  0 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2006-09-29  0:37 UTC (permalink / raw)
  To: chuck.lever; +Cc: Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

Chuck Lever wrote:

> Andrew Morton wrote:
>
>> lru_add_drain_all() is a nasty, hacky, not-exported-to-modules 
>> thing.  It
>> equates to lru_add_drain() if !CONFIG_NUMA.
>

It should drain on all CPUs though, I can't remember why it doesn't.
Not that I disagree that throwing IPIs around is a hack ;)

>>
>> Sigh, we're not getting there, are we?
>>
>> I'm still thinking we add invalidate_complete_page2() to get us out of
>> trouble and park the problem :(.  That'd be a good approach for 
>> 2.6.18.x,
>> which I assume is fairly urgent.
>
>
> Choosing which fix to include is above my pay grade.  Both of these 
> proposals address the NFS readdir cache invalidation problem.
>
> But it seems like there is a real problem here -- the pages that are 
> waiting to be added the LRU will always have a page count that is too 
> high for invalidate_inode_pages to work on them.


If you do the lru_add_drain_all, then the vmscan problem should be probably
mostly fixable by detecting failure, waiting, and retrying a few times.

After that, making an invalidate_complete_page2 ignore the page count or
dirty status would only save you from a very small number of cases, and they
would be likely to be a data loss / corruption case.

OTOH, we haven't had many complains before, so for 2.6.18, an
invalidate_complete_page2 may indeed be the best option?

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-29  0:37                             ` Nick Piggin
@ 2006-09-29 20:34                               ` Chuck Lever
  2006-09-29 20:45                                 ` Peter Zijlstra
  0 siblings, 1 reply; 56+ messages in thread
From: Chuck Lever @ 2006-09-29 20:34 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

Nick Piggin wrote:
> Chuck Lever wrote:
> 
>> Andrew Morton wrote:
>>
>>> lru_add_drain_all() is a nasty, hacky, not-exported-to-modules 
>>> thing.  It
>>> equates to lru_add_drain() if !CONFIG_NUMA.
>>
> 
> It should drain on all CPUs though, I can't remember why it doesn't.
> Not that I disagree that throwing IPIs around is a hack ;)
 >
>>> Sigh, we're not getting there, are we?
>>>
>>> I'm still thinking we add invalidate_complete_page2() to get us out of
>>> trouble and park the problem :(.  That'd be a good approach for 
>>> 2.6.18.x,
>>> which I assume is fairly urgent.
>>
>>
>> Choosing which fix to include is above my pay grade.  Both of these 
>> proposals address the NFS readdir cache invalidation problem.
>>
>> But it seems like there is a real problem here -- the pages that are 
>> waiting to be added the LRU will always have a page count that is too 
>> high for invalidate_inode_pages to work on them.

One option that hasn't been entertained is to remove the "batched LRU
add" logic all together.  Just gut lru_cache_add -- it should send
pages immediately to be added to the LRU list.  This is a bit slower, 
but it fixes the invalidation problems, and makes the icky 
lru_add_drain_all() a no-op.

Note that lru_add_drain_all() would have to be called much more often if 
it is called from the inode invalidation logic... So, which is more 
overhead, adding a page to the LRU at a time, or draining the per-CPU 
LRU pagevecs all the time?

> If you do the lru_add_drain_all, then the vmscan problem should be probably
> mostly fixable by detecting failure, waiting, and retrying a few times.

Why shouldn't invalidate_complete_pages() do that?

> After that, making an invalidate_complete_page2 ignore the page count or
> dirty status would only save you from a very small number of cases, and 
> they
> would be likely to be a data loss / corruption case.
> 
> OTOH, we haven't had many complaints before, so for 2.6.18, an
> invalidate_complete_page2 may indeed be the best option?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-29 20:34                               ` Chuck Lever
@ 2006-09-29 20:45                                 ` Peter Zijlstra
  2006-09-29 21:02                                   ` Chuck Lever
  0 siblings, 1 reply; 56+ messages in thread
From: Peter Zijlstra @ 2006-09-29 20:45 UTC (permalink / raw)
  To: chuck.lever
  Cc: Nick Piggin, Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

On Fri, 2006-09-29 at 16:34 -0400, Chuck Lever wrote:

> One option that hasn't been entertained is to remove the "batched LRU
> add" logic all together.  Just gut lru_cache_add -- it should send
> pages immediately to be added to the LRU list.  This is a bit slower, 
> but it fixes the invalidation problems, and makes the icky 
> lru_add_drain_all() a no-op.

That would bring some larger machines to their knees contending on
zone->lock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-29 20:45                                 ` Peter Zijlstra
@ 2006-09-29 21:02                                   ` Chuck Lever
  2006-09-29 21:17                                     ` Peter Zijlstra
  0 siblings, 1 reply; 56+ messages in thread
From: Chuck Lever @ 2006-09-29 21:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

Peter Zijlstra wrote:
> On Fri, 2006-09-29 at 16:34 -0400, Chuck Lever wrote:
> 
>> One option that hasn't been entertained is to remove the "batched LRU
>> add" logic all together.  Just gut lru_cache_add -- it should send
>> pages immediately to be added to the LRU list.  This is a bit slower, 
>> but it fixes the invalidation problems, and makes the icky 
>> lru_add_drain_all() a no-op.
> 
> That would bring some larger machines to their knees contending on
> zone->lock.
> 

Versus calling lru_add_drain_all() every time a file system tries to 
invalidate an inode's cached pages?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-29 21:02                                   ` Chuck Lever
@ 2006-09-29 21:17                                     ` Peter Zijlstra
  2006-09-29 21:44                                       ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Peter Zijlstra @ 2006-09-29 21:17 UTC (permalink / raw)
  To: chuck.lever
  Cc: Nick Piggin, Andrew Morton, Trond Myklebust, Steve Dickson, linux-mm

On Fri, 2006-09-29 at 17:02 -0400, Chuck Lever wrote:
> Peter Zijlstra wrote:
> > On Fri, 2006-09-29 at 16:34 -0400, Chuck Lever wrote:
> > 
> >> One option that hasn't been entertained is to remove the "batched LRU
> >> add" logic all together.  Just gut lru_cache_add -- it should send
> >> pages immediately to be added to the LRU list.  This is a bit slower, 
> >> but it fixes the invalidation problems, and makes the icky 
> >> lru_add_drain_all() a no-op.
> > 
> > That would bring some larger machines to their knees contending on
> > zone->lock.
> > 
> 
> Versus calling lru_add_drain_all() every time a file system tries to 
> invalidate an inode's cached pages?

Surely not all filesystems are that broken? From what I gather from this
discussion its only NFS.

But even then, you can run quite a few workloads without invalidating
inodes often.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-29 21:17                                     ` Peter Zijlstra
@ 2006-09-29 21:44                                       ` Andrew Morton
  2006-09-29 21:48                                         ` Chuck Lever
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2006-09-29 21:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: chuck.lever, Nick Piggin, Trond Myklebust, Steve Dickson, linux-mm

buggerit, let's do this.  It'll fix NFS, yes?



From: Andrew Morton <akpm@osdl.org>

The recent fix to invalidate_inode_pages() (git commit 016eb4a) managed to
unfix invalidate_inode_pages2().

The problem is that various bits of code in the kernel can take transient refs
on pages: the page scanner will do this when inspecting a batch of pages, and
the lru_cache_add() batching pagevecs also hold a ref.

Net result is transient failures in invalidate_inode_pages2().  This affects
NFS directory invalidation (observed) and presumably also block-backed
direct-io (not yet reported).

Fix it by reverting invalidate_inode_pages2() back to the old version which
ignores the page refcounts.

We may come up with something more clever later, but for now we need a 2.6.18
fix for NFS.

Cc: Chuck Lever <cel@citi.umich.edu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 mm/truncate.c |   34 ++++++++++++++++++++++++++++++++--
 1 files changed, 32 insertions(+), 2 deletions(-)

diff -puN mm/truncate.c~invalidate_inode_pages2-ignore-page-refcounts mm/truncate.c
--- a/mm/truncate.c~invalidate_inode_pages2-ignore-page-refcounts
+++ a/mm/truncate.c
@@ -261,9 +261,39 @@ unsigned long invalidate_inode_pages(str
 {
 	return invalidate_mapping_pages(mapping, 0, ~0UL);
 }
-
 EXPORT_SYMBOL(invalidate_inode_pages);
 
+/*
+ * This is like invalidate_complete_page(), except it ignores the page's
+ * refcount.  We do this because invalidate_indoe_pages2() needs stronger
+ * invalidation guarantees, and cannot afford to leave pages behind because
+ * shrink_list() has a temp ref on them, or because they're transiently sitting
+ * in the lru_cache_add() pagevecs.
+ */
+static int
+invalidate_complete_page2(struct address_space *mapping, struct page *page)
+{
+	if (page->mapping != mapping)
+		return 0;
+
+	if (PagePrivate(page) && !try_to_release_page(page, 0))
+		return 0;
+
+	write_lock_irq(&mapping->tree_lock);
+	if (PageDirty(page))
+		goto failed;
+
+	BUG_ON(PagePrivate(page));
+	__remove_from_page_cache(page);
+	write_unlock_irq(&mapping->tree_lock);
+	ClearPageUptodate(page);
+	page_cache_release(page);	/* pagecache ref */
+	return 1;
+failed:
+	write_unlock_irq(&mapping->tree_lock);
+	return 0;
+}
+
 /**
  * invalidate_inode_pages2_range - remove range of pages from an address_space
  * @mapping: the address_space
@@ -330,7 +360,7 @@ int invalidate_inode_pages2_range(struct
 				}
 			}
 			was_dirty = test_clear_page_dirty(page);
-			if (!invalidate_complete_page(mapping, page)) {
+			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);
 				ret = -EIO;
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-29 21:44                                       ` Andrew Morton
@ 2006-09-29 21:48                                         ` Chuck Lever
  2006-09-29 22:29                                           ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Chuck Lever @ 2006-09-29 21:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Nick Piggin, Trond Myklebust, Steve Dickson, linux-mm

Andrew Morton wrote:
> buggerit, let's do this.  It'll fix NFS, yes?

It looks right to me.  I'll discuss a patch with Trond that adds a 
warning in nfs_revalidate_mapping() and perhaps a performance counter to 
see how many times we hit this in practice.

> From: Andrew Morton <akpm@osdl.org>
> 
> The recent fix to invalidate_inode_pages() (git commit 016eb4a) managed to
> unfix invalidate_inode_pages2().
> 
> The problem is that various bits of code in the kernel can take transient refs
> on pages: the page scanner will do this when inspecting a batch of pages, and
> the lru_cache_add() batching pagevecs also hold a ref.
> 
> Net result is transient failures in invalidate_inode_pages2().  This affects
> NFS directory invalidation (observed) and presumably also block-backed
> direct-io (not yet reported).
> 
> Fix it by reverting invalidate_inode_pages2() back to the old version which
> ignores the page refcounts.
> 
> We may come up with something more clever later, but for now we need a 2.6.18
> fix for NFS.
> 
> Cc: Chuck Lever <cel@citi.umich.edu>
> Cc: Nick Piggin <nickpiggin@yahoo.com.au>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: <stable@kernel.org>
> Signed-off-by: Andrew Morton <akpm@osdl.org>
> ---
> 
>  mm/truncate.c |   34 ++++++++++++++++++++++++++++++++--
>  1 files changed, 32 insertions(+), 2 deletions(-)
> 
> diff -puN mm/truncate.c~invalidate_inode_pages2-ignore-page-refcounts mm/truncate.c
> --- a/mm/truncate.c~invalidate_inode_pages2-ignore-page-refcounts
> +++ a/mm/truncate.c
> @@ -261,9 +261,39 @@ unsigned long invalidate_inode_pages(str
>  {
>  	return invalidate_mapping_pages(mapping, 0, ~0UL);
>  }
> -
>  EXPORT_SYMBOL(invalidate_inode_pages);
>  
> +/*
> + * This is like invalidate_complete_page(), except it ignores the page's
> + * refcount.  We do this because invalidate_indoe_pages2() needs stronger
> + * invalidation guarantees, and cannot afford to leave pages behind because
> + * shrink_list() has a temp ref on them, or because they're transiently sitting
> + * in the lru_cache_add() pagevecs.
> + */
> +static int
> +invalidate_complete_page2(struct address_space *mapping, struct page *page)
> +{
> +	if (page->mapping != mapping)
> +		return 0;
> +
> +	if (PagePrivate(page) && !try_to_release_page(page, 0))
> +		return 0;
> +
> +	write_lock_irq(&mapping->tree_lock);
> +	if (PageDirty(page))
> +		goto failed;
> +
> +	BUG_ON(PagePrivate(page));
> +	__remove_from_page_cache(page);
> +	write_unlock_irq(&mapping->tree_lock);
> +	ClearPageUptodate(page);
> +	page_cache_release(page);	/* pagecache ref */
> +	return 1;
> +failed:
> +	write_unlock_irq(&mapping->tree_lock);
> +	return 0;
> +}
> +
>  /**
>   * invalidate_inode_pages2_range - remove range of pages from an address_space
>   * @mapping: the address_space
> @@ -330,7 +360,7 @@ int invalidate_inode_pages2_range(struct
>  				}
>  			}
>  			was_dirty = test_clear_page_dirty(page);
> -			if (!invalidate_complete_page(mapping, page)) {
> +			if (!invalidate_complete_page2(mapping, page)) {
>  				if (was_dirty)
>  					set_page_dirty(page);
>  				ret = -EIO;
> _
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-29 21:48                                         ` Chuck Lever
@ 2006-09-29 22:29                                           ` Andrew Morton
  2006-09-29 23:05                                             ` Chuck Lever
  2006-10-01  4:21                                             ` Chuck Lever
  0 siblings, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2006-09-29 22:29 UTC (permalink / raw)
  To: chuck.lever
  Cc: Peter Zijlstra, Nick Piggin, Trond Myklebust, Steve Dickson, linux-mm

On Fri, 29 Sep 2006 17:48:23 -0400
Chuck Lever <chuck.lever@oracle.com> wrote:

> Andrew Morton wrote:
> > buggerit, let's do this.  It'll fix NFS, yes?
> 
> It looks right to me.

s/right/less incorrect/ ;)

Please double-check that it passes testing.

>  I'll discuss a patch with Trond that adds a 
> warning in nfs_revalidate_mapping() and perhaps a performance counter to 
> see how many times we hit this in practice.

Yes, please do carefully check the invalidate_inode_pages2() return value:
we've had ongoing subtle problem with that code and we really do want the
early warning which an explicit check+printk will give us.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-29 22:29                                           ` Andrew Morton
@ 2006-09-29 23:05                                             ` Chuck Lever
  2006-10-01  4:21                                             ` Chuck Lever
  1 sibling, 0 replies; 56+ messages in thread
From: Chuck Lever @ 2006-09-29 23:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Nick Piggin, Trond Myklebust, Steve Dickson, linux-mm

Andrew Morton wrote:
> On Fri, 29 Sep 2006 17:48:23 -0400
> Chuck Lever <chuck.lever@oracle.com> wrote:
> 
>> Andrew Morton wrote:
>>> buggerit, let's do this.  It'll fix NFS, yes?
>> It looks right to me.
> 
> s/right/less incorrect/ ;)

Yes, that's what I meant.  :-)

> Please double-check that it passes testing.

Of course... since it's the weekend, give me a day.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-09-29 22:29                                           ` Andrew Morton
  2006-09-29 23:05                                             ` Chuck Lever
@ 2006-10-01  4:21                                             ` Chuck Lever
  2006-10-02 12:01                                               ` Steve Dickson
  1 sibling, 1 reply; 56+ messages in thread
From: Chuck Lever @ 2006-10-01  4:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Nick Piggin, Trond Myklebust, Steve Dickson, linux-mm

Andrew Morton wrote:
> On Fri, 29 Sep 2006 17:48:23 -0400
> Chuck Lever <chuck.lever@oracle.com> wrote:
> 
>> Andrew Morton wrote:
>>> buggerit, let's do this.  It'll fix NFS, yes?
>> It looks right to me.
> 
> s/right/less incorrect/ ;)
> 
> Please double-check that it passes testing.

I've run this for 24 hours worth of continuous NFS testing.  I haven't 
seen a problem with it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-01  4:21                                             ` Chuck Lever
@ 2006-10-02 12:01                                               ` Steve Dickson
  2006-10-02 13:25                                                 ` Trond Myklebust
  0 siblings, 1 reply; 56+ messages in thread
From: Steve Dickson @ 2006-10-02 12:01 UTC (permalink / raw)
  To: chuck.lever
  Cc: Andrew Morton, Peter Zijlstra, Nick Piggin, Trond Myklebust, linux-mm


Chuck Lever wrote:
> Andrew Morton wrote:
>> On Fri, 29 Sep 2006 17:48:23 -0400
>> Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>>> Andrew Morton wrote:
>>>> buggerit, let's do this.  It'll fix NFS, yes?
>>> It looks right to me.
>>
>> s/right/less incorrect/ ;)
>>
>> Please double-check that it passes testing.
> 
> I've run this for 24 hours worth of continuous NFS testing.  I haven't 
> seen a problem with it.
Sorry for the delayed response.... life got in the way...
But for what's worth... I'm not longer seeing that readdir failure
with this patch...

Question: Maybe I missed this... but what is NFS suppose to do
when invalidate_inode_pages2() fails on non-file inodes? Noting
it with as metric as Chuck suggested is a good way to
detect  its happening, but does it make sense for NFS to keep
calling invalidate_inode_pages2() until it does not fail when
trying to flush the readdir cache?

steved.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-02 12:01                                               ` Steve Dickson
@ 2006-10-02 13:25                                                 ` Trond Myklebust
  2006-10-02 16:57                                                   ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Trond Myklebust @ 2006-10-02 13:25 UTC (permalink / raw)
  To: Steve Dickson
  Cc: chuck.lever, Andrew Morton, Peter Zijlstra, Nick Piggin, linux-mm

On Mon, 2006-10-02 at 08:01 -0400, Steve Dickson wrote:
> Question: Maybe I missed this... but what is NFS suppose to do
> when invalidate_inode_pages2() fails on non-file inodes? Noting
> it with as metric as Chuck suggested is a good way to
> detect  its happening, but does it make sense for NFS to keep
> calling invalidate_inode_pages2() until it does not fail when
> trying to flush the readdir cache?

Definitely not. There is not much we can do at the filesystem level if
the VM fails, and that is why we haven't bothered checking the return
value previously.

Cheers,
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-02 13:25                                                 ` Trond Myklebust
@ 2006-10-02 16:57                                                   ` Andrew Morton
  2006-10-02 17:02                                                     ` Steve Dickson
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2006-10-02 16:57 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Steve Dickson, chuck.lever, Peter Zijlstra, Nick Piggin, linux-mm

On Mon, 02 Oct 2006 09:25:22 -0400
Trond Myklebust <Trond.Myklebust@netapp.com> wrote:

> On Mon, 2006-10-02 at 08:01 -0400, Steve Dickson wrote:
> > Question: Maybe I missed this... but what is NFS suppose to do
> > when invalidate_inode_pages2() fails on non-file inodes? Noting
> > it with as metric as Chuck suggested is a good way to
> > detect  its happening, but does it make sense for NFS to keep
> > calling invalidate_inode_pages2() until it does not fail when
> > trying to flush the readdir cache?
> 
> Definitely not. There is not much we can do at the filesystem level if
> the VM fails, and that is why we haven't bothered checking the return
> value previously.
> 

Please add a printk.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-02 16:57                                                   ` Andrew Morton
@ 2006-10-02 17:02                                                     ` Steve Dickson
  2006-10-02 18:20                                                       ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Steve Dickson @ 2006-10-02 17:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Trond Myklebust, chuck.lever, Peter Zijlstra, Nick Piggin, linux-mm


Andrew Morton wrote:
> On Mon, 02 Oct 2006 09:25:22 -0400
> Trond Myklebust <Trond.Myklebust@netapp.com> wrote:
> 
>> On Mon, 2006-10-02 at 08:01 -0400, Steve Dickson wrote:
>>> Question: Maybe I missed this... but what is NFS suppose to do
>>> when invalidate_inode_pages2() fails on non-file inodes? Noting
>>> it with as metric as Chuck suggested is a good way to
>>> detect  its happening, but does it make sense for NFS to keep
>>> calling invalidate_inode_pages2() until it does not fail when
>>> trying to flush the readdir cache?
>> Definitely not. There is not much we can do at the filesystem level if
>> the VM fails, and that is why we haven't bothered checking the return
>> value previously.
>>
> 
> Please add a printk.
Please add a printk that can be turned off....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-02 17:02                                                     ` Steve Dickson
@ 2006-10-02 18:20                                                       ` Andrew Morton
  2006-10-02 19:02                                                         ` Steve Dickson
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2006-10-02 18:20 UTC (permalink / raw)
  To: Steve Dickson
  Cc: Trond Myklebust, chuck.lever, Peter Zijlstra, Nick Piggin, linux-mm

On Mon, 02 Oct 2006 13:02:03 -0400
Steve Dickson <SteveD@redhat.com> wrote:

> 
> 
> Andrew Morton wrote:
> > On Mon, 02 Oct 2006 09:25:22 -0400
> > Trond Myklebust <Trond.Myklebust@netapp.com> wrote:
> > 
> >> On Mon, 2006-10-02 at 08:01 -0400, Steve Dickson wrote:
> >>> Question: Maybe I missed this... but what is NFS suppose to do
> >>> when invalidate_inode_pages2() fails on non-file inodes? Noting
> >>> it with as metric as Chuck suggested is a good way to
> >>> detect  its happening, but does it make sense for NFS to keep
> >>> calling invalidate_inode_pages2() until it does not fail when
> >>> trying to flush the readdir cache?
> >> Definitely not. There is not much we can do at the filesystem level if
> >> the VM fails, and that is why we haven't bothered checking the return
> >> value previously.
> >>
> > 
> > Please add a printk.
> Please add a printk that can be turned off....

Please add a printk which can't be turned off.  This is our user's data
we're talking about here.

If that printk comes out then we need to fix the kernel so that it no
longer wants to print that printk.  We don't want to just hide it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-02 18:20                                                       ` Andrew Morton
@ 2006-10-02 19:02                                                         ` Steve Dickson
  2006-10-03  2:14                                                           ` Chuck Lever
  0 siblings, 1 reply; 56+ messages in thread
From: Steve Dickson @ 2006-10-02 19:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Trond Myklebust, chuck.lever, Peter Zijlstra, Nick Piggin, linux-mm


Andrew Morton wrote:
> 
> This is our user's data we're talking about here.
Point...

> 
> If that printk comes out then we need to fix the kernel so that it no
> longer wants to print that printk.  We don't want to just hide it.
I'm concern about the printk popping when we are flushing the
readdir cache (i.e. stale data) and either flooding the console
to a ton a messages (basically bring a system to its knees for
no good reason) or scaring the hell out people by saying we have a
major problem when in reality we are just flushing stale data...

So I definitely agree the printk should be there and be on by default,
but I so think it would be a good idea to have way to turn it off
if need be...

steved.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-02 19:02                                                         ` Steve Dickson
@ 2006-10-03  2:14                                                           ` Chuck Lever
  2006-10-03  4:18                                                             ` Trond Myklebust
  0 siblings, 1 reply; 56+ messages in thread
From: Chuck Lever @ 2006-10-03  2:14 UTC (permalink / raw)
  To: Steve Dickson; +Cc: Andrew Morton, Trond Myklebust, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1298 bytes --]

Steve Dickson wrote:
> Andrew Morton wrote:
>>
>> This is our user's data we're talking about here.
> Point...
> 
>>
>> If that printk comes out then we need to fix the kernel so that it no
>> longer wants to print that printk.  We don't want to just hide it.
 >
> I'm concern about the printk popping when we are flushing the
> readdir cache (i.e. stale data) and either flooding the console
> to a ton a messages (basically bring a system to its knees for
> no good reason) or scaring the hell out people by saying we have a
> major problem when in reality we are just flushing stale data...
> 
> So I definitely agree the printk should be there and be on by default,
> but I so think it would be a good idea to have way to turn it off
> if need be...

[ Sorry for the attachment... anyone know how to include a diff inline 
with Thunderbird? ]

The attached patch is my suggestion for reporting the cache invalidation 
failure from within the NFS client.  Please review and comment.  My 
testing with this patch applied has not triggered a single message, but 
I haven't tried any memory exhaustion scenarios.

I honestly doubt that we will see log floods.  The original problem that 
was causing stale data to remain cached has been addressed.  The reclaim 
race will almost certainly be rare.

[-- Attachment #2: nfs-check-return-codes.diff --]
[-- Type: text/plain, Size: 5097 bytes --]

NFS: Add return code checks for page invalidation

Print a warning if the page invalidation functions don't behave as
expected.  A BUG is probably overkill here since the client's internal data
structures will remain consistent.

We're trying to catch cases where invaliding an inode's page cache races
with vmscan or direct I/O, resulting in stale data remaining in the page
cache.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---

 fs/nfs/dir.c           |   34 +++++++++++++++++++++++++++++-----
 fs/nfs/direct.c        |    2 +-
 fs/nfs/inode.c         |   25 +++++++++++++++++++++++--
 fs/nfs/iostat.h        |    1 +
 include/linux/nfs_fs.h |    1 +
 5 files changed, 55 insertions(+), 8 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 7432f1a..0bb1a42 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -156,6 +156,32 @@ typedef struct {
 	int		error;
 } nfs_readdir_descriptor_t;
 
+/*
+ * Trim off all pages past page zero.  This ensures consistent page
+ * alignment of cached data.
+ *
+ * NB: This assumes we have exclusive access to this mapping either
+ *     through inode->i_mutex or some other mechanism.
+ */
+static void nfs_truncate_directory_cache(struct inode *inode)
+{
+	int result;
+
+	dfprintk(DIRCACHE, "NFS: %s: truncating directory (%s/%Ld)\n",
+			__FUNCTION__, inode->i_sb->s_id,
+			(long long)NFS_FILEID(inode));
+
+	result = invalidate_inode_pages2_range(inode->i_mapping,
+							PAGE_CACHE_SIZE, -1);
+	if (unlikely(result < 0)) {
+		nfs_inc_stats(inode, NFSIOS_INVALIDATEFAILED);
+		printk(KERN_ERR
+			"NFS: error %d invalidating cache for dir (%s/%Ld)\n",
+				result, inode->i_sb->s_id,
+				(long long)NFS_FILEID(inode));
+	}
+}
+
 /* Now we cache directories properly, by stuffing the dirent
  * data directly in the page cache.
  *
@@ -199,12 +225,10 @@ int nfs_readdir_filler(nfs_readdir_descr
 	spin_lock(&inode->i_lock);
 	NFS_I(inode)->cache_validity |= NFS_INO_INVALID_ATIME;
 	spin_unlock(&inode->i_lock);
-	/* Ensure consistent page alignment of the data.
-	 * Note: assumes we have exclusive access to this mapping either
-	 *	 through inode->i_mutex or some other mechanism.
-	 */
+
 	if (page->index == 0)
-		invalidate_inode_pages2_range(inode->i_mapping, PAGE_CACHE_SIZE, -1);
+		nfs_truncate_directory_cache(inode);
+
 	unlock_page(page);
 	return 0;
  error:
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 377839b..fe69c39 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -823,7 +823,7 @@ ssize_t nfs_file_direct_write(struct kio
 	 *      occur before the writes complete.  Kind of racey.
 	 */
 	if (mapping->nrpages)
-		invalidate_inode_pages2(mapping);
+		nfs_invalidate_mapping(mapping->host, mapping);
 
 	if (retval > 0)
 		iocb->ki_pos = pos + retval;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index bc9376c..e1cf978 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -657,6 +657,27 @@ int nfs_revalidate_inode(struct nfs_serv
 }
 
 /**
+ * nfs_invalidate_mapping - Invalidate the inode's page cache
+ * @inode - pointer to host inode
+ * @mapping - pointer to mapping
+ */
+void nfs_invalidate_mapping(struct inode *inode, struct address_space *mapping)
+{
+	int result;
+
+	nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE);
+
+	result = invalidate_inode_pages2(mapping);
+	if (unlikely(result) < 0) {
+		nfs_inc_stats(inode, NFSIOS_INVALIDATEFAILED);
+		printk(KERN_ERR
+			"NFS: error %d invalidating pages for inode (%s/%Ld)\n",
+				result, inode->i_sb->s_id,
+				(long long)NFS_FILEID(inode));
+	}
+}
+
+/**
  * nfs_revalidate_mapping - Revalidate the pagecache
  * @inode - pointer to host inode
  * @mapping - pointer to mapping
@@ -673,10 +694,10 @@ int nfs_revalidate_mapping(struct inode 
 		ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
 
 	if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
-		nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE);
 		if (S_ISREG(inode->i_mode))
 			nfs_sync_mapping(mapping);
-		invalidate_inode_pages2(mapping);
+
+		nfs_invalidate_mapping(inode, mapping);
 
 		spin_lock(&inode->i_lock);
 		nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
diff --git a/fs/nfs/iostat.h b/fs/nfs/iostat.h
index 6350ecb..df41150 100644
--- a/fs/nfs/iostat.h
+++ b/fs/nfs/iostat.h
@@ -104,6 +104,7 @@ enum nfs_stat_eventcounters {
 	NFSIOS_SHORTREAD,
 	NFSIOS_SHORTWRITE,
 	NFSIOS_DELAY,
+	NFSIOS_INVALIDATEFAILED,
 	__NFSIOS_COUNTSMAX,
 };
 
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 98c9b9f..dc3cac3 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -306,6 +306,7 @@ extern int nfs_attribute_timeout(struct 
 extern int nfs_revalidate_inode(struct nfs_server *server, struct inode *inode);
 extern int __nfs_revalidate_inode(struct nfs_server *, struct inode *);
 extern int nfs_revalidate_mapping(struct inode *inode, struct address_space *mapping);
+extern void nfs_invalidate_mapping(struct inode *inode, struct address_space *mapping);
 extern int nfs_setattr(struct dentry *, struct iattr *);
 extern void nfs_setattr_update_inode(struct inode *inode, struct iattr *attr);
 extern void nfs_begin_attr_update(struct inode *);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-03  2:14                                                           ` Chuck Lever
@ 2006-10-03  4:18                                                             ` Trond Myklebust
  2006-10-03  4:24                                                               ` Andrew Morton
  2006-10-03 18:50                                                               ` Chuck Lever
  0 siblings, 2 replies; 56+ messages in thread
From: Trond Myklebust @ 2006-10-03  4:18 UTC (permalink / raw)
  To: chuck.lever; +Cc: Steve Dickson, Andrew Morton, linux-mm

On Mon, 2006-10-02 at 22:14 -0400, Chuck Lever wrote:
> Steve Dickson wrote:
> > Andrew Morton wrote:
> >>
> >> This is our user's data we're talking about here.
> > Point...
> >
> >>
> >> If that printk comes out then we need to fix the kernel so that it no
> >> longer wants to print that printk.  We don't want to just hide it.

So what _is_ stopping us from fixing it right now? Are we missing an
audit of the possible errors? That can be arranged...

> > I'm concern about the printk popping when we are flushing the
> > readdir cache (i.e. stale data) and either flooding the console
> > to a ton a messages (basically bring a system to its knees for
> > no good reason) or scaring the hell out people by saying we have a
> > major problem when in reality we are just flushing stale data...
> > 
> > So I definitely agree the printk should be there and be on by default,
> > but I so think it would be a good idea to have way to turn it off
> > if need be...

Why? If we know there is a problem, then why wait to fix it?

> [ Sorry for the attachment... anyone know how to include a diff inline 
> with Thunderbird? ]
> 
> The attached patch is my suggestion for reporting the cache invalidation 
> failure from within the NFS client.  Please review and comment.  My 
> testing with this patch applied has not triggered a single message, but 
> I haven't tried any memory exhaustion scenarios.
> 
> I honestly doubt that we will see log floods.  The original problem that 
> was causing stale data to remain cached has been addressed.  The reclaim 
> race will almost certainly be rare.
> plain text document attachment (nfs-check-return-codes.diff)
> NFS: Add return code checks for page invalidation
> 
> Print a warning if the page invalidation functions don't behave as
> expected.  A BUG is probably overkill here since the client's internal data
> structures will remain consistent.
> 
> We're trying to catch cases where invaliding an inode's page cache races
> with vmscan or direct I/O, resulting in stale data remaining in the page
> cache.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> 
>  fs/nfs/dir.c           |   34 +++++++++++++++++++++++++++++-----
>  fs/nfs/direct.c        |    2 +-
>  fs/nfs/inode.c         |   25 +++++++++++++++++++++++--
>  fs/nfs/iostat.h        |    1 +
>  include/linux/nfs_fs.h |    1 +
>  5 files changed, 55 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 7432f1a..0bb1a42 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -156,6 +156,32 @@ typedef struct {
>  	int		error;
>  } nfs_readdir_descriptor_t;
>  
> +/*
> + * Trim off all pages past page zero.  This ensures consistent page
> + * alignment of cached data.
> + *
> + * NB: This assumes we have exclusive access to this mapping either
> + *     through inode->i_mutex or some other mechanism.
> + */
> +static void nfs_truncate_directory_cache(struct inode *inode)
> +{
> +	int result;
> +
> +	dfprintk(DIRCACHE, "NFS: %s: truncating directory (%s/%Ld)\n",
> +			__FUNCTION__, inode->i_sb->s_id,
> +			(long long)NFS_FILEID(inode));
> +
> +	result = invalidate_inode_pages2_range(inode->i_mapping,
> +							PAGE_CACHE_SIZE, -1);
> +	if (unlikely(result < 0)) {
> +		nfs_inc_stats(inode, NFSIOS_INVALIDATEFAILED);
> +		printk(KERN_ERR
> +			"NFS: error %d invalidating cache for dir (%s/%Ld)\n",
> +				result, inode->i_sb->s_id,
> +				(long long)NFS_FILEID(inode));

See gripe below.

> +	}
> +}
> +
>  /* Now we cache directories properly, by stuffing the dirent
>   * data directly in the page cache.
>   *
> @@ -199,12 +225,10 @@ int nfs_readdir_filler(nfs_readdir_descr
>  	spin_lock(&inode->i_lock);
>  	NFS_I(inode)->cache_validity |= NFS_INO_INVALID_ATIME;
>  	spin_unlock(&inode->i_lock);
> -	/* Ensure consistent page alignment of the data.
> -	 * Note: assumes we have exclusive access to this mapping either
> -	 *	 through inode->i_mutex or some other mechanism.
> -	 */
> +
>  	if (page->index == 0)
> -		invalidate_inode_pages2_range(inode->i_mapping, PAGE_CACHE_SIZE, -1);
> +		nfs_truncate_directory_cache(inode);
> +
>  	unlock_page(page);
>  	return 0;
>   error:
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 377839b..fe69c39 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -823,7 +823,7 @@ ssize_t nfs_file_direct_write(struct kio
>  	 *      occur before the writes complete.  Kind of racey.
>  	 */
>  	if (mapping->nrpages)
> -		invalidate_inode_pages2(mapping);
> +		nfs_invalidate_mapping(mapping->host, mapping);

This looks wrong. Why are we bumping the NFSIOS_DATAINVALIDATE counter
on a direct write? We're not registering a cache consistency problem
here.

>  
>  	if (retval > 0)
>  		iocb->ki_pos = pos + retval;
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index bc9376c..e1cf978 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -657,6 +657,27 @@ int nfs_revalidate_inode(struct nfs_serv
>  }
>  
>  /**
> + * nfs_invalidate_mapping - Invalidate the inode's page cache
> + * @inode - pointer to host inode
> + * @mapping - pointer to mapping
> + */
> +void nfs_invalidate_mapping(struct inode *inode, struct address_space *mapping)
> +{
> +	int result;
> +
> +	nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE);
> +
> +	result = invalidate_inode_pages2(mapping);
> +	if (unlikely(result) < 0) {
> +		nfs_inc_stats(inode, NFSIOS_INVALIDATEFAILED);
> +		printk(KERN_ERR
> +			"NFS: error %d invalidating pages for inode (%s/%Ld)\n",
> +				result, inode->i_sb->s_id,
> +				(long long)NFS_FILEID(inode));

So what _are_ users expected to do about this? Sue us? Complain bitterly
to lkml, and then get told that the VM is broken?

IOW: What the hell is the point of these messages?

> +	}
> +}
> +
> +/**
>   * nfs_revalidate_mapping - Revalidate the pagecache
>   * @inode - pointer to host inode
>   * @mapping - pointer to mapping
> @@ -673,10 +694,10 @@ int nfs_revalidate_mapping(struct inode 
>  		ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
>  
>  	if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
> -		nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE);
>  		if (S_ISREG(inode->i_mode))
>  			nfs_sync_mapping(mapping);
> -		invalidate_inode_pages2(mapping);
> +
> +		nfs_invalidate_mapping(inode, mapping);
>  
>  		spin_lock(&inode->i_lock);
>  		nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
> diff --git a/fs/nfs/iostat.h b/fs/nfs/iostat.h
> index 6350ecb..df41150 100644
> --- a/fs/nfs/iostat.h
> +++ b/fs/nfs/iostat.h
> @@ -104,6 +104,7 @@ enum nfs_stat_eventcounters {
>  	NFSIOS_SHORTREAD,
>  	NFSIOS_SHORTWRITE,
>  	NFSIOS_DELAY,
> +	NFSIOS_INVALIDATEFAILED,
>  	__NFSIOS_COUNTSMAX,
>  };
>  
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index 98c9b9f..dc3cac3 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -306,6 +306,7 @@ extern int nfs_attribute_timeout(struct 
>  extern int nfs_revalidate_inode(struct nfs_server *server, struct inode *inode);
>  extern int __nfs_revalidate_inode(struct nfs_server *, struct inode *);
>  extern int nfs_revalidate_mapping(struct inode *inode, struct address_space *mapping);
> +extern void nfs_invalidate_mapping(struct inode *inode, struct address_space *mapping);
>  extern int nfs_setattr(struct dentry *, struct iattr *);
>  extern void nfs_setattr_update_inode(struct inode *inode, struct iattr *attr);
>  extern void nfs_begin_attr_update(struct inode *);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-03  4:18                                                             ` Trond Myklebust
@ 2006-10-03  4:24                                                               ` Andrew Morton
  2006-10-03 18:50                                                               ` Chuck Lever
  1 sibling, 0 replies; 56+ messages in thread
From: Andrew Morton @ 2006-10-03  4:24 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: chuck.lever, Steve Dickson, linux-mm

On Tue, 03 Oct 2006 00:18:37 -0400
Trond Myklebust <Trond.Myklebust@netapp.com> wrote:

> On Mon, 2006-10-02 at 22:14 -0400, Chuck Lever wrote:
> > Steve Dickson wrote:
> > > Andrew Morton wrote:
> > >>
> > >> This is our user's data we're talking about here.
> > > Point...
> > >
> > >>
> > >> If that printk comes out then we need to fix the kernel so that it no
> > >> longer wants to print that printk.  We don't want to just hide it.
> 
> So what _is_ stopping us from fixing it right now? Are we missing an
> audit of the possible errors? That can be arranged...

We hope that it'll never come out.

> > > I'm concern about the printk popping when we are flushing the
> > > readdir cache (i.e. stale data) and either flooding the console
> > > to a ton a messages (basically bring a system to its knees for
> > > no good reason) or scaring the hell out people by saying we have a
> > > major problem when in reality we are just flushing stale data...
> > > 
> > > So I definitely agree the printk should be there and be on by default,
> > > but I so think it would be a good idea to have way to turn it off
> > > if need be...
> 
> Why? If we know there is a problem, then why wait to fix it?

There is now no known problem.  But as I said before, this is an area where
we've had relatively frequent problems, and those problems are subtle.

So the printk is just an early warning system.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-03  4:18                                                             ` Trond Myklebust
  2006-10-03  4:24                                                               ` Andrew Morton
@ 2006-10-03 18:50                                                               ` Chuck Lever
  2006-10-03 19:10                                                                 ` Trond Myklebust
  1 sibling, 1 reply; 56+ messages in thread
From: Chuck Lever @ 2006-10-03 18:50 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Steve Dickson, Andrew Morton, linux-mm

Trond Myklebust wrote:
>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>> index 7432f1a..0bb1a42 100644
>> --- a/fs/nfs/dir.c
>> +++ b/fs/nfs/dir.c
>> @@ -156,6 +156,32 @@ typedef struct {
>>  	int		error;
>>  } nfs_readdir_descriptor_t;
>>  
>> +/*
>> + * Trim off all pages past page zero.  This ensures consistent page
>> + * alignment of cached data.
>> + *
>> + * NB: This assumes we have exclusive access to this mapping either
>> + *     through inode->i_mutex or some other mechanism.
>> + */
>> +static void nfs_truncate_directory_cache(struct inode *inode)
>> +{
>> +	int result;
>> +
>> +	dfprintk(DIRCACHE, "NFS: %s: truncating directory (%s/%Ld)\n",
>> +			__FUNCTION__, inode->i_sb->s_id,
>> +			(long long)NFS_FILEID(inode));
>> +
>> +	result = invalidate_inode_pages2_range(inode->i_mapping,
>> +							PAGE_CACHE_SIZE, -1);
>> +	if (unlikely(result < 0)) {
>> +		nfs_inc_stats(inode, NFSIOS_INVALIDATEFAILED);
>> +		printk(KERN_ERR
>> +			"NFS: error %d invalidating cache for dir (%s/%Ld)\n",
>> +				result, inode->i_sb->s_id,
>> +				(long long)NFS_FILEID(inode));
> 
> See gripe below.
> 
>> +	}
>> +}
>> +
>>  /* Now we cache directories properly, by stuffing the dirent
>>   * data directly in the page cache.
>>   *
>> @@ -199,12 +225,10 @@ int nfs_readdir_filler(nfs_readdir_descr
>>  	spin_lock(&inode->i_lock);
>>  	NFS_I(inode)->cache_validity |= NFS_INO_INVALID_ATIME;
>>  	spin_unlock(&inode->i_lock);
>> -	/* Ensure consistent page alignment of the data.
>> -	 * Note: assumes we have exclusive access to this mapping either
>> -	 *	 through inode->i_mutex or some other mechanism.
>> -	 */
>> +
>>  	if (page->index == 0)
>> -		invalidate_inode_pages2_range(inode->i_mapping, PAGE_CACHE_SIZE, -1);
>> +		nfs_truncate_directory_cache(inode);
>> +
>>  	unlock_page(page);
>>  	return 0;
>>   error:
>> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
>> index 377839b..fe69c39 100644
>> --- a/fs/nfs/direct.c
>> +++ b/fs/nfs/direct.c
>> @@ -823,7 +823,7 @@ ssize_t nfs_file_direct_write(struct kio
>>  	 *      occur before the writes complete.  Kind of racey.
>>  	 */
>>  	if (mapping->nrpages)
>> -		invalidate_inode_pages2(mapping);
>> +		nfs_invalidate_mapping(mapping->host, mapping);
> 
> This looks wrong. Why are we bumping the NFSIOS_DATAINVALIDATE counter
> on a direct write? We're not registering a cache consistency problem
> here.

We're looking for potential races between direct I/O and cache 
invalidation, among others.  Is your concern that this may report false 
positives?

I'm not sure this invalidation is useful in any event.  Direct writes 
are treated like some other client has modified the file, so cached 
pages will get invalidated eventually anyway.  Maybe we should just 
remove this one?

>>  
>>  	if (retval > 0)
>>  		iocb->ki_pos = pos + retval;
>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>> index bc9376c..e1cf978 100644
>> --- a/fs/nfs/inode.c
>> +++ b/fs/nfs/inode.c
>> @@ -657,6 +657,27 @@ int nfs_revalidate_inode(struct nfs_serv
>>  }
>>  
>>  /**
>> + * nfs_invalidate_mapping - Invalidate the inode's page cache
>> + * @inode - pointer to host inode
>> + * @mapping - pointer to mapping
>> + */
>> +void nfs_invalidate_mapping(struct inode *inode, struct address_space *mapping)
>> +{
>> +	int result;
>> +
>> +	nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE);
>> +
>> +	result = invalidate_inode_pages2(mapping);
>> +	if (unlikely(result) < 0) {
>> +		nfs_inc_stats(inode, NFSIOS_INVALIDATEFAILED);
>> +		printk(KERN_ERR
>> +			"NFS: error %d invalidating pages for inode (%s/%Ld)\n",
>> +				result, inode->i_sb->s_id,
>> +				(long long)NFS_FILEID(inode));
> 
> So what _are_ users expected to do about this? Sue us? Complain bitterly
> to lkml, and then get told that the VM is broken?

Such a message will be reported to distributors or lkml, and we will be 
able to collect data about the scenario where there is a problem.

Another option for customers is to run application-level data 
consistency checks when this error is reported.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-03 18:50                                                               ` Chuck Lever
@ 2006-10-03 19:10                                                                 ` Trond Myklebust
  2006-10-03 19:21                                                                   ` Chuck Lever
  2006-10-03 21:37                                                                   ` Andrew Morton
  0 siblings, 2 replies; 56+ messages in thread
From: Trond Myklebust @ 2006-10-03 19:10 UTC (permalink / raw)
  To: chuck.lever; +Cc: Steve Dickson, Andrew Morton, linux-mm

On Tue, 2006-10-03 at 14:50 -0400, Chuck Lever wrote:

> >> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> >> index 377839b..fe69c39 100644
> >> --- a/fs/nfs/direct.c
> >> +++ b/fs/nfs/direct.c
> >> @@ -823,7 +823,7 @@ ssize_t nfs_file_direct_write(struct kio
> >>  	 *      occur before the writes complete.  Kind of racey.
> >>  	 */
> >>  	if (mapping->nrpages)
> >> -		invalidate_inode_pages2(mapping);
> >> +		nfs_invalidate_mapping(mapping->host, mapping);
> > 
> > This looks wrong. Why are we bumping the NFSIOS_DATAINVALIDATE counter
> > on a direct write? We're not registering a cache consistency problem
> > here.
> 
> We're looking for potential races between direct I/O and cache 
> invalidation, among others.  Is your concern that this may report false 
> positives?

No. I simply don't see what the use case is for this statistic. AFAICS
it is purely a debugging tool for _developers_. That would have
absolutely no place at all in /proc/self/mountstats, which is supposed
to provide useful statistics for _administrators_.

> I'm not sure this invalidation is useful in any event.  Direct writes 
> are treated like some other client has modified the file, so cached 
> pages will get invalidated eventually anyway.  Maybe we should just 
> remove this one?

That would break the principle that if one process modifies the file,
then all processes on the same client will immediately see those
changes.

> >> +	result = invalidate_inode_pages2(mapping);
> >> +	if (unlikely(result) < 0) {
> >> +		nfs_inc_stats(inode, NFSIOS_INVALIDATEFAILED);
> >> +		printk(KERN_ERR
> >> +			"NFS: error %d invalidating pages for inode (%s/%Ld)\n",
> >> +				result, inode->i_sb->s_id,
> >> +				(long long)NFS_FILEID(inode));
> > 
> > So what _are_ users expected to do about this? Sue us? Complain bitterly
> > to lkml, and then get told that the VM is broken?
> 
> Such a message will be reported to distributors or lkml, and we will be 
> able to collect data about the scenario where there is a problem.

i.e. we're throwing our hands up into the air, and saying "we don't
understand what is going on here"?

Sorry, but that is not an option. If we can't design a clearly defined
API for cache invalidation, with well defined errors, then we should all
go home and start learning Solaris programming now.

> Another option for customers is to run application-level data 
> consistency checks when this error is reported.

And then sue us?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-03 19:10                                                                 ` Trond Myklebust
@ 2006-10-03 19:21                                                                   ` Chuck Lever
  2006-10-03 21:37                                                                   ` Andrew Morton
  1 sibling, 0 replies; 56+ messages in thread
From: Chuck Lever @ 2006-10-03 19:21 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Steve Dickson, Andrew Morton, linux-mm

Trond Myklebust wrote:
> On Tue, 2006-10-03 at 14:50 -0400, Chuck Lever wrote:
> 
>>>> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
>>>> index 377839b..fe69c39 100644
>>>> --- a/fs/nfs/direct.c
>>>> +++ b/fs/nfs/direct.c
>>>> @@ -823,7 +823,7 @@ ssize_t nfs_file_direct_write(struct kio
>>>>  	 *      occur before the writes complete.  Kind of racey.
>>>>  	 */
>>>>  	if (mapping->nrpages)
>>>> -		invalidate_inode_pages2(mapping);
>>>> +		nfs_invalidate_mapping(mapping->host, mapping);
>>> This looks wrong. Why are we bumping the NFSIOS_DATAINVALIDATE counter
>>> on a direct write? We're not registering a cache consistency problem
>>> here.
>> We're looking for potential races between direct I/O and cache 
>> invalidation, among others.  Is your concern that this may report false 
>> positives?
> 
> No. I simply don't see what the use case is for this statistic. AFAICS
> it is purely a debugging tool for _developers_. That would have
> absolutely no place at all in /proc/self/mountstats, which is supposed
> to provide useful statistics for _administrators_.

Don't you think admins need to know when the client has potentially 
caused data corruption?

>> I'm not sure this invalidation is useful in any event.  Direct writes 
>> are treated like some other client has modified the file, so cached 
>> pages will get invalidated eventually anyway.  Maybe we should just 
>> remove this one?
> 
> That would break the principle that if one process modifies the file,
> then all processes on the same client will immediately see those
> changes.

We could achieve that simply by setting the NFS_INO_INVALID_DATA flag in 
nfs_file_direct_write() and/or nfs_direct_write_complete().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-03 19:10                                                                 ` Trond Myklebust
  2006-10-03 19:21                                                                   ` Chuck Lever
@ 2006-10-03 21:37                                                                   ` Andrew Morton
  2006-10-04 19:29                                                                     ` Chuck Lever
  1 sibling, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2006-10-03 21:37 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: chuck.lever, Steve Dickson, linux-mm

On Tue, 03 Oct 2006 15:10:01 -0400
Trond Myklebust <Trond.Myklebust@netapp.com> wrote:

> > >> +	result = invalidate_inode_pages2(mapping);
> > >> +	if (unlikely(result) < 0) {
> > >> +		nfs_inc_stats(inode, NFSIOS_INVALIDATEFAILED);
> > >> +		printk(KERN_ERR
> > >> +			"NFS: error %d invalidating pages for inode (%s/%Ld)\n",
> > >> +				result, inode->i_sb->s_id,
> > >> +				(long long)NFS_FILEID(inode));
> > > 
> > > So what _are_ users expected to do about this? Sue us? Complain bitterly
> > > to lkml, and then get told that the VM is broken?
> > 
> > Such a message will be reported to distributors or lkml, and we will be 
> > able to collect data about the scenario where there is a problem.
> 
> i.e. we're throwing our hands up into the air, and saying "we don't
> understand what is going on here"?
> 
> Sorry, but that is not an option. If we can't design a clearly defined
> API for cache invalidation, with well defined errors, then we should all
> go home and start learning Solaris programming now.

I don't think we need the stats increment in there.  We could make it a
plain WARN_ON() or BUG_ON().  Or just whack a printk in there.  It's hardly
a big deal.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-03 21:37                                                                   ` Andrew Morton
@ 2006-10-04 19:29                                                                     ` Chuck Lever
  2006-10-04 19:43                                                                       ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Chuck Lever @ 2006-10-04 19:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Trond Myklebust, Steve Dickson, linux-mm

Andrew Morton wrote:
> On Tue, 03 Oct 2006 15:10:01 -0400
> Trond Myklebust <Trond.Myklebust@netapp.com> wrote:
> 
>>>>> +	result = invalidate_inode_pages2(mapping);
>>>>> +	if (unlikely(result) < 0) {
>>>>> +		nfs_inc_stats(inode, NFSIOS_INVALIDATEFAILED);
>>>>> +		printk(KERN_ERR
>>>>> +			"NFS: error %d invalidating pages for inode (%s/%Ld)\n",
>>>>> +				result, inode->i_sb->s_id,
>>>>> +				(long long)NFS_FILEID(inode));
>>>> So what _are_ users expected to do about this? Sue us? Complain bitterly
>>>> to lkml, and then get told that the VM is broken?
>>> Such a message will be reported to distributors or lkml, and we will be 
>>> able to collect data about the scenario where there is a problem.
>> i.e. we're throwing our hands up into the air, and saying "we don't
>> understand what is going on here"?
>>
>> Sorry, but that is not an option. If we can't design a clearly defined
>> API for cache invalidation, with well defined errors, then we should all
>> go home and start learning Solaris programming now.
> 
> I don't think we need the stats increment in there.  We could make it a
> plain WARN_ON() or BUG_ON().  Or just whack a printk in there.  It's hardly
> a big deal.

A "WARN_ON(ret != 0);" placed at the end of 
invalidate_inode_pages2_range() should be sufficient and harmless.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-04 19:29                                                                     ` Chuck Lever
@ 2006-10-04 19:43                                                                       ` Andrew Morton
  2006-10-04 19:53                                                                         ` Steve Dickson
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2006-10-04 19:43 UTC (permalink / raw)
  To: chuck.lever; +Cc: Trond Myklebust, Steve Dickson, linux-mm

On Wed, 04 Oct 2006 15:29:24 -0400
Chuck Lever <chuck.lever@oracle.com> wrote:

> Andrew Morton wrote:
> > On Tue, 03 Oct 2006 15:10:01 -0400
> > Trond Myklebust <Trond.Myklebust@netapp.com> wrote:
> > 
> >>>>> +	result = invalidate_inode_pages2(mapping);
> >>>>> +	if (unlikely(result) < 0) {
> >>>>> +		nfs_inc_stats(inode, NFSIOS_INVALIDATEFAILED);
> >>>>> +		printk(KERN_ERR
> >>>>> +			"NFS: error %d invalidating pages for inode (%s/%Ld)\n",
> >>>>> +				result, inode->i_sb->s_id,
> >>>>> +				(long long)NFS_FILEID(inode));
> >>>> So what _are_ users expected to do about this? Sue us? Complain bitterly
> >>>> to lkml, and then get told that the VM is broken?
> >>> Such a message will be reported to distributors or lkml, and we will be 
> >>> able to collect data about the scenario where there is a problem.
> >> i.e. we're throwing our hands up into the air, and saying "we don't
> >> understand what is going on here"?
> >>
> >> Sorry, but that is not an option. If we can't design a clearly defined
> >> API for cache invalidation, with well defined errors, then we should all
> >> go home and start learning Solaris programming now.
> > 
> > I don't think we need the stats increment in there.  We could make it a
> > plain WARN_ON() or BUG_ON().  Or just whack a printk in there.  It's hardly
> > a big deal.
> 
> A "WARN_ON(ret != 0);" placed at the end of 
> invalidate_inode_pages2_range() should be sufficient and harmless.

true.  I think. I'll take a look at adding a WARN_ON_ONCE().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Checking page_count(page) in invalidate_complete_page
  2006-10-04 19:43                                                                       ` Andrew Morton
@ 2006-10-04 19:53                                                                         ` Steve Dickson
  0 siblings, 0 replies; 56+ messages in thread
From: Steve Dickson @ 2006-10-04 19:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: chuck.lever, Trond Myklebust, linux-mm


Andrew Morton wrote:
>> A "WARN_ON(ret != 0);" placed at the end of 
>> invalidate_inode_pages2_range() should be sufficient and harmless.
> 
> true.  I think. I'll take a look at adding a WARN_ON_ONCE().
Something like that would be very good... imho...

steved.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2006-10-04 19:53 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4518333E.2060101@oracle.com>
2006-09-25 21:10 ` Checking page_count(page) in invalidate_complete_page Andrew Morton
2006-09-25 22:30   ` Chuck Lever
2006-09-25 22:53     ` Andrew Morton
2006-09-25 22:57     ` Steve Dickson
2006-09-25 23:14       ` Nick Piggin
2006-09-25 22:40   ` Chuck Lever
2006-09-25 23:02     ` Andrew Morton
2006-09-25 22:50   ` Steve Dickson
2006-09-25 22:51   ` Nick Piggin
2006-09-25 23:14     ` Chuck Lever
2006-09-25 23:21       ` Nick Piggin
2006-09-26  0:01         ` Chuck Lever
2006-09-26  0:13           ` Nick Piggin
2006-09-26  1:33             ` Chuck Lever
2006-09-26  1:48               ` Nick Piggin
2006-09-28 16:26                 ` Chuck Lever
2006-09-28 16:36                   ` Andrew Morton
2006-09-28 16:40                     ` Andrew Morton
2006-09-28 16:42                       ` Chuck Lever
2006-09-28 17:03                         ` Andrew Morton
2006-09-28 17:09                           ` Chuck Lever
2006-09-29  0:37                             ` Nick Piggin
2006-09-29 20:34                               ` Chuck Lever
2006-09-29 20:45                                 ` Peter Zijlstra
2006-09-29 21:02                                   ` Chuck Lever
2006-09-29 21:17                                     ` Peter Zijlstra
2006-09-29 21:44                                       ` Andrew Morton
2006-09-29 21:48                                         ` Chuck Lever
2006-09-29 22:29                                           ` Andrew Morton
2006-09-29 23:05                                             ` Chuck Lever
2006-10-01  4:21                                             ` Chuck Lever
2006-10-02 12:01                                               ` Steve Dickson
2006-10-02 13:25                                                 ` Trond Myklebust
2006-10-02 16:57                                                   ` Andrew Morton
2006-10-02 17:02                                                     ` Steve Dickson
2006-10-02 18:20                                                       ` Andrew Morton
2006-10-02 19:02                                                         ` Steve Dickson
2006-10-03  2:14                                                           ` Chuck Lever
2006-10-03  4:18                                                             ` Trond Myklebust
2006-10-03  4:24                                                               ` Andrew Morton
2006-10-03 18:50                                                               ` Chuck Lever
2006-10-03 19:10                                                                 ` Trond Myklebust
2006-10-03 19:21                                                                   ` Chuck Lever
2006-10-03 21:37                                                                   ` Andrew Morton
2006-10-04 19:29                                                                     ` Chuck Lever
2006-10-04 19:43                                                                       ` Andrew Morton
2006-10-04 19:53                                                                         ` Steve Dickson
2006-09-28 16:41                     ` Chuck Lever
2006-09-26  6:25               ` Nick Piggin
2006-09-26 13:12                 ` Chuck Lever
2006-09-27  4:47                   ` Nick Piggin
2006-09-27  8:25                     ` Andrew Morton
2006-09-27  8:39                       ` Nick Piggin
2006-09-27 16:03                         ` Andrew Morton
2006-09-27 15:54                     ` Chuck Lever
2006-09-25 22:56   ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox