From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 5EF896B004A for ; Tue, 7 Jun 2011 14:22:57 -0400 (EDT) Received: by pzk4 with SMTP id 4so3321840pzk.14 for ; Tue, 07 Jun 2011 11:22:54 -0700 (PDT) Subject: Re: [PATCH] mm: Fix assertion mapping->nrpages == 0 in end_writeback() Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: multipart/alternative; boundary=Apple-Mail-1-598770631 From: Jinshan Xiong In-Reply-To: <1307425597.3649.61.camel@tucsk.pomaz.szeredi.hu> Date: Tue, 7 Jun 2011 11:22:48 -0700 Message-Id: <24671813-6F79-4746-8BF1-7CC50F4BBBCA@whamcloud.com> References: <1306748258-4732-1-git-send-email-jack@suse.cz> <20110606151614.0037e236.akpm@linux-foundation.org> <1307425597.3649.61.camel@tucsk.pomaz.szeredi.hu> Sender: owner-linux-mm@kvack.org List-ID: To: Miklos Szeredi Cc: Andrew Morton , Jan Kara , linux-mm@kvack.org, Al Viro , stable@kernel.org, Nick Piggin --Apple-Mail-1-598770631 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii On Jun 6, 2011, at 10:46 PM, Miklos Szeredi wrote: > On Mon, 2011-06-06 at 15:16 -0700, Andrew Morton wrote: >> On Mon, 30 May 2011 11:37:38 +0200 >> Jan Kara wrote: >>=20 >>> Under heavy memory and filesystem load, users observe the assertion >>> mapping->nrpages =3D=3D 0 in end_writeback() trigger. This can be = caused >>> by page reclaim reclaiming the last page from a mapping in the = following >>> race: >>> CPU0 CPU1 >>> ... >>> shrink_page_list() >>> __remove_mapping() >>> __delete_from_page_cache() >>> radix_tree_delete() >>> evict_inode() >>> truncate_inode_pages() >>> truncate_inode_pages_range() >>> pagevec_lookup() - finds = nothing >>> end_writeback() >>> mapping->nrpages !=3D 0 -> = BUG >>> page->mapping =3D NULL >>> mapping->nrpages-- >>>=20 >>> Fix the problem by cycling the mapping->tree_lock at the end of >>> truncate_inode_pages_range() to synchronize with page reclaim. >>>=20 >>> Analyzed by Jay , lost in LKML, and dug >>> out by Miklos Szeredi . >>>=20 >>> CC: Jay >>> CC: stable@kernel.org >>> Acked-by: Miklos Szeredi >>> Signed-off-by: Jan Kara >>> --- >>> mm/truncate.c | 7 +++++++ >>> 1 files changed, 7 insertions(+), 0 deletions(-) >>>=20 >>> Andrew, would you merge this patch please? Thanks. >>>=20 >>> diff --git a/mm/truncate.c b/mm/truncate.c >>> index a956675..ec3d292 100644 >>> --- a/mm/truncate.c >>> +++ b/mm/truncate.c >>> @@ -291,6 +291,13 @@ void truncate_inode_pages_range(struct = address_space *mapping, >>> pagevec_release(&pvec); >>> mem_cgroup_uncharge_end(); >>> } >>> + /* >>> + * Cycle the tree_lock to make sure all = __delete_from_page_cache() >>> + * calls run from page reclaim have finished as well (this = handles the >>> + * case when page reclaim took the last page from our range). >>> + */ >>> + spin_lock_irq(&mapping->tree_lock); >>> + spin_unlock_irq(&mapping->tree_lock); >>> } >>> EXPORT_SYMBOL(truncate_inode_pages_range); >>=20 >> That's one ugly patch. >>=20 >>=20 >> Perhaps this regression was added by Nick's RCUification of = pagecache.=20 >>=20 >> Before that patch, mapping->nrpages and the radix-tree state were >> coherent for holders of tree_lock. So pagevec_lookup() would never >> return "no pages" while ->nrpages is non-zero. >>=20 >> After that patch, find_get_pages() uses RCU to protect the radix-tree >> but I don't think it correctly protects the aggregate (radix-tree + >> nrpages). >=20 > Yes, that's the case. >=20 >>=20 >>=20 >> If it's not that then I see another possibility.=20 >> truncate_inode_pages_range() does >>=20 >> if (mapping->nrpages =3D=3D 0) >> return; >>=20 >> Is there anything to prevent a page getting added to the inode = _after_ >> this test? i_mutex? If not, that would trigger the BUG. >=20 > That BUG is in the inode eviction phase, so there's nothing that could > be adding a page. >=20 > And the only thing that could be removing one is page reclaim. >=20 >> Either way, I don't think that the uglypatch expresses a full >> understanding of te bug ;) >=20 > I don't see a better way, how would we make nrpages update atomically > wrt the radix-tree while using only RCU? >=20 > The question is, does it matter that those two can get temporarily out > of sync? >=20 > In case of inode eviction it does, not only because of that BUG_ON, = but > because page reclaim must be somehow synchronised with eviction. > Otherwise it may access tree_lock on the mapping of an already freed > inode. I tend to think your patch is absolutely ok to fix this problem. = However, I think it would be better to move: spin_lock(&mapping->tree_lock); spin_unlock(&mapping->tree_lock); into end_writeback(). This is because truncate_inode_pages_range() is a = generic function and it will be called somewhere else, maybe = unnecessarily to do this extra thing. Actually, I'd like to hold an inode refcount in page stealing process. = The reason is obvious: it makes no sense to steal pages from a = to-be-freed inode. However, the problem is the overhead to grab an inode = is damned heavy. Thanks, Jinshan >=20 > In other cases? AFAICS it doesn't matter. Most ->nrpages accesses > weren't under tree_lock before Nick's RCUification, so their use were > just optimization. =20 >=20 > Thanks, > Miklos >=20 >=20 --Apple-Mail-1-598770631 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii
On Mon, 2011-06-06 at 15:16 -0700, Andrew Morton = wrote:
On Mon, 30 May 2011 11:37:38 = +0200
Jan Kara <jack@suse.cz> = wrote:

Under heavy memory and filesystem load, users observe the = assertion
mapping->nrpages =3D=3D 0 in = end_writeback() trigger. This can be = caused
by page reclaim reclaiming the last page from a mapping in = the following
race:
CPU0 = CPU1
=  ...
=  shrink_page_list()
=    __remove_mapping()
=      __delete_from_page_cache()
<= /blockquote>
=        radix_tree_delete()
= evict_inode()
=  truncate_inode_pages()
=    truncate_inode_pages_range()
=      pagevec_lookup() - finds = nothing
= = = = =  end_writeback()
=    mapping->nrpages !=3D 0 -> = BUG
=        page->mapping =3D = NULL
=        mapping->nrpages--

Fix the problem by cycling the = mapping->tree_lock at the end = of
truncate_inode_pages_range() to synchronize with page = reclaim.

Analyzed by Jay <jinshan.xiong@whamcloud.com>, lost in LKML, and dug
out by Miklos Szeredi <mszeredi@suse.de>.

CC: Jay <jinshan.xiong@whamcloud.com>
CC: stable@kernel.org
Acked-by: = Miklos Szeredi <mszeredi@suse.de>
=
Signed-off-by: Jan Kara <jack@suse.cz>
---
mm/truncate.c | =    7 +++++++
1 files changed, 7 = insertions(+), 0 deletions(-)

Andrew, would you merge this = patch please? Thanks.

diff --git a/mm/truncate.c = b/mm/truncate.c
index a956675..ec3d292 = 100644
--- = a/mm/truncate.c
+++ = b/mm/truncate.c
@@ -291,6 +291,13 @@ void = truncate_inode_pages_range(struct address_space = *mapping,
= pagevec_release(&pvec);
= mem_cgroup_uncharge_end();
= }
+ = /*
+ * Cycle the tree_lock to make = sure all = __delete_from_page_cache()
+ * calls run from page reclaim = have finished as well (this handles = the
+ = * case when page reclaim took the last page from our = range).
+ = */
+ = spin_lock_irq(&mapping->tree_lock);
+ = spin_unlock_irq(&mapping->tree_lock);
= }
= EXPORT_SYMBOL(truncate_inode_pages_range);

That's = one ugly patch.


Perhaps this = regression was added by Nick's RCUification of pagecache. =

Before that patch, mapping->nrpages and the radix-tree = state were
coherent for = holders of tree_lock.  So pagevec_lookup() would = never
return "no pages" while = ->nrpages is non-zero.

After that = patch, find_get_pages() uses RCU to protect the = radix-tree
but I don't think = it correctly protects the aggregate (radix-tree = +
nrpages).

Yes, that's the = case.



If it's not = that then I see another possibility.
truncate_inode_pages_range() = does

=        if (mapping->nrpages =3D=3D = 0)
=             &n= bsp;  return;

Is there = anything to prevent a page getting added to the inode = _after_
this test? =  i_mutex?  If not, that would trigger the = BUG.

That BUG is in the inode eviction phase, so = there's nothing that could
be adding a page.

And the only = thing that could be removing one is page reclaim.

Either way, I don't think that the uglypatch expresses a = full
understanding of te bug = ;)

I don't see a better way, how would we make = nrpages update atomically
wrt the radix-tree while using only = RCU?

The question is, does it matter that those two can get = temporarily out
of sync?

In case of inode eviction it does, = not only because of that BUG_ON, but
because page reclaim must be = somehow synchronised with eviction.
Otherwise it may access tree_lock = on the mapping of an already = freed
inode.

I tend to = think your patch is absolutely ok to fix this problem. However, I think = it would be better to = move:

spin_lock(&mapping->tree_lock);
spin_unlock(&mapping->tree_lock);

in= to end_writeback(). This is because truncate_inode_pages_range() is = a generic function and it will be called somewhere else, maybe = unnecessarily to do this extra thing.

Actually, = I'd like to hold an inode refcount in page stealing process. The reason = is obvious: it makes no sense to steal pages from a to-be-freed inode. = However, the problem is the overhead to grab an inode is damned = heavy.

Thanks,
Jinshan


In other cases?  AFAICS it doesn't = matter.  Most ->nrpages accesses
weren't under tree_lock = before Nick's RCUification, so their use were
just optimization. =   

Thanks,
Miklos


=
= --Apple-Mail-1-598770631-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org