On Jun 8, 2011, at 9:40 AM, Jan Kara wrote: > On Tue 07-06-11 11:22:48, Jinshan Xiong wrote: >> >> On Jun 6, 2011, at 10:46 PM, Miklos Szeredi wrote: >> >>> On Mon, 2011-06-06 at 15:16 -0700, Andrew Morton wrote: >>>> On Mon, 30 May 2011 11:37:38 +0200 >>>> Jan Kara wrote: >>>> >>>>> Under heavy memory and filesystem load, users observe the assertion >>>>> mapping->nrpages == 0 in end_writeback() trigger. This can be caused >>>>> by page reclaim reclaiming the last page from a mapping in the following >>>>> race: >>>>> CPU0 CPU1 >>>>> ... >>>>> shrink_page_list() >>>>> __remove_mapping() >>>>> __delete_from_page_cache() >>>>> radix_tree_delete() >>>>> evict_inode() >>>>> truncate_inode_pages() >>>>> truncate_inode_pages_range() >>>>> pagevec_lookup() - finds nothing >>>>> end_writeback() >>>>> mapping->nrpages != 0 -> BUG >>>>> page->mapping = NULL >>>>> mapping->nrpages-- >>>>> >>>>> Fix the problem by cycling the mapping->tree_lock at the end of >>>>> truncate_inode_pages_range() to synchronize with page reclaim. >>>>> >>>>> Analyzed by Jay , lost in LKML, and dug >>>>> out by Miklos Szeredi . >>>>> >>>>> CC: Jay >>>>> CC: stable@kernel.org >>>>> Acked-by: Miklos Szeredi >>>>> Signed-off-by: Jan Kara >>>>> --- >>>>> mm/truncate.c | 7 +++++++ >>>>> 1 files changed, 7 insertions(+), 0 deletions(-) >>>>> >>>>> Andrew, would you merge this patch please? Thanks. >>>>> >>>>> diff --git a/mm/truncate.c b/mm/truncate.c >>>>> index a956675..ec3d292 100644 >>>>> --- a/mm/truncate.c >>>>> +++ b/mm/truncate.c >>>>> @@ -291,6 +291,13 @@ void truncate_inode_pages_range(struct address_space *mapping, >>>>> pagevec_release(&pvec); >>>>> mem_cgroup_uncharge_end(); >>>>> } >>>>> + /* >>>>> + * Cycle the tree_lock to make sure all __delete_from_page_cache() >>>>> + * calls run from page reclaim have finished as well (this handles the >>>>> + * case when page reclaim took the last page from our range). >>>>> + */ >>>>> + spin_lock_irq(&mapping->tree_lock); >>>>> + spin_unlock_irq(&mapping->tree_lock); >>>>> } >>>>> EXPORT_SYMBOL(truncate_inode_pages_range); >>>> >>>> That's one ugly patch. >>>> >>>> >>>> Perhaps this regression was added by Nick's RCUification of pagecache. >>>> >>>> Before that patch, mapping->nrpages and the radix-tree state were >>>> coherent for holders of tree_lock. So pagevec_lookup() would never >>>> return "no pages" while ->nrpages is non-zero. >>>> >>>> After that patch, find_get_pages() uses RCU to protect the radix-tree >>>> but I don't think it correctly protects the aggregate (radix-tree + >>>> nrpages). >>> >>> Yes, that's the case. >>> >>>> >>>> >>>> If it's not that then I see another possibility. >>>> truncate_inode_pages_range() does >>>> >>>> if (mapping->nrpages == 0) >>>> return; >>>> >>>> Is there anything to prevent a page getting added to the inode _after_ >>>> this test? i_mutex? If not, that would trigger the BUG. >>> >>> That BUG is in the inode eviction phase, so there's nothing that could >>> be adding a page. >>> >>> And the only thing that could be removing one is page reclaim. >>> >>>> Either way, I don't think that the uglypatch expresses a full >>>> understanding of te bug ;) >>> >>> I don't see a better way, how would we make nrpages update atomically >>> wrt the radix-tree while using only RCU? >>> >>> The question is, does it matter that those two can get temporarily out >>> of sync? >>> >>> In case of inode eviction it does, not only because of that BUG_ON, but >>> because page reclaim must be somehow synchronised with eviction. >>> Otherwise it may access tree_lock on the mapping of an already freed >>> inode. >> >> I tend to think your patch is absolutely ok to fix this problem. However, I think it would be better to move: >> >> spin_lock(&mapping->tree_lock); >> spin_unlock(&mapping->tree_lock); >> >> into end_writeback(). This is because truncate_inode_pages_range() is a >> generic function and it will be called somewhere else, maybe >> unnecessarily to do this extra thing. > Possible. I just thought it would be nice from > truncate_inode_pages_range() to return only after we are really sure there > are no outstanding pages in the requested range... > >> Actually, I'd like to hold an inode refcount in page stealing process. >> The reason is obvious: it makes no sense to steal pages from a >> to-be-freed inode. However, the problem is the overhead to grab an inode >> is damned heavy. > No a good idea I think. If you happen to be the last one to drop inode > reference, you have to handle inode deletion and you really want to limit > places from where that can happen because that needs all sorts of > filesystem locks etc. Indeed. Thanks for pointing it out. > > Honza > -- > Jan Kara > SUSE Labs, CR