From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 20 Oct 2008 08:17:35 -0700 (PDT) From: Linus Torvalds Subject: Re: [patch] mm: fix anon_vma races In-Reply-To: Message-ID: References: <20081016041033.GB10371@wotan.suse.de> <1224285222.10548.22.camel@lappy.programming.kicks-ass.net> <20081018013258.GA3595@wotan.suse.de> <20081018022541.GA19018@wotan.suse.de> <20081018052046.GA26472@wotan.suse.de> <1224326299.28131.132.camel@twins> <1224413500.10548.55.camel@lappy.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: Hugh Dickins Cc: Peter Zijlstra , Nick Piggin , Linux Memory Management List List-ID: On Mon, 20 Oct 2008, Hugh Dickins wrote: > > When you say "to the point where we don't need to care about anything > else", are you there agreeing with Nick that your smp_wmb() and > smp_read_barrier_depends() are no longer needed? Yes. The anon_vma only has two fields: the spinlock itself, and the list. And with all list allocations being inside the spinlock, and with the spinlock itself being a memory barrier, I'm now convinced that the worry about memory ordering was unnecessary. Well, not unnecessary, because I think the discussion was good, and I think we fixed another bug, but the smp_wmb++smp_read_barrier_depends does seem to be a non-issue in this path. > But this is all _irrelevant_ : the tricky test to worry about in > page_lock_anon_vma() is of page_mapped() i.e. does this page currently > have any ptes in userspace, not of page_mapping() or page->mapping. I'm not arguing for removing the page_mapped() we have now, I'm just wondering about the one Nick wanted to add at the end. > In the case of file pages, it is in some places crucial to check > page->mapping against a racing file truncation; but there's no such > issue with anon pages, their page->mapping pointing to anon_vma is > left set until the page is finally freed, it is not volatile. > > But in the case of anon pages, what page->mapping points to may be > volatile, in the sense that that memory might at some point get reused > for a different anon_vma, or the slab page below it get freed and > reused for a different purpose completely: that's what we have to > careful of in the case of anon pages, and it's RCU and the > page_mapped() test which guard that. .. and I'm not worried about the slab page. It's stable, since we hold the RCU read-lock. No worries about that one either. It's the "struct page" itself - the one we use for lookup in page_lock_anon_vma(). And I'm worried about the need for *re-doing* the page_mapped() test. The problem that makes "page_lock_anon_vma()" such a total disaster is that yes, it does locking, but it does locking _after_ the lookup, and the lock doesn't actually protect any of the data that it is using for the lookup itself. And yes, we have various tricks to try to make the data "safe" even if we race with the lookup, like the RCU stability of the anon_vma allocation, so that even if we race, we don't do anything bad. And I don't worry about the anon_vma, that part looks fine. But page_remove_rmap() is run totally unlocked wrt page_lock_anon_vma(), it looks like. page_remove_rmap() is run under the pte lock, and page_lock_anon_vma() is run under no lock at all that I can see that is reliable. So yes, we have the same kind of "delay the destroy" wrt page->mapping (ie page_remove_rmap() doesn't actually clear page->mapping at all), but none of this runs under any lock at all. So as far as I can tell, the only reason "page_lock_anon_vma()" is safe is because we hopefully always hold an elevated page count when we call it, so at least the struct page itself will never get freed, so page->mapping is safe just because it's not cleared. But assuming all that is true, it boils down to this: - the anon_vma we get may be the wrong one or a stale one (since page_remove_rmap ran concurrently and we ended up freeing the anon_vma), but it's always a "valid" anon_vma in the sense that it's initialized and the list is always pointing to *some* stable set of vm_area_struct's. - if we raced, and the anon_vma is stale, we're going to walk over some bogus list that pertains to a totally different page, but WHO REALLY CARES? If it is about another page that got that anon_vma reallocated to it, all the code that traverses the list of vma's still has to check the page tables _anyway_. And that will never trigger, and that will get the pte lock for checking anyway, so at _that_ point do we correctly finally synchronize with a lock that is actually relevant. - the "anon_vma->lock" is totally irrelevant wrt "page_mapped()", and I'm not seeing what it could possibly help to re-check after getting that lock. So what I'm trying to figure out is why Nick wanted to add another check for page_mapped(). I'm not seeing what it is supposed to protect against. (Yes, we have checks for "page_mapped()" inside the "try_tp_unmap_xyz()" loops, but those are for a different reason - they're there to exit the loop early when we know there's no point. They don't claim to be about locking serialization). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org