* [patch] mm: fix anon_vma races
@ 2008-10-16 4:10 Nick Piggin
2008-10-17 22:14 ` Hugh Dickins
2008-10-17 23:13 ` Peter Zijlstra
0 siblings, 2 replies; 52+ messages in thread
From: Nick Piggin @ 2008-10-16 4:10 UTC (permalink / raw)
To: Hugh Dickins, Linux Memory Management List
Hi,
Still would like independent confirmation of these problems and the fix, but
it still looks buggy to me...
---
There are some races in the anon_vma code.
The race comes about because adding our vma to the tail of anon_vma->head comes
after the assignment of vma->anon_vma to a new anon_vma pointer. In the case
where this is a new anon_vma, this is done without holding any locks. So a
parallel anon_vma_prepare might see the vma already has an anon_vma, so it
won't serialise on the page_table_lock. It may proceed and the anon_vma to a
page in the page fault path. Another thread may then pick up the page from the
LRU list, find its mapcount incremented, and attempt to iterate over the
anon_vma's list concurrently with the first thread (because the first one is
not holding the anon_vma lock). This is a fairly subtle race, and only likely
to be hit in kernels where the spinlock is preemptible and the first thread is
preempted at the right time... but OTOH it is _possible_ to hit here; on bigger
SMP systems cacheline transfer latencies could be very large, or we could take
an NMI inside the lock or something. Fix this by initialising the list before
adding the anon_vma to vma.
After that, there is a similar data-race with memory ordering where the store
to make the anon_vma visible passes previous stores to initialize the anon_vma.
This race also includes stores to initialize the anon_vma spinlock by the
slab constructor. Add and comment appropriate barriers to solve this.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -81,8 +81,15 @@ int anon_vma_prepare(struct vm_area_stru
/* page_table_lock to protect against threads */
spin_lock(&mm->page_table_lock);
if (likely(!vma->anon_vma)) {
- vma->anon_vma = anon_vma;
list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+ /*
+ * This smp_wmb() is required to order all previous
+ * stores to initialize the anon_vma (by the slab
+ * ctor) and add this vma, with the store to make it
+ * visible to other CPUs via vma->anon_vma.
+ */
+ smp_wmb();
+ vma->anon_vma = anon_vma;
allocated = NULL;
}
spin_unlock(&mm->page_table_lock);
@@ -91,6 +98,15 @@ int anon_vma_prepare(struct vm_area_stru
spin_unlock(&locked->lock);
if (unlikely(allocated))
anon_vma_free(allocated);
+ } else {
+ /*
+ * This smp_read_barrier_depends is required to order the data
+ * dependent loads of fields in anon_vma, with the load of the
+ * anon_vma pointer vma->anon_vma. This complements the above
+ * smp_wmb, and prevents a CPU from loading uninitialized
+ * contents of anon_vma.
+ */
+ smp_read_barrier_depends();
}
return 0;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: [patch] mm: fix anon_vma races 2008-10-16 4:10 [patch] mm: fix anon_vma races Nick Piggin @ 2008-10-17 22:14 ` Hugh Dickins 2008-10-17 23:05 ` Linus Torvalds 2008-10-17 23:13 ` Peter Zijlstra 1 sibling, 1 reply; 52+ messages in thread From: Hugh Dickins @ 2008-10-17 22:14 UTC (permalink / raw) To: Nick Piggin; +Cc: Linus Torvalds, linux-kernel, linux-mm Added Linus and LKML to the Cc now I see that he's suggesting a similar but not identical patch under that other thread. On Thu, 16 Oct 2008, Nick Piggin wrote: > Hi, > > Still would like independent confirmation of these problems and the fix, but > it still looks buggy to me... > > --- > > There are some races in the anon_vma code. > > The race comes about because adding our vma to the tail of anon_vma->head comes > after the assignment of vma->anon_vma to a new anon_vma pointer. In the case > where this is a new anon_vma, this is done without holding any locks. So a > parallel anon_vma_prepare might see the vma already has an anon_vma, so it > won't serialise on the page_table_lock. It may proceed and the anon_vma to a > page in the page fault path. Another thread may then pick up the page from the > LRU list, find its mapcount incremented, and attempt to iterate over the > anon_vma's list concurrently with the first thread (because the first one is > not holding the anon_vma lock). This is a fairly subtle race, and only likely > to be hit in kernels where the spinlock is preemptible and the first thread is > preempted at the right time... but OTOH it is _possible_ to hit here; on bigger > SMP systems cacheline transfer latencies could be very large, or we could take > an NMI inside the lock or something. Fix this by initialising the list before > adding the anon_vma to vma. > > After that, there is a similar data-race with memory ordering where the store > to make the anon_vma visible passes previous stores to initialize the anon_vma. > This race also includes stores to initialize the anon_vma spinlock by the > slab constructor. Add and comment appropriate barriers to solve this. > > Signed-off-by: Nick Piggin <npiggin@suse.de> Ow, you do break my head with these things. I'm pretty sure you're on to something here, it certainly looks suspicious. But I've not yet convinced myself that what you (or Linus) propose is the right answer - I was going to think through some more, but spotting the other from Linus thought I ought at least to introduce you to each other ;) My problem is really with the smp_read_barrier_depends() you each have in anon_vma_prepare(). But the only thing which its CPU does with the anon_vma is put its address into a struct page (or am I forgetting more?). Wouldn't the smp_read_barrier_depends() need to be, not there in anon_vma_prepare(), but over on the third CPU, perhaps in page_lock_anon_vma()? I've also been toying with the idea of taking the newly alloc'ed anon_vma's lock here in the !vma->anon_vma case, instead of having that "locked = NULL" business. That might be a good idea, but I'm rather thinking it doesn't quite address the issue at hand. Taking a break... Hugh > --- > Index: linux-2.6/mm/rmap.c > =================================================================== > --- linux-2.6.orig/mm/rmap.c > +++ linux-2.6/mm/rmap.c > @@ -81,8 +81,15 @@ int anon_vma_prepare(struct vm_area_stru > /* page_table_lock to protect against threads */ > spin_lock(&mm->page_table_lock); > if (likely(!vma->anon_vma)) { > - vma->anon_vma = anon_vma; > list_add_tail(&vma->anon_vma_node, &anon_vma->head); > + /* > + * This smp_wmb() is required to order all previous > + * stores to initialize the anon_vma (by the slab > + * ctor) and add this vma, with the store to make it > + * visible to other CPUs via vma->anon_vma. > + */ > + smp_wmb(); > + vma->anon_vma = anon_vma; > allocated = NULL; > } > spin_unlock(&mm->page_table_lock); > @@ -91,6 +98,15 @@ int anon_vma_prepare(struct vm_area_stru > spin_unlock(&locked->lock); > if (unlikely(allocated)) > anon_vma_free(allocated); > + } else { > + /* > + * This smp_read_barrier_depends is required to order the data > + * dependent loads of fields in anon_vma, with the load of the > + * anon_vma pointer vma->anon_vma. This complements the above > + * smp_wmb, and prevents a CPU from loading uninitialized > + * contents of anon_vma. > + */ > + smp_read_barrier_depends(); > } > return 0; > } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-17 22:14 ` Hugh Dickins @ 2008-10-17 23:05 ` Linus Torvalds 2008-10-18 0:13 ` Hugh Dickins 0 siblings, 1 reply; 52+ messages in thread From: Linus Torvalds @ 2008-10-17 23:05 UTC (permalink / raw) To: Hugh Dickins; +Cc: Nick Piggin, linux-kernel, linux-mm On Fri, 17 Oct 2008, Hugh Dickins wrote: > > My problem is really with the smp_read_barrier_depends() you each > have in anon_vma_prepare(). But the only thing which its CPU > does with the anon_vma is put its address into a struct page > (or am I forgetting more?). Wouldn't the smp_read_barrier_depends() > need to be, not there in anon_vma_prepare(), but over on the third > CPU, perhaps in page_lock_anon_vma()? I thought about it, but it's a disaster from a maintenance standpoint to put it there, rather than make it all clear in the _one_ function that actually does things optimistically. I agree that it's a bit subtle the way I did it (haven't seen Nick's patch, I assume he was upset at me for shouting at him), but that's part of why I put that comment in there and said things are subtle. Anyway, technically you're right: the smp_read_barrier_depends() really would be more obvious in the place where we actually fetch that "anon_vma" pointer again and actually derefernce it. HOWEVER: - there are potentially multiple places that do that, and putting it in the anon_vma_prepare() thing not only matches things with the smp_wmb(), making that whole pairing much more obvious, but it also means that we're guaranteed that any anon_vma user will have done the smp_read_barrier_depends(), since they all have to do that prepare thing anyway. So putting it there is simpler and gives better guarantees, and pairs up the barriers better. - Now, "simpler" (etc) is no help if it doesn't work, so now I have to convince you that it's _sufficient_ to do that "read_barrier_depends()" early, even if we then end up re-doing the first read and thus the "depends" part doesn't work any more. So "simpler" is all good, but not if it's incorrect. And I admit it, here my argument is one of implementation. The fact is, the only architecture where "read_barrier_depends()" exists at all as anything but a no-op is alpha, and there it's a full read barrier. On all other architectures, causality implies a read barrier anyway, so for them, placement (or non-placement) of the smp_read_barrier_depends is a total non-issue. And so, since on the only architecture where it could possibly matter, that _depends thing turns into a full read barrier, and since "anon_vma" is actually stable since written, and since the only ordering constrain is that initial ordering of seeing the "anon_vma" turn non-NULL, you may as well think of that "read_barrier_depends()" as a full read barrier between the _original_ read of the anon_vma pointer and then the read of the lock data we want to protect. Which it is, on alpha. And that is sufficient. IOW, think of it as a real read_barrier(), with no dependency thing, but that only happens when an architecture doesn't already guarantee the causality barrier. And once you think of it as a "smp_rmb() for alpha", you realize that it's perfectly ok for it to be where it is. Anyway, lockless is bad. It would certainly be a *lot* simpler to just take the page_table_lock around the whole thing, except I think we really *really* don't want to do that. That thing is solidly in a couple of *very* timing-critical routines. Doing another lock there is just not an option. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-17 23:05 ` Linus Torvalds @ 2008-10-18 0:13 ` Hugh Dickins 2008-10-18 0:25 ` Linus Torvalds 2008-10-18 1:53 ` Nick Piggin 0 siblings, 2 replies; 52+ messages in thread From: Hugh Dickins @ 2008-10-18 0:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nick Piggin, linux-kernel, linux-mm On Fri, 17 Oct 2008, Linus Torvalds wrote: > On Fri, 17 Oct 2008, Hugh Dickins wrote: > > > > My problem is really with the smp_read_barrier_depends() you each > > have in anon_vma_prepare(). But the only thing which its CPU > > does with the anon_vma is put its address into a struct page > > (or am I forgetting more?). Wouldn't the smp_read_barrier_depends() > > need to be, not there in anon_vma_prepare(), but over on the third > > CPU, perhaps in page_lock_anon_vma()? > > I thought about it, but it's a disaster from a maintenance standpoint to > put it there, rather than make it all clear in the _one_ function that > actually does things optimistically. > > I agree that it's a bit subtle the way I did it (haven't seen Nick's > patch, I assume he was upset at me for shouting at him), but that's part > of why I put that comment in there and said things are subtle. Nick's patch was below in the mail you're replying to - I expect it looked so much like yours, that at a glance you thought it was yours - though there are little differences. (Yes, I think he felt he'd make more progress by backing away from your harangues and taking cover with a patch to linux-mm - though surely you were right that his original ctor angle was mistaken.) > > Anyway, technically you're right: the smp_read_barrier_depends() really > would be more obvious in the place where we actually fetch that "anon_vma" > pointer again and actually derefernce it. > > HOWEVER: > > - there are potentially multiple places that do that, and putting it in > the anon_vma_prepare() thing not only matches things with the > smp_wmb(), making that whole pairing much more obvious, but it also > means that we're guaranteed that any anon_vma user will have done the > smp_read_barrier_depends(), since they all have to do that prepare > thing anyway. No, it's not so that any anon_vma user would have done the smp_read_barrier_depends() placed in anon_vma_prepare(). Anyone faulting in a page would have done it (swapoff? that assumes it's been done, let's not worry about it right now). But they're doing it to make the page's ptes accessible to memory reclaim, and the CPU doing memory reclaim will not (unless by coincidence) have done that anon_vma_prepare() - it's just reading the links which the faulters are providing. But I've given up thought for the night, will leave digesting all you've written until morning, just wanted to point that out before sleeping. Hugh > > So putting it there is simpler and gives better guarantees, and pairs > up the barriers better. > > - Now, "simpler" (etc) is no help if it doesn't work, so now I have to > convince you that it's _sufficient_ to do that "read_barrier_depends()" > early, even if we then end up re-doing the first read and thus the > "depends" part doesn't work any more. So "simpler" is all good, but not > if it's incorrect. > > And I admit it, here my argument is one of implementation. The fact is, > the only architecture where "read_barrier_depends()" exists at all as > anything but a no-op is alpha, and there it's a full read barrier. On > all other architectures, causality implies a read barrier anyway, so > for them, placement (or non-placement) of the smp_read_barrier_depends > is a total non-issue. > > And so, since on the only architecture where it could possibly matter, > that _depends thing turns into a full read barrier, and since > "anon_vma" is actually stable since written, and since the only > ordering constrain is that initial ordering of seeing the "anon_vma" > turn non-NULL, you may as well think of that "read_barrier_depends()" > as a full read barrier between the _original_ read of the anon_vma > pointer and then the read of the lock data we want to protect. > > Which it is, on alpha. And that is sufficient. IOW, think of it as a > real read_barrier(), with no dependency thing, but that only happens > when an architecture doesn't already guarantee the causality barrier. > > And once you think of it as a "smp_rmb() for alpha", you realize that > it's perfectly ok for it to be where it is. > > Anyway, lockless is bad. It would certainly be a *lot* simpler to just > take the page_table_lock around the whole thing, except I think we really > *really* don't want to do that. That thing is solidly in a couple of > *very* timing-critical routines. Doing another lock there is just not an > option. > > Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 0:13 ` Hugh Dickins @ 2008-10-18 0:25 ` Linus Torvalds 2008-10-18 1:53 ` Nick Piggin 1 sibling, 0 replies; 52+ messages in thread From: Linus Torvalds @ 2008-10-18 0:25 UTC (permalink / raw) To: Hugh Dickins; +Cc: Nick Piggin, linux-kernel, linux-mm On Sat, 18 Oct 2008, Hugh Dickins wrote: > > But they're doing it to make the page's ptes accessible to > memory reclaim, and the CPU doing memory reclaim will not > (unless by coincidence) have done that anon_vma_prepare() - > it's just reading the links which the faulters are providing. Ahh. Good point. Then yes, those ones would really need the smp_read_barrier_depends() too. Very annoying. Of course, we _could_ just admit that the situation is really *really* unlikely, and there are probably something like five people running Linux/alpha, and that we don't care enough. With just the smp_wmb(), we cover all non-alpha people, since this is all through a pointer, so all the readers will inevitably be of the smp_read_barrier_depends kind. If we want to guarantee the proper smp_read_barrier_depends() behaviour, then we'd need to find them all. Possibly by renaming the whole field (prepend an underscore like we usually do) and forcing all readers to use some "get_anon_vma(vma)" helper function, aka static inline struct anon_vma *get_anon_vma(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->_anon_vma; smp_read_barrier_depends(); return anon_vma; } which would find them all. Ugh. I really would have preferred not having something like that. Especially considering how limited that issue really is. Hmm. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 0:13 ` Hugh Dickins 2008-10-18 0:25 ` Linus Torvalds @ 2008-10-18 1:53 ` Nick Piggin 2008-10-18 2:50 ` Paul Mackerras 1 sibling, 1 reply; 52+ messages in thread From: Nick Piggin @ 2008-10-18 1:53 UTC (permalink / raw) To: Hugh Dickins; +Cc: Linus Torvalds, linux-kernel, linux-mm, linux-arch On Sat, Oct 18, 2008 at 01:13:16AM +0100, Hugh Dickins wrote: > On Fri, 17 Oct 2008, Linus Torvalds wrote: > > would be more obvious in the place where we actually fetch that "anon_vma" > > pointer again and actually derefernce it. > > > > HOWEVER: > > > > - there are potentially multiple places that do that, and putting it in > > the anon_vma_prepare() thing not only matches things with the > > smp_wmb(), making that whole pairing much more obvious, but it also > > means that we're guaranteed that any anon_vma user will have done the > > smp_read_barrier_depends(), since they all have to do that prepare > > thing anyway. > > No, it's not so that any anon_vma user would have done the > smp_read_barrier_depends() placed in anon_vma_prepare(). > > Anyone faulting in a page would have done it (swapoff? that > assumes it's been done, let's not worry about it right now). > > But they're doing it to make the page's ptes accessible to > memory reclaim, and the CPU doing memory reclaim will not > (unless by coincidence) have done that anon_vma_prepare() - > it's just reading the links which the faulters are providing. Yes, that's a very important flaw you point out with the fix. Good spotting. Actually another thing I was staying awake thinking about was the pairwise consistency problem. "Apparently" Linux is supposed to support arbitrary pairwise consistency. This means. CPU0 anon_vma.initialized = 1; smp_wmb() vma->anon_vma = anon_vma; CPU1 if (vma->anon_vma) page->anon_vma = vma->anon_vma; CPU2 if (page->anon_vma) { smp_read_barrier_depends(); assert(page->anon_vma.initialized); } The assertion may trigger because the store from CPU0 may not have propograted to CPU2 before the stores from CPU1. But after thinking about this a bit more, I think Linux would be broken all over the map under such ordering schemes. I think we'd have to mandate causal consistency. Are there any architectures we run on where this is not guaranteed? (I think recent clarifications to x86 ordering give us CC on that architecture). powerpc, ia64, alpha, sparc, arm, mips? (cced linux-arch) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 1:53 ` Nick Piggin @ 2008-10-18 2:50 ` Paul Mackerras 2008-10-18 2:57 ` Linus Torvalds 2008-10-18 5:49 ` Nick Piggin 0 siblings, 2 replies; 52+ messages in thread From: Paul Mackerras @ 2008-10-18 2:50 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Linus Torvalds, linux-kernel, linux-mm, linux-arch Nick Piggin writes: > But after thinking about this a bit more, I think Linux would be > broken all over the map under such ordering schemes. I think we'd > have to mandate causal consistency. Are there any architectures we > run on where this is not guaranteed? (I think recent clarifications > to x86 ordering give us CC on that architecture). > > powerpc, ia64, alpha, sparc, arm, mips? (cced linux-arch) Not sure what you mean by causal consistency, but I assume it's the same as saying that barriers give cumulative ordering, as described on page 413 of the Power Architecture V2.05 document at: http://www.power.org/resources/reading/PowerISA_V2.05.pdf The ordering provided by sync, lwsync and eieio is cumulative (see pages 446 and 448), so we should be OK on powerpc AFAICS. (The cumulative property of eieio only applies to accesses to normal system memory, but that should be OK since we use sync when we want barriers that affect non-cacheable accesses as well as cacheable.) Paul. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 2:50 ` Paul Mackerras @ 2008-10-18 2:57 ` Linus Torvalds 2008-10-18 5:49 ` Nick Piggin 1 sibling, 0 replies; 52+ messages in thread From: Linus Torvalds @ 2008-10-18 2:57 UTC (permalink / raw) To: Paul Mackerras Cc: Nick Piggin, Hugh Dickins, linux-kernel, linux-mm, linux-arch On Sat, 18 Oct 2008, Paul Mackerras wrote: > > Not sure what you mean by causal consistency, but I assume it's the > same as saying that barriers give cumulative ordering, as described on > page 413 of the Power Architecture V2.05 document at: I'm pretty sure that everybody but alpha is ok. And alpha needs the smp_read_barrier_depends() not because it doesn't really support causality, but because each CPU internally doesn't guarantee that they handle the cache invalidates in-order without a barrier. So without the smp_read_barrier_depends(), alpha will actually have the proper causal relationships (cachelines will move to exclusive state on CPU0 in the right order and others will see the causality), but because CPU2 may see the stale data from not even having invalidated the "anon_vma.initialized" because the cache invalidation queue hadn't been flushed in order. Alpha is insane. And the odd man out. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 2:50 ` Paul Mackerras 2008-10-18 2:57 ` Linus Torvalds @ 2008-10-18 5:49 ` Nick Piggin 2008-10-18 10:49 ` Paul Mackerras 2008-10-18 17:00 ` Linus Torvalds 1 sibling, 2 replies; 52+ messages in thread From: Nick Piggin @ 2008-10-18 5:49 UTC (permalink / raw) To: Paul Mackerras Cc: Hugh Dickins, Linus Torvalds, linux-kernel, linux-mm, linux-arch On Sat, Oct 18, 2008 at 01:50:57PM +1100, Paul Mackerras wrote: > Nick Piggin writes: > > > But after thinking about this a bit more, I think Linux would be > > broken all over the map under such ordering schemes. I think we'd > > have to mandate causal consistency. Are there any architectures we > > run on where this is not guaranteed? (I think recent clarifications > > to x86 ordering give us CC on that architecture). > > > > powerpc, ia64, alpha, sparc, arm, mips? (cced linux-arch) > > Not sure what you mean by causal consistency, but I assume it's the I think it can be called transitive. Basically (assumememory starts off zeroed) CPU0 x := 1 CPU1 if (x == 1) { fence y := 1 } CPU2 if (y == 1) { fence assert(x == 1) } As opposed to pairwise, which only provides an ordering of visibility between any given two CPUs (so the store to y might be propogated to CPU2 after the store to x, regardless of the fences). Apparently pairwise ordering is more interesting than just a theoretical thing, and not just restricted to Alpha's funny caches. It can allow for arbitrary network propogating stores / cache coherency between CPUs. x86's publically documented memory model supposedly could allow for such ordering up until a year or so ago (when they clarified and strengthened it). > same as saying that barriers give cumulative ordering, as described on > page 413 of the Power Architecture V2.05 document at: > > http://www.power.org/resources/reading/PowerISA_V2.05.pdf > > The ordering provided by sync, lwsync and eieio is cumulative (see > pages 446 and 448), so we should be OK on powerpc AFAICS. (The > cumulative property of eieio only applies to accesses to normal system > memory, but that should be OK since we use sync when we want barriers > that affect non-cacheable accesses as well as cacheable.) The section on cumulative ordering sounds like it might do the trick. But I haven't really worked through exactly what it is saying ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 5:49 ` Nick Piggin @ 2008-10-18 10:49 ` Paul Mackerras 2008-10-18 17:00 ` Linus Torvalds 1 sibling, 0 replies; 52+ messages in thread From: Paul Mackerras @ 2008-10-18 10:49 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Linus Torvalds, linux-kernel, linux-mm, linux-arch Nick Piggin writes: > > Not sure what you mean by causal consistency, but I assume it's the > > I think it can be called transitive. Basically (assumememory starts off zeroed) > CPU0 > x := 1 > > CPU1 > if (x == 1) { > fence > y := 1 > } > > CPU2 > if (y == 1) { > fence > assert(x == 1) > } That's essentially the same as example 1 on page 415, so yes we are talking about the same thing. Paul. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 5:49 ` Nick Piggin 2008-10-18 10:49 ` Paul Mackerras @ 2008-10-18 17:00 ` Linus Torvalds 2008-10-18 18:44 ` Matthew Wilcox 2008-10-19 2:53 ` Nick Piggin 1 sibling, 2 replies; 52+ messages in thread From: Linus Torvalds @ 2008-10-18 17:00 UTC (permalink / raw) To: Nick Piggin Cc: Paul Mackerras, Hugh Dickins, linux-kernel, linux-mm, linux-arch On Sat, 18 Oct 2008, Nick Piggin wrote: > > I think it can be called transitive. Basically (assumememory starts off zeroed) Alpha is transitive. It has a notion of "processor issue order" and "location access order", and the ordering those two creates is a transitive "before" and "after" ordering. The issue with alpha is not that it wouldn't be transitive - the issue is that *local* read dependencies do not cause a "processor issue order"! So the real issue with alpha is not about any big pair-wise ordering vs transitive thing, the big issue is that alpha's totally _local_ and per-core orderings are so totally screwed up, and are defined to be very loose - because back when alpha was designed, loose memory ordering was thought to be a good thing for performance. They were wrong, but that was mainly because the alpha designers lived in a time when threading wasn't really even an issue. They were optimizing purely for the case where memory ordering doesn't matter, and considered locking etc to be one of those non-RISCy rare operations that can be basically ignored. > CPU0 > x := 1 So this creates a "location access" event on 'x' on alpha, call it "event A". > CPU1 > if (x == 1) { > fence > y := 1 > } This has two events: let's call the read of 'x' "B", and "C" is the write to 'y'. And according to the alpha rules, we now have: - A << B Because we saw a '1' in B, we now have a "location access ordering" on the _same_ variable between A and B. - B < C Because we have the fence in between the read and the write, we now have a "processor issue order" between B and C (despite the fact that they are different variables). And now, the alpha definition of "before" means that we can declare that A is before C. But on alpha, we really do need that fence, even if the address of 'y' was somehow directly data-dependent on 'x'. THAT is what makes alpha special, not any odd ordering rules. > CPU2 > if (y == 1) { > fence > assert(x == 1) > } So again, we now have two events: the access of 'y' is "D", and the access of x is "E". And again, according to the alpha rules, we have two orderings: - C << D Because we saw a '1' in D, we have another "location access ordering" on the variably 'y' between C and D. - D < E Because of the fence, we have a "processor issue ordering" between D and E. And for the same reason as above, we now get that C is "before" E according to the alpha ordering rules. And because the definition of "before" is transitive, then A is before E. And that, in turn, means that that assert() can never trigger, because if it triggered, then by the access ordering rules that would imply that E << A, which would mean that E is "before" A, which in turn would violate the whole chain we just got to. So while the alpha architecture manual doesn't have the exact sequence mentioned above (it only has nine so-called "Litmus tests"), it's fairly close to Litmus test 3, and the ordering on alpha is very clear: it's all transitive and causal (ie "before" can never be "after"). > Apparently pairwise ordering is more interesting than just a theoretical > thing, and not just restricted to Alpha's funny caches. Nobody does just pairwise ordering, afaik. It's an insane model. Everybody does some form of transitive ordering. The real (and only really odd) issue with alpha is that for everybody else, if you have read x -> data dependency -> read y (ie you read a pointer and dereference it, or you read an index and dereference an array through it), then on all other architectures that will imply a local processor ordering, which in turn will be part of the whole transitive order of operations. On alpha, it doesn't. You can think of it as alpha doing value speculation (ie allowing speculative reads even across data dependencies), so on alpha, you could imagine a CPU doing address speculation, and turning the two reads into a sequence of (a) read off speculative pointer '*p' (b) read x (c) verify that that x == p and THAT is what "smp_read_barrier_depends()" will basically stop on alpha. Nothing else. Other CPU's will always basically do (a) read x (b) read *x so they have an implied read barrier between those two events thanks to simply the causal relationship. Some more notes: - The reason that alpha has this odd thing is not actually that any alpha implementation does value speculation, but the way the caches are invalidated, the invalidates can be delayed and re-ordered almost arbitrarily on the local CPU, and in the absense of a memory barrier the second read (that does happen "after" the first read in some local internal CPU sense and wasn't really speculative in that way) can get stale data because one cacheline has been updated before another one has. So while you can think of it a value speculation, the underlying cause is actually not some fancy speculation infrastructure, just an internal implementation issue. - The _data_ dependency is important, because other architectures _will_ still speculatively move memory operations around across other "causal" relationships, notably across control dependencies. IOW, if you have if (read(x)) read(y) then there is NOT necessarily any real orderign between the reads, because the conditional ends up being speculated, and you may well see "y" being read before "x", and you really need a smp_rmb() on other architectures than alpha too. So in this sense, alpha is very "consistent" - for alpha, _no_ amount of local causality matters, and only accesses to the *same* variable are implicitly locally ordered. - On x86, the new memory ordering semantics means that _all_ local causal relationships are honored, so x86, like alpha, is very consistent. It will consider both the data-dependency and the control dependency to be 100% the same. It just does it differently than alpha: for alpha, neither matters for ordering, for x86, both matter. Of course, even on x86, the local causal relationships still do allow loads to pass stores, so x86 isn't _totally_ ordered. x86 obviously still does need the smp_mb(). So alpha is "more consistent" in the respect of really having very clear rules. The fact that those "clear rules" are totally insane and very inconvenient for threading (and weren't the big performance advantage that people used to think they would be) is a separate matter. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 17:00 ` Linus Torvalds @ 2008-10-18 18:44 ` Matthew Wilcox 2008-10-19 2:54 ` Nick Piggin 2008-10-19 2:53 ` Nick Piggin 1 sibling, 1 reply; 52+ messages in thread From: Matthew Wilcox @ 2008-10-18 18:44 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Paul Mackerras, Hugh Dickins, linux-kernel, linux-mm, linux-arch On Sat, Oct 18, 2008 at 10:00:30AM -0700, Linus Torvalds wrote: > > Apparently pairwise ordering is more interesting than just a theoretical > > thing, and not just restricted to Alpha's funny caches. > > Nobody does just pairwise ordering, afaik. It's an insane model. Everybody > does some form of transitive ordering. I assume you're talking about CPUs in particular here, and I don't know of any counterexamples. If you're talking about PCI devices, the model is not transitive. Here's the exact text from Appendix E of PCI 3.0: A system may have multiple Producer-Consumer pairs operating simultaneously, with different data - flag-status sets located all around the system. But since only one Producer can write to a single data-flag set, there are no ordering requirements between different masters. Writes from one master on one bus may occur in one order on one bus, with respect to another master's writes, and occur in another order on another bus. In this case, the rules allow for some writes to be rearranged; for example, an agent on Bus 1 may see Transaction A from a master on Bus 1 complete first, followed by Transaction B from another master on Bus 0. An agent on Bus 0 may see Transaction B complete first followed by Transaction A. Even though the actual transactions complete in a different order, this causes no problem since the different masters must be addressing different data-flag sets. I seem to remember earlier versions of the spec having more clear language about A happening before B and B happening before C didn't mean that A happened before C from the perspective of a third device, but I can't find it right now. Anyway, as the spec says, you're not supposed to use PCI like that, so it doesn't matter. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 18:44 ` Matthew Wilcox @ 2008-10-19 2:54 ` Nick Piggin 0 siblings, 0 replies; 52+ messages in thread From: Nick Piggin @ 2008-10-19 2:54 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Paul Mackerras, Hugh Dickins, linux-kernel, linux-mm, linux-arch On Sat, Oct 18, 2008 at 12:44:05PM -0600, Matthew Wilcox wrote: > On Sat, Oct 18, 2008 at 10:00:30AM -0700, Linus Torvalds wrote: > > > Apparently pairwise ordering is more interesting than just a theoretical > > > thing, and not just restricted to Alpha's funny caches. > > > > Nobody does just pairwise ordering, afaik. It's an insane model. Everybody > > does some form of transitive ordering. > > I assume you're talking about CPUs in particular here, and I don't know > of any counterexamples. Yes, just CPUs. > If you're talking about PCI devices, the model is not transitive. > Here's the exact text from Appendix E of PCI 3.0: > > A system may have multiple Producer-Consumer pairs operating > simultaneously, with different data - flag-status sets located all > around the system. But since only one Producer can write to a single > data-flag set, there are no ordering requirements between different > masters. Writes from one master on one bus may occur in one order on > one bus, with respect to another master's writes, and occur in another > order on another bus. In this case, the rules allow for some writes > to be rearranged; for example, an agent on Bus 1 may see Transaction > A from a master on Bus 1 complete first, followed by Transaction B > from another master on Bus 0. An agent on Bus 0 may see Transaction > B complete first followed by Transaction A. Even though the actual > transactions complete in a different order, this causes no problem > since the different masters must be addressing different data-flag sets. > > I seem to remember earlier versions of the spec having more clear > language about A happening before B and B happening before C didn't > mean that A happened before C from the perspective of a third device, > but I can't find it right now. > > Anyway, as the spec says, you're not supposed to use PCI like that, > so it doesn't matter. Interesting. Hopefully as you say it won't matter. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 17:00 ` Linus Torvalds 2008-10-18 18:44 ` Matthew Wilcox @ 2008-10-19 2:53 ` Nick Piggin 1 sibling, 0 replies; 52+ messages in thread From: Nick Piggin @ 2008-10-19 2:53 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Mackerras, Hugh Dickins, linux-kernel, linux-mm, linux-arch On Sat, Oct 18, 2008 at 10:00:30AM -0700, Linus Torvalds wrote: > > > On Sat, 18 Oct 2008, Nick Piggin wrote: > > > > I think it can be called transitive. Basically (assumememory starts off zeroed) > > Alpha is transitive. It has a notion of "processor issue order" and > "location access order", and the ordering those two creates is a > transitive "before" and "after" ordering. > > The issue with alpha is not that it wouldn't be transitive - the issue is > that *local* read dependencies do not cause a "processor issue order"! That's fine. That's not so different to most other weakly ordered processor having control dependencies not appearing in-order. So long as stores propogate according to causality. > So this creates a "location access" event on 'x' on alpha, call it "event > A". > > > CPU1 > > if (x == 1) { > > fence > > y := 1 > > } > > This has two events: let's call the read of 'x' "B", and "C" is the write > to 'y'. > > And according to the alpha rules, we now have: > > - A << B > > Because we saw a '1' in B, we now have a "location access ordering" > on the _same_ variable between A and B. > > - B < C > > Because we have the fence in between the read and the write, we now > have a "processor issue order" between B and C (despite the fact that > they are different variables). > > And now, the alpha definition of "before" means that we can declare that A > is before C. > > But on alpha, we really do need that fence, even if the address of 'y' was > somehow directly data-dependent on 'x'. THAT is what makes alpha special, > not any odd ordering rules. > > > CPU2 > > if (y == 1) { > > fence > > assert(x == 1) > > } > > So again, we now have two events: the access of 'y' is "D", and the access > of x is "E". And again, according to the alpha rules, we have two > orderings: > > - C << D > > Because we saw a '1' in D, we have another "location access ordering" > on the variably 'y' between C and D. > > - D < E > > Because of the fence, we have a "processor issue ordering" between D > and E. > > And for the same reason as above, we now get that C is "before" E > according to the alpha ordering rules. And because the definition of > "before" is transitive, then A is before E. > > And that, in turn, means that that assert() can never trigger, because if > it triggered, then by the access ordering rules that would imply that E << > A, which would mean that E is "before" A, which in turn would violate the > whole chain we just got to. > > So while the alpha architecture manual doesn't have the exact sequence > mentioned above (it only has nine so-called "Litmus tests"), it's fairly > close to Litmus test 3, and the ordering on alpha is very clear: it's all > transitive and causal (ie "before" can never be "after"). OK, good. > > Apparently pairwise ordering is more interesting than just a theoretical > > thing, and not just restricted to Alpha's funny caches. > > Nobody does just pairwise ordering, afaik. It's an insane model. Everybody > does some form of transitive ordering. We were chatting with Andy Glew a while back, and he said it actually can be quite beneficial for HW designers (but I imagine that is the same as a lot of "insane" things) ;) I remember though that you said Linux should be pairwise-safe. I think that's wrong (for more reasons than this anon-vma race), which is why I got concerned and started off this subthread. I think Linux probably has a lot of problems in a pairwise consistency model, so I'd just like to check if we acutally attempt to supportany architecture where that is the case. x86, powerpc, alpha are good ;) That gives me hope. Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-16 4:10 [patch] mm: fix anon_vma races Nick Piggin 2008-10-17 22:14 ` Hugh Dickins @ 2008-10-17 23:13 ` Peter Zijlstra 2008-10-17 23:53 ` Linus Torvalds 1 sibling, 1 reply; 52+ messages in thread From: Peter Zijlstra @ 2008-10-17 23:13 UTC (permalink / raw) To: Nick Piggin; +Cc: Hugh Dickins, Linux Memory Management List, Linus Torvalds On Thu, 2008-10-16 at 06:10 +0200, Nick Piggin wrote: > Signed-off-by: Nick Piggin <npiggin@suse.de> > --- > Index: linux-2.6/mm/rmap.c > =================================================================== > --- linux-2.6.orig/mm/rmap.c > +++ linux-2.6/mm/rmap.c > @@ -81,8 +81,15 @@ int anon_vma_prepare(struct vm_area_stru > /* page_table_lock to protect against threads */ > spin_lock(&mm->page_table_lock); > if (likely(!vma->anon_vma)) { > - vma->anon_vma = anon_vma; > list_add_tail(&vma->anon_vma_node, &anon_vma->head); > + /* > + * This smp_wmb() is required to order all previous > + * stores to initialize the anon_vma (by the slab > + * ctor) and add this vma, with the store to make it > + * visible to other CPUs via vma->anon_vma. > + */ > + smp_wmb(); > + vma->anon_vma = anon_vma; I'm not getting why you explicitly move the list_add_tail() before the wmb, doesn't the list also expose the anon_vma to other cpus? Hmm, I guess not, since we won't set the page->mapping until well after the wmb.. still, tricky. > allocated = NULL; > } > spin_unlock(&mm->page_table_lock); > @@ -91,6 +98,15 @@ int anon_vma_prepare(struct vm_area_stru > spin_unlock(&locked->lock); > if (unlikely(allocated)) > anon_vma_free(allocated); > + } else { > + /* > + * This smp_read_barrier_depends is required to order the data > + * dependent loads of fields in anon_vma, with the load of the > + * anon_vma pointer vma->anon_vma. This complements the above > + * smp_wmb, and prevents a CPU from loading uninitialized > + * contents of anon_vma. > + */ > + smp_read_barrier_depends(); > } > return 0; > } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-17 23:13 ` Peter Zijlstra @ 2008-10-17 23:53 ` Linus Torvalds 2008-10-18 0:42 ` Linus Torvalds 0 siblings, 1 reply; 52+ messages in thread From: Linus Torvalds @ 2008-10-17 23:53 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Nick Piggin, Hugh Dickins, Linux Memory Management List On Sat, 18 Oct 2008, Peter Zijlstra wrote: > On Thu, 2008-10-16 at 06:10 +0200, Nick Piggin wrote: > > > Signed-off-by: Nick Piggin <npiggin@suse.de> > > --- > > Index: linux-2.6/mm/rmap.c > > =================================================================== > > --- linux-2.6.orig/mm/rmap.c > > +++ linux-2.6/mm/rmap.c > > @@ -81,8 +81,15 @@ int anon_vma_prepare(struct vm_area_stru > > /* page_table_lock to protect against threads */ > > spin_lock(&mm->page_table_lock); > > if (likely(!vma->anon_vma)) { > > - vma->anon_vma = anon_vma; > > list_add_tail(&vma->anon_vma_node, &anon_vma->head); > > + /* > > + * This smp_wmb() is required to order all previous > > + * stores to initialize the anon_vma (by the slab > > + * ctor) and add this vma, with the store to make it > > + * visible to other CPUs via vma->anon_vma. > > + */ > > + smp_wmb(); > > + vma->anon_vma = anon_vma; > > I'm not getting why you explicitly move the list_add_tail() before the > wmb, doesn't the list also expose the anon_vma to other cpus? I do think the anon_vma locking might be good to look over. It is very non-obvious. Especially the initial create is really really quite suspect. I suspect we should start out with the anon-vma locked, and do the spin_unlock(&anon_vma->lock); unconditionally in anon_vma_prepare(), and just simplify it. As it is, newly allocated anon_vma's get exposed in unlocked state while we're still working on them. But I think that what Nick did is correct - we always start traversal through anon_vma->head, so no, the "list_add_tail()" won't expose it to anybody else, because nobody else has seen the anon_vma(). That said, that's really too damn subtle. We shouldn't rely on memory ordering for the list handling, when the list handling is _supposed_ to be using that anon_vma->lock thing. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-17 23:53 ` Linus Torvalds @ 2008-10-18 0:42 ` Linus Torvalds 2008-10-18 1:08 ` Linus Torvalds 2008-10-19 1:13 ` Hugh Dickins 0 siblings, 2 replies; 52+ messages in thread From: Linus Torvalds @ 2008-10-18 0:42 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Nick Piggin, Hugh Dickins, Linux Memory Management List On Fri, 17 Oct 2008, Linus Torvalds wrote: > > But I think that what Nick did is correct - we always start traversal > through anon_vma->head, so no, the "list_add_tail()" won't expose it to > anybody else, because nobody else has seen the anon_vma(). > > That said, that's really too damn subtle. We shouldn't rely on memory > ordering for the list handling, when the list handling is _supposed_ to be > using that anon_vma->lock thing. So maybe a better patch would be as follows? It simplifies the whole thing by just always locking and unlocking the vma, whether it's newly allocated or not (and whether it then gets dropped as unnecessary or not). It still does that "smp_read_barrier_depends()" in the same old place. I don't have the energy to look at Hugh's point about people reading anon_vma without doing the whole "prepare" thing. It adds more lines than it removes, but it's just because of the comments. With the locking simplification, it actually removes more lines of actual code than it adds. And now we always do that list_add_tail() with the anon_vma lock held, which should simplify thinking about this, and avoid at least one subtle ordering issue. Linus --- mm/rmap.c | 33 +++++++++++++++++++++++---------- 1 files changed, 23 insertions(+), 10 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 0383acf..9221bf7 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -63,35 +63,48 @@ int anon_vma_prepare(struct vm_area_struct *vma) might_sleep(); if (unlikely(!anon_vma)) { struct mm_struct *mm = vma->vm_mm; - struct anon_vma *allocated, *locked; + struct anon_vma *allocated; anon_vma = find_mergeable_anon_vma(vma); - if (anon_vma) { - allocated = NULL; - locked = anon_vma; - spin_lock(&locked->lock); - } else { + allocated = NULL; + if (!anon_vma) { anon_vma = anon_vma_alloc(); if (unlikely(!anon_vma)) return -ENOMEM; allocated = anon_vma; - locked = NULL; } + spin_lock(&anon_vma->lock); /* page_table_lock to protect against threads */ spin_lock(&mm->page_table_lock); if (likely(!vma->anon_vma)) { + /* + * We hold the mm->page_table_lock, but another + * CPU may be doing an optimistic load (the one + * at the top), and we want to make sure that + * the anon_vma changes are visible. + */ + smp_wmb(); vma->anon_vma = anon_vma; list_add_tail(&vma->anon_vma_node, &anon_vma->head); allocated = NULL; } spin_unlock(&mm->page_table_lock); - - if (locked) - spin_unlock(&locked->lock); + spin_unlock(&anon_vma->lock); if (unlikely(allocated)) anon_vma_free(allocated); } + /* + * Subtle: we looked up anon_vma without any locking + * (in the comon case), and are going to look at the + * spinlock etc behind it. In order to know that it's + * initialized, we need to do a read barrier here. + * + * We can use the cheaper "depends" version, since we + * are following a pointer, and only on alpha may that + * give a stale value. + */ + smp_read_barrier_depends(); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 0:42 ` Linus Torvalds @ 2008-10-18 1:08 ` Linus Torvalds 2008-10-18 1:32 ` Nick Piggin 2008-10-19 1:13 ` Hugh Dickins 1 sibling, 1 reply; 52+ messages in thread From: Linus Torvalds @ 2008-10-18 1:08 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Nick Piggin, Hugh Dickins, Linux Memory Management List On Fri, 17 Oct 2008, Linus Torvalds wrote: > > So maybe a better patch would be as follows? It simplifies the whole thing > by just always locking and unlocking the vma, whether it's newly allocated > or not (and whether it then gets dropped as unnecessary or not). Side note: it would be nicer if we had a "spin_lock_init_locked()", so that we could avoid the more expensive "true lock" when doing the initial allocation, but we don't. That said, the case of having to allocate a new anon_vma _should_ be the rare one. I dunno. Nick's approach of just depending on memory ordering for the list update probably works too. Even if it's rather more subtle than I'd like. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 1:08 ` Linus Torvalds @ 2008-10-18 1:32 ` Nick Piggin 2008-10-18 2:11 ` Linus Torvalds 0 siblings, 1 reply; 52+ messages in thread From: Nick Piggin @ 2008-10-18 1:32 UTC (permalink / raw) To: Linus Torvalds; +Cc: Peter Zijlstra, Hugh Dickins, Linux Memory Management List On Fri, Oct 17, 2008 at 06:08:05PM -0700, Linus Torvalds wrote: > > > On Fri, 17 Oct 2008, Linus Torvalds wrote: > > > > So maybe a better patch would be as follows? It simplifies the whole thing > > by just always locking and unlocking the vma, whether it's newly allocated > > or not (and whether it then gets dropped as unnecessary or not). > > Side note: it would be nicer if we had a "spin_lock_init_locked()", so > that we could avoid the more expensive "true lock" when doing the initial > allocation, but we don't. That said, the case of having to allocate a new > anon_vma _should_ be the rare one. We can't do that, unfortuantely, because anon_vmas are allocated with SLAB_DESTROY_BY_RCU. Obviously that's the easier way to solve most of the orderings (then you just need to order store to spinlock with store to make anon_vma visible). But no, we can't do that. But there seem like there might be other problems here too... Hmm. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 1:32 ` Nick Piggin @ 2008-10-18 2:11 ` Linus Torvalds 2008-10-18 2:25 ` Nick Piggin 0 siblings, 1 reply; 52+ messages in thread From: Linus Torvalds @ 2008-10-18 2:11 UTC (permalink / raw) To: Nick Piggin; +Cc: Peter Zijlstra, Hugh Dickins, Linux Memory Management List On Sat, 18 Oct 2008, Nick Piggin wrote: > > > > Side note: it would be nicer if we had a "spin_lock_init_locked()", so > > that we could avoid the more expensive "true lock" when doing the initial > > allocation, but we don't. That said, the case of having to allocate a new > > anon_vma _should_ be the rare one. > > We can't do that, unfortuantely, because anon_vmas are allocated with > SLAB_DESTROY_BY_RCU. Aughh. I see what you're saying. We don't _free_ them by RCU, we just destroy the page allocation. So an anon_vma can get _re-allocated_ for another page (without being destroyed), concurrently with somebody optimistically being busy with that same anon_vma that they got through that optimistic 'page_lock_anon_vma()' thing. So if we were to just set the lock, we might actually be messing with something that is still actively used by the previous page that was unmapped concurrently and still being accessed by try_to_unmap_anon. So even though we allocated a "new" anon_vma, it might still be busy. Yes? No? That thing really is too subtle for words. But if that's actually what you are alluding to, then doesn't that mean that we _really_ should be doing that "spin_lock(&anon_vma->lock)" even for the first allocation, and that the current code is broken? Because otherwise that other concurrent user that found the stale vma through page_lock_anon_vma() will now try to follow the linked list and _think_ it's stable (thanks to the lock), but we're actually inserting entries into it without holding any locks at all. But I'm hoping I actually am totally *not* understanding what you meant, and am actually just terminally confused. Hugh, this is very much your code. Can you please tell me I'm really confused here, and un-confuse me. Pretty please? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 2:11 ` Linus Torvalds @ 2008-10-18 2:25 ` Nick Piggin 2008-10-18 2:35 ` Nick Piggin ` (2 more replies) 0 siblings, 3 replies; 52+ messages in thread From: Nick Piggin @ 2008-10-18 2:25 UTC (permalink / raw) To: Linus Torvalds; +Cc: Peter Zijlstra, Hugh Dickins, Linux Memory Management List On Fri, Oct 17, 2008 at 07:11:38PM -0700, Linus Torvalds wrote: > > > On Sat, 18 Oct 2008, Nick Piggin wrote: > > > > > > Side note: it would be nicer if we had a "spin_lock_init_locked()", so > > > that we could avoid the more expensive "true lock" when doing the initial > > > allocation, but we don't. That said, the case of having to allocate a new > > > anon_vma _should_ be the rare one. > > > > We can't do that, unfortuantely, because anon_vmas are allocated with > > SLAB_DESTROY_BY_RCU. > > Aughh. I see what you're saying. We don't _free_ them by RCU, we just > destroy the page allocation. So an anon_vma can get _re-allocated_ for > another page (without being destroyed), concurrently with somebody > optimistically being busy with that same anon_vma that they got through > that optimistic 'page_lock_anon_vma()' thing. > > So if we were to just set the lock, we might actually be messing with > something that is still actively used by the previous page that was > unmapped concurrently and still being accessed by try_to_unmap_anon. So > even though we allocated a "new" anon_vma, it might still be busy. > > Yes? No? That's what I'm thinking, yes. But I admit the last time I looked at this really closely was probably reading through Hugh's patches and changelogs (which at the time must have convinced me ;)). So I could be wrong. > That thing really is too subtle for words. But if that's actually what you > are alluding to, then doesn't that mean that we _really_ should be doing > that "spin_lock(&anon_vma->lock)" even for the first allocation, and that > the current code is broken? Because otherwise that other concurrent user > that found the stale vma through page_lock_anon_vma() will now try to > follow the linked list and _think_ it's stable (thanks to the lock), but > we're actually inserting entries into it without holding any locks at all. Yes, that's what I meant by "has other problems". Another thing is also that even if we have the lock here, I can't see why page_lock_anon_vma is safe against finding an anon_vma which has been deallocated then allocated for something else (and had vmas inserted into it etc.). I think most of our memory ordering problems can be solved by locking. Note that I don't think we need any barriers there, and callers don't need any read_barrier_depends either, because AFAIKS they all take the lock too. (it shouldn't actually need the reordering of the assignments either, which shows it is a bit more robust than relying on ordering, but I think it is neater if we reorder them). Whether this page_lock_anon_vma is really a problem or not... I've added a test in there that I think may be a problem (at least, I'd like to know why I'm wrong and have a comment in there). > But I'm hoping I actually am totally *not* understanding what you meant, > and am actually just terminally confused. > > Hugh, this is very much your code. Can you please tell me I'm really > confused here, and un-confuse me. Pretty please? Ditto ;) --- Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -63,32 +63,42 @@ int anon_vma_prepare(struct vm_area_stru might_sleep(); if (unlikely(!anon_vma)) { struct mm_struct *mm = vma->vm_mm; - struct anon_vma *allocated, *locked; + struct anon_vma *allocated; anon_vma = find_mergeable_anon_vma(vma); if (anon_vma) { allocated = NULL; - locked = anon_vma; - spin_lock(&locked->lock); } else { anon_vma = anon_vma_alloc(); if (unlikely(!anon_vma)) return -ENOMEM; allocated = anon_vma; - locked = NULL; } + /* + * The lock is required even for new anon_vmas, because as + * soon as we store vma->anon_vma = anon_vma, then the + * anon_vma becomes visible via the vma. This means another + * CPU can find the anon_vma, then store it into the struct + * page with page_add_anon_rmap. At this point, anon_vma can + * be loaded from the page with page_lock_anon_vma. + * + * So long as the anon_vma->lock is taken before looking at + * any fields in the anon_vma, the lock should take care of + * races and memory ordering issues WRT anon_vma fields. + */ + spin_lock(&anon_vma->lock); + /* page_table_lock to protect against threads */ spin_lock(&mm->page_table_lock); if (likely(!vma->anon_vma)) { - vma->anon_vma = anon_vma; list_add_tail(&vma->anon_vma_node, &anon_vma->head); + vma->anon_vma = anon_vma; allocated = NULL; } spin_unlock(&mm->page_table_lock); + spin_lock(&anon_vma->lock); - if (locked) - spin_unlock(&locked->lock); if (unlikely(allocated)) anon_vma_free(allocated); } @@ -171,6 +181,10 @@ static struct anon_vma *page_lock_anon_v anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); spin_lock(&anon_vma->lock); + + if (anon_mapping != (unsigned long)page->mapping) + goto out; + return anon_vma; out: rcu_read_unlock(); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 2:25 ` Nick Piggin @ 2008-10-18 2:35 ` Nick Piggin 2008-10-18 2:53 ` Linus Torvalds 2008-10-18 19:14 ` Hugh Dickins 2 siblings, 0 replies; 52+ messages in thread From: Nick Piggin @ 2008-10-18 2:35 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Peter Zijlstra, Hugh Dickins, Linux Memory Management List On Saturday 18 October 2008 13:25, Nick Piggin wrote: > @@ -171,6 +181,10 @@ static struct anon_vma *page_lock_anon_v > > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); > spin_lock(&anon_vma->lock); > + > + if (anon_mapping != (unsigned long)page->mapping) > + goto out; > + > return anon_vma; > out: > rcu_read_unlock(); Err, that should probably be another check for page_mapped(), because I think page->mapping probably won't get cleared until after anon_vma may have been freed and reused... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 2:25 ` Nick Piggin 2008-10-18 2:35 ` Nick Piggin @ 2008-10-18 2:53 ` Linus Torvalds 2008-10-18 5:20 ` Nick Piggin 2008-10-18 19:14 ` Hugh Dickins 2 siblings, 1 reply; 52+ messages in thread From: Linus Torvalds @ 2008-10-18 2:53 UTC (permalink / raw) To: Nick Piggin; +Cc: Peter Zijlstra, Hugh Dickins, Linux Memory Management List On Sat, 18 Oct 2008, Nick Piggin wrote: > @@ -171,6 +181,10 @@ static struct anon_vma *page_lock_anon_v > > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); > spin_lock(&anon_vma->lock); > + > + if (anon_mapping != (unsigned long)page->mapping) > + goto out; > + > return anon_vma; > out: > rcu_read_unlock(); I see why you'd like to try to do this, but look a bit closer, and you'll realize that this is *really* wrong. So there's the brown-paper-bag-reason why it's wrong: you need to unlock in this case, but there's a subtler reason why I doubt the whole approach works: I don't think we actually hold the anon_vma lock when we set page->mapping. So I don't think you really fixed the race that you want to fix, and I don't think that does what you wanted to do. But I might have missed something. I'm off to play poker. It's Friday night, there's only so many memory ordering and locking issues I can take in one day. I'm hoping that by the time I look at this again, you and Hugh will have sorted it out. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 2:53 ` Linus Torvalds @ 2008-10-18 5:20 ` Nick Piggin 2008-10-18 10:38 ` Peter Zijlstra 0 siblings, 1 reply; 52+ messages in thread From: Nick Piggin @ 2008-10-18 5:20 UTC (permalink / raw) To: Linus Torvalds; +Cc: Peter Zijlstra, Hugh Dickins, Linux Memory Management List On Fri, Oct 17, 2008 at 07:53:49PM -0700, Linus Torvalds wrote: > > > On Sat, 18 Oct 2008, Nick Piggin wrote: > > @@ -171,6 +181,10 @@ static struct anon_vma *page_lock_anon_v > > > > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); > > spin_lock(&anon_vma->lock); > > + > > + if (anon_mapping != (unsigned long)page->mapping) > > + goto out; > > + > > return anon_vma; > > out: > > rcu_read_unlock(); > > I see why you'd like to try to do this, but look a bit closer, and you'll > realize that this is *really* wrong. > > So there's the brown-paper-bag-reason why it's wrong: you need to unlock > in this case, Check. > but there's a subtler reason why I doubt the whole approach > works: I don't think we actually hold the anon_vma lock when we set > page->mapping. No, we don't, but I think that's OK because we do an atomic assignment to page->mapping. I can't see any bugs there (the change to always take the anon_vma lock when inserting a new anon_vma into vma->anon_vma should ensure the vma->anon_vma assignment happens after the busy lock is visible, and the fact that anyone looking into the anon_vma should hold the lock takes care of everything else). > So I don't think you really fixed the race that you want to fix, and I > don't think that does what you wanted to do. > > But I might have missed something. No, I think this race is different. It's because it is "hard" to get a reference on anon_vma from the lru->page path, because unmapping a vma doesn't take any of the same locks (in particular it doesn't take the page lock, which would be the big hammer solution). So we can have a thread in reclaim who has a locked page, and is just about to call page_lock_anon_vma. At this point, another thread might unmap the whole vma. If this is the last vma in the anon_vma, then it garbage collects it in anon_vma_unlink. page->mapping does not get NULLed out or anything. So the first thread picks up the anon_vma under RCU, sees page_mapped is still true (let's say this part runs just before the unmapper decrements the last ->mapcount, then the page gets garbage collected). Then we take the page lock. Still OK because we are under SLAB_DESTROY_BY_RCU. Then we return the anon_vma and start using it. But when we took the page lock, we don't actually know that the anon_vma hasn't been allocated and used for something else entirely. Taking the anon_vma->lock in anon_vma_prepare of a new anon_vma closes the obvious list corruption problems that could occur if we tried to walk it at the same time a new vma was being put on there. But AFAIKS, even then we have a problem where we might be trying to walk over completely the wrong vmas now. Slight improvement attached. --- Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -63,32 +63,42 @@ int anon_vma_prepare(struct vm_area_stru might_sleep(); if (unlikely(!anon_vma)) { struct mm_struct *mm = vma->vm_mm; - struct anon_vma *allocated, *locked; + struct anon_vma *allocated; anon_vma = find_mergeable_anon_vma(vma); if (anon_vma) { allocated = NULL; - locked = anon_vma; - spin_lock(&locked->lock); } else { anon_vma = anon_vma_alloc(); if (unlikely(!anon_vma)) return -ENOMEM; allocated = anon_vma; - locked = NULL; } + /* + * The lock is required even for new anon_vmas, because as + * soon as we store vma->anon_vma = anon_vma, then the + * anon_vma becomes visible via the vma. This means another + * CPU can find the anon_vma, then store it into the struct + * page with page_add_anon_rmap. At this point, anon_vma can + * be loaded from the page with page_lock_anon_vma. + * + * So long as the anon_vma->lock is taken before looking at + * any fields in the anon_vma, the lock should take care of + * races and memory ordering issues WRT anon_vma fields. + */ + spin_lock(&anon_vma->lock); + /* page_table_lock to protect against threads */ spin_lock(&mm->page_table_lock); if (likely(!vma->anon_vma)) { - vma->anon_vma = anon_vma; list_add_tail(&vma->anon_vma_node, &anon_vma->head); + vma->anon_vma = anon_vma; allocated = NULL; } spin_unlock(&mm->page_table_lock); + spin_lock(&anon_vma->lock); - if (locked) - spin_unlock(&locked->lock); if (unlikely(allocated)) anon_vma_free(allocated); } @@ -171,6 +181,21 @@ static struct anon_vma *page_lock_anon_v anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); spin_lock(&anon_vma->lock); + + /* + * If the page is no longer mapped, we have no way to keep the + * anon_vma stable. It may be freed and even re-allocated for some + * other set of anonymous mappings at any point. If the page is + * mapped while we have the lock on the anon_vma, then we know + * anon_vma_unlink can't run and garbage collect the anon_vma + * (because unmapping the page happens before unlinking the anon_vma). + */ + if (unlikely(!page_mapped(page))) { + spin_unlock(&anon_vma->lock); + goto out; + } + BUG_ON(page->mapping != anon_mapping); + return anon_vma; out: rcu_read_unlock(); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 5:20 ` Nick Piggin @ 2008-10-18 10:38 ` Peter Zijlstra 2008-10-19 9:52 ` Hugh Dickins 0 siblings, 1 reply; 52+ messages in thread From: Peter Zijlstra @ 2008-10-18 10:38 UTC (permalink / raw) To: Nick Piggin; +Cc: Linus Torvalds, Hugh Dickins, Linux Memory Management List On Sat, 2008-10-18 at 07:20 +0200, Nick Piggin wrote: > Index: linux-2.6/mm/rmap.c > =================================================================== > --- linux-2.6.orig/mm/rmap.c > +++ linux-2.6/mm/rmap.c > @@ -63,32 +63,42 @@ int anon_vma_prepare(struct vm_area_stru > might_sleep(); > if (unlikely(!anon_vma)) { > struct mm_struct *mm = vma->vm_mm; > - struct anon_vma *allocated, *locked; > + struct anon_vma *allocated; > > anon_vma = find_mergeable_anon_vma(vma); > if (anon_vma) { > allocated = NULL; > - locked = anon_vma; > - spin_lock(&locked->lock); > } else { > anon_vma = anon_vma_alloc(); > if (unlikely(!anon_vma)) > return -ENOMEM; > allocated = anon_vma; > - locked = NULL; > } > > + /* > + * The lock is required even for new anon_vmas, because as > + * soon as we store vma->anon_vma = anon_vma, then the > + * anon_vma becomes visible via the vma. This means another > + * CPU can find the anon_vma, then store it into the struct > + * page with page_add_anon_rmap. At this point, anon_vma can > + * be loaded from the page with page_lock_anon_vma. > + * > + * So long as the anon_vma->lock is taken before looking at > + * any fields in the anon_vma, the lock should take care of > + * races and memory ordering issues WRT anon_vma fields. > + */ > + spin_lock(&anon_vma->lock); > + > /* page_table_lock to protect against threads */ > spin_lock(&mm->page_table_lock); > if (likely(!vma->anon_vma)) { > - vma->anon_vma = anon_vma; > list_add_tail(&vma->anon_vma_node, &anon_vma->head); > + vma->anon_vma = anon_vma; > allocated = NULL; > } > spin_unlock(&mm->page_table_lock); > + spin_lock(&anon_vma->lock); did you perchance mean, spin_unlock() ? > > - if (locked) > - spin_unlock(&locked->lock); > if (unlikely(allocated)) > anon_vma_free(allocated); > } > @@ -171,6 +181,21 @@ static struct anon_vma *page_lock_anon_v > > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); > spin_lock(&anon_vma->lock); > + > + /* > + * If the page is no longer mapped, we have no way to keep the > + * anon_vma stable. It may be freed and even re-allocated for some > + * other set of anonymous mappings at any point. If the page is > + * mapped while we have the lock on the anon_vma, then we know > + * anon_vma_unlink can't run and garbage collect the anon_vma > + * (because unmapping the page happens before unlinking the anon_vma). > + */ > + if (unlikely(!page_mapped(page))) { > + spin_unlock(&anon_vma->lock); > + goto out; > + } > + BUG_ON(page->mapping != anon_mapping); > + > return anon_vma; > out: > rcu_read_unlock(); fault_creation: anon_vma_prepare() page_add_new_anon_rmap(); expand_creation: anon_vma_prepare() anon_vma_lock(); rmap_lookup: page_referenced()/try_to_unmap() page_lock_anon_vma() vma_lookup: vma_adjust()/vma_* vma->anon_vma teardown: unmap_vmas() zap_range() page_remove_rmap() free_page() free_pgtables() anon_vma_unlink() free_range() IOW we remove rmap, free the page (set mapping=NULL) and then unlink and free the anon_vma. But at that time vma->anon_vma is still set. head starts to hurt,.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 10:38 ` Peter Zijlstra @ 2008-10-19 9:52 ` Hugh Dickins 2008-10-19 10:51 ` Peter Zijlstra 0 siblings, 1 reply; 52+ messages in thread From: Hugh Dickins @ 2008-10-19 9:52 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Nick Piggin, Linus Torvalds, Linux Memory Management List On Sat, 18 Oct 2008, Peter Zijlstra wrote: > > fault_creation: > > anon_vma_prepare() > page_add_new_anon_rmap(); > > expand_creation: > > anon_vma_prepare() > anon_vma_lock(); > > rmap_lookup: > > page_referenced()/try_to_unmap() > page_lock_anon_vma() > > vma_lookup: > > vma_adjust()/vma_* > vma->anon_vma > > teardown: > > unmap_vmas() > zap_range() > page_remove_rmap() > free_page() > free_pgtables() > anon_vma_unlink() > free_range() > > IOW we remove rmap, free the page (set mapping=NULL) and then unlink and > free the anon_vma. > > But at that time vma->anon_vma is still set. > > > head starts to hurt,.. Comprehension isn't my strong suit at the moment: I don't grasp what problem you're seeing here - if you can spell it out in more detail for me, I'd like to try stopping your head hurt - though not at cost of making mine hurt more! Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-19 9:52 ` Hugh Dickins @ 2008-10-19 10:51 ` Peter Zijlstra 2008-10-19 12:39 ` Hugh Dickins 2008-10-19 18:25 ` Linus Torvalds 0 siblings, 2 replies; 52+ messages in thread From: Peter Zijlstra @ 2008-10-19 10:51 UTC (permalink / raw) To: Hugh Dickins; +Cc: Nick Piggin, Linus Torvalds, Linux Memory Management List On Sun, 2008-10-19 at 10:52 +0100, Hugh Dickins wrote: > On Sat, 18 Oct 2008, Peter Zijlstra wrote: > > > > fault_creation: > > > > anon_vma_prepare() > > page_add_new_anon_rmap(); > > > > expand_creation: > > > > anon_vma_prepare() > > anon_vma_lock(); > > > > rmap_lookup: > > > > page_referenced()/try_to_unmap() > > page_lock_anon_vma() > > > > vma_lookup: > > > > vma_adjust()/vma_* > > vma->anon_vma > > > > teardown: > > > > unmap_vmas() > > zap_range() > > page_remove_rmap() > > free_page() > > free_pgtables() > > anon_vma_unlink() > > free_range() > > > > IOW we remove rmap, free the page (set mapping=NULL) and then unlink and > > free the anon_vma. > > > > But at that time vma->anon_vma is still set. > > > > > > head starts to hurt,.. > > Comprehension isn't my strong suit at the moment: I don't grasp > what problem you're seeing here - if you can spell it out in more > detail for me, I'd like to try stopping your head hurt - though not > at cost of making mine hurt more! Heh, I meant to continue on this path more later yesterday, but weekend chores kept me from it. The above was my feeble attempt at getting an overview of what we do with these anon_vma things so as to get a feel for the interaction. I think my main concern in all this is validating that we have the right anon_vma on dereference - both the vma->anon_vma and the page->mapping one. Part of the confusion is that we don't clear those pointers at the end of their lifetimes (page_remove_rmap and anon_vma_unlink). I guess the !page_mapping() test in page_lock_anon_vma() is meant to deal with this, I think the point is that we have a stable page reference, and thus the mapping is stable too (although page_referenced() only holds a ref, and that could race with a fault which would install the anon_vma? - still that looks a race the safe way) Still it looks odd to have a rcu_read_lock() section without and rcu_dereference() or smp_read_depend barrier. I think I see how the vma->anon_vma references work too, given the added locking in anon_vma_prepare() proposed in this thread. But I need to ponder those a bit more. And alas, I need to run out again.. these weekends are too short. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-19 10:51 ` Peter Zijlstra @ 2008-10-19 12:39 ` Hugh Dickins 2008-10-19 18:25 ` Linus Torvalds 1 sibling, 0 replies; 52+ messages in thread From: Hugh Dickins @ 2008-10-19 12:39 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Nick Piggin, Linus Torvalds, Linux Memory Management List On Sun, 19 Oct 2008, Peter Zijlstra wrote: > > I think my main concern in all this is validating that we have the right > anon_vma on dereference - both the vma->anon_vma and the page->mapping > one. > > Part of the confusion is that we don't clear those pointers at the end > of their lifetimes (page_remove_rmap and anon_vma_unlink). Yes, I would very much have liked to clear it in page_remove_rmap(), as I say there: it's still feels ugly to be cleaning up after it in free_hot_cold_page() (gosh! and that's still the name of where it happens!), though there are some good debug advantages to having it set indefinitely too. Clearing vma->anon_vma at the end, I don't think I ever cared about that: it's very common to kfree() something without resetting the pointers to it, I don't think there's any worrying race in its case. > > I guess the !page_mapping() test in page_lock_anon_vma() is meant to !page_mapped() > deal with this, I think the point is that we have a stable page > reference, and thus the mapping is stable too (although > page_referenced() only holds a ref, and that could race with a fault > which would install the anon_vma? - still that looks a race the safe > way) page_lock_anon_vma() is like those scenes where sailors are pulleyed down a rope from one ship to another in stormy mid-ocean. There, now you understand it, need I say more? If we see page_mapcount is raised (in the RCU locked section), we can be sure that the slab page which the struct anon_vma rests on will not get freed and reused for something else: page_mapcount may go down to 0 at any instant, but having been non-0, we are assured that anon_vma->lock will remain the spinlock in a struct anon_vma, even if by the time we've acquired that spinlock, that particular piece of memory has been freed and reused for the anon_vma of another group of vmas altogether. Certainly mapcount could also go up and another vma be added to the anon_vma's list while we wait to get the spinlock, as you say, but that's no worry. > > Still it looks odd to have a rcu_read_lock() section without and > rcu_dereference() or smp_read_depend barrier. At the 2.6.8 time I wrote it, I was using preempt_disable() and preempt_enable(), and there was no such thing as rcu_dereference(). But I don't think it's now lacking in that respect: the whole idea was to keep almost all of the code free of RCU worries, just concentrate them all into page_lock_anon_vma() (and slab.c). > > I think I see how the vma->anon_vma references work too, given the added > locking in anon_vma_prepare() proposed in this thread. But I need to > ponder those a bit more. > > And alas, I need to run out again.. these weekends are too short. I know the feeling: I also seem to have promised many too many people that I'll be attending to this or that at the weekend. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-19 10:51 ` Peter Zijlstra 2008-10-19 12:39 ` Hugh Dickins @ 2008-10-19 18:25 ` Linus Torvalds 2008-10-19 18:45 ` Peter Zijlstra ` (3 more replies) 1 sibling, 4 replies; 52+ messages in thread From: Linus Torvalds @ 2008-10-19 18:25 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Hugh Dickins, Nick Piggin, Linux Memory Management List On Sun, 19 Oct 2008, Peter Zijlstra wrote: > > Part of the confusion is that we don't clear those pointers at the end > of their lifetimes (page_remove_rmap and anon_vma_unlink). > > I guess the !page_mapping() test in page_lock_anon_vma() is meant to > deal with this Hmm. So that part I'm still not entirely convinced about. The thing is, we have two issues on anon_vma usage, and the page_lock_anon_vma() usage in particular: - the integrity of the list itself Here it should be sufficient to just always get the lock, to the point where we don't need to care about anything else. So getting the lock properly on new allocation makes all the other races irrelevant. - the integrity of the _result_ of traversing the list This is what the !page_mapping() thing is supposedly protecting against, I think. But as far as I can tell, there's really two different use cases here: (a) people who care deeply about the result and (b) people who don't. And the difference between the two cases is whether they had the page locked or not. The "try_to_unmap()" callers care deeply, and lock the page. In contrast, some "page_referenced()" callers (really just shrink_active_list) don't care deeply, and to them the return value is really just a heuristic. As far as I can tell, all the people who care deeply will lock the page (and _have_ to lock the page), and thus 'page->mapping' should be stable for those cases. And then we have the other cases, who just want a heuristic, and they don't hold the page lock, but if we look at the wrong active_vma that has gotten reallocated to something else, they don't even really care. So I'm not seeing the reason for that check for page_mapped() at the end. Does it actually protect against anything relevant? Anyway, I _think_ the part that everybody agrees about is the initial locking of the anon_vma. Whether we then even need any memory barriers and/or the page_mapped() check is an independent question. Yes? No? So I'm suggesting this commit as the part we at least all agree on. But I haven't pushed it out yet, so you can still holler.. But I think all the discussion is about other issues, and we all agree on at least this part? Linus --- ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-19 18:25 ` Linus Torvalds @ 2008-10-19 18:45 ` Peter Zijlstra 2008-10-19 19:00 ` Hugh Dickins ` (2 subsequent siblings) 3 siblings, 0 replies; 52+ messages in thread From: Peter Zijlstra @ 2008-10-19 18:45 UTC (permalink / raw) To: Linus Torvalds; +Cc: Hugh Dickins, Nick Piggin, Linux Memory Management List On Sun, 2008-10-19 at 11:25 -0700, Linus Torvalds wrote: > Anyway, I _think_ the part that everybody agrees about is the initial > locking of the anon_vma. Whether we then even need any memory barriers > and/or the page_mapped() check is an independent question. Yes? No? Yes > So I'm suggesting this commit as the part we at least all agree on. But I > haven't pushed it out yet, so you can still holler.. But I think all the > discussion is about other issues, and we all agree on at least this part? > > Linus > > --- > From f422f2ec50872331820f15711f48b9ffc9cbb64e Mon Sep 17 00:00:00 2001 > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Sun, 19 Oct 2008 10:32:20 -0700 > Subject: [PATCH] anon_vma_prepare: properly lock even newly allocated entries > > The anon_vma code is very subtle, and we end up doing optimistic lookups > of anon_vmas under RCU in page_lock_anon_vma() with no locking. Other > CPU's can also see the newly allocated entry immediately after we've > exposed it by setting "vma->anon_vma" to the new value. > > We protect against the anon_vma being destroyed by having the SLAB > marked as SLAB_DESTROY_BY_RCU, so the RCU lookup can depend on the > allocation not being destroyed - but it might still be free'd and > re-allocated here to a new vma. > > As a result, we should not do the anon_vma list ops on a newly allocated > vma without proper locking. > > Acked-by: Nick Piggin <npiggin@suse.de> > Acked-by: Hugh Dickins <hugh@veritas.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > --- > mm/rmap.c | 42 ++++++++++++++++++++++++++++++++---------- > 1 files changed, 32 insertions(+), 10 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 0383acf..e8d639b 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -55,7 +55,33 @@ > > struct kmem_cache *anon_vma_cachep; > > -/* This must be called under the mmap_sem. */ > +/** > + * anon_vma_prepare - attach an anon_vma to a memory region > + * @vma: the memory region in question > + * > + * This makes sure the memory mapping described by 'vma' has > + * an 'anon_vma' attached to it, so that we can associate the > + * anonymous pages mapped into it with that anon_vma. > + * > + * The common case will be that we already have one, but if > + * if not we either need to find an adjacent mapping that we > + * can re-use the anon_vma from (very common when the only > + * reason for splitting a vma has been mprotect()), or we > + * allocate a new one. > + * > + * Anon-vma allocations are very subtle, because we may have > + * optimistically looked up an anon_vma in page_lock_anon_vma() > + * and that may actually touch the spinlock even in the newly > + * allocated vma (it depends on RCU to make sure that the > + * anon_vma isn't actually destroyed). > + * > + * As a result, we need to do proper anon_vma locking even > + * for the new allocation. At the same time, we do not want > + * to do any locking for the common case of already having > + * an anon_vma. > + * > + * This must be called with the mmap_sem held for reading. > + */ > int anon_vma_prepare(struct vm_area_struct *vma) > { > struct anon_vma *anon_vma = vma->anon_vma; > @@ -63,20 +89,17 @@ int anon_vma_prepare(struct vm_area_struct *vma) > might_sleep(); > if (unlikely(!anon_vma)) { > struct mm_struct *mm = vma->vm_mm; > - struct anon_vma *allocated, *locked; > + struct anon_vma *allocated; > > anon_vma = find_mergeable_anon_vma(vma); > - if (anon_vma) { > - allocated = NULL; > - locked = anon_vma; > - spin_lock(&locked->lock); > - } else { > + allocated = NULL; > + if (!anon_vma) { > anon_vma = anon_vma_alloc(); > if (unlikely(!anon_vma)) > return -ENOMEM; > allocated = anon_vma; > - locked = NULL; > } > + spin_lock(&anon_vma->lock); > > /* page_table_lock to protect against threads */ > spin_lock(&mm->page_table_lock); > @@ -87,8 +110,7 @@ int anon_vma_prepare(struct vm_area_struct *vma) > } > spin_unlock(&mm->page_table_lock); > > - if (locked) > - spin_unlock(&locked->lock); > + spin_unlock(&anon_vma->lock); > if (unlikely(allocated)) > anon_vma_free(allocated); > } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-19 18:25 ` Linus Torvalds 2008-10-19 18:45 ` Peter Zijlstra @ 2008-10-19 19:00 ` Hugh Dickins 2008-10-20 4:03 ` Hugh Dickins 2008-10-21 2:44 ` Nick Piggin 3 siblings, 0 replies; 52+ messages in thread From: Hugh Dickins @ 2008-10-19 19:00 UTC (permalink / raw) To: Linus Torvalds; +Cc: Peter Zijlstra, Nick Piggin, Linux Memory Management List On Sun, 19 Oct 2008, Linus Torvalds wrote: > > Anyway, I _think_ the part that everybody agrees about is the initial > locking of the anon_vma. Whether we then even need any memory barriers > and/or the page_mapped() check is an independent question. Yes? No? > > So I'm suggesting this commit as the part we at least all agree on. But I > haven't pushed it out yet, so you can still holler.. But I think all the > discussion is about other issues, and we all agree on at least this part? I'll have to postpone answering the rest of your mail until later, but yes, I agree your patch is what we've all agreed on so far, and I can't even quibble with your description - it's good. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-19 18:25 ` Linus Torvalds 2008-10-19 18:45 ` Peter Zijlstra 2008-10-19 19:00 ` Hugh Dickins @ 2008-10-20 4:03 ` Hugh Dickins 2008-10-20 15:17 ` Linus Torvalds 2008-10-21 2:44 ` Nick Piggin 3 siblings, 1 reply; 52+ messages in thread From: Hugh Dickins @ 2008-10-20 4:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: Peter Zijlstra, Nick Piggin, Linux Memory Management List On Sun, 19 Oct 2008, Linus Torvalds wrote: > On Sun, 19 Oct 2008, Peter Zijlstra wrote: > > > > Part of the confusion is that we don't clear those pointers at the end > > of their lifetimes (page_remove_rmap and anon_vma_unlink). > > > > I guess the !page_mapping() test in page_lock_anon_vma() is meant to !page_mapped() > > deal with this > > Hmm. So that part I'm still not entirely convinced about. > > The thing is, we have two issues on anon_vma usage, and the > page_lock_anon_vma() usage in particular: > > - the integrity of the list itself > > Here it should be sufficient to just always get the lock, to the point > where we don't need to care about anything else. So getting the lock > properly on new allocation makes all the other races irrelevant. Yes, I should have removed anon_vma_prepare()'s optimization of not locking a newly allocated struct anon_vma when I moved page_lock_anon_vma() over to rely on SLAB_DESTROY_BY_RCU. When you say "to the point where we don't need to care about anything else", are you there agreeing with Nick that your smp_wmb() and smp_read_barrier_depends() are no longer needed? > > - the integrity of the _result_ of traversing the list > > This is what the !page_mapping() thing is supposedly protecting !page_mapped() > against, I think. > > But as far as I can tell, there's really two different use cases here: > (a) people who care deeply about the result and (b) people who don't. You could look at it that way, but I don't think it's like that really. IIRC at one time it was the case that every caller had the page already locked when coming in here, but none of them could care too deeply about the result because there was a trylock of the page_table_lock which could fail on one of the paths. Nowadays the only trylocking in rmap.c is one place where it's being called without holding page lock: that doesn't matter for anon pages, but matters for file pages because page->mapping could go NULL from truncation at any instant without the page lock, and we do need page ->mapping to locate the prio_tree of vmas - in that case it trylocks for page lock, and just assumes referenced if couldn't get it. > > And the difference between the two cases is whether they had the page > locked or not. The "try_to_unmap()" callers care deeply, and lock the > page. In contrast, some "page_referenced()" callers (really just > shrink_active_list) don't care deeply, and to them the return value is > really just a heuristic. > > As far as I can tell, all the people who care deeply will lock the page > (and _have_ to lock the page), and thus 'page->mapping' should be stable > for those cases. It's certainly true that try_to_unmap() callers care more deeply than page_referenced() callers, and that the only trylock is on the page_referenced() file path. But this is all _irrelevant_ : the tricky test to worry about in page_lock_anon_vma() is of page_mapped() i.e. does this page currently have any ptes in userspace, not of page_mapping() or page->mapping. In the case of file pages, it is in some places crucial to check page->mapping against a racing file truncation; but there's no such issue with anon pages, their page->mapping pointing to anon_vma is left set until the page is finally freed, it is not volatile. But in the case of anon pages, what page->mapping points to may be volatile, in the sense that that memory might at some point get reused for a different anon_vma, or the slab page below it get freed and reused for a different purpose completely: that's what we have to careful of in the case of anon pages, and it's RCU and the page_mapped() test which guard that. > > And then we have the other cases, who just want a heuristic, and they > don't hold the page lock, but if we look at the wrong active_vma that has anon_vma > gotten reallocated to something else, they don't even really care. It's not that they don't care, it's that if the struct anon_vma has gotten reallocated to something else, they can then be sure there's no longer anything to care about. The struct anon_vma bundles together in a list all those vmas which might conceivably contain a pte of the page we're interested in. If the anon_vma has gotten freed, even reallocated, or utterly reused, that implies that its list of vmas was emptied, there's no longer any vma which might contain that page. > > So I'm not seeing the reason for that check for page_mapped() at the end. > Does it actually protect against anything relevant? Absolutely, it is crucial that page_lock_anon_vma(page) checks page_mapped(page) after doing the rcu_read_lock() and before doing the spin_lock(&anon_vma->lock). When we get into page_lock_anon_vma, we do not know whether what page->mapping (less its PageAnon bit) points to is really still the anon_vma for the page. Maybe all the tasks which had that page mapped at the time we last checked page_mapped(), have now exited, their vmas and anon_vmas and mms been freed. So first we rcu_read_lock(): and because the anon_vma cache is SLAB_DESTROY_BY_RCU, we know that while we hold that, if the memory is now pointing to an anon_vma type of object, it will remain pointing to an anon_vma type of object until the rcu_read_unlock(). That's important because what we're really wanting to do is spin_lock(&anon_vma->lock), but that would be a corrupting thing to do if the struct anon_vma's memory has meanwhile got reused for something else. Okay, we've got rcu_read_lock() to stabilize the situation, but we still do not know whether the memory pointed to by page->mapping is still in use as an anon_vma. Well, we don't try to answer precisely that question: but if the page we're holding (caller better have a reference on it!) has any ptes in userspace, i.e. page_mapcount is raised, i.e. page_mapped is true, then we can be sure that what it's pointing to is the anon_vma. And if page_mapped isn't true any longer, then we're no longer interested in getting that lock anyway and can just back out. At any instant thereafter the tasks concerned may exit or unmap the page, the vmas and anon_vmas and mms may get freed and reused; but SLAB_DESTROY_BY_RCU guarantees that while we have rcu_read_lock() that struct anon_vma will remain of type struct anon_vma, and we can now safely spin_lock(&anon_vma->lock). Once we've got that spin lock, we can safely (given the fix in your patch) traverse the anon_vma list. Maybe the struct anon_vma is still relevant to our page, and maybe we'll find ptes of our page in one or more of the vmas it bundles together; or maybe it's still the right anon_vma, but our page gets unmapped while we search and no ptes are found; or maybe that struct anon_vma has just got freed back to slab and its list is empty; or maybe it has meanwhile got reused for another unrelated bundle of vmas, and we search that bundle for our page which won't be there - that's okay, it's a very unlikely case, but it's allowed for. > > Anyway, I _think_ the part that everybody agrees about is the initial > locking of the anon_vma. Whether we then even need any memory barriers > and/or the page_mapped() check is an independent question. Yes? No? > > So I'm suggesting this commit as the part we at least all agree on. But I > haven't pushed it out yet, so you can still holler.. But I think all the > discussion is about other issues, and we all agree on at least this part? Yes, agreed. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-20 4:03 ` Hugh Dickins @ 2008-10-20 15:17 ` Linus Torvalds 2008-10-20 18:21 ` Hugh Dickins 2008-10-21 2:56 ` Nick Piggin 0 siblings, 2 replies; 52+ messages in thread From: Linus Torvalds @ 2008-10-20 15:17 UTC (permalink / raw) To: Hugh Dickins; +Cc: Peter Zijlstra, Nick Piggin, Linux Memory Management List On Mon, 20 Oct 2008, Hugh Dickins wrote: > > When you say "to the point where we don't need to care about anything > else", are you there agreeing with Nick that your smp_wmb() and > smp_read_barrier_depends() are no longer needed? Yes. The anon_vma only has two fields: the spinlock itself, and the list. And with all list allocations being inside the spinlock, and with the spinlock itself being a memory barrier, I'm now convinced that the worry about memory ordering was unnecessary. Well, not unnecessary, because I think the discussion was good, and I think we fixed another bug, but the smp_wmb++smp_read_barrier_depends does seem to be a non-issue in this path. > But this is all _irrelevant_ : the tricky test to worry about in > page_lock_anon_vma() is of page_mapped() i.e. does this page currently > have any ptes in userspace, not of page_mapping() or page->mapping. I'm not arguing for removing the page_mapped() we have now, I'm just wondering about the one Nick wanted to add at the end. > In the case of file pages, it is in some places crucial to check > page->mapping against a racing file truncation; but there's no such > issue with anon pages, their page->mapping pointing to anon_vma is > left set until the page is finally freed, it is not volatile. > > But in the case of anon pages, what page->mapping points to may be > volatile, in the sense that that memory might at some point get reused > for a different anon_vma, or the slab page below it get freed and > reused for a different purpose completely: that's what we have to > careful of in the case of anon pages, and it's RCU and the > page_mapped() test which guard that. .. and I'm not worried about the slab page. It's stable, since we hold the RCU read-lock. No worries about that one either. It's the "struct page" itself - the one we use for lookup in page_lock_anon_vma(). And I'm worried about the need for *re-doing* the page_mapped() test. The problem that makes "page_lock_anon_vma()" such a total disaster is that yes, it does locking, but it does locking _after_ the lookup, and the lock doesn't actually protect any of the data that it is using for the lookup itself. And yes, we have various tricks to try to make the data "safe" even if we race with the lookup, like the RCU stability of the anon_vma allocation, so that even if we race, we don't do anything bad. And I don't worry about the anon_vma, that part looks fine. But page_remove_rmap() is run totally unlocked wrt page_lock_anon_vma(), it looks like. page_remove_rmap() is run under the pte lock, and page_lock_anon_vma() is run under no lock at all that I can see that is reliable. So yes, we have the same kind of "delay the destroy" wrt page->mapping (ie page_remove_rmap() doesn't actually clear page->mapping at all), but none of this runs under any lock at all. So as far as I can tell, the only reason "page_lock_anon_vma()" is safe is because we hopefully always hold an elevated page count when we call it, so at least the struct page itself will never get freed, so page->mapping is safe just because it's not cleared. But assuming all that is true, it boils down to this: - the anon_vma we get may be the wrong one or a stale one (since page_remove_rmap ran concurrently and we ended up freeing the anon_vma), but it's always a "valid" anon_vma in the sense that it's initialized and the list is always pointing to *some* stable set of vm_area_struct's. - if we raced, and the anon_vma is stale, we're going to walk over some bogus list that pertains to a totally different page, but WHO REALLY CARES? If it is about another page that got that anon_vma reallocated to it, all the code that traverses the list of vma's still has to check the page tables _anyway_. And that will never trigger, and that will get the pte lock for checking anyway, so at _that_ point do we correctly finally synchronize with a lock that is actually relevant. - the "anon_vma->lock" is totally irrelevant wrt "page_mapped()", and I'm not seeing what it could possibly help to re-check after getting that lock. So what I'm trying to figure out is why Nick wanted to add another check for page_mapped(). I'm not seeing what it is supposed to protect against. (Yes, we have checks for "page_mapped()" inside the "try_tp_unmap_xyz()" loops, but those are for a different reason - they're there to exit the loop early when we know there's no point. They don't claim to be about locking serialization). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-20 15:17 ` Linus Torvalds @ 2008-10-20 18:21 ` Hugh Dickins 2008-10-21 2:56 ` Nick Piggin 1 sibling, 0 replies; 52+ messages in thread From: Hugh Dickins @ 2008-10-20 18:21 UTC (permalink / raw) To: Linus Torvalds; +Cc: Peter Zijlstra, Nick Piggin, Linux Memory Management List On Mon, 20 Oct 2008, Linus Torvalds wrote: > On Mon, 20 Oct 2008, Hugh Dickins wrote: > > > > When you say "to the point where we don't need to care about anything > > else", are you there agreeing with Nick that your smp_wmb() and > > smp_read_barrier_depends() are no longer needed? > > Yes. The anon_vma only has two fields: the spinlock itself, and the list. > And with all list allocations being inside the spinlock, and with the > spinlock itself being a memory barrier, I'm now convinced that the worry > about memory ordering was unnecessary. Okay, thanks, that's a relief. I'm afraid that once a barrier discussion comes up and we insert them, then I become dazedly paranoid and it's very hard to shake me from seeing a need for barriers everywhere, including a barrier before and after every barrier ad infinitum to make sure they're really barriers. I still get a twinge of anxiety seeing anon_vma_prepare()'s unlocked if (unlikely(!anon_vma)) { since it looks like the kind of thing that can be a problem. But on reflection, I guess there are lots and lots of places where we do such opportunistic checks before going the slow path taking the lock. > > Well, not unnecessary, because I think the discussion was good, and I > think we fixed another bug, Yes, that was a valuable find, which Nick's ctor aberration led us to. Though whether it ever bit anyone, I doubt. We did have a spate of anon_vma corruptions 2.5 years ago, but I think they were just one manifestation of some more general slab corruption, don't match this. > but the smp_wmb++smp_read_barrier_depends does > seem to be a non-issue in this path. > > > But this is all _irrelevant_ : the tricky test to worry about in > > page_lock_anon_vma() is of page_mapped() i.e. does this page currently > > have any ptes in userspace, not of page_mapping() or page->mapping. > > I'm not arguing for removing the page_mapped() we have now, I'm just > wondering about the one Nick wanted to add at the end. Oh, that, sorry I didn't realize - but there again, it was well worth my writing it down again, else I wouldn't have corrected my embarrassingly mistaken goahead to Nick on moving the check. [snipped a lot of good understanding of how it works] > So what I'm trying to figure out is why Nick wanted to add another check > for page_mapped(). I'm not seeing what it is supposed to protect against. I think it's a matter of mental comfort, or good interface design. You're right that it will make no actual difference to what happens in its two sole callers page_referenced_anon() and try_to_unmap_anon(), beyond adding an extra branch to short-circuit a futile search which would already terminate after the first iteration (each loop already has a page_mapped test, to get out a.s.a.p. if the list is very long). But (particularly because he didn't realize it could happen: I put no comment there beyond "tricky") he thinks it would be better to know that when you emerge from a successful page_lock_anon_vma(), then the anon_vma you have is indeed still the right one for the page (as you noted, we do assume caller holds a reference on page). One might argue that a comment would be better than a runtime test: so long as page_lock_anon_vma() is a static function with just those two callers. In writing this, another barrier anxiety crossed my mind. I've made a big deal of checking page_mapped after getting rcu_read_lock, but now I wonder if another barrier is needed for that. Documentation/memory-barriers.txt "LOCKING FUNCTIONS" groups RCU along with spin locks in discussing their semipermeable characteristics, so I guess no extra barrier needed; but it does work rather differently. In CLASSIC_RCU the preempt_disable() has a compiler barrier() but not any processor *mb(). As I understand it, that's fine because if page->_mapcount was loaded before the preempt_disable and we don't preempt before the preempt_disable, then so what, that's okay; and if we are preempted immediately before the preempt_disable, then all the business of context switch is sure to reload it again after. In PREEMPT_RCU? I don't know, that's some study I've never got around to; but I think you and Peter will know whether it's good. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-20 15:17 ` Linus Torvalds 2008-10-20 18:21 ` Hugh Dickins @ 2008-10-21 2:56 ` Nick Piggin 2008-10-21 3:25 ` Linus Torvalds 1 sibling, 1 reply; 52+ messages in thread From: Nick Piggin @ 2008-10-21 2:56 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Peter Zijlstra, Nick Piggin, Linux Memory Management List On Tuesday 21 October 2008 02:17, Linus Torvalds wrote: > On Mon, 20 Oct 2008, Hugh Dickins wrote: > > But in the case of anon pages, what page->mapping points to may be > > volatile, in the sense that that memory might at some point get reused > > for a different anon_vma, or the slab page below it get freed and > > reused for a different purpose completely: that's what we have to > > careful of in the case of anon pages, and it's RCU and the > > page_mapped() test which guard that. > > .. and I'm not worried about the slab page. It's stable, since we hold the > RCU read-lock. No worries about that one either. > > It's the "struct page" itself - the one we use for lookup in > page_lock_anon_vma(). And I'm worried about the need for *re-doing* the > page_mapped() test. > > The problem that makes "page_lock_anon_vma()" such a total disaster is > that yes, it does locking, but it does locking _after_ the lookup, and the > lock doesn't actually protect any of the data that it is using for the > lookup itself. > > And yes, we have various tricks to try to make the data "safe" even if we > race with the lookup, like the RCU stability of the anon_vma allocation, > so that even if we race, we don't do anything bad. And I don't worry about > the anon_vma, that part looks fine. > > But page_remove_rmap() is run totally unlocked wrt page_lock_anon_vma(), > it looks like. page_remove_rmap() is run under the pte lock, and > page_lock_anon_vma() is run under no lock at all that I can see that is > reliable. > > So yes, we have the same kind of "delay the destroy" wrt page->mapping (ie > page_remove_rmap() doesn't actually clear page->mapping at all), but none > of this runs under any lock at all. > > So as far as I can tell, the only reason "page_lock_anon_vma()" is safe is > because we hopefully always hold an elevated page count when we call it, > so at least the struct page itself will never get freed, so page->mapping > is safe just because it's not cleared. > > But assuming all that is true, it boils down to this: > > - the anon_vma we get may be the wrong one or a stale one (since > page_remove_rmap ran concurrently and we ended up freeing the > anon_vma), but it's always a "valid" anon_vma in the sense that it's > initialized and the list is always pointing to *some* stable set of > vm_area_struct's. > > - if we raced, and the anon_vma is stale, we're going to walk over > some bogus list that pertains to a totally different page, but WHO > REALLY CARES? If it is about another page that got that anon_vma > reallocated to it, all the code that traverses the list of vma's still > has to check the page tables _anyway_. And that will never trigger, and > that will get the pte lock for checking anyway, so at _that_ point do > we correctly finally synchronize with a lock that is actually relevant. > > - the "anon_vma->lock" is totally irrelevant wrt "page_mapped()", and I'm > not seeing what it could possibly help to re-check after getting that > lock. > > So what I'm trying to figure out is why Nick wanted to add another check > for page_mapped(). I'm not seeing what it is supposed to protect against. It's not supposed to protect against anything that would be a problem in the existing code (well, I initially thought it might be, but Hugh explained why its not needed). I'd still like to put the check in, in order to constrain this peculiarity of SLAB_DESTROY_BY_RCU to those couple of functions which allocate or take a reference. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-21 2:56 ` Nick Piggin @ 2008-10-21 3:25 ` Linus Torvalds 2008-10-21 4:33 ` Nick Piggin 2008-10-21 4:34 ` Nick Piggin 0 siblings, 2 replies; 52+ messages in thread From: Linus Torvalds @ 2008-10-21 3:25 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Peter Zijlstra, Nick Piggin, Linux Memory Management List On Tue, 21 Oct 2008, Nick Piggin wrote: > > > > So what I'm trying to figure out is why Nick wanted to add another check > > for page_mapped(). I'm not seeing what it is supposed to protect against. > > It's not supposed to protect against anything that would be a problem > in the existing code (well, I initially thought it might be, but Hugh > explained why its not needed). I'd still like to put the check in, in > order to constrain this peculiarity of SLAB_DESTROY_BY_RCU to those > couple of functions which allocate or take a reference. Hmm. Ok, as long as I understand what it is for (and if it's not a bug-fix but a "like to drop the stale anon_vma early), I'm ok. So I won't mind, and Hugh seems to prefer it. So if you send that patch alogn with a good explanation for a changelog entry, I'll apply it. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-21 3:25 ` Linus Torvalds @ 2008-10-21 4:33 ` Nick Piggin 2008-10-21 12:58 ` Hugh Dickins 2008-10-21 15:59 ` Christoph Lameter 2008-10-21 4:34 ` Nick Piggin 1 sibling, 2 replies; 52+ messages in thread From: Nick Piggin @ 2008-10-21 4:33 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Hugh Dickins, Peter Zijlstra, Linux Memory Management List On Mon, Oct 20, 2008 at 08:25:54PM -0700, Linus Torvalds wrote: > > > On Tue, 21 Oct 2008, Nick Piggin wrote: > > > > > > So what I'm trying to figure out is why Nick wanted to add another check > > > for page_mapped(). I'm not seeing what it is supposed to protect against. > > > > It's not supposed to protect against anything that would be a problem > > in the existing code (well, I initially thought it might be, but Hugh > > explained why its not needed). I'd still like to put the check in, in > > order to constrain this peculiarity of SLAB_DESTROY_BY_RCU to those > > couple of functions which allocate or take a reference. > > Hmm. Ok, as long as I understand what it is for (and if it's not a > bug-fix but a "like to drop the stale anon_vma early), I'm ok. > > So I won't mind, and Hugh seems to prefer it. So if you send that patch > alogn with a good explanation for a changelog entry, I'll apply it. Well something like this, then. Hugh? -- With the existing SLAB_DESTROY_BY_RCU scheme for anon_vma, page_lock_anon_vma might take the lock of the anon_vma at a point where it has already been freed then re-allocated and reused for something else. This is OK (with the exception of the now-fixed case where newly allocated anon_vma had its list manipulated without holding the lock), because in order to get to the pte, the page tables must be walked and the pte confirmed to point to this page anyway. So technically it should work. The problem with it is that it is quite subtle, and it means that we have to keep this stale-anon_vma problem in the back of our minds, when reviewing or modifying any part of the anonymous rmap code. It *could* be that it would break some otherwise legitimate change to the code. Add another page_mapped check to weed out these anon_vmas. Comment the existing page_mapped check a little bit. Signed-off-by: Nick Piggin <npiggin@suse.de> --- Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -200,11 +200,47 @@ struct anon_vma *page_lock_anon_vma(stru anon_mapping = (unsigned long) page->mapping; if (!(anon_mapping & PAGE_MAPPING_ANON)) goto out; + + /* + * The page_mapped check is required in order to ensure anon_vma is + * protected under this RCU critical section before we touch it. + * + * If page_mapped was not checked, page->mapping may refer to an + * anon_vma that has since been freed (see page_remove_rmap comment not + * resetting PageAnon). And hence it would not be protected with RCU + * and could be freed and reused at any time. + */ if (!page_mapped(page)) goto out; anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); spin_lock(&anon_vma->lock); + + /* + * If the page is no longer mapped, we have no way to keep the anon_vma + * stable. It may be freed and even re-allocated for some other set of + * anonymous mappings at any point. Technically this should be OK, as + * we hold the spinlock, and should be able to tolerate finding + * unrelated vmas on our list. However we'd rather nip these in the bud + * here, for simplicity. + * + * If the page is mapped while we have the lock on the anon_vma, then + * we know anon_vma_unlink can't run and garbage collect the anon_vma: + * unmapping the page and decrementing its mapcount happens before + * unlinking the anon_vma; unlinking the anon_vma requires the + * anon_vma lock to be held. So this check ensures we have a stable + * anon_vma. + * + * Note: the page can still become unmapped, and the !page_mapped + * condition become true at any point. This check is definitely not + * preventing any such thing. + */ + if (unlikely(!page_mapped(page))) { + spin_unlock(&anon_vma->lock); + goto out; + } + VM_BUG_ON(anon_mapping != (unsigned long)page->mapping); + return anon_vma; out: rcu_read_unlock(); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-21 4:33 ` Nick Piggin @ 2008-10-21 12:58 ` Hugh Dickins 2008-10-21 15:59 ` Christoph Lameter 1 sibling, 0 replies; 52+ messages in thread From: Hugh Dickins @ 2008-10-21 12:58 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Nick Piggin, Peter Zijlstra, Linux Memory Management List On Tue, 21 Oct 2008, Nick Piggin wrote: > On Mon, Oct 20, 2008 at 08:25:54PM -0700, Linus Torvalds wrote: > > On Tue, 21 Oct 2008, Nick Piggin wrote: > > > > > > > > So what I'm trying to figure out is why Nick wanted to add another check > > > > for page_mapped(). I'm not seeing what it is supposed to protect against. > > > > > > It's not supposed to protect against anything that would be a problem > > > in the existing code (well, I initially thought it might be, but Hugh > > > explained why its not needed). I'd still like to put the check in, in > > > order to constrain this peculiarity of SLAB_DESTROY_BY_RCU to those > > > couple of functions which allocate or take a reference. > > > > Hmm. Ok, as long as I understand what it is for (and if it's not a > > bug-fix but a "like to drop the stale anon_vma early), I'm ok. > > > > So I won't mind, and Hugh seems to prefer it. So if you send that patch (I'd prefer just a comment myself, but I do see Nick's point of view.) > > alogn with a good explanation for a changelog entry, I'll apply it. > > Well something like this, then. Hugh? Yes, that's good. I'm not a huge fan of such comments that dwarf the code, but I've rather disqualified myself by going to the opposite extreme, and it's hard to get over in less, and I think all you've said is correct and relevant. Would it be better without the VM_BUG_ON on page->mapping? I find that a bit distracting and of no interest, though I guess it's a further way of clarifying the assumptions. You could as well add a VM_BUG_ON(!page_count(page)), but I don't really want that. Thanks, Hugh > > -- > > With the existing SLAB_DESTROY_BY_RCU scheme for anon_vma, page_lock_anon_vma > might take the lock of the anon_vma at a point where it has already been freed > then re-allocated and reused for something else. > > This is OK (with the exception of the now-fixed case where newly allocated > anon_vma had its list manipulated without holding the lock), because in order > to get to the pte, the page tables must be walked and the pte confirmed to > point to this page anyway. So technically it should work. > > The problem with it is that it is quite subtle, and it means that we have to > keep this stale-anon_vma problem in the back of our minds, when reviewing or > modifying any part of the anonymous rmap code. It *could* be that it would > break some otherwise legitimate change to the code. > > Add another page_mapped check to weed out these anon_vmas. Comment the > existing page_mapped check a little bit. > > Signed-off-by: Nick Piggin <npiggin@suse.de> > --- > Index: linux-2.6/mm/rmap.c > =================================================================== > --- linux-2.6.orig/mm/rmap.c > +++ linux-2.6/mm/rmap.c > @@ -200,11 +200,47 @@ struct anon_vma *page_lock_anon_vma(stru > anon_mapping = (unsigned long) page->mapping; > if (!(anon_mapping & PAGE_MAPPING_ANON)) > goto out; > + > + /* > + * The page_mapped check is required in order to ensure anon_vma is > + * protected under this RCU critical section before we touch it. > + * > + * If page_mapped was not checked, page->mapping may refer to an > + * anon_vma that has since been freed (see page_remove_rmap comment not > + * resetting PageAnon). And hence it would not be protected with RCU > + * and could be freed and reused at any time. > + */ > if (!page_mapped(page)) > goto out; > > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); > spin_lock(&anon_vma->lock); > + > + /* > + * If the page is no longer mapped, we have no way to keep the anon_vma > + * stable. It may be freed and even re-allocated for some other set of > + * anonymous mappings at any point. Technically this should be OK, as > + * we hold the spinlock, and should be able to tolerate finding > + * unrelated vmas on our list. However we'd rather nip these in the bud > + * here, for simplicity. > + * > + * If the page is mapped while we have the lock on the anon_vma, then > + * we know anon_vma_unlink can't run and garbage collect the anon_vma: > + * unmapping the page and decrementing its mapcount happens before > + * unlinking the anon_vma; unlinking the anon_vma requires the > + * anon_vma lock to be held. So this check ensures we have a stable > + * anon_vma. > + * > + * Note: the page can still become unmapped, and the !page_mapped > + * condition become true at any point. This check is definitely not > + * preventing any such thing. > + */ > + if (unlikely(!page_mapped(page))) { > + spin_unlock(&anon_vma->lock); > + goto out; > + } > + VM_BUG_ON(anon_mapping != (unsigned long)page->mapping); > + > return anon_vma; > out: > rcu_read_unlock(); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-21 4:33 ` Nick Piggin 2008-10-21 12:58 ` Hugh Dickins @ 2008-10-21 15:59 ` Christoph Lameter 2008-10-22 9:29 ` Nick Piggin 1 sibling, 1 reply; 52+ messages in thread From: Christoph Lameter @ 2008-10-21 15:59 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Nick Piggin, Hugh Dickins, Peter Zijlstra, Linux Memory Management List Nick Piggin wrote: > if (!page_mapped(page)) > goto out; > > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); Isnt it possible for the anon_vma to be freed and reallocated for another use after this statement and before taking the lock? > spin_lock(&anon_vma->lock); Then we may take the spinlock on another anon_vma structure not related to this page. > + > + /* > + * If the page is no longer mapped, we have no way to keep the anon_vma > + * stable. It may be freed and even re-allocated for some other set of > + * anonymous mappings at any point. Technically this should be OK, as > + * we hold the spinlock, and should be able to tolerate finding > + * unrelated vmas on our list. However we'd rather nip these in the bud > + * here, for simplicity. > + * > + * If the page is mapped while we have the lock on the anon_vma, then > + * we know anon_vma_unlink can't run and garbage collect the anon_vma: > + * unmapping the page and decrementing its mapcount happens before > + * unlinking the anon_vma; unlinking the anon_vma requires the > + * anon_vma lock to be held. So this check ensures we have a stable > + * anon_vma. > + * > + * Note: the page can still become unmapped, and the !page_mapped > + * condition become true at any point. This check is definitely not > + * preventing any such thing. > + */ What is this then? An optimization? > + if (unlikely(!page_mapped(page))) { > + spin_unlock(&anon_vma->lock); > + goto out; > + } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-21 15:59 ` Christoph Lameter @ 2008-10-22 9:29 ` Nick Piggin 0 siblings, 0 replies; 52+ messages in thread From: Nick Piggin @ 2008-10-22 9:29 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Nick Piggin, Hugh Dickins, Peter Zijlstra, Linux Memory Management List On Tue, Oct 21, 2008 at 10:59:40AM -0500, Christoph Lameter wrote: > Nick Piggin wrote: > > > > if (!page_mapped(page)) > > goto out; > > > > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); > > Isnt it possible for the anon_vma to be freed and reallocated for another use > after this statement and before taking the lock? Yes. > > spin_lock(&anon_vma->lock); > > Then we may take the spinlock on another anon_vma structure not related to > this page. Yes. > > + > > + /* > > + * If the page is no longer mapped, we have no way to keep the anon_vma > > + * stable. It may be freed and even re-allocated for some other set of > > + * anonymous mappings at any point. Technically this should be OK, as > > + * we hold the spinlock, and should be able to tolerate finding > > + * unrelated vmas on our list. However we'd rather nip these in the bud > > + * here, for simplicity. > > + * > > + * If the page is mapped while we have the lock on the anon_vma, then > > + * we know anon_vma_unlink can't run and garbage collect the anon_vma: > > + * unmapping the page and decrementing its mapcount happens before > > + * unlinking the anon_vma; unlinking the anon_vma requires the > > + * anon_vma lock to be held. So this check ensures we have a stable > > + * anon_vma. > > + * > > + * Note: the page can still become unmapped, and the !page_mapped > > + * condition become true at any point. This check is definitely not > > + * preventing any such thing. > > + */ > > What is this then? An optimization? As the comment says, it filters out those unrelated anon_vmas. In doing so it allows us to guarantee the reference with the lock alone (as-per the next patch). Also just means we don't have to care about that case (even though it's not technically wrong). > > > + if (unlikely(!page_mapped(page))) { > > + spin_unlock(&anon_vma->lock); > > + goto out; > > + } > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-21 3:25 ` Linus Torvalds 2008-10-21 4:33 ` Nick Piggin @ 2008-10-21 4:34 ` Nick Piggin 2008-10-21 13:55 ` Hugh Dickins 1 sibling, 1 reply; 52+ messages in thread From: Nick Piggin @ 2008-10-21 4:34 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Hugh Dickins, Peter Zijlstra, Linux Memory Management List On Mon, Oct 20, 2008 at 08:25:54PM -0700, Linus Torvalds wrote: > > > On Tue, 21 Oct 2008, Nick Piggin wrote: > > > > > > So what I'm trying to figure out is why Nick wanted to add another check > > > for page_mapped(). I'm not seeing what it is supposed to protect against. > > > > It's not supposed to protect against anything that would be a problem > > in the existing code (well, I initially thought it might be, but Hugh > > explained why its not needed). I'd still like to put the check in, in > > order to constrain this peculiarity of SLAB_DESTROY_BY_RCU to those > > couple of functions which allocate or take a reference. > > Hmm. Ok, as long as I understand what it is for (and if it's not a > bug-fix but a "like to drop the stale anon_vma early), I'm ok. > > So I won't mind, and Hugh seems to prefer it. So if you send that patch > alogn with a good explanation for a changelog entry, I'll apply it. And after that patch, I *think* we should be able to do something like this. -- With the change to return only stable, non-empty anon_vmas from page_lock_anon_vma, we no longer have to hold off RCU while looking at the anon_vma. After this change, the lockless referencing, and interesting SLAB_DESTROY_BY_RCU behaviour is pretty well localised to page_lock_anon_vma and anon_vma_prepare. Today, for normal RCU, this doesn't matter much. For preemptible RCU and preemptible anon_vma lock, this change could help with keeping RCU ticking. It could also help if we ever wanted to add a sleeping lock to anon_vma. Basically just fewer nested dependencies ~= more flexible and maintainable. Signed-off-by: Nick Piggin <npiggin@suse.de> --- Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -239,6 +239,8 @@ struct anon_vma *page_lock_anon_vma(stru spin_unlock(&anon_vma->lock); goto out; } + rcu_read_unlock(); + VM_BUG_ON(anon_mapping != (unsigned long)page->mapping); return anon_vma; @@ -250,7 +252,6 @@ out: void page_unlock_anon_vma(struct anon_vma *anon_vma) { spin_unlock(&anon_vma->lock); - rcu_read_unlock(); } /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-21 4:34 ` Nick Piggin @ 2008-10-21 13:55 ` Hugh Dickins 0 siblings, 0 replies; 52+ messages in thread From: Hugh Dickins @ 2008-10-21 13:55 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Christoph Lameter, Nick Piggin, Peter Zijlstra, Linux Memory Management List On Tue, 21 Oct 2008, Nick Piggin wrote: > On Mon, Oct 20, 2008 at 08:25:54PM -0700, Linus Torvalds wrote: > > And after that patch, I *think* we should be able to do something like > this. > > -- > With the change to return only stable, non-empty anon_vmas from > page_lock_anon_vma, we no longer have to hold off RCU while looking at > the anon_vma. After this change, the lockless referencing, and interesting > SLAB_DESTROY_BY_RCU behaviour is pretty well localised to page_lock_anon_vma > and anon_vma_prepare. > > Today, for normal RCU, this doesn't matter much. For preemptible RCU and > preemptible anon_vma lock, this change could help with keeping RCU ticking. > It could also help if we ever wanted to add a sleeping lock to anon_vma. > Basically just fewer nested dependencies ~= more flexible and maintainable. > > Signed-off-by: Nick Piggin <npiggin@suse.de> Interesting. That's how it used to be originally (and we just did the spin_unlock directly without any page_unlock_anon_vma wrapper). I rather liked keeping the RCU trickery in the one function. But it worried ChristophL that way (and caused the -rt tree problems?): eventually he persuaded me to allow the patch moving rcu_read_unlock() after the spin_unlock(). I think he was seeing the same point that you are seeing, when you say that this can come (only) after your patch checking page_mapped i.e. anon_vma stability after getting the spinlock. Since I only knew classic RCU in which rcu_read_lock is preempt_disable, and a spin_lock does preempt_disable, it was all theoretical to me. I like this patch, but let's see how Christoph feels about it. Hugh > --- > Index: linux-2.6/mm/rmap.c > =================================================================== > --- linux-2.6.orig/mm/rmap.c > +++ linux-2.6/mm/rmap.c > @@ -239,6 +239,8 @@ struct anon_vma *page_lock_anon_vma(stru > spin_unlock(&anon_vma->lock); > goto out; > } > + rcu_read_unlock(); > + > VM_BUG_ON(anon_mapping != (unsigned long)page->mapping); > > return anon_vma; > @@ -250,7 +252,6 @@ out: > void page_unlock_anon_vma(struct anon_vma *anon_vma) > { > spin_unlock(&anon_vma->lock); > - rcu_read_unlock(); > } > > /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-19 18:25 ` Linus Torvalds ` (2 preceding siblings ...) 2008-10-20 4:03 ` Hugh Dickins @ 2008-10-21 2:44 ` Nick Piggin 3 siblings, 0 replies; 52+ messages in thread From: Nick Piggin @ 2008-10-21 2:44 UTC (permalink / raw) To: Linus Torvalds, stable Cc: Peter Zijlstra, Hugh Dickins, Nick Piggin, Linux Memory Management List On Monday 20 October 2008 05:25, Linus Torvalds wrote: > On Sun, 19 Oct 2008, Peter Zijlstra wrote: > > Part of the confusion is that we don't clear those pointers at the end > > of their lifetimes (page_remove_rmap and anon_vma_unlink). > > > > I guess the !page_mapping() test in page_lock_anon_vma() is meant to > > deal with this > > Hmm. So that part I'm still not entirely convinced about. > > The thing is, we have two issues on anon_vma usage, and the > page_lock_anon_vma() usage in particular: > > - the integrity of the list itself > > Here it should be sufficient to just always get the lock, to the point > where we don't need to care about anything else. So getting the lock > properly on new allocation makes all the other races irrelevant. > > - the integrity of the _result_ of traversing the list > > This is what the !page_mapping() thing is supposedly protecting > against, I think. > > But as far as I can tell, there's really two different use cases here: > (a) people who care deeply about the result and (b) people who don't. > > And the difference between the two cases is whether they had the page > locked or not. The "try_to_unmap()" callers care deeply, and lock the > page. In contrast, some "page_referenced()" callers (really just > shrink_active_list) don't care deeply, and to them the return value is > really just a heuristic. > > As far as I can tell, all the people who care deeply will lock the page > (and _have_ to lock the page), and thus 'page->mapping' should be stable > for those cases. > > And then we have the other cases, who just want a heuristic, and they > don't hold the page lock, but if we look at the wrong active_vma that has > gotten reallocated to something else, they don't even really care. > > So I'm not seeing the reason for that check for page_mapped() at the end. > Does it actually protect against anything relevant? > > Anyway, I _think_ the part that everybody agrees about is the initial > locking of the anon_vma. Whether we then even need any memory barriers > and/or the page_mapped() check is an independent question. Yes? No? > > So I'm suggesting this commit as the part we at least all agree on. But I > haven't pushed it out yet, so you can still holler.. But I think all the > discussion is about other issues, and we all agree on at least this part? > > Linus > > --- > From f422f2ec50872331820f15711f48b9ffc9cbb64e Mon Sep 17 00:00:00 2001 > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Sun, 19 Oct 2008 10:32:20 -0700 > Subject: [PATCH] anon_vma_prepare: properly lock even newly allocated > entries > > The anon_vma code is very subtle, and we end up doing optimistic lookups > of anon_vmas under RCU in page_lock_anon_vma() with no locking. Other > CPU's can also see the newly allocated entry immediately after we've > exposed it by setting "vma->anon_vma" to the new value. > > We protect against the anon_vma being destroyed by having the SLAB > marked as SLAB_DESTROY_BY_RCU, so the RCU lookup can depend on the > allocation not being destroyed - but it might still be free'd and > re-allocated here to a new vma. > > As a result, we should not do the anon_vma list ops on a newly allocated > vma without proper locking. Thanks, this is exactly what I proposed. I preferred to add more explicit comments about why it's OK to expose anon_vma to vma->anon_vma without ordering initialisation etc. But those comments were mainly for the benefit of others. If you and Hugh now agree with me on that, then maybe no comments are required. > > Acked-by: Nick Piggin <npiggin@suse.de> > Acked-by: Hugh Dickins <hugh@veritas.com> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > --- > mm/rmap.c | 42 ++++++++++++++++++++++++++++++++---------- > 1 files changed, 32 insertions(+), 10 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 0383acf..e8d639b 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -55,7 +55,33 @@ > > struct kmem_cache *anon_vma_cachep; > > -/* This must be called under the mmap_sem. */ > +/** > + * anon_vma_prepare - attach an anon_vma to a memory region > + * @vma: the memory region in question > + * > + * This makes sure the memory mapping described by 'vma' has > + * an 'anon_vma' attached to it, so that we can associate the > + * anonymous pages mapped into it with that anon_vma. > + * > + * The common case will be that we already have one, but if > + * if not we either need to find an adjacent mapping that we > + * can re-use the anon_vma from (very common when the only > + * reason for splitting a vma has been mprotect()), or we > + * allocate a new one. > + * > + * Anon-vma allocations are very subtle, because we may have > + * optimistically looked up an anon_vma in page_lock_anon_vma() > + * and that may actually touch the spinlock even in the newly > + * allocated vma (it depends on RCU to make sure that the > + * anon_vma isn't actually destroyed). > + * > + * As a result, we need to do proper anon_vma locking even > + * for the new allocation. At the same time, we do not want > + * to do any locking for the common case of already having > + * an anon_vma. > + * > + * This must be called with the mmap_sem held for reading. > + */ > int anon_vma_prepare(struct vm_area_struct *vma) > { > struct anon_vma *anon_vma = vma->anon_vma; > @@ -63,20 +89,17 @@ int anon_vma_prepare(struct vm_area_struct *vma) > might_sleep(); > if (unlikely(!anon_vma)) { > struct mm_struct *mm = vma->vm_mm; > - struct anon_vma *allocated, *locked; > + struct anon_vma *allocated; > > anon_vma = find_mergeable_anon_vma(vma); > - if (anon_vma) { > - allocated = NULL; > - locked = anon_vma; > - spin_lock(&locked->lock); > - } else { > + allocated = NULL; > + if (!anon_vma) { > anon_vma = anon_vma_alloc(); > if (unlikely(!anon_vma)) > return -ENOMEM; > allocated = anon_vma; > - locked = NULL; > } > + spin_lock(&anon_vma->lock); > > /* page_table_lock to protect against threads */ > spin_lock(&mm->page_table_lock); > @@ -87,8 +110,7 @@ int anon_vma_prepare(struct vm_area_struct *vma) > } > spin_unlock(&mm->page_table_lock); > > - if (locked) > - spin_unlock(&locked->lock); > + spin_unlock(&anon_vma->lock); > if (unlikely(allocated)) > anon_vma_free(allocated); > } > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 2:25 ` Nick Piggin 2008-10-18 2:35 ` Nick Piggin 2008-10-18 2:53 ` Linus Torvalds @ 2008-10-18 19:14 ` Hugh Dickins 2008-10-19 3:03 ` Nick Piggin 2 siblings, 1 reply; 52+ messages in thread From: Hugh Dickins @ 2008-10-18 19:14 UTC (permalink / raw) To: Nick Piggin; +Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List Sorry, I've only just got back to this, and just had one quick read through the thread. A couple of points on what I believe so far... On Sat, 18 Oct 2008, Nick Piggin wrote: > On Fri, Oct 17, 2008 at 07:11:38PM -0700, Linus Torvalds wrote: > > On Sat, 18 Oct 2008, Nick Piggin wrote: > > > > > > We can't do that, unfortuantely, because anon_vmas are allocated with > > > SLAB_DESTROY_BY_RCU. > > > > Aughh. I see what you're saying. We don't _free_ them by RCU, we just > > destroy the page allocation. So an anon_vma can get _re-allocated_ for > > another page (without being destroyed), concurrently with somebody > > optimistically being busy with that same anon_vma that they got through > > that optimistic 'page_lock_anon_vma()' thing. > > > > So if we were to just set the lock, we might actually be messing with > > something that is still actively used by the previous page that was > > unmapped concurrently and still being accessed by try_to_unmap_anon. So > > even though we allocated a "new" anon_vma, it might still be busy. > > > > Yes? No? Yes, I believe Linus is right; but need to mull it over some more. That not-taking-the-lock-on-newly-allocated optimization comes from Andrea, and dates from before I removed the page_map_lock(). It looks like an optimization that was valid in Andrea's original, but something I should have removed when going to SLAB_DESTROY_BY_RCU. > > That's what I'm thinking, yes. But I admit the last time I looked at > this really closely was probably reading through Hugh's patches and > changelogs (which at the time must have convinced me ;)). So I could > be wrong. > > > > That thing really is too subtle for words. But if that's actually what you > > are alluding to, then doesn't that mean that we _really_ should be doing > > that "spin_lock(&anon_vma->lock)" even for the first allocation, and that > > the current code is broken? Because otherwise that other concurrent user > > that found the stale vma through page_lock_anon_vma() will now try to > > follow the linked list and _think_ it's stable (thanks to the lock), but > > we're actually inserting entries into it without holding any locks at all. > > Yes, that's what I meant by "has other problems". Another thing is also > that even if we have the lock here, I can't see why page_lock_anon_vma > is safe against finding an anon_vma which has been deallocated then > allocated for something else (and had vmas inserted into it etc.). And Nick is right that page_lock_anon_vma() is not safe against finding an anon_vma which has now been allocated for something else: but that is no surprise, it's very much in the nature of SLAB_DESTROY_BY_RCU (I left most of the comment in mm/slab.c, just said "tricky" here). It should be no problem: having locked the right-or-perhaps-wrong anon_vma, we then go on to search its list for a page which may or may not be there, even when it's the right anon_vma; there's no need for special code to deal with the very unlikely case that we've now got an irrelevant list, it's just that the page we're looking for won't be found in it. But not-taking-the-lock-on-newly-allocated does then look wrong. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 19:14 ` Hugh Dickins @ 2008-10-19 3:03 ` Nick Piggin 2008-10-19 7:07 ` Hugh Dickins 0 siblings, 1 reply; 52+ messages in thread From: Nick Piggin @ 2008-10-19 3:03 UTC (permalink / raw) To: Hugh Dickins; +Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List On Sat, Oct 18, 2008 at 08:14:12PM +0100, Hugh Dickins wrote: > And Nick is right that page_lock_anon_vma() is not safe against finding > an anon_vma which has now been allocated for something else: but that > is no surprise, it's very much in the nature of SLAB_DESTROY_BY_RCU > (I left most of the comment in mm/slab.c, just said "tricky" here). > > It should be no problem: having locked the right-or-perhaps-wrong > anon_vma, we then go on to search its list for a page which may or > may not be there, even when it's the right anon_vma; there's no need OK, so it may be correct but I think that's pretty nasty if we can avoid it so easily. Then we have to keep in mind this special case throughout the code rather than just confining it to the low level take-a-reference function and never having to worry about it. > for special code to deal with the very unlikely case that we've now > got an irrelevant list, it's just that the page we're looking for > won't be found in it. There is already a page_mapped check in there. I'm just going to propose we move that down. No extra branchesin the fastpath. OK? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-19 3:03 ` Nick Piggin @ 2008-10-19 7:07 ` Hugh Dickins 2008-10-20 3:26 ` Hugh Dickins 0 siblings, 1 reply; 52+ messages in thread From: Hugh Dickins @ 2008-10-19 7:07 UTC (permalink / raw) To: Nick Piggin; +Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List On Sun, 19 Oct 2008, Nick Piggin wrote: > On Sat, Oct 18, 2008 at 08:14:12PM +0100, Hugh Dickins wrote: > > And Nick is right that page_lock_anon_vma() is not safe against finding > > an anon_vma which has now been allocated for something else: but that > > is no surprise, it's very much in the nature of SLAB_DESTROY_BY_RCU > > (I left most of the comment in mm/slab.c, just said "tricky" here). > > > > It should be no problem: having locked the right-or-perhaps-wrong > > anon_vma, we then go on to search its list for a page which may or > > may not be there, even when it's the right anon_vma; there's no need > > OK, so it may be correct but I think that's pretty nasty if we can > avoid it so easily. Then we have to keep in mind this special case > throughout the code rather than just confining it to the low level > take-a-reference function and never having to worry about it. > > > > for special code to deal with the very unlikely case that we've now > > got an irrelevant list, it's just that the page we're looking for > > won't be found in it. > > There is already a page_mapped check in there. I'm just going to > propose we move that down. No extra branchesin the fastpath. OK? That should be OK, yes. Looking back at the history, I believe I sited the page_mapped test where it is, partly for simpler flow, and partly to avoid overhead of taking spinlock unnecessarily. But that's much too rare a case to be worrying about: better to leave you comfortable with page_lock_anon_vma() itself. More important will be the comments you add, I'll probably want to quibble on those! Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-19 7:07 ` Hugh Dickins @ 2008-10-20 3:26 ` Hugh Dickins 2008-10-21 2:45 ` Nick Piggin 0 siblings, 1 reply; 52+ messages in thread From: Hugh Dickins @ 2008-10-20 3:26 UTC (permalink / raw) To: Nick Piggin; +Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List On Sun, 19 Oct 2008, Hugh Dickins wrote: > On Sun, 19 Oct 2008, Nick Piggin wrote: > > > > There is already a page_mapped check in there. I'm just going to > > propose we move that down. No extra branchesin the fastpath. OK? > > That should be OK, yes. Looking back at the history, I believe > I sited the page_mapped test where it is, partly for simpler flow, > and partly to avoid overhead of taking spinlock unnecessarily. Arrgh! What terrible advice I gave you there, completely wrong: that's what happens when I rush a reply instead of thinking. I'm three-quarters through replying to Linus on this, and going into that detail, remember now why its placement is critical. Repeat the page_mapped check before returning if you wish, but do not remove the one that's there: see other mail for explanation. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-20 3:26 ` Hugh Dickins @ 2008-10-21 2:45 ` Nick Piggin 0 siblings, 0 replies; 52+ messages in thread From: Nick Piggin @ 2008-10-21 2:45 UTC (permalink / raw) To: Hugh Dickins Cc: Nick Piggin, Linus Torvalds, Peter Zijlstra, Linux Memory Management List On Monday 20 October 2008 14:26, Hugh Dickins wrote: > On Sun, 19 Oct 2008, Hugh Dickins wrote: > > On Sun, 19 Oct 2008, Nick Piggin wrote: > > > There is already a page_mapped check in there. I'm just going to > > > propose we move that down. No extra branchesin the fastpath. OK? > > > > That should be OK, yes. Looking back at the history, I believe > > I sited the page_mapped test where it is, partly for simpler flow, > > and partly to avoid overhead of taking spinlock unnecessarily. > > Arrgh! What terrible advice I gave you there, completely wrong: > that's what happens when I rush a reply instead of thinking. > > I'm three-quarters through replying to Linus on this, and going > into that detail, remember now why its placement is critical. > > Repeat the page_mapped check before returning if you wish, > but do not remove the one that's there: see other mail for > explanation. Right, thanks for that. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-18 0:42 ` Linus Torvalds 2008-10-18 1:08 ` Linus Torvalds @ 2008-10-19 1:13 ` Hugh Dickins 2008-10-19 2:41 ` Nick Piggin 1 sibling, 1 reply; 52+ messages in thread From: Hugh Dickins @ 2008-10-19 1:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: Peter Zijlstra, Nick Piggin, Linux Memory Management List On Fri, 17 Oct 2008, Linus Torvalds wrote: > On Fri, 17 Oct 2008, Linus Torvalds wrote: > > > > But I think that what Nick did is correct - we always start traversal > > through anon_vma->head, so no, the "list_add_tail()" won't expose it to > > anybody else, because nobody else has seen the anon_vma(). > > > > That said, that's really too damn subtle. We shouldn't rely on memory > > ordering for the list handling, when the list handling is _supposed_ to be > > using that anon_vma->lock thing. > > So maybe a better patch would be as follows? It simplifies the whole thing > by just always locking and unlocking the vma, whether it's newly allocated > or not (and whether it then gets dropped as unnecessary or not). > > It still does that "smp_read_barrier_depends()" in the same old place. I > don't have the energy to look at Hugh's point about people reading > anon_vma without doing the whole "prepare" thing. > > It adds more lines than it removes, but it's just because of the comments. > With the locking simplification, it actually removes more lines of actual > code than it adds. And now we always do that list_add_tail() with the > anon_vma lock held, which should simplify thinking about this, and avoid > at least one subtle ordering issue. > > Linus I'm slowly approaching the conclusion that this is the only patch which is needed here. The newly-allocated "locked = NULL" mis-optimization still looks wrong to me in the face of SLAB_DESTROY_BY_RCU, and you kill that. You also address Nick's second point about barriers: you've arranged them differently, but I don't think that matters; or the smp_wmb() could go into the "allocated = anon_vma" block, couldn't it? that would reduce its overhead a little. (If we needed more than an Alpha-barrier in the common path, then I'd look harder for a way to avoid it more often, but it'll do as is.) I thought for a while that even the barriers weren't needed, because the only thing mmap.c and memory.c do with anon_vma (until they've up_readed mmap_sem and down_writed it to rearrange vmas) is note its address. Then I found one exception, the use of anon_vma_lock() in expand_downwards() and expand_upwards() (it's not really being used as an anon_vma lock, just as a convenient spinlock to serialize concurrent stack faults for a moment): but I don't think that could ever actually need the barriers, the anon_vma for the stack should be well-established before there can be any racing threads on it. But at last I realized the significant exception is right there in anon_vma_prepare(): the spin_lock(&anon_vma->lock) of an anon_vma coming back from find_mergeable_anon_vma() does need that lock to be visibly initialized - that is the clinching case for barriers. That leaves Nick's original point, of the three CPUs with the third doing reclaim, with my point about needing smp_read_barrier_depends() over there. I now think those races were illusory, that we were all overlooking something. Reclaim (or page migration) doesn't arrive at those pages by scanning the old mem_map[] array, it gets them off an LRU list, whose spinlock is locked and unlocked to take them off; and the original faulting CPU had to lock and unlock that spinlock to put them on the LRU originally, at a stage after its anon_vma_prepare(). Surely we have enough barriers there to make sure that anon_vma->lock is visibly initialized by the time page_lock_anon_vma() tries to take it? And it's not any kind of coincidence: isn't this a general pattern, that a newly initialized structure containing a lock is made available to other threads such as reclaim, and they can rely on that lock being visibly initialized, because the structure is made available to them by being put onto and examined on some separately locked list or tree? Nick, are you happy with Linus's patch below? Or if not, please explain again what's missing - thanks. Hugh > > --- > mm/rmap.c | 33 +++++++++++++++++++++++---------- > 1 files changed, 23 insertions(+), 10 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 0383acf..9221bf7 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -63,35 +63,48 @@ int anon_vma_prepare(struct vm_area_struct *vma) > might_sleep(); > if (unlikely(!anon_vma)) { > struct mm_struct *mm = vma->vm_mm; > - struct anon_vma *allocated, *locked; > + struct anon_vma *allocated; > > anon_vma = find_mergeable_anon_vma(vma); > - if (anon_vma) { > - allocated = NULL; > - locked = anon_vma; > - spin_lock(&locked->lock); > - } else { > + allocated = NULL; > + if (!anon_vma) { > anon_vma = anon_vma_alloc(); > if (unlikely(!anon_vma)) > return -ENOMEM; > allocated = anon_vma; > - locked = NULL; > } > + spin_lock(&anon_vma->lock); > > /* page_table_lock to protect against threads */ > spin_lock(&mm->page_table_lock); > if (likely(!vma->anon_vma)) { > + /* > + * We hold the mm->page_table_lock, but another > + * CPU may be doing an optimistic load (the one > + * at the top), and we want to make sure that > + * the anon_vma changes are visible. > + */ > + smp_wmb(); > vma->anon_vma = anon_vma; > list_add_tail(&vma->anon_vma_node, &anon_vma->head); > allocated = NULL; > } > spin_unlock(&mm->page_table_lock); > - > - if (locked) > - spin_unlock(&locked->lock); > + spin_unlock(&anon_vma->lock); > if (unlikely(allocated)) > anon_vma_free(allocated); > } > + /* > + * Subtle: we looked up anon_vma without any locking > + * (in the comon case), and are going to look at the > + * spinlock etc behind it. In order to know that it's > + * initialized, we need to do a read barrier here. > + * > + * We can use the cheaper "depends" version, since we > + * are following a pointer, and only on alpha may that > + * give a stale value. > + */ > + smp_read_barrier_depends(); > return 0; > } > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-19 1:13 ` Hugh Dickins @ 2008-10-19 2:41 ` Nick Piggin 2008-10-19 9:45 ` Hugh Dickins 0 siblings, 1 reply; 52+ messages in thread From: Nick Piggin @ 2008-10-19 2:41 UTC (permalink / raw) To: Hugh Dickins; +Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List On Sun, Oct 19, 2008 at 02:13:06AM +0100, Hugh Dickins wrote: > On Fri, 17 Oct 2008, Linus Torvalds wrote: > > On Fri, 17 Oct 2008, Linus Torvalds wrote: > > > > > > But I think that what Nick did is correct - we always start traversal > > > through anon_vma->head, so no, the "list_add_tail()" won't expose it to > > > anybody else, because nobody else has seen the anon_vma(). > > > > > > That said, that's really too damn subtle. We shouldn't rely on memory > > > ordering for the list handling, when the list handling is _supposed_ to be > > > using that anon_vma->lock thing. > > > > So maybe a better patch would be as follows? It simplifies the whole thing > > by just always locking and unlocking the vma, whether it's newly allocated > > or not (and whether it then gets dropped as unnecessary or not). > > > > It still does that "smp_read_barrier_depends()" in the same old place. I > > don't have the energy to look at Hugh's point about people reading > > anon_vma without doing the whole "prepare" thing. > > > > It adds more lines than it removes, but it's just because of the comments. > > With the locking simplification, it actually removes more lines of actual > > code than it adds. And now we always do that list_add_tail() with the > > anon_vma lock held, which should simplify thinking about this, and avoid > > at least one subtle ordering issue. > > > > Linus > > I'm slowly approaching the conclusion that this is the only patch > which is needed here. > > The newly-allocated "locked = NULL" mis-optimization still looks > wrong to me in the face of SLAB_DESTROY_BY_RCU, and you kill that. > > You also address Nick's second point about barriers: you've arranged > them differently, but I don't think that matters; or the smp_wmb() > could go into the "allocated = anon_vma" block, couldn't it? that > would reduce its overhead a little. (If we needed more than an > Alpha-barrier in the common path, then I'd look harder for a way > to avoid it more often, but it'll do as is.) > > I thought for a while that even the barriers weren't needed, because > the only thing mmap.c and memory.c do with anon_vma (until they've > up_readed mmap_sem and down_writed it to rearrange vmas) is note its > address. Then I found one exception, the use of anon_vma_lock() > in expand_downwards() and expand_upwards() (it's not really being > used as an anon_vma lock, just as a convenient spinlock to serialize > concurrent stack faults for a moment): but I don't think that could > ever actually need the barriers, the anon_vma for the stack should > be well-established before there can be any racing threads on it. > > But at last I realized the significant exception is right there in > anon_vma_prepare(): the spin_lock(&anon_vma->lock) of an anon_vma > coming back from find_mergeable_anon_vma() does need that lock to > be visibly initialized - that is the clinching case for barriers. > > That leaves Nick's original point, of the three CPUs with the third > doing reclaim, with my point about needing smp_read_barrier_depends() > over there. I now think those races were illusory, that we were all > overlooking something. Reclaim (or page migration) doesn't arrive at > those pages by scanning the old mem_map[] array, it gets them off an > LRU list, whose spinlock is locked and unlocked to take them off; and > the original faulting CPU had to lock and unlock that spinlock to put > them on the LRU originally, at a stage after its anon_vma_prepare(). > > Surely we have enough barriers there to make sure that anon_vma->lock > is visibly initialized by the time page_lock_anon_vma() tries to take > it? I don't think so. According to the race, the stores to anon_vma from the first process would not arrive at the reclaimer in time. There is no amount of barriers the reclaimer can perform to change this. But anyway all that goes away if we just use locking properly as -per my last patch (see below). > And it's not any kind of coincidence: isn't this a general pattern, > that a newly initialized structure containing a lock is made available > to other threads such as reclaim, and they can rely on that lock being > visibly initialized, because the structure is made available to them > by being put onto and examined on some separately locked list or tree? > > Nick, are you happy with Linus's patch below? > Or if not, please explain again what's missing - thanks. No, I don't agree with the need for barriers and I think Linus had also not worked through the whole problem at this point because of the just-in-case barriers and hoping to initialise the newly allocated lock as locked, as an optimisation. With my patch, the rules are simple: anybody who wants to look in the anon_vma or modify the anon_vma must take the anon_vma->lock. Then there are no problems with ordering, and no need for any explicit barriers. I don't understand your point about requiring ->lock to be initialised coming from find_mergeable_anon_vma. Why? All the critical memory operations get ordered inside the lock AFAIKS (I haven't actually applied Linus' patch to see what the result is, but that should be the case with my patch). So, no. I still think my patch is the right one. This btw. is modulo the final hunk of my last patch. That was supposed to address the different, percieved issue I saw with getting an old anon_vma. I still think that's pretty ugly even if correct. It would be nice just to add the single branch (or better, just move one down) to avoid having to think about it. I will submit a separate patch for that. Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-19 2:41 ` Nick Piggin @ 2008-10-19 9:45 ` Hugh Dickins 2008-10-21 3:59 ` Nick Piggin 0 siblings, 1 reply; 52+ messages in thread From: Hugh Dickins @ 2008-10-19 9:45 UTC (permalink / raw) To: Nick Piggin; +Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List I'm fairly lost by now in all this, suffering from barrier sickness, and we're not understanding each other very well. I'll have a try. On Sun, 19 Oct 2008, Nick Piggin wrote: > On Sun, Oct 19, 2008 at 02:13:06AM +0100, Hugh Dickins wrote: > > > > That leaves Nick's original point, of the three CPUs with the third > > doing reclaim, with my point about needing smp_read_barrier_depends() > > over there. I now think those races were illusory, that we were all > > overlooking something. Reclaim (or page migration) doesn't arrive at > > those pages by scanning the old mem_map[] array, it gets them off an > > LRU list, whose spinlock is locked and unlocked to take them off; and > > the original faulting CPU had to lock and unlock that spinlock to put > > them on the LRU originally, at a stage after its anon_vma_prepare(). > > > > Surely we have enough barriers there to make sure that anon_vma->lock > > is visibly initialized by the time page_lock_anon_vma() tries to take > > it? > > I don't think so. According to the race, the stores to anon_vma from > the first process would not arrive at the reclaimer in time. There > is no amount of barriers the reclaimer can perform to change this. No amount of barriers the reclaimer can do alone, the other end needs the complementary barriers. I'm suggesting the lock-unlock pair when page is put on LRU provides that (though I do recall that unlock-lock is stronger), in association with reclaimer's lock when when it goes to take page from LRU. I don't grasp where I'm going wrong on that; but you're more interested in asserting that it's irrelevant by now anyway, and I'll probably be prepared to accept that. > But > anyway all that goes away if we just use locking properly as -per my > last patch (see below). I don't see your last patch below, so presume you're referring to the previous you posted (and your "modulo final hunk" remark bears that out). I probably gave it less attention than it deserved amidst all the other flurry of discussion. There is no dispute between us over locking the newly allocated anon_vma in anon_vma_prepare(): I think all three of us came to suggest that independently, and we all agree there are separate reasons why it's essential. It's what Linus did in the patch which I was commending but you're demurring from. I think you're saying that with just that change to anon_vma_prepare(), we've then no need for the smp_wmb() and smp_read_barrier_depends() he retained from his first version. I'm afraid my barrier sickness has reached that advanced stage in which I can no long tell whether that's obviously true or obviously false: like those perspective cube outlines you can switch either way in your mind, I see it both ways and feel very very stupid. You'll have to settle that with Linus. > > Nick, are you happy with Linus's patch below? > > Or if not, please explain again what's missing - thanks. > > No, I don't agree with the need for barriers and I think Linus had also > not worked through the whole problem at this point because of the > just-in-case barriers and hoping to initialise the newly allocated lock > as locked, as an optimisation. > > With my patch, the rules are simple: anybody who wants to look in the > anon_vma or modify the anon_vma must take the anon_vma->lock. Then there > are no problems with ordering, and no need for any explicit barriers. But we still have the case where one caller sails through anon_vma_prepare() seeing vma->anon_vma set, before the initialization of that struct anon_vma (and in particular the lock) is visible to it. I think you're saying, hell, we don't need separate steps to make a lock visible, that would be intolerable: locking the lock makes it visible. So all we have to do is not skip the locking of it in the newly allocated case. If Linus is persuaded, then so am I. > > I don't understand your point about requiring ->lock to be initialised > coming from find_mergeable_anon_vma. Why? All the critical memory operations > get ordered inside the lock AFAIKS (I haven't actually applied Linus' > patch to see what the result is, but that should be the case with my > patch). When find_mergeable_anon_vma returns the anon_vma from an adjacent vma which could be merged with this one, that's the one case (setting aside extend_stack) where this faulting CPU will want to access the content of a struct anon_vma which may have been initialized by another CPU - to lock it and add vma to its list - rather than just use its address. > > So, no. I still think my patch is the right one. > > > This btw. is modulo the final hunk of my last patch. That was supposed to > address the different, percieved issue I saw with getting an old anon_vma. > I still think that's pretty ugly even if correct. It would be nice just > to add the single branch (or better, just move one down) to avoid having > to think about it. I will submit a separate patch for that. Yes, I think we've now understood and agreed on that part. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch] mm: fix anon_vma races 2008-10-19 9:45 ` Hugh Dickins @ 2008-10-21 3:59 ` Nick Piggin 0 siblings, 0 replies; 52+ messages in thread From: Nick Piggin @ 2008-10-21 3:59 UTC (permalink / raw) To: Hugh Dickins Cc: Nick Piggin, Linus Torvalds, Peter Zijlstra, Linux Memory Management List On Sunday 19 October 2008 20:45, Hugh Dickins wrote: > I'm fairly lost by now in all this, suffering from barrier sickness, > and we're not understanding each other very well. I'll have a try. Not sure if you're still worried about this I'll just try to answer anyway. Ignore it if you like ;) > On Sun, 19 Oct 2008, Nick Piggin wrote: > > On Sun, Oct 19, 2008 at 02:13:06AM +0100, Hugh Dickins wrote: > > > That leaves Nick's original point, of the three CPUs with the third > > > doing reclaim, with my point about needing smp_read_barrier_depends() > > > over there. I now think those races were illusory, that we were all > > > overlooking something. Reclaim (or page migration) doesn't arrive at > > > those pages by scanning the old mem_map[] array, it gets them off an > > > LRU list, whose spinlock is locked and unlocked to take them off; and > > > the original faulting CPU had to lock and unlock that spinlock to put > > > them on the LRU originally, at a stage after its anon_vma_prepare(). > > > > > > Surely we have enough barriers there to make sure that anon_vma->lock > > > is visibly initialized by the time page_lock_anon_vma() tries to take > > > it? > > > > I don't think so. According to the race, the stores to anon_vma from > > the first process would not arrive at the reclaimer in time. There > > is no amount of barriers the reclaimer can perform to change this. > > No amount of barriers the reclaimer can do alone, the other end needs > the complementary barriers. I'm suggesting the lock-unlock pair when > page is put on LRU provides that (though I do recall that unlock-lock > is stronger), in association with reclaimer's lock when when it goes > to take page from LRU. > > I don't grasp where I'm going wrong on that; but you're more interested > in asserting that it's irrelevant by now anyway, and I'll probably be > prepared to accept that. Well it's just that CPU1 is what puts the page on the LRU, but CPU0 is the one who's stores we need to order. CPU0 anon_vma->initialised = 1; vma->anon_vma = vma; CPU1 anon_vma = vma->anon_vma; page->anon_vma = anon_vma; spin_lock(lru_lock); list_add(page, lru); spin_unlock(lru_lock); CPU2 spin_lock(lru_lock); anon_vma = page->anon_vma; So CPU2 definitely would see page->anon_vma to be what CPU1 set it to. Locks provide that much ordering, of course. But it's CPU0's ordering which is what matters -- CPU2 could still see anon_vma->initialised == 0 So CPU0 needs smp_wmb() between those. > I think you're saying that with just that change to anon_vma_prepare(), > we've then no need for the smp_wmb() and smp_read_barrier_depends() he > retained from his first version. I'm afraid my barrier sickness has > reached that advanced stage in which I can no long tell whether that's > obviously true or obviously false: like those perspective cube outlines > you can switch either way in your mind, I see it both ways and feel > very very stupid. You'll have to settle that with Linus. I think you may have worked though it with Linus? However going back to my example above: if CPU0 were to hold anon_vma->lock around its assignments, and CPU2 were to take the lock before checking initialised, then the wmb() would not be required. It could still be the case that the store to vma->anon_vma becomes visible first (and so CPU1 could assign the pointer to a page->mapping). However, by the time anybody is allowed to look inside anon_vma, initialised must be 1 (the spin_unlock must not be visible until _all_ prior stores are). > > > Nick, are you happy with Linus's patch below? > > > Or if not, please explain again what's missing - thanks. > > > > No, I don't agree with the need for barriers and I think Linus had also > > not worked through the whole problem at this point because of the > > just-in-case barriers and hoping to initialise the newly allocated lock > > as locked, as an optimisation. > > > > With my patch, the rules are simple: anybody who wants to look in the > > anon_vma or modify the anon_vma must take the anon_vma->lock. Then there > > are no problems with ordering, and no need for any explicit barriers. > > But we still have the case where one caller sails through > anon_vma_prepare() seeing vma->anon_vma set, before the initialization > of that struct anon_vma (and in particular the lock) is visible to it. > > I think you're saying, hell, we don't need separate steps to make a > lock visible, that would be intolerable: locking the lock makes it > visible. So all we have to do is not skip the locking of it in the > newly allocated case. If Linus is persuaded, then so am I. The lock definitely gets initialised by CPU0, by the ctor. So it would be wrong to allow the anon_vma to become visible and have CPU2 try to lock a possibly uninitialised lock. But lock barriers say that subsequent stores are not allowed to be visible before the store to acquire the lock, and normal (obvious) cache coherency rules says that the stores to initialise the lock must not come after the store to lock it. So the store to vma->anon_vma could not be visible before the lock is locked. Inside the critical section, things can get jumbled around as usual, but if you only ever care about those orderings from within the same lock, everything is guaranteed to be visible. > > I don't understand your point about requiring ->lock to be initialised > > coming from find_mergeable_anon_vma. Why? All the critical memory > > operations get ordered inside the lock AFAIKS (I haven't actually applied > > Linus' patch to see what the result is, but that should be the case with > > my patch). > > When find_mergeable_anon_vma returns the anon_vma from an adjacent vma > which could be merged with this one, that's the one case (setting aside > extend_stack) where this faulting CPU will want to access the content > of a struct anon_vma which may have been initialized by another CPU - > to lock it and add vma to its list - rather than just use its address. OK. But the mergeable anon_vma checks don't actually look inside the anon_vma AFAIKS. It will end up taking the lock (and hence having the same memory ordering guarantees as CPU2). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2008-10-22 9:29 UTC | newest] Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-10-16 4:10 [patch] mm: fix anon_vma races Nick Piggin 2008-10-17 22:14 ` Hugh Dickins 2008-10-17 23:05 ` Linus Torvalds 2008-10-18 0:13 ` Hugh Dickins 2008-10-18 0:25 ` Linus Torvalds 2008-10-18 1:53 ` Nick Piggin 2008-10-18 2:50 ` Paul Mackerras 2008-10-18 2:57 ` Linus Torvalds 2008-10-18 5:49 ` Nick Piggin 2008-10-18 10:49 ` Paul Mackerras 2008-10-18 17:00 ` Linus Torvalds 2008-10-18 18:44 ` Matthew Wilcox 2008-10-19 2:54 ` Nick Piggin 2008-10-19 2:53 ` Nick Piggin 2008-10-17 23:13 ` Peter Zijlstra 2008-10-17 23:53 ` Linus Torvalds 2008-10-18 0:42 ` Linus Torvalds 2008-10-18 1:08 ` Linus Torvalds 2008-10-18 1:32 ` Nick Piggin 2008-10-18 2:11 ` Linus Torvalds 2008-10-18 2:25 ` Nick Piggin 2008-10-18 2:35 ` Nick Piggin 2008-10-18 2:53 ` Linus Torvalds 2008-10-18 5:20 ` Nick Piggin 2008-10-18 10:38 ` Peter Zijlstra 2008-10-19 9:52 ` Hugh Dickins 2008-10-19 10:51 ` Peter Zijlstra 2008-10-19 12:39 ` Hugh Dickins 2008-10-19 18:25 ` Linus Torvalds 2008-10-19 18:45 ` Peter Zijlstra 2008-10-19 19:00 ` Hugh Dickins 2008-10-20 4:03 ` Hugh Dickins 2008-10-20 15:17 ` Linus Torvalds 2008-10-20 18:21 ` Hugh Dickins 2008-10-21 2:56 ` Nick Piggin 2008-10-21 3:25 ` Linus Torvalds 2008-10-21 4:33 ` Nick Piggin 2008-10-21 12:58 ` Hugh Dickins 2008-10-21 15:59 ` Christoph Lameter 2008-10-22 9:29 ` Nick Piggin 2008-10-21 4:34 ` Nick Piggin 2008-10-21 13:55 ` Hugh Dickins 2008-10-21 2:44 ` Nick Piggin 2008-10-18 19:14 ` Hugh Dickins 2008-10-19 3:03 ` Nick Piggin 2008-10-19 7:07 ` Hugh Dickins 2008-10-20 3:26 ` Hugh Dickins 2008-10-21 2:45 ` Nick Piggin 2008-10-19 1:13 ` Hugh Dickins 2008-10-19 2:41 ` Nick Piggin 2008-10-19 9:45 ` Hugh Dickins 2008-10-21 3:59 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox