[patch] mm: fix anon

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch] mm: fix anon_vma races
@ 2008-10-16  4:10 Nick Piggin
  2008-10-17 22:14 ` Hugh Dickins
  2008-10-17 23:13 ` Peter Zijlstra
  0 siblings, 2 replies; 52+ messages in thread
From: Nick Piggin @ 2008-10-16  4:10 UTC (permalink / raw)
  To: Hugh Dickins, Linux Memory Management List

Hi,

Still would like independent confirmation of these problems and the fix, but
it still looks buggy to me...

---

There are some races in the anon_vma code.

The race comes about because adding our vma to the tail of anon_vma->head comes
after the assignment of vma->anon_vma to a new anon_vma pointer. In the case
where this is a new anon_vma, this is done without holding any locks.  So a
parallel anon_vma_prepare might see the vma already has an anon_vma, so it
won't serialise on the page_table_lock. It may proceed and the anon_vma to a
page in the page fault path. Another thread may then pick up the page from the
LRU list, find its mapcount incremented, and attempt to iterate over the
anon_vma's list concurrently with the first thread (because the first one is
not holding the anon_vma lock). This is a fairly subtle race, and only likely
to be hit in kernels where the spinlock is preemptible and the first thread is
preempted at the right time... but OTOH it is _possible_ to hit here; on bigger
SMP systems cacheline transfer latencies could be very large, or we could take
an NMI inside the lock or something. Fix this by initialising the list before
adding the anon_vma to vma.

After that, there is a similar data-race with memory ordering where the store
to make the anon_vma visible passes previous stores to initialize the anon_vma.
This race also includes stores to initialize the anon_vma spinlock by the
slab constructor. Add and comment appropriate barriers to solve this.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -81,8 +81,15 @@ int anon_vma_prepare(struct vm_area_stru
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
-			vma->anon_vma = anon_vma;
 			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+			/*
+			 * This smp_wmb() is required to order all previous
+			 * stores to initialize the anon_vma (by the slab
+			 * ctor) and add this vma, with the store to make it
+			 * visible to other CPUs via vma->anon_vma.
+			 */
+			smp_wmb();
+			vma->anon_vma = anon_vma;
 			allocated = NULL;
 		}
 		spin_unlock(&mm->page_table_lock);
@@ -91,6 +98,15 @@ int anon_vma_prepare(struct vm_area_stru
 			spin_unlock(&locked->lock);
 		if (unlikely(allocated))
 			anon_vma_free(allocated);
+	} else {
+		/*
+		 * This smp_read_barrier_depends is required to order the data
+		 * dependent loads of fields in anon_vma, with the load of the
+		 * anon_vma pointer vma->anon_vma. This complements the above
+		 * smp_wmb, and prevents a CPU from loading uninitialized
+		 * contents of anon_vma.
+		 */
+		smp_read_barrier_depends();
 	}
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-16  4:10 [patch] mm: fix anon_vma races Nick Piggin
@ 2008-10-17 22:14 ` Hugh Dickins
  2008-10-17 23:05   ` Linus Torvalds
  2008-10-17 23:13 ` Peter Zijlstra
  1 sibling, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2008-10-17 22:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linus Torvalds, linux-kernel, linux-mm

Added Linus and LKML to the Cc now I see that he's suggesting
a similar but not identical patch under that other thread.

On Thu, 16 Oct 2008, Nick Piggin wrote:

> Hi,
> 
> Still would like independent confirmation of these problems and the fix, but
> it still looks buggy to me...
> 
> ---
> 
> There are some races in the anon_vma code.
> 
> The race comes about because adding our vma to the tail of anon_vma->head comes
> after the assignment of vma->anon_vma to a new anon_vma pointer. In the case
> where this is a new anon_vma, this is done without holding any locks.  So a
> parallel anon_vma_prepare might see the vma already has an anon_vma, so it
> won't serialise on the page_table_lock. It may proceed and the anon_vma to a
> page in the page fault path. Another thread may then pick up the page from the
> LRU list, find its mapcount incremented, and attempt to iterate over the
> anon_vma's list concurrently with the first thread (because the first one is
> not holding the anon_vma lock). This is a fairly subtle race, and only likely
> to be hit in kernels where the spinlock is preemptible and the first thread is
> preempted at the right time... but OTOH it is _possible_ to hit here; on bigger
> SMP systems cacheline transfer latencies could be very large, or we could take
> an NMI inside the lock or something. Fix this by initialising the list before
> adding the anon_vma to vma.
> 
> After that, there is a similar data-race with memory ordering where the store
> to make the anon_vma visible passes previous stores to initialize the anon_vma.
> This race also includes stores to initialize the anon_vma spinlock by the
> slab constructor. Add and comment appropriate barriers to solve this.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Ow, you do break my head with these things.

I'm pretty sure you're on to something here, it certainly looks
suspicious.  But I've not yet convinced myself that what you (or
Linus) propose is the right answer - I was going to think through
some more, but spotting the other from Linus thought I ought at
least to introduce you to each other ;)

My problem is really with the smp_read_barrier_depends() you each
have in anon_vma_prepare().  But the only thing which its CPU
does with the anon_vma is put its address into a struct page
(or am I forgetting more?).  Wouldn't the smp_read_barrier_depends()
need to be, not there in anon_vma_prepare(), but over on the third
CPU, perhaps in page_lock_anon_vma()?

I've also been toying with the idea of taking the newly alloc'ed
anon_vma's lock here in the !vma->anon_vma case, instead of having
that "locked = NULL" business.  That might be a good idea, but I'm
rather thinking it doesn't quite address the issue at hand.

Taking a break...

Hugh

> ---
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -81,8 +81,15 @@ int anon_vma_prepare(struct vm_area_stru
>  		/* page_table_lock to protect against threads */
>  		spin_lock(&mm->page_table_lock);
>  		if (likely(!vma->anon_vma)) {
> -			vma->anon_vma = anon_vma;
>  			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
> +			/*
> +			 * This smp_wmb() is required to order all previous
> +			 * stores to initialize the anon_vma (by the slab
> +			 * ctor) and add this vma, with the store to make it
> +			 * visible to other CPUs via vma->anon_vma.
> +			 */
> +			smp_wmb();
> +			vma->anon_vma = anon_vma;
>  			allocated = NULL;
>  		}
>  		spin_unlock(&mm->page_table_lock);
> @@ -91,6 +98,15 @@ int anon_vma_prepare(struct vm_area_stru
>  			spin_unlock(&locked->lock);
>  		if (unlikely(allocated))
>  			anon_vma_free(allocated);
> +	} else {
> +		/*
> +		 * This smp_read_barrier_depends is required to order the data
> +		 * dependent loads of fields in anon_vma, with the load of the
> +		 * anon_vma pointer vma->anon_vma. This complements the above
> +		 * smp_wmb, and prevents a CPU from loading uninitialized
> +		 * contents of anon_vma.
> +		 */
> +		smp_read_barrier_depends();
>  	}
>  	return 0;
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-17 22:14 ` Hugh Dickins
@ 2008-10-17 23:05   ` Linus Torvalds
  2008-10-18  0:13     ` Hugh Dickins
  0 siblings, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2008-10-17 23:05 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Nick Piggin, linux-kernel, linux-mm

On Fri, 17 Oct 2008, Hugh Dickins wrote:
> 
> My problem is really with the smp_read_barrier_depends() you each
> have in anon_vma_prepare().  But the only thing which its CPU
> does with the anon_vma is put its address into a struct page
> (or am I forgetting more?).  Wouldn't the smp_read_barrier_depends()
> need to be, not there in anon_vma_prepare(), but over on the third
> CPU, perhaps in page_lock_anon_vma()?

I thought about it, but it's a disaster from a maintenance standpoint to 
put it there, rather than make it all clear in the _one_ function that 
actually does things optimistically.

I agree that it's a bit subtle the way I did it (haven't seen Nick's 
patch, I assume he was upset at me for shouting at him), but that's part 
of why I put that comment in there and said things are subtle.

Anyway, technically you're right: the smp_read_barrier_depends() really 
would be more obvious in the place where we actually fetch that "anon_vma" 
pointer again and actually derefernce it.

HOWEVER:

 - there are potentially multiple places that do that, and putting it in 
   the anon_vma_prepare() thing not only matches things with the 
   smp_wmb(), making that whole pairing much more obvious, but it also 
   means that we're guaranteed that any anon_vma user will have done the 
   smp_read_barrier_depends(), since they all have to do that prepare 
   thing anyway.

   So putting it there is simpler and gives better guarantees, and pairs 
   up the barriers better.

 - Now, "simpler" (etc) is no help if it doesn't work, so now I have to 
   convince you that it's _sufficient_ to do that "read_barrier_depends()" 
   early, even if we then end up re-doing the first read and thus the 
   "depends" part doesn't work any more. So "simpler" is all good, but not 
   if it's incorrect.

   And I admit it, here my argument is one of implementation. The fact is, 
   the only architecture where "read_barrier_depends()" exists at all as 
   anything but a no-op is alpha, and there it's a full read barrier. On 
   all other architectures, causality implies a read barrier anyway, so 
   for them, placement (or non-placement) of the smp_read_barrier_depends 
   is a total non-issue.

   And so, since on the only architecture where it could possibly matter, 
   that _depends thing turns into a full read barrier, and since 
   "anon_vma" is actually stable since written, and since the only 
   ordering constrain is that initial ordering of seeing the "anon_vma" 
   turn non-NULL, you may as well think of that "read_barrier_depends()" 
   as a full read barrier between the _original_ read of the anon_vma 
   pointer and then the read of the lock data we want to protect.

   Which it is, on alpha. And that is sufficient. IOW, think of it as a 
   real read_barrier(), with no dependency thing, but that only happens 
   when an architecture doesn't already guarantee the causality barrier.

   And once you think of it as a "smp_rmb() for alpha", you realize that 
   it's perfectly ok for it to be where it is.

Anyway, lockless is bad. It would certainly be a *lot* simpler to just 
take the page_table_lock around the whole thing, except I think we really 
*really* don't want to do that. That thing is solidly in a couple of 
*very* timing-critical routines. Doing another lock there is just not an 
option.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-17 23:05   ` Linus Torvalds
@ 2008-10-18  0:13     ` Hugh Dickins
  2008-10-18  0:25       ` Linus Torvalds
  2008-10-18  1:53       ` Nick Piggin
  0 siblings, 2 replies; 52+ messages in thread
From: Hugh Dickins @ 2008-10-18  0:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nick Piggin, linux-kernel, linux-mm

On Fri, 17 Oct 2008, Linus Torvalds wrote:
> On Fri, 17 Oct 2008, Hugh Dickins wrote:
> > 
> > My problem is really with the smp_read_barrier_depends() you each
> > have in anon_vma_prepare().  But the only thing which its CPU
> > does with the anon_vma is put its address into a struct page
> > (or am I forgetting more?).  Wouldn't the smp_read_barrier_depends()
> > need to be, not there in anon_vma_prepare(), but over on the third
> > CPU, perhaps in page_lock_anon_vma()?
> 
> I thought about it, but it's a disaster from a maintenance standpoint to 
> put it there, rather than make it all clear in the _one_ function that 
> actually does things optimistically.
> 
> I agree that it's a bit subtle the way I did it (haven't seen Nick's 
> patch, I assume he was upset at me for shouting at him), but that's part 
> of why I put that comment in there and said things are subtle.

Nick's patch was below in the mail you're replying to - I expect
it looked so much like yours, that at a glance you thought it was
yours - though there are little differences.

(Yes, I think he felt he'd make more progress by backing away from
your harangues and taking cover with a patch to linux-mm - though
surely you were right that his original ctor angle was mistaken.)

> 
> Anyway, technically you're right: the smp_read_barrier_depends() really 
> would be more obvious in the place where we actually fetch that "anon_vma" 
> pointer again and actually derefernce it.
> 
> HOWEVER:
> 
>  - there are potentially multiple places that do that, and putting it in 
>    the anon_vma_prepare() thing not only matches things with the 
>    smp_wmb(), making that whole pairing much more obvious, but it also 
>    means that we're guaranteed that any anon_vma user will have done the 
>    smp_read_barrier_depends(), since they all have to do that prepare 
>    thing anyway.

No, it's not so that any anon_vma user would have done the
smp_read_barrier_depends() placed in anon_vma_prepare().

Anyone faulting in a page would have done it (swapoff? that
assumes it's been done, let's not worry about it right now).

But they're doing it to make the page's ptes accessible to
memory reclaim, and the CPU doing memory reclaim will not
(unless by coincidence) have done that anon_vma_prepare() -
it's just reading the links which the faulters are providing.

But I've given up thought for the night, will leave digesting
all you've written until morning, just wanted to point that
out before sleeping.

Hugh

> 
>    So putting it there is simpler and gives better guarantees, and pairs 
>    up the barriers better.
> 
>  - Now, "simpler" (etc) is no help if it doesn't work, so now I have to 
>    convince you that it's _sufficient_ to do that "read_barrier_depends()" 
>    early, even if we then end up re-doing the first read and thus the 
>    "depends" part doesn't work any more. So "simpler" is all good, but not 
>    if it's incorrect.
> 
>    And I admit it, here my argument is one of implementation. The fact is, 
>    the only architecture where "read_barrier_depends()" exists at all as 
>    anything but a no-op is alpha, and there it's a full read barrier. On 
>    all other architectures, causality implies a read barrier anyway, so 
>    for them, placement (or non-placement) of the smp_read_barrier_depends 
>    is a total non-issue.
> 
>    And so, since on the only architecture where it could possibly matter, 
>    that _depends thing turns into a full read barrier, and since 
>    "anon_vma" is actually stable since written, and since the only 
>    ordering constrain is that initial ordering of seeing the "anon_vma" 
>    turn non-NULL, you may as well think of that "read_barrier_depends()" 
>    as a full read barrier between the _original_ read of the anon_vma 
>    pointer and then the read of the lock data we want to protect.
> 
>    Which it is, on alpha. And that is sufficient. IOW, think of it as a 
>    real read_barrier(), with no dependency thing, but that only happens 
>    when an architecture doesn't already guarantee the causality barrier.
> 
>    And once you think of it as a "smp_rmb() for alpha", you realize that 
>    it's perfectly ok for it to be where it is.
> 
> Anyway, lockless is bad. It would certainly be a *lot* simpler to just 
> take the page_table_lock around the whole thing, except I think we really 
> *really* don't want to do that. That thing is solidly in a couple of 
> *very* timing-critical routines. Doing another lock there is just not an 
> option.
> 
> 		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  0:13     ` Hugh Dickins
@ 2008-10-18  0:25       ` Linus Torvalds
  2008-10-18  1:53       ` Nick Piggin
  1 sibling, 0 replies; 52+ messages in thread
From: Linus Torvalds @ 2008-10-18  0:25 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Nick Piggin, linux-kernel, linux-mm

On Sat, 18 Oct 2008, Hugh Dickins wrote:
> 
> But they're doing it to make the page's ptes accessible to
> memory reclaim, and the CPU doing memory reclaim will not
> (unless by coincidence) have done that anon_vma_prepare() -
> it's just reading the links which the faulters are providing.

Ahh. Good point. Then yes, those ones would really need the 
smp_read_barrier_depends() too.

Very annoying.

Of course, we _could_ just admit that the situation is really *really* 
unlikely, and there are probably something like five people running 
Linux/alpha, and that we don't care enough. With just the smp_wmb(), we 
cover all non-alpha people, since this is all through a pointer, so all 
the readers will inevitably be of the smp_read_barrier_depends kind.

If we want to guarantee the proper smp_read_barrier_depends() behaviour, 
then we'd need to find them all. Possibly by renaming the whole field 
(prepend an underscore like we usually do) and forcing all readers to use 
some "get_anon_vma(vma)" helper function, aka

  static inline struct anon_vma *get_anon_vma(struct vm_area_struct *vma)
  {
	struct anon_vma *anon_vma = vma->_anon_vma;
	smp_read_barrier_depends();
	return anon_vma;
  }

which would find them all.

Ugh. I really would have preferred not having something like that. 
Especially considering how limited that issue really is. Hmm. 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  0:13     ` Hugh Dickins
  2008-10-18  0:25       ` Linus Torvalds
@ 2008-10-18  1:53       ` Nick Piggin
  2008-10-18  2:50         ` Paul Mackerras
  1 sibling, 1 reply; 52+ messages in thread
From: Nick Piggin @ 2008-10-18  1:53 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linus Torvalds, linux-kernel, linux-mm, linux-arch

On Sat, Oct 18, 2008 at 01:13:16AM +0100, Hugh Dickins wrote:
> On Fri, 17 Oct 2008, Linus Torvalds wrote:
> > would be more obvious in the place where we actually fetch that "anon_vma" 
> > pointer again and actually derefernce it.
> > 
> > HOWEVER:
> > 
> >  - there are potentially multiple places that do that, and putting it in 
> >    the anon_vma_prepare() thing not only matches things with the 
> >    smp_wmb(), making that whole pairing much more obvious, but it also 
> >    means that we're guaranteed that any anon_vma user will have done the 
> >    smp_read_barrier_depends(), since they all have to do that prepare 
> >    thing anyway.
> 
> No, it's not so that any anon_vma user would have done the
> smp_read_barrier_depends() placed in anon_vma_prepare().
> 
> Anyone faulting in a page would have done it (swapoff? that
> assumes it's been done, let's not worry about it right now).
> 
> But they're doing it to make the page's ptes accessible to
> memory reclaim, and the CPU doing memory reclaim will not
> (unless by coincidence) have done that anon_vma_prepare() -
> it's just reading the links which the faulters are providing.

Yes, that's a very important flaw you point out with the fix. Good
spotting.

Actually another thing I was staying awake thinking about was the
pairwise consistency problem. "Apparently" Linux is supposed to
support arbitrary pairwise consistency.

This means.
CPU0
anon_vma.initialized = 1;
smp_wmb()
vma->anon_vma = anon_vma;

CPU1
if (vma->anon_vma)
  page->anon_vma = vma->anon_vma;

CPU2
if (page->anon_vma) {
  smp_read_barrier_depends();
  assert(page->anon_vma.initialized);
}

The assertion may trigger because the store from CPU0 may not have
propograted to CPU2 before the stores from CPU1.

But after thinking about this a bit more, I think Linux would be
broken all over the map under such ordering schemes. I think we'd
have to mandate causal consistency. Are there any architectures we
run on where this is not guaranteed? (I think recent clarifications
to x86 ordering give us CC on that architecture).

powerpc, ia64, alpha, sparc, arm, mips? (cced linux-arch)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  1:53       ` Nick Piggin
@ 2008-10-18  2:50         ` Paul Mackerras
  2008-10-18  2:57           ` Linus Torvalds
  2008-10-18  5:49           ` Nick Piggin
  0 siblings, 2 replies; 52+ messages in thread
From: Paul Mackerras @ 2008-10-18  2:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Linus Torvalds, linux-kernel, linux-mm, linux-arch

Nick Piggin writes:

> But after thinking about this a bit more, I think Linux would be
> broken all over the map under such ordering schemes. I think we'd
> have to mandate causal consistency. Are there any architectures we
> run on where this is not guaranteed? (I think recent clarifications
> to x86 ordering give us CC on that architecture).
> 
> powerpc, ia64, alpha, sparc, arm, mips? (cced linux-arch)

Not sure what you mean by causal consistency, but I assume it's the
same as saying that barriers give cumulative ordering, as described on
page 413 of the Power Architecture V2.05 document at:

http://www.power.org/resources/reading/PowerISA_V2.05.pdf

The ordering provided by sync, lwsync and eieio is cumulative (see
pages 446 and 448), so we should be OK on powerpc AFAICS.  (The
cumulative property of eieio only applies to accesses to normal system
memory, but that should be OK since we use sync when we want barriers
that affect non-cacheable accesses as well as cacheable.)

Paul.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  2:50         ` Paul Mackerras
@ 2008-10-18  2:57           ` Linus Torvalds
  2008-10-18  5:49           ` Nick Piggin
  1 sibling, 0 replies; 52+ messages in thread
From: Linus Torvalds @ 2008-10-18  2:57 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Nick Piggin, Hugh Dickins, linux-kernel, linux-mm, linux-arch

On Sat, 18 Oct 2008, Paul Mackerras wrote:
> 
> Not sure what you mean by causal consistency, but I assume it's the
> same as saying that barriers give cumulative ordering, as described on
> page 413 of the Power Architecture V2.05 document at:

I'm pretty sure that everybody but alpha is ok.

And alpha needs the smp_read_barrier_depends() not because it doesn't 
really support causality, but because each CPU internally doesn't 
guarantee that they handle the cache invalidates in-order without a 
barrier. 

So without the smp_read_barrier_depends(), alpha will actually have the 
proper causal relationships (cachelines will move to exclusive state on 
CPU0 in the right order and others will see the causality), but because 
CPU2 may see the stale data from not even having invalidated the 
"anon_vma.initialized" because the cache invalidation queue hadn't been 
flushed in order.

Alpha is insane. And the odd man out.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  2:50         ` Paul Mackerras
  2008-10-18  2:57           ` Linus Torvalds
@ 2008-10-18  5:49           ` Nick Piggin
  2008-10-18 10:49             ` Paul Mackerras
  2008-10-18 17:00             ` Linus Torvalds
  1 sibling, 2 replies; 52+ messages in thread
From: Nick Piggin @ 2008-10-18  5:49 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Hugh Dickins, Linus Torvalds, linux-kernel, linux-mm, linux-arch

On Sat, Oct 18, 2008 at 01:50:57PM +1100, Paul Mackerras wrote:
> Nick Piggin writes:
> 
> > But after thinking about this a bit more, I think Linux would be
> > broken all over the map under such ordering schemes. I think we'd
> > have to mandate causal consistency. Are there any architectures we
> > run on where this is not guaranteed? (I think recent clarifications
> > to x86 ordering give us CC on that architecture).
> > 
> > powerpc, ia64, alpha, sparc, arm, mips? (cced linux-arch)
> 
> Not sure what you mean by causal consistency, but I assume it's the

I think it can be called transitive. Basically (assumememory starts off zeroed)
CPU0
x := 1

CPU1
if (x == 1) {
  fence
  y := 1
}

CPU2
if (y == 1) {
  fence
  assert(x == 1)
}


As opposed to pairwise, which only provides an ordering of visibility between
any given two CPUs (so the store to y might be propogated to CPU2 after the
store to x, regardless of the fences).

Apparently pairwise ordering is more interesting than just a theoretical
thing, and not just restricted to Alpha's funny caches. It can allow for
arbitrary network propogating stores / cache coherency between CPUs. x86's
publically documented memory model supposedly could allow for such ordering
up until a year or so ago (when they clarified and strengthened it).


> same as saying that barriers give cumulative ordering, as described on
> page 413 of the Power Architecture V2.05 document at:
> 
> http://www.power.org/resources/reading/PowerISA_V2.05.pdf
> 
> The ordering provided by sync, lwsync and eieio is cumulative (see
> pages 446 and 448), so we should be OK on powerpc AFAICS.  (The
> cumulative property of eieio only applies to accesses to normal system
> memory, but that should be OK since we use sync when we want barriers
> that affect non-cacheable accesses as well as cacheable.)

The section on cumulative ordering sounds like it might do the trick. But
I haven't really worked through exactly what it is saying ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  5:49           ` Nick Piggin
@ 2008-10-18 10:49             ` Paul Mackerras
  2008-10-18 17:00             ` Linus Torvalds
  1 sibling, 0 replies; 52+ messages in thread
From: Paul Mackerras @ 2008-10-18 10:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Linus Torvalds, linux-kernel, linux-mm, linux-arch

Nick Piggin writes:

> > Not sure what you mean by causal consistency, but I assume it's the
> 
> I think it can be called transitive. Basically (assumememory starts off zeroed)
> CPU0
> x := 1
> 
> CPU1
> if (x == 1) {
>   fence
>   y := 1
> }
> 
> CPU2
> if (y == 1) {
>   fence
>   assert(x == 1)
> }

That's essentially the same as example 1 on page 415, so yes we are
talking about the same thing.

Paul.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  5:49           ` Nick Piggin
  2008-10-18 10:49             ` Paul Mackerras
@ 2008-10-18 17:00             ` Linus Torvalds
  2008-10-18 18:44               ` Matthew Wilcox
  2008-10-19  2:53               ` Nick Piggin
  1 sibling, 2 replies; 52+ messages in thread
From: Linus Torvalds @ 2008-10-18 17:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul Mackerras, Hugh Dickins, linux-kernel, linux-mm, linux-arch

On Sat, 18 Oct 2008, Nick Piggin wrote:
> 
> I think it can be called transitive. Basically (assumememory starts off zeroed)

Alpha is transitive. It has a notion of "processor issue order" and 
"location access order", and the ordering those two creates is a 
transitive "before" and "after" ordering.

The issue with alpha is not that it wouldn't be transitive - the issue is 
that *local* read dependencies do not cause a "processor issue order"!

So the real issue with alpha is not about any big pair-wise ordering vs 
transitive thing, the big issue is that alpha's totally _local_ and 
per-core orderings are so totally screwed up, and are defined to be very 
loose - because back when alpha was designed, loose memory ordering was 
thought to be a good thing for performance.

They were wrong, but that was mainly because the alpha designers lived in 
a time when threading wasn't really even an issue. They were optimizing 
purely for the case where memory ordering doesn't matter, and considered 
locking etc to be one of those non-RISCy rare operations that can be 
basically ignored.

> CPU0
> x := 1

So this creates a "location access" event on 'x' on alpha, call it "event 
A".

> CPU1
> if (x == 1) {
>   fence
>   y := 1
> }

This has two events: let's call the read of 'x' "B", and "C" is the write 
to 'y'.

And according to the alpha rules, we now have:

 - A << B

   Because we saw a '1' in B, we now have a "location access ordering" 
   on the _same_ variable between A and B.

 - B < C

   Because we have the fence in between the read and the write, we now 
   have a "processor issue order" between B and C (despite the fact that 
   they are different variables).

And now, the alpha definition of "before" means that we can declare that A 
is before C.

But on alpha, we really do need that fence, even if the address of 'y' was 
somehow directly data-dependent on 'x'. THAT is what makes alpha special, 
not any odd ordering rules.

> CPU2
> if (y == 1) {
>   fence
>   assert(x == 1)
> }

So again, we now have two events: the access of 'y' is "D", and the access 
of x is "E". And again, according to the alpha rules, we have two 
orderings:

 - C << D

   Because we saw a '1' in D, we have another "location access ordering" 
   on the variably 'y' between C and D.

 - D < E

   Because of the fence, we have a "processor issue ordering" between D 
   and E.

And for the same reason as above, we now get that C is "before" E 
according to the alpha ordering rules. And because the definition of 
"before" is transitive, then A is before E.

And that, in turn, means that that assert() can never trigger, because if 
it triggered, then by the access ordering rules that would imply that E << 
A, which would mean that E is "before" A, which in turn would violate the 
whole chain we just got to.

So while the alpha architecture manual doesn't have the exact sequence 
mentioned above (it only has nine so-called "Litmus tests"), it's fairly 
close to Litmus test 3, and the ordering on alpha is very clear: it's all 
transitive and causal (ie "before" can never be "after").

> Apparently pairwise ordering is more interesting than just a theoretical
> thing, and not just restricted to Alpha's funny caches.

Nobody does just pairwise ordering, afaik. It's an insane model. Everybody 
does some form of transitive ordering.

The real (and only really odd) issue with alpha is that for everybody 
else, if you have

	read x -> data dependency -> read y

(ie you read a pointer and dereference it, or you read an index and 
dereference an array through it), then on all other architectures that 
will imply a local processor ordering, which in turn will be part of the 
whole transitive order of operations. 

On alpha, it doesn't. You can think of it as alpha doing value speculation 
(ie allowing speculative reads even across data dependencies), so on 
alpha, you could imagine a CPU doing address speculation, and turning the 
two reads into a sequence of

 (a) read off speculative pointer '*p'
 (b) read x
 (c) verify that that x == p

and THAT is what "smp_read_barrier_depends()" will basically stop on 
alpha. Nothing else. Other CPU's will always basically do

 (a) read x
 (b) read *x

so they have an implied read barrier between those two events thanks to 
simply the causal relationship.

Some more notes:

 - The reason that alpha has this odd thing is not actually that any alpha 
   implementation does value speculation, but the way the caches are 
   invalidated, the invalidates can be delayed and re-ordered almost 
   arbitrarily on the local CPU, and in the absense of a memory barrier 
   the second read (that does happen "after" the first read in some local 
   internal CPU sense and wasn't really speculative in that way) can get 
   stale data because one cacheline has been updated before another one 
   has.

   So while you can think of it a value speculation, the underlying cause 
   is actually not some fancy speculation infrastructure, just an internal 
   implementation issue.

 - The _data_ dependency is important, because other architectures _will_ 
   still speculatively move memory operations around across other "causal" 
   relationships, notably across control dependencies. IOW, if you have

	if (read(x))
		read(y)

   then there is NOT necessarily any real orderign between the reads, 
   because the conditional ends up being speculated, and you may well see 
   "y" being read before "x", and you really need a smp_rmb() on other 
   architectures than alpha too. So in this sense, alpha is very 
   "consistent" - for alpha, _no_ amount of local causality matters, and 
   only accesses to the *same* variable are implicitly locally ordered.

 - On x86, the new memory ordering semantics means that _all_ local causal 
   relationships are honored, so x86, like alpha, is very consistent. It 
   will consider both the data-dependency and the control dependency to be 
   100% the same. It just does it differently than alpha: for alpha, 
   neither matters for ordering, for x86, both matter.

Of course, even on x86, the local causal relationships still do allow 
loads to pass stores, so x86 isn't _totally_ ordered. x86 obviously still 
does need the smp_mb().

So alpha is "more consistent" in the respect of really having very clear 
rules. The fact that those "clear rules" are totally insane and very 
inconvenient for threading (and weren't the big performance advantage that 
people used to think they would be) is a separate matter.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18 17:00             ` Linus Torvalds
@ 2008-10-18 18:44               ` Matthew Wilcox
  2008-10-19  2:54                 ` Nick Piggin
  2008-10-19  2:53               ` Nick Piggin
  1 sibling, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2008-10-18 18:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Paul Mackerras, Hugh Dickins, linux-kernel,
	linux-mm, linux-arch

On Sat, Oct 18, 2008 at 10:00:30AM -0700, Linus Torvalds wrote:
> > Apparently pairwise ordering is more interesting than just a theoretical
> > thing, and not just restricted to Alpha's funny caches.
> 
> Nobody does just pairwise ordering, afaik. It's an insane model. Everybody 
> does some form of transitive ordering.

I assume you're talking about CPUs in particular here, and I don't know
of any counterexamples.

If you're talking about PCI devices, the model is not transitive.
Here's the exact text from Appendix E of PCI 3.0:

  A system may have multiple Producer-Consumer pairs operating
  simultaneously, with different data - flag-status sets located all
  around the system. But since only one Producer can write to a single
  data-flag set, there are no ordering requirements between different
  masters. Writes from one master on one bus may occur in one order on
  one bus, with respect to another master's writes, and occur in another
  order on another bus. In this case, the rules allow for some writes
  to be rearranged; for example, an agent on Bus 1 may see Transaction
  A from a master on Bus 1 complete first, followed by Transaction B
  from another master on Bus 0. An agent on Bus 0 may see Transaction
  B complete first followed by Transaction A. Even though the actual
  transactions complete in a different order, this causes no problem
  since the different masters must be addressing different data-flag sets.

I seem to remember earlier versions of the spec having more clear
language about A happening before B and B happening before C didn't
mean that A happened before C from the perspective of a third device,
but I can't find it right now.

Anyway, as the spec says, you're not supposed to use PCI like that,
so it doesn't matter.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18 18:44               ` Matthew Wilcox
@ 2008-10-19  2:54                 ` Nick Piggin
  0 siblings, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2008-10-19  2:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Paul Mackerras, Hugh Dickins, linux-kernel,
	linux-mm, linux-arch

On Sat, Oct 18, 2008 at 12:44:05PM -0600, Matthew Wilcox wrote:
> On Sat, Oct 18, 2008 at 10:00:30AM -0700, Linus Torvalds wrote:
> > > Apparently pairwise ordering is more interesting than just a theoretical
> > > thing, and not just restricted to Alpha's funny caches.
> > 
> > Nobody does just pairwise ordering, afaik. It's an insane model. Everybody 
> > does some form of transitive ordering.
> 
> I assume you're talking about CPUs in particular here, and I don't know
> of any counterexamples.

Yes, just CPUs.

 
> If you're talking about PCI devices, the model is not transitive.
> Here's the exact text from Appendix E of PCI 3.0:
> 
>   A system may have multiple Producer-Consumer pairs operating
>   simultaneously, with different data - flag-status sets located all
>   around the system. But since only one Producer can write to a single
>   data-flag set, there are no ordering requirements between different
>   masters. Writes from one master on one bus may occur in one order on
>   one bus, with respect to another master's writes, and occur in another
>   order on another bus. In this case, the rules allow for some writes
>   to be rearranged; for example, an agent on Bus 1 may see Transaction
>   A from a master on Bus 1 complete first, followed by Transaction B
>   from another master on Bus 0. An agent on Bus 0 may see Transaction
>   B complete first followed by Transaction A. Even though the actual
>   transactions complete in a different order, this causes no problem
>   since the different masters must be addressing different data-flag sets.
> 
> I seem to remember earlier versions of the spec having more clear
> language about A happening before B and B happening before C didn't
> mean that A happened before C from the perspective of a third device,
> but I can't find it right now.
> 
> Anyway, as the spec says, you're not supposed to use PCI like that,
> so it doesn't matter.

Interesting. Hopefully as you say it won't matter.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18 17:00             ` Linus Torvalds
  2008-10-18 18:44               ` Matthew Wilcox
@ 2008-10-19  2:53               ` Nick Piggin
  1 sibling, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2008-10-19  2:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Hugh Dickins, linux-kernel, linux-mm, linux-arch

On Sat, Oct 18, 2008 at 10:00:30AM -0700, Linus Torvalds wrote:
> 
> 
> On Sat, 18 Oct 2008, Nick Piggin wrote:
> > 
> > I think it can be called transitive. Basically (assumememory starts off zeroed)
> 
> Alpha is transitive. It has a notion of "processor issue order" and 
> "location access order", and the ordering those two creates is a 
> transitive "before" and "after" ordering.
> 
> The issue with alpha is not that it wouldn't be transitive - the issue is 
> that *local* read dependencies do not cause a "processor issue order"!

That's fine. That's not so different to most other weakly ordered processor
having control dependencies not appearing in-order. So long as stores
propogate according to causality.

 
> So this creates a "location access" event on 'x' on alpha, call it "event 
> A".
> 
> > CPU1
> > if (x == 1) {
> >   fence
> >   y := 1
> > }
> 
> This has two events: let's call the read of 'x' "B", and "C" is the write 
> to 'y'.
> 
> And according to the alpha rules, we now have:
> 
>  - A << B
> 
>    Because we saw a '1' in B, we now have a "location access ordering" 
>    on the _same_ variable between A and B.
> 
>  - B < C
> 
>    Because we have the fence in between the read and the write, we now 
>    have a "processor issue order" between B and C (despite the fact that 
>    they are different variables).
> 
> And now, the alpha definition of "before" means that we can declare that A 
> is before C.
> 
> But on alpha, we really do need that fence, even if the address of 'y' was 
> somehow directly data-dependent on 'x'. THAT is what makes alpha special, 
> not any odd ordering rules.
> 
> > CPU2
> > if (y == 1) {
> >   fence
> >   assert(x == 1)
> > }
> 
> So again, we now have two events: the access of 'y' is "D", and the access 
> of x is "E". And again, according to the alpha rules, we have two 
> orderings:
> 
>  - C << D
> 
>    Because we saw a '1' in D, we have another "location access ordering" 
>    on the variably 'y' between C and D.
> 
>  - D < E
> 
>    Because of the fence, we have a "processor issue ordering" between D 
>    and E.
> 
> And for the same reason as above, we now get that C is "before" E 
> according to the alpha ordering rules. And because the definition of 
> "before" is transitive, then A is before E.
> 
> And that, in turn, means that that assert() can never trigger, because if 
> it triggered, then by the access ordering rules that would imply that E << 
> A, which would mean that E is "before" A, which in turn would violate the 
> whole chain we just got to.
> 
> So while the alpha architecture manual doesn't have the exact sequence 
> mentioned above (it only has nine so-called "Litmus tests"), it's fairly 
> close to Litmus test 3, and the ordering on alpha is very clear: it's all 
> transitive and causal (ie "before" can never be "after").

OK, good.

 
> > Apparently pairwise ordering is more interesting than just a theoretical
> > thing, and not just restricted to Alpha's funny caches.
> 
> Nobody does just pairwise ordering, afaik. It's an insane model. Everybody 
> does some form of transitive ordering.
 
We were chatting with Andy Glew a while back, and he said it actually can
be quite beneficial for HW designers (but I imagine that is the same as a
lot of "insane" things) ;)

I remember though that you said Linux should be pairwise-safe. I think
that's wrong (for more reasons than this anon-vma race), which is why
I got concerned and started off this subthread.

I think Linux probably has a lot of problems in a pairwise consistency
model, so I'd just like to check if we acutally attempt to supportany
architecture where that is the case.

x86, powerpc, alpha are good ;) That gives me hope.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-16  4:10 [patch] mm: fix anon_vma races Nick Piggin
  2008-10-17 22:14 ` Hugh Dickins
@ 2008-10-17 23:13 ` Peter Zijlstra
  2008-10-17 23:53   ` Linus Torvalds
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2008-10-17 23:13 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Hugh Dickins, Linux Memory Management List, Linus Torvalds

On Thu, 2008-10-16 at 06:10 +0200, Nick Piggin wrote:

> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -81,8 +81,15 @@ int anon_vma_prepare(struct vm_area_stru
>  		/* page_table_lock to protect against threads */
>  		spin_lock(&mm->page_table_lock);
>  		if (likely(!vma->anon_vma)) {
> -			vma->anon_vma = anon_vma;
>  			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
> +			/*
> +			 * This smp_wmb() is required to order all previous
> +			 * stores to initialize the anon_vma (by the slab
> +			 * ctor) and add this vma, with the store to make it
> +			 * visible to other CPUs via vma->anon_vma.
> +			 */
> +			smp_wmb();
> +			vma->anon_vma = anon_vma;

I'm not getting why you explicitly move the list_add_tail() before the
wmb, doesn't the list also expose the anon_vma to other cpus?

Hmm, I guess not, since we won't set the page->mapping until well after
the wmb.. still, tricky.

>  			allocated = NULL;
>  		}
>  		spin_unlock(&mm->page_table_lock);
> @@ -91,6 +98,15 @@ int anon_vma_prepare(struct vm_area_stru
>  			spin_unlock(&locked->lock);
>  		if (unlikely(allocated))
>  			anon_vma_free(allocated);
> +	} else {
> +		/*
> +		 * This smp_read_barrier_depends is required to order the data
> +		 * dependent loads of fields in anon_vma, with the load of the
> +		 * anon_vma pointer vma->anon_vma. This complements the above
> +		 * smp_wmb, and prevents a CPU from loading uninitialized
> +		 * contents of anon_vma.
> +		 */
> +		smp_read_barrier_depends();
>  	}
>  	return 0;
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-17 23:13 ` Peter Zijlstra
@ 2008-10-17 23:53   ` Linus Torvalds
  2008-10-18  0:42     ` Linus Torvalds
  0 siblings, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2008-10-17 23:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Nick Piggin, Hugh Dickins, Linux Memory Management List

On Sat, 18 Oct 2008, Peter Zijlstra wrote:
> On Thu, 2008-10-16 at 06:10 +0200, Nick Piggin wrote:
> 
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> > Index: linux-2.6/mm/rmap.c
> > ===================================================================
> > --- linux-2.6.orig/mm/rmap.c
> > +++ linux-2.6/mm/rmap.c
> > @@ -81,8 +81,15 @@ int anon_vma_prepare(struct vm_area_stru
> >  		/* page_table_lock to protect against threads */
> >  		spin_lock(&mm->page_table_lock);
> >  		if (likely(!vma->anon_vma)) {
> > -			vma->anon_vma = anon_vma;
> >  			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
> > +			/*
> > +			 * This smp_wmb() is required to order all previous
> > +			 * stores to initialize the anon_vma (by the slab
> > +			 * ctor) and add this vma, with the store to make it
> > +			 * visible to other CPUs via vma->anon_vma.
> > +			 */
> > +			smp_wmb();
> > +			vma->anon_vma = anon_vma;
> 
> I'm not getting why you explicitly move the list_add_tail() before the
> wmb, doesn't the list also expose the anon_vma to other cpus?

I do think the anon_vma locking might be good to look over. It is very 
non-obvious. Especially the initial create is really really quite suspect. 
I suspect we should start out with the anon-vma locked, and do the

	spin_unlock(&anon_vma->lock);

unconditionally in anon_vma_prepare(), and just simplify it. As it is, 
newly allocated anon_vma's get exposed in unlocked state while we're still 
working on them.

But I think that what Nick did is correct - we always start traversal 
through anon_vma->head, so no, the "list_add_tail()" won't expose it to 
anybody else, because nobody else has seen the anon_vma().

That said, that's really too damn subtle. We shouldn't rely on memory 
ordering for the list handling, when the list handling is _supposed_ to be 
using that anon_vma->lock thing.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-17 23:53   ` Linus Torvalds
@ 2008-10-18  0:42     ` Linus Torvalds
  2008-10-18  1:08       ` Linus Torvalds
  2008-10-19  1:13       ` Hugh Dickins
  0 siblings, 2 replies; 52+ messages in thread
From: Linus Torvalds @ 2008-10-18  0:42 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Nick Piggin, Hugh Dickins, Linux Memory Management List


On Fri, 17 Oct 2008, Linus Torvalds wrote:
> 
> But I think that what Nick did is correct - we always start traversal 
> through anon_vma->head, so no, the "list_add_tail()" won't expose it to 
> anybody else, because nobody else has seen the anon_vma().
> 
> That said, that's really too damn subtle. We shouldn't rely on memory 
> ordering for the list handling, when the list handling is _supposed_ to be 
> using that anon_vma->lock thing.

So maybe a better patch would be as follows? It simplifies the whole thing 
by just always locking and unlocking the vma, whether it's newly allocated 
or not (and whether it then gets dropped as unnecessary or not).

It still does that "smp_read_barrier_depends()" in the same old place. I 
don't have the energy to look at Hugh's point about people reading 
anon_vma without doing the whole "prepare" thing.

It adds more lines than it removes, but it's just because of the comments. 
With the locking simplification, it actually removes more lines of actual 
code than it adds. And now we always do that list_add_tail() with the 
anon_vma lock held, which should simplify thinking about this, and avoid 
at least one subtle ordering issue.

		Linus

---
 mm/rmap.c |   33 +++++++++++++++++++++++----------
 1 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 0383acf..9221bf7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -63,35 +63,48 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 	might_sleep();
 	if (unlikely(!anon_vma)) {
 		struct mm_struct *mm = vma->vm_mm;
-		struct anon_vma *allocated, *locked;
+		struct anon_vma *allocated;
 
 		anon_vma = find_mergeable_anon_vma(vma);
-		if (anon_vma) {
-			allocated = NULL;
-			locked = anon_vma;
-			spin_lock(&locked->lock);
-		} else {
+		allocated = NULL;
+		if (!anon_vma) {
 			anon_vma = anon_vma_alloc();
 			if (unlikely(!anon_vma))
 				return -ENOMEM;
 			allocated = anon_vma;
-			locked = NULL;
 		}
+		spin_lock(&anon_vma->lock);
 
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
+			/*
+			 * We hold the mm->page_table_lock, but another
+			 * CPU may be doing an optimistic load (the one
+			 * at the top), and we want to make sure that
+			 * the anon_vma changes are visible.
+			 */
+			smp_wmb();
 			vma->anon_vma = anon_vma;
 			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
 			allocated = NULL;
 		}
 		spin_unlock(&mm->page_table_lock);
-
-		if (locked)
-			spin_unlock(&locked->lock);
+		spin_unlock(&anon_vma->lock);
 		if (unlikely(allocated))
 			anon_vma_free(allocated);
 	}
+	/*
+	 * Subtle: we looked up anon_vma without any locking
+	 * (in the comon case), and are going to look at the
+	 * spinlock etc behind it. In order to know that it's
+	 * initialized, we need to do a read barrier here.
+	 *
+	 * We can use the cheaper "depends" version, since we
+	 * are following a pointer, and only on alpha may that
+	 * give a stale value.
+	 */
+	smp_read_barrier_depends();
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  0:42     ` Linus Torvalds
@ 2008-10-18  1:08       ` Linus Torvalds
  2008-10-18  1:32         ` Nick Piggin
  2008-10-19  1:13       ` Hugh Dickins
  1 sibling, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2008-10-18  1:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Nick Piggin, Hugh Dickins, Linux Memory Management List


On Fri, 17 Oct 2008, Linus Torvalds wrote:
> 
> So maybe a better patch would be as follows? It simplifies the whole thing 
> by just always locking and unlocking the vma, whether it's newly allocated 
> or not (and whether it then gets dropped as unnecessary or not).

Side note: it would be nicer if we had a "spin_lock_init_locked()", so 
that we could avoid the more expensive "true lock" when doing the initial 
allocation, but we don't. That said, the case of having to allocate a new 
anon_vma _should_ be the rare one.

I dunno. Nick's approach of just depending on memory ordering for the list 
update probably works too. Even if it's rather more subtle than I'd like. 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  1:08       ` Linus Torvalds
@ 2008-10-18  1:32         ` Nick Piggin
  2008-10-18  2:11           ` Linus Torvalds
  0 siblings, 1 reply; 52+ messages in thread
From: Nick Piggin @ 2008-10-18  1:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Peter Zijlstra, Hugh Dickins, Linux Memory Management List

On Fri, Oct 17, 2008 at 06:08:05PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 17 Oct 2008, Linus Torvalds wrote:
> > 
> > So maybe a better patch would be as follows? It simplifies the whole thing 
> > by just always locking and unlocking the vma, whether it's newly allocated 
> > or not (and whether it then gets dropped as unnecessary or not).
> 
> Side note: it would be nicer if we had a "spin_lock_init_locked()", so 
> that we could avoid the more expensive "true lock" when doing the initial 
> allocation, but we don't. That said, the case of having to allocate a new 
> anon_vma _should_ be the rare one.

We can't do that, unfortuantely, because anon_vmas are allocated with
SLAB_DESTROY_BY_RCU. Obviously that's the easier way to solve most of
the orderings (then you just need to order store to spinlock with store
to make anon_vma visible). But no, we can't do that.

But there seem like there might be other problems here too... Hmm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  1:32         ` Nick Piggin
@ 2008-10-18  2:11           ` Linus Torvalds
  2008-10-18  2:25             ` Nick Piggin
  0 siblings, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2008-10-18  2:11 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Peter Zijlstra, Hugh Dickins, Linux Memory Management List

On Sat, 18 Oct 2008, Nick Piggin wrote:
> > 
> > Side note: it would be nicer if we had a "spin_lock_init_locked()", so 
> > that we could avoid the more expensive "true lock" when doing the initial 
> > allocation, but we don't. That said, the case of having to allocate a new 
> > anon_vma _should_ be the rare one.
> 
> We can't do that, unfortuantely, because anon_vmas are allocated with
> SLAB_DESTROY_BY_RCU.

Aughh. I see what you're saying. We don't _free_ them by RCU, we just 
destroy the page allocation. So an anon_vma can get _re-allocated_ for 
another page (without being destroyed), concurrently with somebody 
optimistically being busy with that same anon_vma that they got through 
that optimistic 'page_lock_anon_vma()' thing.

So if we were to just set the lock, we might actually be messing with 
something that is still actively used by the previous page that was 
unmapped concurrently and still being accessed by try_to_unmap_anon. So 
even though we allocated a "new" anon_vma, it might still be busy.

Yes? No?

That thing really is too subtle for words. But if that's actually what you 
are alluding to, then doesn't that mean that we _really_ should be doing 
that "spin_lock(&anon_vma->lock)" even for the first allocation, and that 
the current code is broken? Because otherwise that other concurrent user 
that found the stale vma through page_lock_anon_vma() will now try to 
follow the linked list and _think_ it's stable (thanks to the lock), but 
we're actually inserting entries into it without holding any locks at all.

But I'm hoping I actually am totally *not* understanding what you meant, 
and am actually just terminally confused.

Hugh, this is very much your code. Can you please tell me I'm really 
confused here, and un-confuse me. Pretty please?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  2:11           ` Linus Torvalds
@ 2008-10-18  2:25             ` Nick Piggin
  2008-10-18  2:35               ` Nick Piggin
                                 ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Nick Piggin @ 2008-10-18  2:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Peter Zijlstra, Hugh Dickins, Linux Memory Management List

On Fri, Oct 17, 2008 at 07:11:38PM -0700, Linus Torvalds wrote:
> 
> 
> On Sat, 18 Oct 2008, Nick Piggin wrote:
> > > 
> > > Side note: it would be nicer if we had a "spin_lock_init_locked()", so 
> > > that we could avoid the more expensive "true lock" when doing the initial 
> > > allocation, but we don't. That said, the case of having to allocate a new 
> > > anon_vma _should_ be the rare one.
> > 
> > We can't do that, unfortuantely, because anon_vmas are allocated with
> > SLAB_DESTROY_BY_RCU.
> 
> Aughh. I see what you're saying. We don't _free_ them by RCU, we just 
> destroy the page allocation. So an anon_vma can get _re-allocated_ for 
> another page (without being destroyed), concurrently with somebody 
> optimistically being busy with that same anon_vma that they got through 
> that optimistic 'page_lock_anon_vma()' thing.
> 
> So if we were to just set the lock, we might actually be messing with 
> something that is still actively used by the previous page that was 
> unmapped concurrently and still being accessed by try_to_unmap_anon. So 
> even though we allocated a "new" anon_vma, it might still be busy.
> 
> Yes? No?

That's what I'm thinking, yes. But I admit the last time I looked at
this really closely was probably reading through Hugh's patches and
changelogs (which at the time must have convinced me ;)). So I could
be wrong.


> That thing really is too subtle for words. But if that's actually what you 
> are alluding to, then doesn't that mean that we _really_ should be doing 
> that "spin_lock(&anon_vma->lock)" even for the first allocation, and that 
> the current code is broken? Because otherwise that other concurrent user 
> that found the stale vma through page_lock_anon_vma() will now try to 
> follow the linked list and _think_ it's stable (thanks to the lock), but 
> we're actually inserting entries into it without holding any locks at all.

Yes, that's what I meant by "has other problems". Another thing is also
that even if we have the lock here, I can't see why page_lock_anon_vma
is safe against finding an anon_vma which has been deallocated then
allocated for something else (and had vmas inserted into it etc.).

I think most of our memory ordering problems can be solved by locking.
Note that I don't think we need any barriers there, and callers don't
need any read_barrier_depends either, because AFAIKS they all take the
lock too. (it shouldn't actually need the reordering of the assignments
either, which shows it is a bit more robust than relying on ordering,
but I think it is neater if we reorder them).

Whether this page_lock_anon_vma is really a problem or not... I've
added a test in there that I think may be a problem (at least, I'd like
to know why I'm wrong and have a comment in there).


> But I'm hoping I actually am totally *not* understanding what you meant, 
> and am actually just terminally confused.
> 
> Hugh, this is very much your code. Can you please tell me I'm really 
> confused here, and un-confuse me. Pretty please?

Ditto ;)

---
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -63,32 +63,42 @@ int anon_vma_prepare(struct vm_area_stru
 	might_sleep();
 	if (unlikely(!anon_vma)) {
 		struct mm_struct *mm = vma->vm_mm;
-		struct anon_vma *allocated, *locked;
+		struct anon_vma *allocated;
 
 		anon_vma = find_mergeable_anon_vma(vma);
 		if (anon_vma) {
 			allocated = NULL;
-			locked = anon_vma;
-			spin_lock(&locked->lock);
 		} else {
 			anon_vma = anon_vma_alloc();
 			if (unlikely(!anon_vma))
 				return -ENOMEM;
 			allocated = anon_vma;
-			locked = NULL;
 		}
 
+		/*
+		 * The lock is required even for new anon_vmas, because as
+		 * soon as we store vma->anon_vma = anon_vma, then the
+		 * anon_vma becomes visible via the vma. This means another
+		 * CPU can find the anon_vma, then store it into the struct
+		 * page with page_add_anon_rmap. At this point, anon_vma can
+		 * be loaded from the page with page_lock_anon_vma.
+		 *
+		 * So long as the anon_vma->lock is taken before looking at
+		 * any fields in the anon_vma, the lock should take care of
+		 * races and memory ordering issues WRT anon_vma fields.
+		 */
+		spin_lock(&anon_vma->lock);
+
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
-			vma->anon_vma = anon_vma;
 			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+			vma->anon_vma = anon_vma;
 			allocated = NULL;
 		}
 		spin_unlock(&mm->page_table_lock);
+		spin_lock(&anon_vma->lock);
 
-		if (locked)
-			spin_unlock(&locked->lock);
 		if (unlikely(allocated))
 			anon_vma_free(allocated);
 	}
@@ -171,6 +181,10 @@ static struct anon_vma *page_lock_anon_v
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	spin_lock(&anon_vma->lock);
+
+	if (anon_mapping != (unsigned long)page->mapping)
+		goto out;
+
 	return anon_vma;
 out:
 	rcu_read_unlock();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  2:25             ` Nick Piggin
@ 2008-10-18  2:35               ` Nick Piggin
  2008-10-18  2:53               ` Linus Torvalds
  2008-10-18 19:14               ` Hugh Dickins
  2 siblings, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2008-10-18  2:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Zijlstra, Hugh Dickins,
	Linux Memory Management List

On Saturday 18 October 2008 13:25, Nick Piggin wrote:

> @@ -171,6 +181,10 @@ static struct anon_vma *page_lock_anon_v
>
>  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>  	spin_lock(&anon_vma->lock);
> +
> +	if (anon_mapping != (unsigned long)page->mapping)
> +		goto out;
> +
>  	return anon_vma;
>  out:
>  	rcu_read_unlock();

Err, that should probably be another check for page_mapped(), because
I think page->mapping probably won't get cleared until after anon_vma
may have been freed and reused...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  2:25             ` Nick Piggin
  2008-10-18  2:35               ` Nick Piggin
@ 2008-10-18  2:53               ` Linus Torvalds
  2008-10-18  5:20                 ` Nick Piggin
  2008-10-18 19:14               ` Hugh Dickins
  2 siblings, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2008-10-18  2:53 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Peter Zijlstra, Hugh Dickins, Linux Memory Management List

On Sat, 18 Oct 2008, Nick Piggin wrote:
> @@ -171,6 +181,10 @@ static struct anon_vma *page_lock_anon_v
>  
>  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>  	spin_lock(&anon_vma->lock);
> +
> +	if (anon_mapping != (unsigned long)page->mapping)
> +		goto out;
> +
>  	return anon_vma;
>  out:
>  	rcu_read_unlock();

I see why you'd like to try to do this, but look a bit closer, and you'll 
realize that this is *really* wrong.

So there's the brown-paper-bag-reason why it's wrong: you need to unlock 
in this case, but there's a subtler reason why I doubt the whole approach 
works: I don't think we actually hold the anon_vma lock when we set 
page->mapping.

So I don't think you really fixed the race that you want to fix, and I 
don't think that does what you wanted to do.

But I might have missed something.

I'm off to play poker. It's Friday night, there's only so many memory 
ordering and locking issues I can take in one day. I'm hoping that by the 
time I look at this again, you and Hugh will have sorted it out.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  2:53               ` Linus Torvalds
@ 2008-10-18  5:20                 ` Nick Piggin
  2008-10-18 10:38                   ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Nick Piggin @ 2008-10-18  5:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Peter Zijlstra, Hugh Dickins, Linux Memory Management List

On Fri, Oct 17, 2008 at 07:53:49PM -0700, Linus Torvalds wrote:
> 
> 
> On Sat, 18 Oct 2008, Nick Piggin wrote:
> > @@ -171,6 +181,10 @@ static struct anon_vma *page_lock_anon_v
> >  
> >  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> >  	spin_lock(&anon_vma->lock);
> > +
> > +	if (anon_mapping != (unsigned long)page->mapping)
> > +		goto out;
> > +
> >  	return anon_vma;
> >  out:
> >  	rcu_read_unlock();
> 
> I see why you'd like to try to do this, but look a bit closer, and you'll 
> realize that this is *really* wrong.
> 
> So there's the brown-paper-bag-reason why it's wrong: you need to unlock 
> in this case,

Check.


> but there's a subtler reason why I doubt the whole approach 
> works: I don't think we actually hold the anon_vma lock when we set 
> page->mapping.

No, we don't, but I think that's OK because we do an atomic assignment
to page->mapping. I can't see any bugs there (the change to always take
the anon_vma lock when inserting a new anon_vma into vma->anon_vma
should ensure the vma->anon_vma assignment happens after the busy lock
is visible, and the fact that anyone looking into the anon_vma should
hold the lock takes care of everything else).

 
> So I don't think you really fixed the race that you want to fix, and I 
> don't think that does what you wanted to do.
> 
> But I might have missed something.

No, I think this race is different. It's because it is "hard" to get a
reference on anon_vma from the lru->page path, because unmapping a vma
doesn't take any of the same locks (in particular it doesn't take the
page lock, which would be the big hammer solution).

So we can have a thread in reclaim who has a locked page, and is just
about to call page_lock_anon_vma.

At this point, another thread might unmap the whole vma. If this is
the last vma in the anon_vma, then it garbage collects it in anon_vma_unlink.
page->mapping does not get NULLed out or anything.

So the first thread picks up the anon_vma under RCU, sees page_mapped is
still true (let's say this part runs just before the unmapper decrements
the last ->mapcount, then the page gets garbage collected).

Then we take the page lock. Still OK because we are under SLAB_DESTROY_BY_RCU.
Then we return the anon_vma and start using it. But when we took the page
lock, we don't actually know that the anon_vma hasn't been allocated and used
for something else entirely.

Taking the anon_vma->lock in anon_vma_prepare of a new anon_vma closes the
obvious list corruption problems that could occur if we tried to walk it
at the same time a new vma was being put on there. But AFAIKS, even then we
have a problem where we might be trying to walk over completely the wrong
vmas now.

Slight improvement attached.
---
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -63,32 +63,42 @@ int anon_vma_prepare(struct vm_area_stru
 	might_sleep();
 	if (unlikely(!anon_vma)) {
 		struct mm_struct *mm = vma->vm_mm;
-		struct anon_vma *allocated, *locked;
+		struct anon_vma *allocated;
 
 		anon_vma = find_mergeable_anon_vma(vma);
 		if (anon_vma) {
 			allocated = NULL;
-			locked = anon_vma;
-			spin_lock(&locked->lock);
 		} else {
 			anon_vma = anon_vma_alloc();
 			if (unlikely(!anon_vma))
 				return -ENOMEM;
 			allocated = anon_vma;
-			locked = NULL;
 		}
 
+		/*
+		 * The lock is required even for new anon_vmas, because as
+		 * soon as we store vma->anon_vma = anon_vma, then the
+		 * anon_vma becomes visible via the vma. This means another
+		 * CPU can find the anon_vma, then store it into the struct
+		 * page with page_add_anon_rmap. At this point, anon_vma can
+		 * be loaded from the page with page_lock_anon_vma.
+		 *
+		 * So long as the anon_vma->lock is taken before looking at
+		 * any fields in the anon_vma, the lock should take care of
+		 * races and memory ordering issues WRT anon_vma fields.
+		 */
+		spin_lock(&anon_vma->lock);
+
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
-			vma->anon_vma = anon_vma;
 			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+			vma->anon_vma = anon_vma;
 			allocated = NULL;
 		}
 		spin_unlock(&mm->page_table_lock);
+		spin_lock(&anon_vma->lock);
 
-		if (locked)
-			spin_unlock(&locked->lock);
 		if (unlikely(allocated))
 			anon_vma_free(allocated);
 	}
@@ -171,6 +181,21 @@ static struct anon_vma *page_lock_anon_v
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	spin_lock(&anon_vma->lock);
+
+	/*
+	 * If the page is no longer mapped, we have no way to keep the
+	 * anon_vma stable. It may be freed and even re-allocated for some
+	 * other set of anonymous mappings at any point. If the page is
+	 * mapped while we have the lock on the anon_vma, then we know
+	 * anon_vma_unlink can't run and garbage collect the anon_vma
+	 * (because unmapping the page happens before unlinking the anon_vma).
+	 */
+	if (unlikely(!page_mapped(page))) {
+		spin_unlock(&anon_vma->lock);
+		goto out;
+	}
+	BUG_ON(page->mapping != anon_mapping);
+
 	return anon_vma;
 out:
 	rcu_read_unlock();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  5:20                 ` Nick Piggin
@ 2008-10-18 10:38                   ` Peter Zijlstra
  2008-10-19  9:52                     ` Hugh Dickins
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2008-10-18 10:38 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linus Torvalds, Hugh Dickins, Linux Memory Management List

On Sat, 2008-10-18 at 07:20 +0200, Nick Piggin wrote:

> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -63,32 +63,42 @@ int anon_vma_prepare(struct vm_area_stru
>  	might_sleep();
>  	if (unlikely(!anon_vma)) {
>  		struct mm_struct *mm = vma->vm_mm;
> -		struct anon_vma *allocated, *locked;
> +		struct anon_vma *allocated;
>  
>  		anon_vma = find_mergeable_anon_vma(vma);
>  		if (anon_vma) {
>  			allocated = NULL;
> -			locked = anon_vma;
> -			spin_lock(&locked->lock);
>  		} else {
>  			anon_vma = anon_vma_alloc();
>  			if (unlikely(!anon_vma))
>  				return -ENOMEM;
>  			allocated = anon_vma;
> -			locked = NULL;
>  		}
>  
> +		/*
> +		 * The lock is required even for new anon_vmas, because as
> +		 * soon as we store vma->anon_vma = anon_vma, then the
> +		 * anon_vma becomes visible via the vma. This means another
> +		 * CPU can find the anon_vma, then store it into the struct
> +		 * page with page_add_anon_rmap. At this point, anon_vma can
> +		 * be loaded from the page with page_lock_anon_vma.
> +		 *
> +		 * So long as the anon_vma->lock is taken before looking at
> +		 * any fields in the anon_vma, the lock should take care of
> +		 * races and memory ordering issues WRT anon_vma fields.
> +		 */
> +		spin_lock(&anon_vma->lock);
> +
>  		/* page_table_lock to protect against threads */
>  		spin_lock(&mm->page_table_lock);
>  		if (likely(!vma->anon_vma)) {
> -			vma->anon_vma = anon_vma;
>  			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
> +			vma->anon_vma = anon_vma;
>  			allocated = NULL;
>  		}
>  		spin_unlock(&mm->page_table_lock);
> +		spin_lock(&anon_vma->lock);

did you perchance mean, spin_unlock() ?

>  
> -		if (locked)
> -			spin_unlock(&locked->lock);
>  		if (unlikely(allocated))
>  			anon_vma_free(allocated);
>  	}
> @@ -171,6 +181,21 @@ static struct anon_vma *page_lock_anon_v
>  
>  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>  	spin_lock(&anon_vma->lock);
> +
> +	/*
> +	 * If the page is no longer mapped, we have no way to keep the
> +	 * anon_vma stable. It may be freed and even re-allocated for some
> +	 * other set of anonymous mappings at any point. If the page is
> +	 * mapped while we have the lock on the anon_vma, then we know
> +	 * anon_vma_unlink can't run and garbage collect the anon_vma
> +	 * (because unmapping the page happens before unlinking the anon_vma).
> +	 */
> +	if (unlikely(!page_mapped(page))) {
> +		spin_unlock(&anon_vma->lock);
> +		goto out;
> +	}
> +	BUG_ON(page->mapping != anon_mapping);
> +
>  	return anon_vma;
>  out:
>  	rcu_read_unlock();


fault_creation:

 anon_vma_prepare()
 page_add_new_anon_rmap();

expand_creation:

 anon_vma_prepare()
 anon_vma_lock();

rmap_lookup:

 page_referenced()/try_to_unmap()
   page_lock_anon_vma()

vma_lookup:

 vma_adjust()/vma_*
   vma->anon_vma

teardown:

 unmap_vmas()
   zap_range()
      page_remove_rmap()
      free_page()
 free_pgtables()
   anon_vma_unlink()
   free_range()
  
IOW we remove rmap, free the page (set mapping=NULL) and then unlink and
free the anon_vma.

But at that time vma->anon_vma is still set.


head starts to hurt,.. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18 10:38                   ` Peter Zijlstra
@ 2008-10-19  9:52                     ` Hugh Dickins
  2008-10-19 10:51                       ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2008-10-19  9:52 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Nick Piggin, Linus Torvalds, Linux Memory Management List

On Sat, 18 Oct 2008, Peter Zijlstra wrote:
> 
> fault_creation:
> 
>  anon_vma_prepare()
>  page_add_new_anon_rmap();
> 
> expand_creation:
> 
>  anon_vma_prepare()
>  anon_vma_lock();
> 
> rmap_lookup:
> 
>  page_referenced()/try_to_unmap()
>    page_lock_anon_vma()
> 
> vma_lookup:
> 
>  vma_adjust()/vma_*
>    vma->anon_vma
> 
> teardown:
> 
>  unmap_vmas()
>    zap_range()
>       page_remove_rmap()
>       free_page()
>  free_pgtables()
>    anon_vma_unlink()
>    free_range()
>   
> IOW we remove rmap, free the page (set mapping=NULL) and then unlink and
> free the anon_vma.
> 
> But at that time vma->anon_vma is still set.
> 
> 
> head starts to hurt,.. 

Comprehension isn't my strong suit at the moment: I don't grasp
what problem you're seeing here - if you can spell it out in more
detail for me, I'd like to try stopping your head hurt - though not
at cost of making mine hurt more!

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-19  9:52                     ` Hugh Dickins
@ 2008-10-19 10:51                       ` Peter Zijlstra
  2008-10-19 12:39                         ` Hugh Dickins
  2008-10-19 18:25                         ` Linus Torvalds
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2008-10-19 10:51 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Nick Piggin, Linus Torvalds, Linux Memory Management List

On Sun, 2008-10-19 at 10:52 +0100, Hugh Dickins wrote:
> On Sat, 18 Oct 2008, Peter Zijlstra wrote:
> > 
> > fault_creation:
> > 
> >  anon_vma_prepare()
> >  page_add_new_anon_rmap();
> > 
> > expand_creation:
> > 
> >  anon_vma_prepare()
> >  anon_vma_lock();
> > 
> > rmap_lookup:
> > 
> >  page_referenced()/try_to_unmap()
> >    page_lock_anon_vma()
> > 
> > vma_lookup:
> > 
> >  vma_adjust()/vma_*
> >    vma->anon_vma
> > 
> > teardown:
> > 
> >  unmap_vmas()
> >    zap_range()
> >       page_remove_rmap()
> >       free_page()
> >  free_pgtables()
> >    anon_vma_unlink()
> >    free_range()
> >   
> > IOW we remove rmap, free the page (set mapping=NULL) and then unlink and
> > free the anon_vma.
> > 
> > But at that time vma->anon_vma is still set.
> > 
> > 
> > head starts to hurt,.. 
> 
> Comprehension isn't my strong suit at the moment: I don't grasp
> what problem you're seeing here - if you can spell it out in more
> detail for me, I'd like to try stopping your head hurt - though not
> at cost of making mine hurt more!

Heh, I meant to continue on this path more later yesterday, but weekend
chores kept me from it.

The above was my feeble attempt at getting an overview of what we do
with these anon_vma things so as to get a feel for the interaction.

I think my main concern in all this is validating that we have the right
anon_vma on dereference - both the vma->anon_vma and the page->mapping
one.

Part of the confusion is that we don't clear those pointers at the end
of their lifetimes (page_remove_rmap and anon_vma_unlink).

I guess the !page_mapping() test in page_lock_anon_vma() is meant to
deal with this, I think the point is that we have a stable page
reference, and thus the mapping is stable too (although
page_referenced() only holds a ref, and that could race with a fault
which would install the anon_vma? - still that looks a race the safe
way)

Still it looks odd to have a rcu_read_lock() section without and
rcu_dereference() or smp_read_depend barrier.

I think I see how the vma->anon_vma references work too, given the added
locking in anon_vma_prepare() proposed in this thread. But I need to
ponder those a bit more.

And alas, I need to run out again.. these weekends are too short.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-19 10:51                       ` Peter Zijlstra
@ 2008-10-19 12:39                         ` Hugh Dickins
  2008-10-19 18:25                         ` Linus Torvalds
  1 sibling, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2008-10-19 12:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Nick Piggin, Linus Torvalds, Linux Memory Management List

On Sun, 19 Oct 2008, Peter Zijlstra wrote:
> 
> I think my main concern in all this is validating that we have the right
> anon_vma on dereference - both the vma->anon_vma and the page->mapping
> one.
> 
> Part of the confusion is that we don't clear those pointers at the end
> of their lifetimes (page_remove_rmap and anon_vma_unlink).

Yes, I would very much have liked to clear it in page_remove_rmap(),
as I say there: it's still feels ugly to be cleaning up after it in
free_hot_cold_page() (gosh! and that's still the name of where it
happens!), though there are some good debug advantages to having it
set indefinitely too.

Clearing vma->anon_vma at the end, I don't think I ever cared about
that: it's very common to kfree() something without resetting the
pointers to it, I don't think there's any worrying race in its case.

> 
> I guess the !page_mapping() test in page_lock_anon_vma() is meant to
              !page_mapped()
> deal with this, I think the point is that we have a stable page
> reference, and thus the mapping is stable too (although
> page_referenced() only holds a ref, and that could race with a fault
> which would install the anon_vma? - still that looks a race the safe
> way)

page_lock_anon_vma() is like those scenes where sailors are pulleyed
down a rope from one ship to another in stormy mid-ocean.  There,
now you understand it, need I say more?

If we see page_mapcount is raised (in the RCU locked section), we
can be sure that the slab page which the struct anon_vma rests on
will not get freed and reused for something else: page_mapcount
may go down to 0 at any instant, but having been non-0, we are
assured that anon_vma->lock will remain the spinlock in a struct
anon_vma, even if by the time we've acquired that spinlock, that
particular piece of memory has been freed and reused for the
anon_vma of another group of vmas altogether.

Certainly mapcount could also go up and another vma be added to
the anon_vma's list while we wait to get the spinlock, as you say,
but that's no worry.

> 
> Still it looks odd to have a rcu_read_lock() section without and
> rcu_dereference() or smp_read_depend barrier.

At the 2.6.8 time I wrote it, I was using preempt_disable() and
preempt_enable(), and there was no such thing as rcu_dereference().
But I don't think it's now lacking in that respect: the whole idea
was to keep almost all of the code free of RCU worries, just
concentrate them all into page_lock_anon_vma() (and slab.c).

> 
> I think I see how the vma->anon_vma references work too, given the added
> locking in anon_vma_prepare() proposed in this thread. But I need to
> ponder those a bit more.
> 
> And alas, I need to run out again.. these weekends are too short.

I know the feeling: I also seem to have promised many too many
people that I'll be attending to this or that at the weekend.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-19 10:51                       ` Peter Zijlstra
  2008-10-19 12:39                         ` Hugh Dickins
@ 2008-10-19 18:25                         ` Linus Torvalds
  2008-10-19 18:45                           ` Peter Zijlstra
                                             ` (3 more replies)
  1 sibling, 4 replies; 52+ messages in thread
From: Linus Torvalds @ 2008-10-19 18:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Hugh Dickins, Nick Piggin, Linux Memory Management List

On Sun, 19 Oct 2008, Peter Zijlstra wrote:
> 
> Part of the confusion is that we don't clear those pointers at the end
> of their lifetimes (page_remove_rmap and anon_vma_unlink).
> 
> I guess the !page_mapping() test in page_lock_anon_vma() is meant to
> deal with this

Hmm. So that part I'm still not entirely convinced about.

The thing is, we have two issues on anon_vma usage, and the
page_lock_anon_vma() usage in particular:

 - the integrity of the list itself

   Here it should be sufficient to just always get the lock, to the point 
   where we don't need to care about anything else. So getting the lock 
   properly on new allocation makes all the other races irrelevant.

 - the integrity of the _result_ of traversing the list

   This is what the !page_mapping() thing is supposedly protecting 
   against, I think.

   But as far as I can tell, there's really two different use cases here: 
   (a) people who care deeply about the result and (b) people who don't.

   And the difference between the two cases is whether they had the page 
   locked or not. The "try_to_unmap()" callers care deeply, and lock the 
   page. In contrast, some "page_referenced()" callers (really just 
   shrink_active_list) don't care deeply, and to them the return value is 
   really just a heuristic.

As far as I can tell, all the people who care deeply will lock the page 
(and _have_ to lock the page), and thus 'page->mapping' should be stable 
for those cases.

And then we have the other cases, who just want a heuristic, and they 
don't hold the page lock, but if we look at the wrong active_vma that has 
gotten reallocated to something else, they don't even really care. 

So I'm not seeing the reason for that check for page_mapped() at the end. 
Does it actually protect against anything relevant?

Anyway, I _think_ the part that everybody agrees about is the initial 
locking of the anon_vma. Whether we then even need any memory barriers 
and/or the page_mapped() check is an independent question. Yes? No?

So I'm suggesting this commit as the part we at least all agree on. But I 
haven't pushed it out yet, so you can still holler.. But I think all the 
discussion is about other issues, and we all agree on at least this part?

		Linus

---

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-19 18:25                         ` Linus Torvalds
@ 2008-10-19 18:45                           ` Peter Zijlstra
  2008-10-19 19:00                           ` Hugh Dickins
                                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2008-10-19 18:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Hugh Dickins, Nick Piggin, Linux Memory Management List

On Sun, 2008-10-19 at 11:25 -0700, Linus Torvalds wrote:

> Anyway, I _think_ the part that everybody agrees about is the initial 
> locking of the anon_vma. Whether we then even need any memory barriers 
> and/or the page_mapped() check is an independent question. Yes? No?

Yes

> So I'm suggesting this commit as the part we at least all agree on. But I 
> haven't pushed it out yet, so you can still holler.. But I think all the 
> discussion is about other issues, and we all agree on at least this part?
> 
> 		Linus
> 
> ---
> From f422f2ec50872331820f15711f48b9ffc9cbb64e Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Sun, 19 Oct 2008 10:32:20 -0700
> Subject: [PATCH] anon_vma_prepare: properly lock even newly allocated entries
> 
> The anon_vma code is very subtle, and we end up doing optimistic lookups
> of anon_vmas under RCU in page_lock_anon_vma() with no locking.  Other
> CPU's can also see the newly allocated entry immediately after we've
> exposed it by setting "vma->anon_vma" to the new value.
> 
> We protect against the anon_vma being destroyed by having the SLAB
> marked as SLAB_DESTROY_BY_RCU, so the RCU lookup can depend on the
> allocation not being destroyed - but it might still be free'd and
> re-allocated here to a new vma.
> 
> As a result, we should not do the anon_vma list ops on a newly allocated
> vma without proper locking.
> 
> Acked-by: Nick Piggin <npiggin@suse.de>
> Acked-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  mm/rmap.c |   42 ++++++++++++++++++++++++++++++++----------
>  1 files changed, 32 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0383acf..e8d639b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -55,7 +55,33 @@
>  
>  struct kmem_cache *anon_vma_cachep;
>  
> -/* This must be called under the mmap_sem. */
> +/**
> + * anon_vma_prepare - attach an anon_vma to a memory region
> + * @vma: the memory region in question
> + *
> + * This makes sure the memory mapping described by 'vma' has
> + * an 'anon_vma' attached to it, so that we can associate the
> + * anonymous pages mapped into it with that anon_vma.
> + *
> + * The common case will be that we already have one, but if
> + * if not we either need to find an adjacent mapping that we
> + * can re-use the anon_vma from (very common when the only
> + * reason for splitting a vma has been mprotect()), or we
> + * allocate a new one.
> + *
> + * Anon-vma allocations are very subtle, because we may have
> + * optimistically looked up an anon_vma in page_lock_anon_vma()
> + * and that may actually touch the spinlock even in the newly
> + * allocated vma (it depends on RCU to make sure that the
> + * anon_vma isn't actually destroyed).
> + *
> + * As a result, we need to do proper anon_vma locking even
> + * for the new allocation. At the same time, we do not want
> + * to do any locking for the common case of already having
> + * an anon_vma.
> + *
> + * This must be called with the mmap_sem held for reading.
> + */
>  int anon_vma_prepare(struct vm_area_struct *vma)
>  {
>  	struct anon_vma *anon_vma = vma->anon_vma;
> @@ -63,20 +89,17 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  	might_sleep();
>  	if (unlikely(!anon_vma)) {
>  		struct mm_struct *mm = vma->vm_mm;
> -		struct anon_vma *allocated, *locked;
> +		struct anon_vma *allocated;
>  
>  		anon_vma = find_mergeable_anon_vma(vma);
> -		if (anon_vma) {
> -			allocated = NULL;
> -			locked = anon_vma;
> -			spin_lock(&locked->lock);
> -		} else {
> +		allocated = NULL;
> +		if (!anon_vma) {
>  			anon_vma = anon_vma_alloc();
>  			if (unlikely(!anon_vma))
>  				return -ENOMEM;
>  			allocated = anon_vma;
> -			locked = NULL;
>  		}
> +		spin_lock(&anon_vma->lock);
>  
>  		/* page_table_lock to protect against threads */
>  		spin_lock(&mm->page_table_lock);
> @@ -87,8 +110,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  		}
>  		spin_unlock(&mm->page_table_lock);
>  
> -		if (locked)
> -			spin_unlock(&locked->lock);
> +		spin_unlock(&anon_vma->lock);
>  		if (unlikely(allocated))
>  			anon_vma_free(allocated);
>  	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-19 18:25                         ` Linus Torvalds
  2008-10-19 18:45                           ` Peter Zijlstra
@ 2008-10-19 19:00                           ` Hugh Dickins
  2008-10-20  4:03                           ` Hugh Dickins
  2008-10-21  2:44                           ` Nick Piggin
  3 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2008-10-19 19:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Peter Zijlstra, Nick Piggin, Linux Memory Management List

On Sun, 19 Oct 2008, Linus Torvalds wrote:
> 
> Anyway, I _think_ the part that everybody agrees about is the initial 
> locking of the anon_vma. Whether we then even need any memory barriers 
> and/or the page_mapped() check is an independent question. Yes? No?
> 
> So I'm suggesting this commit as the part we at least all agree on. But I 
> haven't pushed it out yet, so you can still holler.. But I think all the 
> discussion is about other issues, and we all agree on at least this part?

I'll have to postpone answering the rest of your mail until later,
but yes, I agree your patch is what we've all agreed on so far,
and I can't even quibble with your description - it's good.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-19 18:25                         ` Linus Torvalds
  2008-10-19 18:45                           ` Peter Zijlstra
  2008-10-19 19:00                           ` Hugh Dickins
@ 2008-10-20  4:03                           ` Hugh Dickins
  2008-10-20 15:17                             ` Linus Torvalds
  2008-10-21  2:44                           ` Nick Piggin
  3 siblings, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2008-10-20  4:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Peter Zijlstra, Nick Piggin, Linux Memory Management List

On Sun, 19 Oct 2008, Linus Torvalds wrote:
> On Sun, 19 Oct 2008, Peter Zijlstra wrote:
> > 
> > Part of the confusion is that we don't clear those pointers at the end
> > of their lifetimes (page_remove_rmap and anon_vma_unlink).
> > 
> > I guess the !page_mapping() test in page_lock_anon_vma() is meant to
                !page_mapped()
> > deal with this
> 
> Hmm. So that part I'm still not entirely convinced about.
> 
> The thing is, we have two issues on anon_vma usage, and the
> page_lock_anon_vma() usage in particular:
> 
>  - the integrity of the list itself
> 
>    Here it should be sufficient to just always get the lock, to the point 
>    where we don't need to care about anything else. So getting the lock 
>    properly on new allocation makes all the other races irrelevant.

Yes, I should have removed anon_vma_prepare()'s optimization
of not locking a newly allocated struct anon_vma when I moved
page_lock_anon_vma() over to rely on SLAB_DESTROY_BY_RCU.

When you say "to the point where we don't need to care about anything
else", are you there agreeing with Nick that your smp_wmb() and
smp_read_barrier_depends() are no longer needed?

> 
>  - the integrity of the _result_ of traversing the list
> 
>    This is what the !page_mapping() thing is supposedly protecting 
                      !page_mapped()
>    against, I think.
> 
>    But as far as I can tell, there's really two different use cases here: 
>    (a) people who care deeply about the result and (b) people who don't.

You could look at it that way, but I don't think it's like that really.

IIRC at one time it was the case that every caller had the page already
locked when coming in here, but none of them could care too deeply about
the result because there was a trylock of the page_table_lock which could
fail on one of the paths.

Nowadays the only trylocking in rmap.c is one place where it's being
called without holding page lock: that doesn't matter for anon pages,
but matters for file pages because page->mapping could go NULL from
truncation at any instant without the page lock, and we do need page
->mapping to locate the prio_tree of vmas - in that case it trylocks
for page lock, and just assumes referenced if couldn't get it.

> 
>    And the difference between the two cases is whether they had the page 
>    locked or not. The "try_to_unmap()" callers care deeply, and lock the 
>    page. In contrast, some "page_referenced()" callers (really just 
>    shrink_active_list) don't care deeply, and to them the return value is 
>    really just a heuristic.
> 
> As far as I can tell, all the people who care deeply will lock the page 
> (and _have_ to lock the page), and thus 'page->mapping' should be stable 
> for those cases.

It's certainly true that try_to_unmap() callers care more deeply
than page_referenced() callers, and that the only trylock is on the
page_referenced() file path.

But this is all _irrelevant_ : the tricky test to worry about in
page_lock_anon_vma() is of page_mapped() i.e. does this page currently
have any ptes in userspace, not of page_mapping() or page->mapping.

In the case of file pages, it is in some places crucial to check
page->mapping against a racing file truncation; but there's no such
issue with anon pages, their page->mapping pointing to anon_vma is
left set until the page is finally freed, it is not volatile.

But in the case of anon pages, what page->mapping points to may be
volatile, in the sense that that memory might at some point get reused
for a different anon_vma, or the slab page below it get freed and
reused for a different purpose completely: that's what we have to
careful of in the case of anon pages, and it's RCU and the
page_mapped() test which guard that.

> 
> And then we have the other cases, who just want a heuristic, and they 
> don't hold the page lock, but if we look at the wrong active_vma that has 
                                                        anon_vma
> gotten reallocated to something else, they don't even really care. 

It's not that they don't care, it's that if the struct anon_vma has
gotten reallocated to something else, they can then be sure there's
no longer anything to care about.

The struct anon_vma bundles together in a list all those vmas which
might conceivably contain a pte of the page we're interested in.
If the anon_vma has gotten freed, even reallocated, or utterly reused,
that implies that its list of vmas was emptied, there's no longer any
vma which might contain that page.

> 
> So I'm not seeing the reason for that check for page_mapped() at the end. 
> Does it actually protect against anything relevant?

Absolutely, it is crucial that page_lock_anon_vma(page) checks
page_mapped(page) after doing the rcu_read_lock() and before doing
the spin_lock(&anon_vma->lock).

When we get into page_lock_anon_vma, we do not know whether what
page->mapping (less its PageAnon bit) points to is really still
the anon_vma for the page.  Maybe all the tasks which had that
page mapped at the time we last checked page_mapped(), have now
exited, their vmas and anon_vmas and mms been freed.

So first we rcu_read_lock(): and because the anon_vma cache is
SLAB_DESTROY_BY_RCU, we know that while we hold that, if the memory
is now pointing to an anon_vma type of object, it will remain
pointing to an anon_vma type of object until the rcu_read_unlock().

That's important because what we're really wanting to do is
spin_lock(&anon_vma->lock), but that would be a corrupting thing
to do if the struct anon_vma's memory has meanwhile got reused for
something else.

Okay, we've got rcu_read_lock() to stabilize the situation, but
we still do not know whether the memory pointed to by page->mapping
is still in use as an anon_vma.  Well, we don't try to answer
precisely that question: but if the page we're holding (caller
better have a reference on it!) has any ptes in userspace,
i.e. page_mapcount is raised, i.e. page_mapped is true,
then we can be sure that what it's pointing to is the anon_vma.
And if page_mapped isn't true any longer, then we're no longer
interested in getting that lock anyway and can just back out.

At any instant thereafter the tasks concerned may exit or unmap
the page, the vmas and anon_vmas and mms may get freed and reused;
but SLAB_DESTROY_BY_RCU guarantees that while we have rcu_read_lock()
that struct anon_vma will remain of type struct anon_vma, and we can
now safely spin_lock(&anon_vma->lock).

Once we've got that spin lock, we can safely (given the fix in your
patch) traverse the anon_vma list.  Maybe the struct anon_vma is
still relevant to our page, and maybe we'll find ptes of our page
in one or more of the vmas it bundles together; or maybe it's
still the right anon_vma, but our page gets unmapped while we
search and no ptes are found; or maybe that struct anon_vma has
just got freed back to slab and its list is empty; or maybe it
has meanwhile got reused for another unrelated bundle of vmas,
and we search that bundle for our page which won't be there -
that's okay, it's a very unlikely case, but it's allowed for.

> 
> Anyway, I _think_ the part that everybody agrees about is the initial 
> locking of the anon_vma. Whether we then even need any memory barriers 
> and/or the page_mapped() check is an independent question. Yes? No?
> 
> So I'm suggesting this commit as the part we at least all agree on. But I 
> haven't pushed it out yet, so you can still holler.. But I think all the 
> discussion is about other issues, and we all agree on at least this part?

Yes, agreed.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-20  4:03                           ` Hugh Dickins
@ 2008-10-20 15:17                             ` Linus Torvalds
  2008-10-20 18:21                               ` Hugh Dickins
  2008-10-21  2:56                               ` Nick Piggin
  0 siblings, 2 replies; 52+ messages in thread
From: Linus Torvalds @ 2008-10-20 15:17 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Peter Zijlstra, Nick Piggin, Linux Memory Management List

On Mon, 20 Oct 2008, Hugh Dickins wrote:
> 
> When you say "to the point where we don't need to care about anything
> else", are you there agreeing with Nick that your smp_wmb() and
> smp_read_barrier_depends() are no longer needed?

Yes. The anon_vma only has two fields: the spinlock itself, and the list. 
And with all list allocations being inside the spinlock, and with the 
spinlock itself being a memory barrier, I'm now convinced that the worry 
about memory ordering was unnecessary.

Well, not unnecessary, because I think the discussion was good, and I 
think we fixed another bug, but the smp_wmb++smp_read_barrier_depends does 
seem to be a non-issue in this path.

> But this is all _irrelevant_ : the tricky test to worry about in
> page_lock_anon_vma() is of page_mapped() i.e. does this page currently
> have any ptes in userspace, not of page_mapping() or page->mapping.

I'm not arguing for removing the page_mapped() we have now, I'm just 
wondering about the one Nick wanted to add at the end.

> In the case of file pages, it is in some places crucial to check
> page->mapping against a racing file truncation; but there's no such
> issue with anon pages, their page->mapping pointing to anon_vma is
> left set until the page is finally freed, it is not volatile.
> 
> But in the case of anon pages, what page->mapping points to may be
> volatile, in the sense that that memory might at some point get reused
> for a different anon_vma, or the slab page below it get freed and
> reused for a different purpose completely: that's what we have to
> careful of in the case of anon pages, and it's RCU and the
> page_mapped() test which guard that.

.. and I'm not worried about the slab page. It's stable, since we hold the 
RCU read-lock. No worries about that one either.

It's the "struct page" itself - the one we use for lookup in 
page_lock_anon_vma(). And I'm worried about the need for *re-doing* the 
page_mapped() test.

The problem that makes "page_lock_anon_vma()" such a total disaster is 
that yes, it does locking, but it does locking _after_ the lookup, and the 
lock doesn't actually protect any of the data that it is using for the 
lookup itself.

And yes, we have various tricks to try to make the data "safe" even if we 
race with the lookup, like the RCU stability of the anon_vma allocation, 
so that even if we race, we don't do anything bad. And I don't worry about 
the anon_vma, that part looks fine.

But page_remove_rmap() is run totally unlocked wrt page_lock_anon_vma(), 
it looks like. page_remove_rmap() is run under the pte lock, and 
page_lock_anon_vma() is run under no lock at all that I can see that is 
reliable.

So yes, we have the same kind of "delay the destroy" wrt page->mapping (ie 
page_remove_rmap() doesn't actually clear page->mapping at all), but none 
of this runs under any lock at all.

So as far as I can tell, the only reason "page_lock_anon_vma()" is safe is 
because we hopefully always hold an elevated page count when we call it, 
so at least the struct page itself will never get freed, so page->mapping 
is safe just because it's not cleared. 

But assuming all that is true, it boils down to this:

 - the anon_vma we get may be the wrong one or a stale one (since 
   page_remove_rmap ran concurrently and we ended up freeing the 
   anon_vma), but it's always a "valid" anon_vma in the sense that it's 
   initialized and the list is always pointing to *some* stable set of 
   vm_area_struct's.

 - if we raced, and the anon_vma is stale, we're going to walk over 
   some bogus list that pertains to a totally different page, but WHO 
   REALLY CARES? If it is about another page that got that anon_vma 
   reallocated to it, all the code that traverses the list of vma's still 
   has to check the page tables _anyway_. And that will never trigger, and 
   that will get the pte lock for checking anyway, so at _that_ point do 
   we correctly finally synchronize with a lock that is actually relevant.

 - the "anon_vma->lock" is totally irrelevant wrt "page_mapped()", and I'm 
   not seeing what it could possibly help to re-check after getting that 
   lock.

So what I'm trying to figure out is why Nick wanted to add another check 
for page_mapped(). I'm not seeing what it is supposed to protect against.

(Yes, we have checks for "page_mapped()" inside the "try_tp_unmap_xyz()" 
loops, but those are for a different reason - they're there to exit the 
loop early when we know there's no point. They don't claim to be about 
locking serialization).

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-20 15:17                             ` Linus Torvalds
@ 2008-10-20 18:21                               ` Hugh Dickins
  2008-10-21  2:56                               ` Nick Piggin
  1 sibling, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2008-10-20 18:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Peter Zijlstra, Nick Piggin, Linux Memory Management List

On Mon, 20 Oct 2008, Linus Torvalds wrote:
> On Mon, 20 Oct 2008, Hugh Dickins wrote:
> > 
> > When you say "to the point where we don't need to care about anything
> > else", are you there agreeing with Nick that your smp_wmb() and
> > smp_read_barrier_depends() are no longer needed?
> 
> Yes. The anon_vma only has two fields: the spinlock itself, and the list. 
> And with all list allocations being inside the spinlock, and with the 
> spinlock itself being a memory barrier, I'm now convinced that the worry 
> about memory ordering was unnecessary.

Okay, thanks, that's a relief.  I'm afraid that once a barrier discussion
comes up and we insert them, then I become dazedly paranoid and it's very
hard to shake me from seeing a need for barriers everywhere, including a
barrier before and after every barrier ad infinitum to make sure they're
really barriers.

I still get a twinge of anxiety seeing anon_vma_prepare()'s unlocked
	if (unlikely(!anon_vma)) {
since it looks like the kind of thing that can be a problem.  But on
reflection, I guess there are lots and lots of places where we do such
opportunistic checks before going the slow path taking the lock.

> 
> Well, not unnecessary, because I think the discussion was good, and I 
> think we fixed another bug,

Yes, that was a valuable find, which Nick's ctor aberration led us to.
Though whether it ever bit anyone, I doubt.  We did have a spate of
anon_vma corruptions 2.5 years ago, but I think they were just one
manifestation of some more general slab corruption, don't match this.

> but the smp_wmb++smp_read_barrier_depends does 
> seem to be a non-issue in this path.
> 
> > But this is all _irrelevant_ : the tricky test to worry about in
> > page_lock_anon_vma() is of page_mapped() i.e. does this page currently
> > have any ptes in userspace, not of page_mapping() or page->mapping.
> 
> I'm not arguing for removing the page_mapped() we have now, I'm just 
> wondering about the one Nick wanted to add at the end.

Oh, that, sorry I didn't realize - but there again, it was well
worth my writing it down again, else I wouldn't have corrected
my embarrassingly mistaken goahead to Nick on moving the check.

[snipped a lot of good understanding of how it works]

> So what I'm trying to figure out is why Nick wanted to add another check 
> for page_mapped(). I'm not seeing what it is supposed to protect against.

I think it's a matter of mental comfort, or good interface design.

You're right that it will make no actual difference to what happens
in its two sole callers page_referenced_anon() and try_to_unmap_anon(),
beyond adding an extra branch to short-circuit a futile search which
would already terminate after the first iteration (each loop already
has a page_mapped test, to get out a.s.a.p. if the list is very long).

But (particularly because he didn't realize it could happen: I put
no comment there beyond "tricky") he thinks it would be better to
know that when you emerge from a successful page_lock_anon_vma(),
then the anon_vma you have is indeed still the right one for the
page (as you noted, we do assume caller holds a reference on page).

One might argue that a comment would be better than a runtime test:
so long as page_lock_anon_vma() is a static function with just those
two callers.

In writing this, another barrier anxiety crossed my mind.  I've made
a big deal of checking page_mapped after getting rcu_read_lock, but
now I wonder if another barrier is needed for that.

Documentation/memory-barriers.txt "LOCKING FUNCTIONS" groups RCU along
with spin locks in discussing their semipermeable characteristics, so
I guess no extra barrier needed; but it does work rather differently.

In CLASSIC_RCU the preempt_disable() has a compiler barrier() but
not any processor *mb().  As I understand it, that's fine because if
page->_mapcount was loaded before the preempt_disable and we don't
preempt before the preempt_disable, then so what, that's okay; and
if we are preempted immediately before the preempt_disable, then
all the business of context switch is sure to reload it again after.

In PREEMPT_RCU?  I don't know, that's some study I've never got
around to; but I think you and Peter will know whether it's good.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-20 15:17                             ` Linus Torvalds
  2008-10-20 18:21                               ` Hugh Dickins
@ 2008-10-21  2:56                               ` Nick Piggin
  2008-10-21  3:25                                 ` Linus Torvalds
  1 sibling, 1 reply; 52+ messages in thread
From: Nick Piggin @ 2008-10-21  2:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Peter Zijlstra, Nick Piggin, Linux Memory Management List

On Tuesday 21 October 2008 02:17, Linus Torvalds wrote:
> On Mon, 20 Oct 2008, Hugh Dickins wrote:

> > But in the case of anon pages, what page->mapping points to may be
> > volatile, in the sense that that memory might at some point get reused
> > for a different anon_vma, or the slab page below it get freed and
> > reused for a different purpose completely: that's what we have to
> > careful of in the case of anon pages, and it's RCU and the
> > page_mapped() test which guard that.
>
> .. and I'm not worried about the slab page. It's stable, since we hold the
> RCU read-lock. No worries about that one either.
>
> It's the "struct page" itself - the one we use for lookup in
> page_lock_anon_vma(). And I'm worried about the need for *re-doing* the
> page_mapped() test.
>
> The problem that makes "page_lock_anon_vma()" such a total disaster is
> that yes, it does locking, but it does locking _after_ the lookup, and the
> lock doesn't actually protect any of the data that it is using for the
> lookup itself.
>
> And yes, we have various tricks to try to make the data "safe" even if we
> race with the lookup, like the RCU stability of the anon_vma allocation,
> so that even if we race, we don't do anything bad. And I don't worry about
> the anon_vma, that part looks fine.
>
> But page_remove_rmap() is run totally unlocked wrt page_lock_anon_vma(),
> it looks like. page_remove_rmap() is run under the pte lock, and
> page_lock_anon_vma() is run under no lock at all that I can see that is
> reliable.
>
> So yes, we have the same kind of "delay the destroy" wrt page->mapping (ie
> page_remove_rmap() doesn't actually clear page->mapping at all), but none
> of this runs under any lock at all.
>
> So as far as I can tell, the only reason "page_lock_anon_vma()" is safe is
> because we hopefully always hold an elevated page count when we call it,
> so at least the struct page itself will never get freed, so page->mapping
> is safe just because it's not cleared.
>
> But assuming all that is true, it boils down to this:
>
>  - the anon_vma we get may be the wrong one or a stale one (since
>    page_remove_rmap ran concurrently and we ended up freeing the
>    anon_vma), but it's always a "valid" anon_vma in the sense that it's
>    initialized and the list is always pointing to *some* stable set of
>    vm_area_struct's.
>
>  - if we raced, and the anon_vma is stale, we're going to walk over
>    some bogus list that pertains to a totally different page, but WHO
>    REALLY CARES? If it is about another page that got that anon_vma
>    reallocated to it, all the code that traverses the list of vma's still
>    has to check the page tables _anyway_. And that will never trigger, and
>    that will get the pte lock for checking anyway, so at _that_ point do
>    we correctly finally synchronize with a lock that is actually relevant.
>
>  - the "anon_vma->lock" is totally irrelevant wrt "page_mapped()", and I'm
>    not seeing what it could possibly help to re-check after getting that
>    lock.
>
> So what I'm trying to figure out is why Nick wanted to add another check
> for page_mapped(). I'm not seeing what it is supposed to protect against.

It's not supposed to protect against anything that would be a problem
in the existing code (well, I initially thought it might be, but Hugh
explained why its not needed). I'd still like to put the check in, in
order to constrain this peculiarity of SLAB_DESTROY_BY_RCU to those
couple of functions which allocate or take a reference.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-21  2:56                               ` Nick Piggin
@ 2008-10-21  3:25                                 ` Linus Torvalds
  2008-10-21  4:33                                   ` Nick Piggin
  2008-10-21  4:34                                   ` Nick Piggin
  0 siblings, 2 replies; 52+ messages in thread
From: Linus Torvalds @ 2008-10-21  3:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Peter Zijlstra, Nick Piggin, Linux Memory Management List


On Tue, 21 Oct 2008, Nick Piggin wrote:
> >
> > So what I'm trying to figure out is why Nick wanted to add another check
> > for page_mapped(). I'm not seeing what it is supposed to protect against.
> 
> It's not supposed to protect against anything that would be a problem
> in the existing code (well, I initially thought it might be, but Hugh
> explained why its not needed). I'd still like to put the check in, in
> order to constrain this peculiarity of SLAB_DESTROY_BY_RCU to those
> couple of functions which allocate or take a reference.

Hmm.  Ok, as long as I understand what it is for (and if it's not a 
bug-fix but a "like to drop the stale anon_vma early), I'm ok.

So I won't mind, and Hugh seems to prefer it. So if you send that patch 
alogn with a good explanation for a changelog entry, I'll apply it.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-21  3:25                                 ` Linus Torvalds
@ 2008-10-21  4:33                                   ` Nick Piggin
  2008-10-21 12:58                                     ` Hugh Dickins
  2008-10-21 15:59                                     ` Christoph Lameter
  2008-10-21  4:34                                   ` Nick Piggin
  1 sibling, 2 replies; 52+ messages in thread
From: Nick Piggin @ 2008-10-21  4:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Hugh Dickins, Peter Zijlstra, Linux Memory Management List

On Mon, Oct 20, 2008 at 08:25:54PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 21 Oct 2008, Nick Piggin wrote:
> > >
> > > So what I'm trying to figure out is why Nick wanted to add another check
> > > for page_mapped(). I'm not seeing what it is supposed to protect against.
> > 
> > It's not supposed to protect against anything that would be a problem
> > in the existing code (well, I initially thought it might be, but Hugh
> > explained why its not needed). I'd still like to put the check in, in
> > order to constrain this peculiarity of SLAB_DESTROY_BY_RCU to those
> > couple of functions which allocate or take a reference.
> 
> Hmm.  Ok, as long as I understand what it is for (and if it's not a 
> bug-fix but a "like to drop the stale anon_vma early), I'm ok.
> 
> So I won't mind, and Hugh seems to prefer it. So if you send that patch 
> alogn with a good explanation for a changelog entry, I'll apply it.

Well something like this, then. Hugh?

--

With the existing SLAB_DESTROY_BY_RCU scheme for anon_vma, page_lock_anon_vma
might take the lock of the anon_vma at a point where it has already been freed
then re-allocated and reused for something else.

This is OK (with the exception of the now-fixed case where newly allocated
anon_vma had its list manipulated without holding the lock), because in order
to get to the pte, the page tables must be walked and the pte confirmed to
point to this page anyway. So technically it should work.

The problem with it is that it is quite subtle, and it means that we have to
keep this stale-anon_vma problem in the back of our minds, when reviewing or
modifying any part of the anonymous rmap code. It *could* be that it would
break some otherwise legitimate change to the code.

Add another page_mapped check to weed out these anon_vmas. Comment the
existing page_mapped check a little bit.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -200,11 +200,47 @@ struct anon_vma *page_lock_anon_vma(stru
 	anon_mapping = (unsigned long) page->mapping;
 	if (!(anon_mapping & PAGE_MAPPING_ANON))
 		goto out;
+
+	/*
+	 * The page_mapped check is required in order to ensure anon_vma is
+	 * protected under this RCU critical section before we touch it.
+	 *
+	 * If page_mapped was not checked, page->mapping may refer to an
+	 * anon_vma that has since been freed (see page_remove_rmap comment not
+	 * resetting PageAnon). And hence it would not be protected with RCU
+	 * and could be freed and reused at any time.
+	 */
 	if (!page_mapped(page))
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	spin_lock(&anon_vma->lock);
+
+	/*
+	 * If the page is no longer mapped, we have no way to keep the anon_vma
+	 * stable. It may be freed and even re-allocated for some other set of
+	 * anonymous mappings at any point. Technically this should be OK, as
+	 * we hold the spinlock, and should be able to tolerate finding
+	 * unrelated vmas on our list. However we'd rather nip these in the bud
+	 * here, for simplicity.
+	 *
+	 * If the page is mapped while we have the lock on the anon_vma, then
+	 * we know anon_vma_unlink can't run and garbage collect the anon_vma:
+	 * unmapping the page and decrementing its mapcount happens before
+	 * unlinking the anon_vma; unlinking the anon_vma requires the
+	 * anon_vma lock to be held. So this check ensures we have a stable
+	 * anon_vma.
+	 *
+	 * Note: the page can still become unmapped, and the !page_mapped
+	 * condition become true at any point. This check is definitely not
+	 * preventing any such thing.
+	 */
+	if (unlikely(!page_mapped(page))) {
+		spin_unlock(&anon_vma->lock);
+		goto out;
+	}
+	VM_BUG_ON(anon_mapping != (unsigned long)page->mapping);
+
 	return anon_vma;
 out:
 	rcu_read_unlock();


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-21  4:33                                   ` Nick Piggin
@ 2008-10-21 12:58                                     ` Hugh Dickins
  2008-10-21 15:59                                     ` Christoph Lameter
  1 sibling, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2008-10-21 12:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Nick Piggin, Peter Zijlstra,
	Linux Memory Management List

On Tue, 21 Oct 2008, Nick Piggin wrote:
> On Mon, Oct 20, 2008 at 08:25:54PM -0700, Linus Torvalds wrote:
> > On Tue, 21 Oct 2008, Nick Piggin wrote:
> > > >
> > > > So what I'm trying to figure out is why Nick wanted to add another check
> > > > for page_mapped(). I'm not seeing what it is supposed to protect against.
> > > 
> > > It's not supposed to protect against anything that would be a problem
> > > in the existing code (well, I initially thought it might be, but Hugh
> > > explained why its not needed). I'd still like to put the check in, in
> > > order to constrain this peculiarity of SLAB_DESTROY_BY_RCU to those
> > > couple of functions which allocate or take a reference.
> > 
> > Hmm.  Ok, as long as I understand what it is for (and if it's not a 
> > bug-fix but a "like to drop the stale anon_vma early), I'm ok.
> > 
> > So I won't mind, and Hugh seems to prefer it. So if you send that patch 

(I'd prefer just a comment myself, but I do see Nick's point of view.)

> > alogn with a good explanation for a changelog entry, I'll apply it.
> 
> Well something like this, then. Hugh?

Yes, that's good.  I'm not a huge fan of such comments that dwarf the
code, but I've rather disqualified myself by going to the opposite
extreme, and it's hard to get over in less, and I think all you've
said is correct and relevant.

Would it be better without the VM_BUG_ON on page->mapping?  I find
that a bit distracting and of no interest, though I guess it's a
further way of clarifying the assumptions.  You could as well add
a VM_BUG_ON(!page_count(page)), but I don't really want that.

Thanks,
Hugh

> 
> --
> 
> With the existing SLAB_DESTROY_BY_RCU scheme for anon_vma, page_lock_anon_vma
> might take the lock of the anon_vma at a point where it has already been freed
> then re-allocated and reused for something else.
> 
> This is OK (with the exception of the now-fixed case where newly allocated
> anon_vma had its list manipulated without holding the lock), because in order
> to get to the pte, the page tables must be walked and the pte confirmed to
> point to this page anyway. So technically it should work.
> 
> The problem with it is that it is quite subtle, and it means that we have to
> keep this stale-anon_vma problem in the back of our minds, when reviewing or
> modifying any part of the anonymous rmap code. It *could* be that it would
> break some otherwise legitimate change to the code.
> 
> Add another page_mapped check to weed out these anon_vmas. Comment the
> existing page_mapped check a little bit.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -200,11 +200,47 @@ struct anon_vma *page_lock_anon_vma(stru
>  	anon_mapping = (unsigned long) page->mapping;
>  	if (!(anon_mapping & PAGE_MAPPING_ANON))
>  		goto out;
> +
> +	/*
> +	 * The page_mapped check is required in order to ensure anon_vma is
> +	 * protected under this RCU critical section before we touch it.
> +	 *
> +	 * If page_mapped was not checked, page->mapping may refer to an
> +	 * anon_vma that has since been freed (see page_remove_rmap comment not
> +	 * resetting PageAnon). And hence it would not be protected with RCU
> +	 * and could be freed and reused at any time.
> +	 */
>  	if (!page_mapped(page))
>  		goto out;
>  
>  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>  	spin_lock(&anon_vma->lock);
> +
> +	/*
> +	 * If the page is no longer mapped, we have no way to keep the anon_vma
> +	 * stable. It may be freed and even re-allocated for some other set of
> +	 * anonymous mappings at any point. Technically this should be OK, as
> +	 * we hold the spinlock, and should be able to tolerate finding
> +	 * unrelated vmas on our list. However we'd rather nip these in the bud
> +	 * here, for simplicity.
> +	 *
> +	 * If the page is mapped while we have the lock on the anon_vma, then
> +	 * we know anon_vma_unlink can't run and garbage collect the anon_vma:
> +	 * unmapping the page and decrementing its mapcount happens before
> +	 * unlinking the anon_vma; unlinking the anon_vma requires the
> +	 * anon_vma lock to be held. So this check ensures we have a stable
> +	 * anon_vma.
> +	 *
> +	 * Note: the page can still become unmapped, and the !page_mapped
> +	 * condition become true at any point. This check is definitely not
> +	 * preventing any such thing.
> +	 */
> +	if (unlikely(!page_mapped(page))) {
> +		spin_unlock(&anon_vma->lock);
> +		goto out;
> +	}
> +	VM_BUG_ON(anon_mapping != (unsigned long)page->mapping);
> +
>  	return anon_vma;
>  out:
>  	rcu_read_unlock();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-21  4:33                                   ` Nick Piggin
  2008-10-21 12:58                                     ` Hugh Dickins
@ 2008-10-21 15:59                                     ` Christoph Lameter
  2008-10-22  9:29                                       ` Nick Piggin
  1 sibling, 1 reply; 52+ messages in thread
From: Christoph Lameter @ 2008-10-21 15:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Nick Piggin, Hugh Dickins, Peter Zijlstra,
	Linux Memory Management List

Nick Piggin wrote:


>  	if (!page_mapped(page))
>  		goto out;
>  
>  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);

Isnt it possible for the anon_vma to be freed and reallocated for another use
after this statement and before taking the lock?

>  	spin_lock(&anon_vma->lock);

Then we may take the spinlock on another anon_vma structure not related to
this page.

> +
> +	/*
> +	 * If the page is no longer mapped, we have no way to keep the anon_vma
> +	 * stable. It may be freed and even re-allocated for some other set of
> +	 * anonymous mappings at any point. Technically this should be OK, as
> +	 * we hold the spinlock, and should be able to tolerate finding
> +	 * unrelated vmas on our list. However we'd rather nip these in the bud
> +	 * here, for simplicity.
> +	 *
> +	 * If the page is mapped while we have the lock on the anon_vma, then
> +	 * we know anon_vma_unlink can't run and garbage collect the anon_vma:
> +	 * unmapping the page and decrementing its mapcount happens before
> +	 * unlinking the anon_vma; unlinking the anon_vma requires the
> +	 * anon_vma lock to be held. So this check ensures we have a stable
> +	 * anon_vma.
> +	 *
> +	 * Note: the page can still become unmapped, and the !page_mapped
> +	 * condition become true at any point. This check is definitely not
> +	 * preventing any such thing.
> +	 */

What is this then? An optimization?

> +	if (unlikely(!page_mapped(page))) {
> +		spin_unlock(&anon_vma->lock);
> +		goto out;
> +	}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-21 15:59                                     ` Christoph Lameter
@ 2008-10-22  9:29                                       ` Nick Piggin
  0 siblings, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2008-10-22  9:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Nick Piggin, Hugh Dickins, Peter Zijlstra,
	Linux Memory Management List

On Tue, Oct 21, 2008 at 10:59:40AM -0500, Christoph Lameter wrote:
> Nick Piggin wrote:
> 
> 
> >  	if (!page_mapped(page))
> >  		goto out;
> >  
> >  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> 
> Isnt it possible for the anon_vma to be freed and reallocated for another use
> after this statement and before taking the lock?

Yes.


> >  	spin_lock(&anon_vma->lock);
> 
> Then we may take the spinlock on another anon_vma structure not related to
> this page.

Yes.


> > +
> > +	/*
> > +	 * If the page is no longer mapped, we have no way to keep the anon_vma
> > +	 * stable. It may be freed and even re-allocated for some other set of
> > +	 * anonymous mappings at any point. Technically this should be OK, as
> > +	 * we hold the spinlock, and should be able to tolerate finding
> > +	 * unrelated vmas on our list. However we'd rather nip these in the bud
> > +	 * here, for simplicity.
> > +	 *
> > +	 * If the page is mapped while we have the lock on the anon_vma, then
> > +	 * we know anon_vma_unlink can't run and garbage collect the anon_vma:
> > +	 * unmapping the page and decrementing its mapcount happens before
> > +	 * unlinking the anon_vma; unlinking the anon_vma requires the
> > +	 * anon_vma lock to be held. So this check ensures we have a stable
> > +	 * anon_vma.
> > +	 *
> > +	 * Note: the page can still become unmapped, and the !page_mapped
> > +	 * condition become true at any point. This check is definitely not
> > +	 * preventing any such thing.
> > +	 */
> 
> What is this then? An optimization?

As the comment says, it filters out those unrelated anon_vmas. In doing so
it allows us to guarantee the reference with the lock alone (as-per the next
patch). Also just means we don't have to care about that case (even though
it's not technically wrong).


> 
> > +	if (unlikely(!page_mapped(page))) {
> > +		spin_unlock(&anon_vma->lock);
> > +		goto out;
> > +	}
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-21  3:25                                 ` Linus Torvalds
  2008-10-21  4:33                                   ` Nick Piggin
@ 2008-10-21  4:34                                   ` Nick Piggin
  2008-10-21 13:55                                     ` Hugh Dickins
  1 sibling, 1 reply; 52+ messages in thread
From: Nick Piggin @ 2008-10-21  4:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Hugh Dickins, Peter Zijlstra, Linux Memory Management List

On Mon, Oct 20, 2008 at 08:25:54PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 21 Oct 2008, Nick Piggin wrote:
> > >
> > > So what I'm trying to figure out is why Nick wanted to add another check
> > > for page_mapped(). I'm not seeing what it is supposed to protect against.
> > 
> > It's not supposed to protect against anything that would be a problem
> > in the existing code (well, I initially thought it might be, but Hugh
> > explained why its not needed). I'd still like to put the check in, in
> > order to constrain this peculiarity of SLAB_DESTROY_BY_RCU to those
> > couple of functions which allocate or take a reference.
> 
> Hmm.  Ok, as long as I understand what it is for (and if it's not a 
> bug-fix but a "like to drop the stale anon_vma early), I'm ok.
> 
> So I won't mind, and Hugh seems to prefer it. So if you send that patch 
> alogn with a good explanation for a changelog entry, I'll apply it.

And after that patch, I *think* we should be able to do something like
this.

--
With the change to return only stable, non-empty anon_vmas from
page_lock_anon_vma, we no longer have to hold off RCU while looking at
the anon_vma. After this change, the lockless referencing, and interesting
SLAB_DESTROY_BY_RCU behaviour is pretty well localised to page_lock_anon_vma
and anon_vma_prepare.

Today, for normal RCU, this doesn't matter much. For preemptible RCU and
preemptible anon_vma lock, this change could help with keeping RCU ticking.
It could also help if we ever wanted to add a sleeping lock to anon_vma.
Basically just fewer nested dependencies ~= more flexible and maintainable.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -239,6 +239,8 @@ struct anon_vma *page_lock_anon_vma(stru
 		spin_unlock(&anon_vma->lock);
 		goto out;
 	}
+	rcu_read_unlock();
+
 	VM_BUG_ON(anon_mapping != (unsigned long)page->mapping);
 
 	return anon_vma;
@@ -250,7 +252,6 @@ out:
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	spin_unlock(&anon_vma->lock);
-	rcu_read_unlock();
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-21  4:34                                   ` Nick Piggin
@ 2008-10-21 13:55                                     ` Hugh Dickins
  0 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2008-10-21 13:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Christoph Lameter, Nick Piggin, Peter Zijlstra,
	Linux Memory Management List

On Tue, 21 Oct 2008, Nick Piggin wrote:
> On Mon, Oct 20, 2008 at 08:25:54PM -0700, Linus Torvalds wrote:
> 
> And after that patch, I *think* we should be able to do something like
> this.
> 
> --
> With the change to return only stable, non-empty anon_vmas from
> page_lock_anon_vma, we no longer have to hold off RCU while looking at
> the anon_vma. After this change, the lockless referencing, and interesting
> SLAB_DESTROY_BY_RCU behaviour is pretty well localised to page_lock_anon_vma
> and anon_vma_prepare.
> 
> Today, for normal RCU, this doesn't matter much. For preemptible RCU and
> preemptible anon_vma lock, this change could help with keeping RCU ticking.
> It could also help if we ever wanted to add a sleeping lock to anon_vma.
> Basically just fewer nested dependencies ~= more flexible and maintainable.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Interesting.  That's how it used to be originally (and we just did
the spin_unlock directly without any page_unlock_anon_vma wrapper).
I rather liked keeping the RCU trickery in the one function.

But it worried ChristophL that way (and caused the -rt tree problems?):
eventually he persuaded me to allow the patch moving rcu_read_unlock()
after the spin_unlock().

I think he was seeing the same point that you are seeing, when you say
that this can come (only) after your patch checking page_mapped i.e.
anon_vma stability after getting the spinlock.

Since I only knew classic RCU in which rcu_read_lock is preempt_disable,
and a spin_lock does preempt_disable, it was all theoretical to me.

I like this patch, but let's see how Christoph feels about it.

Hugh

> ---
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -239,6 +239,8 @@ struct anon_vma *page_lock_anon_vma(stru
>  		spin_unlock(&anon_vma->lock);
>  		goto out;
>  	}
> +	rcu_read_unlock();
> +
>  	VM_BUG_ON(anon_mapping != (unsigned long)page->mapping);
>  
>  	return anon_vma;
> @@ -250,7 +252,6 @@ out:
>  void page_unlock_anon_vma(struct anon_vma *anon_vma)
>  {
>  	spin_unlock(&anon_vma->lock);
> -	rcu_read_unlock();
>  }
>  
>  /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-19 18:25                         ` Linus Torvalds
                                             ` (2 preceding siblings ...)
  2008-10-20  4:03                           ` Hugh Dickins
@ 2008-10-21  2:44                           ` Nick Piggin
  3 siblings, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2008-10-21  2:44 UTC (permalink / raw)
  To: Linus Torvalds, stable
  Cc: Peter Zijlstra, Hugh Dickins, Nick Piggin, Linux Memory Management List

On Monday 20 October 2008 05:25, Linus Torvalds wrote:
> On Sun, 19 Oct 2008, Peter Zijlstra wrote:
> > Part of the confusion is that we don't clear those pointers at the end
> > of their lifetimes (page_remove_rmap and anon_vma_unlink).
> >
> > I guess the !page_mapping() test in page_lock_anon_vma() is meant to
> > deal with this
>
> Hmm. So that part I'm still not entirely convinced about.
>
> The thing is, we have two issues on anon_vma usage, and the
> page_lock_anon_vma() usage in particular:
>
>  - the integrity of the list itself
>
>    Here it should be sufficient to just always get the lock, to the point
>    where we don't need to care about anything else. So getting the lock
>    properly on new allocation makes all the other races irrelevant.
>
>  - the integrity of the _result_ of traversing the list
>
>    This is what the !page_mapping() thing is supposedly protecting
>    against, I think.
>
>    But as far as I can tell, there's really two different use cases here:
>    (a) people who care deeply about the result and (b) people who don't.
>
>    And the difference between the two cases is whether they had the page
>    locked or not. The "try_to_unmap()" callers care deeply, and lock the
>    page. In contrast, some "page_referenced()" callers (really just
>    shrink_active_list) don't care deeply, and to them the return value is
>    really just a heuristic.
>
> As far as I can tell, all the people who care deeply will lock the page
> (and _have_ to lock the page), and thus 'page->mapping' should be stable
> for those cases.
>
> And then we have the other cases, who just want a heuristic, and they
> don't hold the page lock, but if we look at the wrong active_vma that has
> gotten reallocated to something else, they don't even really care.
>
> So I'm not seeing the reason for that check for page_mapped() at the end.
> Does it actually protect against anything relevant?
>
> Anyway, I _think_ the part that everybody agrees about is the initial
> locking of the anon_vma. Whether we then even need any memory barriers
> and/or the page_mapped() check is an independent question. Yes? No?
>
> So I'm suggesting this commit as the part we at least all agree on. But I
> haven't pushed it out yet, so you can still holler.. But I think all the
> discussion is about other issues, and we all agree on at least this part?
>
> 		Linus
>
> ---
> From f422f2ec50872331820f15711f48b9ffc9cbb64e Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Sun, 19 Oct 2008 10:32:20 -0700
> Subject: [PATCH] anon_vma_prepare: properly lock even newly allocated
> entries
>
> The anon_vma code is very subtle, and we end up doing optimistic lookups
> of anon_vmas under RCU in page_lock_anon_vma() with no locking.  Other
> CPU's can also see the newly allocated entry immediately after we've
> exposed it by setting "vma->anon_vma" to the new value.
>
> We protect against the anon_vma being destroyed by having the SLAB
> marked as SLAB_DESTROY_BY_RCU, so the RCU lookup can depend on the
> allocation not being destroyed - but it might still be free'd and
> re-allocated here to a new vma.
>
> As a result, we should not do the anon_vma list ops on a newly allocated
> vma without proper locking.

Thanks, this is exactly what I proposed. I preferred to add more
explicit comments about why it's OK to expose anon_vma to vma->anon_vma
without ordering initialisation etc.

But those comments were mainly for the benefit of others. If you and
Hugh now agree with me on that, then maybe no comments are required.

>
> Acked-by: Nick Piggin <npiggin@suse.de>
> Acked-by: Hugh Dickins <hugh@veritas.com>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  mm/rmap.c |   42 ++++++++++++++++++++++++++++++++----------
>  1 files changed, 32 insertions(+), 10 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0383acf..e8d639b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -55,7 +55,33 @@
>
>  struct kmem_cache *anon_vma_cachep;
>
> -/* This must be called under the mmap_sem. */
> +/**
> + * anon_vma_prepare - attach an anon_vma to a memory region
> + * @vma: the memory region in question
> + *
> + * This makes sure the memory mapping described by 'vma' has
> + * an 'anon_vma' attached to it, so that we can associate the
> + * anonymous pages mapped into it with that anon_vma.
> + *
> + * The common case will be that we already have one, but if
> + * if not we either need to find an adjacent mapping that we
> + * can re-use the anon_vma from (very common when the only
> + * reason for splitting a vma has been mprotect()), or we
> + * allocate a new one.
> + *
> + * Anon-vma allocations are very subtle, because we may have
> + * optimistically looked up an anon_vma in page_lock_anon_vma()
> + * and that may actually touch the spinlock even in the newly
> + * allocated vma (it depends on RCU to make sure that the
> + * anon_vma isn't actually destroyed).
> + *
> + * As a result, we need to do proper anon_vma locking even
> + * for the new allocation. At the same time, we do not want
> + * to do any locking for the common case of already having
> + * an anon_vma.
> + *
> + * This must be called with the mmap_sem held for reading.
> + */
>  int anon_vma_prepare(struct vm_area_struct *vma)
>  {
>  	struct anon_vma *anon_vma = vma->anon_vma;
> @@ -63,20 +89,17 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  	might_sleep();
>  	if (unlikely(!anon_vma)) {
>  		struct mm_struct *mm = vma->vm_mm;
> -		struct anon_vma *allocated, *locked;
> +		struct anon_vma *allocated;
>
>  		anon_vma = find_mergeable_anon_vma(vma);
> -		if (anon_vma) {
> -			allocated = NULL;
> -			locked = anon_vma;
> -			spin_lock(&locked->lock);
> -		} else {
> +		allocated = NULL;
> +		if (!anon_vma) {
>  			anon_vma = anon_vma_alloc();
>  			if (unlikely(!anon_vma))
>  				return -ENOMEM;
>  			allocated = anon_vma;
> -			locked = NULL;
>  		}
> +		spin_lock(&anon_vma->lock);
>
>  		/* page_table_lock to protect against threads */
>  		spin_lock(&mm->page_table_lock);
> @@ -87,8 +110,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  		}
>  		spin_unlock(&mm->page_table_lock);
>
> -		if (locked)
> -			spin_unlock(&locked->lock);
> +		spin_unlock(&anon_vma->lock);
>  		if (unlikely(allocated))
>  			anon_vma_free(allocated);
>  	}
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  2:25             ` Nick Piggin
  2008-10-18  2:35               ` Nick Piggin
  2008-10-18  2:53               ` Linus Torvalds
@ 2008-10-18 19:14               ` Hugh Dickins
  2008-10-19  3:03                 ` Nick Piggin
  2 siblings, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2008-10-18 19:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List

Sorry, I've only just got back to this, and just had one quick read
through the thread.  A couple of points on what I believe so far...

On Sat, 18 Oct 2008, Nick Piggin wrote:
> On Fri, Oct 17, 2008 at 07:11:38PM -0700, Linus Torvalds wrote:
> > On Sat, 18 Oct 2008, Nick Piggin wrote:
> > > 
> > > We can't do that, unfortuantely, because anon_vmas are allocated with
> > > SLAB_DESTROY_BY_RCU.
> > 
> > Aughh. I see what you're saying. We don't _free_ them by RCU, we just 
> > destroy the page allocation. So an anon_vma can get _re-allocated_ for 
> > another page (without being destroyed), concurrently with somebody 
> > optimistically being busy with that same anon_vma that they got through 
> > that optimistic 'page_lock_anon_vma()' thing.
> > 
> > So if we were to just set the lock, we might actually be messing with 
> > something that is still actively used by the previous page that was 
> > unmapped concurrently and still being accessed by try_to_unmap_anon. So 
> > even though we allocated a "new" anon_vma, it might still be busy.
> > 
> > Yes? No?

Yes, I believe Linus is right; but need to mull it over some more.
That not-taking-the-lock-on-newly-allocated optimization comes
from Andrea, and dates from before I removed the page_map_lock().
It looks like an optimization that was valid in Andrea's original,
but something I should have removed when going to SLAB_DESTROY_BY_RCU.

> 
> That's what I'm thinking, yes. But I admit the last time I looked at
> this really closely was probably reading through Hugh's patches and
> changelogs (which at the time must have convinced me ;)). So I could
> be wrong.
> 
> 
> > That thing really is too subtle for words. But if that's actually what you 
> > are alluding to, then doesn't that mean that we _really_ should be doing 
> > that "spin_lock(&anon_vma->lock)" even for the first allocation, and that 
> > the current code is broken? Because otherwise that other concurrent user 
> > that found the stale vma through page_lock_anon_vma() will now try to 
> > follow the linked list and _think_ it's stable (thanks to the lock), but 
> > we're actually inserting entries into it without holding any locks at all.
> 
> Yes, that's what I meant by "has other problems". Another thing is also
> that even if we have the lock here, I can't see why page_lock_anon_vma
> is safe against finding an anon_vma which has been deallocated then
> allocated for something else (and had vmas inserted into it etc.).

And Nick is right that page_lock_anon_vma() is not safe against finding
an anon_vma which has now been allocated for something else: but that
is no surprise, it's very much in the nature of SLAB_DESTROY_BY_RCU
(I left most of the comment in mm/slab.c, just said "tricky" here).

It should be no problem: having locked the right-or-perhaps-wrong
anon_vma, we then go on to search its list for a page which may or
may not be there, even when it's the right anon_vma; there's no need
for special code to deal with the very unlikely case that we've now
got an irrelevant list, it's just that the page we're looking for
won't be found in it.

But not-taking-the-lock-on-newly-allocated does then look wrong.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18 19:14               ` Hugh Dickins
@ 2008-10-19  3:03                 ` Nick Piggin
  2008-10-19  7:07                   ` Hugh Dickins
  0 siblings, 1 reply; 52+ messages in thread
From: Nick Piggin @ 2008-10-19  3:03 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List

On Sat, Oct 18, 2008 at 08:14:12PM +0100, Hugh Dickins wrote:
> And Nick is right that page_lock_anon_vma() is not safe against finding
> an anon_vma which has now been allocated for something else: but that
> is no surprise, it's very much in the nature of SLAB_DESTROY_BY_RCU
> (I left most of the comment in mm/slab.c, just said "tricky" here).
> 
> It should be no problem: having locked the right-or-perhaps-wrong
> anon_vma, we then go on to search its list for a page which may or
> may not be there, even when it's the right anon_vma; there's no need

OK, so it may be correct but I think that's pretty nasty if we can
avoid it so easily. Then we have to keep in mind this special case
throughout the code rather than just confining it to the low level
take-a-reference function and never having to worry about it.


> for special code to deal with the very unlikely case that we've now
> got an irrelevant list, it's just that the page we're looking for
> won't be found in it.

There is already a page_mapped check in there. I'm just going to
propose we move that down. No extra branchesin the fastpath. OK?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-19  3:03                 ` Nick Piggin
@ 2008-10-19  7:07                   ` Hugh Dickins
  2008-10-20  3:26                     ` Hugh Dickins
  0 siblings, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2008-10-19  7:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List

On Sun, 19 Oct 2008, Nick Piggin wrote:
> On Sat, Oct 18, 2008 at 08:14:12PM +0100, Hugh Dickins wrote:
> > And Nick is right that page_lock_anon_vma() is not safe against finding
> > an anon_vma which has now been allocated for something else: but that
> > is no surprise, it's very much in the nature of SLAB_DESTROY_BY_RCU
> > (I left most of the comment in mm/slab.c, just said "tricky" here).
> > 
> > It should be no problem: having locked the right-or-perhaps-wrong
> > anon_vma, we then go on to search its list for a page which may or
> > may not be there, even when it's the right anon_vma; there's no need
> 
> OK, so it may be correct but I think that's pretty nasty if we can
> avoid it so easily. Then we have to keep in mind this special case
> throughout the code rather than just confining it to the low level
> take-a-reference function and never having to worry about it.
> 
> 
> > for special code to deal with the very unlikely case that we've now
> > got an irrelevant list, it's just that the page we're looking for
> > won't be found in it.
> 
> There is already a page_mapped check in there. I'm just going to
> propose we move that down. No extra branchesin the fastpath. OK?

That should be OK, yes.  Looking back at the history, I believe
I sited the page_mapped test where it is, partly for simpler flow,
and partly to avoid overhead of taking spinlock unnecessarily.

But that's much too rare a case to be worrying about: better
to leave you comfortable with page_lock_anon_vma() itself.
More important will be the comments you add, I'll probably
want to quibble on those!

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-19  7:07                   ` Hugh Dickins
@ 2008-10-20  3:26                     ` Hugh Dickins
  2008-10-21  2:45                       ` Nick Piggin
  0 siblings, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2008-10-20  3:26 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List

On Sun, 19 Oct 2008, Hugh Dickins wrote:
> On Sun, 19 Oct 2008, Nick Piggin wrote:
> > 
> > There is already a page_mapped check in there. I'm just going to
> > propose we move that down. No extra branchesin the fastpath. OK?
> 
> That should be OK, yes.  Looking back at the history, I believe
> I sited the page_mapped test where it is, partly for simpler flow,
> and partly to avoid overhead of taking spinlock unnecessarily.

Arrgh!  What terrible advice I gave you there, completely wrong:
that's what happens when I rush a reply instead of thinking.

I'm three-quarters through replying to Linus on this, and going
into that detail, remember now why its placement is critical.

Repeat the page_mapped check before returning if you wish,
but do not remove the one that's there: see other mail for
explanation.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-20  3:26                     ` Hugh Dickins
@ 2008-10-21  2:45                       ` Nick Piggin
  0 siblings, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2008-10-21  2:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Linus Torvalds, Peter Zijlstra,
	Linux Memory Management List

On Monday 20 October 2008 14:26, Hugh Dickins wrote:
> On Sun, 19 Oct 2008, Hugh Dickins wrote:
> > On Sun, 19 Oct 2008, Nick Piggin wrote:
> > > There is already a page_mapped check in there. I'm just going to
> > > propose we move that down. No extra branchesin the fastpath. OK?
> >
> > That should be OK, yes.  Looking back at the history, I believe
> > I sited the page_mapped test where it is, partly for simpler flow,
> > and partly to avoid overhead of taking spinlock unnecessarily.
>
> Arrgh!  What terrible advice I gave you there, completely wrong:
> that's what happens when I rush a reply instead of thinking.
>
> I'm three-quarters through replying to Linus on this, and going
> into that detail, remember now why its placement is critical.
>
> Repeat the page_mapped check before returning if you wish,
> but do not remove the one that's there: see other mail for
> explanation.

Right, thanks for that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  0:42     ` Linus Torvalds
  2008-10-18  1:08       ` Linus Torvalds
@ 2008-10-19  1:13       ` Hugh Dickins
  2008-10-19  2:41         ` Nick Piggin
  1 sibling, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2008-10-19  1:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Peter Zijlstra, Nick Piggin, Linux Memory Management List

On Fri, 17 Oct 2008, Linus Torvalds wrote:
> On Fri, 17 Oct 2008, Linus Torvalds wrote:
> > 
> > But I think that what Nick did is correct - we always start traversal 
> > through anon_vma->head, so no, the "list_add_tail()" won't expose it to 
> > anybody else, because nobody else has seen the anon_vma().
> > 
> > That said, that's really too damn subtle. We shouldn't rely on memory 
> > ordering for the list handling, when the list handling is _supposed_ to be 
> > using that anon_vma->lock thing.
> 
> So maybe a better patch would be as follows? It simplifies the whole thing 
> by just always locking and unlocking the vma, whether it's newly allocated 
> or not (and whether it then gets dropped as unnecessary or not).
> 
> It still does that "smp_read_barrier_depends()" in the same old place. I 
> don't have the energy to look at Hugh's point about people reading 
> anon_vma without doing the whole "prepare" thing.
> 
> It adds more lines than it removes, but it's just because of the comments. 
> With the locking simplification, it actually removes more lines of actual 
> code than it adds. And now we always do that list_add_tail() with the 
> anon_vma lock held, which should simplify thinking about this, and avoid 
> at least one subtle ordering issue.
> 
> 		Linus

I'm slowly approaching the conclusion that this is the only patch
which is needed here.

The newly-allocated "locked = NULL" mis-optimization still looks
wrong to me in the face of SLAB_DESTROY_BY_RCU, and you kill that.

You also address Nick's second point about barriers: you've arranged
them differently, but I don't think that matters; or the smp_wmb()
could go into the "allocated = anon_vma" block, couldn't it?  that
would reduce its overhead a little.  (If we needed more than an
Alpha-barrier in the common path, then I'd look harder for a way
to avoid it more often, but it'll do as is.)

I thought for a while that even the barriers weren't needed, because
the only thing mmap.c and memory.c do with anon_vma (until they've
up_readed mmap_sem and down_writed it to rearrange vmas) is note its
address.  Then I found one exception, the use of anon_vma_lock()
in expand_downwards() and expand_upwards() (it's not really being
used as an anon_vma lock, just as a convenient spinlock to serialize
concurrent stack faults for a moment): but I don't think that could
ever actually need the barriers, the anon_vma for the stack should
be well-established before there can be any racing threads on it.

But at last I realized the significant exception is right there in
anon_vma_prepare(): the spin_lock(&anon_vma->lock) of an anon_vma
coming back from find_mergeable_anon_vma() does need that lock to
be visibly initialized - that is the clinching case for barriers.

That leaves Nick's original point, of the three CPUs with the third
doing reclaim, with my point about needing smp_read_barrier_depends()
over there.  I now think those races were illusory, that we were all
overlooking something.  Reclaim (or page migration) doesn't arrive at
those pages by scanning the old mem_map[] array, it gets them off an
LRU list, whose spinlock is locked and unlocked to take them off; and
the original faulting CPU had to lock and unlock that spinlock to put
them on the LRU originally, at a stage after its anon_vma_prepare().

Surely we have enough barriers there to make sure that anon_vma->lock
is visibly initialized by the time page_lock_anon_vma() tries to take
it?  And it's not any kind of coincidence: isn't this a general pattern,
that a newly initialized structure containing a lock is made available
to other threads such as reclaim, and they can rely on that lock being
visibly initialized, because the structure is made available to them
by being put onto and examined on some separately locked list or tree?

Nick, are you happy with Linus's patch below?
Or if not, please explain again what's missing - thanks.

Hugh

> 
> ---
>  mm/rmap.c |   33 +++++++++++++++++++++++----------
>  1 files changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0383acf..9221bf7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -63,35 +63,48 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  	might_sleep();
>  	if (unlikely(!anon_vma)) {
>  		struct mm_struct *mm = vma->vm_mm;
> -		struct anon_vma *allocated, *locked;
> +		struct anon_vma *allocated;
>  
>  		anon_vma = find_mergeable_anon_vma(vma);
> -		if (anon_vma) {
> -			allocated = NULL;
> -			locked = anon_vma;
> -			spin_lock(&locked->lock);
> -		} else {
> +		allocated = NULL;
> +		if (!anon_vma) {
>  			anon_vma = anon_vma_alloc();
>  			if (unlikely(!anon_vma))
>  				return -ENOMEM;
>  			allocated = anon_vma;
> -			locked = NULL;
>  		}
> +		spin_lock(&anon_vma->lock);
>  
>  		/* page_table_lock to protect against threads */
>  		spin_lock(&mm->page_table_lock);
>  		if (likely(!vma->anon_vma)) {
> +			/*
> +			 * We hold the mm->page_table_lock, but another
> +			 * CPU may be doing an optimistic load (the one
> +			 * at the top), and we want to make sure that
> +			 * the anon_vma changes are visible.
> +			 */
> +			smp_wmb();
>  			vma->anon_vma = anon_vma;
>  			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
>  			allocated = NULL;
>  		}
>  		spin_unlock(&mm->page_table_lock);
> -
> -		if (locked)
> -			spin_unlock(&locked->lock);
> +		spin_unlock(&anon_vma->lock);
>  		if (unlikely(allocated))
>  			anon_vma_free(allocated);
>  	}
> +	/*
> +	 * Subtle: we looked up anon_vma without any locking
> +	 * (in the comon case), and are going to look at the
> +	 * spinlock etc behind it. In order to know that it's
> +	 * initialized, we need to do a read barrier here.
> +	 *
> +	 * We can use the cheaper "depends" version, since we
> +	 * are following a pointer, and only on alpha may that
> +	 * give a stale value.
> +	 */
> +	smp_read_barrier_depends();
>  	return 0;
>  }
>  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-19  1:13       ` Hugh Dickins
@ 2008-10-19  2:41         ` Nick Piggin
  2008-10-19  9:45           ` Hugh Dickins
  0 siblings, 1 reply; 52+ messages in thread
From: Nick Piggin @ 2008-10-19  2:41 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List

On Sun, Oct 19, 2008 at 02:13:06AM +0100, Hugh Dickins wrote:
> On Fri, 17 Oct 2008, Linus Torvalds wrote:
> > On Fri, 17 Oct 2008, Linus Torvalds wrote:
> > > 
> > > But I think that what Nick did is correct - we always start traversal 
> > > through anon_vma->head, so no, the "list_add_tail()" won't expose it to 
> > > anybody else, because nobody else has seen the anon_vma().
> > > 
> > > That said, that's really too damn subtle. We shouldn't rely on memory 
> > > ordering for the list handling, when the list handling is _supposed_ to be 
> > > using that anon_vma->lock thing.
> > 
> > So maybe a better patch would be as follows? It simplifies the whole thing 
> > by just always locking and unlocking the vma, whether it's newly allocated 
> > or not (and whether it then gets dropped as unnecessary or not).
> > 
> > It still does that "smp_read_barrier_depends()" in the same old place. I 
> > don't have the energy to look at Hugh's point about people reading 
> > anon_vma without doing the whole "prepare" thing.
> > 
> > It adds more lines than it removes, but it's just because of the comments. 
> > With the locking simplification, it actually removes more lines of actual 
> > code than it adds. And now we always do that list_add_tail() with the 
> > anon_vma lock held, which should simplify thinking about this, and avoid 
> > at least one subtle ordering issue.
> > 
> > 		Linus
> 
> I'm slowly approaching the conclusion that this is the only patch
> which is needed here.
> 
> The newly-allocated "locked = NULL" mis-optimization still looks
> wrong to me in the face of SLAB_DESTROY_BY_RCU, and you kill that.
> 
> You also address Nick's second point about barriers: you've arranged
> them differently, but I don't think that matters; or the smp_wmb()
> could go into the "allocated = anon_vma" block, couldn't it?  that
> would reduce its overhead a little.  (If we needed more than an
> Alpha-barrier in the common path, then I'd look harder for a way
> to avoid it more often, but it'll do as is.)
> 
> I thought for a while that even the barriers weren't needed, because
> the only thing mmap.c and memory.c do with anon_vma (until they've
> up_readed mmap_sem and down_writed it to rearrange vmas) is note its
> address.  Then I found one exception, the use of anon_vma_lock()
> in expand_downwards() and expand_upwards() (it's not really being
> used as an anon_vma lock, just as a convenient spinlock to serialize
> concurrent stack faults for a moment): but I don't think that could
> ever actually need the barriers, the anon_vma for the stack should
> be well-established before there can be any racing threads on it.
> 
> But at last I realized the significant exception is right there in
> anon_vma_prepare(): the spin_lock(&anon_vma->lock) of an anon_vma
> coming back from find_mergeable_anon_vma() does need that lock to
> be visibly initialized - that is the clinching case for barriers.
> 
> That leaves Nick's original point, of the three CPUs with the third
> doing reclaim, with my point about needing smp_read_barrier_depends()
> over there.  I now think those races were illusory, that we were all
> overlooking something.  Reclaim (or page migration) doesn't arrive at
> those pages by scanning the old mem_map[] array, it gets them off an
> LRU list, whose spinlock is locked and unlocked to take them off; and
> the original faulting CPU had to lock and unlock that spinlock to put
> them on the LRU originally, at a stage after its anon_vma_prepare().
> 
> Surely we have enough barriers there to make sure that anon_vma->lock
> is visibly initialized by the time page_lock_anon_vma() tries to take
> it?

I don't think so. According to the race, the stores to anon_vma from
the first process would not arrive at the reclaimer in time. There
is no amount of barriers the reclaimer can perform to change this. But
anyway all that goes away if we just use locking properly as -per my
last patch (see below).


>  And it's not any kind of coincidence: isn't this a general pattern,
> that a newly initialized structure containing a lock is made available
> to other threads such as reclaim, and they can rely on that lock being
> visibly initialized, because the structure is made available to them
> by being put onto and examined on some separately locked list or tree?
> 
> Nick, are you happy with Linus's patch below?
> Or if not, please explain again what's missing - thanks.

No, I don't agree with the need for barriers and I think Linus had also
not worked through the whole problem at this point because of the
just-in-case barriers and hoping to initialise the newly allocated lock
as locked, as an optimisation.

With my patch, the rules are simple: anybody who wants to look in the
anon_vma or modify the anon_vma must take the anon_vma->lock. Then there
are no problems with ordering, and no need for any explicit barriers.

I don't understand your point about requiring ->lock to be initialised
coming from find_mergeable_anon_vma. Why? All the critical memory operations
get ordered inside the lock AFAIKS (I haven't actually applied Linus'
patch to see what the result is, but that should be the case with my
patch).

So, no. I still think my patch is the right one.


This btw. is modulo the final hunk of my last patch. That was supposed to
address the different, percieved issue I saw with getting an old anon_vma.
I still think that's pretty ugly even if correct. It would be nice just
to add the single branch (or better, just move one down) to avoid having
to think about it. I will submit a separate patch for that.

Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-19  2:41         ` Nick Piggin
@ 2008-10-19  9:45           ` Hugh Dickins
  2008-10-21  3:59             ` Nick Piggin
  0 siblings, 1 reply; 52+ messages in thread
From: Hugh Dickins @ 2008-10-19  9:45 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linus Torvalds, Peter Zijlstra, Linux Memory Management List

I'm fairly lost by now in all this, suffering from barrier sickness,
and we're not understanding each other very well.  I'll have a try.

On Sun, 19 Oct 2008, Nick Piggin wrote:
> On Sun, Oct 19, 2008 at 02:13:06AM +0100, Hugh Dickins wrote:
> > 
> > That leaves Nick's original point, of the three CPUs with the third
> > doing reclaim, with my point about needing smp_read_barrier_depends()
> > over there.  I now think those races were illusory, that we were all
> > overlooking something.  Reclaim (or page migration) doesn't arrive at
> > those pages by scanning the old mem_map[] array, it gets them off an
> > LRU list, whose spinlock is locked and unlocked to take them off; and
> > the original faulting CPU had to lock and unlock that spinlock to put
> > them on the LRU originally, at a stage after its anon_vma_prepare().
> > 
> > Surely we have enough barriers there to make sure that anon_vma->lock
> > is visibly initialized by the time page_lock_anon_vma() tries to take
> > it?
> 
> I don't think so. According to the race, the stores to anon_vma from
> the first process would not arrive at the reclaimer in time. There
> is no amount of barriers the reclaimer can perform to change this.

No amount of barriers the reclaimer can do alone, the other end needs
the complementary barriers.  I'm suggesting the lock-unlock pair when
page is put on LRU provides that (though I do recall that unlock-lock
is stronger), in association with reclaimer's lock when when it goes
to take page from LRU.

I don't grasp where I'm going wrong on that; but you're more interested
in asserting that it's irrelevant by now anyway, and I'll probably be
prepared to accept that.

> But
> anyway all that goes away if we just use locking properly as -per my
> last patch (see below).

I don't see your last patch below, so presume you're referring to the
previous you posted (and your "modulo final hunk" remark bears that
out).  I probably gave it less attention than it deserved amidst
all the other flurry of discussion.

There is no dispute between us over locking the newly allocated
anon_vma in anon_vma_prepare(): I think all three of us came to
suggest that independently, and we all agree there are separate
reasons why it's essential.  It's what Linus did in the patch
which I was commending but you're demurring from.

I think you're saying that with just that change to anon_vma_prepare(),
we've then no need for the smp_wmb() and smp_read_barrier_depends() he
retained from his first version.  I'm afraid my barrier sickness has
reached that advanced stage in which I can no long tell whether that's
obviously true or obviously false: like those perspective cube outlines
you can switch either way in your mind, I see it both ways and feel
very very stupid.  You'll have to settle that with Linus.

> > Nick, are you happy with Linus's patch below?
> > Or if not, please explain again what's missing - thanks.
> 
> No, I don't agree with the need for barriers and I think Linus had also
> not worked through the whole problem at this point because of the
> just-in-case barriers and hoping to initialise the newly allocated lock
> as locked, as an optimisation.
> 
> With my patch, the rules are simple: anybody who wants to look in the
> anon_vma or modify the anon_vma must take the anon_vma->lock. Then there
> are no problems with ordering, and no need for any explicit barriers.

But we still have the case where one caller sails through
anon_vma_prepare() seeing vma->anon_vma set, before the initialization
of that struct anon_vma (and in particular the lock) is visible to it.

I think you're saying, hell, we don't need separate steps to make a
lock visible, that would be intolerable: locking the lock makes it
visible.  So all we have to do is not skip the locking of it in the
newly allocated case.  If Linus is persuaded, then so am I.

> 
> I don't understand your point about requiring ->lock to be initialised
> coming from find_mergeable_anon_vma. Why? All the critical memory operations
> get ordered inside the lock AFAIKS (I haven't actually applied Linus'
> patch to see what the result is, but that should be the case with my
> patch).

When find_mergeable_anon_vma returns the anon_vma from an adjacent vma
which could be merged with this one, that's the one case (setting aside
extend_stack) where this faulting CPU will want to access the content
of a struct anon_vma which may have been initialized by another CPU -
to lock it and add vma to its list - rather than just use its address.

> 
> So, no. I still think my patch is the right one.
> 
> 
> This btw. is modulo the final hunk of my last patch. That was supposed to
> address the different, percieved issue I saw with getting an old anon_vma.
> I still think that's pretty ugly even if correct. It would be nice just
> to add the single branch (or better, just move one down) to avoid having
> to think about it. I will submit a separate patch for that.

Yes, I think we've now understood and agreed on that part.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-19  9:45           ` Hugh Dickins
@ 2008-10-21  3:59             ` Nick Piggin
  0 siblings, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2008-10-21  3:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Linus Torvalds, Peter Zijlstra,
	Linux Memory Management List

On Sunday 19 October 2008 20:45, Hugh Dickins wrote:
> I'm fairly lost by now in all this, suffering from barrier sickness,
> and we're not understanding each other very well.  I'll have a try.

Not sure if you're still worried about this I'll just try to answer
anyway. Ignore it if you like ;)


> On Sun, 19 Oct 2008, Nick Piggin wrote:
> > On Sun, Oct 19, 2008 at 02:13:06AM +0100, Hugh Dickins wrote:
> > > That leaves Nick's original point, of the three CPUs with the third
> > > doing reclaim, with my point about needing smp_read_barrier_depends()
> > > over there.  I now think those races were illusory, that we were all
> > > overlooking something.  Reclaim (or page migration) doesn't arrive at
> > > those pages by scanning the old mem_map[] array, it gets them off an
> > > LRU list, whose spinlock is locked and unlocked to take them off; and
> > > the original faulting CPU had to lock and unlock that spinlock to put
> > > them on the LRU originally, at a stage after its anon_vma_prepare().
> > >
> > > Surely we have enough barriers there to make sure that anon_vma->lock
> > > is visibly initialized by the time page_lock_anon_vma() tries to take
> > > it?
> >
> > I don't think so. According to the race, the stores to anon_vma from
> > the first process would not arrive at the reclaimer in time. There
> > is no amount of barriers the reclaimer can perform to change this.
>
> No amount of barriers the reclaimer can do alone, the other end needs
> the complementary barriers.  I'm suggesting the lock-unlock pair when
> page is put on LRU provides that (though I do recall that unlock-lock
> is stronger), in association with reclaimer's lock when when it goes
> to take page from LRU.
>
> I don't grasp where I'm going wrong on that; but you're more interested
> in asserting that it's irrelevant by now anyway, and I'll probably be
> prepared to accept that.

Well it's just that CPU1 is what puts the page on the LRU, but CPU0
is the one who's stores we need to order.

CPU0
anon_vma->initialised = 1;
vma->anon_vma = vma;

CPU1
anon_vma = vma->anon_vma;
page->anon_vma = anon_vma;
spin_lock(lru_lock);
list_add(page, lru);
spin_unlock(lru_lock);

CPU2
spin_lock(lru_lock);
anon_vma = page->anon_vma;

So CPU2 definitely would see page->anon_vma to be what CPU1 set it to.
Locks provide that much ordering, of course. But it's CPU0's ordering
which is what matters -- CPU2 could still see anon_vma->initialised == 0
So CPU0 needs smp_wmb() between those.


> I think you're saying that with just that change to anon_vma_prepare(),
> we've then no need for the smp_wmb() and smp_read_barrier_depends() he
> retained from his first version.  I'm afraid my barrier sickness has
> reached that advanced stage in which I can no long tell whether that's
> obviously true or obviously false: like those perspective cube outlines
> you can switch either way in your mind, I see it both ways and feel
> very very stupid.  You'll have to settle that with Linus.

I think you may have worked though it with Linus? However going back to
my example above: if CPU0 were to hold anon_vma->lock around its assignments,
and CPU2 were to take the lock before checking initialised, then the wmb()
would not be required.

It could still be the case that the store to vma->anon_vma becomes visible
first (and so CPU1 could assign the pointer to a page->mapping). However,
by the time anybody is allowed to look inside anon_vma, initialised must be
1 (the spin_unlock must not be visible until _all_ prior stores are).


> > > Nick, are you happy with Linus's patch below?
> > > Or if not, please explain again what's missing - thanks.
> >
> > No, I don't agree with the need for barriers and I think Linus had also
> > not worked through the whole problem at this point because of the
> > just-in-case barriers and hoping to initialise the newly allocated lock
> > as locked, as an optimisation.
> >
> > With my patch, the rules are simple: anybody who wants to look in the
> > anon_vma or modify the anon_vma must take the anon_vma->lock. Then there
> > are no problems with ordering, and no need for any explicit barriers.
>
> But we still have the case where one caller sails through
> anon_vma_prepare() seeing vma->anon_vma set, before the initialization
> of that struct anon_vma (and in particular the lock) is visible to it.
>
> I think you're saying, hell, we don't need separate steps to make a
> lock visible, that would be intolerable: locking the lock makes it
> visible.  So all we have to do is not skip the locking of it in the
> newly allocated case.  If Linus is persuaded, then so am I.


The lock definitely gets initialised by CPU0, by the ctor. So it would
be wrong to allow the anon_vma to become visible and have CPU2 try to
lock a possibly uninitialised lock.

But lock barriers say that subsequent stores are not allowed to be
visible before the store to acquire the lock, and normal (obvious)
cache coherency rules says that the stores to initialise the lock
must not come after the store to lock it. So the store to
vma->anon_vma could not be visible before the lock is locked.

Inside the critical section, things can get jumbled around as usual,
but if you only ever care about those orderings from within the same
lock, everything is guaranteed to be visible.


> > I don't understand your point about requiring ->lock to be initialised
> > coming from find_mergeable_anon_vma. Why? All the critical memory
> > operations get ordered inside the lock AFAIKS (I haven't actually applied
> > Linus' patch to see what the result is, but that should be the case with
> > my patch).
>
> When find_mergeable_anon_vma returns the anon_vma from an adjacent vma
> which could be merged with this one, that's the one case (setting aside
> extend_stack) where this faulting CPU will want to access the content
> of a struct anon_vma which may have been initialized by another CPU -
> to lock it and add vma to its list - rather than just use its address.

OK. But the mergeable anon_vma checks don't actually look inside the
anon_vma AFAIKS. It will end up taking the lock (and hence having the
same memory ordering guarantees as CPU2).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2008-10-22  9:29 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-16  4:10 [patch] mm: fix anon_vma races Nick Piggin
2008-10-17 22:14 ` Hugh Dickins
2008-10-17 23:05   ` Linus Torvalds
2008-10-18  0:13     ` Hugh Dickins
2008-10-18  0:25       ` Linus Torvalds
2008-10-18  1:53       ` Nick Piggin
2008-10-18  2:50         ` Paul Mackerras
2008-10-18  2:57           ` Linus Torvalds
2008-10-18  5:49           ` Nick Piggin
2008-10-18 10:49             ` Paul Mackerras
2008-10-18 17:00             ` Linus Torvalds
2008-10-18 18:44               ` Matthew Wilcox
2008-10-19  2:54                 ` Nick Piggin
2008-10-19  2:53               ` Nick Piggin
2008-10-17 23:13 ` Peter Zijlstra
2008-10-17 23:53   ` Linus Torvalds
2008-10-18  0:42     ` Linus Torvalds
2008-10-18  1:08       ` Linus Torvalds
2008-10-18  1:32         ` Nick Piggin
2008-10-18  2:11           ` Linus Torvalds
2008-10-18  2:25             ` Nick Piggin
2008-10-18  2:35               ` Nick Piggin
2008-10-18  2:53               ` Linus Torvalds
2008-10-18  5:20                 ` Nick Piggin
2008-10-18 10:38                   ` Peter Zijlstra
2008-10-19  9:52                     ` Hugh Dickins
2008-10-19 10:51                       ` Peter Zijlstra
2008-10-19 12:39                         ` Hugh Dickins
2008-10-19 18:25                         ` Linus Torvalds
2008-10-19 18:45                           ` Peter Zijlstra
2008-10-19 19:00                           ` Hugh Dickins
2008-10-20  4:03                           ` Hugh Dickins
2008-10-20 15:17                             ` Linus Torvalds
2008-10-20 18:21                               ` Hugh Dickins
2008-10-21  2:56                               ` Nick Piggin
2008-10-21  3:25                                 ` Linus Torvalds
2008-10-21  4:33                                   ` Nick Piggin
2008-10-21 12:58                                     ` Hugh Dickins
2008-10-21 15:59                                     ` Christoph Lameter
2008-10-22  9:29                                       ` Nick Piggin
2008-10-21  4:34                                   ` Nick Piggin
2008-10-21 13:55                                     ` Hugh Dickins
2008-10-21  2:44                           ` Nick Piggin
2008-10-18 19:14               ` Hugh Dickins
2008-10-19  3:03                 ` Nick Piggin
2008-10-19  7:07                   ` Hugh Dickins
2008-10-20  3:26                     ` Hugh Dickins
2008-10-21  2:45                       ` Nick Piggin
2008-10-19  1:13       ` Hugh Dickins
2008-10-19  2:41         ` Nick Piggin
2008-10-19  9:45           ` Hugh Dickins
2008-10-21  3:59             ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox