* [patch] mm: fix race in COW logic
@ 2008-06-22 15:30 Nick Piggin
2008-06-22 17:11 ` Hugh Dickins
0 siblings, 1 reply; 13+ messages in thread
From: Nick Piggin @ 2008-06-22 15:30 UTC (permalink / raw)
To: Linux Memory Management List, Hugh Dickins, Andrew Morton,
Linus Torvalds
Hi,
Can someone please review my thinking here? (no this is not a fix for
Robin's recent issue, but something completely different)
Thanks,
Nick
There is a race in the COW logic. It contains a shortcut to avoid the
COW and reuse the page if we have the sole reference on the page, however it
is possible to have two racing do_wp_page()ers with one causing the other to
mistakenly believe it is safe to take the shortcut when it is not. This could
lead to data corruption.
Process 1 and process2 each have a wp pte of the same anon page (ie. one
forked the other). The page's mapcount is 2. Then they both attempt to write
to it around the same time...
proc1 proc2 thr1 proc2 thr2
CPU0 CPU1 CPU3
do_wp_page() do_wp_page()
trylock_page()
can_share_swap_page()
load page mapcount (==2)
reuse = 0
pte unlock
copy page to new_page
pte lock
page_remove_rmap(page);
trylock_page()
can_share_swap_page()
load page mapcount (==1)
reuse = 1
ptep_set_access_flags (allow W)
write private key into page
read from page
ptep_clear_flush()
set_pte_at(pte of new_page)
Fix this by moving the page_remove_rmap of the old page after the pte clear
and flush. Potentially the entire branch could be moved down here, but in
order to stay consistent, I won't (should probably move all the *_mm_counter
stuff with one patch).
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1766,7 +1766,6 @@ gotten:
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
- page_remove_rmap(old_page, vma);
if (!PageAnon(old_page)) {
dec_mm_counter(mm, file_rss);
inc_mm_counter(mm, anon_rss);
@@ -1788,6 +1787,24 @@ gotten:
lru_cache_add_active(new_page);
page_add_new_anon_rmap(new_page, vma, address);
+ if (old_page) {
+ /*
+ * Only after switching the pte to the new page may
+ * we remove the mapcount here. Otherwise another
+ * process may come and find the rmap count decremented
+ * before the pte is switched to the new page, and
+ * "reuse" the old page writing into it while our pte
+ * here still points into it and can be read by other
+ * threads.
+ *
+ * The ptep_clear_flush should be enough to prevent
+ * any possible reordering making the old page visible
+ * to other threads afterwards, so just executing
+ * after it is fine.
+ */
+ page_remove_rmap(old_page, vma);
+ }
+
/* Free the old page.. */
new_page = old_page;
ret |= VM_FAULT_WRITE;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] mm: fix race in COW logic
2008-06-22 15:30 [patch] mm: fix race in COW logic Nick Piggin
@ 2008-06-22 17:11 ` Hugh Dickins
2008-06-22 17:35 ` Linus Torvalds
0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2008-06-22 17:11 UTC (permalink / raw)
To: Nick Piggin; +Cc: Linux Memory Management List, Andrew Morton, Linus Torvalds
On Sun, 22 Jun 2008, Nick Piggin wrote:
>
> Can someone please review my thinking here? (no this is not a fix for
> Robin's recent issue, but something completely different)
You have a wicked mind, and I think you're right, and the fix right.
One thing though, in moving the page_remove_rmap in that way, aren't
you assuming that there's an appropriate wmbarrier between the two
locations? If that is necessarily so (there's plenty happening in
between), it may deserve a comment to say just where that barrier is.
Hugh
>
> Thanks,
> Nick
>
>
> There is a race in the COW logic. It contains a shortcut to avoid the
> COW and reuse the page if we have the sole reference on the page, however it
> is possible to have two racing do_wp_page()ers with one causing the other to
> mistakenly believe it is safe to take the shortcut when it is not. This could
> lead to data corruption.
>
> Process 1 and process2 each have a wp pte of the same anon page (ie. one
> forked the other). The page's mapcount is 2. Then they both attempt to write
> to it around the same time...
>
> proc1 proc2 thr1 proc2 thr2
> CPU0 CPU1 CPU3
> do_wp_page() do_wp_page()
> trylock_page()
> can_share_swap_page()
> load page mapcount (==2)
> reuse = 0
> pte unlock
> copy page to new_page
> pte lock
> page_remove_rmap(page);
> trylock_page()
> can_share_swap_page()
> load page mapcount (==1)
> reuse = 1
> ptep_set_access_flags (allow W)
>
> write private key into page
> read from page
> ptep_clear_flush()
> set_pte_at(pte of new_page)
>
>
> Fix this by moving the page_remove_rmap of the old page after the pte clear
> and flush. Potentially the entire branch could be moved down here, but in
> order to stay consistent, I won't (should probably move all the *_mm_counter
> stuff with one patch).
>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -1766,7 +1766,6 @@ gotten:
> page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> if (likely(pte_same(*page_table, orig_pte))) {
> if (old_page) {
> - page_remove_rmap(old_page, vma);
> if (!PageAnon(old_page)) {
> dec_mm_counter(mm, file_rss);
> inc_mm_counter(mm, anon_rss);
> @@ -1788,6 +1787,24 @@ gotten:
> lru_cache_add_active(new_page);
> page_add_new_anon_rmap(new_page, vma, address);
>
> + if (old_page) {
> + /*
> + * Only after switching the pte to the new page may
> + * we remove the mapcount here. Otherwise another
> + * process may come and find the rmap count decremented
> + * before the pte is switched to the new page, and
> + * "reuse" the old page writing into it while our pte
> + * here still points into it and can be read by other
> + * threads.
> + *
> + * The ptep_clear_flush should be enough to prevent
> + * any possible reordering making the old page visible
> + * to other threads afterwards, so just executing
> + * after it is fine.
> + */
> + page_remove_rmap(old_page, vma);
> + }
> +
> /* Free the old page.. */
> new_page = old_page;
> ret |= VM_FAULT_WRITE;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] mm: fix race in COW logic
2008-06-22 17:11 ` Hugh Dickins
@ 2008-06-22 17:35 ` Linus Torvalds
2008-06-22 18:10 ` Hugh Dickins
2008-06-23 1:52 ` Nick Piggin
0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2008-06-22 17:35 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Nick Piggin, Linux Memory Management List, Andrew Morton
On Sun, 22 Jun 2008, Hugh Dickins wrote:
>
> You have a wicked mind, and I think you're right, and the fix right.
Agreed. I think the patch is fine, although I'd personally probably like
it even more if the mm counter updates to follow the rmap updates.
> One thing though, in moving the page_remove_rmap in that way, aren't
> you assuming that there's an appropriate wmbarrier between the two
> locations? If that is necessarily so (there's plenty happening in
> between), it may deserve a comment to say just where that barrier is.
In this case, I don't think memory ordering matters.
What matters is that the map count never goes down to one - and by
re-ordering the inc/dec accesses, that simply won't happen. IOW, memory
ordering is immaterial, only the ordering of count updates (from the
standpoint of the faulting CPU - so that's not even an SMP issue) matters.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] mm: fix race in COW logic
2008-06-22 17:35 ` Linus Torvalds
@ 2008-06-22 18:10 ` Hugh Dickins
2008-06-22 18:18 ` Linus Torvalds
2008-06-23 1:49 ` Nick Piggin
2008-06-23 1:52 ` Nick Piggin
1 sibling, 2 replies; 13+ messages in thread
From: Hugh Dickins @ 2008-06-22 18:10 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Nick Piggin, Linux Memory Management List, Andrew Morton
On Sun, 22 Jun 2008, Linus Torvalds wrote:
> On Sun, 22 Jun 2008, Hugh Dickins wrote:
>
> > One thing though, in moving the page_remove_rmap in that way, aren't
> > you assuming that there's an appropriate wmbarrier between the two
> > locations? If that is necessarily so (there's plenty happening in
> > between), it may deserve a comment to say just where that barrier is.
>
> In this case, I don't think memory ordering matters.
>
> What matters is that the map count never goes down to one - and by
> re-ordering the inc/dec accesses, that simply won't happen. IOW, memory
> ordering is immaterial, only the ordering of count updates (from the
> standpoint of the faulting CPU - so that's not even an SMP issue) matters.
I'm puzzled. The page_remove_rmap has moved to the other side of the
page_add_new_anon_rmap, but they are operating on different pages.
It's true that the total of their mapcounts doesn't go down to one
in the critical area, but that total isn't computed anywhere.
After asking, I thought the answer was going to be that page_remove_rmap
uses atomic_add_negative, and atomic ops which return a value do
themselves provide sufficient barrier. I'm wondering if that's so
obvious that you've generously sought out a different meaning to my query.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] mm: fix race in COW logic
2008-06-22 18:10 ` Hugh Dickins
@ 2008-06-22 18:18 ` Linus Torvalds
2008-06-23 1:49 ` Nick Piggin
1 sibling, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2008-06-22 18:18 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Nick Piggin, Linux Memory Management List, Andrew Morton
On Sun, 22 Jun 2008, Hugh Dickins wrote:
>
> I'm puzzled.
No, you're right. It was me who was confused. The pages are different,
duh, and yes, there's a SMP memory ordering issue between updating the
new page table entry and decrementing the use count for the old page.
My bad. Ignore me.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] mm: fix race in COW logic
2008-06-22 18:10 ` Hugh Dickins
2008-06-22 18:18 ` Linus Torvalds
@ 2008-06-23 1:49 ` Nick Piggin
2008-06-23 10:04 ` Hugh Dickins
1 sibling, 1 reply; 13+ messages in thread
From: Nick Piggin @ 2008-06-23 1:49 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Linus Torvalds, Linux Memory Management List, Andrew Morton
On Sun, Jun 22, 2008 at 07:10:41PM +0100, Hugh Dickins wrote:
> On Sun, 22 Jun 2008, Linus Torvalds wrote:
> > On Sun, 22 Jun 2008, Hugh Dickins wrote:
> >
> > > One thing though, in moving the page_remove_rmap in that way, aren't
> > > you assuming that there's an appropriate wmbarrier between the two
> > > locations? If that is necessarily so (there's plenty happening in
> > > between), it may deserve a comment to say just where that barrier is.
> >
> > In this case, I don't think memory ordering matters.
> >
> > What matters is that the map count never goes down to one - and by
> > re-ordering the inc/dec accesses, that simply won't happen. IOW, memory
> > ordering is immaterial, only the ordering of count updates (from the
> > standpoint of the faulting CPU - so that's not even an SMP issue) matters.
>
> I'm puzzled. The page_remove_rmap has moved to the other side of the
> page_add_new_anon_rmap, but they are operating on different pages.
> It's true that the total of their mapcounts doesn't go down to one
> in the critical area, but that total isn't computed anywhere.
>
> After asking, I thought the answer was going to be that page_remove_rmap
> uses atomic_add_negative, and atomic ops which return a value do
> themselves provide sufficient barrier. I'm wondering if that's so
> obvious that you've generously sought out a different meaning to my query.
I was initially thinking an smp_wmb might have been in order (excuse the pun),
but then I rethought it and added the 2d paragraph to my comment. But I may
still have been wrong. Let's ignore the barriers implicit in the rmap
functions for now, and if we find they are required we can add a nice
/* smp_wmb() for ..., provided by ...! */
Now. The critical memory operations AFAIKS are:
dec page->mapcount
load page->mapcount (== 1)
store pte (RW)
store via pte
load via pte
ptep_clear_flush
store new pte
Note that I don't believe the page_add_new_anon_rmap is part of the critical
ordering. Unless that is for a different issue?
Now if we move the decrement of page->mapcount to below the ptep_clear_flush,
then our TLB shootdown protocol *should* guarantee that nothing may load via
that pte after page->mapcount has been decremented, right?
Now we only have pairwise barrier semantics, so if the leftmost process is
not part of the TLB shootdown (which it is not), then it is possible that
it may see the store to decrement the mapcount before the store to clear the
pte. Maybe. I was hoping causality would not allow a subsequent store through
the pte to be seen by the rightmost guy before the TLB flush. But maybe I
was wrong? (at any rate, page_remove_rmap gives us smp_wmb if required, so
the code is not technically wrong)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] mm: fix race in COW logic
2008-06-22 17:35 ` Linus Torvalds
2008-06-22 18:10 ` Hugh Dickins
@ 2008-06-23 1:52 ` Nick Piggin
1 sibling, 0 replies; 13+ messages in thread
From: Nick Piggin @ 2008-06-23 1:52 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Hugh Dickins, Linux Memory Management List, Andrew Morton
On Sun, Jun 22, 2008 at 10:35:50AM -0700, Linus Torvalds wrote:
>
>
> On Sun, 22 Jun 2008, Hugh Dickins wrote:
> >
> > You have a wicked mind, and I think you're right, and the fix right.
>
> Agreed. I think the patch is fine, although I'd personally probably like
> it even more if the mm counter updates to follow the rmap updates.
I would definitely have done that except it would want updating everywhere
else too, making the patch a bugfix+more (and mixing subtle issues that
may very well need bisecting some day!).
As you see in the changelog, I would prefer that too. I can make a followup
to move all the counter updates afterward, and eliminate this branch.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] mm: fix race in COW logic
2008-06-23 1:49 ` Nick Piggin
@ 2008-06-23 10:04 ` Hugh Dickins
2008-06-23 12:18 ` Nick Piggin
0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2008-06-23 10:04 UTC (permalink / raw)
To: Nick Piggin; +Cc: Linus Torvalds, Linux Memory Management List, Andrew Morton
On Mon, 23 Jun 2008, Nick Piggin wrote:
> On Sun, Jun 22, 2008 at 07:10:41PM +0100, Hugh Dickins wrote:
> > On Sun, 22 Jun 2008, Linus Torvalds wrote:
> > > On Sun, 22 Jun 2008, Hugh Dickins wrote:
> > >
> > > > One thing though, in moving the page_remove_rmap in that way, aren't
> > > > you assuming that there's an appropriate wmbarrier between the two
> > > > locations? If that is necessarily so (there's plenty happening in
> > > > between), it may deserve a comment to say just where that barrier is.
...
> I was initially thinking an smp_wmb might have been in order (excuse the pun),
> but then I rethought it and added the 2d paragraph to my comment.
First off, I've a bewildering and shameful confession and apology
to make: I made that suggestion that you add a comment without even
reading through the comment you had made! Ugh (at myself)! Sorry
for that. I've a suspicion that the longer a comment is, the more
likely my eye will skip over it...
> But I may
> still have been wrong. Let's ignore the barriers implicit in the rmap
> functions for now, and if we find they are required we can add a nice
> /* smp_wmb() for ..., provided by ...! */
Well, okay, but we're discussing a minuscule likelihood on top of the
minuscule likelihood of the race you astutely identified. So I don't
want to waste anyone's time with an academic exercise; but if any of
us learn something from it, then it may be worth it.
>
> Now. The critical memory operations AFAIKS are:
>
> dec page->mapcount
> load page->mapcount (== 1)
> store pte (RW)
> store via pte
> load via pte
> ptep_clear_flush
> store new pte
>
> Note that I don't believe the page_add_new_anon_rmap is part of the critical
> ordering. Unless that is for a different issue?
Agreed, the page_add_new_anon_rmap is irrelevant to this, it only got
mentioned when I was trying to make sense of the mail Linus retracted.
>
> Now if we move the decrement of page->mapcount to below the ptep_clear_flush,
> then our TLB shootdown protocol *should* guarantee that nothing may load via
> that pte after page->mapcount has been decremented, right?
Via that old pte, yes. But page->_mapcount is not held at a
virtual address affected by the TLB shootdown, so I don't see how the
ptep_clear_flush would give a guarantee on the visibility of the mapcount
change. Besides which, the shootdown hits the right hand CPU not the left.
I think it likely that any implementation of ptep_clear_flush would
include instructions which give that guarantee (particularly since
it has to do something inter-CPU to handle the CPU on the right);
but I don't see how ptep_clear_flush gives that guarantee itself.
>
> Now we only have pairwise barrier semantics, so if the leftmost process is
> not part of the TLB shootdown (which it is not), then it is possible that
> it may see the store to decrement the mapcount before the store to clear the
> pte. Maybe. I was hoping causality would not allow a subsequent store through
> the pte to be seen by the rightmost guy before the TLB flush. But maybe I
> was wrong?
If you're amidst the maybes, imagine the darkness I'm in!
And I'm not adding much to the argument with that remark,
so please don't feel obliged to respond.
> (at any rate, page_remove_rmap gives us smp_wmb if required, so
> the code is not technically wrong)
Originally I came at the question because it seemed to me that if
moving the page_remove_rmap down was to be fully effective, it needed
to move through a suitable barrier; it hadn't occurred to me that it
was carrying the suitable barrier with it. But if that is indeed
correct, I think it would be better to rely upon that, than resort
to more difficult arguments.
I would love to find a sure-footed way to think about memory barriers,
but I don't think I'll ever get the knack; particularly when even you
and Paul and Linus can sometimes be caught in doubt.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] mm: fix race in COW logic
2008-06-23 10:04 ` Hugh Dickins
@ 2008-06-23 12:18 ` Nick Piggin
2008-06-23 12:30 ` Nick Piggin
2008-06-27 9:13 ` Peter Zijlstra
0 siblings, 2 replies; 13+ messages in thread
From: Nick Piggin @ 2008-06-23 12:18 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Linus Torvalds, Linux Memory Management List, Andrew Morton
On Mon, Jun 23, 2008 at 11:04:31AM +0100, Hugh Dickins wrote:
> On Mon, 23 Jun 2008, Nick Piggin wrote:
> > On Sun, Jun 22, 2008 at 07:10:41PM +0100, Hugh Dickins wrote:
> > > On Sun, 22 Jun 2008, Linus Torvalds wrote:
> > > > On Sun, 22 Jun 2008, Hugh Dickins wrote:
> > > >
> > > > > One thing though, in moving the page_remove_rmap in that way, aren't
> > > > > you assuming that there's an appropriate wmbarrier between the two
> > > > > locations? If that is necessarily so (there's plenty happening in
> > > > > between), it may deserve a comment to say just where that barrier is.
> ...
> > I was initially thinking an smp_wmb might have been in order (excuse the pun),
> > but then I rethought it and added the 2d paragraph to my comment.
>
> First off, I've a bewildering and shameful confession and apology
> to make: I made that suggestion that you add a comment without even
> reading through the comment you had made! Ugh (at myself)! Sorry
> for that. I've a suspicion that the longer a comment is, the more
> likely my eye will skip over it...
No need for that :) You miss less than most, I think...
> > But I may
> > still have been wrong. Let's ignore the barriers implicit in the rmap
> > functions for now, and if we find they are required we can add a nice
> > /* smp_wmb() for ..., provided by ...! */
>
> Well, okay, but we're discussing a minuscule likelihood on top of the
> minuscule likelihood of the race you astutely identified. So I don't
> want to waste anyone's time with an academic exercise; but if any of
> us learn something from it, then it may be worth it.
No, I don't think it's a waste of time at all. Although we've already
established the atomic operation in the rmap operation should be
enough, understanding and properly commenting this will I think help
maintain this and other code too.
BTW. Yes, the race identified is pretty slim, requiring a lot to happen
on other CPUs between a small non-preemptible window of instructions,
it *might* actually happen in practice on larger systems where cachelines
might be highly contended or take a long time to get from one place to
another... and the -rt patchset where I assume ptl is preemptible too.
> > Now. The critical memory operations AFAIKS are:
> >
> > dec page->mapcount
> > load page->mapcount (== 1)
> > store pte (RW)
> > store via pte
> > load via pte
> > ptep_clear_flush
> > store new pte
> >
> > Note that I don't believe the page_add_new_anon_rmap is part of the critical
> > ordering. Unless that is for a different issue?
>
> Agreed, the page_add_new_anon_rmap is irrelevant to this, it only got
> mentioned when I was trying to make sense of the mail Linus retracted.
OK, good.
> > Now if we move the decrement of page->mapcount to below the ptep_clear_flush,
> > then our TLB shootdown protocol *should* guarantee that nothing may load via
> > that pte after page->mapcount has been decremented, right?
>
> Via that old pte, yes. But page->_mapcount is not held at a
> virtual address affected by the TLB shootdown, so I don't see how the
> ptep_clear_flush would give a guarantee on the visibility of the mapcount
> change. Besides which, the shootdown hits the right hand CPU not the left.
Right. Yes, I am just assuming that the TLB shootdown guarantees this,
and it may even be true for all implementations, but you're probably
right that it is wrong to rely on it...
> I think it likely that any implementation of ptep_clear_flush would
> include instructions which give that guarantee (particularly since
> it has to do something inter-CPU to handle the CPU on the right);
> but I don't see how ptep_clear_flush gives that guarantee itself.
>
> >
> > Now we only have pairwise barrier semantics, so if the leftmost process is
> > not part of the TLB shootdown (which it is not), then it is possible that
> > it may see the store to decrement the mapcount before the store to clear the
> > pte. Maybe. I was hoping causality would not allow a subsequent store through
> > the pte to be seen by the rightmost guy before the TLB flush. But maybe I
> > was wrong?
>
> If you're amidst the maybes, imagine the darkness I'm in!
> And I'm not adding much to the argument with that remark,
> so please don't feel obliged to respond.
>
> > (at any rate, page_remove_rmap gives us smp_wmb if required, so
> > the code is not technically wrong)
>
> Originally I came at the question because it seemed to me that if
> moving the page_remove_rmap down was to be fully effective, it needed
> to move through a suitable barrier; it hadn't occurred to me that it
> was carrying the suitable barrier with it. But if that is indeed
> correct, I think it would be better to rely upon that, than resort
> to more difficult arguments.
No I actually think you make a good point, and I'll resubmit the
patch with a replacement comment to say we've got the ordering
covered if nothing else then by the atomic op in rmap.
> I would love to find a sure-footed way to think about memory barriers,
> but I don't think I'll ever get the knack; particularly when even you
> and Paul and Linus can sometimes be caught in doubt.
Actually it can get pretty hard if you try to handwave about some
abstract higher level operation like "TLB flushing" (like I was).
It's a good idea to first distil out all the critical load and
store instructions before trying to reason with ordering issues.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] mm: fix race in COW logic
2008-06-23 12:18 ` Nick Piggin
@ 2008-06-23 12:30 ` Nick Piggin
2008-06-23 15:39 ` Hugh Dickins
2008-06-27 9:19 ` Peter Zijlstra
2008-06-27 9:13 ` Peter Zijlstra
1 sibling, 2 replies; 13+ messages in thread
From: Nick Piggin @ 2008-06-23 12:30 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Linus Torvalds, Linux Memory Management List, Andrew Morton
On Mon, Jun 23, 2008 at 02:18:31PM +0200, Nick Piggin wrote:
> On Mon, Jun 23, 2008 at 11:04:31AM +0100, Hugh Dickins wrote:
> > moving the page_remove_rmap down was to be fully effective, it needed
> > to move through a suitable barrier; it hadn't occurred to me that it
> > was carrying the suitable barrier with it. But if that is indeed
> > correct, I think it would be better to rely upon that, than resort
> > to more difficult arguments.
>
> No I actually think you make a good point, and I'll resubmit the
> patch with a replacement comment to say we've got the ordering
> covered if nothing else then by the atomic op in rmap.
OK, this is a new comment. I don't actually know if it is any good.
It is hard to be coherent if you write these things in English.
Maybe it is best to illustrate with the interleaving diagram in the
changelog?
--
There is a race in the COW logic. It contains a shortcut to avoid the
COW and reuse the page if we have the sole reference on the page, however it
is possible to have two racing do_wp_page()ers with one causing the other to
mistakenly believe it is safe to take the shortcut when it is not. This could
lead to data corruption.
Process 1 and process2 each have a wp pte of the same anon page (ie. one
forked the other). The page's mapcount is 2. Then they both attempt to write
to it around the same time...
proc1 proc2 thr1 proc2 thr2
CPU0 CPU1 CPU3
do_wp_page() do_wp_page()
trylock_page()
can_share_swap_page()
load page mapcount (==2)
reuse = 0
pte unlock
copy page to new_page
pte lock
page_remove_rmap(page);
trylock_page()
can_share_swap_page()
load page mapcount (==1)
reuse = 1
ptep_set_access_flags (allow W)
write private key into page
read from page
ptep_clear_flush()
set_pte_at(pte of new_page)
Fix this by moving the page_remove_rmap of the old page after the pte clear
and flush. Potentially the entire branch could be moved down here, but in
order to stay consistent, I won't (should probably move all the *_mm_counter
stuff with one patch).
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1766,7 +1766,6 @@ gotten:
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
- page_remove_rmap(old_page, vma);
if (!PageAnon(old_page)) {
dec_mm_counter(mm, file_rss);
inc_mm_counter(mm, anon_rss);
@@ -1788,6 +1787,32 @@ gotten:
lru_cache_add_active(new_page);
page_add_new_anon_rmap(new_page, vma, address);
+ if (old_page) {
+ /*
+ * Only after switching the pte to the new page may
+ * we remove the mapcount here. Otherwise another
+ * process may come and find the rmap count decremented
+ * before the pte is switched to the new page, and
+ * "reuse" the old page writing into it while our pte
+ * here still points into it and can be read by other
+ * threads.
+ *
+ * The critical issue is to order this
+ * page_remove_rmap with the ptp_clear_flush above.
+ * Those stores are ordered by (if nothing else,)
+ * the barrier present in the atomic_add_negative
+ * in page_remove_rmap.
+ *
+ * Then the TLB flush in ptep_clear_flush ensures that
+ * no process can access the old page before the
+ * decremented mapcount is visible. And the old page
+ * cannot be reused until after the decremented
+ * mapcount is visible. So transitively, TLBs to
+ * old page will be flushed before it can be reused.
+ */
+ page_remove_rmap(old_page, vma);
+ }
+
/* Free the old page.. */
new_page = old_page;
ret |= VM_FAULT_WRITE;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] mm: fix race in COW logic
2008-06-23 12:30 ` Nick Piggin
@ 2008-06-23 15:39 ` Hugh Dickins
2008-06-27 9:19 ` Peter Zijlstra
1 sibling, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2008-06-23 15:39 UTC (permalink / raw)
To: Nick Piggin; +Cc: Linus Torvalds, Linux Memory Management List, Andrew Morton
On Mon, 23 Jun 2008, Nick Piggin wrote:
> On Mon, Jun 23, 2008 at 02:18:31PM +0200, Nick Piggin wrote:
> > On Mon, Jun 23, 2008 at 11:04:31AM +0100, Hugh Dickins wrote:
> > > moving the page_remove_rmap down was to be fully effective, it needed
> > > to move through a suitable barrier; it hadn't occurred to me that it
> > > was carrying the suitable barrier with it. But if that is indeed
> > > correct, I think it would be better to rely upon that, than resort
> > > to more difficult arguments.
> >
> > No I actually think you make a good point, and I'll resubmit the
> > patch with a replacement comment to say we've got the ordering
> > covered if nothing else then by the atomic op in rmap.
>
> OK, this is a new comment. I don't actually know if it is any good.
> It is hard to be coherent if you write these things in English.
Thanks for this.
Very hard, and pretty good, I think. I wouldn't want to swear to the
rightness of every sentence (and there's a few slight typos that don't
bother me). And although you're right to say it could lead to data
corruption (because something is read which should never have been
seen, and gets wrongly incorporated into the data), it's easier to
understand if you concentrate on the "read private key" aspect.
> Maybe it is best to illustrate with the interleaving diagram in the
> changelog?
Oh, no, I think just leave that to the changelog. The longer that
comment gets, the more it distracts from the rest of the function:
if it weren't for trying to avoid multiple "if (old_page)"s, we
might very well have positioned the page_remove_rmap there in the
first place, with no comment whatsoever.
Linus was inclined to move the dec/inc mm_counters too, but you
perceive some reason to leave them in place: I've not followed
that up. So far as I'm concerned...
Acked-by: Hugh Dickins <hugh@veritas.com>
>
> --
> There is a race in the COW logic. It contains a shortcut to avoid the
> COW and reuse the page if we have the sole reference on the page, however it
> is possible to have two racing do_wp_page()ers with one causing the other to
> mistakenly believe it is safe to take the shortcut when it is not. This could
> lead to data corruption.
>
> Process 1 and process2 each have a wp pte of the same anon page (ie. one
> forked the other). The page's mapcount is 2. Then they both attempt to write
> to it around the same time...
>
> proc1 proc2 thr1 proc2 thr2
> CPU0 CPU1 CPU3
> do_wp_page() do_wp_page()
> trylock_page()
> can_share_swap_page()
> load page mapcount (==2)
> reuse = 0
> pte unlock
> copy page to new_page
> pte lock
> page_remove_rmap(page);
> trylock_page()
> can_share_swap_page()
> load page mapcount (==1)
> reuse = 1
> ptep_set_access_flags (allow W)
>
> write private key into page
> read from page
> ptep_clear_flush()
> set_pte_at(pte of new_page)
>
>
> Fix this by moving the page_remove_rmap of the old page after the pte clear
> and flush. Potentially the entire branch could be moved down here, but in
> order to stay consistent, I won't (should probably move all the *_mm_counter
> stuff with one patch).
>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -1766,7 +1766,6 @@ gotten:
> page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> if (likely(pte_same(*page_table, orig_pte))) {
> if (old_page) {
> - page_remove_rmap(old_page, vma);
> if (!PageAnon(old_page)) {
> dec_mm_counter(mm, file_rss);
> inc_mm_counter(mm, anon_rss);
> @@ -1788,6 +1787,32 @@ gotten:
> lru_cache_add_active(new_page);
> page_add_new_anon_rmap(new_page, vma, address);
>
> + if (old_page) {
> + /*
> + * Only after switching the pte to the new page may
> + * we remove the mapcount here. Otherwise another
> + * process may come and find the rmap count decremented
> + * before the pte is switched to the new page, and
> + * "reuse" the old page writing into it while our pte
> + * here still points into it and can be read by other
> + * threads.
> + *
> + * The critical issue is to order this
> + * page_remove_rmap with the ptp_clear_flush above.
> + * Those stores are ordered by (if nothing else,)
> + * the barrier present in the atomic_add_negative
> + * in page_remove_rmap.
> + *
> + * Then the TLB flush in ptep_clear_flush ensures that
> + * no process can access the old page before the
> + * decremented mapcount is visible. And the old page
> + * cannot be reused until after the decremented
> + * mapcount is visible. So transitively, TLBs to
> + * old page will be flushed before it can be reused.
> + */
> + page_remove_rmap(old_page, vma);
> + }
> +
> /* Free the old page.. */
> new_page = old_page;
> ret |= VM_FAULT_WRITE;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] mm: fix race in COW logic
2008-06-23 12:18 ` Nick Piggin
2008-06-23 12:30 ` Nick Piggin
@ 2008-06-27 9:13 ` Peter Zijlstra
1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2008-06-27 9:13 UTC (permalink / raw)
To: Nick Piggin
Cc: Hugh Dickins, Linus Torvalds, Linux Memory Management List,
Andrew Morton
On Mon, 2008-06-23 at 14:18 +0200, Nick Piggin wrote:
> BTW. Yes, the race identified is pretty slim, requiring a lot to happen
> on other CPUs between a small non-preemptible window of instructions,
> it *might* actually happen in practice on larger systems where cachelines
> might be highly contended or take a long time to get from one place to
> another...
> and the -rt patchset where I assume ptl is preemptible too.
Indeed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [patch] mm: fix race in COW logic
2008-06-23 12:30 ` Nick Piggin
2008-06-23 15:39 ` Hugh Dickins
@ 2008-06-27 9:19 ` Peter Zijlstra
1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2008-06-27 9:19 UTC (permalink / raw)
To: Nick Piggin
Cc: Hugh Dickins, Linus Torvalds, Linux Memory Management List,
Andrew Morton
On Mon, 2008-06-23 at 14:30 +0200, Nick Piggin wrote:
> On Mon, Jun 23, 2008 at 02:18:31PM +0200, Nick Piggin wrote:
> > On Mon, Jun 23, 2008 at 11:04:31AM +0100, Hugh Dickins wrote:
> > > moving the page_remove_rmap down was to be fully effective, it needed
> > > to move through a suitable barrier; it hadn't occurred to me that it
> > > was carrying the suitable barrier with it. But if that is indeed
> > > correct, I think it would be better to rely upon that, than resort
> > > to more difficult arguments.
> >
> > No I actually think you make a good point, and I'll resubmit the
> > patch with a replacement comment to say we've got the ordering
> > covered if nothing else then by the atomic op in rmap.
>
> OK, this is a new comment. I don't actually know if it is any good.
> It is hard to be coherent if you write these things in English.
> Maybe it is best to illustrate with the interleaving diagram in the
> changelog?
>
> --
> There is a race in the COW logic. It contains a shortcut to avoid the
> COW and reuse the page if we have the sole reference on the page, however it
> is possible to have two racing do_wp_page()ers with one causing the other to
> mistakenly believe it is safe to take the shortcut when it is not. This could
> lead to data corruption.
>
> Process 1 and process2 each have a wp pte of the same anon page (ie. one
> forked the other). The page's mapcount is 2. Then they both attempt to write
> to it around the same time...
>
> proc1 proc2 thr1 proc2 thr2
> CPU0 CPU1 CPU3
> do_wp_page() do_wp_page()
> trylock_page()
> can_share_swap_page()
> load page mapcount (==2)
> reuse = 0
> pte unlock
> copy page to new_page
> pte lock
> page_remove_rmap(page);
> trylock_page()
> can_share_swap_page()
> load page mapcount (==1)
> reuse = 1
> ptep_set_access_flags (allow W)
>
> write private key into page
> read from page
> ptep_clear_flush()
> set_pte_at(pte of new_page)
>
>
> Fix this by moving the page_remove_rmap of the old page after the pte clear
> and flush. Potentially the entire branch could be moved down here, but in
> order to stay consistent, I won't (should probably move all the *_mm_counter
> stuff with one patch).
Since I bothered to read all the way through this thread, I might as
well provide an ack,..
Acked-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -1766,7 +1766,6 @@ gotten:
> page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> if (likely(pte_same(*page_table, orig_pte))) {
> if (old_page) {
> - page_remove_rmap(old_page, vma);
> if (!PageAnon(old_page)) {
> dec_mm_counter(mm, file_rss);
> inc_mm_counter(mm, anon_rss);
> @@ -1788,6 +1787,32 @@ gotten:
> lru_cache_add_active(new_page);
> page_add_new_anon_rmap(new_page, vma, address);
>
> + if (old_page) {
> + /*
> + * Only after switching the pte to the new page may
> + * we remove the mapcount here. Otherwise another
> + * process may come and find the rmap count decremented
> + * before the pte is switched to the new page, and
> + * "reuse" the old page writing into it while our pte
> + * here still points into it and can be read by other
> + * threads.
> + *
> + * The critical issue is to order this
> + * page_remove_rmap with the ptp_clear_flush above.
> + * Those stores are ordered by (if nothing else,)
> + * the barrier present in the atomic_add_negative
> + * in page_remove_rmap.
> + *
> + * Then the TLB flush in ptep_clear_flush ensures that
> + * no process can access the old page before the
> + * decremented mapcount is visible. And the old page
> + * cannot be reused until after the decremented
> + * mapcount is visible. So transitively, TLBs to
> + * old page will be flushed before it can be reused.
> + */
> + page_remove_rmap(old_page, vma);
> + }
> +
> /* Free the old page.. */
> new_page = old_page;
> ret |= VM_FAULT_WRITE;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2008-06-27 9:19 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-06-22 15:30 [patch] mm: fix race in COW logic Nick Piggin
2008-06-22 17:11 ` Hugh Dickins
2008-06-22 17:35 ` Linus Torvalds
2008-06-22 18:10 ` Hugh Dickins
2008-06-22 18:18 ` Linus Torvalds
2008-06-23 1:49 ` Nick Piggin
2008-06-23 10:04 ` Hugh Dickins
2008-06-23 12:18 ` Nick Piggin
2008-06-23 12:30 ` Nick Piggin
2008-06-23 15:39 ` Hugh Dickins
2008-06-27 9:19 ` Peter Zijlstra
2008-06-27 9:13 ` Peter Zijlstra
2008-06-23 1:52 ` Nick Piggin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox