[rfc][patch] mm: dirty page accounting race fix

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [rfc][patch] mm: dirty page accounting race fix
@ 2008-08-14  9:45 Nick Piggin
  2008-08-14 10:27 ` Peter Zijlstra
  2008-08-14 11:55 ` Hugh Dickins
  0 siblings, 2 replies; 12+ messages in thread
From: Nick Piggin @ 2008-08-14  9:45 UTC (permalink / raw)
  To: Linux Memory Management List, Hugh Dickins, Peter Zijlstra,
	Linux Kernel Mailing List

Hi,

Wonder if you would be kind enough to verify this? It solves my issues (well
still running after several hours wheras it previously would lock up in a
few seconds).

Thanks,
Nick

---
There is a race with dirty page accounting where a page may not properly
be accounted for.

clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty.

page_mkclean walks the rmaps for that page, and for each one it cleans and
write protects the pte if it was dirty. It uses page_check_address to find the
pte. That function has a shortcut to avoid the ptl if the pte is not
present. Unfortunately, the pte can be switched to not-present then back to
present by other code while holding the page table lock -- this should not
be a signal for page_mkclean to ignore that pte, because it may be dirty.

For example, do_wp_page calls ptep_clear_flush_notify before marking the
pte writable and dirty. do_wp_page does subsequently set the page dirty, but
that can happen before clear_page_dirty_for_io() calls TestClearPageDirty.
There may also be other code which does similar things, and/or architectures
which implement other pte manipulation functions using a clear-set sequence,
I haven't looked thoroughly. But it should be fixed.

The consequence of the bug is loss of data integrity (msync) of dirty page
accounting accuracy. XIP's __xip_unmap is probably also unreliable, which
can lead to data corruption.

Fix this by always taking ptl to check the pte in page_check_address.

It's possible to retain this optimization for page_referenced and
try_to_unmap.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -102,7 +102,7 @@ int try_to_unmap(struct page *, int igno
  * Called from mm/filemap_xip.c to unmap empty zero page
  */
 pte_t *page_check_address(struct page *, struct mm_struct *,
-				unsigned long, spinlock_t **);
+				unsigned long, spinlock_t **, int);
 
 /*
  * Used by swapoff to help locate where page is expected in vma.
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -185,7 +185,7 @@ __xip_unmap (struct address_space * mapp
 		address = vma->vm_start +
 			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-		pte = page_check_address(page, mm, address, &ptl);
+		pte = page_check_address(page, mm, address, &ptl, 1);
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -224,10 +224,14 @@ unsigned long page_address_in_vma(struct
 /*
  * Check that @page is mapped at @address into @mm.
  *
+ * If @synch is false, page_check_address may perform a racy check to avoid
+ * the page table lock when the pte is not present (helpful when reclaiming
+ * highly shared pages).
+ *
  * On success returns with pte mapped and locked.
  */
 pte_t *page_check_address(struct page *page, struct mm_struct *mm,
-			  unsigned long address, spinlock_t **ptlp)
+			  unsigned long address, spinlock_t **ptlp, int synch)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -249,7 +253,7 @@ pte_t *page_check_address(struct page *p
 
 	pte = pte_offset_map(pmd, address);
 	/* Make a quick check before getting the lock */
-	if (!pte_present(*pte)) {
+	if (!synch && !pte_present(*pte)) {
 		pte_unmap(pte);
 		return NULL;
 	}
@@ -281,7 +285,7 @@ static int page_referenced_one(struct pa
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl);
+	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
 		goto out;
 
@@ -450,7 +454,7 @@ static int page_mkclean_one(struct page 
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl);
+	pte = page_check_address(page, mm, address, &ptl, 1);
 	if (!pte)
 		goto out;
 
@@ -697,7 +701,7 @@ static int try_to_unmap_one(struct page 
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl);
+	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
 		goto out;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][patch] mm: dirty page accounting race fix
  2008-08-14  9:45 [rfc][patch] mm: dirty page accounting race fix Nick Piggin
@ 2008-08-14 10:27 ` Peter Zijlstra
  2008-08-14 11:55 ` Hugh Dickins
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2008-08-14 10:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Hugh Dickins, Linux Kernel Mailing List

On Thu, 2008-08-14 at 11:45 +0200, Nick Piggin wrote:
> Hi,
> 
> Wonder if you would be kind enough to verify this? It solves my issues (well
> still running after several hours wheras it previously would lock up in a
> few seconds).
> 
> Thanks,
> Nick

s/synch/sync/ ?

Indeed, that does look like a hole, and the proposed fix seems sane.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

> ---
> There is a race with dirty page accounting where a page may not properly
> be accounted for.
> 
> clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty.
> 
> page_mkclean walks the rmaps for that page, and for each one it cleans and
> write protects the pte if it was dirty. It uses page_check_address to find the
> pte. That function has a shortcut to avoid the ptl if the pte is not
> present. Unfortunately, the pte can be switched to not-present then back to
> present by other code while holding the page table lock -- this should not
> be a signal for page_mkclean to ignore that pte, because it may be dirty.
> 
> For example, do_wp_page calls ptep_clear_flush_notify before marking the
> pte writable and dirty. do_wp_page does subsequently set the page dirty, but
> that can happen before clear_page_dirty_for_io() calls TestClearPageDirty.
> There may also be other code which does similar things, and/or architectures
> which implement other pte manipulation functions using a clear-set sequence,
> I haven't looked thoroughly. But it should be fixed.
> 
> The consequence of the bug is loss of data integrity (msync) of dirty page
> accounting accuracy. XIP's __xip_unmap is probably also unreliable, which
> can lead to data corruption.
> 
> Fix this by always taking ptl to check the pte in page_check_address.
> 
> It's possible to retain this optimization for page_referenced and
> try_to_unmap.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
> Index: linux-2.6/include/linux/rmap.h
> ===================================================================
> --- linux-2.6.orig/include/linux/rmap.h
> +++ linux-2.6/include/linux/rmap.h
> @@ -102,7 +102,7 @@ int try_to_unmap(struct page *, int igno
>   * Called from mm/filemap_xip.c to unmap empty zero page
>   */
>  pte_t *page_check_address(struct page *, struct mm_struct *,
> -				unsigned long, spinlock_t **);
> +				unsigned long, spinlock_t **, int);
>  
>  /*
>   * Used by swapoff to help locate where page is expected in vma.
> Index: linux-2.6/mm/filemap_xip.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap_xip.c
> +++ linux-2.6/mm/filemap_xip.c
> @@ -185,7 +185,7 @@ __xip_unmap (struct address_space * mapp
>  		address = vma->vm_start +
>  			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>  		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> -		pte = page_check_address(page, mm, address, &ptl);
> +		pte = page_check_address(page, mm, address, &ptl, 1);
>  		if (pte) {
>  			/* Nuke the page table entry. */
>  			flush_cache_page(vma, address, pte_pfn(*pte));
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -224,10 +224,14 @@ unsigned long page_address_in_vma(struct
>  /*
>   * Check that @page is mapped at @address into @mm.
>   *
> + * If @synch is false, page_check_address may perform a racy check to avoid
> + * the page table lock when the pte is not present (helpful when reclaiming
> + * highly shared pages).
> + *
>   * On success returns with pte mapped and locked.
>   */
>  pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> -			  unsigned long address, spinlock_t **ptlp)
> +			  unsigned long address, spinlock_t **ptlp, int synch)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> @@ -249,7 +253,7 @@ pte_t *page_check_address(struct page *p
>  
>  	pte = pte_offset_map(pmd, address);
>  	/* Make a quick check before getting the lock */
> -	if (!pte_present(*pte)) {
> +	if (!synch && !pte_present(*pte)) {
>  		pte_unmap(pte);
>  		return NULL;
>  	}
> @@ -281,7 +285,7 @@ static int page_referenced_one(struct pa
>  	if (address == -EFAULT)
>  		goto out;
>  
> -	pte = page_check_address(page, mm, address, &ptl);
> +	pte = page_check_address(page, mm, address, &ptl, 0);
>  	if (!pte)
>  		goto out;
>  
> @@ -450,7 +454,7 @@ static int page_mkclean_one(struct page 
>  	if (address == -EFAULT)
>  		goto out;
>  
> -	pte = page_check_address(page, mm, address, &ptl);
> +	pte = page_check_address(page, mm, address, &ptl, 1);
>  	if (!pte)
>  		goto out;
>  
> @@ -697,7 +701,7 @@ static int try_to_unmap_one(struct page 
>  	if (address == -EFAULT)
>  		goto out;
>  
> -	pte = page_check_address(page, mm, address, &ptl);
> +	pte = page_check_address(page, mm, address, &ptl, 0);
>  	if (!pte)
>  		goto out;
>  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][patch] mm: dirty page accounting race fix
  2008-08-14  9:45 [rfc][patch] mm: dirty page accounting race fix Nick Piggin
  2008-08-14 10:27 ` Peter Zijlstra
@ 2008-08-14 11:55 ` Hugh Dickins
  2008-08-14 12:18   ` Peter Zijlstra
                     ` (2 more replies)
  1 sibling, 3 replies; 12+ messages in thread
From: Hugh Dickins @ 2008-08-14 11:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Peter Zijlstra, Linux Kernel Mailing List

On Thu, 14 Aug 2008, Nick Piggin wrote:

> Hi,
> 
> Wonder if you would be kind enough to verify this? It solves my issues (well
> still running after several hours wheras it previously would lock up in a
> few seconds).
> 
> Thanks,
> Nick
> 
> ---
> There is a race with dirty page accounting where a page may not properly
> be accounted for.
> 
> clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty.
> 
> page_mkclean walks the rmaps for that page, and for each one it cleans and
> write protects the pte if it was dirty. It uses page_check_address to find the
> pte. That function has a shortcut to avoid the ptl if the pte is not
> present. Unfortunately, the pte can be switched to not-present then back to
> present by other code while holding the page table lock -- this should not
> be a signal for page_mkclean to ignore that pte, because it may be dirty.
> 
> For example, do_wp_page calls ptep_clear_flush_notify before marking the
> pte writable and dirty. do_wp_page does subsequently set the page dirty, but
> that can happen before clear_page_dirty_for_io() calls TestClearPageDirty.
> There may also be other code which does similar things, and/or architectures
> which implement other pte manipulation functions using a clear-set sequence,
> I haven't looked thoroughly. But it should be fixed.
> 
> The consequence of the bug is loss of data integrity (msync) of dirty page
> accounting accuracy. XIP's __xip_unmap is probably also unreliable, which
> can lead to data corruption.
> 
> Fix this by always taking ptl to check the pte in page_check_address.
> 
> It's possible to retain this optimization for page_referenced and
> try_to_unmap.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---

Thanks for the inlined patch.

> Index: linux-2.6/include/linux/rmap.h
> ===================================================================
> --- linux-2.6.orig/include/linux/rmap.h
> +++ linux-2.6/include/linux/rmap.h
> @@ -102,7 +102,7 @@ int try_to_unmap(struct page *, int igno
>   * Called from mm/filemap_xip.c to unmap empty zero page
>   */
>  pte_t *page_check_address(struct page *, struct mm_struct *,
> -				unsigned long, spinlock_t **);
> +				unsigned long, spinlock_t **, int);
>  
>  /*
>   * Used by swapoff to help locate where page is expected in vma.
> Index: linux-2.6/mm/filemap_xip.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap_xip.c
> +++ linux-2.6/mm/filemap_xip.c
> @@ -185,7 +185,7 @@ __xip_unmap (struct address_space * mapp
>  		address = vma->vm_start +
>  			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>  		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> -		pte = page_check_address(page, mm, address, &ptl);
> +		pte = page_check_address(page, mm, address, &ptl, 1);
>  		if (pte) {
>  			/* Nuke the page table entry. */
>  			flush_cache_page(vma, address, pte_pfn(*pte));
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -224,10 +224,14 @@ unsigned long page_address_in_vma(struct
>  /*
>   * Check that @page is mapped at @address into @mm.
>   *
> + * If @synch is false, page_check_address may perform a racy check to avoid
> + * the page table lock when the pte is not present (helpful when reclaiming
> + * highly shared pages).
> + *
>   * On success returns with pte mapped and locked.
>   */
>  pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> -			  unsigned long address, spinlock_t **ptlp)
> +			  unsigned long address, spinlock_t **ptlp, int synch)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> @@ -249,7 +253,7 @@ pte_t *page_check_address(struct page *p
>  
>  	pte = pte_offset_map(pmd, address);
>  	/* Make a quick check before getting the lock */
> -	if (!pte_present(*pte)) {
> +	if (!synch && !pte_present(*pte)) {
>  		pte_unmap(pte);
>  		return NULL;
>  	}
> @@ -281,7 +285,7 @@ static int page_referenced_one(struct pa
>  	if (address == -EFAULT)
>  		goto out;
>  
> -	pte = page_check_address(page, mm, address, &ptl);
> +	pte = page_check_address(page, mm, address, &ptl, 0);
>  	if (!pte)
>  		goto out;
>  
> @@ -450,7 +454,7 @@ static int page_mkclean_one(struct page 
>  	if (address == -EFAULT)
>  		goto out;
>  
> -	pte = page_check_address(page, mm, address, &ptl);
> +	pte = page_check_address(page, mm, address, &ptl, 1);
>  	if (!pte)
>  		goto out;
>  
> @@ -697,7 +701,7 @@ static int try_to_unmap_one(struct page 
>  	if (address == -EFAULT)
>  		goto out;
>  
> -	pte = page_check_address(page, mm, address, &ptl);
> +	pte = page_check_address(page, mm, address, &ptl, 0);
>  	if (!pte)
>  		goto out;
>  
> 

I'm not against this if it really turns out to be the answer,
it's simple enough and it sounds like a very good find; but
I'm still not convinced that you've got to the bottom of it.

Am I confused, or is your "do_wp_page calls ptep_clear_flush_notify"
example a very bad one?  The page it's dealing with there doesn't
go back into the page table (its COW does), and the dirty_accounting
case doesn't even get down there, it's dealt with in the reuse case
above, which uses ptep_set_access_flags.  Now, I think that one may
well behave as you suggest on some arches (though it's extending
permissions not restricting them, so maybe not); but please check
that out and improve your example.

Even if it does, it's not clear to me that your fix is the answer.
That may well be because the whole of dirty page accounting grew too
subtle for me!  But holding the page table lock on one pte of the
page doesn't guarantee much about the integrity of the whole dance:
do_wp_page does its set_page_dirty_balance for this case, you'd
need to spell out the bad sequence more to convince me.

Sorry, that may be a lot of work, to get it through my skull!
And I may be lazily asking you to do my thinking for me.

But I got a bit distracted: mprotect's change_pte_range is
traditionally where the pte_modify operation has been split up into
stages on some arches, that really can be restricting permissions
and needs to tread carefully.  Now I go to look there, I see its
		/*
		 * Avoid taking write faults for pages we know to be
		 * dirty.
		 */
		if (dirty_accountable && pte_dirty(ptent))
			ptent = pte_mkwrite(ptent);

and get rather worried: isn't that likely to be giving write permission
to a pte in a vma we are precisely taking write permission away from?
That's a different issue, of course; but perhaps it's even relevant.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][patch] mm: dirty page accounting race fix
  2008-08-14 11:55 ` Hugh Dickins
@ 2008-08-14 12:18   ` Peter Zijlstra
  2008-08-14 12:47     ` Hugh Dickins
  2008-08-14 12:35   ` Nick Piggin
  2008-08-14 12:49   ` Peter Zijlstra
  2 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2008-08-14 12:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Linux Memory Management List, Linux Kernel Mailing List

On Thu, 2008-08-14 at 12:55 +0100, Hugh Dickins wrote:

> But I got a bit distracted: mprotect's change_pte_range is
> traditionally where the pte_modify operation has been split up into
> stages on some arches, that really can be restricting permissions
> and needs to tread carefully.  Now I go to look there, I see its
> 		/*
> 		 * Avoid taking write faults for pages we know to be
> 		 * dirty.
> 		 */
> 		if (dirty_accountable && pte_dirty(ptent))
> 			ptent = pte_mkwrite(ptent);
> 
> and get rather worried: isn't that likely to be giving write permission
> to a pte in a vma we are precisely taking write permission away from?

Exactly, we do that because the page is already dirty, therefore we do
not need to trap on write to mark it dirty - at least, that was the idea
behind this optimization.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][patch] mm: dirty page accounting race fix
  2008-08-14 11:55 ` Hugh Dickins
  2008-08-14 12:18   ` Peter Zijlstra
@ 2008-08-14 12:35   ` Nick Piggin
  2008-08-14 13:20     ` Hugh Dickins
  2008-08-14 12:49   ` Peter Zijlstra
  2 siblings, 1 reply; 12+ messages in thread
From: Nick Piggin @ 2008-08-14 12:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linux Memory Management List, Peter Zijlstra, Linux Kernel Mailing List

On Thu, Aug 14, 2008 at 12:55:46PM +0100, Hugh Dickins wrote:
> On Thu, 14 Aug 2008, Nick Piggin wrote:

> > -	pte = page_check_address(page, mm, address, &ptl);
> > +	pte = page_check_address(page, mm, address, &ptl, 0);
> >  	if (!pte)
> >  		goto out;
> >  
> > 
> 
> I'm not against this if it really turns out to be the answer,
> it's simple enough and it sounds like a very good find; but
> I'm still not convinced that you've got to the bottom of it.
> 
> Am I confused, or is your "do_wp_page calls ptep_clear_flush_notify"
> example a very bad one?  The page it's dealing with there doesn't
> go back into the page table (its COW does), and the dirty_accounting
> case doesn't even get down there, it's dealt with in the reuse case
> above, which uses ptep_set_access_flags.  Now, I think that one may

Oh you're right definitely. Thanks.

Actually, the bug I am running into is not with a vanilla kernel...
I am making several of my own required changes to solve other races
I need to plug, so I'm sorry the changelog might be misleading...
I have not actually reproduced a problem with the vanilla kernel.


> well behave as you suggest on some arches (though it's extending
> permissions not restricting them, so maybe not); but please check
> that out and improve your example.
> 
> Even if it does, it's not clear to me that your fix is the answer.
> That may well be because the whole of dirty page accounting grew too
> subtle for me!  But holding the page table lock on one pte of the
> page doesn't guarantee much about the integrity of the whole dance:
> do_wp_page does its set_page_dirty_balance for this case, you'd
> need to spell out the bad sequence more to convince me.
 

Hmm, no even in that case I think we get away with it because of
the wait_on_page_locked which ensures clearing the page dirty
bit before do_wp_page sets the page dirty...



> Sorry, that may be a lot of work, to get it through my skull!
> And I may be lazily asking you to do my thinking for me.

Maybe I've found another one: ppc64's set_pte_at seems to clear
the pte, and lots of pte accessors are implemented with set_pte_at.
mprotect's modify_prot_commit for example.

Even if I'm wrong and we happen to be safe everywhere, it seems
really fragile to ask that no architectures ever allow transient
!pte_present in cases  where it matters, and no generic code
emit the wrong sequence either. Or is there some reason I'm missing
that makes this more robust?


> But I got a bit distracted: mprotect's change_pte_range is
> traditionally where the pte_modify operation has been split up into
> stages on some arches, that really can be restricting permissions
> and needs to tread carefully.  Now I go to look there, I see its
> 		/*
> 		 * Avoid taking write faults for pages we know to be
> 		 * dirty.
> 		 */
> 		if (dirty_accountable && pte_dirty(ptent))
> 			ptent = pte_mkwrite(ptent);
> 
> and get rather worried: isn't that likely to be giving write permission
> to a pte in a vma we are precisely taking write permission away from?
> That's a different issue, of course; but perhaps it's even relevant.

Hmm, vma_wants_writenotify is only true if VM_WRITE, and in that
case we might be OK?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][patch] mm: dirty page accounting race fix
  2008-08-14 12:18   ` Peter Zijlstra
@ 2008-08-14 12:47     ` Hugh Dickins
  2008-08-14 20:25       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2008-08-14 12:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Linux Memory Management List, Linux Kernel Mailing List

On Thu, 14 Aug 2008, Peter Zijlstra wrote:
> On Thu, 2008-08-14 at 12:55 +0100, Hugh Dickins wrote:
> 
> > But I got a bit distracted: mprotect's change_pte_range is
> > traditionally where the pte_modify operation has been split up into
> > stages on some arches, that really can be restricting permissions
> > and needs to tread carefully.  Now I go to look there, I see its
> > 		/*
> > 		 * Avoid taking write faults for pages we know to be
> > 		 * dirty.
> > 		 */
> > 		if (dirty_accountable && pte_dirty(ptent))
> > 			ptent = pte_mkwrite(ptent);
> > 
> > and get rather worried: isn't that likely to be giving write permission
> > to a pte in a vma we are precisely taking write permission away from?
> 
> Exactly, we do that because the page is already dirty, therefore we do
> not need to trap on write to mark it dirty - at least, that was the idea
> behind this optimization.

I realized that was the intended optimization, what I'd missed is that
dirty_accountable can only be true there if (vma->vm_flags & VM_WRITE):
that's checked in vma_wants_writenotify(), which is how dirty_accountable
gets to be set.

So those lines are okay, panic over, phew.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][patch] mm: dirty page accounting race fix
  2008-08-14 11:55 ` Hugh Dickins
  2008-08-14 12:18   ` Peter Zijlstra
  2008-08-14 12:35   ` Nick Piggin
@ 2008-08-14 12:49   ` Peter Zijlstra
  2008-08-14 13:39     ` Hugh Dickins
  2 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2008-08-14 12:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Linux Memory Management List, Linux Kernel Mailing List

On Thu, 2008-08-14 at 12:55 +0100, Hugh Dickins wrote:

> Am I confused, or is your "do_wp_page calls ptep_clear_flush_notify"
> example a very bad one?  The page it's dealing with there doesn't
> go back into the page table (its COW does), and the dirty_accounting
> case doesn't even get down there, it's dealt with in the reuse case
> above, which uses ptep_set_access_flags. 

Also, the new page is only added to the rmap _after_ it has been
installed. So page_mkclean() will never get to it to see the empty pte.

>  Now, I think that one may
> well behave as you suggest on some arches (though it's extending
> permissions not restricting them, so maybe not); but please check
> that out and improve your example.

Another case I just looked at is if ptep_clear_flush_young() actually
does the clear bit. But the few arches (x86_64, ppc64) that I looked at
don't seem to do so.

If someone would, you could hit this race.

/me continues searching for a convincing candidate..

> Even if it does, it's not clear to me that your fix is the answer.
> That may well be because the whole of dirty page accounting grew too
> subtle for me! 

    CPU1                      CPU2

    lock(pte_lock)
    ptep_clear_flush(ptep)
                            page_mkclean()
                              page_check_address()
                                !pte_present(ptep)
                                   return NULL
    ptep_set(ptep, new_pte);
    unlock(pte_lock)

Now, if page_check_address() doesn't return prematurely, but is forced
to take the pte_lock, we won't see that hole and will not skip the page.

> But holding the page table lock on one pte of the
> page doesn't guarantee much about the integrity of the whole dance:
> do_wp_page does its set_page_dirty_balance for this case, you'd
> need to spell out the bad sequence more to convince me.
 
Now you're confusing me... are you saying ptes can be changed from under
your feet even while holding the pte_lock?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][patch] mm: dirty page accounting race fix
  2008-08-14 12:35   ` Nick Piggin
@ 2008-08-14 13:20     ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2008-08-14 13:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management List, Peter Zijlstra, Linux Kernel Mailing List

On Thu, 14 Aug 2008, Nick Piggin wrote:
> On Thu, Aug 14, 2008 at 12:55:46PM +0100, Hugh Dickins wrote:
> 
> Maybe I've found another one: ppc64's set_pte_at seems to clear
> the pte, and lots of pte accessors are implemented with set_pte_at.
> mprotect's modify_prot_commit for example.
> 
> Even if I'm wrong and we happen to be safe everywhere, it seems
> really fragile to ask that no architectures ever allow transient
> !pte_present in cases  where it matters, and no generic code
> emit the wrong sequence either. Or is there some reason I'm missing
> that makes this more robust?

I agree completely that should be allowed (within pagetable lock)
and is sometimes essential, mprotect being the classic example.

So I'll try to think through your case later on, focussing on
mprotect instead, and report back once I've pictured it.

> Hmm, vma_wants_writenotify is only true if VM_WRITE, and in that
> case we might be OK?

Yes, that's what I'd missed: with that worry out of the way,
I should think a bit clearer.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][patch] mm: dirty page accounting race fix
  2008-08-14 12:49   ` Peter Zijlstra
@ 2008-08-14 13:39     ` Hugh Dickins
  2008-08-14 13:52       ` Nick Piggin
  0 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2008-08-14 13:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Linux Memory Management List, Linux Kernel Mailing List

On Thu, 14 Aug 2008, Peter Zijlstra wrote:
> On Thu, 2008-08-14 at 12:55 +0100, Hugh Dickins wrote:
> 
> > But holding the page table lock on one pte of the
> > page doesn't guarantee much about the integrity of the whole dance:
> > do_wp_page does its set_page_dirty_balance for this case, you'd
> > need to spell out the bad sequence more to convince me.
>  
> Now you're confusing me... are you saying ptes can be changed from under
> your feet even while holding the pte_lock?

Well, yes, dirty and accessed can be changed from another thread in
userspace while we hold pt lock in the kernel.  (But dirty could only
be changed if the pte is writable, and in dirty balancing cases that
should be being prevented.)

But no, that isn't what I was thinking of.  pt lock better be enough
to secure against kernel modifications to the pte.  I was just thinking
there are (potentially) all those other ptes of the page, and this pte
may be modified the next instant, it wasn't obvious to me that missing
the one is so terrible.

Give me time, please don't let me confuse you, one of us is enough.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][patch] mm: dirty page accounting race fix
  2008-08-14 13:39     ` Hugh Dickins
@ 2008-08-14 13:52       ` Nick Piggin
  2008-08-14 19:09         ` Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: Nick Piggin @ 2008-08-14 13:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Linux Memory Management List, Linux Kernel Mailing List

On Thu, Aug 14, 2008 at 02:39:02PM +0100, Hugh Dickins wrote:
> On Thu, 14 Aug 2008, Peter Zijlstra wrote:
> > On Thu, 2008-08-14 at 12:55 +0100, Hugh Dickins wrote:
> > 
> > > But holding the page table lock on one pte of the
> > > page doesn't guarantee much about the integrity of the whole dance:
> > > do_wp_page does its set_page_dirty_balance for this case, you'd
> > > need to spell out the bad sequence more to convince me.
> >  
> > Now you're confusing me... are you saying ptes can be changed from under
> > your feet even while holding the pte_lock?
> 
> Well, yes, dirty and accessed can be changed from another thread in
> userspace while we hold pt lock in the kernel.  (But dirty could only
> be changed if the pte is writable, and in dirty balancing cases that
> should be being prevented.)
> 
> But no, that isn't what I was thinking of.  pt lock better be enough
> to secure against kernel modifications to the pte.  I was just thinking
> there are (potentially) all those other ptes of the page, and this pte
> may be modified the next instant, it wasn't obvious to me that missing
> the one is so terrible.

If I may... perhaps I didn't explain the race well enough. I'll try to
shed further light on it (or prove myself wrong):

It is true that there are other ptes, and any of those others at any
time may be modified when we don't hold their particular ptl.

But after we clean them, they'll require a page fault to dirty them again,
and we're blocking out page faults (effectively -- see the comments in
clear_page_dirty_for_io and wait_on_page_locked in do_wp_page) by holding
the page lock over the call to clear_page_dirty_for_io.

So if we find and clean a dirty/writable pte, we guarantee it won't get
to the set_page_dirty part of the fault handler until clear_page_dirty_for_io
has done its TestClearPageDirty.

The problem is in the transiently-cleared-but-actually-dirty ptes that may
be seen if not holding ptl will not be made readonly at all. TestClearPageDirty
will clean the page, but we've still got a dirty pte.

HTH

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][patch] mm: dirty page accounting race fix
  2008-08-14 13:52       ` Nick Piggin
@ 2008-08-14 19:09         ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2008-08-14 19:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Linux Memory Management List, Linux Kernel Mailing List

On Thu, 14 Aug 2008, Nick Piggin wrote:
> 
> If I may...

You're very welcome...

> perhaps I didn't explain the race well enough. I'll try to
> shed further light on it (or prove myself wrong):
> 
> It is true that there are other ptes, and any of those others at any
> time may be modified when we don't hold their particular ptl.
> 
> But after we clean them, they'll require a page fault to dirty them again,
> and we're blocking out page faults (effectively -- see the comments in
> clear_page_dirty_for_io and wait_on_page_locked in do_wp_page) by holding
> the page lock over the call to clear_page_dirty_for_io.

Yes, that's the part of the protocol I was forgetting.  Not exactly
relevant to the mprotect case where your hole opens up, but exactly
the part of it that synchronizes all the ptes with the struct page,
which was worrying me there.

I really ought to read all those private messages to Virginia.
Who is she?  A friend of Paige McWright, perhaps?

> 
> So if we find and clean a dirty/writable pte, we guarantee it won't get
> to the set_page_dirty part of the fault handler until clear_page_dirty_for_io
> has done its TestClearPageDirty.
> 
> The problem is in the transiently-cleared-but-actually-dirty ptes that may
> be seen if not holding ptl will not be made readonly at all.
> TestClearPageDirty will clean the page, but we've still got a dirty pte.

Yes.  The pte is still marked dirty and writable (after briefly being
absent) and the page may be modified by userspace without entering the
accounting; it will eventually get picked up and written out, but if
we're taking the accounting seriously we ought to fix the hole.

> 
> HTH

Very much so: thank you.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [rfc][patch] mm: dirty page accounting race fix
  2008-08-14 12:47     ` Hugh Dickins
@ 2008-08-14 20:25       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 12+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-14 20:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Nick Piggin, Linux Memory Management List,
	Linux Kernel Mailing List

Hugh Dickins wrote:
> I realized that was the intended optimization, what I'd missed is that
> dirty_accountable can only be true there if (vma->vm_flags & VM_WRITE):
> that's checked in vma_wants_writenotify(), which is how dirty_accountable
> gets to be set.
>
> So those lines are okay, panic over, phew.
>   

I got bitten by precisely the same train of thought.  I think that code 
officially Non Obvious (or at least Not Immediately Obvious, which is 
bad in security-sensitive code).

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-08-14 20:25 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-08-14  9:45 [rfc][patch] mm: dirty page accounting race fix Nick Piggin
2008-08-14 10:27 ` Peter Zijlstra
2008-08-14 11:55 ` Hugh Dickins
2008-08-14 12:18   ` Peter Zijlstra
2008-08-14 12:47     ` Hugh Dickins
2008-08-14 20:25       ` Jeremy Fitzhardinge
2008-08-14 12:35   ` Nick Piggin
2008-08-14 13:20     ` Hugh Dickins
2008-08-14 12:49   ` Peter Zijlstra
2008-08-14 13:39     ` Hugh Dickins
2008-08-14 13:52       ` Nick Piggin
2008-08-14 19:09         ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox