page_mkwrite caller is racy?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* page_mkwrite caller is racy?
@ 2007-01-29 10:20 Nick Piggin
  2007-01-29 16:08 ` Hugh Dickins
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Nick Piggin @ 2007-01-29 10:20 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management, David Howells,
	Hugh Dickins, Andrew Morton

Hi,

After do_wp_page calls page_mkwrite on its target (old_page), it then drops the
reference to the page before locking the ptl and verifying that the pte points
to old_page.

Unfortunately, old_page may have been truncated and freed, or reclaimed, then
re-allocated and used again for the same pagecache position and faulted in
read-only into the same pte by another thread. Then you will have a situation
where page_mkwrite succeeds but the page we use is actually a readonly one.

Moving page_cache_release(old_page) to below the next statement will fix that
problem.

But it is sad that this thing got merged without any callers to even know how it
is intended to work. Must it be able to sleep?

Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: page_mkwrite caller is racy?
  2007-01-29 10:20 page_mkwrite caller is racy? Nick Piggin
@ 2007-01-29 16:08 ` Hugh Dickins
  2007-01-29 20:41   ` Anton Altaparmakov
  2007-01-30  1:14   ` Nick Piggin
  2007-01-29 20:00 ` Mark Fasheh
  2007-02-01 11:44 ` David Howells
  2 siblings, 2 replies; 9+ messages in thread
From: Hugh Dickins @ 2007-01-29 16:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linux Memory Management, David Howells, Andrew Morton

On Mon, 29 Jan 2007, Nick Piggin wrote:
> 
> After do_wp_page calls page_mkwrite on its target (old_page), it then drops
> the reference to the page before locking the ptl and verifying that the pte
> points to old_page.
> 
> Unfortunately, old_page may have been truncated and freed, or reclaimed, then
> re-allocated and used again for the same pagecache position and faulted in
> read-only into the same pte by another thread. Then you will have a situation
> where page_mkwrite succeeds but the page we use is actually a readonly one.

You're right.  Well observed.  It was I who originally added that
page_cache_release/page_cache_get, and the page_cache_get certainly
followed getting the page_table_lock when I first added them.

Looks like amidst all the intervening versions, with the patch going
into and getting dropped from -mm from time to time, those positions
became reversed without us noticing (almost certainly when the lock
and the pte_offset_map got merged into the pte_offset_map_lock).

> 
> Moving page_cache_release(old_page) to below the next statement
> will fix that problem.

Yes.  I'm reluctant to steal your credit, but also reluctant to go
back and forth too much over this: please insert your Signed-off-by
_before_ mine in the patch below (substituting your own comment if
you prefer) and send it Andrew.

Not a priority for 2.6.20 or -stable: aside from the unlikelihood,
we don't seem to have any page_mkwrite users yet, as you point out.

> 
> But it is sad that this thing got merged without any callers to even
> know how it is intended to work.

I'm rather to blame for that: I pushed Peter to rearranging his work
on top of what David had, since they were dabbling in related issues,
and we'd already solved a number of them in relation to page_mkwrite;
so then when dirty tracking was wanted in, page_mkwrite came with it.

At the time I believed that AntonA was on the point of using it in
NTFS, but apparently not yet.

> Must it be able to sleep?

Not as David was using it: that was something I felt strongly it
should be allowd to do.  For example, in order to allocate backing
store for the mmap'ed page to be written (that need has been talked
about off and on for years).

Hugh

After do_wp_page has tested page_mkwrite, it must release old_page after
acquiring page table lock, not before: at some stage that ordering got
reversed, leaving a (very unlikely) window in which old_page might be
truncated, freed, and reused in the same position.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memory.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- 2.6.20-rc6/mm/memory.c	2007-01-25 08:25:27.000000000 +0000
+++ linux/mm/memory.c	2007-01-29 15:35:56.000000000 +0000
@@ -1531,8 +1531,6 @@ static int do_wp_page(struct mm_struct *
 			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
 				goto unwritable_page;

-			page_cache_release(old_page);
-
 			/*
 			 * Since we dropped the lock we need to revalidate
 			 * the PTE as someone else may have changed it.  If
@@ -1541,6 +1539,7 @@ static int do_wp_page(struct mm_struct *
 			 */
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
+			page_cache_release(old_page);
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: page_mkwrite caller is racy?
  2007-01-29 16:08 ` Hugh Dickins
@ 2007-01-29 20:41   ` Anton Altaparmakov
  2007-01-30  1:14   ` Nick Piggin
  1 sibling, 0 replies; 9+ messages in thread
From: Anton Altaparmakov @ 2007-01-29 20:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, linux-kernel, Linux Memory Management,
	David Howells, Andrew Morton

On Mon, 29 Jan 2007, Hugh Dickins wrote:
> On Mon, 29 Jan 2007, Nick Piggin wrote:
> > After do_wp_page calls page_mkwrite on its target (old_page), it then drops
> > the reference to the page before locking the ptl and verifying that the pte
> > points to old_page.
> > 
> > Unfortunately, old_page may have been truncated and freed, or reclaimed, then
> > re-allocated and used again for the same pagecache position and faulted in
> > read-only into the same pte by another thread. Then you will have a situation
> > where page_mkwrite succeeds but the page we use is actually a readonly one.
> 
> You're right.  Well observed.  It was I who originally added that
> page_cache_release/page_cache_get, and the page_cache_get certainly
> followed getting the page_table_lock when I first added them.
> 
> Looks like amidst all the intervening versions, with the patch going
> into and getting dropped from -mm from time to time, those positions
> became reversed without us noticing (almost certainly when the lock
> and the pte_offset_map got merged into the pte_offset_map_lock).
> 
> > 
> > Moving page_cache_release(old_page) to below the next statement
> > will fix that problem.
> 
> Yes.  I'm reluctant to steal your credit, but also reluctant to go
> back and forth too much over this: please insert your Signed-off-by
> _before_ mine in the patch below (substituting your own comment if
> you prefer) and send it Andrew.
> 
> Not a priority for 2.6.20 or -stable: aside from the unlikelihood,
> we don't seem to have any page_mkwrite users yet, as you point out.
> 
> > But it is sad that this thing got merged without any callers to even
> > know how it is intended to work.
> 
> I'm rather to blame for that: I pushed Peter to rearranging his work
> on top of what David had, since they were dabbling in related issues,
> and we'd already solved a number of them in relation to page_mkwrite;
> so then when dirty tracking was wanted in, page_mkwrite came with it.
> 
> At the time I believed that AntonA was on the point of using it in
> NTFS, but apparently not yet.

Other things got more important...  I still am on the virge of using it 
but I have to finish off other work first so the "virge" may be a little 
wihle off still.

> > Must it be able to sleep?
> 
> Not as David was using it: that was something I felt strongly it
> should be allowd to do.  For example, in order to allocate backing
> store for the mmap'ed page to be written (that need has been talked
> about off and on for years).

Yes this is exactly what I need it in NTFS for.  And also I need to be 
able to perform a mmap'ed write into a non-initialized region, i.e. a 
region which has disk allocation but has not been zeroed yet so in a total 
worst case scenario I could have a huge file that is all allocated on disk 
but completely not initialized yet and a single byte write towards the end 
of the file would require me to zero the entirety of that file up to the 
written byte at least so the one byte write may trigger a multi gigabyte 
(or terrabyte!) write operation which from ->writepage would be bad news 
but from page_mkwrite is much better.

Best regards,

	Anton

> 
> Hugh
> 
> 
> After do_wp_page has tested page_mkwrite, it must release old_page after
> acquiring page table lock, not before: at some stage that ordering got
> reversed, leaving a (very unlikely) window in which old_page might be
> truncated, freed, and reused in the same position.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
> 
>  mm/memory.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> --- 2.6.20-rc6/mm/memory.c	2007-01-25 08:25:27.000000000 +0000
> +++ linux/mm/memory.c	2007-01-29 15:35:56.000000000 +0000
> @@ -1531,8 +1531,6 @@ static int do_wp_page(struct mm_struct *
>  			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
>  				goto unwritable_page;
>  
> -			page_cache_release(old_page);
> -
>  			/*
>  			 * Since we dropped the lock we need to revalidate
>  			 * the PTE as someone else may have changed it.  If
> @@ -1541,6 +1539,7 @@ static int do_wp_page(struct mm_struct *
>  			 */
>  			page_table = pte_offset_map_lock(mm, pmd, address,
>  							 &ptl);
> +			page_cache_release(old_page);
>  			if (!pte_same(*page_table, orig_pte))
>  				goto unlock;
>  		}
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: page_mkwrite caller is racy?
  2007-01-29 16:08 ` Hugh Dickins
  2007-01-29 20:41   ` Anton Altaparmakov
@ 2007-01-30  1:14   ` Nick Piggin
  2007-01-30  1:51     ` Mark Fasheh
  1 sibling, 1 reply; 9+ messages in thread
From: Nick Piggin @ 2007-01-30  1:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-kernel, Linux Memory Management, David Howells,
	Andrew Morton, Anton Altaparmakov, Mark Fasheh

Hugh Dickins wrote:
> On Mon, 29 Jan 2007, Nick Piggin wrote:

>>Moving page_cache_release(old_page) to below the next statement
>>will fix that problem.
> 
> 
> Yes.  I'm reluctant to steal your credit, but also reluctant to go
> back and forth too much over this: please insert your Signed-off-by
> _before_ mine in the patch below (substituting your own comment if
> you prefer) and send it Andrew.
> 
> Not a priority for 2.6.20 or -stable: aside from the unlikelihood,
> we don't seem to have any page_mkwrite users yet, as you point out.

Agreed. Thanks for doing the patch.

>>But it is sad that this thing got merged without any callers to even
>>know how it is intended to work.
> 
> 
> I'm rather to blame for that: I pushed Peter to rearranging his work
> on top of what David had, since they were dabbling in related issues,
> and we'd already solved a number of them in relation to page_mkwrite;
> so then when dirty tracking was wanted in, page_mkwrite came with it.

Well its not a big problem -- I knew there were several people lined
up who wanted it. XFS is another one IIRC.

>>Must it be able to sleep?
> 
> 
> Not as David was using it: that was something I felt strongly it
> should be allowd to do.  For example, in order to allocate backing
> store for the mmap'ed page to be written (that need has been talked
> about off and on for years).

Fine, and Mark and Anton confirm it (cc'ed, thanks guys).

This is another discussion, but do we want the page locked here? Or
are the filesystems happy to exclude truncate themselves?


> After do_wp_page has tested page_mkwrite, it must release old_page after
> acquiring page table lock, not before: at some stage that ordering got
> reversed, leaving a (very unlikely) window in which old_page might be
> truncated, freed, and reused in the same position.

Andrew please apply.

Signed-off-by: Nick Piggin <npiggin@suse.de>

> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
> 
>  mm/memory.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> --- 2.6.20-rc6/mm/memory.c	2007-01-25 08:25:27.000000000 +0000
> +++ linux/mm/memory.c	2007-01-29 15:35:56.000000000 +0000
> @@ -1531,8 +1531,6 @@ static int do_wp_page(struct mm_struct *
>  			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
>  				goto unwritable_page;
>  
> -			page_cache_release(old_page);
> -
>  			/*
>  			 * Since we dropped the lock we need to revalidate
>  			 * the PTE as someone else may have changed it.  If
> @@ -1541,6 +1539,7 @@ static int do_wp_page(struct mm_struct *
>  			 */
>  			page_table = pte_offset_map_lock(mm, pmd, address,
>  							 &ptl);
> +			page_cache_release(old_page);
>  			if (!pte_same(*page_table, orig_pte))
>  				goto unlock;
>  		}
> 


-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: page_mkwrite caller is racy?
  2007-01-30  1:14   ` Nick Piggin
@ 2007-01-30  1:51     ` Mark Fasheh
  2007-01-30 14:58       ` Anton Altaparmakov
  0 siblings, 1 reply; 9+ messages in thread
From: Mark Fasheh @ 2007-01-30  1:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, linux-kernel, Linux Memory Management,
	David Howells, Andrew Morton, Anton Altaparmakov

On Tue, Jan 30, 2007 at 12:14:24PM +1100, Nick Piggin wrote:
> This is another discussion, but do we want the page locked here? Or
> are the filesystems happy to exclude truncate themselves?

No page lock please. Generally, Ocfs2 wants to order cluster locks outside
of page locks. Also, the sparse b-tree support I'm working on right now will
need to be able to allocate in ->page_mkwrite() which would become very
nasty if we came in with the page lock - aside from the additional cluster
locks taken, ocfs2 will want to zero some adjacent pages (because we support
atomic allocation up to 1 meg).

Thanks,
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: page_mkwrite caller is racy?
  2007-01-30  1:51     ` Mark Fasheh
@ 2007-01-30 14:58       ` Anton Altaparmakov
  2007-01-31  1:18         ` Nick Piggin
  0 siblings, 1 reply; 9+ messages in thread
From: Anton Altaparmakov @ 2007-01-30 14:58 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: Nick Piggin, Hugh Dickins, linux-kernel, Linux Memory Management,
	David Howells, Andrew Morton

On Mon, 29 Jan 2007, Mark Fasheh wrote:
> On Tue, Jan 30, 2007 at 12:14:24PM +1100, Nick Piggin wrote:
> > This is another discussion, but do we want the page locked here? Or
> > are the filesystems happy to exclude truncate themselves?
> 
> No page lock please. Generally, Ocfs2 wants to order cluster locks outside
> of page locks. Also, the sparse b-tree support I'm working on right now will
> need to be able to allocate in ->page_mkwrite() which would become very
> nasty if we came in with the page lock - aside from the additional cluster
> locks taken, ocfs2 will want to zero some adjacent pages (because we support
> atomic allocation up to 1 meg).

Ditto for NTFS.  I will need to lock pages on both sides of the page for 
large volume cluster sizes thus I will have to drop the page lock if it is 
already taken so it might as well not be...  Although I do not feel 
strongly about it.  If the page is locked I will just drop the lock and 
then take it again.  If possible to not have the page locked that would 
make my code a little easier/more efficient I expect...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: page_mkwrite caller is racy?
  2007-01-30 14:58       ` Anton Altaparmakov
@ 2007-01-31  1:18         ` Nick Piggin
  0 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2007-01-31  1:18 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Mark Fasheh, Hugh Dickins, linux-kernel, Linux Memory Management,
	David Howells, Andrew Morton

Anton Altaparmakov wrote:
> On Mon, 29 Jan 2007, Mark Fasheh wrote:
> 
>>
>>No page lock please. Generally, Ocfs2 wants to order cluster locks outside
>>of page locks. Also, the sparse b-tree support I'm working on right now will
>>need to be able to allocate in ->page_mkwrite() which would become very
>>nasty if we came in with the page lock - aside from the additional cluster
>>locks taken, ocfs2 will want to zero some adjacent pages (because we support
>>atomic allocation up to 1 meg).
> 
> 
> Ditto for NTFS.  I will need to lock pages on both sides of the page for 
> large volume cluster sizes thus I will have to drop the page lock if it is 
> already taken so it might as well not be...  Although I do not feel 
> strongly about it.  If the page is locked I will just drop the lock and 
> then take it again.  If possible to not have the page locked that would 
> make my code a little easier/more efficient I expect...

OK, that makes sense. Thanks to you both.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: page_mkwrite caller is racy?
  2007-01-29 10:20 page_mkwrite caller is racy? Nick Piggin
  2007-01-29 16:08 ` Hugh Dickins
@ 2007-01-29 20:00 ` Mark Fasheh
  2007-02-01 11:44 ` David Howells
  2 siblings, 0 replies; 9+ messages in thread
From: Mark Fasheh @ 2007-01-29 20:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linux Memory Management, David Howells,
	Hugh Dickins, Andrew Morton

On Mon, Jan 29, 2007 at 09:20:58PM +1100, Nick Piggin wrote:
> But it is sad that this thing got merged without any callers to even know
> how it is intended to work. Must it be able to sleep?

Ocfs2 absolutely needs to be able to sleep in there in order to take cluster
locks, do allocation, etc. I suspect ext3 and other file systems will want
to sleep in there when they start caring about being able to allocate the
page before it gets written to.

For an example of what I'm talking about, there's a shared_writeable_mmap
branch in ocfs2.git which makes use of ->page_mkwrite(). It's got some other
small problems which need fixing (when I get the time to do so), but
generally it should illustrate what we're likely to do.

Thanks,
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: page_mkwrite caller is racy?
  2007-01-29 10:20 page_mkwrite caller is racy? Nick Piggin
  2007-01-29 16:08 ` Hugh Dickins
  2007-01-29 20:00 ` Mark Fasheh
@ 2007-02-01 11:44 ` David Howells
  2 siblings, 0 replies; 9+ messages in thread
From: David Howells @ 2007-02-01 11:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, linux-kernel, Linux Memory Management,
	David Howells, Andrew Morton

Hugh Dickins <hugh@veritas.com> wrote:

> > Must it be able to sleep?
> 
> Not as David was using it

It absolutely *must* be able to sleep.  It has to wait for FS-Cache to finish
writing the page to the cache before letting the PTE be made writable.

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-02-01 11:44 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-29 10:20 page_mkwrite caller is racy? Nick Piggin
2007-01-29 16:08 ` Hugh Dickins
2007-01-29 20:41   ` Anton Altaparmakov
2007-01-30  1:14   ` Nick Piggin
2007-01-30  1:51     ` Mark Fasheh
2007-01-30 14:58       ` Anton Altaparmakov
2007-01-31  1:18         ` Nick Piggin
2007-01-29 20:00 ` Mark Fasheh
2007-02-01 11:44 ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox