Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
@ 2008-06-18 16:41 Robin Holt
  2008-06-18 17:29 ` Nick Piggin
  0 siblings, 1 reply; 30+ messages in thread
From: Robin Holt @ 2008-06-18 16:41 UTC (permalink / raw)
  To: Nick Piggin, Ingo Molnar, Hugh Dickins
  Cc: Christoph Lameter, Jack Steiner, linux-mm

I am running into a problem where I think a call to get_user_pages(...,
write=1, force=1,...) is returning a readable pte and a page ref count
of 2.  I have not yet trapped the event, but I think I see one place
where this _may_ be happening.

In the sles10 kernel source:
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
		unsigned long start, int len, int write, int force,
		struct page **pages, struct vm_area_struct **vmas)
{
...
retry:
			cond_resched();
			while (!(page = follow_page(vma, start, foll_flags))) {
				int ret;
				ret = __handle_mm_fault(mm, vma, start,
						foll_flags & FOLL_WRITE);
...
				/*
				 * The VM_FAULT_WRITE bit tells us that do_wp_page has
				 * broken COW when necessary, even if maybe_mkwrite
				 * decided not to set pte_write. We can thus safely do
				 * subsequent page lookups as if they were reads.
				 */
				if (ret & VM_FAULT_WRITE)
					foll_flags &= ~FOLL_WRITE;

				cond_resched();
			}

The case I am seeing is under heavy memory pressure.

I think the first pass at follow_page has failed and we called
__handle_mm_fault().  At the time in __handle_mm_fault where the page table
is unlocked, there is a writable pte in the processes page table, and a
struct page with a reference count of 1.  ret will have VM_FAULT_WRITE
set so the get_user_pages code will clear FOLL_WRITE from foll_flags.

Between the time above and the second attempt at follow_page, the
page gets swapped out.  The second attempt at follow_page, now without
FOLL_WRITE (and FOLL_GET is set) will result in a read-only pte with a
reference count of 2.  Any subsequent write fault by the process will
result in a COW break and the process pointing at a different page than
the get_user_pages() returned page.

Is this sequence plausible or am I missing something key?

If this sequence is plausible, I need to know how to either work around
this problem or if it should really be fixed in the kernel.

Thanks,
Robin Holt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-18 16:41 Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2? Robin Holt
@ 2008-06-18 17:29 ` Nick Piggin
  2008-06-18 19:01   ` Hugh Dickins
  0 siblings, 1 reply; 30+ messages in thread
From: Nick Piggin @ 2008-06-18 17:29 UTC (permalink / raw)
  To: Robin Holt
  Cc: Ingo Molnar, Hugh Dickins, Christoph Lameter, Jack Steiner, linux-mm

On Thursday 19 June 2008 02:41, Robin Holt wrote:
> I am running into a problem where I think a call to get_user_pages(...,
> write=1, force=1,...) is returning a readable pte and a page ref count
> of 2.  I have not yet trapped the event, but I think I see one place
> where this _may_ be happening.
>
> In the sles10 kernel source:
> int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> 		unsigned long start, int len, int write, int force,
> 		struct page **pages, struct vm_area_struct **vmas)
> {
> ...
> retry:
> 			cond_resched();
> 			while (!(page = follow_page(vma, start, foll_flags))) {
> 				int ret;
> 				ret = __handle_mm_fault(mm, vma, start,
> 						foll_flags & FOLL_WRITE);
> ...
> 				/*
> 				 * The VM_FAULT_WRITE bit tells us that do_wp_page has
> 				 * broken COW when necessary, even if maybe_mkwrite
> 				 * decided not to set pte_write. We can thus safely do
> 				 * subsequent page lookups as if they were reads.
> 				 */
> 				if (ret & VM_FAULT_WRITE)
> 					foll_flags &= ~FOLL_WRITE;
>
> 				cond_resched();
> 			}
>
> The case I am seeing is under heavy memory pressure.
>
> I think the first pass at follow_page has failed and we called
> __handle_mm_fault().  At the time in __handle_mm_fault where the page table
> is unlocked, there is a writable pte in the processes page table, and a
> struct page with a reference count of 1.  ret will have VM_FAULT_WRITE
> set so the get_user_pages code will clear FOLL_WRITE from foll_flags.
>
> Between the time above and the second attempt at follow_page, the
> page gets swapped out.  The second attempt at follow_page, now without
> FOLL_WRITE (and FOLL_GET is set) will result in a read-only pte with a
> reference count of 2.

There would not be a writeable pte in the page table, otherwise
VM_FAULT_WRITE should not get returned. But it can be returned via
other paths...

However, assuming it was returned, then mmap_sem is still held, so
the vma should not get changed from a writeable to a readonly one,
so I can't see the problem you're describin with that sequence.

Swap pages, for one, could return with VM_FAULT_WRITE, then
subsequently have its page swapped out, then set up a readonly pte
due to the __handle_mm_fault with write access cleared. *I think*.
But although that feels a bit unclean, I don't think it would cause
a problem because the previous VM_FAULT_WRITE (while under mmap_sem)
ensures our swap page should still be valid to write into via get
user pages (and a subsequent write access should cause do_wp_page to
go through the proper reuse logic and now COW).

> Any subsequent write fault by the process will 
> result in a COW break and the process pointing at a different page than
> the get_user_pages() returned page.
>
> Is this sequence plausible or am I missing something key?
>
> If this sequence is plausible, I need to know how to either work around
> this problem or if it should really be fixed in the kernel.

I'd be interested to know the situation that leads to this problem.
If possible a test case would be ideal.

But, with force=1, it is possible to create private "COW" copies of pages
that have readonly ptes in the process page table, and that the process
never has permission to write into (these are "Linus pages").

This situation should not cause the process to be able to write into the
address and cause a further COW, but in the case of shared vmas, it will
cause the page to become disconnected from the file...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-18 17:29 ` Nick Piggin
@ 2008-06-18 19:01   ` Hugh Dickins
  2008-06-18 20:33     ` Robin Holt
  2008-06-19  3:07     ` Nick Piggin
  0 siblings, 2 replies; 30+ messages in thread
From: Hugh Dickins @ 2008-06-18 19:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Robin Holt, Ingo Molnar, Christoph Lameter, Jack Steiner, linux-mm

On Thu, 19 Jun 2008, Nick Piggin wrote:
> On Thursday 19 June 2008 02:41, Robin Holt wrote:
> > I am running into a problem where I think a call to get_user_pages(...,
> > write=1, force=1,...) is returning a readable pte and a page ref count

I'm hoping Robin doesn't really need force=1 - can't you do what you
need with force=0, Robin? force=1 is weird and really only for ptrace
I think.  And assuming that Robin meant to say "readonly pte" above.

> > of 2.  I have not yet trapped the event, but I think I see one place
> > where this _may_ be happening.
> >
> > The case I am seeing is under heavy memory pressure.
> >
> > I think the first pass at follow_page has failed and we called
> > __handle_mm_fault().  At the time in __handle_mm_fault where the page table
> > is unlocked, there is a writable pte in the processes page table, and a
> > struct page with a reference count of 1.  ret will have VM_FAULT_WRITE
> > set so the get_user_pages code will clear FOLL_WRITE from foll_flags.
> >
> > Between the time above and the second attempt at follow_page, the
> > page gets swapped out.  The second attempt at follow_page, now without
> > FOLL_WRITE (and FOLL_GET is set) will result in a read-only pte with a
> > reference count of 2.
> 
> There would not be a writeable pte in the page table, otherwise
> VM_FAULT_WRITE should not get returned. But it can be returned via
> other paths...

In his scenario, there wasn't a writeable pte originally, the
first call to handle_pte_fault instantiates the writeable pte
and returns with VM_FAULT_WRITE set.

> 
> However, assuming it was returned, then mmap_sem is still held, so
> the vma should not get changed from a writeable to a readonly one,
> so I can't see the problem you're describin with that sequence.

The vma doesn't get changed, but the pte just instantiated writably
above, gets swapped out before the next follow_page, then brought
back in by the second call to handle_pte_fault.  But this is with
FOLL_WRITE cleared, so it behaves as a read fault, and puts just
a readonly pte.

> 
> Swap pages, for one, could return with VM_FAULT_WRITE, then
> subsequently have its page swapped out, then set up a readonly pte
> due to the __handle_mm_fault with write access cleared. *I think*.

Yes.

> But although that feels a bit unclean, I don't think it would cause
> a problem because the previous VM_FAULT_WRITE (while under mmap_sem)
> ensures our swap page should still be valid to write into via get
> user pages (and a subsequent write access should cause do_wp_page to
> go through the proper reuse logic and now COW).

I think perhaps Robin is wanting to write into the page both from the
kernel (hence the get_user_pages) and from userspace: but finding that
the attempt to write from userspace breaks COW again (because gup
raised the page count and it's a readonly pte), so they end up
writing into different pages.  We know that COW didn't need to
be broken a second time, but do_wp_page doesn't know that.

> > Any subsequent write fault by the process will 
> > result in a COW break and the process pointing at a different page than
> > the get_user_pages() returned page.
> >
> > Is this sequence plausible or am I missing something key?
> >
> > If this sequence is plausible, I need to know how to either work around
> > this problem or if it should really be fixed in the kernel.
> 
> I'd be interested to know the situation that leads to this problem.
> If possible a test case would be ideal.

Might it help if do_wp_page returned VM_FAULT_WRITE (perhaps renamed)
only in the case where maybe_mkwrite decided not to mkwrite i.e. the
weird write=1,force=1 on readonly vma case?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-18 19:01   ` Hugh Dickins
@ 2008-06-18 20:33     ` Robin Holt
  2008-06-18 21:46       ` Hugh Dickins
  2008-06-19  3:07     ` Nick Piggin
  1 sibling, 1 reply; 30+ messages in thread
From: Robin Holt @ 2008-06-18 20:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Robin Holt, Ingo Molnar, Christoph Lameter,
	Jack Steiner, linux-mm

On Wed, Jun 18, 2008 at 08:01:48PM +0100, Hugh Dickins wrote:
> On Thu, 19 Jun 2008, Nick Piggin wrote:
> > On Thursday 19 June 2008 02:41, Robin Holt wrote:
> > > I am running into a problem where I think a call to get_user_pages(...,
> > > write=1, force=1,...) is returning a readable pte and a page ref count
> 
> I'm hoping Robin doesn't really need force=1 - can't you do what you
> need with force=0, Robin? force=1 is weird and really only for ptrace
> I think.  And assuming that Robin meant to say "readonly pte" above.

I don't know the exact reason for force=1.  The driver has been this
way for years and I don't recall the history.  I can dig into that.

... Removed text added to end of this email.

> I think perhaps Robin is wanting to write into the page both from the
> kernel (hence the get_user_pages) and from userspace: but finding that
> the attempt to write from userspace breaks COW again (because gup
> raised the page count and it's a readonly pte), so they end up
> writing into different pages.  We know that COW didn't need to
> be broken a second time, but do_wp_page doesn't know that.

That is exactly the problem I think I am seeing.  How should I be handling
this to get the correct behavior?  As a test, should I be looking at
the process's page table to see if the pfn matches and is writable.
If it is not, putting the page and redoing the call to get_user_pages()?

> > > Any subsequent write fault by the process will 
> > > result in a COW break and the process pointing at a different page than
> > > the get_user_pages() returned page.
> > >
> > > Is this sequence plausible or am I missing something key?
> > >
> > > If this sequence is plausible, I need to know how to either work around
> > > this problem or if it should really be fixed in the kernel.
> > 
> > I'd be interested to know the situation that leads to this problem.
> > If possible a test case would be ideal.
> 
> Might it help if do_wp_page returned VM_FAULT_WRITE (perhaps renamed)
> only in the case where maybe_mkwrite decided not to mkwrite i.e. the
> weird write=1,force=1 on readonly vma case?

I don't think it is in the return value, but rather the clearing of the
FOLL_WRITE flag.  Is that being done to handle a force=1 where the vma
is marked readonly?  Could follow_page handle the force case
differently?

What is the intent of force=1?

Thanks,
Robin Holt


Cut from above:
> > > of 2.  I have not yet trapped the event, but I think I see one place
> > > where this _may_ be happening.
> > >
> > > The case I am seeing is under heavy memory pressure.
> > >
> > > I think the first pass at follow_page has failed and we called
> > > __handle_mm_fault().  At the time in __handle_mm_fault where the page table
> > > is unlocked, there is a writable pte in the processes page table, and a
> > > struct page with a reference count of 1.  ret will have VM_FAULT_WRITE
> > > set so the get_user_pages code will clear FOLL_WRITE from foll_flags.
> > >
> > > Between the time above and the second attempt at follow_page, the
> > > page gets swapped out.  The second attempt at follow_page, now without
> > > FOLL_WRITE (and FOLL_GET is set) will result in a read-only pte with a
> > > reference count of 2.
> > 
> > There would not be a writeable pte in the page table, otherwise
> > VM_FAULT_WRITE should not get returned. But it can be returned via
> > other paths...
> 
> In his scenario, there wasn't a writeable pte originally, the
> first call to handle_pte_fault instantiates the writeable pte
> and returns with VM_FAULT_WRITE set.
> 
> > 
> > However, assuming it was returned, then mmap_sem is still held, so
> > the vma should not get changed from a writeable to a readonly one,
> > so I can't see the problem you're describin with that sequence.
> 
> The vma doesn't get changed, but the pte just instantiated writably
> above, gets swapped out before the next follow_page, then brought
> back in by the second call to handle_pte_fault.  But this is with
> FOLL_WRITE cleared, so it behaves as a read fault, and puts just
> a readonly pte.
> 
> > 
> > Swap pages, for one, could return with VM_FAULT_WRITE, then
> > subsequently have its page swapped out, then set up a readonly pte
> > due to the __handle_mm_fault with write access cleared. *I think*.
> 
> Yes.
> 
> > But although that feels a bit unclean, I don't think it would cause
> > a problem because the previous VM_FAULT_WRITE (while under mmap_sem)
> > ensures our swap page should still be valid to write into via get
> > user pages (and a subsequent write access should cause do_wp_page to
> > go through the proper reuse logic and now COW).
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-18 20:33     ` Robin Holt
@ 2008-06-18 21:46       ` Hugh Dickins
  2008-06-19  3:31         ` Nick Piggin
  2008-06-19 16:32         ` Robin Holt
  0 siblings, 2 replies; 30+ messages in thread
From: Hugh Dickins @ 2008-06-18 21:46 UTC (permalink / raw)
  To: Robin Holt
  Cc: Nick Piggin, Ingo Molnar, Christoph Lameter, Jack Steiner, linux-mm

On Wed, 18 Jun 2008, Robin Holt wrote:
> On Wed, Jun 18, 2008 at 08:01:48PM +0100, Hugh Dickins wrote:
> > I think perhaps Robin is wanting to write into the page both from the
> > kernel (hence the get_user_pages) and from userspace: but finding that
> > the attempt to write from userspace breaks COW again (because gup
> > raised the page count and it's a readonly pte), so they end up
> > writing into different pages.  We know that COW didn't need to
> > be broken a second time, but do_wp_page doesn't know that.
> 
> That is exactly the problem I think I am seeing.  How should I be handling
> this to get the correct behavior?  As a test, should I be looking at
> the process's page table to see if the pfn matches and is writable.
> If it is not, putting the page and redoing the call to get_user_pages()?

I'd rather consider it a bug in get_user_pages than expect you to code
around it: it is the kind of job that get_user_pages is supposed to be
doing, but you've discovered a case it doesn't quite manage to handle
(if we're all interpreting things correctly).

But perhaps you need to work with distro kernels already out there:
in which case, yes, the procedure you describe above sounds right.
You're GPL, yes?  Then I expect you can use apply_to_page_range()
to do the page table walking for you (but I've not used it myself).

> > Might it help if do_wp_page returned VM_FAULT_WRITE (perhaps renamed)
> > only in the case where maybe_mkwrite decided not to mkwrite i.e. the
> > weird write=1,force=1 on readonly vma case?
> 
> I don't think it is in the return value, but rather the clearing of the
> FOLL_WRITE flag.  Is that being done to handle a force=1 where the vma
> is marked readonly?

Yes, and that's a much better way of changing it than I was suggesting.
Does the patch below work for you?  Does it look sensible to others?

> Could follow_page handle the force case differently?

IIRC it used to, but that proved unsatisfactory,
and the VM_FAULT_WRITE notification replaced that.

> 
> What is the intent of force=1?

Ah.  Ignoring the readonly case (which has never been problematic:
I've never had to think about it, I think it allows get_user_pages
to access PROT_NONE areas), write=1,force=1 is intended to allow
ptrace to modify a user-readonly area (e.g. set breakpoint in text),
avoiding the danger of its modifications leaking back into the file
which has been mapped.

But it's weird because it requires VM_MAYWRITE, which means if there
was any chance of the modification leaking back into the shared file,
it must have been opened for reading and writing, so the user process
has actually got permission to modify it.  And it's weird because it
causes COWs in a shared mapping which you normally think could never
contain COWs - I used to rail against it for that reason, but in the
end did an audit and couldn't find any place where that violation of
our assumptions actually mattered enough to get so excited.

Hugh

--- 2.6.26-rc6/mm/memory.c	2008-05-26 20:00:39.000000000 +0100
+++ linux/mm/memory.c	2008-06-18 22:06:46.000000000 +0100
@@ -1152,9 +1152,15 @@ int get_user_pages(struct task_struct *t
 				 * do_wp_page has broken COW when necessary,
 				 * even if maybe_mkwrite decided not to set
 				 * pte_write. We can thus safely do subsequent
-				 * page lookups as if they were reads.
+				 * page lookups as if they were reads. But only
+				 * do so when looping for pte_write is futile:
+				 * in some cases userspace may also be wanting
+				 * to write to the gotten user page, which a
+				 * read fault here might prevent (a readonly
+				 * page would get reCOWed by userspace write).
 				 */
-				if (ret & VM_FAULT_WRITE)
+				if ((ret & VM_FAULT_WRITE) &&
+				    !(vma->vm_flags & VM_WRITE))
 					foll_flags &= ~FOLL_WRITE;

 				cond_resched();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-18 19:01   ` Hugh Dickins
  2008-06-18 20:33     ` Robin Holt
@ 2008-06-19  3:07     ` Nick Piggin
  2008-06-19 11:09       ` Hugh Dickins
  1 sibling, 1 reply; 30+ messages in thread
From: Nick Piggin @ 2008-06-19  3:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robin Holt, Ingo Molnar, Christoph Lameter, Jack Steiner, linux-mm

On Thursday 19 June 2008 05:01, Hugh Dickins wrote:
> On Thu, 19 Jun 2008, Nick Piggin wrote:

> > But although that feels a bit unclean, I don't think it would cause
> > a problem because the previous VM_FAULT_WRITE (while under mmap_sem)
> > ensures our swap page should still be valid to write into via get
> > user pages (and a subsequent write access should cause do_wp_page to
> > go through the proper reuse logic and now COW).
>
> I think perhaps Robin is wanting to write into the page both from the
> kernel (hence the get_user_pages) and from userspace: but finding that
> the attempt to write from userspace breaks COW again (because gup
> raised the page count and it's a readonly pte), so they end up
> writing into different pages.  We know that COW didn't need to
> be broken a second time, but do_wp_page doesn't know that.

I'm having trouble seeing the path that leads to this situation. I
can't see what the significance of the elevated page count is?

We're talking about swap pages, as in do_swap_page? Then AFAIKS it
is only the mapcount that is taken into account, and get_user_pages
will first break COW, but that should set mapcount back to 1, in
which case the userspace access should notice that in do_swap_page
and prevent the 2nd COW from happening.

Unless, hmm no it can also be called directly via handle_pte_fault,
and if it happens to fail the trylock_page, I think I do see how it
can be COWed. But it doesn't seem to have anything to do with page
count so I don't know if I'm on the right track or maybe missing the
obvious...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-18 21:46       ` Hugh Dickins
@ 2008-06-19  3:31         ` Nick Piggin
  2008-06-19  3:34           ` Nick Piggin
  2008-06-19 11:39           ` Hugh Dickins
  2008-06-19 16:32         ` Robin Holt
  1 sibling, 2 replies; 30+ messages in thread
From: Nick Piggin @ 2008-06-19  3:31 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robin Holt, Ingo Molnar, Christoph Lameter, Jack Steiner, linux-mm

On Thursday 19 June 2008 07:46, Hugh Dickins wrote:

> contain COWs - I used to rail against it for that reason, but in the
> end did an audit and couldn't find any place where that violation of
> our assumptions actually mattered enough to get so excited.

Still, they're slightly troublesome, as our get_user_pages problems
demonstrate :)

>
> Hugh
>
> --- 2.6.26-rc6/mm/memory.c	2008-05-26 20:00:39.000000000 +0100
> +++ linux/mm/memory.c	2008-06-18 22:06:46.000000000 +0100
> @@ -1152,9 +1152,15 @@ int get_user_pages(struct task_struct *t
>  				 * do_wp_page has broken COW when necessary,
>  				 * even if maybe_mkwrite decided not to set
>  				 * pte_write. We can thus safely do subsequent
> -				 * page lookups as if they were reads.
> +				 * page lookups as if they were reads. But only
> +				 * do so when looping for pte_write is futile:
> +				 * in some cases userspace may also be wanting
> +				 * to write to the gotten user page, which a
> +				 * read fault here might prevent (a readonly
> +				 * page would get reCOWed by userspace write).
>  				 */
> -				if (ret & VM_FAULT_WRITE)
> +				if ((ret & VM_FAULT_WRITE) &&
> +				    !(vma->vm_flags & VM_WRITE))
>  					foll_flags &= ~FOLL_WRITE;
>
>  				cond_resched();

Hmm, doesn't this give the same problem for !VM_WRITE vmas? If you
called get_user_pages again, isn't that going to cause another COW
on the already-COWed page that we're hoping to write into? (not sure
about mprotect either, could that be used to make the vma writeable
afterwards and then write to it?)

I would rather (if my reading of the code is correct) make the
trylock page into a full lock_page. The indeterminism of the trylock
has always bugged me anyway... Shouldn't that cause a swap page not
to get reCOWed if we have the only mapping to it?

If the lock_page cost bothers you, we could do a quick unlocked check
on page_mapcount > 1 before taking the lock (which would also avoid
the extra atomic ops and barriers in many cases where the page really
is shared)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-19  3:31         ` Nick Piggin
@ 2008-06-19  3:34           ` Nick Piggin
  2008-06-19 11:39           ` Hugh Dickins
  1 sibling, 0 replies; 30+ messages in thread
From: Nick Piggin @ 2008-06-19  3:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robin Holt, Ingo Molnar, Christoph Lameter, Jack Steiner, linux-mm

On Thursday 19 June 2008 13:31, Nick Piggin wrote:
> On Thursday 19 June 2008 07:46, Hugh Dickins wrote:
> > contain COWs - I used to rail against it for that reason, but in the
> > end did an audit and couldn't find any place where that violation of
> > our assumptions actually mattered enough to get so excited.
>
> Still, they're slightly troublesome, as our get_user_pages problems
> demonstrate :)
>
> > Hugh
> >
> > --- 2.6.26-rc6/mm/memory.c	2008-05-26 20:00:39.000000000 +0100
> > +++ linux/mm/memory.c	2008-06-18 22:06:46.000000000 +0100
> > @@ -1152,9 +1152,15 @@ int get_user_pages(struct task_struct *t
> >  				 * do_wp_page has broken COW when necessary,
> >  				 * even if maybe_mkwrite decided not to set
> >  				 * pte_write. We can thus safely do subsequent
> > -				 * page lookups as if they were reads.
> > +				 * page lookups as if they were reads. But only
> > +				 * do so when looping for pte_write is futile:
> > +				 * in some cases userspace may also be wanting
> > +				 * to write to the gotten user page, which a
> > +				 * read fault here might prevent (a readonly
> > +				 * page would get reCOWed by userspace write).
> >  				 */
> > -				if (ret & VM_FAULT_WRITE)
> > +				if ((ret & VM_FAULT_WRITE) &&
> > +				    !(vma->vm_flags & VM_WRITE))
> >  					foll_flags &= ~FOLL_WRITE;
> >
> >  				cond_resched();
>
> Hmm, doesn't this give the same problem for !VM_WRITE vmas? If you
> called get_user_pages again, isn't that going to cause another COW
> on the already-COWed page that we're hoping to write into? (not sure
> about mprotect either, could that be used to make the vma writeable
> afterwards and then write to it?)
>
> I would rather (if my reading of the code is correct) make the
> trylock page into a full lock_page. The indeterminism of the trylock
> has always bugged me anyway... Shouldn't that cause a swap page not
> to get reCOWed if we have the only mapping to it?

Although, I have to qualify that. I do like this patch if nothing
else than because it avoids this crazy corner case completely for
the common force=0 case. So I would like to see this patch go in
(hopefully it doesn't just push the obscure bugs further under the
radar!)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-19  3:07     ` Nick Piggin
@ 2008-06-19 11:09       ` Hugh Dickins
  2008-06-19 13:38         ` Robin Holt
  0 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2008-06-19 11:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Robin Holt, Ingo Molnar, Christoph Lameter, Jack Steiner, linux-mm

On Thu, 19 Jun 2008, Nick Piggin wrote:
> On Thursday 19 June 2008 05:01, Hugh Dickins wrote:
> > On Thu, 19 Jun 2008, Nick Piggin wrote:
> 
> > > But although that feels a bit unclean, I don't think it would cause
> > > a problem because the previous VM_FAULT_WRITE (while under mmap_sem)
> > > ensures our swap page should still be valid to write into via get
> > > user pages (and a subsequent write access should cause do_wp_page to
> > > go through the proper reuse logic and now COW).
> >
> > I think perhaps Robin is wanting to write into the page both from the
> > kernel (hence the get_user_pages) and from userspace: but finding that
> > the attempt to write from userspace breaks COW again (because gup
> > raised the page count and it's a readonly pte), so they end up
> > writing into different pages.  We know that COW didn't need to
> > be broken a second time, but do_wp_page doesn't know that.
> 
> I'm having trouble seeing the path that leads to this situation. I
> can't see what the significance of the elevated page count is?

The trouble is, you're looking at what can_share_swap_page actually
does, instead of letting your mind regress a few years to what it
used to do before we had page_mapcount ;)

Yes, sorry, my page count "explanation" is nonsense.

> 
> We're talking about swap pages, as in do_swap_page? Then AFAIKS it
> is only the mapcount that is taken into account, and get_user_pages
> will first break COW, but that should set mapcount back to 1, in
> which case the userspace access should notice that in do_swap_page
> and prevent the 2nd COW from happening.

(I assume Robin is not forking, we do know that causes this kind
of problem, but he didn't mention any forking so I assume not.)

> 
> Unless, hmm no it can also be called directly via handle_pte_fault,
> and if it happens to fail the trylock_page, I think I do see how it
> can be COWed. But it doesn't seem to have anything to do with page
> count so I don't know if I'm on the right track or maybe missing the
> obvious...

The !TestSetPageLocked (there, now I'm looking at the source!).
Yes, I suppose if it goes the wrong way on that, it would account
for it; though it'd be nice to have some confirmation that's what's
happening.

Over to your next mail...

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-19  3:31         ` Nick Piggin
  2008-06-19  3:34           ` Nick Piggin
@ 2008-06-19 11:39           ` Hugh Dickins
  2008-06-19 12:07             ` Nick Piggin
  1 sibling, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2008-06-19 11:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Robin Holt, Ingo Molnar, Christoph Lameter, Jack Steiner, linux-mm

On Thu, 19 Jun 2008, Nick Piggin wrote:
> On Thursday 19 June 2008 07:46, Hugh Dickins wrote:
> 
> > contain COWs - I used to rail against it for that reason, but in the
> > end did an audit and couldn't find any place where that violation of
> > our assumptions actually mattered enough to get so excited.
> 
> Still, they're slightly troublesome, as our get_user_pages problems
> demonstrate :)

Indeed, but fighting Linus over them does get tiring.  (And now's
not the time for another fight, after Oleg's ZERO_PAGE discovery!)

> > -				if (ret & VM_FAULT_WRITE)
> > +				if ((ret & VM_FAULT_WRITE) &&
> > +				    !(vma->vm_flags & VM_WRITE))
> >  					foll_flags &= ~FOLL_WRITE;

> Hmm, doesn't this give the same problem for !VM_WRITE vmas? If you
> called get_user_pages again, isn't that going to cause another COW
> on the already-COWed page that we're hoping to write into?

Sure, if it fixes any issue at all (now very much in doubt),
it only fixes it for writing to VM_WRITE vmas; but that should
be the only case normal people are concerned with - and Robin's
userspace is trying to write to the page, so it better have VM_WRITE.

> (not sure
> about mprotect either, could that be used to make the vma writeable
> afterwards and then write to it?)

Er, yes, but I didn't get the point you were trying to make there.

> 
> I would rather (if my reading of the code is correct) make the
> trylock page into a full lock_page. The indeterminism of the trylock
> has always bugged me anyway... Shouldn't that cause a swap page not
> to get reCOWed if we have the only mapping to it?
> 
> If the lock_page cost bothers you, we could do a quick unlocked check
> on page_mapcount > 1 before taking the lock (which would also avoid
> the extra atomic ops and barriers in many cases where the page really
> is shared)

That indeterminism has certainly bothered me too.  There was another
interesting case which it interfered with, a year or two back.  I'll
have to search mboxes later to locate it.

We do have page table lock at that point, so it gets a bit tedious
(like the page_mkwrite case) to use lock_page there: more overhead
than just that of the lock_page.

I've had a quick look at my collection of uncompleted/unpublished
swap patches, and here's a hunk from one of them which is trying
to address that point.  But I'll have to look back and see what
else this depends upon.

-		if (!TestSetPageLocked(old_page)) {
-			reuse = can_share_swap_page(old_page);
-			unlock_page(old_page);
+		if (page_mapcount(old_page) == 1) {
+			extern int page_swapcount(struct page *);
+			if (!PageSwapCache(old_page))
+				reuse = 1;
+			else if (!TestSetPageLocked(old_page)) {
+				reuse = !page_is_shared(old_page);
+				unlock_page(old_page);
+			} else if (!page_swapcount(old_page))
+				reuse = 1;

I probably won't get back to this today.  And there are also good
reasons in -mm for me to check back on all these swapcount issues.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-19 11:39           ` Hugh Dickins
@ 2008-06-19 12:07             ` Nick Piggin
  2008-06-19 12:21               ` Nick Piggin
  2008-06-19 12:34               ` Hugh Dickins
  0 siblings, 2 replies; 30+ messages in thread
From: Nick Piggin @ 2008-06-19 12:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robin Holt, Ingo Molnar, Christoph Lameter, Jack Steiner, linux-mm

On Thursday 19 June 2008 21:39, Hugh Dickins wrote:

> That indeterminism has certainly bothered me too.  There was another
> interesting case which it interfered with, a year or two back.  I'll
> have to search mboxes later to locate it.
>
> We do have page table lock at that point, so it gets a bit tedious
> (like the page_mkwrite case) to use lock_page there: more overhead
> than just that of the lock_page.

Mmm, right.


> I've had a quick look at my collection of uncompleted/unpublished
> swap patches, and here's a hunk from one of them which is trying
> to address that point.  But I'll have to look back and see what
> else this depends upon.
>
> -		if (!TestSetPageLocked(old_page)) {
> -			reuse = can_share_swap_page(old_page);
> -			unlock_page(old_page);
> +		if (page_mapcount(old_page) == 1) {
> +			extern int page_swapcount(struct page *);
> +			if (!PageSwapCache(old_page))
> +				reuse = 1;
> +			else if (!TestSetPageLocked(old_page)) {
> +				reuse = !page_is_shared(old_page);
> +				unlock_page(old_page);
> +			} else if (!page_swapcount(old_page))
> +				reuse = 1;
>
> I probably won't get back to this today.  And there are also good
> reasons in -mm for me to check back on all these swapcount issues.

I don't see how you can get an accurate page_swapcount without
the page lock. Anyway, if you volunteer to take a look at the
problem, great. I expect Robin could just as well fix it for
their code in the meantime by using force=0...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-19 12:07             ` Nick Piggin
@ 2008-06-19 12:21               ` Nick Piggin
  2008-06-19 17:48                 ` Christoph Lameter
  2008-06-19 12:34               ` Hugh Dickins
  1 sibling, 1 reply; 30+ messages in thread
From: Nick Piggin @ 2008-06-19 12:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robin Holt, Ingo Molnar, Christoph Lameter, Jack Steiner, linux-mm

On Thursday 19 June 2008 22:07, Nick Piggin wrote:
> On Thursday 19 June 2008 21:39, Hugh Dickins wrote:

> > I probably won't get back to this today.  And there are also good
> > reasons in -mm for me to check back on all these swapcount issues.
>
> I don't see how you can get an accurate page_swapcount without
> the page lock. Anyway, if you volunteer to take a look at the
> problem, great. I expect Robin could just as well fix it for
> their code in the meantime by using force=0...

You could always use another page flag, of course ;)
Or get rid of Linus pages

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-19 12:07             ` Nick Piggin
  2008-06-19 12:21               ` Nick Piggin
@ 2008-06-19 12:34               ` Hugh Dickins
  2008-06-19 12:53                 ` Nick Piggin
  1 sibling, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2008-06-19 12:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Robin Holt, Ingo Molnar, Christoph Lameter, Jack Steiner, linux-mm

On Thu, 19 Jun 2008, Nick Piggin wrote:
> On Thursday 19 June 2008 21:39, Hugh Dickins wrote:
> 
> > I've had a quick look at my collection of uncompleted/unpublished
> > swap patches, and here's a hunk from one of them which is trying
> > to address that point.  But I'll have to look back and see what
> > else this depends upon.
> >
> > -		if (!TestSetPageLocked(old_page)) {
> > -			reuse = can_share_swap_page(old_page);
> > -			unlock_page(old_page);
> > +		if (page_mapcount(old_page) == 1) {
> > +			extern int page_swapcount(struct page *);
> > +			if (!PageSwapCache(old_page))
> > +				reuse = 1;
> > +			else if (!TestSetPageLocked(old_page)) {
> > +				reuse = !page_is_shared(old_page);
> > +				unlock_page(old_page);
> > +			} else if (!page_swapcount(old_page))
> > +				reuse = 1;
> >
> > I probably won't get back to this today.  And there are also good
> > reasons in -mm for me to check back on all these swapcount issues.
> 
> I don't see how you can get an accurate page_swapcount without
> the page lock.

I doubt it's an accurate swapcount, just a case where one can be
sure of !page_swapcount.  It's certainly not something to take on
trust, patches I need to be sceptical about and refresh my mind on.

> Anyway, if you volunteer to take a look at the problem, great.

I do.

> I expect Robin could just as well fix it for
> their code in the meantime by using force=0...

Sorry, please explain, I don't see that: though their driver happens
to say force=1, I don't think it's needed and I don't think it's
making any difference in this case.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-19 12:34               ` Hugh Dickins
@ 2008-06-19 12:53                 ` Nick Piggin
  2008-06-19 13:25                   ` Hugh Dickins
  0 siblings, 1 reply; 30+ messages in thread
From: Nick Piggin @ 2008-06-19 12:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robin Holt, Ingo Molnar, Christoph Lameter, Jack Steiner, linux-mm

On Thursday 19 June 2008 22:34, Hugh Dickins wrote:
> On Thu, 19 Jun 2008, Nick Piggin wrote:
> > On Thursday 19 June 2008 21:39, Hugh Dickins wrote:
> > > I've had a quick look at my collection of uncompleted/unpublished
> > > swap patches, and here's a hunk from one of them which is trying
> > > to address that point.  But I'll have to look back and see what
> > > else this depends upon.
> > >
> > > -		if (!TestSetPageLocked(old_page)) {
> > > -			reuse = can_share_swap_page(old_page);
> > > -			unlock_page(old_page);
> > > +		if (page_mapcount(old_page) == 1) {
> > > +			extern int page_swapcount(struct page *);
> > > +			if (!PageSwapCache(old_page))
> > > +				reuse = 1;
> > > +			else if (!TestSetPageLocked(old_page)) {
> > > +				reuse = !page_is_shared(old_page);
> > > +				unlock_page(old_page);
> > > +			} else if (!page_swapcount(old_page))
> > > +				reuse = 1;
> > >
> > > I probably won't get back to this today.  And there are also good
> > > reasons in -mm for me to check back on all these swapcount issues.
> >
> > I don't see how you can get an accurate page_swapcount without
> > the page lock.
>
> I doubt it's an accurate swapcount, just a case where one can be
> sure of !page_swapcount.  It's certainly not something to take on
> trust, patches I need to be sceptical about and refresh my mind on.

I don't know if you can be sure of that, because after checking
page_mapcount, but before checking page_swapcount, can't another
process have moved their swapcount to mapcount?


> > Anyway, if you volunteer to take a look at the problem, great.
>
> I do.

Thanks


> > I expect Robin could just as well fix it for
> > their code in the meantime by using force=0...
>
> Sorry, please explain, I don't see that: though their driver happens
> to say force=1, I don't think it's needed and I don't think it's
> making any difference in this case.

Oh, I missed that. You're now thinking they do have VM_WRITE on
the vma and hence your patch isn't going to work (and neither
force=0). OK, that sounds right to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-19 12:53                 ` Nick Piggin
@ 2008-06-19 13:25                   ` Hugh Dickins
  2008-06-19 13:35                     ` Robin Holt
  0 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2008-06-19 13:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Robin Holt, Ingo Molnar, Christoph Lameter, Jack Steiner, linux-mm

On Thu, 19 Jun 2008, Nick Piggin wrote:
> On Thursday 19 June 2008 22:34, Hugh Dickins wrote:
> >
> > I doubt it's an accurate swapcount, just a case where one can be
> > sure of !page_swapcount.  It's certainly not something to take on
> > trust, patches I need to be sceptical about and refresh my mind on.
> 
> I don't know if you can be sure of that, because after checking
> page_mapcount, but before checking page_swapcount, can't another
> process have moved their swapcount to mapcount?

Obviously that's the concern.  I need to go over the whole patch
and refresh my mind on this area before I can give you an answer.

> > > I expect Robin could just as well fix it for
> > > their code in the meantime by using force=0...
> >
> > Sorry, please explain, I don't see that: though their driver happens
> > to say force=1, I don't think it's needed and I don't think it's
> > making any difference in this case.
> 
> Oh, I missed that. You're now thinking they do have VM_WRITE on
> the vma and hence your patch isn't going to work (and neither
> force=0). OK, that sounds right to me.

I'm still confused.  I thought all along that they have VM_WRITE on
the vma, which Robin has (by implication) confirmed when he says that
userspace is trying to write to the same page - I don't think he'd
expect it to be able to do so without VM_WRITE.

And my gup patch may (I'm unsure, I haven't tried to picture the whole
sequence again) still be useful in the case that they do have VM_WRITE,
but it would make no difference if they didn't have VM_WRITE.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-19 13:25                   ` Hugh Dickins
@ 2008-06-19 13:35                     ` Robin Holt
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Holt @ 2008-06-19 13:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Robin Holt, Ingo Molnar, Christoph Lameter,
	Jack Steiner, linux-mm

On Thu, Jun 19, 2008 at 02:25:10PM +0100, Hugh Dickins wrote:
> On Thu, 19 Jun 2008, Nick Piggin wrote:
> > On Thursday 19 June 2008 22:34, Hugh Dickins wrote:

> > Oh, I missed that. You're now thinking they do have VM_WRITE on
> > the vma and hence your patch isn't going to work (and neither
> > force=0). OK, that sounds right to me.
> 
> I'm still confused.  I thought all along that they have VM_WRITE on
> the vma, which Robin has (by implication) confirmed when he says that
> userspace is trying to write to the same page - I don't think he'd
> expect it to be able to do so without VM_WRITE.

It does have VM_WRITE set.  I do expect this patch to at least change
the problem.  I am also working on testing (seperately) with force=0 to
verify that does not introduce other regressions.  I am doing this
testing against a sles10 kernel and not Linus' latest and greatest.  I
will try to test Linus' kernel later, but that will take more time.

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-19 11:09       ` Hugh Dickins
@ 2008-06-19 13:38         ` Robin Holt
  2008-06-19 13:49           ` Hugh Dickins
  0 siblings, 1 reply; 30+ messages in thread
From: Robin Holt @ 2008-06-19 13:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Robin Holt, Ingo Molnar, Christoph Lameter,
	Jack Steiner, linux-mm

On Thu, Jun 19, 2008 at 12:09:15PM +0100, Hugh Dickins wrote:
> On Thu, 19 Jun 2008, Nick Piggin wrote:
> > On Thursday 19 June 2008 05:01, Hugh Dickins wrote:
> > > On Thu, 19 Jun 2008, Nick Piggin wrote:

> > We're talking about swap pages, as in do_swap_page? Then AFAIKS it
> > is only the mapcount that is taken into account, and get_user_pages
> > will first break COW, but that should set mapcount back to 1, in
> > which case the userspace access should notice that in do_swap_page
> > and prevent the 2nd COW from happening.
> 
> (I assume Robin is not forking, we do know that causes this kind
> of problem, but he didn't mention any forking so I assume not.)

There has been a fork long before this mapping was created.  There was a
hole at this location and the mapping gets established and pages populated
following all ranks of the MPI job getting initialized.

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-19 13:38         ` Robin Holt
@ 2008-06-19 13:49           ` Hugh Dickins
  2008-06-23 15:54             ` Robin Holt
  2008-06-23 19:11             ` Robin Holt
  0 siblings, 2 replies; 30+ messages in thread
From: Hugh Dickins @ 2008-06-19 13:49 UTC (permalink / raw)
  To: Robin Holt
  Cc: Nick Piggin, Ingo Molnar, Christoph Lameter, Jack Steiner, linux-mm

On Thu, 19 Jun 2008, Robin Holt wrote:
> On Thu, Jun 19, 2008 at 12:09:15PM +0100, Hugh Dickins wrote:
> > 
> > (I assume Robin is not forking, we do know that causes this kind
> > of problem, but he didn't mention any forking so I assume not.)
> 
> There has been a fork long before this mapping was created.  There was a
> hole at this location and the mapping gets established and pages populated
> following all ranks of the MPI job getting initialized.

There's usually been a fork somewhen in the past!  That's no problem.

The fork problem comes when someone has done a get_user_pages to break
all the COWs, then another thread does a fork which writeprotects and
raises page_mapcount, so the next write from userspace breaks COW again
and writes to a different page from that which the kernel is holding.

That one kept on coming up, but I've not heard of it again since we
added madvise MADV_DONTFORK so apps could exclude such parts of the
address space from copy_page_range.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-18 21:46       ` Hugh Dickins
  2008-06-19  3:31         ` Nick Piggin
@ 2008-06-19 16:32         ` Robin Holt
  2008-06-20  9:23           ` Nick Piggin
  1 sibling, 1 reply; 30+ messages in thread
From: Robin Holt @ 2008-06-19 16:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robin Holt, Nick Piggin, Ingo Molnar, Christoph Lameter,
	Jack Steiner, linux-mm

On Wed, Jun 18, 2008 at 10:46:09PM +0100, Hugh Dickins wrote:
> On Wed, 18 Jun 2008, Robin Holt wrote:
> > On Wed, Jun 18, 2008 at 08:01:48PM +0100, Hugh Dickins wrote:
> --- 2.6.26-rc6/mm/memory.c	2008-05-26 20:00:39.000000000 +0100
> +++ linux/mm/memory.c	2008-06-18 22:06:46.000000000 +0100
> @@ -1152,9 +1152,15 @@ int get_user_pages(struct task_struct *t
>  				 * do_wp_page has broken COW when necessary,
>  				 * even if maybe_mkwrite decided not to set
>  				 * pte_write. We can thus safely do subsequent
> -				 * page lookups as if they were reads.
> +				 * page lookups as if they were reads. But only
> +				 * do so when looping for pte_write is futile:
> +				 * in some cases userspace may also be wanting
> +				 * to write to the gotten user page, which a
> +				 * read fault here might prevent (a readonly
> +				 * page would get reCOWed by userspace write).
>  				 */
> -				if (ret & VM_FAULT_WRITE)
> +				if ((ret & VM_FAULT_WRITE) &&
> +				    !(vma->vm_flags & VM_WRITE))
>  					foll_flags &= ~FOLL_WRITE;
>  
>  				cond_resched();

I applied the equivalent of this to the sles10 kernel and still saw the
problem.  I also changed the driver to use force=0 and gave more memory
to the test.  That passed in the same way as force=1.  I then restricted
memory and got the same failure.

I am not convinced yet that we can use force=0 yet since I do not recall
the reason for force=1 being used.  I will look into that seperately
from this.

I am working on putting in a trap to detect the problem closer to the
time of failure.

Thanks,
Robin Holt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-19 12:21               ` Nick Piggin
@ 2008-06-19 17:48                 ` Christoph Lameter
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2008-06-19 17:48 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Hugh Dickins, Robin Holt, Ingo Molnar, Jack Steiner, linux-mm

On Thu, 19 Jun 2008, Nick Piggin wrote:

> You could always use another page flag, of course ;)

Some are available now. If you are fast you will get one...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-19 16:32         ` Robin Holt
@ 2008-06-20  9:23           ` Nick Piggin
  0 siblings, 0 replies; 30+ messages in thread
From: Nick Piggin @ 2008-06-20  9:23 UTC (permalink / raw)
  To: Robin Holt
  Cc: Hugh Dickins, Ingo Molnar, Christoph Lameter, Jack Steiner, linux-mm

On Friday 20 June 2008 02:32, Robin Holt wrote:
> On Wed, Jun 18, 2008 at 10:46:09PM +0100, Hugh Dickins wrote:
> > On Wed, 18 Jun 2008, Robin Holt wrote:
> > > On Wed, Jun 18, 2008 at 08:01:48PM +0100, Hugh Dickins wrote:
> >
> > --- 2.6.26-rc6/mm/memory.c	2008-05-26 20:00:39.000000000 +0100
> > +++ linux/mm/memory.c	2008-06-18 22:06:46.000000000 +0100
> > @@ -1152,9 +1152,15 @@ int get_user_pages(struct task_struct *t
> >  				 * do_wp_page has broken COW when necessary,
> >  				 * even if maybe_mkwrite decided not to set
> >  				 * pte_write. We can thus safely do subsequent
> > -				 * page lookups as if they were reads.
> > +				 * page lookups as if they were reads. But only
> > +				 * do so when looping for pte_write is futile:
> > +				 * in some cases userspace may also be wanting
> > +				 * to write to the gotten user page, which a
> > +				 * read fault here might prevent (a readonly
> > +				 * page would get reCOWed by userspace write).
> >  				 */
> > -				if (ret & VM_FAULT_WRITE)
> > +				if ((ret & VM_FAULT_WRITE) &&
> > +				    !(vma->vm_flags & VM_WRITE))
> >  					foll_flags &= ~FOLL_WRITE;
> >
> >  				cond_resched();
>
> I applied the equivalent of this to the sles10 kernel and still saw the
> problem.

If you were able to test my hypothesis, that might help while Hugh
comes up with a more efficient solution (this adds 3 atomic ops in
the do_wp_page path for anonymous)...

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c  2008-06-20 19:16:20.000000000 +1000
+++ linux-2.6/mm/memory.c       2008-06-20 19:19:08.000000000 +1000
@@ -1677,10 +1677,15 @@
         * not dirty accountable.
         */
        if (PageAnon(old_page)) {
-               if (!TestSetPageLocked(old_page)) {
-                       reuse = can_share_swap_page(old_page);
-                       unlock_page(old_page);
-               }
+               page_cache_get(old_page);
+               pte_unmap_unlock(page_table, ptl);
+               lock_page(old_page);
+               reuse = can_share_swap_page(old_page);
+               unlock_page(old_page);
+               page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+               page_cache_release(old_page);
+               if (!pte_same(*page_table, orig_pte))
+                       goto unlock;
        } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
                                        (VM_WRITE|VM_SHARED))) {
                /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-19 13:49           ` Hugh Dickins
@ 2008-06-23 15:54             ` Robin Holt
  2008-06-23 16:48               ` Hugh Dickins
  2008-06-23 19:11             ` Robin Holt
  1 sibling, 1 reply; 30+ messages in thread
From: Robin Holt @ 2008-06-23 15:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robin Holt, Nick Piggin, Ingo Molnar, Christoph Lameter,
	Jack Steiner, linux-mm

On Thu, Jun 19, 2008 at 02:49:50PM +0100, Hugh Dickins wrote:
> On Thu, 19 Jun 2008, Robin Holt wrote:
> > On Thu, Jun 19, 2008 at 12:09:15PM +0100, Hugh Dickins wrote:
> > > 
> > > (I assume Robin is not forking, we do know that causes this kind
> > > of problem, but he didn't mention any forking so I assume not.)
> > 
> > There has been a fork long before this mapping was created.  There was a
> > hole at this location and the mapping gets established and pages populated
> > following all ranks of the MPI job getting initialized.
> 
> There's usually been a fork somewhen in the past!  That's no problem.
> 
> The fork problem comes when someone has done a get_user_pages to break
> all the COWs, then another thread does a fork which writeprotects and
> raises page_mapcount, so the next write from userspace breaks COW again
> and writes to a different page from that which the kernel is holding.
> 
> That one kept on coming up, but I've not heard of it again since we
> added madvise MADV_DONTFORK so apps could exclude such parts of the
> address space from copy_page_range.

I finally tracked this down.  I think it is a problem specific to XPMEM
on the SLES10 kernel and will not be a problem once Andrea's mmu_notifier
is in the kernel.  It is a problem, as far as I can tell, specific to
the way XPMEM works.

I will open a SuSE bugzilla to work the issue directly with them.

Prior to the transition event, we have a page of memory that was
pre-faulted by a process.  The process has exported (via XPMEM) a
window of its own address space.  A remote process has attached and
touched the page of memory.  The fault will call into XPMEM which does
the get_user_pages.

At this point, both processes have a writable PTE entry to the same
page and XPMEM has one additional reference count (_count) on the page
acquired via get_user_pages().

Memory pressure causes swap_page to get called.  It clears the two
process's page table entries, returns the _count values, etc.  The only
thing that remains different from normal at this point is XPMEM retains
a reference.

Both processes then read-fault the page which results in readable PTEs
being installed.

The failure point comes when either process write faults the page.
At that point, a COW is initiated and now the two processes are looking
at seperate pages.

The scenario would be different in the case of mmu_notifiers.
The notifier callout when the readable PTE was being replaced with a
writable PTE would result in the remote page table getting cleared and
XPMEM releasing the _count.

All that said, I think the race we discussed earlier in the thread is
a legitimate one and believe Hugh's fix is correct.

Thank you for all your patience,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-23 15:54             ` Robin Holt
@ 2008-06-23 16:48               ` Hugh Dickins
  2008-06-23 17:52                 ` Robin Holt
  0 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2008-06-23 16:48 UTC (permalink / raw)
  To: Robin Holt
  Cc: Nick Piggin, Ingo Molnar, Christoph Lameter, Jack Steiner, linux-mm

On Mon, 23 Jun 2008, Robin Holt wrote:
> 
> I finally tracked this down.  I think it is a problem specific to XPMEM
> on the SLES10 kernel and will not be a problem once Andrea's mmu_notifier
> is in the kernel.  It is a problem, as far as I can tell, specific to
> the way XPMEM works.
> 
> I will open a SuSE bugzilla to work the issue directly with them.
> 
> Prior to the transition event, we have a page of memory that was
> pre-faulted by a process.  The process has exported (via XPMEM) a
> window of its own address space.  A remote process has attached and
> touched the page of memory.  The fault will call into XPMEM which does
> the get_user_pages.
> 
> At this point, both processes have a writable PTE entry to the same
> page and XPMEM has one additional reference count (_count) on the page
> acquired via get_user_pages().
> 
> Memory pressure causes swap_page to get called.  It clears the two
> process's page table entries, returns the _count values, etc.  The only
> thing that remains different from normal at this point is XPMEM retains
> a reference.
> 
> Both processes then read-fault the page which results in readable PTEs
> being installed.
> 
> The failure point comes when either process write faults the page.
> At that point, a COW is initiated and now the two processes are looking
> at seperate pages.

A COW is initiated because memory pressure happens to have the page
locked at the instant do_wp_page is looking to reuse it?  That's the
scenario we were assuming before (once Nick had swept away my page
count mismemories), but it's not specific to XPMEM.  Well, I suppose
it is specific to a small proportion of get_user_pages callers, and it
might be that all the rest have by now worked around the issue somehow.

> 
> The scenario would be different in the case of mmu_notifiers.
> The notifier callout when the readable PTE was being replaced with a
> writable PTE would result in the remote page table getting cleared and
> XPMEM releasing the _count.

But XPMEM's contribution to the _count shouldn't matter.

> 
> All that said, I think the race we discussed earlier in the thread is
> a legitimate one and believe Hugh's fix is correct.

My fix?  Would that be the get_user_pages VM_WRITE test before clearing
FOLL_WRITE - which I believe didn't fix you at all?  Or the grand new
reuse test in do_wp_page that I'm still working on - of which Nick sent
a lock_page approximation for you to try?  Would you still be able to
try mine when I'm ready, or does it now appear irrelevant to you?

(Nick questioned the page_swapcount test in the lines I sent out, and
he was absolutely right: at the time I thought that once I looked at
the page_swapcount end of the patch, I'd find it taking page_mapcount
into account under swap_lock; but when I got there, found I had moved
away to other work before actually completing that end of it.  Probably
got stalled on deciding function names, that's taken most of my time!)

> 
> Thank you for all your patience,
> Robin

Not at all, thank you for raising a real issue: all the better
if you can get around it without a targeted fix for now.

If you're interested, by the way, in the earlier discussion
I mentioned, a window into it can be found at

	http://lkml.org/lkml/2006/9/14/384

but it's a broken thread, with misunderstanding on all sides,
so rather hard to get a grasp of it.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-23 16:48               ` Hugh Dickins
@ 2008-06-23 17:52                 ` Robin Holt
  2008-06-23 20:58                   ` Hugh Dickins
  0 siblings, 1 reply; 30+ messages in thread
From: Robin Holt @ 2008-06-23 17:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robin Holt, Nick Piggin, Ingo Molnar, Christoph Lameter,
	Jack Steiner, linux-mm

On Mon, Jun 23, 2008 at 05:48:17PM +0100, Hugh Dickins wrote:
> On Mon, 23 Jun 2008, Robin Holt wrote:
> > All that said, I think the race we discussed earlier in the thread is
> > a legitimate one and believe Hugh's fix is correct.
> 
> My fix?  Would that be the get_user_pages VM_WRITE test before clearing
> FOLL_WRITE - which I believe didn't fix you at all?  Or the grand new

It did not fix the problem I was seeing, but I believe it is a possible
race condition.  I certainly admit to not having a complete enough
understanding and there may be something which prevents that from being
a problem, but I currently still think there is a problem, just not one
I can reproduce.

> reuse test in do_wp_page that I'm still working on - of which Nick sent
> a lock_page approximation for you to try?  Would you still be able to
> try mine when I'm ready, or does it now appear irrelevant to you?

Before your response, I had convinced myself my problem was specific to
XPMEM, but I see your point and may agree that it is a problem for all
get_user_pages() users.

I can certainly test when you have it ready.

I had confused myself about Nick's first patch.  I will give that
another look over and see if it fixes the problem.

> 	http://lkml.org/lkml/2006/9/14/384
> 
> but it's a broken thread, with misunderstanding on all sides,
> so rather hard to get a grasp of it.

That is extremely similar to the issue I am seeing.  I think that if
Infiniband were using the mmu_notifier stuff, they would be closer, but
IIRC, there are significant hardware restrictions which prevent demand
paging for working on some IB devices.

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-19 13:49           ` Hugh Dickins
  2008-06-23 15:54             ` Robin Holt
@ 2008-06-23 19:11             ` Robin Holt
  2008-06-23 19:12               ` Robin Holt
  1 sibling, 1 reply; 30+ messages in thread
From: Robin Holt @ 2008-06-23 19:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robin Holt, Nick Piggin, Ingo Molnar, Christoph Lameter,
	Jack Steiner, linux-mm

On Thu, Jun 19, 2008 at 02:49:50PM +0100, Hugh Dickins wrote:
> On Thu, 19 Jun 2008, Robin Holt wrote:
> > On Thu, Jun 19, 2008 at 12:09:15PM +0100, Hugh Dickins wrote:
> > > 
> > > (I assume Robin is not forking, we do know that causes this kind
> > > of problem, but he didn't mention any forking so I assume not.)
> > 
> > There has been a fork long before this mapping was created.  There was a
> > hole at this location and the mapping gets established and pages populated
> > following all ranks of the MPI job getting initialized.
> 
> There's usually been a fork somewhen in the past!  That's no problem.
> 
> The fork problem comes when someone has done a get_user_pages to break
> all the COWs, then another thread does a fork which writeprotects and
> raises page_mapcount, so the next write from userspace breaks COW again
> and writes to a different page from that which the kernel is holding.
> 
> That one kept on coming up, but I've not heard of it again since we
> added madvise MADV_DONTFORK so apps could exclude such parts of the
> address space from copy_page_range.

I think you still have a hole.  Here is what I _think_ I was actually
running into.  A range of memory was exported with xpmem.  This is on
a sles10 kernel which has no mmu_notifier equivalent functionality.
The exporting process has write faulted a range of addresses which
it plans on using for a functionality validation test which verifies
its results.

The address range is then imported by the other MPI ranks (128 ranks
total) and pages are faulted in.

At this time, the system comes under severe memory pressure.  The swap
code makes a swap entry and replaces both process's PTE with the swap
entry.  XPMEM is holding an extra reference (_count) on the page.

The imported now faults the page again (either read or write, does not
matter, merely that it faults first).  After that, the exporter read
faults the address and then write faults.  The read followed by write
seems to be the key.  At that point, the _count and _mapcount are both
elevated to the point where the page will be COW'd.

To verify that it was _something_ like this, I had inserted a BUG_ON when
we return from get_user_pages() to verify the _mapcount is 1 or greater
and the _count is 2 or greater.  Additionally, I walked the process page
tables at this point and verified pte_write was true.

I also added a page flag (just a kludge to verify).  When XPMEM exports
the page, I set the page flag.  In can_share_swap_page, I made the return
(count == 1) && test_bit(27, &page->flags);

I clearly messed something up, but it does indicate I am finally in
the right neighborhood.  The test program completes with success.
I definitely messed up the clearing of bit 27 because the machine will
no longer launch new executables after the job completes.  If I reboot,
I can rerun the job to successful completion again.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-23 19:11             ` Robin Holt
@ 2008-06-23 19:12               ` Robin Holt
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Holt @ 2008-06-23 19:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robin Holt, Nick Piggin, Ingo Molnar, Christoph Lameter,
	Jack Steiner, linux-mm

Ooops, this was a draft I meant to throw away.  Please ignore.

Sorry for the confusion,
Robin

On Mon, Jun 23, 2008 at 02:11:35PM -0500, Robin Holt wrote:
> On Thu, Jun 19, 2008 at 02:49:50PM +0100, Hugh Dickins wrote:
> > On Thu, 19 Jun 2008, Robin Holt wrote:
> > > On Thu, Jun 19, 2008 at 12:09:15PM +0100, Hugh Dickins wrote:
> > > > 
> > > > (I assume Robin is not forking, we do know that causes this kind
> > > > of problem, but he didn't mention any forking so I assume not.)
> > > 
> > > There has been a fork long before this mapping was created.  There was a
> > > hole at this location and the mapping gets established and pages populated
> > > following all ranks of the MPI job getting initialized.
> > 
> > There's usually been a fork somewhen in the past!  That's no problem.
> > 
> > The fork problem comes when someone has done a get_user_pages to break
> > all the COWs, then another thread does a fork which writeprotects and
> > raises page_mapcount, so the next write from userspace breaks COW again
> > and writes to a different page from that which the kernel is holding.
> > 
> > That one kept on coming up, but I've not heard of it again since we
> > added madvise MADV_DONTFORK so apps could exclude such parts of the
> > address space from copy_page_range.
> 
> I think you still have a hole.  Here is what I _think_ I was actually
> running into.  A range of memory was exported with xpmem.  This is on
> a sles10 kernel which has no mmu_notifier equivalent functionality.
> The exporting process has write faulted a range of addresses which
> it plans on using for a functionality validation test which verifies
> its results.
> 
> The address range is then imported by the other MPI ranks (128 ranks
> total) and pages are faulted in.
> 
> At this time, the system comes under severe memory pressure.  The swap
> code makes a swap entry and replaces both process's PTE with the swap
> entry.  XPMEM is holding an extra reference (_count) on the page.
> 
> The imported now faults the page again (either read or write, does not
> matter, merely that it faults first).  After that, the exporter read
> faults the address and then write faults.  The read followed by write
> seems to be the key.  At that point, the _count and _mapcount are both
> elevated to the point where the page will be COW'd.
> 
> To verify that it was _something_ like this, I had inserted a BUG_ON when
> we return from get_user_pages() to verify the _mapcount is 1 or greater
> and the _count is 2 or greater.  Additionally, I walked the process page
> tables at this point and verified pte_write was true.
> 
> I also added a page flag (just a kludge to verify).  When XPMEM exports
> the page, I set the page flag.  In can_share_swap_page, I made the return
> (count == 1) && test_bit(27, &page->flags);
> 
> I clearly messed something up, but it does indicate I am finally in
> the right neighborhood.  The test program completes with success.
> I definitely messed up the clearing of bit 27 because the machine will
> no longer launch new executables after the job completes.  If I reboot,
> I can rerun the job to successful completion again.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-23 17:52                 ` Robin Holt
@ 2008-06-23 20:58                   ` Hugh Dickins
  2008-06-24 11:56                     ` Robin Holt
  2008-06-24 15:19                     ` Robin Holt
  0 siblings, 2 replies; 30+ messages in thread
From: Hugh Dickins @ 2008-06-23 20:58 UTC (permalink / raw)
  To: Robin Holt
  Cc: Nick Piggin, Ingo Molnar, Christoph Lameter, Jack Steiner,
	Rik van Riel, Lee Schermerhorn, linux-mm

On Mon, 23 Jun 2008, Robin Holt wrote:
> On Mon, Jun 23, 2008 at 05:48:17PM +0100, Hugh Dickins wrote:
> 
> > reuse test in do_wp_page that I'm still working on - of which Nick sent
> > a lock_page approximation for you to try?  Would you still be able to
> > try mine when I'm ready, or does it now appear irrelevant to you?
> 
> Before your response, I had convinced myself my problem was specific to
> XPMEM, but I see your point and may agree that it is a problem for all
> get_user_pages() users.
> 
> I can certainly test when you have it ready.

Thanks a lot, Robin.  Here it is below.

> 
> I had confused myself about Nick's first patch.  I will give that
> another look over and see if it fixes the problem.

Nick's _first_ patch?  The one I was thinking of was the one
with lock_page in do_wp_page, but it shouldn't be necessary now if
what's below works - though his is a much smaller and less worrying
patch, so anyone looking for a quick fix to the issue might well
prefer his.

> 
> > 	http://lkml.org/lkml/2006/9/14/384
> > 
> > but it's a broken thread, with misunderstanding on all sides,
> > so rather hard to get a grasp of it.
> 
> That is extremely similar to the issue I am seeing.  I think that if
> Infiniband were using the mmu_notifier stuff, they would be closer, but
> IIRC, there are significant hardware restrictions which prevent demand
> paging for working on some IB devices.

Ah, I'm glad you've managed to glean something from it, good.

Here's the rollup of the patches I'm proposing for two issues:
it doesn't get my signoff yet because I'll want to split it into
little stages properly commented, and I'll want to do more strenuous
testing; but this shouldn't blow up in your face.  Against 2.6.26-rc7,
should apply quite easily to earlier, but a little more work against
2.6.26-rc5-mm3 - though it simplifies some of that too (Rik+Lee Cced).

I say two issues, two competing issues.  One is the issue which may be
your issue, that we want to decide COW in do_wp_page without needing
PageLocked, so a concurrent shrink_page_list() doesn't occasionally
force an unwanted COW, messing up some get_user_pages() uses.  The
other, of no interest to you, is that we do want PageLocked when
deciding COW in do_wp_page, because that's a good moment to free
up the swap space - leaving the modified page with swap just makes
a big seek likely when it's next written to swap.

Hugh

 include/linux/swap.h |   21 ++++--
 mm/memory.c          |   15 +++-
 mm/migrate.c         |    5 -
 mm/page_io.c         |    2 
 mm/swap_state.c      |   12 ++-
 mm/swapfile.c        |  128 ++++++++++++++++++++++-------------------
 6 files changed, 105 insertions(+), 78 deletions(-)

--- 2.6.26-rc7/include/linux/swap.h	2008-05-03 21:55:11.000000000 +0100
+++ linux/include/linux/swap.h	2008-06-23 18:12:47.000000000 +0100
@@ -250,8 +250,9 @@ extern unsigned int count_swap_pages(int
 extern sector_t map_swap_page(struct swap_info_struct *, pgoff_t);
 extern sector_t swapdev_block(int, pgoff_t);
 extern struct swap_info_struct *get_swap_info_struct(unsigned);
-extern int can_share_swap_page(struct page *);
-extern int remove_exclusive_swap_page(struct page *);
+extern int can_reuse_swap_page_unlocked(struct page *);
+extern int can_reuse_swap_page_locked(struct page *);
+extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 
 extern spinlock_t swap_lock;
@@ -319,8 +320,6 @@ static inline struct page *lookup_swap_c
 	return NULL;
 }
 
-#define can_share_swap_page(p)			(page_mapcount(p) == 1)
-
 static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
 							gfp_t gfp_mask)
 {
@@ -337,9 +336,19 @@ static inline void delete_from_swap_cach
 
 #define swap_token_default_timeout		0
 
-static inline int remove_exclusive_swap_page(struct page *p)
+static inline int can_reuse_swap_page_unlocked(struct page *page)
 {
-	return 0;
+	return 0;	/* irrelevant: never called  */
+}
+
+static inline int can_reuse_swap_page_locked(struct page *page)
+{
+	return 0;	/* irrelevant: never called */
+}
+
+static inline int try_to_free_swap(struct page *page)
+{
+	return 0;	/* irrelevant: never called */
 }
 
 static inline swp_entry_t get_swap_page(void)
--- 2.6.26-rc7/mm/memory.c	2008-06-21 08:41:19.000000000 +0100
+++ linux/mm/memory.c	2008-06-23 18:12:47.000000000 +0100
@@ -1686,9 +1686,14 @@ static int do_wp_page(struct mm_struct *
 	 * not dirty accountable.
 	 */
 	if (PageAnon(old_page)) {
-		if (!TestSetPageLocked(old_page)) {
-			reuse = can_share_swap_page(old_page);
-			unlock_page(old_page);
+		if (page_mapcount(old_page) == 1) {
+			if (!PageSwapCache(old_page))
+				reuse = 1;
+			else if (!TestSetPageLocked(old_page)) {
+				reuse = can_reuse_swap_page_locked(old_page);
+				unlock_page(old_page);
+			} else
+				reuse = can_reuse_swap_page_unlocked(old_page);
 		}
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 					(VM_WRITE|VM_SHARED))) {
@@ -2185,7 +2190,7 @@ static int do_swap_page(struct mm_struct
 
 	inc_mm_counter(mm, anon_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
-	if (write_access && can_share_swap_page(page)) {
+	if (write_access && can_reuse_swap_page_locked(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		write_access = 0;
 	}
@@ -2196,7 +2201,7 @@ static int do_swap_page(struct mm_struct
 
 	swap_free(entry);
 	if (vm_swap_full())
-		remove_exclusive_swap_page(page);
+		try_to_free_swap(page);
 	unlock_page(page);
 
 	if (write_access) {
--- 2.6.26-rc7/mm/migrate.c	2008-06-21 08:41:19.000000000 +0100
+++ linux/mm/migrate.c	2008-06-23 18:12:47.000000000 +0100
@@ -330,8 +330,10 @@ static int migrate_page_move_mapping(str
 	get_page(newpage);	/* add cache reference */
 #ifdef CONFIG_SWAP
 	if (PageSwapCache(page)) {
-		SetPageSwapCache(newpage);
 		set_page_private(newpage, page_private(page));
+		/* page_swapcount() relies on private whenever PageSwapCache */
+		smp_wmb();
+		SetPageSwapCache(newpage);
 	}
 #endif
 
@@ -398,7 +400,6 @@ static void migrate_page_copy(struct pag
 #endif
 	ClearPageActive(page);
 	ClearPagePrivate(page);
-	set_page_private(page, 0);
 	page->mapping = NULL;
 
 	/*
--- 2.6.26-rc7/mm/page_io.c	2008-04-17 03:49:44.000000000 +0100
+++ linux/mm/page_io.c	2008-06-23 18:12:47.000000000 +0100
@@ -98,7 +98,7 @@ int swap_writepage(struct page *page, st
 	struct bio *bio;
 	int ret = 0, rw = WRITE;
 
-	if (remove_exclusive_swap_page(page)) {
+	if (try_to_free_swap(page)) {
 		unlock_page(page);
 		goto out;
 	}
--- 2.6.26-rc7/mm/swap_state.c	2008-05-03 21:55:12.000000000 +0100
+++ linux/mm/swap_state.c	2008-06-23 18:12:47.000000000 +0100
@@ -76,13 +76,15 @@ int add_to_swap_cache(struct page *page,
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
+		set_page_private(page, entry.val);
+		/* page_swapcount() relies on private whenever PageSwapCache */
+		smp_wmb();
 		write_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
 			page_cache_get(page);
 			SetPageSwapCache(page);
-			set_page_private(page, entry.val);
 			total_swapcache_pages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 			INC_CACHE_INFO(add_total);
@@ -105,7 +107,6 @@ void __delete_from_swap_cache(struct pag
 	BUG_ON(PagePrivate(page));
 
 	radix_tree_delete(&swapper_space.page_tree, page_private(page));
-	set_page_private(page, 0);
 	ClearPageSwapCache(page);
 	total_swapcache_pages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
@@ -188,13 +189,14 @@ void delete_from_swap_cache(struct page 
  * 
  * Its ok to check for PageSwapCache without the page lock
  * here because we are going to recheck again inside 
- * exclusive_swap_page() _with_ the lock. 
+ * try_to_free_swap() _with_ the lock. 
  * 					- Marcelo
  */
 static inline void free_swap_cache(struct page *page)
 {
-	if (PageSwapCache(page) && !TestSetPageLocked(page)) {
-		remove_exclusive_swap_page(page);
+	if (PageSwapCache(page) && !page_mapped(page) &&
+	    !TestSetPageLocked(page)) {
+		try_to_free_swap(page);
 		unlock_page(page);
 	}
 }
--- 2.6.26-rc7/mm/swapfile.c	2008-05-03 21:55:12.000000000 +0100
+++ linux/mm/swapfile.c	2008-06-23 18:12:47.000000000 +0100
@@ -251,7 +251,6 @@ static struct swap_info_struct * swap_in
 		goto bad_offset;
 	if (!p->swap_map[offset])
 		goto bad_free;
-	spin_lock(&swap_lock);
 	return p;
 
 bad_free:
@@ -300,90 +299,104 @@ void swap_free(swp_entry_t entry)
 
 	p = swap_info_get(entry);
 	if (p) {
+		spin_lock(&swap_lock);
 		swap_entry_free(p, swp_offset(entry));
 		spin_unlock(&swap_lock);
 	}
 }
 
 /*
- * How many references to page are currently swapped out?
+ * How many page table references are there to this page's swap entry?
+ * including references to the page itself if add_mapcount.
+ *
+ * page_swapcount is only called on a page which was recently PageSwapCache
+ * (but might have been deleted from swap cache just before getting here).
+ *
+ * When called with PageLocked, the swapcount is stable, and the mapcount
+ * cannot rise (but may fall if the page is concurrently unmapped from a
+ * page table whose lock we don't hold).
+ *
+ * When called without PageLocked, the swapcount is stable while we hold
+ * swap_lock, and the mapcount cannot rise from 1 while we hold that page
+ * table lock (but may fall if the page is concurrently unmapped from a
+ * page table whose lock we don't hold).
+ *
+ * do_swap_page and unuse_pte call page_add_anon_rmap before swap_free,
+ * try_to_unmap_one calls swap_duplicate before page_remove_rmap:
+ * so in general, swapcount+mapcount should never be seen too low -
+ * but you need to consider more memory barriers if extending its use.
  */
-static inline int page_swapcount(struct page *page)
+static int page_swapcount(struct page *page, int add_mapcount)
 {
 	int count = 0;
 	struct swap_info_struct *p;
 	swp_entry_t entry;
 
-	entry.val = page_private(page);
-	p = swap_info_get(entry);
-	if (p) {
-		/* Subtract the 1 for the swap cache itself */
-		count = p->swap_map[swp_offset(entry)] - 1;
-		spin_unlock(&swap_lock);
+	spin_lock(&swap_lock);
+	if (add_mapcount)
+		count = page_mapcount(page);
+	if (PageSwapCache(page)) {
+		/* We can rely on page_private once PageSwapCache is visible */
+		smp_rmb();
+		entry.val = page_private(page);
+		p = swap_info_get(entry);
+		if (p) {
+			/* Subtract the 1 for the swap cache itself */
+			count += p->swap_map[swp_offset(entry)] - 1;
+		}
 	}
+	spin_unlock(&swap_lock);
 	return count;
 }
 
 /*
- * We can use this swap cache entry directly
- * if there are no other references to it.
+ * Can do_wp_page() make the faulting swapcache page writable without COW?
+ * But something else, probably shrink_page_list(), already has PageLocked.
+ * Don't be misled to COW the page unnecessarily: check swapcount+mapcount.
  */
-int can_share_swap_page(struct page *page)
+int can_reuse_swap_page_unlocked(struct page *page)
 {
-	int count;
-
-	BUG_ON(!PageLocked(page));
-	count = page_mapcount(page);
-	if (count <= 1 && PageSwapCache(page))
-		count += page_swapcount(page);
-	return count == 1;
+	return page_swapcount(page, 1) == 1;
 }
 
 /*
- * Work out if there are any other processes sharing this
- * swap cache page. Free it if you can. Return success.
+ * Can do_wp_page() make the faulting swapcache page writable without COW?
+ * having acquiring PageLocked.  In this case, since we have that lock and
+ * are about to modify the page, we'd better free its swap space - it won't
+ * be read again, and writing there later would probably require an extra seek.
  */
-int remove_exclusive_swap_page(struct page *page)
+int can_reuse_swap_page_locked(struct page *page)
 {
-	int retval;
-	struct swap_info_struct * p;
-	swp_entry_t entry;
+	VM_BUG_ON(!PageLocked(page));
+	if (page_swapcount(page, 1) != 1)
+		return 0;
+	if (!PageSwapCache(page))
+		return 1;
+	if (PageWriteback(page))
+		return 1;
 
-	BUG_ON(PagePrivate(page));
-	BUG_ON(!PageLocked(page));
+	delete_from_swap_cache(page);
+	SetPageDirty(page);
+	return 1;
+}
 
+/*
+ * If swap is getting full, or if there are no more mappings of this page,
+ * then try_to_free_swap is called to free its swap space.
+ */
+int try_to_free_swap(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page));
 	if (!PageSwapCache(page))
 		return 0;
 	if (PageWriteback(page))
 		return 0;
-	if (page_count(page) != 2) /* 2: us + cache */
+	if (page_swapcount(page, 0))	/* Here we don't care about mapcount */
 		return 0;
 
-	entry.val = page_private(page);
-	p = swap_info_get(entry);
-	if (!p)
-		return 0;
-
-	/* Is the only swap cache user the cache itself? */
-	retval = 0;
-	if (p->swap_map[swp_offset(entry)] == 1) {
-		/* Recheck the page count with the swapcache lock held.. */
-		write_lock_irq(&swapper_space.tree_lock);
-		if ((page_count(page) == 2) && !PageWriteback(page)) {
-			__delete_from_swap_cache(page);
-			SetPageDirty(page);
-			retval = 1;
-		}
-		write_unlock_irq(&swapper_space.tree_lock);
-	}
-	spin_unlock(&swap_lock);
-
-	if (retval) {
-		swap_free(entry);
-		page_cache_release(page);
-	}
-
-	return retval;
+	delete_from_swap_cache(page);
+	SetPageDirty(page);
+	return 1;
 }
 
 /*
@@ -400,6 +413,7 @@ void free_swap_and_cache(swp_entry_t ent
 
 	p = swap_info_get(entry);
 	if (p) {
+		spin_lock(&swap_lock);
 		if (swap_entry_free(p, swp_offset(entry)) == 1) {
 			page = find_get_page(&swapper_space, entry.val);
 			if (page && unlikely(TestSetPageLocked(page))) {
@@ -410,14 +424,10 @@ void free_swap_and_cache(swp_entry_t ent
 		spin_unlock(&swap_lock);
 	}
 	if (page) {
-		int one_user;
-
-		BUG_ON(PagePrivate(page));
-		one_user = (page_count(page) == 2);
-		/* Only cache user (+us), or swap space full? Free it! */
+		/* Not mapped elsewhere, or swap space full? Free it! */
 		/* Also recheck PageSwapCache after page is locked (above) */
 		if (PageSwapCache(page) && !PageWriteback(page) &&
-					(one_user || vm_swap_full())) {
+				(!page_mapped(page) || vm_swap_full())) {
 			delete_from_swap_cache(page);
 			SetPageDirty(page);
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-23 20:58                   ` Hugh Dickins
@ 2008-06-24 11:56                     ` Robin Holt
  2008-06-24 15:19                     ` Robin Holt
  1 sibling, 0 replies; 30+ messages in thread
From: Robin Holt @ 2008-06-24 11:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robin Holt, Nick Piggin, Ingo Molnar, Christoph Lameter,
	Jack Steiner, Rik van Riel, Lee Schermerhorn, linux-mm

Argh, the current kernel does not boot for ia64.  I will look into
that first.  It fails even without your patch applied.

Thanks,
Robin

On Mon, Jun 23, 2008 at 09:58:42PM +0100, Hugh Dickins wrote:
> On Mon, 23 Jun 2008, Robin Holt wrote:
> > On Mon, Jun 23, 2008 at 05:48:17PM +0100, Hugh Dickins wrote:
> > 
> > > reuse test in do_wp_page that I'm still working on - of which Nick sent
> > > a lock_page approximation for you to try?  Would you still be able to
> > > try mine when I'm ready, or does it now appear irrelevant to you?
> > 
> > Before your response, I had convinced myself my problem was specific to
> > XPMEM, but I see your point and may agree that it is a problem for all
> > get_user_pages() users.
> > 
> > I can certainly test when you have it ready.
> 
> Thanks a lot, Robin.  Here it is below.
> 
> > 
> > I had confused myself about Nick's first patch.  I will give that
> > another look over and see if it fixes the problem.
> 
> Nick's _first_ patch?  The one I was thinking of was the one
> with lock_page in do_wp_page, but it shouldn't be necessary now if
> what's below works - though his is a much smaller and less worrying
> patch, so anyone looking for a quick fix to the issue might well
> prefer his.
> 
> > 
> > > 	http://lkml.org/lkml/2006/9/14/384
> > > 
> > > but it's a broken thread, with misunderstanding on all sides,
> > > so rather hard to get a grasp of it.
> > 
> > That is extremely similar to the issue I am seeing.  I think that if
> > Infiniband were using the mmu_notifier stuff, they would be closer, but
> > IIRC, there are significant hardware restrictions which prevent demand
> > paging for working on some IB devices.
> 
> Ah, I'm glad you've managed to glean something from it, good.
> 
> Here's the rollup of the patches I'm proposing for two issues:
> it doesn't get my signoff yet because I'll want to split it into
> little stages properly commented, and I'll want to do more strenuous
> testing; but this shouldn't blow up in your face.  Against 2.6.26-rc7,
> should apply quite easily to earlier, but a little more work against
> 2.6.26-rc5-mm3 - though it simplifies some of that too (Rik+Lee Cced).
> 
> I say two issues, two competing issues.  One is the issue which may be
> your issue, that we want to decide COW in do_wp_page without needing
> PageLocked, so a concurrent shrink_page_list() doesn't occasionally
> force an unwanted COW, messing up some get_user_pages() uses.  The
> other, of no interest to you, is that we do want PageLocked when
> deciding COW in do_wp_page, because that's a good moment to free
> up the swap space - leaving the modified page with swap just makes
> a big seek likely when it's next written to swap.
> 
> Hugh
> 
>  include/linux/swap.h |   21 ++++--
>  mm/memory.c          |   15 +++-
>  mm/migrate.c         |    5 -
>  mm/page_io.c         |    2 
>  mm/swap_state.c      |   12 ++-
>  mm/swapfile.c        |  128 ++++++++++++++++++++++-------------------
>  6 files changed, 105 insertions(+), 78 deletions(-)
> 
> --- 2.6.26-rc7/include/linux/swap.h	2008-05-03 21:55:11.000000000 +0100
> +++ linux/include/linux/swap.h	2008-06-23 18:12:47.000000000 +0100
> @@ -250,8 +250,9 @@ extern unsigned int count_swap_pages(int
>  extern sector_t map_swap_page(struct swap_info_struct *, pgoff_t);
>  extern sector_t swapdev_block(int, pgoff_t);
>  extern struct swap_info_struct *get_swap_info_struct(unsigned);
> -extern int can_share_swap_page(struct page *);
> -extern int remove_exclusive_swap_page(struct page *);
> +extern int can_reuse_swap_page_unlocked(struct page *);
> +extern int can_reuse_swap_page_locked(struct page *);
> +extern int try_to_free_swap(struct page *);
>  struct backing_dev_info;
>  
>  extern spinlock_t swap_lock;
> @@ -319,8 +320,6 @@ static inline struct page *lookup_swap_c
>  	return NULL;
>  }
>  
> -#define can_share_swap_page(p)			(page_mapcount(p) == 1)
> -
>  static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
>  							gfp_t gfp_mask)
>  {
> @@ -337,9 +336,19 @@ static inline void delete_from_swap_cach
>  
>  #define swap_token_default_timeout		0
>  
> -static inline int remove_exclusive_swap_page(struct page *p)
> +static inline int can_reuse_swap_page_unlocked(struct page *page)
>  {
> -	return 0;
> +	return 0;	/* irrelevant: never called  */
> +}
> +
> +static inline int can_reuse_swap_page_locked(struct page *page)
> +{
> +	return 0;	/* irrelevant: never called */
> +}
> +
> +static inline int try_to_free_swap(struct page *page)
> +{
> +	return 0;	/* irrelevant: never called */
>  }
>  
>  static inline swp_entry_t get_swap_page(void)
> --- 2.6.26-rc7/mm/memory.c	2008-06-21 08:41:19.000000000 +0100
> +++ linux/mm/memory.c	2008-06-23 18:12:47.000000000 +0100
> @@ -1686,9 +1686,14 @@ static int do_wp_page(struct mm_struct *
>  	 * not dirty accountable.
>  	 */
>  	if (PageAnon(old_page)) {
> -		if (!TestSetPageLocked(old_page)) {
> -			reuse = can_share_swap_page(old_page);
> -			unlock_page(old_page);
> +		if (page_mapcount(old_page) == 1) {
> +			if (!PageSwapCache(old_page))
> +				reuse = 1;
> +			else if (!TestSetPageLocked(old_page)) {
> +				reuse = can_reuse_swap_page_locked(old_page);
> +				unlock_page(old_page);
> +			} else
> +				reuse = can_reuse_swap_page_unlocked(old_page);
>  		}
>  	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
>  					(VM_WRITE|VM_SHARED))) {
> @@ -2185,7 +2190,7 @@ static int do_swap_page(struct mm_struct
>  
>  	inc_mm_counter(mm, anon_rss);
>  	pte = mk_pte(page, vma->vm_page_prot);
> -	if (write_access && can_share_swap_page(page)) {
> +	if (write_access && can_reuse_swap_page_locked(page)) {
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  		write_access = 0;
>  	}
> @@ -2196,7 +2201,7 @@ static int do_swap_page(struct mm_struct
>  
>  	swap_free(entry);
>  	if (vm_swap_full())
> -		remove_exclusive_swap_page(page);
> +		try_to_free_swap(page);
>  	unlock_page(page);
>  
>  	if (write_access) {
> --- 2.6.26-rc7/mm/migrate.c	2008-06-21 08:41:19.000000000 +0100
> +++ linux/mm/migrate.c	2008-06-23 18:12:47.000000000 +0100
> @@ -330,8 +330,10 @@ static int migrate_page_move_mapping(str
>  	get_page(newpage);	/* add cache reference */
>  #ifdef CONFIG_SWAP
>  	if (PageSwapCache(page)) {
> -		SetPageSwapCache(newpage);
>  		set_page_private(newpage, page_private(page));
> +		/* page_swapcount() relies on private whenever PageSwapCache */
> +		smp_wmb();
> +		SetPageSwapCache(newpage);
>  	}
>  #endif
>  
> @@ -398,7 +400,6 @@ static void migrate_page_copy(struct pag
>  #endif
>  	ClearPageActive(page);
>  	ClearPagePrivate(page);
> -	set_page_private(page, 0);
>  	page->mapping = NULL;
>  
>  	/*
> --- 2.6.26-rc7/mm/page_io.c	2008-04-17 03:49:44.000000000 +0100
> +++ linux/mm/page_io.c	2008-06-23 18:12:47.000000000 +0100
> @@ -98,7 +98,7 @@ int swap_writepage(struct page *page, st
>  	struct bio *bio;
>  	int ret = 0, rw = WRITE;
>  
> -	if (remove_exclusive_swap_page(page)) {
> +	if (try_to_free_swap(page)) {
>  		unlock_page(page);
>  		goto out;
>  	}
> --- 2.6.26-rc7/mm/swap_state.c	2008-05-03 21:55:12.000000000 +0100
> +++ linux/mm/swap_state.c	2008-06-23 18:12:47.000000000 +0100
> @@ -76,13 +76,15 @@ int add_to_swap_cache(struct page *page,
>  	BUG_ON(PagePrivate(page));
>  	error = radix_tree_preload(gfp_mask);
>  	if (!error) {
> +		set_page_private(page, entry.val);
> +		/* page_swapcount() relies on private whenever PageSwapCache */
> +		smp_wmb();
>  		write_lock_irq(&swapper_space.tree_lock);
>  		error = radix_tree_insert(&swapper_space.page_tree,
>  						entry.val, page);
>  		if (!error) {
>  			page_cache_get(page);
>  			SetPageSwapCache(page);
> -			set_page_private(page, entry.val);
>  			total_swapcache_pages++;
>  			__inc_zone_page_state(page, NR_FILE_PAGES);
>  			INC_CACHE_INFO(add_total);
> @@ -105,7 +107,6 @@ void __delete_from_swap_cache(struct pag
>  	BUG_ON(PagePrivate(page));
>  
>  	radix_tree_delete(&swapper_space.page_tree, page_private(page));
> -	set_page_private(page, 0);
>  	ClearPageSwapCache(page);
>  	total_swapcache_pages--;
>  	__dec_zone_page_state(page, NR_FILE_PAGES);
> @@ -188,13 +189,14 @@ void delete_from_swap_cache(struct page 
>   * 
>   * Its ok to check for PageSwapCache without the page lock
>   * here because we are going to recheck again inside 
> - * exclusive_swap_page() _with_ the lock. 
> + * try_to_free_swap() _with_ the lock. 
>   * 					- Marcelo
>   */
>  static inline void free_swap_cache(struct page *page)
>  {
> -	if (PageSwapCache(page) && !TestSetPageLocked(page)) {
> -		remove_exclusive_swap_page(page);
> +	if (PageSwapCache(page) && !page_mapped(page) &&
> +	    !TestSetPageLocked(page)) {
> +		try_to_free_swap(page);
>  		unlock_page(page);
>  	}
>  }
> --- 2.6.26-rc7/mm/swapfile.c	2008-05-03 21:55:12.000000000 +0100
> +++ linux/mm/swapfile.c	2008-06-23 18:12:47.000000000 +0100
> @@ -251,7 +251,6 @@ static struct swap_info_struct * swap_in
>  		goto bad_offset;
>  	if (!p->swap_map[offset])
>  		goto bad_free;
> -	spin_lock(&swap_lock);
>  	return p;
>  
>  bad_free:
> @@ -300,90 +299,104 @@ void swap_free(swp_entry_t entry)
>  
>  	p = swap_info_get(entry);
>  	if (p) {
> +		spin_lock(&swap_lock);
>  		swap_entry_free(p, swp_offset(entry));
>  		spin_unlock(&swap_lock);
>  	}
>  }
>  
>  /*
> - * How many references to page are currently swapped out?
> + * How many page table references are there to this page's swap entry?
> + * including references to the page itself if add_mapcount.
> + *
> + * page_swapcount is only called on a page which was recently PageSwapCache
> + * (but might have been deleted from swap cache just before getting here).
> + *
> + * When called with PageLocked, the swapcount is stable, and the mapcount
> + * cannot rise (but may fall if the page is concurrently unmapped from a
> + * page table whose lock we don't hold).
> + *
> + * When called without PageLocked, the swapcount is stable while we hold
> + * swap_lock, and the mapcount cannot rise from 1 while we hold that page
> + * table lock (but may fall if the page is concurrently unmapped from a
> + * page table whose lock we don't hold).
> + *
> + * do_swap_page and unuse_pte call page_add_anon_rmap before swap_free,
> + * try_to_unmap_one calls swap_duplicate before page_remove_rmap:
> + * so in general, swapcount+mapcount should never be seen too low -
> + * but you need to consider more memory barriers if extending its use.
>   */
> -static inline int page_swapcount(struct page *page)
> +static int page_swapcount(struct page *page, int add_mapcount)
>  {
>  	int count = 0;
>  	struct swap_info_struct *p;
>  	swp_entry_t entry;
>  
> -	entry.val = page_private(page);
> -	p = swap_info_get(entry);
> -	if (p) {
> -		/* Subtract the 1 for the swap cache itself */
> -		count = p->swap_map[swp_offset(entry)] - 1;
> -		spin_unlock(&swap_lock);
> +	spin_lock(&swap_lock);
> +	if (add_mapcount)
> +		count = page_mapcount(page);
> +	if (PageSwapCache(page)) {
> +		/* We can rely on page_private once PageSwapCache is visible */
> +		smp_rmb();
> +		entry.val = page_private(page);
> +		p = swap_info_get(entry);
> +		if (p) {
> +			/* Subtract the 1 for the swap cache itself */
> +			count += p->swap_map[swp_offset(entry)] - 1;
> +		}
>  	}
> +	spin_unlock(&swap_lock);
>  	return count;
>  }
>  
>  /*
> - * We can use this swap cache entry directly
> - * if there are no other references to it.
> + * Can do_wp_page() make the faulting swapcache page writable without COW?
> + * But something else, probably shrink_page_list(), already has PageLocked.
> + * Don't be misled to COW the page unnecessarily: check swapcount+mapcount.
>   */
> -int can_share_swap_page(struct page *page)
> +int can_reuse_swap_page_unlocked(struct page *page)
>  {
> -	int count;
> -
> -	BUG_ON(!PageLocked(page));
> -	count = page_mapcount(page);
> -	if (count <= 1 && PageSwapCache(page))
> -		count += page_swapcount(page);
> -	return count == 1;
> +	return page_swapcount(page, 1) == 1;
>  }
>  
>  /*
> - * Work out if there are any other processes sharing this
> - * swap cache page. Free it if you can. Return success.
> + * Can do_wp_page() make the faulting swapcache page writable without COW?
> + * having acquiring PageLocked.  In this case, since we have that lock and
> + * are about to modify the page, we'd better free its swap space - it won't
> + * be read again, and writing there later would probably require an extra seek.
>   */
> -int remove_exclusive_swap_page(struct page *page)
> +int can_reuse_swap_page_locked(struct page *page)
>  {
> -	int retval;
> -	struct swap_info_struct * p;
> -	swp_entry_t entry;
> +	VM_BUG_ON(!PageLocked(page));
> +	if (page_swapcount(page, 1) != 1)
> +		return 0;
> +	if (!PageSwapCache(page))
> +		return 1;
> +	if (PageWriteback(page))
> +		return 1;
>  
> -	BUG_ON(PagePrivate(page));
> -	BUG_ON(!PageLocked(page));
> +	delete_from_swap_cache(page);
> +	SetPageDirty(page);
> +	return 1;
> +}
>  
> +/*
> + * If swap is getting full, or if there are no more mappings of this page,
> + * then try_to_free_swap is called to free its swap space.
> + */
> +int try_to_free_swap(struct page *page)
> +{
> +	VM_BUG_ON(!PageLocked(page));
>  	if (!PageSwapCache(page))
>  		return 0;
>  	if (PageWriteback(page))
>  		return 0;
> -	if (page_count(page) != 2) /* 2: us + cache */
> +	if (page_swapcount(page, 0))	/* Here we don't care about mapcount */
>  		return 0;
>  
> -	entry.val = page_private(page);
> -	p = swap_info_get(entry);
> -	if (!p)
> -		return 0;
> -
> -	/* Is the only swap cache user the cache itself? */
> -	retval = 0;
> -	if (p->swap_map[swp_offset(entry)] == 1) {
> -		/* Recheck the page count with the swapcache lock held.. */
> -		write_lock_irq(&swapper_space.tree_lock);
> -		if ((page_count(page) == 2) && !PageWriteback(page)) {
> -			__delete_from_swap_cache(page);
> -			SetPageDirty(page);
> -			retval = 1;
> -		}
> -		write_unlock_irq(&swapper_space.tree_lock);
> -	}
> -	spin_unlock(&swap_lock);
> -
> -	if (retval) {
> -		swap_free(entry);
> -		page_cache_release(page);
> -	}
> -
> -	return retval;
> +	delete_from_swap_cache(page);
> +	SetPageDirty(page);
> +	return 1;
>  }
>  
>  /*
> @@ -400,6 +413,7 @@ void free_swap_and_cache(swp_entry_t ent
>  
>  	p = swap_info_get(entry);
>  	if (p) {
> +		spin_lock(&swap_lock);
>  		if (swap_entry_free(p, swp_offset(entry)) == 1) {
>  			page = find_get_page(&swapper_space, entry.val);
>  			if (page && unlikely(TestSetPageLocked(page))) {
> @@ -410,14 +424,10 @@ void free_swap_and_cache(swp_entry_t ent
>  		spin_unlock(&swap_lock);
>  	}
>  	if (page) {
> -		int one_user;
> -
> -		BUG_ON(PagePrivate(page));
> -		one_user = (page_count(page) == 2);
> -		/* Only cache user (+us), or swap space full? Free it! */
> +		/* Not mapped elsewhere, or swap space full? Free it! */
>  		/* Also recheck PageSwapCache after page is locked (above) */
>  		if (PageSwapCache(page) && !PageWriteback(page) &&
> -					(one_user || vm_swap_full())) {
> +				(!page_mapped(page) || vm_swap_full())) {
>  			delete_from_swap_cache(page);
>  			SetPageDirty(page);
>  		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-23 20:58                   ` Hugh Dickins
  2008-06-24 11:56                     ` Robin Holt
@ 2008-06-24 15:19                     ` Robin Holt
  2008-06-24 20:19                       ` Hugh Dickins
  1 sibling, 1 reply; 30+ messages in thread
From: Robin Holt @ 2008-06-24 15:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robin Holt, Nick Piggin, Ingo Molnar, Christoph Lameter,
	Jack Steiner, Rik van Riel, Lee Schermerhorn, linux-mm

OK.  I just gave it a try and I still get a failure with this patch
applied.

I can't help wondering if we (XPMEM, IB, etc) shouldn't be setting a
page flag indicating that the attempt to swap this page out should not
even be tried.  I would guess this has already been discussed to death.
If so, I am sorry, but I missed those discussions.

Thanks,
Robin

On Mon, Jun 23, 2008 at 09:58:42PM +0100, Hugh Dickins wrote:
> On Mon, 23 Jun 2008, Robin Holt wrote:
> > On Mon, Jun 23, 2008 at 05:48:17PM +0100, Hugh Dickins wrote:
> > 
> > > reuse test in do_wp_page that I'm still working on - of which Nick sent
> > > a lock_page approximation for you to try?  Would you still be able to
> > > try mine when I'm ready, or does it now appear irrelevant to you?
> > 
> > Before your response, I had convinced myself my problem was specific to
> > XPMEM, but I see your point and may agree that it is a problem for all
> > get_user_pages() users.
> > 
> > I can certainly test when you have it ready.
> 
> Thanks a lot, Robin.  Here it is below.
> 
> > 
> > I had confused myself about Nick's first patch.  I will give that
> > another look over and see if it fixes the problem.
> 
> Nick's _first_ patch?  The one I was thinking of was the one
> with lock_page in do_wp_page, but it shouldn't be necessary now if
> what's below works - though his is a much smaller and less worrying
> patch, so anyone looking for a quick fix to the issue might well
> prefer his.
> 
> > 
> > > 	http://lkml.org/lkml/2006/9/14/384
> > > 
> > > but it's a broken thread, with misunderstanding on all sides,
> > > so rather hard to get a grasp of it.
> > 
> > That is extremely similar to the issue I am seeing.  I think that if
> > Infiniband were using the mmu_notifier stuff, they would be closer, but
> > IIRC, there are significant hardware restrictions which prevent demand
> > paging for working on some IB devices.
> 
> Ah, I'm glad you've managed to glean something from it, good.
> 
> Here's the rollup of the patches I'm proposing for two issues:
> it doesn't get my signoff yet because I'll want to split it into
> little stages properly commented, and I'll want to do more strenuous
> testing; but this shouldn't blow up in your face.  Against 2.6.26-rc7,
> should apply quite easily to earlier, but a little more work against
> 2.6.26-rc5-mm3 - though it simplifies some of that too (Rik+Lee Cced).
> 
> I say two issues, two competing issues.  One is the issue which may be
> your issue, that we want to decide COW in do_wp_page without needing
> PageLocked, so a concurrent shrink_page_list() doesn't occasionally
> force an unwanted COW, messing up some get_user_pages() uses.  The
> other, of no interest to you, is that we do want PageLocked when
> deciding COW in do_wp_page, because that's a good moment to free
> up the swap space - leaving the modified page with swap just makes
> a big seek likely when it's next written to swap.
> 
> Hugh
> 
>  include/linux/swap.h |   21 ++++--
>  mm/memory.c          |   15 +++-
>  mm/migrate.c         |    5 -
>  mm/page_io.c         |    2 
>  mm/swap_state.c      |   12 ++-
>  mm/swapfile.c        |  128 ++++++++++++++++++++++-------------------
>  6 files changed, 105 insertions(+), 78 deletions(-)
> 
> --- 2.6.26-rc7/include/linux/swap.h	2008-05-03 21:55:11.000000000 +0100
> +++ linux/include/linux/swap.h	2008-06-23 18:12:47.000000000 +0100
> @@ -250,8 +250,9 @@ extern unsigned int count_swap_pages(int
>  extern sector_t map_swap_page(struct swap_info_struct *, pgoff_t);
>  extern sector_t swapdev_block(int, pgoff_t);
>  extern struct swap_info_struct *get_swap_info_struct(unsigned);
> -extern int can_share_swap_page(struct page *);
> -extern int remove_exclusive_swap_page(struct page *);
> +extern int can_reuse_swap_page_unlocked(struct page *);
> +extern int can_reuse_swap_page_locked(struct page *);
> +extern int try_to_free_swap(struct page *);
>  struct backing_dev_info;
>  
>  extern spinlock_t swap_lock;
> @@ -319,8 +320,6 @@ static inline struct page *lookup_swap_c
>  	return NULL;
>  }
>  
> -#define can_share_swap_page(p)			(page_mapcount(p) == 1)
> -
>  static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
>  							gfp_t gfp_mask)
>  {
> @@ -337,9 +336,19 @@ static inline void delete_from_swap_cach
>  
>  #define swap_token_default_timeout		0
>  
> -static inline int remove_exclusive_swap_page(struct page *p)
> +static inline int can_reuse_swap_page_unlocked(struct page *page)
>  {
> -	return 0;
> +	return 0;	/* irrelevant: never called  */
> +}
> +
> +static inline int can_reuse_swap_page_locked(struct page *page)
> +{
> +	return 0;	/* irrelevant: never called */
> +}
> +
> +static inline int try_to_free_swap(struct page *page)
> +{
> +	return 0;	/* irrelevant: never called */
>  }
>  
>  static inline swp_entry_t get_swap_page(void)
> --- 2.6.26-rc7/mm/memory.c	2008-06-21 08:41:19.000000000 +0100
> +++ linux/mm/memory.c	2008-06-23 18:12:47.000000000 +0100
> @@ -1686,9 +1686,14 @@ static int do_wp_page(struct mm_struct *
>  	 * not dirty accountable.
>  	 */
>  	if (PageAnon(old_page)) {
> -		if (!TestSetPageLocked(old_page)) {
> -			reuse = can_share_swap_page(old_page);
> -			unlock_page(old_page);
> +		if (page_mapcount(old_page) == 1) {
> +			if (!PageSwapCache(old_page))
> +				reuse = 1;
> +			else if (!TestSetPageLocked(old_page)) {
> +				reuse = can_reuse_swap_page_locked(old_page);
> +				unlock_page(old_page);
> +			} else
> +				reuse = can_reuse_swap_page_unlocked(old_page);
>  		}
>  	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
>  					(VM_WRITE|VM_SHARED))) {
> @@ -2185,7 +2190,7 @@ static int do_swap_page(struct mm_struct
>  
>  	inc_mm_counter(mm, anon_rss);
>  	pte = mk_pte(page, vma->vm_page_prot);
> -	if (write_access && can_share_swap_page(page)) {
> +	if (write_access && can_reuse_swap_page_locked(page)) {
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  		write_access = 0;
>  	}
> @@ -2196,7 +2201,7 @@ static int do_swap_page(struct mm_struct
>  
>  	swap_free(entry);
>  	if (vm_swap_full())
> -		remove_exclusive_swap_page(page);
> +		try_to_free_swap(page);
>  	unlock_page(page);
>  
>  	if (write_access) {
> --- 2.6.26-rc7/mm/migrate.c	2008-06-21 08:41:19.000000000 +0100
> +++ linux/mm/migrate.c	2008-06-23 18:12:47.000000000 +0100
> @@ -330,8 +330,10 @@ static int migrate_page_move_mapping(str
>  	get_page(newpage);	/* add cache reference */
>  #ifdef CONFIG_SWAP
>  	if (PageSwapCache(page)) {
> -		SetPageSwapCache(newpage);
>  		set_page_private(newpage, page_private(page));
> +		/* page_swapcount() relies on private whenever PageSwapCache */
> +		smp_wmb();
> +		SetPageSwapCache(newpage);
>  	}
>  #endif
>  
> @@ -398,7 +400,6 @@ static void migrate_page_copy(struct pag
>  #endif
>  	ClearPageActive(page);
>  	ClearPagePrivate(page);
> -	set_page_private(page, 0);
>  	page->mapping = NULL;
>  
>  	/*
> --- 2.6.26-rc7/mm/page_io.c	2008-04-17 03:49:44.000000000 +0100
> +++ linux/mm/page_io.c	2008-06-23 18:12:47.000000000 +0100
> @@ -98,7 +98,7 @@ int swap_writepage(struct page *page, st
>  	struct bio *bio;
>  	int ret = 0, rw = WRITE;
>  
> -	if (remove_exclusive_swap_page(page)) {
> +	if (try_to_free_swap(page)) {
>  		unlock_page(page);
>  		goto out;
>  	}
> --- 2.6.26-rc7/mm/swap_state.c	2008-05-03 21:55:12.000000000 +0100
> +++ linux/mm/swap_state.c	2008-06-23 18:12:47.000000000 +0100
> @@ -76,13 +76,15 @@ int add_to_swap_cache(struct page *page,
>  	BUG_ON(PagePrivate(page));
>  	error = radix_tree_preload(gfp_mask);
>  	if (!error) {
> +		set_page_private(page, entry.val);
> +		/* page_swapcount() relies on private whenever PageSwapCache */
> +		smp_wmb();
>  		write_lock_irq(&swapper_space.tree_lock);
>  		error = radix_tree_insert(&swapper_space.page_tree,
>  						entry.val, page);
>  		if (!error) {
>  			page_cache_get(page);
>  			SetPageSwapCache(page);
> -			set_page_private(page, entry.val);
>  			total_swapcache_pages++;
>  			__inc_zone_page_state(page, NR_FILE_PAGES);
>  			INC_CACHE_INFO(add_total);
> @@ -105,7 +107,6 @@ void __delete_from_swap_cache(struct pag
>  	BUG_ON(PagePrivate(page));
>  
>  	radix_tree_delete(&swapper_space.page_tree, page_private(page));
> -	set_page_private(page, 0);
>  	ClearPageSwapCache(page);
>  	total_swapcache_pages--;
>  	__dec_zone_page_state(page, NR_FILE_PAGES);
> @@ -188,13 +189,14 @@ void delete_from_swap_cache(struct page 
>   * 
>   * Its ok to check for PageSwapCache without the page lock
>   * here because we are going to recheck again inside 
> - * exclusive_swap_page() _with_ the lock. 
> + * try_to_free_swap() _with_ the lock. 
>   * 					- Marcelo
>   */
>  static inline void free_swap_cache(struct page *page)
>  {
> -	if (PageSwapCache(page) && !TestSetPageLocked(page)) {
> -		remove_exclusive_swap_page(page);
> +	if (PageSwapCache(page) && !page_mapped(page) &&
> +	    !TestSetPageLocked(page)) {
> +		try_to_free_swap(page);
>  		unlock_page(page);
>  	}
>  }
> --- 2.6.26-rc7/mm/swapfile.c	2008-05-03 21:55:12.000000000 +0100
> +++ linux/mm/swapfile.c	2008-06-23 18:12:47.000000000 +0100
> @@ -251,7 +251,6 @@ static struct swap_info_struct * swap_in
>  		goto bad_offset;
>  	if (!p->swap_map[offset])
>  		goto bad_free;
> -	spin_lock(&swap_lock);
>  	return p;
>  
>  bad_free:
> @@ -300,90 +299,104 @@ void swap_free(swp_entry_t entry)
>  
>  	p = swap_info_get(entry);
>  	if (p) {
> +		spin_lock(&swap_lock);
>  		swap_entry_free(p, swp_offset(entry));
>  		spin_unlock(&swap_lock);
>  	}
>  }
>  
>  /*
> - * How many references to page are currently swapped out?
> + * How many page table references are there to this page's swap entry?
> + * including references to the page itself if add_mapcount.
> + *
> + * page_swapcount is only called on a page which was recently PageSwapCache
> + * (but might have been deleted from swap cache just before getting here).
> + *
> + * When called with PageLocked, the swapcount is stable, and the mapcount
> + * cannot rise (but may fall if the page is concurrently unmapped from a
> + * page table whose lock we don't hold).
> + *
> + * When called without PageLocked, the swapcount is stable while we hold
> + * swap_lock, and the mapcount cannot rise from 1 while we hold that page
> + * table lock (but may fall if the page is concurrently unmapped from a
> + * page table whose lock we don't hold).
> + *
> + * do_swap_page and unuse_pte call page_add_anon_rmap before swap_free,
> + * try_to_unmap_one calls swap_duplicate before page_remove_rmap:
> + * so in general, swapcount+mapcount should never be seen too low -
> + * but you need to consider more memory barriers if extending its use.
>   */
> -static inline int page_swapcount(struct page *page)
> +static int page_swapcount(struct page *page, int add_mapcount)
>  {
>  	int count = 0;
>  	struct swap_info_struct *p;
>  	swp_entry_t entry;
>  
> -	entry.val = page_private(page);
> -	p = swap_info_get(entry);
> -	if (p) {
> -		/* Subtract the 1 for the swap cache itself */
> -		count = p->swap_map[swp_offset(entry)] - 1;
> -		spin_unlock(&swap_lock);
> +	spin_lock(&swap_lock);
> +	if (add_mapcount)
> +		count = page_mapcount(page);
> +	if (PageSwapCache(page)) {
> +		/* We can rely on page_private once PageSwapCache is visible */
> +		smp_rmb();
> +		entry.val = page_private(page);
> +		p = swap_info_get(entry);
> +		if (p) {
> +			/* Subtract the 1 for the swap cache itself */
> +			count += p->swap_map[swp_offset(entry)] - 1;
> +		}
>  	}
> +	spin_unlock(&swap_lock);
>  	return count;
>  }
>  
>  /*
> - * We can use this swap cache entry directly
> - * if there are no other references to it.
> + * Can do_wp_page() make the faulting swapcache page writable without COW?
> + * But something else, probably shrink_page_list(), already has PageLocked.
> + * Don't be misled to COW the page unnecessarily: check swapcount+mapcount.
>   */
> -int can_share_swap_page(struct page *page)
> +int can_reuse_swap_page_unlocked(struct page *page)
>  {
> -	int count;
> -
> -	BUG_ON(!PageLocked(page));
> -	count = page_mapcount(page);
> -	if (count <= 1 && PageSwapCache(page))
> -		count += page_swapcount(page);
> -	return count == 1;
> +	return page_swapcount(page, 1) == 1;
>  }
>  
>  /*
> - * Work out if there are any other processes sharing this
> - * swap cache page. Free it if you can. Return success.
> + * Can do_wp_page() make the faulting swapcache page writable without COW?
> + * having acquiring PageLocked.  In this case, since we have that lock and
> + * are about to modify the page, we'd better free its swap space - it won't
> + * be read again, and writing there later would probably require an extra seek.
>   */
> -int remove_exclusive_swap_page(struct page *page)
> +int can_reuse_swap_page_locked(struct page *page)
>  {
> -	int retval;
> -	struct swap_info_struct * p;
> -	swp_entry_t entry;
> +	VM_BUG_ON(!PageLocked(page));
> +	if (page_swapcount(page, 1) != 1)
> +		return 0;
> +	if (!PageSwapCache(page))
> +		return 1;
> +	if (PageWriteback(page))
> +		return 1;
>  
> -	BUG_ON(PagePrivate(page));
> -	BUG_ON(!PageLocked(page));
> +	delete_from_swap_cache(page);
> +	SetPageDirty(page);
> +	return 1;
> +}
>  
> +/*
> + * If swap is getting full, or if there are no more mappings of this page,
> + * then try_to_free_swap is called to free its swap space.
> + */
> +int try_to_free_swap(struct page *page)
> +{
> +	VM_BUG_ON(!PageLocked(page));
>  	if (!PageSwapCache(page))
>  		return 0;
>  	if (PageWriteback(page))
>  		return 0;
> -	if (page_count(page) != 2) /* 2: us + cache */
> +	if (page_swapcount(page, 0))	/* Here we don't care about mapcount */
>  		return 0;
>  
> -	entry.val = page_private(page);
> -	p = swap_info_get(entry);
> -	if (!p)
> -		return 0;
> -
> -	/* Is the only swap cache user the cache itself? */
> -	retval = 0;
> -	if (p->swap_map[swp_offset(entry)] == 1) {
> -		/* Recheck the page count with the swapcache lock held.. */
> -		write_lock_irq(&swapper_space.tree_lock);
> -		if ((page_count(page) == 2) && !PageWriteback(page)) {
> -			__delete_from_swap_cache(page);
> -			SetPageDirty(page);
> -			retval = 1;
> -		}
> -		write_unlock_irq(&swapper_space.tree_lock);
> -	}
> -	spin_unlock(&swap_lock);
> -
> -	if (retval) {
> -		swap_free(entry);
> -		page_cache_release(page);
> -	}
> -
> -	return retval;
> +	delete_from_swap_cache(page);
> +	SetPageDirty(page);
> +	return 1;
>  }
>  
>  /*
> @@ -400,6 +413,7 @@ void free_swap_and_cache(swp_entry_t ent
>  
>  	p = swap_info_get(entry);
>  	if (p) {
> +		spin_lock(&swap_lock);
>  		if (swap_entry_free(p, swp_offset(entry)) == 1) {
>  			page = find_get_page(&swapper_space, entry.val);
>  			if (page && unlikely(TestSetPageLocked(page))) {
> @@ -410,14 +424,10 @@ void free_swap_and_cache(swp_entry_t ent
>  		spin_unlock(&swap_lock);
>  	}
>  	if (page) {
> -		int one_user;
> -
> -		BUG_ON(PagePrivate(page));
> -		one_user = (page_count(page) == 2);
> -		/* Only cache user (+us), or swap space full? Free it! */
> +		/* Not mapped elsewhere, or swap space full? Free it! */
>  		/* Also recheck PageSwapCache after page is locked (above) */
>  		if (PageSwapCache(page) && !PageWriteback(page) &&
> -					(one_user || vm_swap_full())) {
> +				(!page_mapped(page) || vm_swap_full())) {
>  			delete_from_swap_cache(page);
>  			SetPageDirty(page);
>  		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
  2008-06-24 15:19                     ` Robin Holt
@ 2008-06-24 20:19                       ` Hugh Dickins
  0 siblings, 0 replies; 30+ messages in thread
From: Hugh Dickins @ 2008-06-24 20:19 UTC (permalink / raw)
  To: Robin Holt
  Cc: Nick Piggin, Ingo Molnar, Christoph Lameter, Jack Steiner,
	Rik van Riel, Lee Schermerhorn, linux-mm

On Tue, 24 Jun 2008, Robin Holt wrote:
> OK.  I just gave it a try and I still get a failure with this patch
> applied.

Thanks a lot for trying.  That's disappointing, I don't understand it.
Would it be possible for you to debug what is actually happening when
this page is COWed?  It seems to me that there's something else going
on that we don't yet know about.

> I can't help wondering if we (XPMEM, IB, etc) shouldn't be setting a
> page flag indicating that the attempt to swap this page out should not
> even be tried.  I would guess this has already been discussed to death.
> If so, I am sorry, but I missed those discussions.

No, I don't think there's been any such discussion,
apart from the http://lkml.org/lkml/2006/9/14/384
threads I indicated before.  get_user_pages has always been very
much about pinning a page in memory, despite attempts to swap it out.

If you want to prevent the page from being pulled out of its pagetable,
then use mlock(2); but it shouldn't be necessary, once we have that
fix to the PageLocked case.  Unless there's something else going on.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2008-06-24 20:19 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-06-18 16:41 Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2? Robin Holt
2008-06-18 17:29 ` Nick Piggin
2008-06-18 19:01   ` Hugh Dickins
2008-06-18 20:33     ` Robin Holt
2008-06-18 21:46       ` Hugh Dickins
2008-06-19  3:31         ` Nick Piggin
2008-06-19  3:34           ` Nick Piggin
2008-06-19 11:39           ` Hugh Dickins
2008-06-19 12:07             ` Nick Piggin
2008-06-19 12:21               ` Nick Piggin
2008-06-19 17:48                 ` Christoph Lameter
2008-06-19 12:34               ` Hugh Dickins
2008-06-19 12:53                 ` Nick Piggin
2008-06-19 13:25                   ` Hugh Dickins
2008-06-19 13:35                     ` Robin Holt
2008-06-19 16:32         ` Robin Holt
2008-06-20  9:23           ` Nick Piggin
2008-06-19  3:07     ` Nick Piggin
2008-06-19 11:09       ` Hugh Dickins
2008-06-19 13:38         ` Robin Holt
2008-06-19 13:49           ` Hugh Dickins
2008-06-23 15:54             ` Robin Holt
2008-06-23 16:48               ` Hugh Dickins
2008-06-23 17:52                 ` Robin Holt
2008-06-23 20:58                   ` Hugh Dickins
2008-06-24 11:56                     ` Robin Holt
2008-06-24 15:19                     ` Robin Holt
2008-06-24 20:19                       ` Hugh Dickins
2008-06-23 19:11             ` Robin Holt
2008-06-23 19:12               ` Robin Holt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox