[RFC] page fault retry with NOPAGE

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] page fault retry with NOPAGE_RETRY
@ 2006-09-14 22:55 Benjamin Herrenschmidt
  2006-09-15  0:19 ` Linus Torvalds
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2006-09-14 22:55 UTC (permalink / raw)
  To: linux-mm; +Cc: Linux Kernel list, Linus Torvalds

Hi !

While tracking some issues with mapping of SPEs on Cell into userland
processes, I figured out that we have a problem that can be solved with
a small generic change (which might be useful for other situations as
well).

Basically, when a user tries to access some mmap'ed SPE registers, the
no_page() function for those tries to schedule in a physial SPE the
virtual SPE context of that user. For that, it needs to wait for an SPE
to become available, which can take some time, and with the current SPE
scheduler, can take a long time indeed.

This wait is interruptible. However, we have no way to fail "gracefully"
from no_page() if the routine we use underneath returns a failure due to
a signal (we use, logically, -EINTR). It's a generic issue with no_page
handlers. They can either wait non-interruptibly, or fail with a sigbus
or oom result.

However, it would be trivial to add the ability for no_page() to fail in
such a way that execution just comes back all the way to userland (and
thus pending signals are delivered), and the faulting instruction is
simply taken again. All we need is something like:

in mm.h:

 #define NOPAGE_SIGBUS   (NULL)
 #define NOPAGE_OOM      ((struct page *) (-1))
+#define NOPAGE_RETRY	((struct page *) (-2))

and in memory.c, in do_no_page():

        /* no page was available -- either SIGBUS or OOM */
        if (new_page == NOPAGE_SIGBUS)
                return VM_FAULT_SIGBUS;
        if (new_page == NOPAGE_OOM)
                return VM_FAULT_OOM;
+       if (new_page == NOPAGE_RETRY)
+               return VM_FAULT_MINOR;

In fact, it would be nicer to turn the whole serie into -one- if
(unlikely(something wrong)) and inside that if, a switch case of fault
codes. That would probably make the code better than what it is today
and allow for adding error cases without performance harm to the fast
path. I'll provide a patch doing that if there is no strong disagreement
with the approach.

With the above change, a no_page() handler can just return NOPAGE_RETRY
(for example if it was waiting on availability of a hardware resource
and a signal got detected) and this will simply be counted as a minor
fault. I've looked at the callers of handle_mm_fault() on a couple of
archs and it seems that the approach will work though I haven't looked
at all of them yet (though at this point, I only plan to use that for
PowerPC SPE code anyway).

Somebody pointed to me that this might also be used to shoot another
bird, though I have not really though about it and wether it's good or
bad, which is the old problem of needing struct page for things that can
be mmap'ed. Using that trick, a driver could do the set_pte() itself in
the no_page handler and return NOPAGE_RETRY. I'm not sure about
advertising that feature though as I like all  callers of things like
set_pte() to be in well known locations, as there are various issues
related to manipulating the page tables that driver writers might not
get right. Though I suppose that if we consider the approach good, we
can provide a helper that "does the right thing" as well (like calling
update_mmu_cache(), flush_tlb_whatever(), etc...).

Any comment ? If it's ok, I'll cook a patch candidate for 2.6.19 along
with the matching change to spufs to make use of it.

Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-14 22:55 [RFC] page fault retry with NOPAGE_RETRY Benjamin Herrenschmidt
@ 2006-09-15  0:19 ` Linus Torvalds
  2006-09-15  7:11 ` Andrew Morton
  2006-09-15 21:35 ` Arnd Bergmann
  2 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2006-09-15  0:19 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-mm, Linux Kernel list


On Fri, 15 Sep 2006, Benjamin Herrenschmidt wrote:
> 
> This wait is interruptible. However, we have no way to fail "gracefully"
> from no_page() if the routine we use underneath returns a failure due to
> a signal (we use, logically, -EINTR). It's a generic issue with no_page
> handlers. They can either wait non-interruptibly, or fail with a sigbus
> or oom result.

I certainly personally have nothing against adding a NOPAGE_RETRY, it 
seems very straightforward. 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-14 22:55 [RFC] page fault retry with NOPAGE_RETRY Benjamin Herrenschmidt
  2006-09-15  0:19 ` Linus Torvalds
@ 2006-09-15  7:11 ` Andrew Morton
  2006-09-15  7:35   ` Andrew Morton
  2006-09-19 23:35   ` Mike Waychison
  2006-09-15 21:35 ` Arnd Bergmann
  2 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2006-09-15  7:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-mm, Linux Kernel list, Linus Torvalds, Mike Waychison

On Fri, 15 Sep 2006 08:55:08 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> in mm.h:
> 
>  #define NOPAGE_SIGBUS   (NULL)
>  #define NOPAGE_OOM      ((struct page *) (-1))
> +#define NOPAGE_RETRY	((struct page *) (-2))
> 
> and in memory.c, in do_no_page():
> 
> 
>         /* no page was available -- either SIGBUS or OOM */
>         if (new_page == NOPAGE_SIGBUS)
>                 return VM_FAULT_SIGBUS;
>         if (new_page == NOPAGE_OOM)
>                 return VM_FAULT_OOM;
> +       if (new_page == NOPAGE_RETRY)
> +               return VM_FAULT_MINOR;

Google are using such a patch (Mike owns it).

It is to reduce mmap_sem contention with threaded apps.  If one thread
takes a major pagefault, it will perform a synchronous disk read while
holding down_read(mmap_sem).  This causes any other thread which wishes to
perform any mmap/munmap/mprotect/etc (which does down_write(mmap_sem)) to
block behind that disk IO.  As you can understand, that can be pretty bad
in the right circumstances.

The patch modifies filemap_nopage() to look to see if it needs to block on
the page coming uptodate and if so, it does up_read(mmap_sem);
wait_on_page_locked(); return NOPAGE_RETRY.  That causes the top-level
do_page_fault() code to rerun the entire pagefault.

It hasn't been submitted for upstream yet because

a) To avoid livelock possibilities (page reclaim, looping FADV_DONTNEED,
   etc) it only does the retry a single time.  After that it falls back to
   the traditional synchronous-read-inside-mmap_sem approach.  A flag in
   current->flags is used to detect the second attempt.  It keeps the patch
   simple, but is a bit hacky.

   To resolve this cleanly I'm thinking we change all the pagefault code
   everywhere: instantiate a new `struct pagefault_args' in do_page_fault()
   and pass that all around the place.  So all the pagefault code, all the
   ->nopage handlers etc will take a single argument.

   This will, I hope, result in less code, faster code and less stack
   consumption.  It could also be used for things like the
   lock-the-page-in-filemap_nopage() proposal: the ->nopage()
   implementation could set a boolean in pagefault_args indicating whether
   the page has been locked.

   And, of course, fielmap_nopage could set another boolean in
   pagefault_args to indicate that it has already tried to rerun the
   pagefault once.

b) It could be more efficient.  Most of the time, there's no need to
   back all the way out of the pagefault handler and rerun the whole thing.
   Because most of the time, nobody changed anything in the mm_struct.  We
   _could_ just retake the mmap_sem after the page comes uptodate and, if
   nothing has changed, proceed.  I see two ways of doing this:

   - The simple way: look to see if any other processes are sharing
     this mm_struct.  If not, just do the synchronous read inside mmap_sem.

   - The better way: put a sequence counter in the mm_struct,
     increment that in every place where down_write(mmap_sem) is performed.
      The pagefault code then can re-take the mmap_sem for reading and look
     to see if the sequence counter is unchanged.  If it is, proceed.  If
     it _has_ changed then drop mmap_sem again and return NOPAGE_RETRY.

otoh, maybe using another bit in page->flags is a good compromise ;)

Mike, could you whip that patch out please?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-15  7:11 ` Andrew Morton
@ 2006-09-15  7:35   ` Andrew Morton
  2006-09-15 13:30     ` Hugh Dickins
  2006-09-19 23:35   ` Mike Waychison
  1 sibling, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2006-09-15  7:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linux-mm, Linux Kernel list,
	Linus Torvalds, Mike Waychison

On Fri, 15 Sep 2006 00:11:51 -0700
Andrew Morton <akpm@osdl.org> wrote:

> b) It could be more efficient.  Most of the time, there's no need to
>    back all the way out of the pagefault handler and rerun the whole thing.
>    Because most of the time, nobody changed anything in the mm_struct.  We
>    _could_ just retake the mmap_sem after the page comes uptodate and, if
>    nothing has changed, proceed.  I see two ways of doing this:
> 
>    - The simple way: look to see if any other processes are sharing
>      this mm_struct.  If not, just do the synchronous read inside mmap_sem.

This assumes that no other heavyweight process will try to modify this
single-threaded process's mm.  I don't _think_ that happens anywhere, does
it?  access_process_vm() is the only case I can think of, and it does
down_read(other process's mmap_sem).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-15  7:35   ` Andrew Morton
@ 2006-09-15 13:30     ` Hugh Dickins
  2006-09-16  1:03       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 28+ messages in thread
From: Hugh Dickins @ 2006-09-15 13:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Benjamin Herrenschmidt, linux-mm, Linux Kernel list,
	Linus Torvalds, Mike Waychison

On Fri, 15 Sep 2006, Andrew Morton wrote:
> 
> This assumes that no other heavyweight process will try to modify this
> single-threaded process's mm.  I don't _think_ that happens anywhere, does
> it?  access_process_vm() is the only case I can think of,

"Modify" in the sense of fault into.
Yes, access_process_vm() is all I can think of too.

> and it does down_read(other process's mmap_sem).

If there were anything else, it'd have to do so too (if not down_write).

I too like NOPAGE_RETRY: as you've both observed, it can help to solve
several different problems.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-14 22:55 [RFC] page fault retry with NOPAGE_RETRY Benjamin Herrenschmidt
  2006-09-15  0:19 ` Linus Torvalds
  2006-09-15  7:11 ` Andrew Morton
@ 2006-09-15 21:35 ` Arnd Bergmann
  2 siblings, 0 replies; 28+ messages in thread
From: Arnd Bergmann @ 2006-09-15 21:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-mm, Linux Kernel list, Linus Torvalds

Am Friday 15 September 2006 00:55 schrieb Benjamin Herrenschmidt:
> Somebody pointed to me that this might also be used to shoot another
> bird, though I have not really though about it and wether it's good or
> bad, which is the old problem of needing struct page for things that can
> be mmap'ed. Using that trick, a driver could do the set_pte() itself in
> the no_page handler and return NOPAGE_RETRY. I'm not sure about
> advertising that feature though as I like all A callers of things like
> set_pte() to be in well known locations, as there are various issues
> related to manipulating the page tables that driver writers might not
> get right. Though I suppose that if we consider the approach good, we
> can provide a helper that "does the right thing" as well (like calling
> update_mmu_cache(), flush_tlb_whatever(), etc...).

One more point where it can help: When the backing store for the spufs
mem file changes between vmalloc memory backed and pointing to a physical
spu, we need to change vm_page_prot between (_PAGE_NO_CACHE | _PAGE_GUARDED)
and the opposite. While all my investigations (with help from Hugh Dickins
and Christoph Hellwig) show that it should be safe to do in the current
code, the idea is still scary. When the nopage function for that file
can simply return NOPAGE_RETRY after setting up the page tables,
we don't need to worry about vm_page_prot any more.

	Arnd <><

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-15 13:30     ` Hugh Dickins
@ 2006-09-16  1:03       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2006-09-16  1:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, Linux Kernel list, Linus Torvalds,
	Mike Waychison

On Fri, 2006-09-15 at 14:30 +0100, Hugh Dickins wrote:
> On Fri, 15 Sep 2006, Andrew Morton wrote:
> > 
> > This assumes that no other heavyweight process will try to modify this
> > single-threaded process's mm.  I don't _think_ that happens anywhere, does
> > it?  access_process_vm() is the only case I can think of,
> 
> "Modify" in the sense of fault into.
> Yes, access_process_vm() is all I can think of too.
> 
> > and it does down_read(other process's mmap_sem).
> 
> If there were anything else, it'd have to do so too (if not down_write).
> 
> I too like NOPAGE_RETRY: as you've both observed, it can help to solve
> several different problems.

Yes, I don't need any of the safeguards that Andrew mentioned in my case
though. I want to return all the way to userland because I want signals
to be handled (which might also be a good thing in your case in fact, so
that a process being starved by that new mecanism can still be
interrupted).

I would ask that if you decide that the more complex approach is not
2.6.19 material, that the simple addition of NOPAGE_RETRY as I've
defined could be included in a first step so I can solve my problem (and
possibly other drivers wanting to do funky things with no_page() and
still take signals), and the google patch be rebased on top of that for
additional simmering :)

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-15  7:11 ` Andrew Morton
  2006-09-15  7:35   ` Andrew Morton
@ 2006-09-19 23:35   ` Mike Waychison
  2006-09-19 23:50     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 28+ messages in thread
From: Mike Waychison @ 2006-09-19 23:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Benjamin Herrenschmidt, linux-mm, Linux Kernel list, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 4540 bytes --]

Patch attached.

As Andrew points out, the logic is a bit hacky and using a flag in 
current->flags to determine whether we have done the retry or not already.

I too think the right approach to being able to handle these kinds of 
retries in a more general fashion is to introduce a struct 
pagefault_args along the page faulting path.  Within it, we could 
introduce a reason for the retry so the higher levels would be able to 
better understand what to do.

struct pagefault_args {
   u32 reason;
};

#define PAGEFAULT_MAYRETRY_IO         0x1
#define PAGEFAULT_RETRY_IO            0x2
#define PAGEFAULT_RETRY_SIGNALPENDING 0x4

do_page_fault could for example set PAGEFAULT_MAYRETRY_IO, letting a 
nopage implementation drop locks for IO and signalling back up to 
do_page_fault by also setting PAGEFAULT_RETRY_IO.

PAGEFAULT_RETRY_SIGNALPENDING could be set to tell do_page_fault to 
deliver the signal and replay the faulting instruction (as opposed to 
the "goto retry" in the patch attached).

Mike Waychison

Andrew Morton wrote:
> On Fri, 15 Sep 2006 08:55:08 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> 
>>in mm.h:
>>
>> #define NOPAGE_SIGBUS   (NULL)
>> #define NOPAGE_OOM      ((struct page *) (-1))
>>+#define NOPAGE_RETRY	((struct page *) (-2))
>>
>>and in memory.c, in do_no_page():
>>
>>
>>        /* no page was available -- either SIGBUS or OOM */
>>        if (new_page == NOPAGE_SIGBUS)
>>                return VM_FAULT_SIGBUS;
>>        if (new_page == NOPAGE_OOM)
>>                return VM_FAULT_OOM;
>>+       if (new_page == NOPAGE_RETRY)
>>+               return VM_FAULT_MINOR;
> 
> 
> Google are using such a patch (Mike owns it).
> 
> It is to reduce mmap_sem contention with threaded apps.  If one thread
> takes a major pagefault, it will perform a synchronous disk read while
> holding down_read(mmap_sem).  This causes any other thread which wishes to
> perform any mmap/munmap/mprotect/etc (which does down_write(mmap_sem)) to
> block behind that disk IO.  As you can understand, that can be pretty bad
> in the right circumstances.
> 
> The patch modifies filemap_nopage() to look to see if it needs to block on
> the page coming uptodate and if so, it does up_read(mmap_sem);
> wait_on_page_locked(); return NOPAGE_RETRY.  That causes the top-level
> do_page_fault() code to rerun the entire pagefault.
> 
> It hasn't been submitted for upstream yet because
> 
> a) To avoid livelock possibilities (page reclaim, looping FADV_DONTNEED,
>    etc) it only does the retry a single time.  After that it falls back to
>    the traditional synchronous-read-inside-mmap_sem approach.  A flag in
>    current->flags is used to detect the second attempt.  It keeps the patch
>    simple, but is a bit hacky.
> 
>    To resolve this cleanly I'm thinking we change all the pagefault code
>    everywhere: instantiate a new `struct pagefault_args' in do_page_fault()
>    and pass that all around the place.  So all the pagefault code, all the
>    ->nopage handlers etc will take a single argument.
> 
>    This will, I hope, result in less code, faster code and less stack
>    consumption.  It could also be used for things like the
>    lock-the-page-in-filemap_nopage() proposal: the ->nopage()
>    implementation could set a boolean in pagefault_args indicating whether
>    the page has been locked.
> 
>    And, of course, fielmap_nopage could set another boolean in
>    pagefault_args to indicate that it has already tried to rerun the
>    pagefault once.
> 
> b) It could be more efficient.  Most of the time, there's no need to
>    back all the way out of the pagefault handler and rerun the whole thing.
>    Because most of the time, nobody changed anything in the mm_struct.  We
>    _could_ just retake the mmap_sem after the page comes uptodate and, if
>    nothing has changed, proceed.  I see two ways of doing this:
> 
>    - The simple way: look to see if any other processes are sharing
>      this mm_struct.  If not, just do the synchronous read inside mmap_sem.
> 
>    - The better way: put a sequence counter in the mm_struct,
>      increment that in every place where down_write(mmap_sem) is performed.
>       The pagefault code then can re-take the mmap_sem for reading and look
>      to see if the sequence counter is unchanged.  If it is, proceed.  If
>      it _has_ changed then drop mmap_sem again and return NOPAGE_RETRY.
> 
> otoh, maybe using another bit in page->flags is a good compromise ;)
> 
> Mike, could you whip that patch out please?


[-- Attachment #2: mmap_sem_drop_for_faults-2.6.18-rc7.patch --]
[-- Type: text/plain, Size: 7742 bytes --]

	Allow major faults to drop the mmap_sem read lock while waiting for IO to complete.  This is done by flagging in current->flags that the caller can tolerate a retry in the ->nopage path.
	
	Benchmarks indicate that this double data structure walking in the case of a major fault results in << 1% performance hit. (test was to balloon and drain the page cache, followed by 64 threads faulting in pages off disk in reverse order).
	
	Signed-off-by: Mike Waychison <mikew@google.com>

Differences ...

 arch/i386/mm/fault.c   |   30 ++++++++++++++++++++++++++----
 arch/x86_64/mm/fault.c |   30 ++++++++++++++++++++++++++----
 include/linux/mm.h     |    2 ++
 include/linux/sched.h  |    1 +
 mm/filemap.c           |   22 +++++++++++++++++++++-
 mm/memory.c            |    4 ++++
 6 files changed, 80 insertions(+), 9 deletions(-)

Index: linux-2.6.18-rc7/arch/i386/mm/fault.c
===================================================================
--- linux-2.6.18-rc7.orig/arch/i386/mm/fault.c	2006-09-19 16:04:21.000000000 -0700
+++ linux-2.6.18-rc7/arch/i386/mm/fault.c	2006-09-19 16:08:15.000000000 -0700
@@ -325,8 +325,8 @@ static inline int vmalloc_fault(unsigned
  *	bit 3 == 1 means use of reserved bit detected
  *	bit 4 == 1 means fault was an instruction fetch
  */
-fastcall void __kprobes do_page_fault(struct pt_regs *regs,
-				      unsigned long error_code)
+static inline void __do_page_fault(struct pt_regs *regs,
+				   unsigned long error_code)
 {
 	struct task_struct *tsk;
 	struct mm_struct *mm;
@@ -407,7 +407,7 @@ fastcall void __kprobes do_page_fault(st
 			goto bad_area_nosemaphore;
 		down_read(&mm->mmap_sem);
 	}
-
+retry:
 	vma = find_vma(mm, address);
 	if (!vma)
 		goto bad_area;
@@ -461,7 +461,15 @@ good_area:
 	 */
 	switch (handle_mm_fault(mm, vma, address, write)) {
 		case VM_FAULT_MINOR:
-			tsk->min_flt++;
+			/*
+			 * If we had to retry (PF_FAULT_MAYRETRY cleared), then
+			 * the page originally wasn't up to date before the
+			 * retry, but now it is.
+			 */
+			if (!(current->flags & PF_FAULT_MAYRETRY))
+				tsk->maj_flt++;
+			else
+				tsk->min_flt++;
 			break;
 		case VM_FAULT_MAJOR:
 			tsk->maj_flt++;
@@ -470,6 +478,12 @@ good_area:
 			goto do_sigbus;
 		case VM_FAULT_OOM:
 			goto out_of_memory;
+		case VM_FAULT_RETRY:
+			if (current->flags & PF_FAULT_MAYRETRY) {
+				current->flags &= ~PF_FAULT_MAYRETRY;
+				goto retry;
+			}
+			BUG();
 		default:
 			BUG();
 	}
@@ -625,6 +639,14 @@ do_sigbus:
 	force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
 }
 
+fastcall void __kprobes do_page_fault(struct pt_regs *regs,
+				      unsigned long error_code)
+{
+	current->flags |= PF_FAULT_MAYRETRY;
+	__do_page_fault(regs, error_code);
+	current->flags &= ~PF_FAULT_MAYRETRY;
+}
+
 #ifndef CONFIG_X86_PAE
 void vmalloc_sync_all(void)
 {
Index: linux-2.6.18-rc7/arch/x86_64/mm/fault.c
===================================================================
--- linux-2.6.18-rc7.orig/arch/x86_64/mm/fault.c	2006-09-19 16:04:28.000000000 -0700
+++ linux-2.6.18-rc7/arch/x86_64/mm/fault.c	2006-09-19 16:09:02.000000000 -0700
@@ -336,8 +336,8 @@ int exception_trace = 1;
  * and the problem, and then passes it off to one of the appropriate
  * routines.
  */
-asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
-					unsigned long error_code)
+static inline void __do_page_fault(struct pt_regs *regs,
+				   unsigned long error_code)
 {
 	struct task_struct *tsk;
 	struct mm_struct *mm;
@@ -435,7 +435,7 @@ asmlinkage void __kprobes do_page_fault(
 			goto bad_area_nosemaphore;
 		down_read(&mm->mmap_sem);
 	}
-
+retry:
 	vma = find_vma(mm, address);
 	if (!vma)
 		goto bad_area;
@@ -481,13 +481,27 @@ good_area:
 	 */
 	switch (handle_mm_fault(mm, vma, address, write)) {
 	case VM_FAULT_MINOR:
-		tsk->min_flt++;
+		/*
+		 * If we had to retry (PF_FAULT_MAYRETRY cleared), then the
+		 * page originally wasn't up to date before the retry, but now
+		 * it is.
+		 */
+		if (!(current->flags & PF_FAULT_MAYRETRY))
+			tsk->maj_flt++;
+		else
+			tsk->min_flt++;
 		break;
 	case VM_FAULT_MAJOR:
 		tsk->maj_flt++;
 		break;
 	case VM_FAULT_SIGBUS:
 		goto do_sigbus;
+	case VM_FAULT_RETRY:
+		if (current->flags & PF_FAULT_MAYRETRY) {
+			current->flags &= ~PF_FAULT_MAYRETRY;
+			goto retry;
+		}
+		BUG();
 	default:
 		goto out_of_memory;
 	}
@@ -613,6 +627,14 @@ do_sigbus:
 	return;
 }
 
+asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
+					unsigned long error_code)
+{
+	current->flags |= PF_FAULT_MAYRETRY;
+	__do_page_fault(regs, error_code);
+	current->flags &= ~PF_FAULT_MAYRETRY;
+}
+
 DEFINE_SPINLOCK(pgd_lock);
 struct page *pgd_list;
 
Index: linux-2.6.18-rc7/include/linux/mm.h
===================================================================
--- linux-2.6.18-rc7.orig/include/linux/mm.h	2006-09-19 16:04:53.000000000 -0700
+++ linux-2.6.18-rc7/include/linux/mm.h	2006-09-19 16:05:33.000000000 -0700
@@ -623,6 +623,7 @@ static inline int page_mapped(struct pag
  */
 #define NOPAGE_SIGBUS	(NULL)
 #define NOPAGE_OOM	((struct page *) (-1))
+#define NOPAGE_RETRY	((struct page *) (-2))
 
 /*
  * Different kinds of faults, as returned by handle_mm_fault().
@@ -633,6 +634,7 @@ static inline int page_mapped(struct pag
 #define VM_FAULT_SIGBUS	0x01
 #define VM_FAULT_MINOR	0x02
 #define VM_FAULT_MAJOR	0x03
+#define VM_FAULT_RETRY	0x04
 
 /* 
  * Special case for get_user_pages.
Index: linux-2.6.18-rc7/mm/filemap.c
===================================================================
--- linux-2.6.18-rc7.orig/mm/filemap.c	2006-09-19 16:04:55.000000000 -0700
+++ linux-2.6.18-rc7/mm/filemap.c	2006-09-19 16:05:33.000000000 -0700
@@ -1486,7 +1486,27 @@ page_not_uptodate:
 		majmin = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
 	}
-	lock_page(page);
+
+	if (!(current->flags & PF_FAULT_MAYRETRY)) {
+		lock_page(page);
+	} else if (TestSetPageLocked(page)) {
+		struct mm_struct *mm = area->vm_mm;
+
+		/*
+		 * Page is already locked by someone else.
+		 *
+		 * We don't want to be holding down_read(mmap_sem)
+		 * inside lock_page(), so use wait_on_page_locked() here.
+		 */
+		up_read(&mm->mmap_sem);
+		wait_on_page_locked(page);
+		down_read(&mm->mmap_sem);
+		/*
+		 * The VMA tree may have changed at this point.
+		 */
+		page_cache_release(page);
+		return NOPAGE_RETRY;
+	}
 
 	/* Did it get unhashed while we waited for it? */
 	if (!page->mapping) {
Index: linux-2.6.18-rc7/mm/memory.c
===================================================================
--- linux-2.6.18-rc7.orig/mm/memory.c	2006-09-19 16:04:55.000000000 -0700
+++ linux-2.6.18-rc7/mm/memory.c	2006-09-19 16:05:33.000000000 -0700
@@ -2122,6 +2122,10 @@ retry:
 		return VM_FAULT_SIGBUS;
 	if (new_page == NOPAGE_OOM)
 		return VM_FAULT_OOM;
+	/* page may be available, but we have to restart the process because
+	 * mmap_sem was dropped during the ->nopage */
+	if (new_page == NOPAGE_RETRY)
+		return VM_FAULT_RETRY;
 
 	/*
 	 * Should we do an early C-O-W break?
Index: linux-2.6.18-rc7/include/linux/sched.h
===================================================================
--- linux-2.6.18-rc7.orig/include/linux/sched.h	2006-09-19 16:04:54.000000000 -0700
+++ linux-2.6.18-rc7/include/linux/sched.h	2006-09-19 16:10:01.000000000 -0700
@@ -1054,6 +1054,7 @@ static inline void put_task_struct(struc
 #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
 #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
+#define PF_FAULT_MAYRETRY 0x04000000	/* I may drop mmap_sem during fault */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-19 23:35   ` Mike Waychison
@ 2006-09-19 23:50     ` Benjamin Herrenschmidt
  2006-09-19 23:59       ` Andrew Morton
                         ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2006-09-19 23:50 UTC (permalink / raw)
  To: Mike Waychison; +Cc: Andrew Morton, linux-mm, Linux Kernel list, Linus Torvalds

On Tue, 2006-09-19 at 16:35 -0700, Mike Waychison wrote:
> Patch attached.
> 
> As Andrew points out, the logic is a bit hacky and using a flag in 
> current->flags to determine whether we have done the retry or not already.
> 
> I too think the right approach to being able to handle these kinds of 
> retries in a more general fashion is to introduce a struct 
> pagefault_args along the page faulting path.  Within it, we could 
> introduce a reason for the retry so the higher levels would be able to 
> better understand what to do.

 .../...

I need to re-read your mail and Andrew as at this point, I don't quite
see why we need that args and/or that current->flags bit instead of
always returning all the way to userland and let the faulting
instruction happen again (which means you don't block in the kernel, can
take signals etc... thus do you actually need to prevent multiple
retries ?)

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-19 23:50     ` Benjamin Herrenschmidt
@ 2006-09-19 23:59       ` Andrew Morton
  2006-09-20  0:06         ` Benjamin Herrenschmidt
  2006-09-20  0:05       ` Benjamin Herrenschmidt
  2006-09-20  1:14       ` Mike Waychison
  2 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2006-09-19 23:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Mike Waychison, linux-mm, Linux Kernel list, Linus Torvalds

On Wed, 20 Sep 2006 09:50:35 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Tue, 2006-09-19 at 16:35 -0700, Mike Waychison wrote:
> > Patch attached.
> > 
> > As Andrew points out, the logic is a bit hacky and using a flag in 
> > current->flags to determine whether we have done the retry or not already.
> > 
> > I too think the right approach to being able to handle these kinds of 
> > retries in a more general fashion is to introduce a struct 
> > pagefault_args along the page faulting path.  Within it, we could 
> > introduce a reason for the retry so the higher levels would be able to 
> > better understand what to do.
> 
>  .../...
> 
> I need to re-read your mail and Andrew as at this point, I don't quite
> see why we need that args and/or that current->flags bit instead of
> always returning all the way to userland and let the faulting
> instruction happen again (which means you don't block in the kernel, can
> take signals etc...

That would amount to a busy wait, waiting for the disk IO to complete.

So we need to go to sleep somewhere (in D state, because we _are_ waiting
for disk IO).  Returning all the way to userspace and immediately retaking
the fault is unneeded extra work.

> thus do you actually need to prevent multiple
> retries ?)

I expect there are livelock scenarios.  For example, process A could spin
on posix_fadvise(some libc text page, POSIX_FADV_DONTNEED), perhaps causing
other applications to get permanently stuck in the kernel.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-19 23:50     ` Benjamin Herrenschmidt
  2006-09-19 23:59       ` Andrew Morton
@ 2006-09-20  0:05       ` Benjamin Herrenschmidt
  2006-09-20  0:21         ` Andrew Morton
  2006-09-20  1:14       ` Mike Waychison
  2 siblings, 1 reply; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2006-09-20  0:05 UTC (permalink / raw)
  To: Mike Waychison; +Cc: Andrew Morton, linux-mm, Linux Kernel list, Linus Torvalds

> I need to re-read your mail and Andrew as at this point, I don't quite
> see why we need that args and/or that current->flags bit instead of
> always returning all the way to userland and let the faulting
> instruction happen again (which means you don't block in the kernel, can
> take signals etc... thus do you actually need to prevent multiple
> retries ?)

Actually... I can see it's faster to not return all the way and take the
fault again ... though only in some cases. For example, if the pte has
been filled in the meantime (concurrent faults) it's actually faster to
just go back. The only reason I see why you need those args is to tell
the no_page() handler wether retrying is acceptable or wether it should
use the old way. Any reason why this is necessary at all ? What are you
trying to avoid by not letting it always do the retry path ?

My thinking was something around the lines of no_page() always does the
retry logic. Then, we do something like:

handle_pte_fault() gets modified. If do_no_page() returns
VM_FAULT_RETRY, it checks pte_present() again. If the PTE is present, it
returns VM_FAULT_MINOR. If PTE is absent, it checks for signals, and
returns VM_FAULT_MINOR if a signal is pending. If PTE is absent and no
signals are pending, it returns VM_FAULT_RETRY.

In addition, we still need to modify all archs do_page_fault() to handle
VM_FAULT_RETRY...

Or is there a specific scenario you are trying to avoid by keeping this
mecanism for preventing retries ?

Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-19 23:59       ` Andrew Morton
@ 2006-09-20  0:06         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2006-09-20  0:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mike Waychison, linux-mm, Linux Kernel list, Linus Torvalds

On Tue, 2006-09-19 at 16:59 -0700, Andrew Morton wrote:
> On Wed, 20 Sep 2006 09:50:35 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> > On Tue, 2006-09-19 at 16:35 -0700, Mike Waychison wrote:
> > > Patch attached.
> > > 
> > > As Andrew points out, the logic is a bit hacky and using a flag in 
> > > current->flags to determine whether we have done the retry or not already.
> > > 
> > > I too think the right approach to being able to handle these kinds of 
> > > retries in a more general fashion is to introduce a struct 
> > > pagefault_args along the page faulting path.  Within it, we could 
> > > introduce a reason for the retry so the higher levels would be able to 
> > > better understand what to do.
> > 
> >  .../...
> > 
> > I need to re-read your mail and Andrew as at this point, I don't quite
> > see why we need that args and/or that current->flags bit instead of
> > always returning all the way to userland and let the faulting
> > instruction happen again (which means you don't block in the kernel, can
> > take signals etc...
> 
> That would amount to a busy wait, waiting for the disk IO to complete.
> 
> So we need to go to sleep somewhere (in D state, because we _are_ waiting
> for disk IO).  Returning all the way to userspace and immediately retaking
> the fault is unneeded extra work.

No, I'm not saying immediately... you do the wait thing in filemap.c.
Anyway, see my other message.

> > thus do you actually need to prevent multiple
> > retries ?)
> 
> I expect there are livelock scenarios.  For example, process A could spin
> on posix_fadvise(some libc text page, POSIX_FADV_DONTNEED), perhaps causing
> other applications to get permanently stuck in the kernel.

Unless you add a way to handle signals.. see my other mail.

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-20  0:05       ` Benjamin Herrenschmidt
@ 2006-09-20  0:21         ` Andrew Morton
  2006-09-20  1:57           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2006-09-20  0:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Mike Waychison, linux-mm, Linux Kernel list, Linus Torvalds

On Wed, 20 Sep 2006 10:05:12 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> 
> > I need to re-read your mail and Andrew as at this point, I don't quite
> > see why we need that args and/or that current->flags bit instead of
> > always returning all the way to userland and let the faulting
> > instruction happen again (which means you don't block in the kernel, can
> > take signals etc... thus do you actually need to prevent multiple
> > retries ?)
> 
> Actually... I can see it's faster to not return all the way and take the
> fault again ... though only in some cases. For example, if the pte has
> been filled in the meantime (concurrent faults) it's actually faster to
> just go back. The only reason I see why you need those args is to tell
> the no_page() handler wether retrying is acceptable or wether it should
> use the old way. Any reason why this is necessary at all ? What are you
> trying to avoid by not letting it always do the retry path ?

Livelocks.  I've described the deliberate one, but I fear there are
accidental ones I haven't thought of.

> My thinking was something around the lines of no_page() always does the
> retry logic. Then, we do something like:
> 
> handle_pte_fault() gets modified. If do_no_page() returns
> VM_FAULT_RETRY, it checks pte_present() again. If the PTE is present, it
> returns VM_FAULT_MINOR. If PTE is absent, it checks for signals, and
> returns VM_FAULT_MINOR if a signal is pending. If PTE is absent and no
> signals are pending, it returns VM_FAULT_RETRY.

You forget that the point of this optimisation is to undo mmap_sem while
waiting on the disk IO.  Once we've done that we cannot go looking at ptes
or vmas: another thread could have munapped the whole lot or anything. 
(And we always need to be afraid of use_mm()..)

Once mmap_sem has been dropped we need to go all the way back to the
process's virtual address and redo the vma lookup.  The easy and clean way
of doing that is to rerun the fault, reuse all the existing code.

> In addition, we still need to modify all archs do_page_fault() to handle
> VM_FAULT_RETRY...

Yup, it would need some temporary ifdeffery while architetures convert.

But bear in mind my earlier comments regarding possible optimisations to
this code.

> Or is there a specific scenario you are trying to avoid by keeping this
> mecanism for preventing retries ?
> 
> Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-19 23:50     ` Benjamin Herrenschmidt
  2006-09-19 23:59       ` Andrew Morton
  2006-09-20  0:05       ` Benjamin Herrenschmidt
@ 2006-09-20  1:14       ` Mike Waychison
  2006-09-20  2:02         ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 28+ messages in thread
From: Mike Waychison @ 2006-09-20  1:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrew Morton, linux-mm, Linux Kernel list, Linus Torvalds

Benjamin Herrenschmidt wrote:
> On Tue, 2006-09-19 at 16:35 -0700, Mike Waychison wrote:
> 
>>Patch attached.
>>
>>As Andrew points out, the logic is a bit hacky and using a flag in 
>>current->flags to determine whether we have done the retry or not already.
>>
>>I too think the right approach to being able to handle these kinds of 
>>retries in a more general fashion is to introduce a struct 
>>pagefault_args along the page faulting path.  Within it, we could 
>>introduce a reason for the retry so the higher levels would be able to 
>>better understand what to do.
> 
> 
>  .../...
> 
> I need to re-read your mail and Andrew as at this point, I don't quite
> see why we need that args and/or that current->flags bit instead of
> always returning all the way to userland and let the faulting
> instruction happen again (which means you don't block in the kernel, can
> take signals etc... thus do you actually need to prevent multiple
> retries ?)
> 
> Ben.
> 
> 

I think the disconnect here is that the retries in the mmap_sem case and 
the returning short for a signal are two different beasts, but they 
would both want to use a NOPAGE_RETRY return code.

In the case of a signal, we definitely want to return back to userspace 
and deliver it.  However, for dropping the mmap_sem while waiting for 
the synchronous IO, it's a lot easier to directly rerun the fault 
handler so that we can make another pass without allowing the for the 
drop (avoiding livelock).

If we were to return to userspace after having dropped mmap_sem (without 
updating pte, because mm/vmas may change) we would lose major vs minor 
fault accounting as well.

Mike Waychison

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-20  0:21         ` Andrew Morton
@ 2006-09-20  1:57           ` Benjamin Herrenschmidt
  2006-09-20  3:05             ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2006-09-20  1:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mike Waychison, linux-mm, Linux Kernel list, Linus Torvalds

> Livelocks.  I've described the deliberate one, but I fear there are
> accidental ones I haven't thought of.

But if you check for signal pending in handle_pte_fault() and possibly
do a cond_resched(), there should be no livelock situation...

> > My thinking was something around the lines of no_page() always does the
> > retry logic. Then, we do something like:
> > 
> > handle_pte_fault() gets modified. If do_no_page() returns
> > VM_FAULT_RETRY, it checks pte_present() again. If the PTE is present, it
> > returns VM_FAULT_MINOR. If PTE is absent, it checks for signals, and
> > returns VM_FAULT_MINOR if a signal is pending. If PTE is absent and no
> > signals are pending, it returns VM_FAULT_RETRY.
> 
> You forget that the point of this optimisation is to undo mmap_sem while
> waiting on the disk IO.  Once we've done that we cannot go looking at ptes
> or vmas: another thread could have munapped the whole lot or anything. 
> (And we always need to be afraid of use_mm()..)

Wait wait .. .we don't need to have the mmap sem to -look- at a PTE. Or
the hardware would be in pain when doing table walk. My point is, if the
PTE is -present- (which we can check without taking the mmap sem), we
just return to userland and re-do the instruction.

> Once mmap_sem has been dropped we need to go all the way back to the
> process's virtual address and redo the vma lookup. The easy and clean way
> of doing that is to rerun the fault, reuse all the existing code.

Sure, but what I'm saying is that we can still check if the PTE is
present or a signal pending and based on that, decide to return either
VM_FAULT_MINOR or VM_FAULT_RETRY. The former would cause do_page_fault()
to return all the way to userland while the later would just loop in
do_page_fault() as per Mike patch.

> > In addition, we still need to modify all archs do_page_fault() to handle
> > VM_FAULT_RETRY...
> 
> Yup, it would need some temporary ifdeffery while architetures convert.
> 
> But bear in mind my earlier comments regarding possible optimisations to
> this code.

Yup and I think my idea does, and I still don't see why we need this
additional burden of MAY_RETRY to carry around... (that is, I don't see
the livelock if we are careful enough to test for signals when we get a
VM_FAULT_RETRY result, and possibly cond_resched() or test
need_resched() and go back to userland, either way is fine by me).

Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-20  1:14       ` Mike Waychison
@ 2006-09-20  2:02         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2006-09-20  2:02 UTC (permalink / raw)
  To: Mike Waychison; +Cc: Andrew Morton, linux-mm, Linux Kernel list, Linus Torvalds

> I think the disconnect here is that the retries in the mmap_sem case and 
> the returning short for a signal are two different beasts, but they 
> would both want to use a NOPAGE_RETRY return code.
> 
> In the case of a signal, we definitely want to return back to userspace 
> and deliver it.  However, for dropping the mmap_sem while waiting for 
> the synchronous IO, it's a lot easier to directly rerun the fault 
> handler so that we can make another pass without allowing the for the 
> drop (avoiding livelock).

I'm not sure I see what your exactly livelock scenario is... 

> If we were to return to userspace after having dropped mmap_sem (without 
> updating pte, because mm/vmas may change) we would lose major vs minor 
> fault accounting as well.

My point is we do the looping in the fault handler if the PTE is still
absent and there is no signal pending and we return to userspace (and
drop the mmap_sem) if any of these criterias is met. We might want to
add a cond_resched() in the first case too.

Wouldn't that solve the livelock issue ?

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-20  1:57           ` Benjamin Herrenschmidt
@ 2006-09-20  3:05             ` Andrew Morton
  2006-09-20  5:04               ` Benjamin Herrenschmidt
  2006-09-20  5:06               ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2006-09-20  3:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Mike Waychison, linux-mm, Linux Kernel list, Linus Torvalds

On Wed, 20 Sep 2006 11:57:09 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> > You forget that the point of this optimisation is to undo mmap_sem while
> > waiting on the disk IO.  Once we've done that we cannot go looking at ptes
> > or vmas: another thread could have munapped the whole lot or anything. 
> > (And we always need to be afraid of use_mm()..)
> 
> Wait wait .. .we don't need to have the mmap sem to -look- at a PTE.

The pte resides in a pagetable page.  Once we've dropped mmap_sem, that
pagetable page might not be there any more: munmap() might have freed it. 
We have to retake mmap_sem, do a find_vma() and a new pagetable walk.

There are some optimisations we could make to avoid all of that in the
common case, but this is the conceptual behaviour.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-20  3:05             ` Andrew Morton
@ 2006-09-20  5:04               ` Benjamin Herrenschmidt
  2006-09-20  5:26                 ` Andrew Morton
  2006-09-20  5:06               ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2006-09-20  5:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mike Waychison, linux-mm, Linux Kernel list, Linus Torvalds

On Tue, 2006-09-19 at 20:05 -0700, Andrew Morton wrote:

> resides in a pagetable page.  Once we've dropped mmap_sem, that
> pagetable page might not be there any more: munmap() might have freed it. 
> We have to retake mmap_sem, do a find_vma() and a new pagetable walk.
> 
> There are some optimisations we could make to avoid all of that in the
> common case, but this is the conceptual behaviour.

It's a non-issue anyway the no_page handler in Mike's patch _does_
re-take mmap_sem before returning RETRY thus my whole idea still stands
perfectly fine unless I've missed something, which means we can make it
without changing no_page arguments. Let me re-describe it:

 - somebody->no_page() returns RETRY. It may have dropped the mmap sem,
but if it did, like in Mike's patch, it will have re-taken it before
returning.

 - upon return (in handle_pte_fault typically) if we get something else
than that retry, we return 
as usual.

 - if we got RETRY we do something like

	if (signal_pending(current) || need_resched() || pte_present(*pte))
		return VM_FAULT_MINOR;
	else
		return VM_FAULT_RETRY;

Thus we still have to change arch to test for VM_FAULT_RETRY and loop on
it (or return to userland if they want but that's less optimal) but we
don't have to carry around a "MAY_RETRY" thing nor change no_page()
arguments.

The idea is that we can't livelock since we'll always schedule and we
can take signals so the process can always be killed.

We'll also avoid the loop and coming back if the PTE has been filled up
in the meantime (just a cheap optimisation avoiding a new find_vma()
etc...).

And it's simpler :)

Now, I may have missed something of course, but I'd like to know what.
So far, I don't see what won't work with the above. 

Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-20  3:05             ` Andrew Morton
  2006-09-20  5:04               ` Benjamin Herrenschmidt
@ 2006-09-20  5:06               ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2006-09-20  5:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mike Waychison, linux-mm, Linux Kernel list, Linus Torvalds

On Tue, 2006-09-19 at 20:05 -0700, Andrew Morton wrote:
> On Wed, 20 Sep 2006 11:57:09 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> > > You forget that the point of this optimisation is to undo mmap_sem while
> > > waiting on the disk IO.  Once we've done that we cannot go looking at ptes
> > > or vmas: another thread could have munapped the whole lot or anything. 
> > > (And we always need to be afraid of use_mm()..)
> > 
> > Wait wait .. .we don't need to have the mmap sem to -look- at a PTE.
> 
> The pte resides in a pagetable page.  Once we've dropped mmap_sem, that
> pagetable page might not be there any more: munmap() might have freed it. 
> We have to retake mmap_sem, do a find_vma() and a new pagetable walk.

Ok, see my other email, we have the mmap_sem anyway so it's fine. Note
that I got a little bit mislead on that "we can wlak the page table
without mmap_sem" because we can on powerpc :) but possibly not on
x86.... we use RCU to free pagetable pages to make that possible because
our hash refill code (equivalent to our TLB miss if you want to see it
that way) has to be able to do that. I think the x86 MMU has some way to
atomically do the walking which is why you don't need that on x86 but I
don't know x86 so ...

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-20  5:04               ` Benjamin Herrenschmidt
@ 2006-09-20  5:26                 ` Andrew Morton
  2006-09-20  6:54                   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2006-09-20  5:26 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Mike Waychison, linux-mm, Linux Kernel list, Linus Torvalds

On Wed, 20 Sep 2006 15:04:24 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Tue, 2006-09-19 at 20:05 -0700, Andrew Morton wrote:
> 
> > resides in a pagetable page.  Once we've dropped mmap_sem, that
> > pagetable page might not be there any more: munmap() might have freed it. 
> > We have to retake mmap_sem, do a find_vma() and a new pagetable walk.
> > 
> > There are some optimisations we could make to avoid all of that in the
> > common case, but this is the conceptual behaviour.
> 
> It's a non-issue anyway the no_page handler in Mike's patch _does_
> re-take mmap_sem before returning RETRY thus my whole idea still stands
> perfectly fine unless I've missed something, which means we can make it
> without changing no_page arguments. Let me re-describe it:
> 
>  - somebody->no_page() returns RETRY. It may have dropped the mmap sem,
> but if it did, like in Mike's patch, it will have re-taken it before
> returning.
> 
>  - upon return (in handle_pte_fault typically) if we get something else
> than that retry, we return 
> as usual.
> 
>  - if we got RETRY we do something like
> 
> 	if (signal_pending(current) || need_resched() || pte_present(*pte))
> 		return VM_FAULT_MINOR;
> 	else
> 		return VM_FAULT_RETRY;
> 
> Thus we still have to change arch to test for VM_FAULT_RETRY and loop on
> it (or return to userland if they want but that's less optimal) but we
> don't have to carry around a "MAY_RETRY" thing nor change no_page()
> arguments.
> 
> The idea is that we can't livelock since we'll always schedule and we
> can take signals so the process can always be killed.
> 
> We'll also avoid the loop and coming back if the PTE has been filled up
> in the meantime (just a cheap optimisation avoiding a new find_vma()
> etc...).
> 
> And it's simpler :)
> 
> Now, I may have missed something of course, but I'd like to know what.
> So far, I don't see what won't work with the above. 
> 

It's a choice between two behaviours:

a) get stuck in the kernel until someone kills you and

b) fault the page in and proceed as expected.

Option b) is better, no?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-20  5:26                 ` Andrew Morton
@ 2006-09-20  6:54                   ` Benjamin Herrenschmidt
  2006-09-20 17:53                     ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2006-09-20  6:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mike Waychison, linux-mm, Linux Kernel list, Linus Torvalds

> It's a choice between two behaviours:
> 
> a) get stuck in the kernel until someone kills you and
> 
> b) fault the page in and proceed as expected.
> 
> Option b) is better, no?

That's what I don't understand... where is the actual race that can
cause the livelock you are mentioning. Is this that TryLock always
fails, we wait, the lock gets available but since we have dropped the
semaphore around the wait, it might get stolen again by the time we are
taking the mmap_sem back and that going on ever and ever again while
lock_page() would get precedence since we have the mmap_sem ?

Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-20  6:54                   ` Benjamin Herrenschmidt
@ 2006-09-20 17:53                     ` Andrew Morton
  2006-09-21 22:05                       ` Benjamin Herrenschmidt
  2006-09-23 14:21                       ` Hugh Dickins
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2006-09-20 17:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Mike Waychison, linux-mm, Linux Kernel list, Linus Torvalds

On Wed, 20 Sep 2006 16:54:59 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> > It's a choice between two behaviours:
> > 
> > a) get stuck in the kernel until someone kills you and
> > 
> > b) fault the page in and proceed as expected.
> > 
> > Option b) is better, no?
> 
> That's what I don't understand... where is the actual race that can
> cause the livelock you are mentioning.

Suppose a program (let's call it "DoS") is written which sits in a loop
doing fadvise(FADV_DONTNEED) against some parts of /lib/libc.so.

Now suppose another process (let's call it "bash") tries to execute that
page.  bash will take a major fault, will submit a read and then it will do
wait_on_page().  I/O completes and the page comes unlocked.  Now there is a
time window in which DoS can shoot that page down again.  And it's quite a
lengthy window - bash need to wake up, get scheduled, take mmap_sem, return
to do_page_fault(), redo the vma lookup and the pagetable walk and the
pagecache lookup.

That's plenty of time in which DoS can shoot down the page again. 
Particularly since every other program in the machine is stuck in disk wait
in its pagefault handler ;)

All of this will cause bash to get permanently stuck in the kernel.  And I
don't think it's acceptable to just allow bash to be killed off:

- one would need a statically-linked shell to be able to do this.

- if one didn't kill off DoS first, it wouldn't help.  A statically
  linked `ps' is also needed.

- having to kill off sshd, xinetd, httpd, etc isn't a very happy solution.

- you can't kill off /sbin/init.

So I think there's a nasty DoS here if we permit infinite retries.  But
it's not just that - there might be other situations under really heavy
memory pressure where livelocks like this can occur.  It's just a general
robustness-of-implementation issue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-20 17:53                     ` Andrew Morton
@ 2006-09-21 22:05                       ` Benjamin Herrenschmidt
  2006-09-21 22:41                         ` Andrew Morton
  2006-09-23 14:21                       ` Hugh Dickins
  1 sibling, 1 reply; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2006-09-21 22:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mike Waychison, linux-mm, Linux Kernel list, Linus Torvalds

> So I think there's a nasty DoS here if we permit infinite retries.  But
> it's not just that - there might be other situations under really heavy
> memory pressure where livelocks like this can occur.  It's just a general
> robustness-of-implementation issue.

Got it. Now, changing args to no_page() will be a pretty big task....

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-21 22:05                       ` Benjamin Herrenschmidt
@ 2006-09-21 22:41                         ` Andrew Morton
  2006-09-21 23:09                           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2006-09-21 22:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Mike Waychison, linux-mm, Linux Kernel list, Linus Torvalds

On Fri, 22 Sep 2006 08:05:04 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> > So I think there's a nasty DoS here if we permit infinite retries.  But
> > it's not just that - there might be other situations under really heavy
> > memory pressure where livelocks like this can occur.  It's just a general
> > robustness-of-implementation issue.
> 
> Got it. Now, changing args to no_page() will be a pretty big task....
> 

Not as big as removing the pt_regs arg from every interrupt handler ;)

But pretty mechanical.  Problem is, I don't think we have our mechanic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-21 22:41                         ` Andrew Morton
@ 2006-09-21 23:09                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2006-09-21 23:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mike Waychison, linux-mm, Linux Kernel list, Linus Torvalds

On Thu, 2006-09-21 at 15:41 -0700, Andrew Morton wrote:
> On Fri, 22 Sep 2006 08:05:04 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> > > So I think there's a nasty DoS here if we permit infinite retries.  But
> > > it's not just that - there might be other situations under really heavy
> > > memory pressure where livelocks like this can occur.  It's just a general
> > > robustness-of-implementation issue.
> > 
> > Got it. Now, changing args to no_page() will be a pretty big task....
> > 
> 
> Not as big as removing the pt_regs arg from every interrupt handler ;)

Which is a change I'm not 100% convinced about btw ... I remember
actually using that in a few occasions... mostly for debugging though.
Bah, anyway, I suppose we can always have a per-cpu global with the last
irq pt_regs pointer if really needed for debug.
 
> But pretty mechanical.  Problem is, I don't think we have our mechanic.

Yup, we would need to decide what to put in there....

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-20 17:53                     ` Andrew Morton
  2006-09-21 22:05                       ` Benjamin Herrenschmidt
@ 2006-09-23 14:21                       ` Hugh Dickins
  2006-09-23 19:46                         ` Andrew Morton
  1 sibling, 1 reply; 28+ messages in thread
From: Hugh Dickins @ 2006-09-23 14:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Benjamin Herrenschmidt, Mike Waychison, linux-mm,
	Linux Kernel list, Linus Torvalds

On Wed, 20 Sep 2006, Andrew Morton wrote:
> On Wed, 20 Sep 2006 16:54:59 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > 
> > That's what I don't understand... where is the actual race that can
> > cause the livelock you are mentioning.
> 
> Suppose a program (let's call it "DoS") is written which sits in a loop
> doing fadvise(FADV_DONTNEED) against some parts of /lib/libc.so.

I agree there's an issue here, but I believe you're attacking the wrong
end, thereby complicating and uglifying the pagefault path (in every
arch) with your proposed arg block and retry limitation.

(Maybe one day there will be need for such an arg block,
but I don't see that yet.)

Isn't the real problem that fadvise(FADV_DONTNEED) is much more
powerful than it should be?  Whereas madvise(MADV_DONTNEED) is simply
releasing pages from my address space, fadvise(FADV_DONTNEED) is going
so far as to remove them from pagecache (if nothing at that instant
prevents): forcing others into I/O.  Why should I be allowed to
invalidate pagecache useful to others so quickly?

Shouldn't it merely, say, move the pages in its range to the inactive
list, giving other processes a chance to reassert an interest in them?
May not turn out as easy as that, I admit.

I'm fine with your idea of dropping mmap_sem while nopage waits on I/O,
I'm fine with your idea of an mm mmap transaction count, so nopage can
just reget mmap_sem without backing out when nothing changed meanwhile.

But I do think Ben should have the simple NOPAGE_RETRY he proposed,
going right back out to userspace; and that should be enough for your
case too (the mmap transaction count would make its use a rarity).

> So I think there's a nasty DoS here if we permit infinite retries.  But
> it's not just that - there might be other situations under really heavy
> memory pressure where livelocks like this can occur.

filemap_nopage would want to mark_page_accessed() before returning
NOPAGE_RETRY, but if that's not good enough to hold the page in cache
before the retried fault grabs it, your memory pressure is already
into thrashing.  I believe the livelock is peculiar to FADV_DONTNEED.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-23 14:21                       ` Hugh Dickins
@ 2006-09-23 19:46                         ` Andrew Morton
  2006-09-23 22:35                           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2006-09-23 19:46 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Benjamin Herrenschmidt, Mike Waychison, linux-mm,
	Linux Kernel list, Linus Torvalds

On Sat, 23 Sep 2006 15:21:40 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> On Wed, 20 Sep 2006, Andrew Morton wrote:
> > On Wed, 20 Sep 2006 16:54:59 +1000
> > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > > 
> > > That's what I don't understand... where is the actual race that can
> > > cause the livelock you are mentioning.
> > 
> > Suppose a program (let's call it "DoS") is written which sits in a loop
> > doing fadvise(FADV_DONTNEED) against some parts of /lib/libc.so.
> 
> I agree there's an issue here, but I believe you're attacking the wrong
> end, thereby complicating and uglifying the pagefault path (in every
> arch) with your proposed arg block and retry limitation.

"simplifying and cleaning up the pagefault path (in every arch) with my
proposed arg block and retry improvement".

We're presently passing from four to six arguments down many layers of
function call.  Can be replaced with a single arg.

> (Maybe one day there will be need for such an arg block,
> but I don't see that yet.)

I agree it's marginal.

> Isn't the real problem that fadvise(FADV_DONTNEED) is much more
> powerful than it should be?  Whereas madvise(MADV_DONTNEED) is simply
> releasing pages from my address space, fadvise(FADV_DONTNEED) is going
> so far as to remove them from pagecache (if nothing at that instant
> prevents): forcing others into I/O.  Why should I be allowed to
> invalidate pagecache useful to others so quickly?
> 
> Shouldn't it merely, say, move the pages in its range to the inactive
> list, giving other processes a chance to reassert an interest in them?
> May not turn out as easy as that, I admit.

Could be, although that would cause inodes to remain unreclaimable.

> I'm fine with your idea of dropping mmap_sem while nopage waits on I/O,
> I'm fine with your idea of an mm mmap transaction count, so nopage can
> just reget mmap_sem without backing out when nothing changed meanwhile.
> 
> But I do think Ben should have the simple NOPAGE_RETRY he proposed,
> going right back out to userspace; and that should be enough for your
> case too (the mmap transaction count would make its use a rarity).

Perhaps we should concentrate on that for now.  Did we have a patch to look
at?

> > So I think there's a nasty DoS here if we permit infinite retries.  But
> > it's not just that - there might be other situations under really heavy
> > memory pressure where livelocks like this can occur.
> 
> filemap_nopage would want to mark_page_accessed() before returning
> NOPAGE_RETRY, but if that's not good enough to hold the page in cache
> before the retried fault grabs it, your memory pressure is already
> into thrashing.  I believe the livelock is peculiar to FADV_DONTNEED.

Maybe.  Putting a potential infinite loop like this into the pagefault path
gives me the creeps.

Bear in mind that direct-io (both block and NFS) shoots down pagecache too.
It has to.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] page fault retry with NOPAGE_RETRY
  2006-09-23 19:46                         ` Andrew Morton
@ 2006-09-23 22:35                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2006-09-23 22:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Mike Waychison, linux-mm, Linux Kernel list,
	Linus Torvalds

> Perhaps we should concentrate on that for now.  Did we have a patch to look
> at?

Only a hand written proto-patch. Below is a real (but untested) one.
Note that there might be still issues when called from get_user_pages()
which of course won't go back to userland. For the two usage scenario I
have in mind, it should be ok though. One (a signal pending) will loop
back in until the resource is available, the other (no_page() inserts
the PTE itself) is just fine. For the former case, I've added a
cond_resched() to the loop, we might want to look into adding the info
of wether we are coming from get_user_pages() vs. do_page_fault() to
these new arguments you want to add to page fault handlers. That would
allow in our case to do a non-interruptibe sleep when caused by
get_user_pages().

---

Add a way for a no_page() handler to request a retry of the faulting
instruction. It goes back to userland on page faults and just tries
again in get_user_pages(). I added a cond_resched() in the loop in that
later case.

Signed-off-by: Benjamin Herrenchmidt <benh@kernel.crashing.org>

Index: linux-work/include/linux/mm.h
===================================================================
--- linux-work.orig/include/linux/mm.h	2006-08-30 08:51:21.000000000 +1000
+++ linux-work/include/linux/mm.h	2006-09-24 08:25:33.000000000 +1000
@@ -623,6 +623,7 @@
  */
 #define NOPAGE_SIGBUS	(NULL)
 #define NOPAGE_OOM	((struct page *) (-1))
+#define NOPAGE_RETRY	((struct page *) (-2))

 /*
  * Different kinds of faults, as returned by handle_mm_fault().
Index: linux-work/mm/memory.c
===================================================================
--- linux-work.orig/mm/memory.c	2006-08-17 16:16:06.000000000 +1000
+++ linux-work/mm/memory.c	2006-09-24 08:34:09.000000000 +1000
@@ -1081,6 +1081,7 @@
 				default:
 					BUG();
 				}
+				cond_resched();
 			}
 			if (pages) {
 				pages[i] = page;
@@ -2117,11 +2118,13 @@
 	 * after the next truncate_count read.
 	 */

-	/* no page was available -- either SIGBUS or OOM */
-	if (new_page == NOPAGE_SIGBUS)
+	/* no page was available -- either SIGBUS, OOM or RETRY */
+	if (unlikely(new_page == NOPAGE_SIGBUS))
 		return VM_FAULT_SIGBUS;
-	if (new_page == NOPAGE_OOM)
+	else if (unlikely(new_page == NOPAGE_OOM))
 		return VM_FAULT_OOM;
+	else if (unlikely(new_page == NOPAGE_RETRY))
+		return VM_FAULT_MINOR;

 	/*
 	 * Should we do an early C-O-W break?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2006-09-23 22:35 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-14 22:55 [RFC] page fault retry with NOPAGE_RETRY Benjamin Herrenschmidt
2006-09-15  0:19 ` Linus Torvalds
2006-09-15  7:11 ` Andrew Morton
2006-09-15  7:35   ` Andrew Morton
2006-09-15 13:30     ` Hugh Dickins
2006-09-16  1:03       ` Benjamin Herrenschmidt
2006-09-19 23:35   ` Mike Waychison
2006-09-19 23:50     ` Benjamin Herrenschmidt
2006-09-19 23:59       ` Andrew Morton
2006-09-20  0:06         ` Benjamin Herrenschmidt
2006-09-20  0:05       ` Benjamin Herrenschmidt
2006-09-20  0:21         ` Andrew Morton
2006-09-20  1:57           ` Benjamin Herrenschmidt
2006-09-20  3:05             ` Andrew Morton
2006-09-20  5:04               ` Benjamin Herrenschmidt
2006-09-20  5:26                 ` Andrew Morton
2006-09-20  6:54                   ` Benjamin Herrenschmidt
2006-09-20 17:53                     ` Andrew Morton
2006-09-21 22:05                       ` Benjamin Herrenschmidt
2006-09-21 22:41                         ` Andrew Morton
2006-09-21 23:09                           ` Benjamin Herrenschmidt
2006-09-23 14:21                       ` Hugh Dickins
2006-09-23 19:46                         ` Andrew Morton
2006-09-23 22:35                           ` Benjamin Herrenschmidt
2006-09-20  5:06               ` Benjamin Herrenschmidt
2006-09-20  1:14       ` Mike Waychison
2006-09-20  2:02         ` Benjamin Herrenschmidt
2006-09-15 21:35 ` Arnd Bergmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox