Patch attached. As Andrew points out, the logic is a bit hacky and using a flag in current->flags to determine whether we have done the retry or not already. I too think the right approach to being able to handle these kinds of retries in a more general fashion is to introduce a struct pagefault_args along the page faulting path. Within it, we could introduce a reason for the retry so the higher levels would be able to better understand what to do. struct pagefault_args { u32 reason; }; #define PAGEFAULT_MAYRETRY_IO 0x1 #define PAGEFAULT_RETRY_IO 0x2 #define PAGEFAULT_RETRY_SIGNALPENDING 0x4 do_page_fault could for example set PAGEFAULT_MAYRETRY_IO, letting a nopage implementation drop locks for IO and signalling back up to do_page_fault by also setting PAGEFAULT_RETRY_IO. PAGEFAULT_RETRY_SIGNALPENDING could be set to tell do_page_fault to deliver the signal and replay the faulting instruction (as opposed to the "goto retry" in the patch attached). Mike Waychison Andrew Morton wrote: > On Fri, 15 Sep 2006 08:55:08 +1000 > Benjamin Herrenschmidt wrote: > > >>in mm.h: >> >> #define NOPAGE_SIGBUS (NULL) >> #define NOPAGE_OOM ((struct page *) (-1)) >>+#define NOPAGE_RETRY ((struct page *) (-2)) >> >>and in memory.c, in do_no_page(): >> >> >> /* no page was available -- either SIGBUS or OOM */ >> if (new_page == NOPAGE_SIGBUS) >> return VM_FAULT_SIGBUS; >> if (new_page == NOPAGE_OOM) >> return VM_FAULT_OOM; >>+ if (new_page == NOPAGE_RETRY) >>+ return VM_FAULT_MINOR; > > > Google are using such a patch (Mike owns it). > > It is to reduce mmap_sem contention with threaded apps. If one thread > takes a major pagefault, it will perform a synchronous disk read while > holding down_read(mmap_sem). This causes any other thread which wishes to > perform any mmap/munmap/mprotect/etc (which does down_write(mmap_sem)) to > block behind that disk IO. As you can understand, that can be pretty bad > in the right circumstances. > > The patch modifies filemap_nopage() to look to see if it needs to block on > the page coming uptodate and if so, it does up_read(mmap_sem); > wait_on_page_locked(); return NOPAGE_RETRY. That causes the top-level > do_page_fault() code to rerun the entire pagefault. > > It hasn't been submitted for upstream yet because > > a) To avoid livelock possibilities (page reclaim, looping FADV_DONTNEED, > etc) it only does the retry a single time. After that it falls back to > the traditional synchronous-read-inside-mmap_sem approach. A flag in > current->flags is used to detect the second attempt. It keeps the patch > simple, but is a bit hacky. > > To resolve this cleanly I'm thinking we change all the pagefault code > everywhere: instantiate a new `struct pagefault_args' in do_page_fault() > and pass that all around the place. So all the pagefault code, all the > ->nopage handlers etc will take a single argument. > > This will, I hope, result in less code, faster code and less stack > consumption. It could also be used for things like the > lock-the-page-in-filemap_nopage() proposal: the ->nopage() > implementation could set a boolean in pagefault_args indicating whether > the page has been locked. > > And, of course, fielmap_nopage could set another boolean in > pagefault_args to indicate that it has already tried to rerun the > pagefault once. > > b) It could be more efficient. Most of the time, there's no need to > back all the way out of the pagefault handler and rerun the whole thing. > Because most of the time, nobody changed anything in the mm_struct. We > _could_ just retake the mmap_sem after the page comes uptodate and, if > nothing has changed, proceed. I see two ways of doing this: > > - The simple way: look to see if any other processes are sharing > this mm_struct. If not, just do the synchronous read inside mmap_sem. > > - The better way: put a sequence counter in the mm_struct, > increment that in every place where down_write(mmap_sem) is performed. > The pagefault code then can re-take the mmap_sem for reading and look > to see if the sequence counter is unchanged. If it is, proceed. If > it _has_ changed then drop mmap_sem again and return NOPAGE_RETRY. > > otoh, maybe using another bit in page->flags is a good compromise ;) > > Mike, could you whip that patch out please?