* mmap/munmap latency on multithreaded apps, because pagefaults hold mmap_sem during disk read [not found] ` <48C57898.1080304@gmail.com> @ 2008-09-12 19:16 ` Török Edwin 2008-09-12 19:35 ` Mike Waychison 0 siblings, 1 reply; 3+ messages in thread From: Török Edwin @ 2008-09-12 19:16 UTC (permalink / raw) To: Mike Waychison Cc: Andi Kleen, Theodore Tso, Peter Zijlstra, Ingo Molnar, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin, Benjamin Herrenschmidt, linux-mm On 2008-09-08 22:10, Torok Edwin wrote: > [snip] > There is however a problem with mmap [mmap with N threads is as slow as > mmap with 1 thread, i.e. it is sequential :(], pagefaults and disk I/O, > I think I am hitting the problem described in this thread (2 years ago!) > http://lwn.net/Articles/200215/ > http://lkml.org/lkml/2006/9/19/260 > > It looks like such a patch is still not part of 2.6.27, what happened to it? > I will see if that patch applies to 2.6.27, and will rerun my test with > that patch applied too. > The patch doesn't apply to 2.6.27-rc6, I tried manually applying the patch. There have been many changes since 2.6.18 (like replacing find_get_page with find_lock_page, filemap returning VM_FAULT codes, etc.). I have probably done something wrong, because the resulting kernel won't boot: I get abnormal exits and random sigbus during boot. Can you please help porting the patch to 2.6.27-rc6? I have attached my 2 attempts at the end of this mail. Also it looks like the original patch just releases the mmap_sem if there is lock contention on the page, but keeps mmap_sem during read? I would like mmap_sem be released during disk I/O too. I also tried changing i_mmap_lock into a semaphore, however I that won't work since some users of i_mmap_lock can't sleep. Taking the i_mmap_lock spinlock in filemap fault is also not possible, since we would sleep while holding a spinlock. Just to confirm that the problem is with pagefaults and mmap, I dropped the mmap_sem in filemap_fault, and then I got same performance in my testprogram for mmap and read. Of course this is totally unsafe, because the mapping could change at any time. > [2] the test program is available here: > http://edwintorok.googlepages.com/scalability.tar.gz > You just build it using 'make' (has to be GNU make), and the run > $ sh ./runtest.sh /usr/bin/ | tee log > $ sh ./postproc.sh log I've written a latency tracer (using ptrace), and I identified the mutex/mmap related latencies (total runtime 23m): - mmap-ed files (created by libclamav) ~6680 ms total - creating/removing anonymous mappings, created by glibc, when I use functions like fopen/fclose: With 8 threads: =====> Total: 3227.732 ms, average: 3.590 ms, times: 899 === /lib/libc.so.6 (mmap) === /lib/libc.so.6 (_IO_file_doallocate) === /lib/libc.so.6 (_IO_doallocbuf) === /lib/libc.so.6 (_IO_file_seekoff) === /lib/libc.so.6 (_IO_file_attach) === /lib/libc.so.6 (_IO_fdopen) === /usr/lib/libz.so.1 (gzflush) === /usr/lib/libz.so.1 (gzdopen) === /usr/local/lib/libclamav.so.5 libclamav/scanners.c:470 (cli_scangzip) =====> Total: 2069.519 ms, average: 3.624 ms, times: 571 === /lib/libc.so.6 (munmap) === /lib/libc.so.6 (_IO_setb) === /lib/libc.so.6 (_IO_file_close_it) === /lib/libc.so.6 (_IO_fclose) === /usr/lib/libz.so.1 (gzerror) === /usr/local/lib/libclamav.so.5 libclamav/scanners.c:529 (cli_scangzip) with 4 threads: =====> Total: 578.607 ms, average: 4.743 ms, times: 122 === /lib/libc.so.6 (munmap) === /lib/libc.so.6 (_IO_setb) === /lib/libc.so.6 (_IO_file_close_it) === /lib/libc.so.6 (_IO_fclose) === /usr/lib/libz.so.1 (gzerror) === /usr/local/lib/libclamav.so.5 libclamav/scanners.c:529 (cli_scangzip) =====> Total: 148.083 ms, average: 2.278 ms, times: 65 === /lib/libc.so.6 (mmap) === /lib/libc.so.6 (_IO_file_doallocate) === /lib/libc.so.6 (_IO_doallocbuf) === /lib/libc.so.6 (_IO_file_seekoff) === /lib/libc.so.6 (_IO_file_attach) === /lib/libc.so.6 (_IO_fdopen) === /usr/lib/libz.so.1 (gzflush) === /usr/lib/libz.so.1 (gzdopen) === /usr/local/lib/libclamav.so.5 libclamav/scanners.c:470 (cli_scangzip) With 8 threads situation is much worse than with 4 threads even for functions using anonymous mappings. Of course the latency tracer has its own overhead (1.2 ms average, 67 ms max for 8 threads, and 0.1 ms average, 31 ms max for 4 threads),but these latencies are above that value. Best regards, --Edwin --- First attempt: arch/x86/mm/fault.c | 33 +++++++++++++++++++++++++++------ include/linux/mm.h | 1 + include/linux/mm_types.h | 1 - include/linux/sched.h | 1 + mm/filemap.c | 18 ++++++++++++------ 5 files changed, 41 insertions(+), 13 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 455f3fe..38bea4b 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -578,10 +578,7 @@ int show_unhandled_signals = 1; * and the problem, and then passes it off to one of the appropriate * routines. */ -#ifdef CONFIG_X86_64 -asmlinkage -#endif -void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) +static inline void __do_page_fault(struct pt_regs *regs, unsigned long error_code) { struct task_struct *tsk; struct mm_struct *mm; @@ -702,6 +699,7 @@ again: down_read(&mm->mmap_sem); } +retry: vma = find_vma(mm, address); if (!vma) goto bad_area; @@ -761,8 +759,21 @@ survive: } if (fault & VM_FAULT_MAJOR) tsk->maj_flt++; - else - tsk->min_flt++; + else { + if ((fault & VM_FAULT_RETRY) && (current->flags & PF_FAULT_MAYRETRY)) { + current->flags &= ~PF_FAULT_MAYRETRY; + goto retry; + } + /* + * If we had to retry (PF_FAULT_MAYRETRY cleared), then + * the page originally wasn't up to date before the + * retry, but now it is. + */ + if (!(current->flags & PF_FAULT_MAYRETRY)) + tsk->maj_flt++; + else + tsk->min_flt++; + } #ifdef CONFIG_X86_32 /* @@ -909,6 +920,16 @@ do_sigbus: tsk->thread.trap_no = 14; force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk); } +#ifdef CONFIG_X86_64 +asmlinkage +#endif +void __kprobes do_page_fault(struct pt_regs *regs, + unsigned long error_code) +{ + current->flags |= PF_FAULT_MAYRETRY; + __do_page_fault(regs, error_code); + current->flags &= ~PF_FAULT_MAYRETRY; +} DEFINE_SPINLOCK(pgd_lock); LIST_HEAD(pgd_list); diff --git a/include/linux/mm.h b/include/linux/mm.h index 72a15dc..4511f68 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -697,6 +697,7 @@ static inline int page_mapped(struct page *page) #define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */ #define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */ +#define VM_FAULT_RETRY 0x0400 #define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index bf33413..9d065a4 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -108,7 +108,6 @@ struct vm_area_struct { unsigned long vm_start; /* Our start address within vm_mm. */ unsigned long vm_end; /* The first byte after our end address within vm_mm. */ - /* linked list of VM areas per task, sorted by address */ struct vm_area_struct *vm_next; diff --git a/include/linux/sched.h b/include/linux/sched.h index 3d9120c..bc39432 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1507,6 +1507,7 @@ extern cputime_t task_gtime(struct task_struct *p); #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */ #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */ #define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */ +#define PF_FAULT_MAYRETRY 0x08000000 /* I may drop mmap_sem during fault */ #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */ #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezeable */ diff --git a/mm/filemap.c b/mm/filemap.c index 876bc59..f9f11bd 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1443,18 +1443,17 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) /* * Do we have something in the page cache already? */ -retry_find: page = find_lock_page(mapping, vmf->pgoff); /* * For sequential accesses, we use the generic readahead logic. */ if (VM_SequentialReadHint(vma)) { if (!page) { + up_read(&vma->vm_mm->mmap_sem); page_cache_sync_readahead(mapping, ra, file, vmf->pgoff, 1); - page = find_lock_page(mapping, vmf->pgoff); - if (!page) - goto no_cached_page; + down_read(&vma->vm_mm->mmap_sem); + return VM_FAULT_RETRY; } if (PageReadahead(page)) { page_cache_async_readahead(mapping, ra, file, page, @@ -1489,7 +1488,10 @@ retry_find: if (vmf->pgoff > ra_pages / 2) start = vmf->pgoff - ra_pages / 2; + up_read(&vma->vm_mm->mmap_sem); do_page_cache_readahead(mapping, file, start, ra_pages); + down_read(&vma->vm_mm->mmap_sem); + return VM_FAULT_RETRY; } page = find_lock_page(mapping, vmf->pgoff); if (!page) @@ -1527,7 +1529,9 @@ no_cached_page: * We're only likely to ever get here if MADV_RANDOM is in * effect. */ + up_read(&vma->vm_mm->mmap_sem); error = page_cache_read(file, vmf->pgoff); + down_read(&vma->vm_mm->mmap_sem); /* * The page we want has now been added to the page cache. @@ -1535,7 +1539,7 @@ no_cached_page: * meantime, we'll just come back here and read it again. */ if (error >= 0) - goto retry_find; + return VM_FAULT_RETRY; /* * An error return from page_cache_read can result if the @@ -1560,16 +1564,18 @@ page_not_uptodate: * and we need to check for errors. */ ClearPageError(page); + up_read(&vma->vm_mm->mmap_sem); error = mapping->a_ops->readpage(file, page); if (!error) { wait_on_page_locked(page); if (!PageUptodate(page)) error = -EIO; } + down_read(&vma->vm_mm->mmap_sem); page_cache_release(page); if (!error || error == AOP_TRUNCATED_PAGE) - goto retry_find; + return VM_FAULT_RETRY; /* Things didn't work out. Return zero to tell the mm layer so. */ shrink_readahead_size_eio(file, ra); --- Second attempt arch/x86/mm/fault.c | 33 ++++++++++++++++++++++++----- include/linux/mm.h | 1 include/linux/sched.h | 1 mm/filemap.c | 56 +++++++++++++++++++++++++++++++++++++++++++++----- 4 files changed, 80 insertions(+), 11 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 455f3fe..38bea4b 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -578,10 +578,7 @@ int show_unhandled_signals = 1; * and the problem, and then passes it off to one of the appropriate * routines. */ -#ifdef CONFIG_X86_64 -asmlinkage -#endif -void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) +static inline void __do_page_fault(struct pt_regs *regs, unsigned long error_code) { struct task_struct *tsk; struct mm_struct *mm; @@ -702,6 +699,7 @@ again: down_read(&mm->mmap_sem); } +retry: vma = find_vma(mm, address); if (!vma) goto bad_area; @@ -761,8 +759,21 @@ survive: } if (fault & VM_FAULT_MAJOR) tsk->maj_flt++; - else - tsk->min_flt++; + else { + if ((fault & VM_FAULT_RETRY) && (current->flags & PF_FAULT_MAYRETRY)) { + current->flags &= ~PF_FAULT_MAYRETRY; + goto retry; + } + /* + * If we had to retry (PF_FAULT_MAYRETRY cleared), then + * the page originally wasn't up to date before the + * retry, but now it is. + */ + if (!(current->flags & PF_FAULT_MAYRETRY)) + tsk->maj_flt++; + else + tsk->min_flt++; + } #ifdef CONFIG_X86_32 /* @@ -909,6 +920,16 @@ do_sigbus: tsk->thread.trap_no = 14; force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk); } +#ifdef CONFIG_X86_64 +asmlinkage +#endif +void __kprobes do_page_fault(struct pt_regs *regs, + unsigned long error_code) +{ + current->flags |= PF_FAULT_MAYRETRY; + __do_page_fault(regs, error_code); + current->flags &= ~PF_FAULT_MAYRETRY; +} DEFINE_SPINLOCK(pgd_lock); LIST_HEAD(pgd_list); diff --git a/include/linux/mm.h b/include/linux/mm.h index 72a15dc..e150c80 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -694,6 +694,7 @@ static inline int page_mapped(struct page *page) #define VM_FAULT_SIGBUS 0x0002 #define VM_FAULT_MAJOR 0x0004 #define VM_FAULT_WRITE 0x0008 /* Special case for get_user_pages */ +#define VM_FAULT_RETRY 0x0016 #define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */ #define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 3d9120c..bc39432 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1507,6 +1507,7 @@ extern cputime_t task_gtime(struct task_struct *p); #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */ #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */ #define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */ +#define PF_FAULT_MAYRETRY 0x08000000 /* I may drop mmap_sem during fault */ #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */ #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezeable */ diff --git a/mm/filemap.c b/mm/filemap.c index 876bc59..212ea0f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -670,6 +670,7 @@ repeat: } EXPORT_SYMBOL(find_get_page); +#define NOPAGE_RETRY ((struct page*)-1) /** * find_lock_page - locate, pin and lock a pagecache page * @mapping: the address_space to search @@ -680,14 +681,31 @@ EXPORT_SYMBOL(find_get_page); * * Returns zero if the page was not present. find_lock_page() may sleep. */ -struct page *find_lock_page(struct address_space *mapping, pgoff_t offset) +static struct page *__find_lock_page(struct address_space *mapping, + pgoff_t offset, struct rw_semaphore *mmap_sem) { struct page *page; repeat: page = find_get_page(mapping, offset); if (page) { - lock_page(page); + if(!mmap_sem) { + lock_page(page); + } else if(!trylock_page(page)) { + /* + * Page is already locked by someone else. + * We don't want to be holding down_read(mmap_sem) + * inside lock_page(), so use wait_on_page_locked() here. + */ + up_read(mmap_sem); + wait_on_page_locked(page); + down_read(mmap_sem); + /* + * The VMA tree may have changed at this point. + */ + page_cache_release(page); + goto repeat; + } /* Has the page been truncated? */ if (unlikely(page->mapping != mapping)) { unlock_page(page); @@ -698,6 +716,10 @@ repeat: } return page; } +struct page *find_lock_page(struct address_space *mapping, pgoff_t offset) +{ + return __find_lock_page(mapping, offset, NULL); +} EXPORT_SYMBOL(find_lock_page); /** @@ -1427,6 +1449,8 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct address_space *mapping = file->f_mapping; struct file_ra_state *ra = &file->f_ra; struct inode *inode = mapping->host; + struct rw_semaphore *mmap_sem; + struct rw_semaphore *mmap_sem_mayretry; struct page *page; pgoff_t size; int did_readaround = 0; @@ -1435,6 +1459,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; if (vmf->pgoff >= size) return VM_FAULT_SIGBUS; + up_read(&vma->vm_mm->mmap_sem); /* If we don't want any read-ahead, don't bother */ if (VM_RandomReadHint(vma)) @@ -1443,16 +1468,25 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) /* * Do we have something in the page cache already? */ + mmap_sem = &vma->vm_mm->mmap_sem; + mmap_sem_mayretry = current->flags & PF_FAULT_MAYRETRY ? mmap_sem : NULL; retry_find: - page = find_lock_page(mapping, vmf->pgoff); + page = __find_lock_page(mapping, vmf->pgoff, mmap_sem_mayretry); + if(page == NOPAGE_RETRY) + goto nopage_retry; /* * For sequential accesses, we use the generic readahead logic. */ if (VM_SequentialReadHint(vma)) { if (!page) { + up_read(mmap_sem); page_cache_sync_readahead(mapping, ra, file, vmf->pgoff, 1); - page = find_lock_page(mapping, vmf->pgoff); + down_read(mmap_sem); + page = __find_lock_page(mapping, vmf->pgoff, + mmap_sem_mayretry); + if(page == NOPAGE_RETRY) + goto nopage_retry; if (!page) goto no_cached_page; } @@ -1489,9 +1523,15 @@ retry_find: if (vmf->pgoff > ra_pages / 2) start = vmf->pgoff - ra_pages / 2; + up_read(mmap_sem); do_page_cache_readahead(mapping, file, start, ra_pages); + down_read(mmap_sem); } - page = find_lock_page(mapping, vmf->pgoff); + page = __find_lock_page(mapping, vmf->pgoff, + (current->flags & PF_FAULT_MAYRETRY) ? + &vma->vm_mm->mmap_sem : NULL); + if(page == NOPAGE_RETRY) + goto nopage_retry; if (!page) goto no_cached_page; } @@ -1527,7 +1567,9 @@ no_cached_page: * We're only likely to ever get here if MADV_RANDOM is in * effect. */ + up_read(mmap_sem); error = page_cache_read(file, vmf->pgoff); + down_read(mmap_sem); /* * The page we want has now been added to the page cache. @@ -1560,12 +1602,14 @@ page_not_uptodate: * and we need to check for errors. */ ClearPageError(page); + up_read(mmap_sem); error = mapping->a_ops->readpage(file, page); if (!error) { wait_on_page_locked(page); if (!PageUptodate(page)) error = -EIO; } + down_read(mmap_sem); page_cache_release(page); if (!error || error == AOP_TRUNCATED_PAGE) @@ -1574,6 +1618,8 @@ page_not_uptodate: /* Things didn't work out. Return zero to tell the mm layer so. */ shrink_readahead_size_eio(file, ra); return VM_FAULT_SIGBUS; +nopage_retry: + return VM_FAULT_RETRY; } EXPORT_SYMBOL(filemap_fault); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: mmap/munmap latency on multithreaded apps, because pagefaults hold mmap_sem during disk read 2008-09-12 19:16 ` mmap/munmap latency on multithreaded apps, because pagefaults hold mmap_sem during disk read Török Edwin @ 2008-09-12 19:35 ` Mike Waychison 2008-09-12 20:10 ` Török Edwin 0 siblings, 1 reply; 3+ messages in thread From: Mike Waychison @ 2008-09-12 19:35 UTC (permalink / raw) To: Török Edwin Cc: Andi Kleen, Theodore Tso, Peter Zijlstra, Ingo Molnar, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin, Benjamin Herrenschmidt, linux-mm Torok Edwin wrote: > On 2008-09-08 22:10, Torok Edwin wrote: >> [snip] >> There is however a problem with mmap [mmap with N threads is as slow as >> mmap with 1 thread, i.e. it is sequential :(], pagefaults and disk I/O, >> I think I am hitting the problem described in this thread (2 years ago!) >> http://lwn.net/Articles/200215/ >> http://lkml.org/lkml/2006/9/19/260 >> >> It looks like such a patch is still not part of 2.6.27, what happened to it? >> I will see if that patch applies to 2.6.27, and will rerun my test with >> that patch applied too. >> > > The patch doesn't apply to 2.6.27-rc6, I tried manually applying the patch. > There have been many changes since 2.6.18 (like replacing find_get_page > with find_lock_page, filemap returning VM_FAULT codes, etc.). > I have probably done something wrong, because the resulting kernel won't > boot: I get abnormal exits and random sigbus during boot. > > Can you please help porting the patch to 2.6.27-rc6? I have attached my > 2 attempts at the end of this mail. I actually have to forward port this and a bunch of other mm speed-ups in the coming two weeks, though they'll be ports from 2.6.18 to 2.6.26. I'll be sending them out to linux-mm once I've done so and completed some testing. > > Also it looks like the original patch just releases the mmap_sem if > there is lock contention on the page, but keeps mmap_sem during read? > I would like mmap_sem be released during disk I/O too. The 'lock'ing of the page is the part that waits for the read to finish, and is the part that is contentious. > > I also tried changing i_mmap_lock into a semaphore, however I that won't > work since some users of i_mmap_lock can't sleep. > Taking the i_mmap_lock spinlock in filemap fault is also not possible, > since we would sleep while holding a spinlock. You shouldn't be seeing much contention on the i_mmap_lock, at least not in the fault path (it's mostly just painful when you have a lot of threads in direct reclaim and you have a large file mmaped). > > Just to confirm that the problem is with pagefaults and mmap, I dropped > the mmap_sem in filemap_fault, and then > I got same performance in my testprogram for mmap and read. Of course > this is totally unsafe, because the mapping could change at any time. > >> [2] the test program is available here: >> http://edwintorok.googlepages.com/scalability.tar.gz >> You just build it using 'make' (has to be GNU make), and the run >> $ sh ./runtest.sh /usr/bin/ | tee log >> $ sh ./postproc.sh log > > I've written a latency tracer (using ptrace), and I identified the > mutex/mmap related latencies (total runtime 23m): > - mmap-ed files (created by libclamav) ~6680 ms total > - creating/removing anonymous mappings, created by glibc, when I use > functions like fopen/fclose: > > With 8 threads: > =====> Total: 3227.732 ms, average: 3.590 ms, times: 899 > === /lib/libc.so.6 (mmap) > === /lib/libc.so.6 (_IO_file_doallocate) > === /lib/libc.so.6 (_IO_doallocbuf) > === /lib/libc.so.6 (_IO_file_seekoff) > === /lib/libc.so.6 (_IO_file_attach) > === /lib/libc.so.6 (_IO_fdopen) > === /usr/lib/libz.so.1 (gzflush) > === /usr/lib/libz.so.1 (gzdopen) > === /usr/local/lib/libclamav.so.5 libclamav/scanners.c:470 (cli_scangzip) > > =====> Total: 2069.519 ms, average: 3.624 ms, times: 571 > === /lib/libc.so.6 (munmap) > === /lib/libc.so.6 (_IO_setb) > === /lib/libc.so.6 (_IO_file_close_it) > === /lib/libc.so.6 (_IO_fclose) > === /usr/lib/libz.so.1 (gzerror) > === /usr/local/lib/libclamav.so.5 libclamav/scanners.c:529 (cli_scangzip) > > with 4 threads: > =====> Total: 578.607 ms, average: 4.743 ms, times: 122 > === /lib/libc.so.6 (munmap) > === /lib/libc.so.6 (_IO_setb) > === /lib/libc.so.6 (_IO_file_close_it) > === /lib/libc.so.6 (_IO_fclose) > === /usr/lib/libz.so.1 (gzerror) > === /usr/local/lib/libclamav.so.5 libclamav/scanners.c:529 (cli_scangzip) > > =====> Total: 148.083 ms, average: 2.278 ms, times: 65 > === /lib/libc.so.6 (mmap) > === /lib/libc.so.6 (_IO_file_doallocate) > === /lib/libc.so.6 (_IO_doallocbuf) > === /lib/libc.so.6 (_IO_file_seekoff) > === /lib/libc.so.6 (_IO_file_attach) > === /lib/libc.so.6 (_IO_fdopen) > === /usr/lib/libz.so.1 (gzflush) > === /usr/lib/libz.so.1 (gzdopen) > === /usr/local/lib/libclamav.so.5 libclamav/scanners.c:470 (cli_scangzip) > > With 8 threads situation is much worse than with 4 threads even for > functions using anonymous mappings. > > Of course the latency tracer has its own overhead (1.2 ms average, 67 ms > max for 8 threads, and 0.1 ms average, 31 ms max for 4 threads),but > these latencies are above that value. > > Best regards, > --Edwin > > --- > First attempt: > arch/x86/mm/fault.c | 33 +++++++++++++++++++++++++++------ > include/linux/mm.h | 1 + > include/linux/mm_types.h | 1 - > include/linux/sched.h | 1 + > mm/filemap.c | 18 ++++++++++++------ > 5 files changed, 41 insertions(+), 13 deletions(-) > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 455f3fe..38bea4b 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -578,10 +578,7 @@ int show_unhandled_signals = 1; > * and the problem, and then passes it off to one of the appropriate > * routines. > */ > -#ifdef CONFIG_X86_64 > -asmlinkage > -#endif > -void __kprobes do_page_fault(struct pt_regs *regs, unsigned long > error_code) > +static inline void __do_page_fault(struct pt_regs *regs, unsigned long > error_code) > { > struct task_struct *tsk; > struct mm_struct *mm; > @@ -702,6 +699,7 @@ again: > down_read(&mm->mmap_sem); > } > > +retry: > vma = find_vma(mm, address); > if (!vma) > goto bad_area; > @@ -761,8 +759,21 @@ survive: > } > if (fault & VM_FAULT_MAJOR) > tsk->maj_flt++; > - else > - tsk->min_flt++; > + else { > + if ((fault & VM_FAULT_RETRY) && (current->flags & > PF_FAULT_MAYRETRY)) { > + current->flags &= ~PF_FAULT_MAYRETRY; > + goto retry; > + } > + /* > + * If we had to retry (PF_FAULT_MAYRETRY cleared), then > + * the page originally wasn't up to date before the > + * retry, but now it is. > + */ > + if (!(current->flags & PF_FAULT_MAYRETRY)) > + tsk->maj_flt++; > + else > + tsk->min_flt++; > + } > > #ifdef CONFIG_X86_32 > /* > @@ -909,6 +920,16 @@ do_sigbus: > tsk->thread.trap_no = 14; > force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk); > } > +#ifdef CONFIG_X86_64 > +asmlinkage > +#endif > +void __kprobes do_page_fault(struct pt_regs *regs, > + unsigned long error_code) > +{ > + current->flags |= PF_FAULT_MAYRETRY; > + __do_page_fault(regs, error_code); > + current->flags &= ~PF_FAULT_MAYRETRY; > +} > > DEFINE_SPINLOCK(pgd_lock); > LIST_HEAD(pgd_list); > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 72a15dc..4511f68 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -697,6 +697,7 @@ static inline int page_mapped(struct page *page) > > #define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not > return page */ > #define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */ > +#define VM_FAULT_RETRY 0x0400 > > #define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index bf33413..9d065a4 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -108,7 +108,6 @@ struct vm_area_struct { > unsigned long vm_start; /* Our start address within vm_mm. */ > unsigned long vm_end; /* The first byte after our end address > within vm_mm. */ > - > /* linked list of VM areas per task, sorted by address */ > struct vm_area_struct *vm_next; > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 3d9120c..bc39432 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1507,6 +1507,7 @@ extern cputime_t task_gtime(struct task_struct *p); > #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */ > #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over > cpuset */ > #define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */ > +#define PF_FAULT_MAYRETRY 0x08000000 /* I may drop mmap_sem during > fault */ > #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */ > #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt > mutex tester */ > #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it > as freezeable */ > diff --git a/mm/filemap.c b/mm/filemap.c > index 876bc59..f9f11bd 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1443,18 +1443,17 @@ int filemap_fault(struct vm_area_struct *vma, > struct vm_fault *vmf) > /* > * Do we have something in the page cache already? > */ > -retry_find: > page = find_lock_page(mapping, vmf->pgoff); > /* > * For sequential accesses, we use the generic readahead logic. > */ > if (VM_SequentialReadHint(vma)) { > if (!page) { > + up_read(&vma->vm_mm->mmap_sem); > page_cache_sync_readahead(mapping, ra, file, > vmf->pgoff, 1); > - page = find_lock_page(mapping, vmf->pgoff); > - if (!page) > - goto no_cached_page; > + down_read(&vma->vm_mm->mmap_sem); > + return VM_FAULT_RETRY; > } > if (PageReadahead(page)) { > page_cache_async_readahead(mapping, ra, file, page, > @@ -1489,7 +1488,10 @@ retry_find: > > if (vmf->pgoff > ra_pages / 2) > start = vmf->pgoff - ra_pages / 2; > + up_read(&vma->vm_mm->mmap_sem); > do_page_cache_readahead(mapping, file, start, ra_pages); > + down_read(&vma->vm_mm->mmap_sem); > + return VM_FAULT_RETRY; > } > page = find_lock_page(mapping, vmf->pgoff); > if (!page) > @@ -1527,7 +1529,9 @@ no_cached_page: > * We're only likely to ever get here if MADV_RANDOM is in > * effect. > */ > + up_read(&vma->vm_mm->mmap_sem); > error = page_cache_read(file, vmf->pgoff); > + down_read(&vma->vm_mm->mmap_sem); > > /* > * The page we want has now been added to the page cache. > @@ -1535,7 +1539,7 @@ no_cached_page: > * meantime, we'll just come back here and read it again. > */ > if (error >= 0) > - goto retry_find; > + return VM_FAULT_RETRY; > > /* > * An error return from page_cache_read can result if the > @@ -1560,16 +1564,18 @@ page_not_uptodate: > * and we need to check for errors. > */ > ClearPageError(page); > + up_read(&vma->vm_mm->mmap_sem); > error = mapping->a_ops->readpage(file, page); > if (!error) { > wait_on_page_locked(page); > if (!PageUptodate(page)) > error = -EIO; > } > + down_read(&vma->vm_mm->mmap_sem); > page_cache_release(page); > > if (!error || error == AOP_TRUNCATED_PAGE) > - goto retry_find; > + return VM_FAULT_RETRY; > > /* Things didn't work out. Return zero to tell the mm layer so. */ > shrink_readahead_size_eio(file, ra); > > --- > Second attempt > arch/x86/mm/fault.c | 33 ++++++++++++++++++++++++----- > include/linux/mm.h | 1 > include/linux/sched.h | 1 > mm/filemap.c | 56 > +++++++++++++++++++++++++++++++++++++++++++++----- > 4 files changed, 80 insertions(+), 11 deletions(-) > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 455f3fe..38bea4b 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -578,10 +578,7 @@ int show_unhandled_signals = 1; > * and the problem, and then passes it off to one of the appropriate > * routines. > */ > -#ifdef CONFIG_X86_64 > -asmlinkage > -#endif > -void __kprobes do_page_fault(struct pt_regs *regs, unsigned long > error_code) > +static inline void __do_page_fault(struct pt_regs *regs, unsigned long > error_code) > { > struct task_struct *tsk; > struct mm_struct *mm; > @@ -702,6 +699,7 @@ again: > down_read(&mm->mmap_sem); > } > > +retry: > vma = find_vma(mm, address); > if (!vma) > goto bad_area; > @@ -761,8 +759,21 @@ survive: > } > if (fault & VM_FAULT_MAJOR) > tsk->maj_flt++; > - else > - tsk->min_flt++; > + else { > + if ((fault & VM_FAULT_RETRY) && (current->flags & > PF_FAULT_MAYRETRY)) { > + current->flags &= ~PF_FAULT_MAYRETRY; > + goto retry; > + } > + /* > + * If we had to retry (PF_FAULT_MAYRETRY cleared), then > + * the page originally wasn't up to date before the > + * retry, but now it is. > + */ > + if (!(current->flags & PF_FAULT_MAYRETRY)) > + tsk->maj_flt++; > + else > + tsk->min_flt++; > + } > > #ifdef CONFIG_X86_32 > /* > @@ -909,6 +920,16 @@ do_sigbus: > tsk->thread.trap_no = 14; > force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk); > } > +#ifdef CONFIG_X86_64 > +asmlinkage > +#endif > +void __kprobes do_page_fault(struct pt_regs *regs, > + unsigned long error_code) > +{ > + current->flags |= PF_FAULT_MAYRETRY; > + __do_page_fault(regs, error_code); > + current->flags &= ~PF_FAULT_MAYRETRY; > +} > > DEFINE_SPINLOCK(pgd_lock); > LIST_HEAD(pgd_list); > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 72a15dc..e150c80 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -694,6 +694,7 @@ static inline int page_mapped(struct page *page) > #define VM_FAULT_SIGBUS 0x0002 > #define VM_FAULT_MAJOR 0x0004 > #define VM_FAULT_WRITE 0x0008 /* Special case for get_user_pages */ > +#define VM_FAULT_RETRY 0x0016 > > #define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not > return page */ > #define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */ > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 3d9120c..bc39432 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1507,6 +1507,7 @@ extern cputime_t task_gtime(struct task_struct *p); > #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */ > #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over > cpuset */ > #define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */ > +#define PF_FAULT_MAYRETRY 0x08000000 /* I may drop mmap_sem during > fault */ > #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */ > #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt > mutex tester */ > #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it > as freezeable */ > diff --git a/mm/filemap.c b/mm/filemap.c > index 876bc59..212ea0f 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -670,6 +670,7 @@ repeat: > } > EXPORT_SYMBOL(find_get_page); > > +#define NOPAGE_RETRY ((struct page*)-1) > /** > * find_lock_page - locate, pin and lock a pagecache page > * @mapping: the address_space to search > @@ -680,14 +681,31 @@ EXPORT_SYMBOL(find_get_page); > * > * Returns zero if the page was not present. find_lock_page() may sleep. > */ > -struct page *find_lock_page(struct address_space *mapping, pgoff_t offset) > +static struct page *__find_lock_page(struct address_space *mapping, > + pgoff_t offset, struct rw_semaphore *mmap_sem) > { > struct page *page; > > repeat: > page = find_get_page(mapping, offset); > if (page) { > - lock_page(page); > + if(!mmap_sem) { > + lock_page(page); > + } else if(!trylock_page(page)) { > + /* > + * Page is already locked by someone else. > + * We don't want to be holding down_read(mmap_sem) > + * inside lock_page(), so use wait_on_page_locked() > here. > + */ > + up_read(mmap_sem); > + wait_on_page_locked(page); > + down_read(mmap_sem); > + /* > + * The VMA tree may have changed at this point. > + */ > + page_cache_release(page); > + goto repeat; > + } > /* Has the page been truncated? */ > if (unlikely(page->mapping != mapping)) { > unlock_page(page); > @@ -698,6 +716,10 @@ repeat: > } > return page; > } > +struct page *find_lock_page(struct address_space *mapping, pgoff_t offset) > +{ > + return __find_lock_page(mapping, offset, NULL); > +} > EXPORT_SYMBOL(find_lock_page); > > /** > @@ -1427,6 +1449,8 @@ int filemap_fault(struct vm_area_struct *vma, > struct vm_fault *vmf) > struct address_space *mapping = file->f_mapping; > struct file_ra_state *ra = &file->f_ra; > struct inode *inode = mapping->host; > + struct rw_semaphore *mmap_sem; > + struct rw_semaphore *mmap_sem_mayretry; > struct page *page; > pgoff_t size; > int did_readaround = 0; > @@ -1435,6 +1459,7 @@ int filemap_fault(struct vm_area_struct *vma, > struct vm_fault *vmf) > size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; > if (vmf->pgoff >= size) > return VM_FAULT_SIGBUS; > + up_read(&vma->vm_mm->mmap_sem); > > /* If we don't want any read-ahead, don't bother */ > if (VM_RandomReadHint(vma)) > @@ -1443,16 +1468,25 @@ int filemap_fault(struct vm_area_struct *vma, > struct vm_fault *vmf) > /* > * Do we have something in the page cache already? > */ > + mmap_sem = &vma->vm_mm->mmap_sem; > + mmap_sem_mayretry = current->flags & PF_FAULT_MAYRETRY ? mmap_sem : > NULL; > retry_find: > - page = find_lock_page(mapping, vmf->pgoff); > + page = __find_lock_page(mapping, vmf->pgoff, mmap_sem_mayretry); > + if(page == NOPAGE_RETRY) > + goto nopage_retry; > /* > * For sequential accesses, we use the generic readahead logic. > */ > if (VM_SequentialReadHint(vma)) { > if (!page) { > + up_read(mmap_sem); > page_cache_sync_readahead(mapping, ra, file, > vmf->pgoff, 1); > - page = find_lock_page(mapping, vmf->pgoff); > + down_read(mmap_sem); > + page = __find_lock_page(mapping, vmf->pgoff, > + mmap_sem_mayretry); > + if(page == NOPAGE_RETRY) > + goto nopage_retry; > if (!page) > goto no_cached_page; > } > @@ -1489,9 +1523,15 @@ retry_find: > > if (vmf->pgoff > ra_pages / 2) > start = vmf->pgoff - ra_pages / 2; > + up_read(mmap_sem); > do_page_cache_readahead(mapping, file, start, ra_pages); > + down_read(mmap_sem); > } > - page = find_lock_page(mapping, vmf->pgoff); > + page = __find_lock_page(mapping, vmf->pgoff, > + (current->flags & PF_FAULT_MAYRETRY) ? > + &vma->vm_mm->mmap_sem : NULL); > + if(page == NOPAGE_RETRY) > + goto nopage_retry; > if (!page) > goto no_cached_page; > } > @@ -1527,7 +1567,9 @@ no_cached_page: > * We're only likely to ever get here if MADV_RANDOM is in > * effect. > */ > + up_read(mmap_sem); > error = page_cache_read(file, vmf->pgoff); > + down_read(mmap_sem); > > /* > * The page we want has now been added to the page cache. > @@ -1560,12 +1602,14 @@ page_not_uptodate: > * and we need to check for errors. > */ > ClearPageError(page); > + up_read(mmap_sem); > error = mapping->a_ops->readpage(file, page); > if (!error) { > wait_on_page_locked(page); > if (!PageUptodate(page)) > error = -EIO; > } > + down_read(mmap_sem); > page_cache_release(page); > > if (!error || error == AOP_TRUNCATED_PAGE) > @@ -1574,6 +1618,8 @@ page_not_uptodate: > /* Things didn't work out. Return zero to tell the mm layer so. */ > shrink_readahead_size_eio(file, ra); > return VM_FAULT_SIGBUS; > +nopage_retry: > + return VM_FAULT_RETRY; > } > EXPORT_SYMBOL(filemap_fault); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: mmap/munmap latency on multithreaded apps, because pagefaults hold mmap_sem during disk read 2008-09-12 19:35 ` Mike Waychison @ 2008-09-12 20:10 ` Török Edwin 0 siblings, 0 replies; 3+ messages in thread From: Török Edwin @ 2008-09-12 20:10 UTC (permalink / raw) To: Mike Waychison Cc: Andi Kleen, Theodore Tso, Peter Zijlstra, Ingo Molnar, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin, Benjamin Herrenschmidt, linux-mm On 2008-09-12 22:35, Mike Waychison wrote: > Torok Edwin wrote: >> On 2008-09-08 22:10, Torok Edwin wrote: >>> [snip] >>> There is however a problem with mmap [mmap with N threads is as slow as >>> mmap with 1 thread, i.e. it is sequential :(], pagefaults and disk I/O, >>> I think I am hitting the problem described in this thread (2 years >>> ago!) >>> http://lwn.net/Articles/200215/ >>> http://lkml.org/lkml/2006/9/19/260 >>> >>> It looks like such a patch is still not part of 2.6.27, what >>> happened to it? >>> I will see if that patch applies to 2.6.27, and will rerun my test with >>> that patch applied too. >>> >> >> The patch doesn't apply to 2.6.27-rc6, I tried manually applying the >> patch. >> There have been many changes since 2.6.18 (like replacing find_get_page >> with find_lock_page, filemap returning VM_FAULT codes, etc.). >> I have probably done something wrong, because the resulting kernel won't >> boot: I get abnormal exits and random sigbus during boot. >> >> Can you please help porting the patch to 2.6.27-rc6? I have attached my >> 2 attempts at the end of this mail. > > I actually have to forward port this and a bunch of other mm speed-ups > in the coming two weeks, though they'll be ports from 2.6.18 to > 2.6.26. I'll be sending them out to linux-mm once I've done so and > completed some testing. > That would be great, thanks! >> >> Also it looks like the original patch just releases the mmap_sem if >> there is lock contention on the page, but keeps mmap_sem during read? >> I would like mmap_sem be released during disk I/O too. > > The 'lock'ing of the page is the part that waits for the read to > finish, and is the part that is contentious. Didn't know that, thanks for explaining. > >> >> I also tried changing i_mmap_lock into a semaphore, however I that won't >> work since some users of i_mmap_lock can't sleep. >> Taking the i_mmap_lock spinlock in filemap fault is also not possible, >> since we would sleep while holding a spinlock. > > You shouldn't be seeing much contention on the i_mmap_lock, at least > not in the fault path (it's mostly just painful when you have a lot of > threads in direct reclaim and you have a large file mmaped). I was thinking of using i_mmap_lock as an alternative to holding mmap_sem, but it didn't seem like a good idea after all. Best regards, --Edwin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2008-09-12 20:10 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <48B1CC15.2040006@gmail.com>
[not found] ` <1219643476.20732.1.camel@twins>
[not found] ` <48B25988.8040302@gmail.com>
[not found] ` <1219656190.8515.7.camel@twins>
[not found] ` <48B28015.3040602@gmail.com>
[not found] ` <1219658527.8515.16.camel@twins>
[not found] ` <48B287D8.1000000@gmail.com>
[not found] ` <1219660582.8515.24.camel@twins>
[not found] ` <48B290E7.4070805@gmail.com>
[not found] ` <1219664477.8515.54.camel@twins>
[not found] ` <20080825134801.GN1408@mit.edu>
[not found] ` <87y72k9otw.fsf@basil.nowhere.org>
[not found] ` <48C57898.1080304@gmail.com>
2008-09-12 19:16 ` mmap/munmap latency on multithreaded apps, because pagefaults hold mmap_sem during disk read Török Edwin
2008-09-12 19:35 ` Mike Waychison
2008-09-12 20:10 ` Török Edwin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox