* [PATCH 0/2][RFC] New version of shared page tables @ 2006-05-03 15:43 Dave McCracken 2006-05-03 15:56 ` Hugh Dickins 2006-05-05 19:25 ` Brian Twichell 0 siblings, 2 replies; 16+ messages in thread From: Dave McCracken @ 2006-05-03 15:43 UTC (permalink / raw) To: Hugh Dickins; +Cc: Linux Memory Management, Linux Kernel I've done some cleanup and some bugfixing. Hugh, please review this version instead of the old one. I like my locking mechanism for unsharing on this one a lot better. It works on an address range instead of depending on a vma, which more closely reflects the way it's used. The first patch just standardizes the pxd_page/pxd_page_kernel macros for all architectures. The second patch is the heart of shared page tables. This version of the patches is against 2.6.17-rc3. Dave McCracken -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2][RFC] New version of shared page tables 2006-05-03 15:43 [PATCH 0/2][RFC] New version of shared page tables Dave McCracken @ 2006-05-03 15:56 ` Hugh Dickins 2006-05-03 16:06 ` Dave McCracken 2006-05-05 19:25 ` Brian Twichell 1 sibling, 1 reply; 16+ messages in thread From: Hugh Dickins @ 2006-05-03 15:56 UTC (permalink / raw) To: Dave McCracken; +Cc: Linux Memory Management, Linux Kernel On Wed, 3 May 2006, Dave McCracken wrote: > > I've done some cleanup and some bugfixing. Hugh, please review > this version instead of the old one. Grrr, just as I'm writing up my notes on the last revision! I need a new go-faster brain. Okay, I'll switch over now. Sisyphughs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2][RFC] New version of shared page tables 2006-05-03 15:56 ` Hugh Dickins @ 2006-05-03 16:06 ` Dave McCracken 2006-05-06 15:25 ` Hugh Dickins 0 siblings, 1 reply; 16+ messages in thread From: Dave McCracken @ 2006-05-03 16:06 UTC (permalink / raw) To: Hugh Dickins; +Cc: Linux Memory Management, Linux Kernel --On Wednesday, May 03, 2006 16:56:12 +0100 Hugh Dickins <hugh@veritas.com> wrote: >> I've done some cleanup and some bugfixing. Hugh, please review >> this version instead of the old one. > > Grrr, just as I'm writing up my notes on the last revision! > I need a new go-faster brain. Okay, I'll switch over now. Sorry. The changes should be relatively minor. Just a tweak to the unshare locking and some extra code to handle hugepage copy_page_range, mostly. Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2][RFC] New version of shared page tables 2006-05-03 16:06 ` Dave McCracken @ 2006-05-06 15:25 ` Hugh Dickins 2006-05-08 19:32 ` Ray Bryant 2006-05-08 19:49 ` Brian Twichell 0 siblings, 2 replies; 16+ messages in thread From: Hugh Dickins @ 2006-05-06 15:25 UTC (permalink / raw) To: Dave McCracken; +Cc: Linux Memory Management, Linux Kernel On Wed, 3 May 2006, Dave McCracken wrote: > > The changes should be relatively minor. Just a tweak to the unshare > locking and some extra code to handle hugepage copy_page_range, mostly. I didn't pay any attention to 1/2, the pxd_page/pxd_page_kernel patch. It was well worth separating that one out, really helps to reduce the scale and worry of the main patch. Notice recent mails suggesting it's the answer to a pud_page anomaly: I hadn't realized the type of pud_page varies from arch to arch, that's horrid: if you're sorting it out, then please make that clear in the comment and push it forward. Though I notice only a couple of instances of pxd_page_kernel outside of include/asm-* in 2.6.17-rc: I now think you'd do much better not to propagate that obscure _kernel suffix further, but go for pxd_page_vaddr (or suchlike) throughout instead: more change, but much clearer. I do agree with Christoph that you'd do well to separate out the hugetlb part of the main patch. Not just for the locking, more because that's become such a specialist and fast-moving area recently. I didn't pay attention to that part of 2/2 either, but got the impression that your patch has not kept up with the changes there. Let me say (while perhaps others are still reading) that I'm seriously wondering whether you should actually restrict your shared pagetable work to the hugetlb case. I realize that would be a disappointing limitation to you, and would remove the 25%/50% improvement cases, leaving only the 3%/4% last-ounce-of-performance cases. But it's worrying me a lot that these complications to core mm code will _almost_ never apply to the majority of users, will get little testing outside of specialist setups. I'd feel safer to remove that "almost", and consign shared pagetables to the hugetlb ghetto, if that would indeed remove their handling from the common code paths. (Whereas, if we didn't have hugetlb, I would be arguing strongly for shared pts.) Patch 2/2 does look cleaner than before, and dropping PTSHARE_PUD has helped to make it all simpler. But you've not yet added the rss accounting: that's going to make it quite a lot nastier. An argument for sticking just to the hugetlb case: although hugetlb accounts rss at present, I think we could justify not doing so (though hugetlb rss is more relevant now it's not prefaulted). However, I'll continue commenting on your non-hugetlb modifications. A lot of page migration work has come into the tree (in mm/mempolicy.c and mm/migrate.c) since 2.6.15: an optimistic guess would be that all you need do there is skip shared pagetables unless MPOL_MF_MOVE_ALL (as page_mapcount > 1 pages would be skipped). You don't have to worry about the pagetable becoming shared after you've tested: it's already accepted that what's initially tested may change later (and your rmap.c changes should cover the wider TLB flushing needed). Naming: pt_check_unshare_pxd, I think better call them pt_unshare_pxd; ah, you've already got pt_unshare_pxd, which is similar but different. That's confusing - forced upon you by pagetable folding peculiarities? I'd rather have just one copy of that central locking code. You've generally been helpful with your underscores: one exception is pt_vmashared: pt_sharing_vma? Ah, it's supposed to be gone from the latest version, but one trace remains - please delete that. ptshare.h contains more declarations/definitions never used, pt_increment_pte, pt_decrement_pte, pt_increment_pmd, pt_decrement_pmd at least: please check and delete what you're not actually using. Compiler warns "value computed is not used" on the "*shared++;" lines in page_check_address: should be "(*shared)++;". So at present the shared path in rmap.c has not really been tested. page_referenced_one needs to skip decrementing *mapcount when shared: the vma prio_tree search will bring it back again and again to the same shared pagetable, though that pte is only counted once in mapcount: so it's currently liable to break out too early, missing other entries. try_to_unmap_one will now be flushing TLB on all cpus for each shared pagetable entry it unmaps, where often (seeing inactive mm) it wouldn't have needed to flush TLB at all; but that might work out on balance, it won't be finding so many entries to unmap. The prio_tree_iter_inits in pt_share_pte and pt_share_pmd should limit their scope to the range of the pagetable involved, not the whole vma. next_shareable_vma likewise? I thought so at first, but perhaps its check for a similar vma often avoids immediate unsharing. Optimizations only, you've probably had them in mind for later. I mentioned the off-by-one in pt_shareable_pte and pt_shareable_pmd before: ought to say "vma->vm_end - 1 >= end"; but must admit that's nitpicking, since vm_end is PAGE_SIZE aligned anyway, so no real issue can arise - fix it to help stop others worrying later? Whereas pt_share_pte and pt_share_pmd have the complementary issue: "end = base + PXD_SIZE" may wrap to 0, so you need to -1 somewhere (but you won't need base and end if pt_trans_start/end go away). pt_share_pte and pt_share_pmd: preferable to swap around their pxd and address arguments, so they resemble what they're replacing. pt_shareable now has the same off-by-one too: or would have, but "end = base + (mask-1)" is quite wrong, isn't it? base + ~mask? Move those calculations lower down, after the common tests? or do compiler and processor nowadays optimize such orderings well? And there's a leftover "vmas in transition" comment on vm_flags. pt_shareable is still not rejecting if vma->anon_vma is set: it's quite possible for a vm_file vma to be private and writable, gather some COW pages, and then be mprotected to readonly, so passing the vm_flags test - but its pagetables must not be shared. VM_PTSHARE came and went, good, you never had the mmap_sem needed to set it. VM_TRANSITION came and went, you've replaced it by the mm->pt_trans_start, pt_trans_end. At first I thought that a big improvement, now I'm not so sure. If they stay, those added fields should be under #ifdef so as not to enlarge the basic mm_struct. I'd prefer something other than "lock" in pt_unshare_lock_range and pt_unlock_range, but I think I'm going to suggest you go back to using pt_unshare_range alone: let's look at the three callsites. sys_remap_file_pages: doesn't really need the locked range, you could just call pt_unshare_range a little lower down, once i_mmap_lock taken. mprotect_fixup: that does need some protection, yes, because the pte protections are out-of-synch with the vm_flags for a while (in a way that's okay for the owning mm, but not for "parasites" wanting to share). Please move the pt_unshare_lock_range (or whatever) down above vma_merge, so you can remove the pt_unlock_range from the -ENOMEM case above it. mremap move_vma: not good enough, you're unsharing and locking the old range, but you also need to lock the new range before copy_vma, to hold it unshared too; which could be done, though not with the interfaces you've provided. (The VM_TRANSITION version was insufficient too, and cleared the flag at a point where "vma" might already have been freed.) Are those the only places which need this range locking? I was worried at first that there might be more, then I came around to thinking you'd identified the right places, now suddenly I see pt_check_unshare_pxd in zap_pxd_range as vulnerable: the vma remains in the prio_tree, so it might immediately get shared again; what the zapping does is not wholly wrong, but its TLB flushing would be inadequate if the table has become shared in the meantime. Or am I mistaken? Unless you have firm performance evidence to the contrary, on a workload that you're seriously trying to address, I suggest you drop the range locking transitional stuff, and down_read_trylock(&svma->vm_mm->mmap_sem) in (or near calls to) next_shareable_vma instead. That will fail more often than your transitional checks, but give much stronger assurance that nothing funny is going on in the vma found. But do you then need to add down_write(&mm->mmap_sem) in exit_mmap, if pagetable sharing is enabled? currently I believe so. But then, I think, you can remove the pt_check_unshare_pxd from free_pyd_range: odd how it was doing those once in the unmap_vmas path, then again in the free_pgtables path - what was your thinking there? Yes, the pagetable may have gotten reshared in between, but the TLB flushing would already be inadequate if so. I admire the simplicity of the way you just unshare when you have to, letting faults fill back in lazily; but does that have a problem in the case of a VM_LOCKED vma, losing the guarantees mlock should be giving? I read through a lot of old mails while reviewing, going back to Daniel's first implementation in 2002. The most interesting remark I found (and have lost again) was one from wli, questioning the locking required when changing *pmd. Hmm, let's look at your pt_check_unshare_pte (similar remarks apply to pt_check_unshare_pmd, pt_unshare_pte, pt_unshare_pmd), there's a lot to question in that locking. Well, the locking that you do have, it's unclear why you're using the spinlock in the pagetable struct page there: doesn't it amount to? page = pmd_page(*pmd); if (atomic_add_unless(&page->_mapcount, -1, -1)) return 0; pmd_clear_flush(mm, address, pmd); return 1; Ah, probably atomic_add_unless wasn't available when you wrote it. But then what of the "Oops, already mapped" pt_decrement_share in pt_share_pte? That's under different locking (rightly, the level above, since the question there is whether a racing thread has set *pmd): what happens if that decrement brings the share count down to, umm, something awkward - hard to be specific, partly because of how _mapcount starts from -1, partly because you've gone for a share count rather than a reference count - I understand that you were avoiding the overhead of maintaining another reference count on the common path, but it leaves me deeply suspicious, I fear it's hiding bugs. I'd agree it'll be rare (usually the racing thread will have found the same pagetable to share as we have, and so raised its share count), but I do believe that pt_decrement_share can go wrong: the process that had that pagetable may be exiting, find it shared in its pt_check_unshare_pte so skip zapping, then we're left with that pagetable to free - but we do nothing other than decrement the count one too far. I think that will get fixed by pt_share_pte holding i_mmap_lock and its down_read_trylock of svma->vm_mm->mmap_sem across the lower block: then it only needs to pt_increment_share when all's well at the end, the decrement case gone. pte_lock nests within i_mmap_lock, should be fine for pmd_lock also. Now, back to the question of the pmd_clear_flush: currently, we may add a valid entry *pmd at any time, but we only clear it in free_pgtables, after all occupying vmas have been removed from anon_vma and prio_tree. You're relaxing that; most paths are protected by holding mmap_sem, but file truncation and rmap lookup are not. The easiest protection against races here is to hold i_mmap_lock, since both unmap_mapping_range and page_check_address do (but slightly messy since the unmap_mapping_range path must then avoid retaking it in the pt_check_unshares). If you're taking i_mmap_lock in pt_check_unshare_pte etc, you could then skip the atomic_add_unless I was suggesting above, revert to your existing structure but using i_mmap_lock instead of pmd_page ptl. Except, you must not drop the lock until after your pmd_clear_flush (which only needs flush_tlb_mm, doesn't it, rather than flush_tlb_all?). Because once you drop the lock, the process you were sharing with could unmap and free all the pages, and not knowing it had been sharing, only flush for its own mm - other threads of your process might be able to access those pages after they were freed by the other. I don't suppose extending the use of i_mmap_lock as I suggest will be popular, it's liable to reduce your scalability: I'm more pointing to an obvious way to fix some problems than necessarily the end solution. Please don't interpret these detailed comments as meaning that I think your patch is almost ready: I'm afraid that the longer I spend looking at it, the more I find to worry about - not a good sign. (And let me repeat, I've not looked at the hugetlb end of it at all.) And though it's easy to find performance advocates in favour of your patch, it's hard to find kernel hackers who care for maintainability wanting it in. And I worry that it will tie our hands, repeatedly posing a difficulty for other future developments (rather as sys_remap_file_pages did, or I feared Christoph's pte xchging would). How was Ray Bryant's shared,anonymous,fork,munmap,private bug of 25 Jan resolved? We didn't hear the end of that. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2][RFC] New version of shared page tables 2006-05-06 15:25 ` Hugh Dickins @ 2006-05-08 19:32 ` Ray Bryant 2006-05-16 21:09 ` Dave McCracken 2006-05-08 19:49 ` Brian Twichell 1 sibling, 1 reply; 16+ messages in thread From: Ray Bryant @ 2006-05-08 19:32 UTC (permalink / raw) To: Hugh Dickins; +Cc: Dave McCracken, Linux Memory Management, Linux Kernel On Saturday 06 May 2006 10:25, Hugh Dickins wrote: <snip> > How was Ray Bryant's shared,anonymous,fork,munmap,private bug of > 25 Jan resolved? We didn't hear the end of that. > I never heard anything back from Dave, either. > Hugh > - <snip> -- Ray Bryant AMD Performance Labs Austin, Tx 512-602-0038 (o) 512-507-7807 (c) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2][RFC] New version of shared page tables 2006-05-08 19:32 ` Ray Bryant @ 2006-05-16 21:09 ` Dave McCracken 2006-05-19 16:55 ` Ray Bryant 2006-05-22 18:00 ` Ray Bryant 0 siblings, 2 replies; 16+ messages in thread From: Dave McCracken @ 2006-05-16 21:09 UTC (permalink / raw) To: Ray Bryant, Hugh Dickins; +Cc: Linux Memory Management, Linux Kernel --On Monday, May 08, 2006 14:32:39 -0500 Ray Bryant <raybry@mpdtxmail.amd.com> wrote: > On Saturday 06 May 2006 10:25, Hugh Dickins wrote: > <snip> >> How was Ray Bryant's shared,anonymous,fork,munmap,private bug of >> 25 Jan resolved? We didn't hear the end of that. >> > > I never heard anything back from Dave, either. My apologies. As I recall your problem looked to be a race in an area where I was redoing the concurrency control. I intended to ask you to retest when my new version came out. Unfortunately the new version took awhile, and by the time I sent it out I forgot to ask you about it. I believe your problem should be fixed in recent versions. If not, I'll make another pass at it. Dave McCracken -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2][RFC] New version of shared page tables 2006-05-16 21:09 ` Dave McCracken @ 2006-05-19 16:55 ` Ray Bryant 2006-05-22 18:00 ` Ray Bryant 1 sibling, 0 replies; 16+ messages in thread From: Ray Bryant @ 2006-05-19 16:55 UTC (permalink / raw) To: Dave McCracken; +Cc: Hugh Dickins, Linux Memory Management, Linux Kernel On Tuesday 16 May 2006 16:09, Dave McCracken wrote: > --On Monday, May 08, 2006 14:32:39 -0500 Ray Bryant > > <raybry@mpdtxmail.amd.com> wrote: > > On Saturday 06 May 2006 10:25, Hugh Dickins wrote: > > <snip> > > > >> How was Ray Bryant's shared,anonymous,fork,munmap,private bug of > >> 25 Jan resolved? We didn't hear the end of that. > > > > I never heard anything back from Dave, either. > > My apologies. As I recall your problem looked to be a race in an area > where I was redoing the concurrency control. I intended to ask you to > retest when my new version came out. Unfortunately the new version took > awhile, and by the time I sent it out I forgot to ask you about it. > > I believe your problem should be fixed in recent versions. If not, I'll > make another pass at it. > > Dave McCracken > Let me build up a kernel with the latest patches and give it a try. (Sorry for delay, didn't see this note until today.) > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Ray Bryant AMD Performance Labs Austin, Tx 512-602-0038 (o) 512-507-7807 (c) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2][RFC] New version of shared page tables 2006-05-16 21:09 ` Dave McCracken 2006-05-19 16:55 ` Ray Bryant @ 2006-05-22 18:00 ` Ray Bryant 1 sibling, 0 replies; 16+ messages in thread From: Ray Bryant @ 2006-05-22 18:00 UTC (permalink / raw) To: Dave McCracken; +Cc: Hugh Dickins, Linux Memory Management, Linux Kernel [-- Attachment #1: Type: text/plain, Size: 2946 bytes --] On Tuesday 16 May 2006 16:09, Dave McCracken wrote: > --On Monday, May 08, 2006 14:32:39 -0500 Ray Bryant > > <raybry@mpdtxmail.amd.com> wrote: > > On Saturday 06 May 2006 10:25, Hugh Dickins wrote: > > <snip> > > > >> How was Ray Bryant's shared,anonymous,fork,munmap,private bug of > >> 25 Jan resolved? We didn't hear the end of that. > > > > I never heard anything back from Dave, either. > > My apologies. As I recall your problem looked to be a race in an area > where I was redoing the concurrency control. I intended to ask you to > retest when my new version came out. Unfortunately the new version took > awhile, and by the time I sent it out I forgot to ask you about it. > > I believe your problem should be fixed in recent versions. If not, I'll > make another pass at it. > > Dave McCracken > Dave, I'm sending you a test case and a small kernel patch (see attachments). The patch applies to 2.6.17-rc1, on top of your patches from 4/10/2006 (I'm assuming these the most recent ones.). What the patch does is to add a system call that will return the pfn and ptep for a given virtual address. What the test program does (I think :-) ) is to create a mmap'd shared region, then fork off a child. The child then re-mmaps() private a portion of the region. Call it without arguments for now, that should map 512 pte's and share them between the parent and 1 child. [Later on we can try more pages and more children. (e. g. ./shpt_test1 128 64).] At this point, what I expect to have happened is that in at the shared region address in the child, there will be a number of pages that are still shared with the parent, hence have the same pfn and ptep as they used to, followed by a set of pages in the re-mmapped() region where the pfn's and ptep's are different, because that set of pages is no longer shared. What I find is that the re-mapped() region, the pfn's are different, but the ptep's have not changed. Hence, we've modified the parent address space rather than getting our own copy of that part of the address space. Now I'm not positive as to what the semantics SHOULD be here, so that may be the error involved, but it seems to me that if I mmap() the region private in the child, I should get a nice new private copy, and the pte's should no longer be shared with the parent. Is that the way you guys understand the semantics of this? Anyway take a look at my test case and see if it makes any sense to you. If it turns out my test case is in error, the mea culpa, and I'll fix the problems and try again. Best Regards, Ray > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Ray Bryant AMD Performance Labs Austin, Tx 512-602-0038 (o) 512-507-7807 (c) [-- Attachment #2: add-sys_get_vminfo.patch --] [-- Type: text/x-diff, Size: 2407 bytes --] Index: linux-2.6.17-rc1-ptsh/mm/memory.c =================================================================== --- linux-2.6.17-rc1-ptsh.orig/mm/memory.c +++ linux-2.6.17-rc1-ptsh/mm/memory.c @@ -2463,3 +2463,80 @@ int in_gate_area_no_task(unsigned long a } #endif /* __HAVE_ARCH_GATE_AREA */ + +#define VMINFO_RESULTS 3 +asmlinkage long +sys_get_vminfo(pid_t pid, unsigned long addr, long *user_addr) +{ + int ret; + struct page *p; + struct task_struct *task = NULL; + struct mm_struct *mm = NULL; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep = NULL; + unsigned long results[VMINFO_RESULTS]; + + if (pid >= 0) { + read_lock(&tasklist_lock); + task = find_task_by_pid(pid); + if (task) { + task_lock(task); + mm = task->mm; + if (mm) + atomic_inc(&mm->mm_users); + } else { + read_unlock(&tasklist_lock); + return -ESRCH; + } + read_unlock(&tasklist_lock); + } else + return -1; + + ret = get_user_pages(task, mm, addr, 1, 0, 0, &p, NULL); + results[0] = 0; + results[1] = -1; + if (ret >= 0) { + results[0] = page_to_pfn(p); + results[1] = page_to_nid(p); + put_page(p); + } else + ret = EINVAL; + + pgd = pgd_offset(mm, addr); + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) + goto no_page_table; + + pud = pud_offset(pgd, addr); + if (pud_none(*pud) || unlikely(pud_bad(*pud))) + goto no_page_table; + + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) + goto no_page_table; + + ptep = pte_offset_map(pmd, addr); + pte_unmap(ptep); + + if (mm) + mmput(mm); + + task_unlock(task); + +copy_vminfo_to_user: + results[2] = (unsigned long) ptep; + + if (copy_to_user(user_addr, results, VMINFO_RESULTS*sizeof(long))) + ret = -EFAULT; + + return ret; + +no_page_table: + ptep = NULL; + + ret = ENOMEM; + + goto copy_vminfo_to_user; + +} Index: linux-2.6.17-rc1-ptsh/include/asm-x86_64/unistd.h =================================================================== --- linux-2.6.17-rc1-ptsh.orig/include/asm-x86_64/unistd.h +++ linux-2.6.17-rc1-ptsh/include/asm-x86_64/unistd.h @@ -611,8 +611,10 @@ __SYSCALL(__NR_set_robust_list, sys_set_ __SYSCALL(__NR_get_robust_list, sys_get_robust_list) #define __NR_splice 275 __SYSCALL(__NR_splice, sys_splice) - -#define __NR_syscall_max __NR_splice +#define __NR_get_vminfo 276 +__SYSCALL(__NR_get_vminfo, sys_get_vminfo) + +#define __NR_syscall_max __NR_get_vminfo #ifndef __NO_STUBS [-- Attachment #3: shpt_test1.c --] [-- Type: text/x-csrc, Size: 10201 bytes --] #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <sys/shm.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/wait.h> #include <sys/time.h> #include <asm/atomic.h> #define PAGE_SIZE 4096 #define INITIAL_HOLE 50 #define ADDR (0x00002aaaaaf00000UL) #define ADDR1 (ADDR + INITIAL_HOLE*PAGE_SIZE) #define MAX_NCHILD 128 #define PAGES_PER_PTE_PAGE 512 #define PAGE_SIZE_IN_KB (PAGE_SIZE/1024) /* * Now test what happens if we create a shared region under the shpt * kernel and then the children remap part of the shared region. We * use a hacked kernel with an additional system call (get_viminfo()) * [see below for details] to make sure that each child gets its own * pfn in this case and that the shared paget table entries are no * longer shared. This test was suggested by Christoph Lamater @ SGI. */ static inline long timeval_diff_in_ms(struct timeval *a, struct timeval *b) { if (a->tv_usec > b->tv_usec) { // borrow a->tv_sec--; a->tv_usec += 1000000; } return(1000 * (b->tv_sec - a->tv_sec) + (b->tv_usec - a->tv_usec)/1000); } long get_pte_kb() { FILE *f; int err; long parameter; char str[128]; f = fopen("/proc/meminfo", "r"); if (!f) { printf("fopen() error in %s\n", __FUNCTION__); perror("fopen"); exit(-1); } while (1) { parameter = -1; err = fscanf(f, "%s %ld", str, ¶meter); if (err == EOF || !strcmp(str, "PageTables:")) break; } return parameter; } /* * hacked in system call to return the following info for a virtual address: * results[0] = pfn of virtual address "addr" * results[1] = nodeid where pfn lives (for NUMA boxen) * results[2] = address of the pte * * Note well, this assumes the page has already been faulted in * If this hasn't happended, the system call results are undefined. */ #define __NR_get_vminfo 276 static inline long get_vminfo(pid_t pid, void *addr, long *results) { return (syscall(__NR_get_vminfo,pid,addr,results)); } main(int argc, char **argv) { char *pages, *pages1; int count; int errors, pc, nchild, child; long shared_region_size, i, remapped_region_size; long starting_pte_kb, ending_pte_kb; pid_t pid[MAX_NCHILD]; void *addr = (void *) ADDR; void *addr1= (void *) ADDR1; atomic_t *atom = (atomic_t *) (addr + 8); volatile long *flag = (long *) (addr + 16); struct timeval start, forkend, end; long results[3]; long *page_pfn, *page_ptep; int pfn_should_match_dont=0, ptep_should_match_dont=0; int pfn_shouldnt_match_do=0, ptep_shouldnt_match_do=0; setbuf(stdout, NULL); printf("Main starts......\n"); printf("argc=%d\n", argc); /* first arg is the number of pte pages to use */ if (argc == 1) pc = 1; else sscanf(argv[1], "%d", &pc); /* second arg is the number of threads to create */ if (argc < 3) nchild = 1; else sscanf(argv[2], "%d", &nchild); /* find out how many pages of pte's we've already used */ /* (we'll subtract this off of the number we get after */ /* the children are all forked, below.) ............. */ starting_pte_kb = get_pte_kb(); pc = PAGES_PER_PTE_PAGE * pc; if (nchild > MAX_NCHILD) nchild = MAX_NCHILD; printf("Number of pages to map: %d nchild: %d\n", pc, nchild); shared_region_size = (long)pc * (long)PAGE_SIZE; printf("Shared region size: %5.2f GB\n", shared_region_size/(1024.0*1024.0*1024.0)); pages = (char *)mmap(addr, shared_region_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED, 0, 0); if (pages == MAP_FAILED) { printf("mmap() failed.\n"); perror("mmap"); exit(999); } printf("mapped region starts at: %p\n", pages); /* initialize the communication flags in the shared region */ atomic_set(atom, 0); *flag = 0; printf("writing data..........\n"); errors = 0; /* initialize the first byte of page N to N */ for(i=0; i<shared_region_size; i+=PAGE_SIZE) { pages[i] = (char) (i/PAGE_SIZE); } /* paranoia check. ........................ */ for(i=0; i<shared_region_size; i+=PAGE_SIZE) { if(pages[i] != (char) (i/PAGE_SIZE)) errors++; } printf("done writing data...... errors=%d\n", errors); printf("Forking.....\n"); gettimeofday(&start, NULL); for(child=0;child<nchild;child++) { if (pid[child]=fork()) { if ((nchild < 16) || ((child % 16) == 0)) printf("parent (pid:%d) created child #%3d: pid:%d\n", getpid(), child, pid[child]); } else { char tmp, rc; int tests = 0; errors = 0; /* check to make sure child sees same data as above */ /* also record pfn and pte addresses for later comparison */ page_pfn = (long *) malloc(pc*sizeof(long)); page_ptep= (long *) malloc(pc*sizeof(long)); if (!page_pfn || !page_ptep) { printf("Ack, PID: %d couldn't allocate both page_pfn (%p) and page_pte (%p)\n", getpid(), page_pfn, page_ptep); perror("mmap"); atomic_add(1, atom); exit(-1); } else printf("PID: %d page_pfn:%p page_pte:%p\n", getpid(), page_pfn, page_ptep); for(i=0; i<shared_region_size; i+=PAGE_SIZE) { if(pages[i] != (char) (i/PAGE_SIZE)) errors++; rc = get_vminfo(getpid(), &pages[i], results); if (rc >= 0) { page_pfn[i/PAGE_SIZE] = results[0]; page_ptep[i/PAGE_SIZE] = results[2]; } else printf("PID:%d i=%d vmaddr:%p returned %d\n", getpid(), i, &pages[i], rc); } if (errors > 0) printf("child(%d) sees errors=%d\n", getpid(), errors); /* now (re-)mmap a portion of the shared region */ /* yeah, this is a little bit arbitrary :-) */ remapped_region_size = PAGE_SIZE*pc/8; printf("remapped_region_size: %ld\n", remapped_region_size); pages1 = (char *)mmap(addr1, remapped_region_size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_FIXED | MAP_PRIVATE, 0, 0); if (pages1 == MAP_FAILED) { printf("mmap() failed in child process %d\n",getpid()); perror("mmap"); atomic_add(1, atom); exit(-1); } /* fault the pages in and put some different data in the pages */ tmp = (char) (getpid() & 0xFF); for(i=0; i<remapped_region_size; i+=PAGE_SIZE) { pages1[i] = tmp; } errors = 0; tests = 0; /* * now print out the pfn and pte addresses for the entire shared region * (up to the end of the re-mapped region, above). * we expect some of the addresses to still use shared ptes, but above * the initial hole, we should see distinct ptes. Another plausible * implementation would be to unshared the entire region; that would be * legal as well. */ for(i=0; i<INITIAL_HOLE*PAGE_SIZE+remapped_region_size; i+=PAGE_SIZE) { if (((void *)&pages[i]) < addr1) { /* we expect shared pte's in this region */ if (pages[i] != tmp) errors++; tests++; rc = get_vminfo(getpid(), &pages[i], results); if (rc >= 0) { if (results[0] != page_pfn[i/PAGE_SIZE]) pfn_should_match_dont++; if (results[2] != page_ptep[i/PAGE_SIZE]) ptep_should_match_dont++; printf("Expect shared: PID:%d i=%d vmaddr:%p pfn:0x%lx was 0x%lx pte:0x%lx was 0x%lx\n", getpid(), i, &pages[i], page_pfn[i/PAGE_SIZE], results[0], page_ptep[i/PAGE_SIZE], results[2]); } else printf("Expect shared: PID:%d i=%d vmaddr:%p returned %d\n", getpid(), i, &pages[i], rc); } else { /* we expect unshared pte's in this region */ if (pages[i] != tmp) errors++; tests++; rc = get_vminfo(getpid(), &pages[i], results); if (rc >= 0) { if (results[0] == page_pfn[i/PAGE_SIZE]) pfn_shouldnt_match_do++; if (results[2] == page_ptep[i/PAGE_SIZE]) ptep_shouldnt_match_do++; printf("Expect unshared: PID:%d i=%d vmaddr:%p pfn:0x%lx was 0x%lx pte:0x%lx was 0x%lx\n", getpid(), i, &pages[i], page_pfn[i/PAGE_SIZE], results[0], page_ptep[i/PAGE_SIZE], results[2]); } else printf("Expect unshared: PID:%d i=%d vmaddr:%p returned %d\n", getpid(), i, &pages[i], rc); } } /* print the number of errors found for the region we have examined */ if (errors > 0) printf("child(%d) sees errors=%d in region, tests:%d\n", getpid(), errors, tests); if (pfn_should_match_dont || ptep_should_match_dont || pfn_shouldnt_match_do || ptep_shouldnt_match_do) { int tmp = pfn_should_match_dont + ptep_should_match_dont + pfn_shouldnt_match_do + ptep_shouldnt_match_do; printf("child(%d) sees match/mismatch errors %d in region, tests:%d\n", getpid(), tmp, tests); printf("child(%d) pfn_should_match_dont: %d\n", getpid(), pfn_should_match_dont); printf("child(%d) ptep_should_match_dont: %d\n", getpid(), ptep_should_match_dont); printf("child(%d) pfn_shouldnt_match_do: %d\n", getpid(), pfn_shouldnt_match_do); printf("child(%d) ptep_shouldnt_match_do: %d\n", getpid(), ptep_shouldnt_match_do); } /* indicate that this child is done */ atomic_add(1, atom); while (!(*flag)) sleep(1); exit(errors); } } gettimeofday(&forkend, 0); /* wait for all of the children to get started and check their data */ printf("Parent is waiting....\n"); while(atomic_read(atom) < nchild) usleep(1000L); gettimeofday(&end, NULL); printf("All children are now sleeping....elapsed ms: %ld fork ms: %ld\n", timeval_diff_in_ms(&start, &end), timeval_diff_in_ms(&start, &forkend)); /* now let us check to see how many pages of pte's have been used */ ending_pte_kb = get_pte_kb(); /* We can use this number to see if the shared region is still using * any shared pte's after the mmap() by the child, although it really * doesn't matter (e. g. it would be allowed to revert the whole shared * region to non-shared ptes. ....................................... */ printf("pte pages used: \t%8ld\n", (ending_pte_kb-starting_pte_kb)/PAGE_SIZE_IN_KB); printf("KB of pte pages used: \t%8ld\n", ending_pte_kb-starting_pte_kb); *flag = 1; for(i=0;i<nchild;i++) { int status; waitpid(pid[i], &status, 0); if (status != 0) printf("pid %d exited with non-zero status: %d\n", pid[i], status); } munmap(pages, shared_region_size); printf("Main exits......\n"); } ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2][RFC] New version of shared page tables 2006-05-06 15:25 ` Hugh Dickins 2006-05-08 19:32 ` Ray Bryant @ 2006-05-08 19:49 ` Brian Twichell 2006-05-09 3:42 ` Nick Piggin 2006-05-09 19:22 ` Hugh Dickins 1 sibling, 2 replies; 16+ messages in thread From: Brian Twichell @ 2006-05-08 19:49 UTC (permalink / raw) To: Hugh Dickins; +Cc: Dave McCracken, Linux Memory Management, Linux Kernel Hugh Dickins wrote: >Let me say (while perhaps others are still reading) that I'm seriously >wondering whether you should actually restrict your shared pagetable work >to the hugetlb case. I realize that would be a disappointing limitation >to you, and would remove the 25%/50% improvement cases, leaving only the >3%/4% last-ounce-of-performance cases. > >But it's worrying me a lot that these complications to core mm code will >_almost_ never apply to the majority of users, will get little testing >outside of specialist setups. I'd feel safer to remove that "almost", >and consign shared pagetables to the hugetlb ghetto, if that would >indeed remove their handling from the common code paths. (Whereas, >if we didn't have hugetlb, I would be arguing strongly for shared pts.) > Hi, In the case of x86-64, if pagetable sharing for small pages was eliminated, we'd lose more than the 27-33% throughput improvement observed when the bufferpools are in small pages. We'd also lose a significant chunk of the 3% improvement observed when the bufferpools are in hugepages. This occurs because there is still small page pagetable sharing being achieved, minimally for database text, when the bufferpools are in hugepages. The performance counters indicated that ITLB and DTLB page walks were reduced by 28% and 10%, respectively, in the x86-64/hugepage case. To be clear, all measurements discussed in my post were performed with kernels config'ed to share pagetables for both small pages and hugepages. If we had to choose between pagetable sharing for small pages and hugepages, we would be in favor of retaining pagetable sharing for small pages. That is where the discernable benefit is for customers that run with "out-of-the-box" settings. Also, there is still some benefit there on x86-64 for customers that use hugepages for the bufferpools. Cheers, Brian -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2][RFC] New version of shared page tables 2006-05-08 19:49 ` Brian Twichell @ 2006-05-09 3:42 ` Nick Piggin 2006-05-10 2:07 ` Chen, Kenneth W 2006-05-10 19:45 ` Brian Twichell 2006-05-09 19:22 ` Hugh Dickins 1 sibling, 2 replies; 16+ messages in thread From: Nick Piggin @ 2006-05-09 3:42 UTC (permalink / raw) To: Brian Twichell Cc: Hugh Dickins, Dave McCracken, Linux Memory Management, Linux Kernel Brian Twichell wrote: > Hugh Dickins wrote: > >> Let me say (while perhaps others are still reading) that I'm seriously >> wondering whether you should actually restrict your shared pagetable >> work >> to the hugetlb case. I realize that would be a disappointing limitation >> to you, and would remove the 25%/50% improvement cases, leaving only the >> 3%/4% last-ounce-of-performance cases. >> >> But it's worrying me a lot that these complications to core mm code will >> _almost_ never apply to the majority of users, will get little testing >> outside of specialist setups. I'd feel safer to remove that "almost", >> and consign shared pagetables to the hugetlb ghetto, if that would >> indeed remove their handling from the common code paths. (Whereas, >> if we didn't have hugetlb, I would be arguing strongly for shared pts.) >> > Hi, > > In the case of x86-64, if pagetable sharing for small pages was > eliminated, we'd lose more than the 27-33% throughput improvement > observed when the bufferpools are in small pages. We'd also lose a > significant chunk of the 3% improvement observed when the bufferpools > are in hugepages. This occurs because there is still small page > pagetable sharing being achieved, minimally for database text, when > the bufferpools are in hugepages. The performance counters indicated > that ITLB and DTLB page walks were reduced by 28% and 10%, > respectively, in the x86-64/hugepage case. Aside, can you just enlighten me as to how TLB misses are improved on x86-64? As far as I knew, it doesn't have ASIDs so I wouldn't have thought it could share TLBs anyway... But I'm not up to scratch with modern implementations. > > To be clear, all measurements discussed in my post were performed with > kernels config'ed to share pagetables for both small pages and hugepages. > > If we had to choose between pagetable sharing for small pages and > hugepages, we would be in favor of retaining pagetable sharing for > small pages. That is where the discernable benefit is for customers > that run with "out-of-the-box" settings. Also, there is still some > benefit there on x86-64 for customers that use hugepages for the > bufferpools. Of course if it was free performance then we'd want it. The downsides are that it is a significant complexity for a pretty small (3%) performance gain for your apparent target workload, which is pretty uncommon among all Linux users. Ignoring the complexity, it is still not free. Sharing data across processes adds to synchronisation overhead and hurts scalability. Some of these page fault scalability scenarios have shown to be important enough that we have introduced complexity _there_. And it seems customers running "out-of-the-box" settings really want to start using hugepages if they're interested in getting the most performance possible, no? --- Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH 0/2][RFC] New version of shared page tables 2006-05-09 3:42 ` Nick Piggin @ 2006-05-10 2:07 ` Chen, Kenneth W 2006-05-10 19:45 ` Brian Twichell 1 sibling, 0 replies; 16+ messages in thread From: Chen, Kenneth W @ 2006-05-10 2:07 UTC (permalink / raw) To: 'Nick Piggin', Brian Twichell Cc: Hugh Dickins, Dave McCracken, Linux Memory Management, Linux Kernel Nick Piggin wrote on Monday, May 08, 2006 8:42 PM > Brian Twichell wrote: > > In the case of x86-64, if pagetable sharing for small pages was > > eliminated, we'd lose more than the 27-33% throughput improvement > > observed when the bufferpools are in small pages. We'd also lose a > > significant chunk of the 3% improvement observed when the bufferpools > > are in hugepages. This occurs because there is still small page > > pagetable sharing being achieved, minimally for database text, when > > the bufferpools are in hugepages. The performance counters indicated > > that ITLB and DTLB page walks were reduced by 28% and 10%, > > respectively, in the x86-64/hugepage case. > > > Aside, can you just enlighten me as to how TLB misses are improved on > x86-64? As far as I knew, it doesn't have ASIDs so I wouldn't have thought > it could share TLBs anyway... > But I'm not up to scratch with modern implementations. Allow me to jump in if I may: The number of TLB misses did not change that much (both i-side and d-side and is expected). What changed is the penalty of TLB misses are reduced: i.e., number of page table walk performed by the hardware are reduced. This is due to specialized buffering of information that reduces the need to perform page walks. With page table sharing, the overall size of page tables are reduced, in turn, it has a better hit rate on the buffered items and it helps to mitigate page walks upon a TLB miss. - Ken -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2][RFC] New version of shared page tables 2006-05-09 3:42 ` Nick Piggin 2006-05-10 2:07 ` Chen, Kenneth W @ 2006-05-10 19:45 ` Brian Twichell 2006-05-12 5:17 ` Nick Piggin 1 sibling, 1 reply; 16+ messages in thread From: Brian Twichell @ 2006-05-10 19:45 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Dave McCracken, Linux Memory Management, Linux Kernel Nick Piggin wrote: > Brian Twichell wrote: > >> >> If we had to choose between pagetable sharing for small pages and >> hugepages, we would be in favor of retaining pagetable sharing for >> small pages. That is where the discernable benefit is for customers >> that run with "out-of-the-box" settings. Also, there is still some >> benefit there on x86-64 for customers that use hugepages for the >> bufferpools. > > > Of course if it was free performance then we'd want it. The downsides > are that it > is a significant complexity for a pretty small (3%) performance gain > for your apparent > target workload, which is pretty uncommon among all Linux users. Our performance data demonstrated that the potential gain for the non-hugepage case is much higher than 3%. > > Ignoring the complexity, it is still not free. Sharing data across > processes adds to > synchronisation overhead and hurts scalability. Some of these page > fault scalability > scenarios have shown to be important enough that we have introduced > complexity _there_. True, but this needs to be balanced against the fact that pagetable sharing will reduce the number of page faults when it is achieved. Let's say you have N processes which touch all the pages in an M page shared memory region. Without shared pagetables this requires N*M page faults; if pagetable sharing is achieved, only M pagefaults are required. > > And it seems customers running "out-of-the-box" settings really want > to start using > hugepages if they're interested in getting the most performance > possible, no? My perspective is that, once the customer is required to invoke "echo XXX > /proc/sys/vm/nr_hugepages" they've left the "out-of-the-box" domain, and entered the domain of hoping that the number of hugepages is sufficient, because if it's not, they'll probably need to reboot, which can be pretty inconvenient for a production transaction-processing application. Cheers, Brian -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2][RFC] New version of shared page tables 2006-05-10 19:45 ` Brian Twichell @ 2006-05-12 5:17 ` Nick Piggin 0 siblings, 0 replies; 16+ messages in thread From: Nick Piggin @ 2006-05-12 5:17 UTC (permalink / raw) To: Brian Twichell Cc: Hugh Dickins, Dave McCracken, Linux Memory Management, Linux Kernel Brian Twichell wrote: > Nick Piggin wrote: >> Of course if it was free performance then we'd want it. The downsides >> are that it >> is a significant complexity for a pretty small (3%) performance gain >> for your apparent >> target workload, which is pretty uncommon among all Linux users. > > > Our performance data demonstrated that the potential gain for the > non-hugepage case is much higher than 3%. The point is, there are hugepages. They were a significant additional complexity but the concession was made because they did provide a large speedup for databases. > >> >> Ignoring the complexity, it is still not free. Sharing data across >> processes adds to >> synchronisation overhead and hurts scalability. Some of these page >> fault scalability >> scenarios have shown to be important enough that we have introduced >> complexity _there_. > > > True, but this needs to be balanced against the fact that pagetable > sharing will reduce the number of page faults when it is achieved. > Let's say you have N processes which touch all the pages in an M page > shared memory region. Without shared pagetables this requires N*M page > faults; if pagetable sharing is achieved, only M pagefaults are required. > >> >> And it seems customers running "out-of-the-box" settings really want >> to start using >> hugepages if they're interested in getting the most performance >> possible, no? > > > My perspective is that, once the customer is required to invoke "echo > XXX > /proc/sys/vm/nr_hugepages" they've left the "out-of-the-box" > domain, and entered the domain of hoping that the number of hugepages is > sufficient, because if it's not, they'll probably need to reboot, which > can be pretty inconvenient for a production transaction-processing > application. I think it is pretty easy to reserve hugepages at bootup. This is what a production transaction processing system will be doing, won't it? Especially if they're performance constrained and hugepages gives them a 30% performance boost. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2][RFC] New version of shared page tables 2006-05-08 19:49 ` Brian Twichell 2006-05-09 3:42 ` Nick Piggin @ 2006-05-09 19:22 ` Hugh Dickins 1 sibling, 0 replies; 16+ messages in thread From: Hugh Dickins @ 2006-05-09 19:22 UTC (permalink / raw) To: Brian Twichell; +Cc: Dave McCracken, Linux Memory Management, Linux Kernel On Mon, 8 May 2006, Brian Twichell wrote: > > If we had to choose between pagetable sharing for small pages and hugepages, > we would be in favor of retaining pagetable sharing for small pages. That is > where the discernable benefit is for customers that run with "out-of-the-box" > settings. Also, there is still some benefit there on x86-64 for customers > that use hugepages for the bufferpools. Thanks for the further info, Brian. Okay, the hugepage end of it does add a different kind of complexity, in an area already complex from the different arch implementations. If you've found that a significant part of the hugepage test improvment is actually due to the smallpage changes, let's turn around what I said, and suggest Dave concentrate on getting the smallpage changes right, putting the hugepage part of it on the backburner at least for now (or if he's particularly keen still to present it, as 3/3). Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2][RFC] New version of shared page tables 2006-05-03 15:43 [PATCH 0/2][RFC] New version of shared page tables Dave McCracken 2006-05-03 15:56 ` Hugh Dickins @ 2006-05-05 19:25 ` Brian Twichell 2006-05-06 3:37 ` Chen, Kenneth W 1 sibling, 1 reply; 16+ messages in thread From: Brian Twichell @ 2006-05-05 19:25 UTC (permalink / raw) To: Dave McCracken Cc: Hugh Dickins, Linux Memory Management, Linux Kernel, slpratt Hi, We reevaluated shared pagetables with recent patches from Dave. As with our previous evaluation, a database transaction-processing workload was used. This time our evaluation focused on a 4-way x86-64 configuration with 8 GB of memory. In the case that the bufferpools were in small pages, shared pagetables provided a 27% improvement in transaction throughput. The performance increase is attributable to multiple factors. First, pagetable memory consumption was reduced from 1.65 GB to 51 MB, freeing up 20% of the system's memory. This memory was devoted to enlarging the database bufferpools, which allowed more database data to be cached in memory. The effect of this was to reduce the number of disk I/O's per transaction by 23%, which contributed to a similar reduction in the context switch rate. A second major component of the performance improvement is reduced TLB and cache miss rates, due to the smaller pagetable footprint. To try to isolate this benefit, we performed an experiment where pagetables were shared, but the database bufferpools were not enlarged. In this configuration, shared pagetables provided a 9% increase in database transaction throughput. Analysis of processor performance counters revealed the following benefits from pagetable sharing: - ITLB and DTLB page walks were reduced by 27% and 26%, respectively. - L1 and L2 cache misses were reduced by 5%. This is due to fewer pagetable entries crowding the caches. - Front-side bus traffic was reduced approximately 10%. When the bufferpools were in hugepages, shared pagetables provided a 3% increase in database transaction throughput. Some of the underlying benefits of pagetable sharing were as follows: - Pagetable memory consumption was reduced from 53 MB to 37 MB. - ITLB and DTLB page walks were reduced by 28% and 10%, respectively. - L1 and L2 cache misses were reduced by 2% and 6.5%, respectively. - Front-side bus traffic was reduced by approximately 4%. The database transaction throughput achieved using small pages with shared pagetables (with bufferpools enlarged) was within 3% of the transaction throughput achieved using hugepages without shared pagetables. Thus shared pagetables provided nearly all the benefit of hugepages, without the requirement of having to deal with limitations of hugepages. We believe this would be a significant benefit to customers running these types of workloads. We also measured the benefit of shared pagetables on our larger setups. On our 4-way x86-64 setup with 64 GB memory, using small pages for the bufferpools, shared pagetables provided a 33% increase in transaction throughput. Using hugepages for the bufferpools, shared pagetables provided a 3% increase. Performance with small pages and shared pagetables was within 4% of the performance using hugepages without shared pagetables. On our ppc64 setups we used both Oracle and DB2 to evaluate the benefit of shared pagetables. When database bufferpools were in small pages, shared pagetables provided an increase in database transaction throughput in the range of 60-65%, while in the hugepage case the improvement was up to 2.4%. We thank Kshitij Doshi and Ken Chen from Intel for their assistance in analyzing the x86-64 data. Cheers, Brian -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH 0/2][RFC] New version of shared page tables 2006-05-05 19:25 ` Brian Twichell @ 2006-05-06 3:37 ` Chen, Kenneth W 0 siblings, 0 replies; 16+ messages in thread From: Chen, Kenneth W @ 2006-05-06 3:37 UTC (permalink / raw) To: 'Brian Twichell', Dave McCracken Cc: Hugh Dickins, Linux Memory Management, Linux Kernel, slpratt Brian Twichell wrote on Friday, May 05, 2006 12:26 PM > We also measured the benefit of shared pagetables on our larger setups. > On our 4-way x86-64 setup with 64 GB memory, using small pages for the > bufferpools, shared pagetables provided a 33% increase in transaction > throughput. Using hugepages for the bufferpools, shared pagetables > provided a 3% increase. Performance with small pages and shared > pagetables was within 4% of the performance using hugepages without > shared pagetables. > > On our ppc64 setups we used both Oracle and DB2 to evaluate the benefit > of shared pagetables. When database bufferpools were in small pages, > shared pagetables provided an increase in database transaction > throughput in the range of 60-65%, while in the hugepage case the > improvement was up to 2.4%. I would also like to add that I have run this set of patches on ia64 and observed similar performance upside. We have multiple data points showing that this feature benefits several architectures. I'm advocating for the upstream inclusion. - Ken -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2006-05-22 18:00 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-05-03 15:43 [PATCH 0/2][RFC] New version of shared page tables Dave McCracken 2006-05-03 15:56 ` Hugh Dickins 2006-05-03 16:06 ` Dave McCracken 2006-05-06 15:25 ` Hugh Dickins 2006-05-08 19:32 ` Ray Bryant 2006-05-16 21:09 ` Dave McCracken 2006-05-19 16:55 ` Ray Bryant 2006-05-22 18:00 ` Ray Bryant 2006-05-08 19:49 ` Brian Twichell 2006-05-09 3:42 ` Nick Piggin 2006-05-10 2:07 ` Chen, Kenneth W 2006-05-10 19:45 ` Brian Twichell 2006-05-12 5:17 ` Nick Piggin 2006-05-09 19:22 ` Hugh Dickins 2006-05-05 19:25 ` Brian Twichell 2006-05-06 3:37 ` Chen, Kenneth W
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox