* Re: [RFT][PATCH 0/2] pagefault scalability alternative
@ 2005-08-24 14:27 linux
2005-08-24 15:21 ` Hugh Dickins
0 siblings, 1 reply; 12+ messages in thread
From: linux @ 2005-08-24 14:27 UTC (permalink / raw)
To: clameter; +Cc: linux-mm
> Atomicity can be guaranteed to some degree by using the present bit.
> For an update the present bit is first switched off. When a
> new value is written, it is first written in the piece of the entry that
> does not contain the pte bit which keeps the entry "not present". Last the
> word with the present bit is written.
Er... no. That would work if reads were atomic but writes weren't, but
consider the following:
Reader Writer
Read first half
Write not-present bit
Write other half
Write present bit
Read second half
Voila, mismatched halves.
Unless you can give a guarantee on relative rates of progress, this
can't be made to work.
The first obvious fix is to read the first half a second time and make
sure it matches, retrying if not. The idea being that if the PTE changed
from AB to AC, you might not notice the change, but it wouldn't matter,
either. But that can fail, too, in sufficiently contrived circumstances:
Reader Writer
Read first half
Write not-present bit
Write other half
Write present bit
Read second half
Write not-present bit
Write other half
Write present bit
Read first half
If it changed from AB -> CD -> AE, you could read AD and not notice the
problem.
And remember that relative rates in SMP systems are *usually* matched,
but if you depend for correctness on a requirement that there be no
interrupts, no NMI, no SMM, no I-cache miss, no I-cache parity error that
triggered a re-fetch, no single-bit ECC error that triggered scrubbing,
etc., then you're really tightly constraining the rest of the system.
Modern processors do all kinds of strange low-probability exception
handling in order to speed up the common case.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [RFT][PATCH 0/2] pagefault scalability alternative 2005-08-24 14:27 [RFT][PATCH 0/2] pagefault scalability alternative linux @ 2005-08-24 15:21 ` Hugh Dickins 0 siblings, 0 replies; 12+ messages in thread From: Hugh Dickins @ 2005-08-24 15:21 UTC (permalink / raw) To: linux; +Cc: clameter, linux-mm On Wed, 24 Aug 2005 linux@horizon.com wrote: > > Atomicity can be guaranteed to some degree by using the present bit. > > For an update the present bit is first switched off. When a > > new value is written, it is first written in the piece of the entry that > > does not contain the pte bit which keeps the entry "not present". Last the > > word with the present bit is written. > > Er... no. That would work if reads were atomic but writes weren't, but > consider the following: > > Reader Writer > Read first half > Write not-present bit > Write other half > Write present bit > Read second half > > Voila, mismatched halves. True. But not an issue for the patch under discussion. In the case of the pt entries, all the writes are done within ptlock, and any reads done outside of ptlock (to choose which fault handler) are rechecked within ptlock before making any critical decision (in the PAE case which might have mismatched halves). In the case of the pmd entries, a transition from present to not present is only made in free_pgtables (either while mmap_sem is held exclusively, or when the mm no longer has users), after unlinking from the prio_tree and anon_vma list by which kswapd might have got to them without mmap_sem (the unlinking taking the necessary locks). And pfn is never changed while present. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* [RFT][PATCH 0/2] pagefault scalability alternative @ 2005-08-22 21:27 Hugh Dickins 2005-08-22 22:29 ` Christoph Lameter 0 siblings, 1 reply; 12+ messages in thread From: Hugh Dickins @ 2005-08-22 21:27 UTC (permalink / raw) To: Christoph Lameter; +Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-mm Here's my alternative to Christoph's pagefault scalability patches: no pte xchging, just narrowing the scope of the page_table_lock and (if CONFIG_SPLIT_PTLOCK=y when SMP) splitting it up per page table. Currently only supports i386 (PAE or not), x86_64 and ia64 (latter unbuilt and untested so far). The rest ought not to build (removed an arg from pte_alloc_kernel). I'll take a look through the other arches: most should be easy, a few (e.g. the sparcs) need more care. (What I've done for oprofile backtrace is probably not quite right, but I think in the right direction: can no longer lock out swapout with page_table_lock, should just try to copy atomically - I'm hoping someone can help me out there to get it right.) Certainly not to be considered for merging into -mm yet: contains various tangential mods (e.g. mremap move speedup) which should be split off into separate patches for description, review and merge. I do expect we shall want to merge the narrowing of page_table_lock in due course - unless you find it's broken. Whether we shall want the ptlock splitting, whether with or without anonymous pte xchging, depends on how they all perform. Presented as a Request For Testing - any chance, Christoph, that you could get someone to run it up on SGI's ia64 512-ways, to compare against the vanilla 2.6.13-rc6-mm1 including your patches? Thanks! (The rss counting in this patch matches how it was in -rc6-mm1. Later I'll want to look at the rss delta mechanism and integrate that in - the narrowing won't want it, but the splitting would. If you think we'd get fairer test numbers by temporarily suppressing rss counting in each version, please do so.) Diffstat below is against 2.6.13-rc6-mm1 minus Christoph's version. No disrespect intended - but it's a bit easier to see what this one is up to if diffed against the simpler base. I'll send the removal of page-fault-patches from -rc6-mm1 as 1/2 then mine as 2/2. Hugh arch/i386/kernel/vm86.c | 17 - arch/i386/mm/ioremap.c | 4 arch/i386/mm/pgtable.c | 51 +++ arch/i386/oprofile/backtrace.c | 42 +- arch/ia64/mm/init.c | 11 arch/x86_64/mm/ioremap.c | 4 fs/exec.c | 14 fs/hugetlbfs/inode.c | 4 fs/proc/task_mmu.c | 19 - include/asm-generic/tlb.h | 4 include/asm-i386/pgalloc.h | 11 include/asm-i386/pgtable.h | 14 include/asm-ia64/pgalloc.h | 13 include/asm-x86_64/pgalloc.h | 24 - include/linux/hugetlb.h | 2 include/linux/mm.h | 73 ++++- include/linux/rmap.h | 3 include/linux/sched.h | 30 ++ kernel/fork.c | 19 - kernel/futex.c | 6 mm/Kconfig | 16 + mm/filemap_xip.c | 14 mm/fremap.c | 53 +-- mm/hugetlb.c | 33 +- mm/memory.c | 578 ++++++++++++++++++----------------------- mm/mempolicy.c | 7 mm/mmap.c | 85 ++---- mm/mprotect.c | 7 mm/mremap.c | 169 +++++------ mm/msync.c | 49 +-- mm/rmap.c | 115 ++++---- mm/swap_state.c | 3 mm/swapfile.c | 20 - mm/vmalloc.c | 4 34 files changed, 740 insertions(+), 778 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFT][PATCH 0/2] pagefault scalability alternative 2005-08-22 21:27 Hugh Dickins @ 2005-08-22 22:29 ` Christoph Lameter 2005-08-23 0:32 ` Nick Piggin 2005-08-23 8:14 ` Hugh Dickins 0 siblings, 2 replies; 12+ messages in thread From: Christoph Lameter @ 2005-08-22 22:29 UTC (permalink / raw) To: Hugh Dickins; +Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-mm On Mon, 22 Aug 2005, Hugh Dickins wrote: > Here's my alternative to Christoph's pagefault scalability patches: > no pte xchging, just narrowing the scope of the page_table_lock and > (if CONFIG_SPLIT_PTLOCK=y when SMP) splitting it up per page table. The basic idea is to have a spinlock per page table entry it seems. I think that is a good idea since it avoids atomic operations and I hope it will bring the same performance as my patches (seems that the page_table_lock can now be cached on the node that the fault is happening). However, these are very extensive changes to the vm. The vm code in various places expects the page table lock to lock the complete page table. How do the page based ptl's and the real ptl interact? There are these various hackish things in there that will hopefully be taken care of. F.e. there really should be a spinlock_t ptl in the struct page. Spinlock_t is often much bigger than an unsigned long. The patch generally drops the first acquisition of the page table lock from handle_mm_fault that is used to protect the read operations on the page table. I doubt that this works with i386 PAE since the page table read operations are not protected by the ptl. These are 64 bit which cannot be reliably retrieved in an 32 bit operation on i386 as you pointed out last fall. There may be concurrent writes so that one gets two pieces that do not fit. PAE mode either needs to fall back to take the page_table_lock for reads or use some tricks to guarantee 64bit atomicity. I have various bad feelings about some elements but I like the general direction. > Certainly not to be considered for merging into -mm yet: contains > various tangential mods (e.g. mremap move speedup) which should be > split off into separate patches for description, review and merge. Could you modularize these patches? Its difficult to review as one. Maybe separate the narrowing and the splitting and the miscellaneous things? > Presented as a Request For Testing - any chance, Christoph, that you > could get someone to run it up on SGI's ia64 512-ways, to compare > against the vanilla 2.6.13-rc6-mm1 including your patches? Thanks! Compiles and boots fine on ia64. Survives my benchmark on a smaller box. Numbers and more details will follow later. It takes some time to get a bigger iron. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFT][PATCH 0/2] pagefault scalability alternative 2005-08-22 22:29 ` Christoph Lameter @ 2005-08-23 0:32 ` Nick Piggin 2005-08-23 7:04 ` Hugh Dickins 2005-08-23 8:14 ` Hugh Dickins 1 sibling, 1 reply; 12+ messages in thread From: Nick Piggin @ 2005-08-23 0:32 UTC (permalink / raw) To: Christoph Lameter; +Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, linux-mm Christoph Lameter wrote: > The patch generally drops the first acquisition of the page > table lock from handle_mm_fault that is used to protect the read > operations on the page table. I doubt that this works with i386 PAE since > the page table read operations are not protected by the ptl. These are 64 > bit which cannot be reliably retrieved in an 32 bit operation on i386 as > you pointed out last fall. There may be concurrent writes so that one > gets two pieces that do not fit. PAE mode either needs to fall back to > take the page_table_lock for reads or use some tricks to guarantee 64bit > atomicity. > Oh yes, you need 64-bit atomic reads and writes for that. We actually did see that load in handle_pte_fault being cut in half by a store. I wouldn't be too worried about that though, as it's only for PAE. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFT][PATCH 0/2] pagefault scalability alternative 2005-08-23 0:32 ` Nick Piggin @ 2005-08-23 7:04 ` Hugh Dickins 0 siblings, 0 replies; 12+ messages in thread From: Hugh Dickins @ 2005-08-23 7:04 UTC (permalink / raw) To: Nick Piggin; +Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, linux-mm On Tue, 23 Aug 2005, Nick Piggin wrote: > Christoph Lameter wrote: > > > The patch generally drops the first acquisition of the page table lock > > from handle_mm_fault that is used to protect the read operations on the > > page table. I doubt that this works with i386 PAE since the page table > > read operations are not protected by the ptl. These are 64 bit which > > cannot be reliably retrieved in an 32 bit operation on i386 as you > > pointed out last fall. There may be concurrent writes so that one gets > > two pieces that do not fit. PAE mode either needs to fall back to take > > the page_table_lock for reads or use some tricks to guarantee 64bit > > atomicity. > > Oh yes, you need 64-bit atomic reads and writes for that. I don't believe we do. Let me expand on that in my reply to Christoph. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFT][PATCH 0/2] pagefault scalability alternative 2005-08-22 22:29 ` Christoph Lameter 2005-08-23 0:32 ` Nick Piggin @ 2005-08-23 8:14 ` Hugh Dickins 2005-08-23 10:03 ` Nick Piggin 2005-08-23 16:30 ` Christoph Lameter 1 sibling, 2 replies; 12+ messages in thread From: Hugh Dickins @ 2005-08-23 8:14 UTC (permalink / raw) To: Christoph Lameter; +Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-mm Thanks a lot for looking into it so quickly, Christoph. Sorry for giving you the work of deciphering it with so little description. On Mon, 22 Aug 2005, Christoph Lameter wrote: > On Mon, 22 Aug 2005, Hugh Dickins wrote: > > > Here's my alternative to Christoph's pagefault scalability patches: > > no pte xchging, just narrowing the scope of the page_table_lock and > > (if CONFIG_SPLIT_PTLOCK=y when SMP) splitting it up per page table. > > The basic idea is to have a spinlock per page table entry it seems. A spinlock per page table, not a spinlock per page table entry. That's the split ptlock Y part, but most of the patch is just moving the taking and release of the lock inwards, good whether or not we split. > I think that is a good idea since it avoids atomic operations and I > hope it will bring the same performance as my patches (seems that the > page_table_lock can now be cached on the node that the fault is > happening). However, these are very extensive changes to the vm. Maybe not push it for 2.6.13 ;-? > The vm code in various places expects the page table lock to lock the > complete page table. How do the page based ptl's and the real ptl > interact? If split ptlock N, they're one and the same - though that doesn't mean there are no issues raised e.g. zap drops the lock at the end of the pagetable, are all arch's tlb mmu_gather operations happy with that? I have more checking to do there. If split ptlock Y, then the mm->page_table_lock (could be renamed) doesn't do much more than guard page table and anon_vma allocation, a few other odds and ends. All the interesting load falls on the per-pt lock. So long as arches don't have special code involving page_table_lock, that change shouldn't matter to them; but a few do (e.g. sparc64) and need checking/conversion. > There are these various hackish things in there that will hopefully be > taken care of. F.e. there really should be a spinlock_t ptl in the struct > page. Spinlock_t is often much bigger than an unsigned long. Yes, see my reply to Nick: I believe it's okay for now, even with debug options, but fragile. If it stays, needs robustification. > The patch generally drops the first acquisition of the page > table lock from handle_mm_fault that is used to protect the read > operations on the page table. I doubt that this works with i386 PAE since > the page table read operations are not protected by the ptl. These are 64 > bit which cannot be reliably retrieved in an 32 bit operation on i386 as > you pointed out last fall. There may be concurrent writes so that one > gets two pieces that do not fit. PAE mode either needs to fall back to > take the page_table_lock for reads or use some tricks to guarantee 64bit > atomicity. Yes, you referred to that "futility" in mail a few days ago: sorry if it seemed like I was ignoring you, I did embark upon a reply, but in the course of that reply decided that I needed to spend the time getting the patch right, then explain it after. I've memories of that too. Spent a while looking through my sent mail - very spooky. It was probably this concluding remark from 12 Dec 04, > > Oh, hold on, isn't handle_mm_fault's pmd without page_table_lock > > similarly racy, in both the 64-on-32 cases, and on architectures > > which have a more complex pmd_t (sparc, m68k, h8300)? Sigh. The list is frv, h8300, i386 PAE, m68k, m68knommu, sparc, uml 3level32. Needn't worry about h8300 and m68knommu because they're NOMMU. Needn't worry about frv and m68k since they're neither SMP nor PREEMPT (I haven't deciphered frv here, wonder if it's just been defined the other way round from the other architectures). UML would follow what's decided for the others. So the problem ones are i386 PAE and sparc: I haven't got down to sparc yet, I expect it to need a little reordering and barriers, but no great problem. I don't believe we need to read or write the PAE entries atomically. When writing we certainly need the ptlock, and we certainly need correct ordering (there's already a wmb between writing the top half and writing the bottom); oh, and yes, ptep_establish for rewriting existing entries does need the atomicity it already has (I think, I'm writing this reply in a rush, not cross-checking every word). But the reading. In particular, that "entry = *pte" in handle_pte_fault. I believe that's fine, provided that the do_..._page handlers are necessarily sceptical about that entry they're passed. They're quite free to do things like allocate a new page, or look up a cache page, without checking, so long as they recheck entry under ptlock before proceeding further, as they already did. But they must not do anything irrevocable, anything that might issue an error message to the logs, if the entry they're passed is actually a mismatch of two halfs. I believe I've already put in the necessary code for that e.g. the sizeof(pte_t) checks. Another aspect is peeking at (in particular) *pmd with any lock: that too might give mismatched halves and nonsense, that's what alarmed me in my mail last December. After dealing with the really hard issues (how to get the definitions and inlines into the header files without crashing the HIGHPTE build) yesterday, I spent several hours ruminating again on that *pmd issue, holding off from making a hundred edits; and in the end added just an unsigned long cast into the i386 definition of pmd_none. We must avoid basing decisions on two mismatched halves; but pmd_present is already safe, and now pmd_none also. The remaining races are benign. What do you think? > I have various bad feelings about some elements but I like the general > direction. Great (except for the bad feelings!). > > Certainly not to be considered for merging into -mm yet: contains > > various tangential mods (e.g. mremap move speedup) which should be > > split off into separate patches for description, review and merge. > > Could you modularize these patches? Its difficult to review as one. Maybe > separate the narrowing and the splitting and the miscellaneous things? Of course I must. This wasn't sent for review (though your review much appreciated), just as something to try out to see if worth pursuing. A suite of 39 seemed more hindrance than help at this stage. (You may well feel a little review is in order before putting strange patches on your special machines!) The first sub-patches I post should be for some of the very tangential things, tidyups that could safely go forward to 2.6.14 (perhaps). Hopefully merging those would reduce the diff somewhat - though it'll certainly need helpful subdivision and description beyond that. > > Presented as a Request For Testing - any chance, Christoph, that you > > could get someone to run it up on SGI's ia64 512-ways, to compare > > against the vanilla 2.6.13-rc6-mm1 including your patches? Thanks! > > Compiles and boots fine on ia64. Survives my benchmark on a smaller box. > Numbers and more details will follow later. It takes some time to get a > bigger iron. Thanks again for such prompt feedback, Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFT][PATCH 0/2] pagefault scalability alternative 2005-08-23 8:14 ` Hugh Dickins @ 2005-08-23 10:03 ` Nick Piggin 2005-08-23 16:30 ` Christoph Lameter 1 sibling, 0 replies; 12+ messages in thread From: Nick Piggin @ 2005-08-23 10:03 UTC (permalink / raw) To: Hugh Dickins; +Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, linux-mm Hugh Dickins wrote: > So the problem ones are i386 PAE and sparc: I haven't got down to sparc > yet, I expect it to need a little reordering and barriers, but no great > problem. > I don't think that case is a problem because I don't think we ever allocate or free pmd entries due to some CPU errata. That is, unless something has changed very recently. > I don't believe we need to read or write the PAE entries atomically. > Hmm, OK. I didn't see the trickery you were doing in do_swap_page and do_file_page etc. So actually, that seems OK. Wrapping it in a helper function might be nice (the recheck-under-lock for sizeof(pte_t) > sizeof(long), that is). -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFT][PATCH 0/2] pagefault scalability alternative 2005-08-23 8:14 ` Hugh Dickins 2005-08-23 10:03 ` Nick Piggin @ 2005-08-23 16:30 ` Christoph Lameter 2005-08-23 16:43 ` Martin J. Bligh ` (2 more replies) 1 sibling, 3 replies; 12+ messages in thread From: Christoph Lameter @ 2005-08-23 16:30 UTC (permalink / raw) To: Hugh Dickins; +Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-mm On Tue, 23 Aug 2005, Hugh Dickins wrote: > > The basic idea is to have a spinlock per page table entry it seems. > A spinlock per page table, not a spinlock per page table entry. Thats a spinlock per pmd? Calling it per page table is a bit confusing since page table may refer to the whole tree. Could you develop a clearer way of referring to these locks that is not page_table_lock or ptl? > After dealing with the really hard issues (how to get the definitions > and inlines into the header files without crashing the HIGHPTE build) > yesterday, I spent several hours ruminating again on that *pmd issue, > holding off from making a hundred edits; and in the end added just > an unsigned long cast into the i386 definition of pmd_none. We must > avoid basing decisions on two mismatched halves; but pmd_present is > already safe, and now pmd_none also. The remaining races are benign. > > What do you think? Atomicity can be guaranteed to some degree by using the present bit. For an update the present bit is first switched off. When a new value is written, it is first written in the piece of the entry that does not contain the pte bit which keeps the entry "not present". Last the word with the present bit is written. This means that if any p?d entry has been found to not contain the present bit then a lock must be taken and then the entry must be reread to get a consistent value. Here are the results of the performance test. In summary these show that the performance of both our approaches are equivalent. I would prefer your patches over mine since they have a broader scope and may accellerate other aspects of vm operations. Note that these tests need to be taken with some caution. Results are topology dependent and its just one special case (allocating new pages in do_anon_page) that is measured. Results are somewhat scewed if the amount of memory per task (mem/threads) becomes too small so that there is not enough time spend in concurrent page faulting. We only scale well up to 32 processors. Beyond that performance is still dropping and there is severe contention at 60. This is still better than to experience this drop at 4 processors (2.6.13) but not all that we are after. This performance pattern is typical for only dealing with the page_table_lock. I tried the delta patches which increase performance somewhat more but I do not get the performance results in the very high range that I saw last year. Either something is wrong with the delta patches or there is another other issue these days that limits performance. I still have to figure out what is going on there. I may know more after I test on a machine with more processors. Two samples each allocating 1,4,8,16 GB with 1-60 processors. 1. 2.6.13-rc6-mm1 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 3 1 1 0.06s 2.08s 2.01s 91530.726 91686.487 1 3 2 1 0.04s 2.40s 1.03s 80313.725 148253.347 1 3 4 1 0.04s 2.48s 0.07s 78019.048 247860.666 1 3 8 1 0.04s 2.76s 0.05s 70217.143 336562.559 1 3 16 1 0.07s 4.37s 0.05s 44201.439 332361.815 1 3 32 1 5.94s 10.92s 1.00s 11650.154 180992.401 1 3 60 1 42.57s 21.80s 2.02s 3054.057 89132.235 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 4 3 1 1 0.13s 8.28s 8.04s 93356.125 93399.776 4 3 2 1 0.13s 9.44s 5.01s 82091.023 152346.188 4 3 4 1 0.12s 9.80s 3.00s 79245.466 256976.998 4 3 8 1 0.17s 10.54s 2.00s 73361.194 383107.125 4 3 16 1 0.16s 17.06s 1.09s 45637.883 404563.542 4 3 32 1 4.27s 42.62s 2.06s 16768.273 294151.260 4 3 60 1 40.02s 110.99s 4.04s 5207.607 177074.387 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 8 3 1 1 0.32s 16.84s 17.01s 91637.381 91636.318 8 3 2 1 0.32s 18.80s 10.02s 82228.356 153285.701 8 3 4 1 0.30s 19.45s 6.00s 79630.620 261810.203 8 3 8 1 0.34s 20.94s 4.00s 73885.006 391636.418 8 3 16 1 0.42s 34.06s 3.07s 45600.835 417784.690 8 3 32 1 9.57s 87.58s 5.01s 16188.390 303679.562 8 3 60 1 37.34s 246.24s 7.07s 5546.221 202734.992 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 16 3 1 1 0.64s 40.12s 40.07s 77161.695 77175.960 16 3 2 1 0.64s 38.24s 20.08s 80891.998 151015.426 16 3 4 1 0.67s 38.75s 11.09s 79784.113 263917.598 16 3 8 1 0.62s 41.82s 7.08s 74107.802 399410.789 16 3 16 1 0.61s 67.76s 7.03s 46003.627 429354.596 16 3 32 1 8.76s 173.04s 9.04s 17302.854 333248.692 16 3 60 1 32.76s 466.27s 13.03s 6303.609 235490.831 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 3 1 1 0.03s 2.08s 2.01s 92739.623 92765.448 1 3 2 1 0.02s 2.38s 1.03s 81647.841 150542.942 1 3 4 1 0.06s 2.46s 0.07s 77649.289 247254.017 1 3 8 1 0.05s 2.75s 0.05s 70017.094 346483.976 1 3 16 1 0.06s 4.39s 0.06s 44161.725 313310.777 1 3 32 1 9.02s 11.20s 1.02s 9717.675 162578.985 1 3 60 1 28.92s 29.71s 2.01s 3353.254 93278.693 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 4 3 1 1 0.16s 8.20s 8.03s 93935.977 93937.837 4 3 2 1 0.19s 9.33s 5.01s 82539.043 153124.158 4 3 4 1 0.22s 9.70s 3.00s 79213.537 257326.049 4 3 8 1 0.23s 10.48s 2.00s 73361.194 383192.157 4 3 16 1 0.22s 16.97s 1.09s 45722.791 405459.259 4 3 32 1 4.67s 43.56s 2.06s 16301.136 292609.111 4 3 60 1 21.01s 99.12s 4.00s 6546.181 193120.292 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 8 3 1 1 0.28s 16.77s 17.00s 92196.014 92248.241 8 3 2 1 0.36s 18.64s 10.02s 82747.475 154108.957 8 3 4 1 0.36s 19.39s 6.00s 79598.381 261456.810 8 3 8 1 0.31s 20.96s 4.00s 73898.891 392375.888 8 3 16 1 0.37s 34.28s 3.07s 45385.042 416385.059 8 3 32 1 8.75s 88.48s 5.01s 16175.737 303964.057 8 3 60 1 34.85s 213.80s 7.03s 6325.563 213671.451 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 16 3 1 1 0.63s 40.19s 40.08s 77040.752 77044.800 16 3 2 1 0.58s 38.34s 20.08s 80800.575 150916.421 16 3 4 1 0.71s 38.66s 11.09s 79873.248 264287.489 16 3 8 1 0.64s 41.91s 7.08s 73905.836 399511.701 16 3 16 1 0.64s 67.46s 7.03s 46187.350 430267.445 16 3 32 1 8.12s 171.97s 9.04s 17466.563 333665.446 16 3 60 1 28.56s 483.76s 13.03s 6140.067 235670.414 2. 2.6.13-rc1-mm1-hugh Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 3 1 1 0.02s 2.12s 2.01s 91530.726 91290.645 1 3 2 1 0.03s 2.40s 1.03s 80842.105 148577.211 1 3 4 1 0.04s 2.50s 0.07s 77161.695 246742.307 1 3 8 1 0.06s 2.74s 0.05s 69917.496 333526.774 1 3 16 1 0.04s 4.38s 0.05s 44321.010 329911.942 1 3 32 1 2.70s 11.10s 0.09s 14242.828 208238.299 1 3 60 1 10.39s 25.69s 1.06s 5448.016 122221.286 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 4 3 1 1 0.20s 8.24s 8.04s 93090.909 93102.689 4 3 2 1 0.17s 9.40s 5.01s 82125.313 152597.761 4 3 4 1 0.18s 9.76s 3.00s 78990.759 256940.071 4 3 8 1 0.14s 10.58s 2.00s 73361.194 383839.118 4 3 16 1 0.18s 17.34s 1.09s 44887.671 400584.413 4 3 32 1 3.03s 44.14s 2.06s 16670.171 296955.678 4 3 60 1 42.77s 124.64s 4.07s 4697.360 164568.728 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 8 3 1 1 0.41s 16.72s 17.01s 91787.115 91811.036 8 3 2 1 0.32s 18.75s 10.02s 82469.799 153767.106 8 3 4 1 0.31s 19.49s 6.00s 79405.493 260233.329 8 3 8 1 0.34s 21.00s 4.00s 73691.154 390162.630 8 3 16 1 0.33s 33.82s 3.07s 46054.814 420596.124 8 3 32 1 6.98s 87.06s 4.09s 16724.767 315990.572 8 3 60 1 39.50s 252.83s 7.06s 5380.182 204361.061 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 16 3 1 1 0.62s 40.28s 40.09s 76897.624 76894.371 16 3 2 1 0.73s 38.20s 20.08s 80775.678 150731.248 16 3 4 1 0.62s 38.86s 11.09s 79670.955 263253.128 16 3 8 1 0.67s 41.89s 7.09s 73891.948 398115.325 16 3 16 1 0.67s 68.00s 7.04s 45802.679 424756.786 16 3 32 1 8.13s 170.75s 9.06s 17584.902 325968.378 16 3 60 1 19.06s 443.08s 12.08s 6806.696 244372.117 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 3 1 1 0.04s 2.08s 2.01s 92390.977 92501.417 1 3 2 1 0.04s 2.38s 1.03s 81108.911 149261.811 1 3 4 1 0.04s 2.48s 0.07s 77895.404 245772.564 1 3 8 1 0.04s 2.71s 0.05s 71338.171 339935.008 1 3 16 1 0.08s 4.41s 0.06s 43690.667 321102.290 1 3 32 1 6.30s 10.29s 1.01s 11843.855 176712.623 1 3 60 1 31.78s 24.45s 2.00s 3496.372 95318.257 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 4 3 1 1 0.16s 8.23s 8.03s 93622.857 93678.202 4 3 2 1 0.17s 9.40s 5.01s 82125.313 152957.631 4 3 4 1 0.13s 9.81s 3.00s 79022.508 256991.607 4 3 8 1 0.16s 10.59s 2.00s 73142.857 383723.246 4 3 16 1 0.16s 17.08s 1.09s 45616.705 404165.286 4 3 32 1 4.28s 43.45s 2.07s 16471.850 283376.758 4 3 60 1 55.40s 115.75s 5.00s 4594.718 156683.131 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 8 3 1 1 0.32s 16.76s 17.00s 92044.944 92058.036 8 3 2 1 0.29s 18.68s 10.01s 82887.015 154397.006 8 3 4 1 0.33s 19.41s 6.00s 79662.885 262009.505 8 3 8 1 0.32s 20.91s 3.09s 74079.879 393444.882 8 3 16 1 0.32s 34.22s 3.07s 45521.649 417768.300 8 3 32 1 3.44s 85.73s 4.08s 17636.959 325891.128 8 3 60 1 56.83s 248.51s 8.02s 5150.986 191074.214 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 16 3 1 1 0.62s 40.08s 40.07s 77267.833 77269.273 16 3 2 1 0.67s 38.21s 20.08s 80891.998 151088.966 16 3 4 1 0.69s 38.68s 11.09s 79889.476 264299.054 16 3 8 1 0.65s 41.70s 7.08s 74261.756 400677.914 16 3 16 1 0.68s 68.20s 7.04s 45664.383 423956.953 16 3 32 1 4.05s 172.59s 9.03s 17808.292 338026.854 16 3 60 1 49.84s 458.57s 13.09s 6187.311 224887.539 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFT][PATCH 0/2] pagefault scalability alternative 2005-08-23 16:30 ` Christoph Lameter @ 2005-08-23 16:43 ` Martin J. Bligh 2005-08-23 18:29 ` Hugh Dickins 2005-08-27 22:10 ` Avi Kivity 2 siblings, 0 replies; 12+ messages in thread From: Martin J. Bligh @ 2005-08-23 16:43 UTC (permalink / raw) To: Christoph Lameter, Hugh Dickins Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-mm >> > The basic idea is to have a spinlock per page table entry it seems. >> A spinlock per page table, not a spinlock per page table entry. > > Thats a spinlock per pmd? Calling it per page table is a bit confusing > since page table may refer to the whole tree. Could you develop > a clearer way of referring to these locks that is not page_table_lock or > ptl? Isn't that per pagetable page? Though maybe that makes less sense with large pages. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFT][PATCH 0/2] pagefault scalability alternative 2005-08-23 16:30 ` Christoph Lameter 2005-08-23 16:43 ` Martin J. Bligh @ 2005-08-23 18:29 ` Hugh Dickins 2005-08-27 22:10 ` Avi Kivity 2 siblings, 0 replies; 12+ messages in thread From: Hugh Dickins @ 2005-08-23 18:29 UTC (permalink / raw) To: Christoph Lameter; +Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-mm On Tue, 23 Aug 2005, Christoph Lameter wrote: > On Tue, 23 Aug 2005, Hugh Dickins wrote: > > > > The basic idea is to have a spinlock per page table entry it seems. > > A spinlock per page table, not a spinlock per page table entry. > > Thats a spinlock per pmd? Calling it per page table is a bit confusing > since page table may refer to the whole tree. Could you develop > a clearer way of referring to these locks that is not page_table_lock or > ptl? Sorry to confuse. I'm used to using "page table" for the leaf (or rather, the twig above the pages themselves), may be hard to retrain myself now. Martin suggests "page table page", that's fine by me. And, quite aside from getting confused between the different levels, there's the confusion that C arrays introduce: when we write "pte", sometimes we're thinking of a single page table entry, and sometimes of the contiguous array of them, the "page table page". You suggest I mean spinlock per pmd: I'd say it's a spinlock per pmd entry. Oh well, let the code speak for itself. Every PMD_SIZE bytes of userspace gets its own spinlock (and "PMD_SIZE" argues for your nomenclature not mine). > Atomicity can be guaranteed to some degree by using the present bit. > For an update the present bit is first switched off. When a > new value is written, it is first written in the piece of the entry that > does not contain the pte bit which keeps the entry "not present". Last the > word with the present bit is written. Exactly. And many of the tests (e.g. in the _alloc functions) are testing present, and need no change. But the p?d_none_or_clear_bad tests would be in danger of advancing to the "bad" test, and getting it wrong, if we assemble "none" from two parts: need just to test the one with present in. (But this would go wrong if we did it for pte_none, I think, because the swap entry gets stored in the upper part when PAE.) > This means that if any p?d entry has been found to not contain the present > bit then a lock must be taken and then the entry must be reread to get a > consistent value. That's indeed what happens in the _alloc functions, pmd_alloc etc. In the others, the lookups which don't wish to allocate, they just give up on seeing p?d_none, no need to reread, if there's a race we're happy to lose it in those contexts. > Here are the results of the performance test. In summary these show that > the performance of both our approaches are equivalent. Many thanks for getting those done. Interesting: or perhaps not - boringly similar! They fit with what I see on the 32-bit and 64-bit 2*HT*Xeons I have here, sometimes one does better, sometimes the other. I'd rather been expecting the bigger machines to show some advantage to your patch, yet not a decisive advantage. Perhaps, on the bigger ones neither can presently scale to, that will become apparent later; or maybe I was just preparing myself for some disappointment. > I would prefer your > patches over mine since they have a broader scope and may accellerate > other aspects of vm operations. That is very generous of you, especially after all the effort you've put into posting and reposting yours down the months: thank you. But I do agree that mine covers more bases: no doubt similar tests on the shared file pages would degenerate quickly from other contention (e.g. the pagecache Nick is playing in), but we can expect that as such issues get dealt with, the narrowed and split locking would give the same advantage to those as to the anonymous. I still fear that your pte xchging, with its parallel set of locking rules, would tie our hands down the road, and might have to be backed out later even if brought in. But it's certainly not ruled out: it will be interesting to see if it gives any boost on top of the split. I do consistently see a small advantage to CONFIG_SPLIT_PTLOCK N over Y when not multithreading, and little visible advantage at 4. Not particularly anxious to offer it as a config option to the user: I wonder whether to tie it to CONFIG_NR_CPUS, enable split ptlock at 4 or at 8 (would need to get good regular test coverage of both). By the way, a little detail in pft.c: I think there's one too many zeroes where wall.tv_nsec is converted for printing - hence the remarkable trend that Wall time is always X.0Ys. But that does not invalidate our conclusions. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFT][PATCH 0/2] pagefault scalability alternative 2005-08-23 16:30 ` Christoph Lameter 2005-08-23 16:43 ` Martin J. Bligh 2005-08-23 18:29 ` Hugh Dickins @ 2005-08-27 22:10 ` Avi Kivity 2 siblings, 0 replies; 12+ messages in thread From: Avi Kivity @ 2005-08-27 22:10 UTC (permalink / raw) To: Christoph Lameter Cc: Hugh Dickins, Nick Piggin, Linus Torvalds, Andrew Morton, linux-mm Christoph Lameter wrote: >>A spinlock per page table, not a spinlock per page table entry. >> >> > >Thats a spinlock per pmd? Calling it per page table is a bit confusing >since page table may refer to the whole tree. Could you develop >a clearer way of referring to these locks that is not page_table_lock or >ptl? > > I've heard the term "paging element" used to describe the p*t's. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2005-08-27 22:10 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-08-24 14:27 [RFT][PATCH 0/2] pagefault scalability alternative linux 2005-08-24 15:21 ` Hugh Dickins -- strict thread matches above, loose matches on Subject: below -- 2005-08-22 21:27 Hugh Dickins 2005-08-22 22:29 ` Christoph Lameter 2005-08-23 0:32 ` Nick Piggin 2005-08-23 7:04 ` Hugh Dickins 2005-08-23 8:14 ` Hugh Dickins 2005-08-23 10:03 ` Nick Piggin 2005-08-23 16:30 ` Christoph Lameter 2005-08-23 16:43 ` Martin J. Bligh 2005-08-23 18:29 ` Hugh Dickins 2005-08-27 22:10 ` Avi Kivity
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox