Re: [RFT][PATCH 0/2] pagefault scalability alternative

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFT][PATCH 0/2] pagefault scalability alternative
@ 2005-08-24 14:27 linux
  2005-08-24 15:21 ` Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: linux @ 2005-08-24 14:27 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm

> Atomicity can be guaranteed to some degree by using the present bit. 
> For an update the present bit is first switched off. When a 
> new value is written, it is first written in the piece of the entry that 
> does not contain the pte bit which keeps the entry "not present". Last the 
> word with the present bit is written.

Er... no.  That would work if reads were atomic but writes weren't, but
consider the following:

Reader		Writer
Read first half
		Write not-present bit
		Write other half
		Write present bit
Read second half

Voila, mismatched halves.
Unless you can give a guarantee on relative rates of progress, this
can't be made to work.

The first obvious fix is to read the first half a second time and make
sure it matches, retrying if not.  The idea being that if the PTE changed
from AB to AC, you might not notice the change, but it wouldn't matter,
either.  But that can fail, too, in sufficiently contrived circumstances:

Reader		Writer
Read first half
		Write not-present bit
		Write other half
		Write present bit
Read second half
		Write not-present bit
		Write other half
		Write present bit
Read first half

If it changed from AB -> CD -> AE, you could read AD and not notice the
problem.

And remember that relative rates in SMP systems are *usually* matched,
but if you depend for correctness on a requirement that there be no
interrupts, no NMI, no SMM, no I-cache miss, no I-cache parity error that
triggered a re-fetch, no single-bit ECC error that triggered scrubbing,
etc., then you're really tightly constraining the rest of the system.

Modern processors do all kinds of strange low-probability exception
handling in order to speed up the common case.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT][PATCH 0/2] pagefault scalability alternative
  2005-08-24 14:27 [RFT][PATCH 0/2] pagefault scalability alternative linux
@ 2005-08-24 15:21 ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2005-08-24 15:21 UTC (permalink / raw)
  To: linux; +Cc: clameter, linux-mm

On Wed, 24 Aug 2005 linux@horizon.com wrote:

> > Atomicity can be guaranteed to some degree by using the present bit. 
> > For an update the present bit is first switched off. When a 
> > new value is written, it is first written in the piece of the entry that 
> > does not contain the pte bit which keeps the entry "not present". Last the 
> > word with the present bit is written.
> 
> Er... no.  That would work if reads were atomic but writes weren't, but
> consider the following:
> 
> Reader		Writer
> Read first half
> 		Write not-present bit
> 		Write other half
> 		Write present bit
> Read second half
> 
> Voila, mismatched halves.

True.  But not an issue for the patch under discussion.

In the case of the pt entries, all the writes are done within ptlock,
and any reads done outside of ptlock (to choose which fault handler)
are rechecked within ptlock before making any critical decision
(in the PAE case which might have mismatched halves).

In the case of the pmd entries, a transition from present to not
present is only made in free_pgtables (either while mmap_sem is
held exclusively, or when the mm no longer has users), after
unlinking from the prio_tree and anon_vma list by which kswapd
might have got to them without mmap_sem (the unlinking taking
the necessary locks).  And pfn is never changed while present.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFT][PATCH 0/2] pagefault scalability alternative
@ 2005-08-22 21:27 Hugh Dickins
  2005-08-22 22:29 ` Christoph Lameter
  0 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2005-08-22 21:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-mm

Here's my alternative to Christoph's pagefault scalability patches:
no pte xchging, just narrowing the scope of the page_table_lock and
(if CONFIG_SPLIT_PTLOCK=y when SMP) splitting it up per page table.

Currently only supports i386 (PAE or not), x86_64 and ia64 (latter
unbuilt and untested so far).  The rest ought not to build (removed
an arg from pte_alloc_kernel).  I'll take a look through the other
arches: most should be easy, a few (e.g. the sparcs) need more care.

(What I've done for oprofile backtrace is probably not quite right,
but I think in the right direction: can no longer lock out swapout
with page_table_lock, should just try to copy atomically - I'm
hoping someone can help me out there to get it right.)

Certainly not to be considered for merging into -mm yet: contains
various tangential mods (e.g. mremap move speedup) which should be
split off into separate patches for description, review and merge.

I do expect we shall want to merge the narrowing of page_table_lock
in due course - unless you find it's broken.  Whether we shall want
the ptlock splitting, whether with or without anonymous pte xchging,
depends on how they all perform.

Presented as a Request For Testing - any chance, Christoph, that you
could get someone to run it up on SGI's ia64 512-ways, to compare
against the vanilla 2.6.13-rc6-mm1 including your patches?  Thanks!

(The rss counting in this patch matches how it was in -rc6-mm1.
Later I'll want to look at the rss delta mechanism and integrate that
in - the narrowing won't want it, but the splitting would.  If you
think we'd get fairer test numbers by temporarily suppressing rss
counting in each version, please do so.)

Diffstat below is against 2.6.13-rc6-mm1 minus Christoph's version.
No disrespect intended - but it's a bit easier to see what this one
is up to if diffed against the simpler base.  I'll send the removal
of page-fault-patches from -rc6-mm1 as 1/2 then mine as 2/2.

Hugh

 arch/i386/kernel/vm86.c        |   17 -
 arch/i386/mm/ioremap.c         |    4 
 arch/i386/mm/pgtable.c         |   51 +++
 arch/i386/oprofile/backtrace.c |   42 +-
 arch/ia64/mm/init.c            |   11 
 arch/x86_64/mm/ioremap.c       |    4 
 fs/exec.c                      |   14 
 fs/hugetlbfs/inode.c           |    4 
 fs/proc/task_mmu.c             |   19 -
 include/asm-generic/tlb.h      |    4 
 include/asm-i386/pgalloc.h     |   11 
 include/asm-i386/pgtable.h     |   14 
 include/asm-ia64/pgalloc.h     |   13 
 include/asm-x86_64/pgalloc.h   |   24 -
 include/linux/hugetlb.h        |    2 
 include/linux/mm.h             |   73 ++++-
 include/linux/rmap.h           |    3 
 include/linux/sched.h          |   30 ++
 kernel/fork.c                  |   19 -
 kernel/futex.c                 |    6 
 mm/Kconfig                     |   16 +
 mm/filemap_xip.c               |   14 
 mm/fremap.c                    |   53 +--
 mm/hugetlb.c                   |   33 +-
 mm/memory.c                    |  578 ++++++++++++++++++-----------------------
 mm/mempolicy.c                 |    7 
 mm/mmap.c                      |   85 ++----
 mm/mprotect.c                  |    7 
 mm/mremap.c                    |  169 +++++------
 mm/msync.c                     |   49 +--
 mm/rmap.c                      |  115 ++++----
 mm/swap_state.c                |    3 
 mm/swapfile.c                  |   20 -
 mm/vmalloc.c                   |    4 
 34 files changed, 740 insertions(+), 778 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT][PATCH 0/2] pagefault scalability alternative
  2005-08-22 21:27 Hugh Dickins
@ 2005-08-22 22:29 ` Christoph Lameter
  2005-08-23  0:32   ` Nick Piggin
  2005-08-23  8:14   ` Hugh Dickins
  0 siblings, 2 replies; 12+ messages in thread
From: Christoph Lameter @ 2005-08-22 22:29 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-mm

On Mon, 22 Aug 2005, Hugh Dickins wrote:

> Here's my alternative to Christoph's pagefault scalability patches:
> no pte xchging, just narrowing the scope of the page_table_lock and
> (if CONFIG_SPLIT_PTLOCK=y when SMP) splitting it up per page table.

The basic idea is to have a spinlock per page table entry it seems. I 
think that is a good idea since it avoids atomic operations and I hope it 
will bring the same performance as my patches (seems that the 
page_table_lock can now be cached on the node that the fault is 
happening). However, these are very extensive changes to the vm.

The vm code in various places expects the page table lock to lock the 
complete page table. How do the page based ptl's and the real ptl 
interact?

There are these various hackish things in there that will hopefully be 
taken care of. F.e. there really should be a spinlock_t ptl in the struct 
page. Spinlock_t is often much bigger than an unsigned long.

The patch generally drops the first acquisition of the page 
table lock from handle_mm_fault that is used to protect the read 
operations on the page table. I doubt that this works with i386 PAE since 
the page table read operations are not protected by the ptl. These are 64 
bit which cannot be reliably retrieved in an 32 bit operation on i386 as 
you pointed out last fall. There may be concurrent writes so that one 
gets two pieces that do not fit. PAE mode either needs to fall back to 
take the page_table_lock for reads or use some tricks to guarantee 64bit 
atomicity.

I have various bad feelings about some elements but I like the general 
direction.

> Certainly not to be considered for merging into -mm yet: contains
> various tangential mods (e.g. mremap move speedup) which should be
> split off into separate patches for description, review and merge.

Could you modularize these patches? Its difficult to review as one. Maybe 
separate the narrowing and the splitting and the miscellaneous things?

> Presented as a Request For Testing - any chance, Christoph, that you
> could get someone to run it up on SGI's ia64 512-ways, to compare
> against the vanilla 2.6.13-rc6-mm1 including your patches?  Thanks!

Compiles and boots fine on ia64. Survives my benchmark on a smaller box. 
Numbers and more details will follow later. It takes some time to get a bigger iron. 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT][PATCH 0/2] pagefault scalability alternative
  2005-08-22 22:29 ` Christoph Lameter
@ 2005-08-23  0:32   ` Nick Piggin
  2005-08-23  7:04     ` Hugh Dickins
  2005-08-23  8:14   ` Hugh Dickins
  1 sibling, 1 reply; 12+ messages in thread
From: Nick Piggin @ 2005-08-23  0:32 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, linux-mm

Christoph Lameter wrote:

> The patch generally drops the first acquisition of the page 
> table lock from handle_mm_fault that is used to protect the read 
> operations on the page table. I doubt that this works with i386 PAE since 
> the page table read operations are not protected by the ptl. These are 64 
> bit which cannot be reliably retrieved in an 32 bit operation on i386 as 
> you pointed out last fall. There may be concurrent writes so that one 
> gets two pieces that do not fit. PAE mode either needs to fall back to 
> take the page_table_lock for reads or use some tricks to guarantee 64bit 
> atomicity.
> 

Oh yes, you need 64-bit atomic reads and writes for that.

We actually did see that load in handle_pte_fault being cut
in half by a store.

I wouldn't be too worried about that though, as it's only
for PAE.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT][PATCH 0/2] pagefault scalability alternative
  2005-08-23  0:32   ` Nick Piggin
@ 2005-08-23  7:04     ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2005-08-23  7:04 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, linux-mm

On Tue, 23 Aug 2005, Nick Piggin wrote:
> Christoph Lameter wrote:
> 
> > The patch generally drops the first acquisition of the page table lock
> > from handle_mm_fault that is used to protect the read operations on the
> > page table. I doubt that this works with i386 PAE since the page table
> > read operations are not protected by the ptl. These are 64 bit which
> > cannot be reliably retrieved in an 32 bit operation on i386 as you
> > pointed out last fall. There may be concurrent writes so that one gets
> > two pieces that do not fit. PAE mode either needs to fall back to take
> > the page_table_lock for reads or use some tricks to guarantee 64bit
> > atomicity.
> 
> Oh yes, you need 64-bit atomic reads and writes for that.

I don't believe we do.  Let me expand on that in my reply to Christoph.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT][PATCH 0/2] pagefault scalability alternative
  2005-08-22 22:29 ` Christoph Lameter
  2005-08-23  0:32   ` Nick Piggin
@ 2005-08-23  8:14   ` Hugh Dickins
  2005-08-23 10:03     ` Nick Piggin
  2005-08-23 16:30     ` Christoph Lameter
  1 sibling, 2 replies; 12+ messages in thread
From: Hugh Dickins @ 2005-08-23  8:14 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-mm

Thanks a lot for looking into it so quickly, Christoph.  Sorry for
giving you the work of deciphering it with so little description.

On Mon, 22 Aug 2005, Christoph Lameter wrote:
> On Mon, 22 Aug 2005, Hugh Dickins wrote:
> 
> > Here's my alternative to Christoph's pagefault scalability patches:
> > no pte xchging, just narrowing the scope of the page_table_lock and
> > (if CONFIG_SPLIT_PTLOCK=y when SMP) splitting it up per page table.
> 
> The basic idea is to have a spinlock per page table entry it seems.

A spinlock per page table, not a spinlock per page table entry.

That's the split ptlock Y part, but most of the patch is just moving the
taking and release of the lock inwards, good whether or not we split.

> I think that is a good idea since it avoids atomic operations and I
> hope it will bring the same performance as my patches (seems that the 
> page_table_lock can now be cached on the node that the fault is 
> happening). However, these are very extensive changes to the vm.

Maybe not push it for 2.6.13 ;-?

> The vm code in various places expects the page table lock to lock the 
> complete page table. How do the page based ptl's and the real ptl 
> interact?

If split ptlock N, they're one and the same - though that doesn't mean
there are no issues raised e.g. zap drops the lock at the end of the
pagetable, are all arch's tlb mmu_gather operations happy with that?
I have more checking to do there.

If split ptlock Y, then the mm->page_table_lock (could be renamed)
doesn't do much more than guard page table and anon_vma allocation,
a few other odds and ends.  All the interesting load falls on the
per-pt lock.  So long as arches don't have special code involving
page_table_lock, that change shouldn't matter to them; but a few
do (e.g. sparc64) and need checking/conversion.

> There are these various hackish things in there that will hopefully be 
> taken care of. F.e. there really should be a spinlock_t ptl in the struct 
> page. Spinlock_t is often much bigger than an unsigned long.

Yes, see my reply to Nick: I believe it's okay for now, even with
debug options, but fragile.  If it stays, needs robustification.

> The patch generally drops the first acquisition of the page 
> table lock from handle_mm_fault that is used to protect the read 
> operations on the page table. I doubt that this works with i386 PAE since 
> the page table read operations are not protected by the ptl. These are 64 
> bit which cannot be reliably retrieved in an 32 bit operation on i386 as 
> you pointed out last fall. There may be concurrent writes so that one 
> gets two pieces that do not fit. PAE mode either needs to fall back to 
> take the page_table_lock for reads or use some tricks to guarantee 64bit 
> atomicity.

Yes, you referred to that "futility" in mail a few days ago: sorry if
it seemed like I was ignoring you, I did embark upon a reply, but in
the course of that reply decided that I needed to spend the time getting
the patch right, then explain it after.

I've memories of that too.  Spent a while looking through my sent mail -
very spooky.  It was probably this concluding remark from 12 Dec 04,

> > Oh, hold on, isn't handle_mm_fault's pmd without page_table_lock
> > similarly racy, in both the 64-on-32 cases, and on architectures
> > which have a more complex pmd_t (sparc, m68k, h8300)?  Sigh.

The list is frv, h8300, i386 PAE, m68k, m68knommu, sparc, uml 3level32.

Needn't worry about h8300 and m68knommu because they're NOMMU.
Needn't worry about frv and m68k since they're neither SMP nor PREEMPT
(I haven't deciphered frv here, wonder if it's just been defined the
other way round from the other architectures).  UML would follow
what's decided for the others.

So the problem ones are i386 PAE and sparc: I haven't got down to sparc
yet, I expect it to need a little reordering and barriers, but no great
problem.

I don't believe we need to read or write the PAE entries atomically.

When writing we certainly need the ptlock, and we certainly need
correct ordering (there's already a wmb between writing the top half
and writing the bottom); oh, and yes, ptep_establish for rewriting
existing entries does need the atomicity it already has (I think,
I'm writing this reply in a rush, not cross-checking every word).

But the reading.  In particular, that "entry = *pte" in handle_pte_fault.
I believe that's fine, provided that the do_..._page handlers are
necessarily sceptical about that entry they're passed.  They're quite
free to do things like allocate a new page, or look up a cache page,
without checking, so long as they recheck entry under ptlock before
proceeding further, as they already did.  But they must not do anything
irrevocable, anything that might issue an error message to the logs,
if the entry they're passed is actually a mismatch of two halfs.
I believe I've already put in the necessary code for that e.g.
the sizeof(pte_t) checks.

Another aspect is peeking at (in particular) *pmd with any lock: that
too might give mismatched halves and nonsense, that's what alarmed me
in my mail last December.

After dealing with the really hard issues (how to get the definitions
and inlines into the header files without crashing the HIGHPTE build)
yesterday, I spent several hours ruminating again on that *pmd issue,
holding off from making a hundred edits; and in the end added just
an unsigned long cast into the i386 definition of pmd_none.  We must
avoid basing decisions on two mismatched halves; but pmd_present is
already safe, and now pmd_none also.  The remaining races are benign.

What do you think?

> I have various bad feelings about some elements but I like the general 
> direction.

Great (except for the bad feelings!).

> > Certainly not to be considered for merging into -mm yet: contains
> > various tangential mods (e.g. mremap move speedup) which should be
> > split off into separate patches for description, review and merge.
> 
> Could you modularize these patches? Its difficult to review as one. Maybe 
> separate the narrowing and the splitting and the miscellaneous things?

Of course I must.  This wasn't sent for review (though your review much
appreciated), just as something to try out to see if worth pursuing.
A suite of 39 seemed more hindrance than help at this stage.

(You may well feel a little review is in order before putting strange
patches on your special machines!)

The first sub-patches I post should be for some of the very tangential
things, tidyups that could safely go forward to 2.6.14 (perhaps).
Hopefully merging those would reduce the diff somewhat - though it'll
certainly need helpful subdivision and description beyond that.

> > Presented as a Request For Testing - any chance, Christoph, that you
> > could get someone to run it up on SGI's ia64 512-ways, to compare
> > against the vanilla 2.6.13-rc6-mm1 including your patches?  Thanks!
> 
> Compiles and boots fine on ia64. Survives my benchmark on a smaller box. 
> Numbers and more details will follow later. It takes some time to get a
> bigger iron. 

Thanks again for such prompt feedback,
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT][PATCH 0/2] pagefault scalability alternative
  2005-08-23  8:14   ` Hugh Dickins
@ 2005-08-23 10:03     ` Nick Piggin
  2005-08-23 16:30     ` Christoph Lameter
  1 sibling, 0 replies; 12+ messages in thread
From: Nick Piggin @ 2005-08-23 10:03 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, linux-mm

Hugh Dickins wrote:

> So the problem ones are i386 PAE and sparc: I haven't got down to sparc
> yet, I expect it to need a little reordering and barriers, but no great
> problem.
> 

I don't think that case is a problem because I don't think we
ever allocate or free pmd entries due to some CPU errata.

That is, unless something has changed very recently.

> I don't believe we need to read or write the PAE entries atomically.
> 

Hmm, OK. I didn't see the trickery you were doing in do_swap_page
and do_file_page etc. So actually, that seems OK.

Wrapping it in a helper function might be nice (the
recheck-under-lock for sizeof(pte_t) > sizeof(long), that is).

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT][PATCH 0/2] pagefault scalability alternative
  2005-08-23  8:14   ` Hugh Dickins
  2005-08-23 10:03     ` Nick Piggin
@ 2005-08-23 16:30     ` Christoph Lameter
  2005-08-23 16:43       ` Martin J. Bligh
                         ` (2 more replies)
  1 sibling, 3 replies; 12+ messages in thread
From: Christoph Lameter @ 2005-08-23 16:30 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-mm

On Tue, 23 Aug 2005, Hugh Dickins wrote:

> > The basic idea is to have a spinlock per page table entry it seems.
> A spinlock per page table, not a spinlock per page table entry.

Thats a spinlock per pmd? Calling it per page table is a bit confusing 
since page table may refer to the whole tree. Could you develop 
a clearer way of referring to these locks that is not page_table_lock or 
ptl?

> After dealing with the really hard issues (how to get the definitions
> and inlines into the header files without crashing the HIGHPTE build)
> yesterday, I spent several hours ruminating again on that *pmd issue,
> holding off from making a hundred edits; and in the end added just
> an unsigned long cast into the i386 definition of pmd_none.  We must
> avoid basing decisions on two mismatched halves; but pmd_present is
> already safe, and now pmd_none also.  The remaining races are benign.
> 
> What do you think?

Atomicity can be guaranteed to some degree by using the present bit. 
For an update the present bit is first switched off. When a 
new value is written, it is first written in the piece of the entry that 
does not contain the pte bit which keeps the entry "not present". Last the 
word with the present bit is written.

This means that if any p?d entry has been found to not contain the present 
bit then a lock must be taken and then the entry must be reread to get a 
consistent value.

Here are the results of the performance test. In summary these show that
the performance of both our approaches are equivalent. I would prefer your 
patches over mine since they have a broader scope and may accellerate 
other aspects of vm operations.

Note that these tests need to be taken with some caution. Results are 
topology dependent and its just one special case (allocating new
pages in do_anon_page) that is measured. Results are somewhat scewed if
the amount of memory per task (mem/threads) becomes too small so that 
there is not enough time spend in concurrent page faulting.

We only scale well up to 32 processors. Beyond that performance is still 
dropping and there is severe contention at 60. This is still better than 
to experience this drop at 4 processors (2.6.13) but not all that we 
are after. This performance pattern is typical for only dealing with
the page_table_lock.

I tried the delta patches which increase performance somewhat more but I 
do not get the performance results in the very high range that I saw last 
year. Either something is wrong with the delta patches or there is 
another other issue these days that limits performance. I still have to 
figure out what is going on there. I may know more after I test on a 
machine with more processors.

Two samples each allocating 1,4,8,16 GB with 1-60 processors.

1. 2.6.13-rc6-mm1

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  3    1   1    0.06s      2.08s   2.01s 91530.726  91686.487
  1  3    2   1    0.04s      2.40s   1.03s 80313.725 148253.347
  1  3    4   1    0.04s      2.48s   0.07s 78019.048 247860.666
  1  3    8   1    0.04s      2.76s   0.05s 70217.143 336562.559
  1  3   16   1    0.07s      4.37s   0.05s 44201.439 332361.815
  1  3   32   1    5.94s     10.92s   1.00s 11650.154 180992.401
  1  3   60   1   42.57s     21.80s   2.02s  3054.057  89132.235
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  4  3    1   1    0.13s      8.28s   8.04s 93356.125  93399.776
  4  3    2   1    0.13s      9.44s   5.01s 82091.023 152346.188
  4  3    4   1    0.12s      9.80s   3.00s 79245.466 256976.998
  4  3    8   1    0.17s     10.54s   2.00s 73361.194 383107.125
  4  3   16   1    0.16s     17.06s   1.09s 45637.883 404563.542
  4  3   32   1    4.27s     42.62s   2.06s 16768.273 294151.260
  4  3   60   1   40.02s    110.99s   4.04s  5207.607 177074.387
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  8  3    1   1    0.32s     16.84s  17.01s 91637.381  91636.318
  8  3    2   1    0.32s     18.80s  10.02s 82228.356 153285.701
  8  3    4   1    0.30s     19.45s   6.00s 79630.620 261810.203
  8  3    8   1    0.34s     20.94s   4.00s 73885.006 391636.418
  8  3   16   1    0.42s     34.06s   3.07s 45600.835 417784.690
  8  3   32   1    9.57s     87.58s   5.01s 16188.390 303679.562
  8  3   60   1   37.34s    246.24s   7.07s  5546.221 202734.992
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
 16  3    1   1    0.64s     40.12s  40.07s 77161.695  77175.960
 16  3    2   1    0.64s     38.24s  20.08s 80891.998 151015.426
 16  3    4   1    0.67s     38.75s  11.09s 79784.113 263917.598
 16  3    8   1    0.62s     41.82s   7.08s 74107.802 399410.789
 16  3   16   1    0.61s     67.76s   7.03s 46003.627 429354.596
 16  3   32   1    8.76s    173.04s   9.04s 17302.854 333248.692
 16  3   60   1   32.76s    466.27s  13.03s  6303.609 235490.831

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  3    1   1    0.03s      2.08s   2.01s 92739.623  92765.448
  1  3    2   1    0.02s      2.38s   1.03s 81647.841 150542.942
  1  3    4   1    0.06s      2.46s   0.07s 77649.289 247254.017
  1  3    8   1    0.05s      2.75s   0.05s 70017.094 346483.976
  1  3   16   1    0.06s      4.39s   0.06s 44161.725 313310.777
  1  3   32   1    9.02s     11.20s   1.02s  9717.675 162578.985
  1  3   60   1   28.92s     29.71s   2.01s  3353.254  93278.693
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  4  3    1   1    0.16s      8.20s   8.03s 93935.977  93937.837
  4  3    2   1    0.19s      9.33s   5.01s 82539.043 153124.158
  4  3    4   1    0.22s      9.70s   3.00s 79213.537 257326.049
  4  3    8   1    0.23s     10.48s   2.00s 73361.194 383192.157
  4  3   16   1    0.22s     16.97s   1.09s 45722.791 405459.259
  4  3   32   1    4.67s     43.56s   2.06s 16301.136 292609.111
  4  3   60   1   21.01s     99.12s   4.00s  6546.181 193120.292
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  8  3    1   1    0.28s     16.77s  17.00s 92196.014  92248.241
  8  3    2   1    0.36s     18.64s  10.02s 82747.475 154108.957
  8  3    4   1    0.36s     19.39s   6.00s 79598.381 261456.810
  8  3    8   1    0.31s     20.96s   4.00s 73898.891 392375.888
  8  3   16   1    0.37s     34.28s   3.07s 45385.042 416385.059
  8  3   32   1    8.75s     88.48s   5.01s 16175.737 303964.057
  8  3   60   1   34.85s    213.80s   7.03s  6325.563 213671.451
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
 16  3    1   1    0.63s     40.19s  40.08s 77040.752  77044.800
 16  3    2   1    0.58s     38.34s  20.08s 80800.575 150916.421
 16  3    4   1    0.71s     38.66s  11.09s 79873.248 264287.489
 16  3    8   1    0.64s     41.91s   7.08s 73905.836 399511.701
 16  3   16   1    0.64s     67.46s   7.03s 46187.350 430267.445
 16  3   32   1    8.12s    171.97s   9.04s 17466.563 333665.446
 16  3   60   1   28.56s    483.76s  13.03s  6140.067 235670.414

2. 2.6.13-rc1-mm1-hugh

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  3    1   1    0.02s      2.12s   2.01s 91530.726  91290.645
  1  3    2   1    0.03s      2.40s   1.03s 80842.105 148577.211
  1  3    4   1    0.04s      2.50s   0.07s 77161.695 246742.307
  1  3    8   1    0.06s      2.74s   0.05s 69917.496 333526.774
  1  3   16   1    0.04s      4.38s   0.05s 44321.010 329911.942
  1  3   32   1    2.70s     11.10s   0.09s 14242.828 208238.299
  1  3   60   1   10.39s     25.69s   1.06s  5448.016 122221.286
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  4  3    1   1    0.20s      8.24s   8.04s 93090.909  93102.689
  4  3    2   1    0.17s      9.40s   5.01s 82125.313 152597.761
  4  3    4   1    0.18s      9.76s   3.00s 78990.759 256940.071
  4  3    8   1    0.14s     10.58s   2.00s 73361.194 383839.118
  4  3   16   1    0.18s     17.34s   1.09s 44887.671 400584.413
  4  3   32   1    3.03s     44.14s   2.06s 16670.171 296955.678
  4  3   60   1   42.77s    124.64s   4.07s  4697.360 164568.728
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  8  3    1   1    0.41s     16.72s  17.01s 91787.115  91811.036
  8  3    2   1    0.32s     18.75s  10.02s 82469.799 153767.106
  8  3    4   1    0.31s     19.49s   6.00s 79405.493 260233.329
  8  3    8   1    0.34s     21.00s   4.00s 73691.154 390162.630
  8  3   16   1    0.33s     33.82s   3.07s 46054.814 420596.124
  8  3   32   1    6.98s     87.06s   4.09s 16724.767 315990.572
  8  3   60   1   39.50s    252.83s   7.06s  5380.182 204361.061
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
 16  3    1   1    0.62s     40.28s  40.09s 76897.624  76894.371
 16  3    2   1    0.73s     38.20s  20.08s 80775.678 150731.248
 16  3    4   1    0.62s     38.86s  11.09s 79670.955 263253.128
 16  3    8   1    0.67s     41.89s   7.09s 73891.948 398115.325
 16  3   16   1    0.67s     68.00s   7.04s 45802.679 424756.786
 16  3   32   1    8.13s    170.75s   9.06s 17584.902 325968.378
 16  3   60   1   19.06s    443.08s  12.08s  6806.696 244372.117

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  3    1   1    0.04s      2.08s   2.01s 92390.977  92501.417
  1  3    2   1    0.04s      2.38s   1.03s 81108.911 149261.811
  1  3    4   1    0.04s      2.48s   0.07s 77895.404 245772.564
  1  3    8   1    0.04s      2.71s   0.05s 71338.171 339935.008
  1  3   16   1    0.08s      4.41s   0.06s 43690.667 321102.290
  1  3   32   1    6.30s     10.29s   1.01s 11843.855 176712.623
  1  3   60   1   31.78s     24.45s   2.00s  3496.372  95318.257
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  4  3    1   1    0.16s      8.23s   8.03s 93622.857  93678.202
  4  3    2   1    0.17s      9.40s   5.01s 82125.313 152957.631
  4  3    4   1    0.13s      9.81s   3.00s 79022.508 256991.607
  4  3    8   1    0.16s     10.59s   2.00s 73142.857 383723.246
  4  3   16   1    0.16s     17.08s   1.09s 45616.705 404165.286
  4  3   32   1    4.28s     43.45s   2.07s 16471.850 283376.758
  4  3   60   1   55.40s    115.75s   5.00s  4594.718 156683.131
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  8  3    1   1    0.32s     16.76s  17.00s 92044.944  92058.036
  8  3    2   1    0.29s     18.68s  10.01s 82887.015 154397.006
  8  3    4   1    0.33s     19.41s   6.00s 79662.885 262009.505
  8  3    8   1    0.32s     20.91s   3.09s 74079.879 393444.882
  8  3   16   1    0.32s     34.22s   3.07s 45521.649 417768.300
  8  3   32   1    3.44s     85.73s   4.08s 17636.959 325891.128
  8  3   60   1   56.83s    248.51s   8.02s  5150.986 191074.214
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
 16  3    1   1    0.62s     40.08s  40.07s 77267.833  77269.273
 16  3    2   1    0.67s     38.21s  20.08s 80891.998 151088.966
 16  3    4   1    0.69s     38.68s  11.09s 79889.476 264299.054
 16  3    8   1    0.65s     41.70s   7.08s 74261.756 400677.914
 16  3   16   1    0.68s     68.20s   7.04s 45664.383 423956.953
 16  3   32   1    4.05s    172.59s   9.03s 17808.292 338026.854
 16  3   60   1   49.84s    458.57s  13.09s  6187.311 224887.539

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT][PATCH 0/2] pagefault scalability alternative
  2005-08-23 16:30     ` Christoph Lameter
@ 2005-08-23 16:43       ` Martin J. Bligh
  2005-08-23 18:29       ` Hugh Dickins
  2005-08-27 22:10       ` Avi Kivity
  2 siblings, 0 replies; 12+ messages in thread
From: Martin J. Bligh @ 2005-08-23 16:43 UTC (permalink / raw)
  To: Christoph Lameter, Hugh Dickins
  Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-mm

>> > The basic idea is to have a spinlock per page table entry it seems.
>> A spinlock per page table, not a spinlock per page table entry.
> 
> Thats a spinlock per pmd? Calling it per page table is a bit confusing 
> since page table may refer to the whole tree. Could you develop 
> a clearer way of referring to these locks that is not page_table_lock or 
> ptl?

Isn't that per pagetable page? Though maybe that makes less sense with
large pages.
 
M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT][PATCH 0/2] pagefault scalability alternative
  2005-08-23 16:30     ` Christoph Lameter
  2005-08-23 16:43       ` Martin J. Bligh
@ 2005-08-23 18:29       ` Hugh Dickins
  2005-08-27 22:10       ` Avi Kivity
  2 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2005-08-23 18:29 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-mm

On Tue, 23 Aug 2005, Christoph Lameter wrote:
> On Tue, 23 Aug 2005, Hugh Dickins wrote:
> 
> > > The basic idea is to have a spinlock per page table entry it seems.
> > A spinlock per page table, not a spinlock per page table entry.
> 
> Thats a spinlock per pmd? Calling it per page table is a bit confusing 
> since page table may refer to the whole tree. Could you develop 
> a clearer way of referring to these locks that is not page_table_lock or 
> ptl?

Sorry to confuse.  I'm used to using "page table" for the leaf (or
rather, the twig above the pages themselves), may be hard to retrain
myself now.  Martin suggests "page table page", that's fine by me.

And, quite aside from getting confused between the different levels,
there's the confusion that C arrays introduce: when we write "pte",
sometimes we're thinking of a single page table entry, and sometimes
of the contiguous array of them, the "page table page".

You suggest I mean spinlock per pmd: I'd say it's a spinlock
per pmd entry.  Oh well, let the code speak for itself.
Every PMD_SIZE bytes of userspace gets its own spinlock
(and "PMD_SIZE" argues for your nomenclature not mine).

> Atomicity can be guaranteed to some degree by using the present bit. 
> For an update the present bit is first switched off. When a 
> new value is written, it is first written in the piece of the entry that 
> does not contain the pte bit which keeps the entry "not present". Last the 
> word with the present bit is written.

Exactly.  And many of the tests (e.g. in the _alloc functions) are testing
present, and need no change.  But the p?d_none_or_clear_bad tests would be
in danger of advancing to the "bad" test, and getting it wrong, if we
assemble "none" from two parts: need just to test the one with present in.

(But this would go wrong if we did it for pte_none, I think,
because the swap entry gets stored in the upper part when PAE.)

> This means that if any p?d entry has been found to not contain the present 
> bit then a lock must be taken and then the entry must be reread to get a 
> consistent value.

That's indeed what happens in the _alloc functions, pmd_alloc etc.
In the others, the lookups which don't wish to allocate, they just
give up on seeing p?d_none, no need to reread, if there's a race
we're happy to lose it in those contexts.

> Here are the results of the performance test. In summary these show that
> the performance of both our approaches are equivalent.

Many thanks for getting those done.  Interesting: or perhaps not -
boringly similar!  They fit with what I see on the 32-bit and 64-bit
2*HT*Xeons I have here, sometimes one does better, sometimes the other.

I'd rather been expecting the bigger machines to show some advantage
to your patch, yet not a decisive advantage.  Perhaps, on the bigger
ones neither can presently scale to, that will become apparent later;
or maybe I was just preparing myself for some disappointment.

> I would prefer your 
> patches over mine since they have a broader scope and may accellerate 
> other aspects of vm operations.

That is very generous of you, especially after all the effort you've
put into posting and reposting yours down the months: thank you.

But I do agree that mine covers more bases: no doubt similar tests on
the shared file pages would degenerate quickly from other contention
(e.g. the pagecache Nick is playing in), but we can expect that as
such issues get dealt with, the narrowed and split locking would
give the same advantage to those as to the anonymous.

I still fear that your pte xchging, with its parallel set of locking
rules, would tie our hands down the road, and might have to be backed
out later even if brought in.  But it's certainly not ruled out: it
will be interesting to see if it gives any boost on top of the split.

I do consistently see a small advantage to CONFIG_SPLIT_PTLOCK N
over Y when not multithreading, and little visible advantage at 4.
Not particularly anxious to offer it as a config option to the user:
I wonder whether to tie it to CONFIG_NR_CPUS, enable split ptlock at
4 or at 8 (would need to get good regular test coverage of both).

By the way, a little detail in pft.c: I think there's one too many
zeroes where wall.tv_nsec is converted for printing - hence the
remarkable trend that Wall time is always X.0Ys.  But that does
not invalidate our conclusions.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFT][PATCH 0/2] pagefault scalability alternative
  2005-08-23 16:30     ` Christoph Lameter
  2005-08-23 16:43       ` Martin J. Bligh
  2005-08-23 18:29       ` Hugh Dickins
@ 2005-08-27 22:10       ` Avi Kivity
  2 siblings, 0 replies; 12+ messages in thread
From: Avi Kivity @ 2005-08-27 22:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Nick Piggin, Linus Torvalds, Andrew Morton, linux-mm

Christoph Lameter wrote:

>>A spinlock per page table, not a spinlock per page table entry.
>>    
>>
>
>Thats a spinlock per pmd? Calling it per page table is a bit confusing 
>since page table may refer to the whole tree. Could you develop 
>a clearer way of referring to these locks that is not page_table_lock or 
>ptl?
>  
>
I've heard the term "paging element" used to describe the p*t's.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-08-27 22:10 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-24 14:27 [RFT][PATCH 0/2] pagefault scalability alternative linux
2005-08-24 15:21 ` Hugh Dickins
  -- strict thread matches above, loose matches on Subject: below --
2005-08-22 21:27 Hugh Dickins
2005-08-22 22:29 ` Christoph Lameter
2005-08-23  0:32   ` Nick Piggin
2005-08-23  7:04     ` Hugh Dickins
2005-08-23  8:14   ` Hugh Dickins
2005-08-23 10:03     ` Nick Piggin
2005-08-23 16:30     ` Christoph Lameter
2005-08-23 16:43       ` Martin J. Bligh
2005-08-23 18:29       ` Hugh Dickins
2005-08-27 22:10       ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox