[PATCH 0/2][RFC] New version of shared page tables

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2][RFC] New version of shared page tables
@ 2006-05-03 15:43 Dave McCracken
  2006-05-03 15:56 ` Hugh Dickins
  2006-05-05 19:25 ` Brian Twichell
  0 siblings, 2 replies; 16+ messages in thread
From: Dave McCracken @ 2006-05-03 15:43 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linux Memory Management, Linux Kernel

I've done some cleanup and some bugfixing.  Hugh, please review
this version instead of the old one.  I like my locking mechanism
for unsharing on this one a lot better.  It works on an address
range instead of depending on a vma, which more closely reflects
the way it's used.

The first patch just standardizes the pxd_page/pxd_page_kernel macros
for all architectures.

The second patch is the heart of shared page tables.

This version of the patches is against 2.6.17-rc3.

Dave McCracken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2][RFC] New version of shared page tables
  2006-05-03 15:43 [PATCH 0/2][RFC] New version of shared page tables Dave McCracken
@ 2006-05-03 15:56 ` Hugh Dickins
  2006-05-03 16:06   ` Dave McCracken
  2006-05-05 19:25 ` Brian Twichell
  1 sibling, 1 reply; 16+ messages in thread
From: Hugh Dickins @ 2006-05-03 15:56 UTC (permalink / raw)
  To: Dave McCracken; +Cc: Linux Memory Management, Linux Kernel

On Wed, 3 May 2006, Dave McCracken wrote:
> 
> I've done some cleanup and some bugfixing.  Hugh, please review
> this version instead of the old one.

Grrr, just as I'm writing up my notes on the last revision!
I need a new go-faster brain.  Okay, I'll switch over now.

Sisyphughs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2][RFC] New version of shared page tables
  2006-05-03 15:56 ` Hugh Dickins
@ 2006-05-03 16:06   ` Dave McCracken
  2006-05-06 15:25     ` Hugh Dickins
  0 siblings, 1 reply; 16+ messages in thread
From: Dave McCracken @ 2006-05-03 16:06 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linux Memory Management, Linux Kernel

--On Wednesday, May 03, 2006 16:56:12 +0100 Hugh Dickins <hugh@veritas.com>
wrote:

>> I've done some cleanup and some bugfixing.  Hugh, please review
>> this version instead of the old one.
> 
> Grrr, just as I'm writing up my notes on the last revision!
> I need a new go-faster brain.  Okay, I'll switch over now.

Sorry.

The changes should be relatively minor.  Just a tweak to the unshare
locking and some extra code to handle hugepage copy_page_range, mostly.

Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2][RFC] New version of shared page tables
  2006-05-03 16:06   ` Dave McCracken
@ 2006-05-06 15:25     ` Hugh Dickins
  2006-05-08 19:32       ` Ray Bryant
  2006-05-08 19:49       ` Brian Twichell
  0 siblings, 2 replies; 16+ messages in thread
From: Hugh Dickins @ 2006-05-06 15:25 UTC (permalink / raw)
  To: Dave McCracken; +Cc: Linux Memory Management, Linux Kernel

On Wed, 3 May 2006, Dave McCracken wrote:
> 
> The changes should be relatively minor.  Just a tweak to the unshare
> locking and some extra code to handle hugepage copy_page_range, mostly.

I didn't pay any attention to 1/2, the pxd_page/pxd_page_kernel patch.
It was well worth separating that one out, really helps to reduce the
scale and worry of the main patch.

Notice recent mails suggesting it's the answer to a pud_page anomaly:
I hadn't realized the type of pud_page varies from arch to arch, that's
horrid: if you're sorting it out, then please make that clear in the
comment and push it forward.

Though I notice only a couple of instances of pxd_page_kernel outside of
include/asm-* in 2.6.17-rc: I now think you'd do much better not to
propagate that obscure _kernel suffix further, but go for pxd_page_vaddr
(or suchlike) throughout instead: more change, but much clearer.

I do agree with Christoph that you'd do well to separate out the hugetlb
part of the main patch.  Not just for the locking, more because that's
become such a specialist and fast-moving area recently.  I didn't pay
attention to that part of 2/2 either, but got the impression that your
patch has not kept up with the changes there.

Let me say (while perhaps others are still reading) that I'm seriously
wondering whether you should actually restrict your shared pagetable work
to the hugetlb case.  I realize that would be a disappointing limitation
to you, and would remove the 25%/50% improvement cases, leaving only the
3%/4% last-ounce-of-performance cases.

But it's worrying me a lot that these complications to core mm code will
_almost_ never apply to the majority of users, will get little testing
outside of specialist setups.  I'd feel safer to remove that "almost",
and consign shared pagetables to the hugetlb ghetto, if that would
indeed remove their handling from the common code paths.  (Whereas,
if we didn't have hugetlb, I would be arguing strongly for shared pts.)

Patch 2/2 does look cleaner than before, and dropping PTSHARE_PUD has
helped to make it all simpler.

But you've not yet added the rss accounting: that's going to make it
quite a lot nastier.  An argument for sticking just to the hugetlb case:
although hugetlb accounts rss at present, I think we could justify not
doing so (though hugetlb rss is more relevant now it's not prefaulted).
However, I'll continue commenting on your non-hugetlb modifications.

A lot of page migration work has come into the tree (in mm/mempolicy.c
and mm/migrate.c) since 2.6.15: an optimistic guess would be that all
you need do there is skip shared pagetables unless MPOL_MF_MOVE_ALL
(as page_mapcount > 1 pages would be skipped).  You don't have to
worry about the pagetable becoming shared after you've tested: it's
already accepted that what's initially tested may change later (and
your rmap.c changes should cover the wider TLB flushing needed).

Naming: pt_check_unshare_pxd, I think better call them pt_unshare_pxd;
ah, you've already got pt_unshare_pxd, which is similar but different.
That's confusing - forced upon you by pagetable folding peculiarities?
I'd rather have just one copy of that central locking code.

You've generally been helpful with your underscores: one exception is
pt_vmashared: pt_sharing_vma?  Ah, it's supposed to be gone from the
latest version, but one trace remains - please delete that.  ptshare.h
contains more declarations/definitions never used, pt_increment_pte,
pt_decrement_pte, pt_increment_pmd, pt_decrement_pmd at least: please
check and delete what you're not actually using.

Compiler warns "value computed is not used" on the "*shared++;" lines
in page_check_address: should be "(*shared)++;".  So at present the
shared path in rmap.c has not really been tested.

page_referenced_one needs to skip decrementing *mapcount when shared:
the vma prio_tree search will bring it back again and again to the
same shared pagetable, though that pte is only counted once in mapcount:
so it's currently liable to break out too early, missing other entries.

try_to_unmap_one will now be flushing TLB on all cpus for each shared
pagetable entry it unmaps, where often (seeing inactive mm) it wouldn't
have needed to flush TLB at all; but that might work out on balance,
it won't be finding so many entries to unmap.

The prio_tree_iter_inits in pt_share_pte and pt_share_pmd should limit
their scope to the range of the pagetable involved, not the whole vma.
next_shareable_vma likewise?  I thought so at first, but perhaps its
check for a similar vma often avoids immediate unsharing.
Optimizations only, you've probably had them in mind for later.

I mentioned the off-by-one in pt_shareable_pte and pt_shareable_pmd
before: ought to say "vma->vm_end - 1 >= end"; but must admit that's
nitpicking, since vm_end is PAGE_SIZE aligned anyway, so no real
issue can arise - fix it to help stop others worrying later?

Whereas pt_share_pte and pt_share_pmd have the complementary issue:
"end = base + PXD_SIZE" may wrap to 0, so you need to -1 somewhere
(but you won't need base and end if pt_trans_start/end go away).
pt_share_pte and pt_share_pmd: preferable to swap around their pxd
and address arguments, so they resemble what they're replacing.

pt_shareable now has the same off-by-one too: or would have, but
"end = base + (mask-1)" is quite wrong, isn't it?  base + ~mask?
Move those calculations lower down, after the common tests?  or
do compiler and processor nowadays optimize such orderings well?
And there's a leftover "vmas in transition" comment on vm_flags.

pt_shareable is still not rejecting if vma->anon_vma is set: it's
quite possible for a vm_file vma to be private and writable, gather
some COW pages, and then be mprotected to readonly, so passing the
vm_flags test - but its pagetables must not be shared.

VM_PTSHARE came and went, good, you never had the mmap_sem needed
to set it.  VM_TRANSITION came and went, you've replaced it by the
mm->pt_trans_start, pt_trans_end.  At first I thought that a big
improvement, now I'm not so sure.  If they stay, those added fields
should be under #ifdef so as not to enlarge the basic mm_struct.

I'd prefer something other than "lock" in pt_unshare_lock_range
and pt_unlock_range, but I think I'm going to suggest you go back
to using pt_unshare_range alone: let's look at the three callsites.

sys_remap_file_pages: doesn't really need the locked range, you could
just call pt_unshare_range a little lower down, once i_mmap_lock taken.

mprotect_fixup: that does need some protection, yes, because the pte
protections are out-of-synch with the vm_flags for a while (in a way
that's okay for the owning mm, but not for "parasites" wanting to share).
Please move the pt_unshare_lock_range (or whatever) down above vma_merge,
so you can remove the pt_unlock_range from the -ENOMEM case above it.

mremap move_vma: not good enough, you're unsharing and locking the old
range, but you also need to lock the new range before copy_vma, to hold
it unshared too; which could be done, though not with the interfaces
you've provided.  (The VM_TRANSITION version was insufficient too, and
cleared the flag at a point where "vma" might already have been freed.)

Are those the only places which need this range locking?  I was worried
at first that there might be more, then I came around to thinking you'd
identified the right places, now suddenly I see pt_check_unshare_pxd in
zap_pxd_range as vulnerable: the vma remains in the prio_tree, so it
might immediately get shared again; what the zapping does is not wholly
wrong, but its TLB flushing would be inadequate if the table has become
shared in the meantime.  Or am I mistaken?

Unless you have firm performance evidence to the contrary, on a workload
that you're seriously trying to address, I suggest you drop the range
locking transitional stuff, and down_read_trylock(&svma->vm_mm->mmap_sem)
in (or near calls to) next_shareable_vma instead.  That will fail more
often than your transitional checks, but give much stronger assurance
that nothing funny is going on in the vma found.

But do you then need to add down_write(&mm->mmap_sem) in exit_mmap, if
pagetable sharing is enabled? currently I believe so.  But then, I think,
you can remove the pt_check_unshare_pxd from free_pyd_range: odd how it
was doing those once in the unmap_vmas path, then again in the
free_pgtables path - what was your thinking there?  Yes, the pagetable
may have gotten reshared in between, but the TLB flushing would already
be inadequate if so.

I admire the simplicity of the way you just unshare when you have to,
letting faults fill back in lazily; but does that have a problem in the
case of a VM_LOCKED vma, losing the guarantees mlock should be giving?

I read through a lot of old mails while reviewing, going back to Daniel's
first implementation in 2002.  The most interesting remark I found (and
have lost again) was one from wli, questioning the locking required when
changing *pmd.  Hmm, let's look at your pt_check_unshare_pte (similar
remarks apply to pt_check_unshare_pmd, pt_unshare_pte, pt_unshare_pmd),
there's a lot to question in that locking.

Well, the locking that you do have, it's unclear why you're using the
spinlock in the pagetable struct page there: doesn't it amount to?
	page = pmd_page(*pmd);
	if (atomic_add_unless(&page->_mapcount, -1, -1))
		return 0;
	pmd_clear_flush(mm, address, pmd);
	return 1;
Ah, probably atomic_add_unless wasn't available when you wrote it.

But then what of the "Oops, already mapped" pt_decrement_share in
pt_share_pte?  That's under different locking (rightly, the level
above, since the question there is whether a racing thread has set
*pmd): what happens if that decrement brings the share count down to,
umm, something awkward - hard to be specific, partly because of how
_mapcount starts from -1, partly because you've gone for a share count
rather than a reference count - I understand that you were avoiding the
overhead of maintaining another reference count on the common path, but
it leaves me deeply suspicious, I fear it's hiding bugs.

I'd agree it'll be rare (usually the racing thread will have found the
same pagetable to share as we have, and so raised its share count), but
I do believe that pt_decrement_share can go wrong: the process that had
that pagetable may be exiting, find it shared in its pt_check_unshare_pte
so skip zapping, then we're left with that pagetable to free - but we do
nothing other than decrement the count one too far.  I think that will
get fixed by pt_share_pte holding i_mmap_lock and its down_read_trylock
of svma->vm_mm->mmap_sem across the lower block: then it only needs to
pt_increment_share when all's well at the end, the decrement case gone.
pte_lock nests within i_mmap_lock, should be fine for pmd_lock also.

Now, back to the question of the pmd_clear_flush: currently, we may add
a valid entry *pmd at any time, but we only clear it in free_pgtables,
after all occupying vmas have been removed from anon_vma and prio_tree.
You're relaxing that; most paths are protected by holding mmap_sem, but
file truncation and rmap lookup are not.  The easiest protection against
races here is to hold i_mmap_lock, since both unmap_mapping_range and
page_check_address do (but slightly messy since the unmap_mapping_range
path must then avoid retaking it in the pt_check_unshares).  If you're
taking i_mmap_lock in pt_check_unshare_pte etc, you could then skip the
atomic_add_unless I was suggesting above, revert to your existing
structure but using i_mmap_lock instead of pmd_page ptl.

Except, you must not drop the lock until after your pmd_clear_flush
(which only needs flush_tlb_mm, doesn't it, rather than flush_tlb_all?).
Because once you drop the lock, the process you were sharing with could
unmap and free all the pages, and not knowing it had been sharing, only
flush for its own mm - other threads of your process might be able to
access those pages after they were freed by the other.

I don't suppose extending the use of i_mmap_lock as I suggest will be
popular, it's liable to reduce your scalability: I'm more pointing to
an obvious way to fix some problems than necessarily the end solution.

Please don't interpret these detailed comments as meaning that I think
your patch is almost ready: I'm afraid that the longer I spend looking
at it, the more I find to worry about - not a good sign.  (And let me
repeat, I've not looked at the hugetlb end of it at all.)

And though it's easy to find performance advocates in favour of your
patch, it's hard to find kernel hackers who care for maintainability
wanting it in.  And I worry that it will tie our hands, repeatedly
posing a difficulty for other future developments (rather as
sys_remap_file_pages did, or I feared Christoph's pte xchging would).

How was Ray Bryant's shared,anonymous,fork,munmap,private bug of
25 Jan resolved?  We didn't hear the end of that.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2][RFC] New version of shared page tables
  2006-05-06 15:25     ` Hugh Dickins
@ 2006-05-08 19:32       ` Ray Bryant
  2006-05-16 21:09         ` Dave McCracken
  2006-05-08 19:49       ` Brian Twichell
  1 sibling, 1 reply; 16+ messages in thread
From: Ray Bryant @ 2006-05-08 19:32 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Dave McCracken, Linux Memory Management, Linux Kernel

On Saturday 06 May 2006 10:25, Hugh Dickins wrote:
<snip>

> How was Ray Bryant's shared,anonymous,fork,munmap,private bug of
> 25 Jan resolved?  We didn't hear the end of that.
>

I never heard anything back from Dave, either.

> Hugh
> -

<snip>

-- 
Ray Bryant
AMD Performance Labs                   Austin, Tx
512-602-0038 (o)                 512-507-7807 (c)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2][RFC] New version of shared page tables
  2006-05-08 19:32       ` Ray Bryant
@ 2006-05-16 21:09         ` Dave McCracken
  2006-05-19 16:55           ` Ray Bryant
  2006-05-22 18:00           ` Ray Bryant
  0 siblings, 2 replies; 16+ messages in thread
From: Dave McCracken @ 2006-05-16 21:09 UTC (permalink / raw)
  To: Ray Bryant, Hugh Dickins; +Cc: Linux Memory Management, Linux Kernel

--On Monday, May 08, 2006 14:32:39 -0500 Ray Bryant
<raybry@mpdtxmail.amd.com> wrote:
> On Saturday 06 May 2006 10:25, Hugh Dickins wrote:
> <snip>
>> How was Ray Bryant's shared,anonymous,fork,munmap,private bug of
>> 25 Jan resolved?  We didn't hear the end of that.
>> 
> 
> I never heard anything back from Dave, either.

My apologies.  As I recall your problem looked to be a race in an area
where I was redoing the concurrency control.  I intended to ask you to
retest when my new version came out.  Unfortunately the new version took
awhile, and by the time I sent it out I forgot to ask you about it.

I believe your problem should be fixed in recent versions.  If not, I'll
make another pass at it.

Dave McCracken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2][RFC] New version of shared page tables
  2006-05-16 21:09         ` Dave McCracken
@ 2006-05-19 16:55           ` Ray Bryant
  2006-05-22 18:00           ` Ray Bryant
  1 sibling, 0 replies; 16+ messages in thread
From: Ray Bryant @ 2006-05-19 16:55 UTC (permalink / raw)
  To: Dave McCracken; +Cc: Hugh Dickins, Linux Memory Management, Linux Kernel

On Tuesday 16 May 2006 16:09, Dave McCracken wrote:
> --On Monday, May 08, 2006 14:32:39 -0500 Ray Bryant
>
> <raybry@mpdtxmail.amd.com> wrote:
> > On Saturday 06 May 2006 10:25, Hugh Dickins wrote:
> > <snip>
> >
> >> How was Ray Bryant's shared,anonymous,fork,munmap,private bug of
> >> 25 Jan resolved?  We didn't hear the end of that.
> >
> > I never heard anything back from Dave, either.
>
> My apologies.  As I recall your problem looked to be a race in an area
> where I was redoing the concurrency control.  I intended to ask you to
> retest when my new version came out.  Unfortunately the new version took
> awhile, and by the time I sent it out I forgot to ask you about it.
>
> I believe your problem should be fixed in recent versions.  If not, I'll
> make another pass at it.
>
> Dave McCracken
>

Let me build up a kernel with the latest patches and give it a try.   
(Sorry for delay, didn't see this note until today.)

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Ray Bryant
AMD Performance Labs                   Austin, Tx
512-602-0038 (o)                 512-507-7807 (c)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2][RFC] New version of shared page tables
  2006-05-16 21:09         ` Dave McCracken
  2006-05-19 16:55           ` Ray Bryant
@ 2006-05-22 18:00           ` Ray Bryant
  1 sibling, 0 replies; 16+ messages in thread
From: Ray Bryant @ 2006-05-22 18:00 UTC (permalink / raw)
  To: Dave McCracken; +Cc: Hugh Dickins, Linux Memory Management, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 2946 bytes --]

On Tuesday 16 May 2006 16:09, Dave McCracken wrote:
> --On Monday, May 08, 2006 14:32:39 -0500 Ray Bryant
>
> <raybry@mpdtxmail.amd.com> wrote:
> > On Saturday 06 May 2006 10:25, Hugh Dickins wrote:
> > <snip>
> >
> >> How was Ray Bryant's shared,anonymous,fork,munmap,private bug of
> >> 25 Jan resolved?  We didn't hear the end of that.
> >
> > I never heard anything back from Dave, either.
>
> My apologies.  As I recall your problem looked to be a race in an area
> where I was redoing the concurrency control.  I intended to ask you to
> retest when my new version came out.  Unfortunately the new version took
> awhile, and by the time I sent it out I forgot to ask you about it.
>
> I believe your problem should be fixed in recent versions.  If not, I'll
> make another pass at it.
>
> Dave McCracken
>

Dave,

I'm sending you a test case and a small kernel patch (see attachments).   The 
patch applies to 2.6.17-rc1, on top of your patches from 4/10/2006 (I'm 
assuming these the most recent ones.).   

What the patch does is to add a system call that will return the pfn and ptep 
for a given virtual address.   What the test program does (I think :-) ) is 
to create a mmap'd shared region, then fork off a child.   The child then 
re-mmaps() private a portion of the region.  Call it without arguments for 
now, that should map 512 pte's and share them between the parent and 1 child.
[Later on we can try more pages and more children.  (e. g. 
./shpt_test1 128 64).]

At this point, what I expect to have happened is that in at the  shared region 
address in the child, there will be a number of pages that are still shared 
with the parent, hence have the same pfn and ptep as they used to, followed 
by a set of pages in the re-mmapped() region where the pfn's and ptep's are 
different, because that set of pages is no longer shared.

What I find is that the re-mapped() region, the pfn's are different, but the 
ptep's have not changed.   Hence, we've modified the parent address space
rather than getting our own copy of that part of the address space.

Now I'm not positive as to what the semantics SHOULD be here, so that may be 
the error involved, but it seems to me that if I mmap() the region private in 
the child, I should get a nice new private copy, and the pte's should no 
longer be shared with the parent.   Is that the way you guys understand the
semantics of this? 

Anyway take a look at my test case and see if it makes any sense to you.
If it turns out my test case is in error, the mea culpa, and I'll fix the 
problems and try again.

Best Regards,

Ray
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Ray Bryant
AMD Performance Labs                   Austin, Tx
512-602-0038 (o)                 512-507-7807 (c)

[-- Attachment #2: add-sys_get_vminfo.patch --]
[-- Type: text/x-diff, Size: 2407 bytes --]

Index: linux-2.6.17-rc1-ptsh/mm/memory.c
===================================================================
--- linux-2.6.17-rc1-ptsh.orig/mm/memory.c
+++ linux-2.6.17-rc1-ptsh/mm/memory.c
@@ -2463,3 +2463,80 @@ int in_gate_area_no_task(unsigned long a
 }
 
 #endif	/* __HAVE_ARCH_GATE_AREA */
+
+#define VMINFO_RESULTS 3
+asmlinkage long
+sys_get_vminfo(pid_t pid, unsigned long addr,  long *user_addr)
+{
+	int ret;
+	struct page *p;
+	struct task_struct *task = NULL;
+	struct mm_struct *mm = NULL;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *ptep = NULL;
+	unsigned long results[VMINFO_RESULTS];
+
+	if (pid >= 0) {
+		read_lock(&tasklist_lock);
+		task = find_task_by_pid(pid);
+		if (task) {
+			task_lock(task);
+			mm = task->mm;
+			if (mm)
+				atomic_inc(&mm->mm_users);
+		} else {
+			read_unlock(&tasklist_lock);
+			return -ESRCH;
+		}
+		read_unlock(&tasklist_lock);
+	} else
+		return -1;
+
+	ret = get_user_pages(task, mm, addr, 1, 0, 0, &p, NULL);
+	results[0] = 0;
+	results[1] = -1;
+	if (ret >= 0) {
+		results[0] = page_to_pfn(p);
+		results[1] = page_to_nid(p);
+		put_page(p);
+	} else
+		ret = EINVAL;
+
+	pgd = pgd_offset(mm, addr);
+	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		goto no_page_table;
+
+	pud = pud_offset(pgd, addr);
+	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+		goto no_page_table;
+
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+		goto no_page_table;
+
+	ptep = pte_offset_map(pmd, addr);
+	pte_unmap(ptep);
+
+	if (mm)
+		mmput(mm);
+
+	task_unlock(task);
+
+copy_vminfo_to_user:
+	results[2] = (unsigned long) ptep;
+
+	if (copy_to_user(user_addr, results, VMINFO_RESULTS*sizeof(long)))
+		ret = -EFAULT;
+
+	return ret;
+
+no_page_table:
+	ptep = NULL;
+
+	ret = ENOMEM;
+
+	goto copy_vminfo_to_user;
+
+}
Index: linux-2.6.17-rc1-ptsh/include/asm-x86_64/unistd.h
===================================================================
--- linux-2.6.17-rc1-ptsh.orig/include/asm-x86_64/unistd.h
+++ linux-2.6.17-rc1-ptsh/include/asm-x86_64/unistd.h
@@ -611,8 +611,10 @@ __SYSCALL(__NR_set_robust_list, sys_set_
 __SYSCALL(__NR_get_robust_list, sys_get_robust_list)
 #define __NR_splice		275
 __SYSCALL(__NR_splice, sys_splice)
-
-#define __NR_syscall_max __NR_splice
+#define __NR_get_vminfo		276
+__SYSCALL(__NR_get_vminfo, sys_get_vminfo)
+ 
+#define __NR_syscall_max __NR_get_vminfo
 
 #ifndef __NO_STUBS
 

[-- Attachment #3: shpt_test1.c --]
[-- Type: text/x-csrc, Size: 10201 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <sys/shm.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <asm/atomic.h>

#define PAGE_SIZE 4096
#define INITIAL_HOLE 50
#define ADDR  (0x00002aaaaaf00000UL)
#define ADDR1 (ADDR + INITIAL_HOLE*PAGE_SIZE)
#define MAX_NCHILD 128
#define PAGES_PER_PTE_PAGE 512

#define PAGE_SIZE_IN_KB (PAGE_SIZE/1024)

/*
 *  Now test what happens if we create a shared region under the shpt
 *  kernel and then the children remap part of the shared region.  We
 *  use a hacked kernel with an additional system call (get_viminfo())
 *  [see below for details] to make sure that each child gets its own
 *  pfn in this case and that the shared paget table entries are no
 *  longer shared.   This test was suggested by Christoph Lamater @ SGI.
 */

static inline long timeval_diff_in_ms(struct timeval *a, struct timeval *b) {
   if (a->tv_usec > b->tv_usec) {
   	// borrow
	a->tv_sec--;
	a->tv_usec += 1000000;
   }
   return(1000 * (b->tv_sec - a->tv_sec) + (b->tv_usec - a->tv_usec)/1000);
}

long get_pte_kb()
{
	FILE *f;
	int err;
	long parameter;
	char str[128];

	f = fopen("/proc/meminfo", "r");
        if (!f) {
		printf("fopen() error in %s\n", __FUNCTION__);
		perror("fopen");
		exit(-1);
	 }

	while (1) {
		parameter = -1;
                err = fscanf(f, "%s %ld", str, &parameter);
                if (err == EOF  || !strcmp(str, "PageTables:"))
                        break;
        }

	return parameter;
}

/* 
 * hacked in system call to return the following info for a virtual address:
 * results[0] = pfn of virtual address "addr"
 * results[1] = nodeid where pfn lives (for NUMA boxen)
 * results[2] = address of the pte
 *
 * Note well, this assumes the page has already been faulted in
 * If this hasn't happended, the system call results are undefined.
 */
#define __NR_get_vminfo 276
static inline long get_vminfo(pid_t pid, void *addr, long *results)
{
	return (syscall(__NR_get_vminfo,pid,addr,results));
}

main(int argc, char **argv)
{
	char *pages, *pages1;
	int count;
	int errors, pc, nchild, child;
	long shared_region_size, i, remapped_region_size;
	long starting_pte_kb, ending_pte_kb;
	pid_t pid[MAX_NCHILD];
	void *addr = (void *) ADDR;
	void *addr1= (void *) ADDR1;
	atomic_t *atom = (atomic_t *) (addr + 8);
	volatile long *flag = (long *) (addr + 16);
	struct timeval start, forkend, end;
	long results[3];
	long *page_pfn, *page_ptep;
	int pfn_should_match_dont=0, ptep_should_match_dont=0;
	int pfn_shouldnt_match_do=0, ptep_shouldnt_match_do=0;

	setbuf(stdout, NULL);
	printf("Main starts......\n");

	printf("argc=%d\n", argc);
	/* first arg is the number of pte pages to use */
	if (argc == 1)
		pc = 1;
	else
		sscanf(argv[1], "%d", &pc);

	/* second arg is the number of threads to create */
	if (argc < 3)
		nchild = 1;
	else
		sscanf(argv[2], "%d", &nchild);

	/* find out how many pages of pte's we've already used */
	/* (we'll subtract this off of the number we get after */
	/*  the children are all forked, below.) ............. */
	starting_pte_kb = get_pte_kb();

	pc = PAGES_PER_PTE_PAGE * pc;
	if (nchild > MAX_NCHILD)
		nchild = MAX_NCHILD;
	printf("Number of pages to map: %d nchild: %d\n", pc, nchild);

	shared_region_size = (long)pc * (long)PAGE_SIZE;

	printf("Shared region size: %5.2f GB\n", shared_region_size/(1024.0*1024.0*1024.0));

	pages = (char *)mmap(addr, shared_region_size,
		PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED, 0, 0);

	if (pages ==  MAP_FAILED) {
		printf("mmap() failed.\n");
		perror("mmap");
		exit(999);
	}

	printf("mapped region starts at: %p\n", pages);

	/* initialize the communication flags in the shared region */
	atomic_set(atom, 0);
	*flag = 0;

	printf("writing data..........\n");
	errors = 0;
	/* initialize the first byte of page N to N */
	for(i=0; i<shared_region_size; i+=PAGE_SIZE) {
		pages[i] = (char) (i/PAGE_SIZE);
	}
	/* paranoia check. ........................ */
	for(i=0; i<shared_region_size; i+=PAGE_SIZE) {
		if(pages[i] != (char) (i/PAGE_SIZE))
			errors++;
	}
	printf("done writing data...... errors=%d\n", errors);

	printf("Forking.....\n");
	gettimeofday(&start, NULL);
	for(child=0;child<nchild;child++) {
		if (pid[child]=fork()) {
			if ((nchild < 16) || ((child % 16) == 0))
				printf("parent (pid:%d) created child #%3d: pid:%d\n", getpid(), child, pid[child]);
		} else {
			char tmp, rc;
			int tests = 0;
			errors = 0;
			/* check to make sure child sees same data as above */
			/* also record pfn and pte addresses for later comparison */
			page_pfn = (long *) malloc(pc*sizeof(long));
			page_ptep= (long *) malloc(pc*sizeof(long));
			if (!page_pfn || !page_ptep) {
				printf("Ack, PID: %d couldn't allocate both page_pfn (%p) and page_pte (%p)\n",
					getpid(), page_pfn, page_ptep);
				perror("mmap");
				atomic_add(1, atom);
				exit(-1);
			} else
				printf("PID: %d page_pfn:%p page_pte:%p\n",
					getpid(), page_pfn, page_ptep);
			for(i=0; i<shared_region_size; i+=PAGE_SIZE) {
				if(pages[i] != (char) (i/PAGE_SIZE))
					errors++;
				rc = get_vminfo(getpid(), &pages[i], results);
				if (rc >= 0) {
					page_pfn[i/PAGE_SIZE] = results[0];
					page_ptep[i/PAGE_SIZE] = results[2];
				}
				else
					printf("PID:%d i=%d vmaddr:%p returned %d\n",
						getpid(), i, &pages[i], rc);
			}
			if (errors > 0)
				printf("child(%d) sees errors=%d\n", getpid(), errors);

			/* now (re-)mmap a portion of the shared region */
			/* yeah, this is a little bit arbitrary :-) */
			remapped_region_size = PAGE_SIZE*pc/8;
			printf("remapped_region_size: %ld\n", remapped_region_size);
			pages1 = (char *)mmap(addr1, remapped_region_size,
				 PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_FIXED | MAP_PRIVATE, 0, 0);
                        if (pages1 == MAP_FAILED) {
				printf("mmap() failed in child process %d\n",getpid());
				perror("mmap");
				atomic_add(1, atom);
				exit(-1);
			}

			/* fault the pages in and put some different data in the pages */
			tmp = (char) (getpid() & 0xFF);
			for(i=0; i<remapped_region_size; i+=PAGE_SIZE) {
				pages1[i] = tmp;
			}

			errors = 0;
			tests  = 0;

			/*
			 * now print out the pfn and pte addresses for the entire shared region
			 * (up to the end of the re-mapped region, above).
			 * we expect some of the addresses to still use shared ptes, but above
			 * the initial hole, we should see distinct ptes.  Another plausible
			 * implementation would be to unshared the entire region; that would be
			 * legal as well.
			 */
			for(i=0; i<INITIAL_HOLE*PAGE_SIZE+remapped_region_size; i+=PAGE_SIZE) {
			        if (((void *)&pages[i]) < addr1) {
					/* we expect shared pte's in this region */
					if (pages[i] != tmp)
						errors++;
					tests++;
					rc = get_vminfo(getpid(), &pages[i], results);
					if (rc >= 0) {
						if (results[0] != page_pfn[i/PAGE_SIZE])
							pfn_should_match_dont++;
						if (results[2] != page_ptep[i/PAGE_SIZE])
							ptep_should_match_dont++;
						printf("Expect shared: PID:%d i=%d vmaddr:%p pfn:0x%lx was 0x%lx pte:0x%lx was 0x%lx\n",
							getpid(), i, &pages[i], page_pfn[i/PAGE_SIZE], results[0], page_ptep[i/PAGE_SIZE], results[2]);
					} else
						printf("Expect shared: PID:%d i=%d vmaddr:%p returned %d\n",
							getpid(), i, &pages[i], rc);
				} else {
					/* we expect unshared pte's in this region */
					if (pages[i] != tmp)
						errors++;
					tests++;
					rc = get_vminfo(getpid(), &pages[i], results);
					if (rc >= 0) {
						if (results[0] == page_pfn[i/PAGE_SIZE])
							pfn_shouldnt_match_do++;
						if (results[2] == page_ptep[i/PAGE_SIZE])
							ptep_shouldnt_match_do++;
						printf("Expect unshared: PID:%d i=%d vmaddr:%p pfn:0x%lx was 0x%lx pte:0x%lx was 0x%lx\n",
							getpid(), i, &pages[i], page_pfn[i/PAGE_SIZE], results[0], page_ptep[i/PAGE_SIZE], results[2]);
					} else
						printf("Expect unshared: PID:%d i=%d vmaddr:%p returned %d\n",
							getpid(), i, &pages[i], rc);
				}
			}
			/* print the number of errors found for the region we have examined */
			if (errors > 0)
				printf("child(%d) sees errors=%d in region, tests:%d\n", getpid(), errors, tests);

			if (pfn_should_match_dont || ptep_should_match_dont || pfn_shouldnt_match_do || ptep_shouldnt_match_do) {
				int tmp = pfn_should_match_dont + ptep_should_match_dont + pfn_shouldnt_match_do + ptep_shouldnt_match_do;
				printf("child(%d) sees match/mismatch errors %d in region, tests:%d\n", getpid(), tmp, tests);
				printf("child(%d) pfn_should_match_dont:  %d\n", getpid(), pfn_should_match_dont);
				printf("child(%d) ptep_should_match_dont: %d\n", getpid(), ptep_should_match_dont);
				printf("child(%d) pfn_shouldnt_match_do:  %d\n", getpid(), pfn_shouldnt_match_do);
				printf("child(%d) ptep_shouldnt_match_do: %d\n", getpid(), ptep_shouldnt_match_do);
			}


			/* indicate that this child is done */
			atomic_add(1, atom);
			while (!(*flag))
				sleep(1);

			exit(errors);
		}
	}

	gettimeofday(&forkend, 0);

	/* wait for all of the children to get started and check their data */
	printf("Parent is waiting....\n");
        while(atomic_read(atom) < nchild)
		usleep(1000L);
	gettimeofday(&end, NULL);

	printf("All children are now sleeping....elapsed ms: %ld fork ms: %ld\n", 
		timeval_diff_in_ms(&start, &end), timeval_diff_in_ms(&start, &forkend));

	/* now let us check to see how many pages of pte's have been used */
	ending_pte_kb = get_pte_kb();

	/* We can use this number to see if the shared region is still using 
	 * any shared pte's after the mmap() by the child, although it really
	 * doesn't matter (e. g. it would be allowed to revert the whole shared
	 * region to non-shared ptes. ....................................... */
	printf("pte pages used: \t%8ld\n",    (ending_pte_kb-starting_pte_kb)/PAGE_SIZE_IN_KB);
	printf("KB of pte pages used: \t%8ld\n", ending_pte_kb-starting_pte_kb);

	*flag = 1;
	
	for(i=0;i<nchild;i++) {
		int status;
		waitpid(pid[i], &status, 0);
		if (status != 0)
			printf("pid %d exited with non-zero status: %d\n", pid[i], status);
	}

	munmap(pages, shared_region_size);

	printf("Main exits......\n");
}


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2][RFC] New version of shared page tables
  2006-05-06 15:25     ` Hugh Dickins
  2006-05-08 19:32       ` Ray Bryant
@ 2006-05-08 19:49       ` Brian Twichell
  2006-05-09  3:42         ` Nick Piggin
  2006-05-09 19:22         ` Hugh Dickins
  1 sibling, 2 replies; 16+ messages in thread
From: Brian Twichell @ 2006-05-08 19:49 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Dave McCracken, Linux Memory Management, Linux Kernel

Hugh Dickins wrote:

>Let me say (while perhaps others are still reading) that I'm seriously
>wondering whether you should actually restrict your shared pagetable work
>to the hugetlb case.  I realize that would be a disappointing limitation
>to you, and would remove the 25%/50% improvement cases, leaving only the
>3%/4% last-ounce-of-performance cases.
>
>But it's worrying me a lot that these complications to core mm code will
>_almost_ never apply to the majority of users, will get little testing
>outside of specialist setups.  I'd feel safer to remove that "almost",
>and consign shared pagetables to the hugetlb ghetto, if that would
>indeed remove their handling from the common code paths.  (Whereas,
>if we didn't have hugetlb, I would be arguing strongly for shared pts.)
>
Hi,

In the case of x86-64, if pagetable sharing for small pages was 
eliminated, we'd lose more than the 27-33% throughput improvement 
observed when the bufferpools are in small pages.  We'd also lose a 
significant chunk of the 3% improvement observed when the bufferpools 
are in hugepages.  This occurs because there is still small page 
pagetable sharing being achieved, minimally for database text, when the 
bufferpools are in hugepages.  The performance counters indicated that 
ITLB and DTLB page walks were reduced by 28% and 10%, respectively, in 
the x86-64/hugepage case.

To be clear, all measurements discussed in my post were performed with 
kernels config'ed to share pagetables for both small pages and hugepages.

If we had to choose between pagetable sharing for small pages and 
hugepages, we would be in favor of retaining pagetable sharing for small 
pages.  That is where the discernable benefit is for customers that run 
with "out-of-the-box" settings.  Also, there is still some benefit there 
on x86-64 for customers that use hugepages for the bufferpools.

Cheers,
Brian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2][RFC] New version of shared page tables
  2006-05-08 19:49       ` Brian Twichell
@ 2006-05-09  3:42         ` Nick Piggin
  2006-05-10  2:07           ` Chen, Kenneth W
  2006-05-10 19:45           ` Brian Twichell
  2006-05-09 19:22         ` Hugh Dickins
  1 sibling, 2 replies; 16+ messages in thread
From: Nick Piggin @ 2006-05-09  3:42 UTC (permalink / raw)
  To: Brian Twichell
  Cc: Hugh Dickins, Dave McCracken, Linux Memory Management, Linux Kernel

Brian Twichell wrote:

> Hugh Dickins wrote:
>
>> Let me say (while perhaps others are still reading) that I'm seriously
>> wondering whether you should actually restrict your shared pagetable 
>> work
>> to the hugetlb case.  I realize that would be a disappointing limitation
>> to you, and would remove the 25%/50% improvement cases, leaving only the
>> 3%/4% last-ounce-of-performance cases.
>>
>> But it's worrying me a lot that these complications to core mm code will
>> _almost_ never apply to the majority of users, will get little testing
>> outside of specialist setups.  I'd feel safer to remove that "almost",
>> and consign shared pagetables to the hugetlb ghetto, if that would
>> indeed remove their handling from the common code paths.  (Whereas,
>> if we didn't have hugetlb, I would be arguing strongly for shared pts.)
>>
> Hi,
>
> In the case of x86-64, if pagetable sharing for small pages was 
> eliminated, we'd lose more than the 27-33% throughput improvement 
> observed when the bufferpools are in small pages.  We'd also lose a 
> significant chunk of the 3% improvement observed when the bufferpools 
> are in hugepages.  This occurs because there is still small page 
> pagetable sharing being achieved, minimally for database text, when 
> the bufferpools are in hugepages.  The performance counters indicated 
> that ITLB and DTLB page walks were reduced by 28% and 10%, 
> respectively, in the x86-64/hugepage case.


Aside, can you just enlighten me as to how TLB misses are improved on 
x86-64? As far as
I knew, it doesn't have ASIDs so I wouldn't have thought it could share 
TLBs anyway...
But I'm not up to scratch with modern implementations.

>
> To be clear, all measurements discussed in my post were performed with 
> kernels config'ed to share pagetables for both small pages and hugepages.
>
> If we had to choose between pagetable sharing for small pages and 
> hugepages, we would be in favor of retaining pagetable sharing for 
> small pages.  That is where the discernable benefit is for customers 
> that run with "out-of-the-box" settings.  Also, there is still some 
> benefit there on x86-64 for customers that use hugepages for the 
> bufferpools.


Of course if it was free performance then we'd want it. The downsides 
are that it
is a significant complexity for a pretty small (3%) performance gain for 
your apparent
target workload, which is pretty uncommon among all Linux users.

Ignoring the complexity, it is still not free. Sharing data across 
processes adds to
synchronisation overhead and hurts scalability. Some of these page fault 
scalability
scenarios have shown to be important enough that we have introduced 
complexity _there_.

And it seems customers running "out-of-the-box" settings really want to 
start using
hugepages if they're interested in getting the most performance 
possible, no?

---

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 0/2][RFC] New version of shared page tables
  2006-05-09  3:42         ` Nick Piggin
@ 2006-05-10  2:07           ` Chen, Kenneth W
  2006-05-10 19:45           ` Brian Twichell
  1 sibling, 0 replies; 16+ messages in thread
From: Chen, Kenneth W @ 2006-05-10  2:07 UTC (permalink / raw)
  To: 'Nick Piggin', Brian Twichell
  Cc: Hugh Dickins, Dave McCracken, Linux Memory Management, Linux Kernel

Nick Piggin wrote on Monday, May 08, 2006 8:42 PM
> Brian Twichell wrote:
> > In the case of x86-64, if pagetable sharing for small pages was 
> > eliminated, we'd lose more than the 27-33% throughput improvement 
> > observed when the bufferpools are in small pages.  We'd also lose a 
> > significant chunk of the 3% improvement observed when the bufferpools 
> > are in hugepages.  This occurs because there is still small page 
> > pagetable sharing being achieved, minimally for database text, when 
> > the bufferpools are in hugepages.  The performance counters indicated 
> > that ITLB and DTLB page walks were reduced by 28% and 10%, 
> > respectively, in the x86-64/hugepage case.
> 
> 
> Aside, can you just enlighten me as to how TLB misses are improved on 
> x86-64? As far as I knew, it doesn't have ASIDs so I wouldn't have thought
> it could share TLBs anyway...
> But I'm not up to scratch with modern implementations.


Allow me to jump in if I may:  The number of TLB misses did not change that
much (both i-side and d-side and is expected).  What changed is the penalty
of TLB misses are reduced: i.e., number of page table walk performed by the
hardware are reduced. This is due to specialized buffering of information
that reduces the need to perform page walks. With page table sharing, the
overall size of page tables are reduced, in turn, it has a better hit rate
on the buffered items and it helps to mitigate page walks upon a TLB miss.

- Ken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2][RFC] New version of shared page tables
  2006-05-09  3:42         ` Nick Piggin
  2006-05-10  2:07           ` Chen, Kenneth W
@ 2006-05-10 19:45           ` Brian Twichell
  2006-05-12  5:17             ` Nick Piggin
  1 sibling, 1 reply; 16+ messages in thread
From: Brian Twichell @ 2006-05-10 19:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Dave McCracken, Linux Memory Management, Linux Kernel

Nick Piggin wrote:

> Brian Twichell wrote:
>
>>
>> If we had to choose between pagetable sharing for small pages and 
>> hugepages, we would be in favor of retaining pagetable sharing for 
>> small pages.  That is where the discernable benefit is for customers 
>> that run with "out-of-the-box" settings.  Also, there is still some 
>> benefit there on x86-64 for customers that use hugepages for the 
>> bufferpools.
>
>
> Of course if it was free performance then we'd want it. The downsides 
> are that it
> is a significant complexity for a pretty small (3%) performance gain 
> for your apparent
> target workload, which is pretty uncommon among all Linux users.

Our performance data demonstrated that the potential gain for the 
non-hugepage case is much higher than 3%.

>
> Ignoring the complexity, it is still not free. Sharing data across 
> processes adds to
> synchronisation overhead and hurts scalability. Some of these page 
> fault scalability
> scenarios have shown to be important enough that we have introduced 
> complexity _there_.

True, but this needs to be balanced against the fact that pagetable 
sharing will reduce the number of page faults when it is achieved.  
Let's say you have N processes which touch all the pages in an M page 
shared memory region.  Without shared pagetables this requires N*M page 
faults; if pagetable sharing is achieved, only M pagefaults are required.

>
> And it seems customers running "out-of-the-box" settings really want 
> to start using
> hugepages if they're interested in getting the most performance 
> possible, no?

My perspective is that, once the customer is required to invoke "echo 
XXX > /proc/sys/vm/nr_hugepages" they've left the "out-of-the-box" 
domain, and entered the domain of hoping that the number of hugepages is 
sufficient, because if it's not, they'll probably need to reboot, which 
can be pretty inconvenient for a production transaction-processing 
application.

Cheers,
Brian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2][RFC] New version of shared page tables
  2006-05-10 19:45           ` Brian Twichell
@ 2006-05-12  5:17             ` Nick Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2006-05-12  5:17 UTC (permalink / raw)
  To: Brian Twichell
  Cc: Hugh Dickins, Dave McCracken, Linux Memory Management, Linux Kernel

Brian Twichell wrote:
> Nick Piggin wrote:

>> Of course if it was free performance then we'd want it. The downsides 
>> are that it
>> is a significant complexity for a pretty small (3%) performance gain 
>> for your apparent
>> target workload, which is pretty uncommon among all Linux users.
> 
> 
> Our performance data demonstrated that the potential gain for the 
> non-hugepage case is much higher than 3%.

The point is, there are hugepages. They were a significant additional
complexity but the concession was made because they did provide a
large speedup for databases.

> 
>>
>> Ignoring the complexity, it is still not free. Sharing data across 
>> processes adds to
>> synchronisation overhead and hurts scalability. Some of these page 
>> fault scalability
>> scenarios have shown to be important enough that we have introduced 
>> complexity _there_.
> 
> 
> True, but this needs to be balanced against the fact that pagetable 
> sharing will reduce the number of page faults when it is achieved.  
> Let's say you have N processes which touch all the pages in an M page 
> shared memory region.  Without shared pagetables this requires N*M page 
> faults; if pagetable sharing is achieved, only M pagefaults are required.
> 
>>
>> And it seems customers running "out-of-the-box" settings really want 
>> to start using
>> hugepages if they're interested in getting the most performance 
>> possible, no?
> 
> 
> My perspective is that, once the customer is required to invoke "echo 
> XXX > /proc/sys/vm/nr_hugepages" they've left the "out-of-the-box" 
> domain, and entered the domain of hoping that the number of hugepages is 
> sufficient, because if it's not, they'll probably need to reboot, which 
> can be pretty inconvenient for a production transaction-processing 
> application.

I think it is pretty easy to reserve hugepages at bootup. This is what
a production transaction processing system will be doing, won't it?
Especially if they're performance constrained and hugepages gives them
a 30% performance boost.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2][RFC] New version of shared page tables
  2006-05-08 19:49       ` Brian Twichell
  2006-05-09  3:42         ` Nick Piggin
@ 2006-05-09 19:22         ` Hugh Dickins
  1 sibling, 0 replies; 16+ messages in thread
From: Hugh Dickins @ 2006-05-09 19:22 UTC (permalink / raw)
  To: Brian Twichell; +Cc: Dave McCracken, Linux Memory Management, Linux Kernel

On Mon, 8 May 2006, Brian Twichell wrote:
> 
> If we had to choose between pagetable sharing for small pages and hugepages,
> we would be in favor of retaining pagetable sharing for small pages.  That is
> where the discernable benefit is for customers that run with "out-of-the-box"
> settings.  Also, there is still some benefit there on x86-64 for customers
> that use hugepages for the bufferpools.

Thanks for the further info, Brian.  Okay, the hugepage end of it does
add a different kind of complexity, in an area already complex from the
different arch implementations.  If you've found that a significant part
of the hugepage test improvment is actually due to the smallpage changes,
let's turn around what I said, and suggest Dave concentrate on getting the
smallpage changes right, putting the hugepage part of it on the backburner
at least for now (or if he's particularly keen still to present it, as 3/3).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2][RFC] New version of shared page tables
  2006-05-03 15:43 [PATCH 0/2][RFC] New version of shared page tables Dave McCracken
  2006-05-03 15:56 ` Hugh Dickins
@ 2006-05-05 19:25 ` Brian Twichell
  2006-05-06  3:37   ` Chen, Kenneth W
  1 sibling, 1 reply; 16+ messages in thread
From: Brian Twichell @ 2006-05-05 19:25 UTC (permalink / raw)
  To: Dave McCracken
  Cc: Hugh Dickins, Linux Memory Management, Linux Kernel, slpratt

Hi,

We reevaluated shared pagetables with recent patches from Dave.  As with 
our previous evaluation, a database transaction-processing workload was 
used.  This time our evaluation focused on a 4-way x86-64 configuration 
with 8 GB of memory.

In the case that the bufferpools were in small pages, shared pagetables 
provided a 27% improvement in transaction throughput.  The performance 
increase is attributable to multiple factors.  First, pagetable memory 
consumption was reduced from 1.65 GB to 51 MB, freeing up 20% of the 
system's memory.  This memory was devoted to enlarging the database 
bufferpools, which allowed more database data to be cached in memory.  
The effect of this was to reduce the number of disk I/O's per 
transaction by 23%, which contributed to a similar reduction in the 
context switch rate.  A second major component of the performance 
improvement is reduced TLB and cache miss rates, due to the smaller 
pagetable footprint.  To try to isolate this benefit, we performed an 
experiment where pagetables were shared, but the database bufferpools 
were not enlarged.  In this configuration, shared pagetables provided a 
9% increase in database transaction throughput.  Analysis of processor 
performance counters revealed the following benefits from pagetable sharing:

- ITLB and DTLB page walks were reduced by 27% and 26%, respectively.
- L1 and L2 cache misses were reduced by 5%.  This is due to fewer 
pagetable entries crowding the caches.
- Front-side bus traffic was reduced approximately 10%.

When the bufferpools were in hugepages, shared pagetables provided a 3% 
increase in database transaction throughput.  Some of the underlying 
benefits of pagetable sharing were as follows:

- Pagetable memory consumption was reduced from 53 MB to 37 MB.
- ITLB and DTLB page walks were reduced by 28% and 10%, respectively.
- L1 and L2 cache misses were reduced by 2% and 6.5%, respectively.
- Front-side bus traffic was reduced by approximately 4%.

The database transaction throughput achieved using small pages with 
shared pagetables (with bufferpools enlarged) was within 3% of the 
transaction throughput achieved using hugepages without shared 
pagetables.  Thus shared pagetables provided nearly all the benefit of 
hugepages, without the requirement of having to deal with limitations of 
hugepages.  We believe this would be a significant benefit to customers 
running these types of workloads.

We also measured the benefit of shared pagetables on our larger setups.  
On our 4-way x86-64 setup with 64 GB memory, using small pages for the 
bufferpools, shared pagetables provided a 33% increase in transaction 
throughput.  Using hugepages for the bufferpools, shared pagetables 
provided a 3% increase.  Performance with small pages and shared 
pagetables was within 4% of the performance using hugepages without 
shared pagetables.

On our ppc64 setups we used both Oracle and DB2 to evaluate the benefit 
of shared pagetables.  When database bufferpools were in small pages, 
shared pagetables provided an increase in database transaction 
throughput in the range of 60-65%, while in the hugepage case the 
improvement was up to 2.4%.

We thank Kshitij Doshi and Ken Chen from Intel for their assistance in 
analyzing the x86-64 data.

Cheers,
Brian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 0/2][RFC] New version of shared page tables
  2006-05-05 19:25 ` Brian Twichell
@ 2006-05-06  3:37   ` Chen, Kenneth W
  0 siblings, 0 replies; 16+ messages in thread
From: Chen, Kenneth W @ 2006-05-06  3:37 UTC (permalink / raw)
  To: 'Brian Twichell', Dave McCracken
  Cc: Hugh Dickins, Linux Memory Management, Linux Kernel, slpratt

Brian Twichell wrote on Friday, May 05, 2006 12:26 PM
> We also measured the benefit of shared pagetables on our larger setups.  
> On our 4-way x86-64 setup with 64 GB memory, using small pages for the 
> bufferpools, shared pagetables provided a 33% increase in transaction 
> throughput.  Using hugepages for the bufferpools, shared pagetables 
> provided a 3% increase.  Performance with small pages and shared 
> pagetables was within 4% of the performance using hugepages without 
> shared pagetables.
> 
> On our ppc64 setups we used both Oracle and DB2 to evaluate the benefit 
> of shared pagetables.  When database bufferpools were in small pages, 
> shared pagetables provided an increase in database transaction 
> throughput in the range of 60-65%, while in the hugepage case the 
> improvement was up to 2.4%.


I would also like to add that I have run this set of patches on ia64 and
observed similar performance upside. We have multiple data points showing
that this feature benefits several architectures.  I'm advocating for the
upstream inclusion.

- Ken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2006-05-22 18:00 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-03 15:43 [PATCH 0/2][RFC] New version of shared page tables Dave McCracken
2006-05-03 15:56 ` Hugh Dickins
2006-05-03 16:06   ` Dave McCracken
2006-05-06 15:25     ` Hugh Dickins
2006-05-08 19:32       ` Ray Bryant
2006-05-16 21:09         ` Dave McCracken
2006-05-19 16:55           ` Ray Bryant
2006-05-22 18:00           ` Ray Bryant
2006-05-08 19:49       ` Brian Twichell
2006-05-09  3:42         ` Nick Piggin
2006-05-10  2:07           ` Chen, Kenneth W
2006-05-10 19:45           ` Brian Twichell
2006-05-12  5:17             ` Nick Piggin
2006-05-09 19:22         ` Hugh Dickins
2006-05-05 19:25 ` Brian Twichell
2006-05-06  3:37   ` Chen, Kenneth W

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox