shared pagetable benchmarking

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* shared pagetable benchmarking
@ 2002-12-20 11:11 Andrew Morton
  2002-12-20 11:13 ` William Lee Irwin III
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Andrew Morton @ 2002-12-20 11:11 UTC (permalink / raw)
  To: Dave McCracken, linux-mm

Did a bit of timing and profiling.  It's a uniprocessor
kernel, 7G, PAE.

The workload is application and removal of ~80 patches using
my patch scripts.  Tons and tons of forks from bash.

2.5 ends up being 13% slower than 2.4, after disabling highpte
to make it fair.  3%-odd of this is HZ=1000.  So say 10%.

Pagetable sharing actually slowed this test down by several
percent overall.  Which is unfortunate, because the main
thing which Linus likes about shared pagetables is that it
"speeds up forks".

Is there anything we can do to fix all of this up a bit?



2.4.21-pre2:

c0106d60 system_call                                  10   0.1786
c012ca00 __free_pages_ok                              10   0.0124
c0114c38 mm_init                                      13   0.0396
c0124a84 find_vma                                     13   0.1548
c01283ac generic_file_write                           13   0.0073
c012cd24 rmqueue                                      14   0.0201
c0122570 __free_pte                                   16   0.1667
c0123df0 handle_mm_fault                              16   0.0625
c0123aac do_anonymous_page                            25   0.0801
c0126c3c file_read_actor                              33   0.1684
c0123be4 do_no_page                                   37   0.0706
c01226d0 copy_page_range                              47   0.0810
c0112878 do_page_fault                                49   0.0352
c01225e8 clear_page_tables                            70   0.3017
c0122914 zap_page_range                               72   0.1118
c01234b0 do_wp_page                                  275   0.3736
00000000 total                                      1062   0.0008
created /tmp/prof.time
akpm-prof pushpatch 99  11.83s user 10.56s system 99% cpu 22.439 total

c01283ac generic_file_write                            9   0.0050
c012d8b4 free_page_and_swap_cache                      9   0.1500
c012625c __find_get_page                              10   0.2083
c0118cdc exit_notify                                  11   0.0174
c012ca00 __free_pages_ok                              11   0.0137
c013c030 link_path_walk                               16   0.0077
c012cd24 rmqueue                                      18   0.0259
c0123aac do_anonymous_page                            20   0.0641
c0126c3c file_read_actor                              25   0.1276
c01226d0 copy_page_range                              27   0.0466
c0123be4 do_no_page                                   29   0.0553
c01225e8 clear_page_tables                            32   0.1379
c01052b0 poll_idle                                    33   0.8250
c0122914 zap_page_range                               42   0.0652
c0112878 do_page_fault                                50   0.0359
c01234b0 do_wp_page                                  161   0.2188
00000000 total                                       791   0.0006
created /tmp/prof.time
akpm-prof poppatch 99  8.60s user 7.57s system 97% cpu 16.530 total



2.5.52-mm3:

c012b998 free_hot_cold_page                           94   0.4896
c0117348 do_schedule                                 103   0.1717
c012ba7c buffered_rmqueue                            110   0.5729
c01c1b4c strnlen_user                                116   1.3810
c012e36c kmem_cache_alloc                            120   1.8750
c0134454 find_vma                                    133   1.5114
c010a558 system_call                                 134   3.0455
c01504ac d_lookup                                    143   0.6164
c0148af4 link_path_walk                              153   0.0915
c0116b88 kmap_atomic_to_page                         175   1.9886
c0133060 handle_mm_fault                             195   0.6414
c01c1d48 __copy_from_user                            212   1.8929
c011598c pte_alloc_one                               213   1.6641
c0132c74 do_anonymous_page                           260   0.6311
c011cab0 do_softirq                                  300   1.7045
c01c1ce0 __copy_to_user                              369   3.5481
c0132e10 do_no_page                                  529   0.8936
c013c9e4 pte_unshare                                 572   0.5789
c0116b04 kmap_atomic                                 585   5.2232
c0131b44 zap_pte_range                               585   1.4199
c0115bc0 do_page_fault                               600   0.4808
c0136428 page_add_rmap                               766   2.1517
c0131890 clear_page_tables                           860   2.6220
c013658c page_remove_rmap                            928   1.9333
c013250c do_wp_page                                 2594   3.3601
00000000 total                                     15261   0.0097
created /tmp/prof.time
akpm-prof pushpatch 99  12.36s user 14.61s system 97% cpu 27.768 total


c0117348 do_schedule                                  77   0.1283
c010a558 system_call                                  85   1.9318
c0134454 find_vma                                     90   1.0227
c01504ac d_lookup                                    106   0.4569
c0116b88 kmap_atomic_to_page                         107   1.2159
c0133060 handle_mm_fault                             113   0.3717
c011598c pte_alloc_one                               135   1.0547
c0148af4 link_path_walk                              135   0.0807
c01c1d48 __copy_from_user                            162   1.4464
c0132c74 do_anonymous_page                           218   0.5291
c01c1ce0 __copy_to_user                              297   2.8558
c011cab0 do_softirq                                  319   1.8125
c0132e10 do_no_page                                  325   0.5490
c0131b44 zap_pte_range                               362   0.8786
c013c9e4 pte_unshare                                 375   0.3796
c0116b04 kmap_atomic                                 384   3.4286
c0115bc0 do_page_fault                               447   0.3582
c0136428 page_add_rmap                               505   1.4185
c0131890 clear_page_tables                           563   1.7165
c013658c page_remove_rmap                            585   1.2188
c013250c do_wp_page                                 1559   2.0194
00000000 total                                     10586   0.0067
created /tmp/prof.time
akpm-prof poppatch 99  9.00s user 10.31s system 96% cpu 19.926 total

OK, remove shpte:
=================

c0134344 find_vma                                    110   1.2500
c01c07fc strnlen_user                                112   1.3333
c0118c94 copy_process                                113   0.0456
c012b77c buffered_rmqueue                            120   0.6250
c014f0a0 __d_lookup                                  133   0.6520
c010a558 system_call                                 145   3.2955
c01474e0 link_path_walk                              162   0.0769
c0116b28 kmap_atomic_to_page                         165   1.8750
c0132f00 handle_mm_fault                             166   0.5461
c012e02c kmem_cache_alloc                            185   2.8906
c012e0d8 kmem_cache_free                             188   2.9375
c0115a0c pgd_alloc                                   193   0.9650
c01c09f8 __copy_from_user                            196   1.7500
c011598c pte_alloc_one                               216   1.6875
c0132b24 do_anonymous_page                           249   0.6163
c011c880 do_softirq                                  293   1.6648
c01c0990 __copy_to_user                              418   4.0192
c0132cb8 do_no_page                                  446   0.7637
c01317b4 copy_page_range                             496   0.7425
c0116aa4 kmap_atomic                                 591   5.2768
c0115b60 do_page_fault                               594   0.4760
c0131a50 zap_pte_range                               609   1.2378
c0136314 page_add_rmap                               632   1.7556
c01314f0 clear_page_tables                           688   2.2051
c013647c page_remove_rmap                            817   1.7021
c01323d4 do_wp_page                                 2600   3.4759
00000000 total                                     14713   0.0094
created /tmp/prof.time
akpm-prof pushpatch 99  12.29s user 14.40s system 99% cpu 26.913 total

c014f0a0 __d_lookup                                   91   0.4461
c010a558 system_call                                  93   2.1136
c0132f00 handle_mm_fault                             111   0.3651
c012e02c kmem_cache_alloc                            113   1.7656
c01474e0 link_path_walk                              118   0.0560
c011598c pte_alloc_one                               129   1.0078
c0116b28 kmap_atomic_to_page                         129   1.4659
c012e0d8 kmem_cache_free                             140   2.1875
c01c09f8 __copy_from_user                            160   1.4286
c0115a0c pgd_alloc                                   170   0.8500
c0132b24 do_anonymous_page                           184   0.4554
c011c880 do_softirq                                  297   1.6875
c01317b4 copy_page_range                             309   0.4626
c01c0990 __copy_to_user                              318   3.0577
c0132cb8 do_no_page                                  335   0.5736
c0131a50 zap_pte_range                               364   0.7398
c01314f0 clear_page_tables                           393   1.2596
c0115b60 do_page_fault                               441   0.3534
c0136314 page_add_rmap                               441   1.2250
c0116aa4 kmap_atomic                                 448   4.0000
c013647c page_remove_rmap                            550   1.1458
c01323d4 do_wp_page                                 1593   2.1297
00000000 total                                     10335   0.0066
created /tmp/prof.time
akpm-prof poppatch 99  9.07s user 10.03s system 99% cpu 19.290 total

Also remove highpte
===================

c01c037c strnlen_user                                108   1.2857
c010a558 system_call                                 111   2.5227
c012b79c buffered_rmqueue                            113   0.5885
c0134144 find_vma                                    117   1.3295
c01171e4 do_schedule                                 118   0.1954
c0132d20 handle_mm_fault                             132   0.4583
c014ebf0 __d_lookup                                  142   0.6961
c0131010 page_address                                147   1.0500
c0116ae4 kmap_atomic                                 163   1.4554
c0147030 link_path_walk                              186   0.0882
c01c0578 __copy_from_user                            203   1.8125
c011597c pte_alloc_one                               224   1.5556
c0115a0c pgd_alloc                                   231   1.1550
c0132990 do_anonymous_page                           260   0.7065
c011c8d0 do_softirq                                  283   1.6080
c01c0510 __copy_to_user                              380   3.6538
c0132b00 do_no_page                                  401   0.7371
c01316fc copy_page_range                             451   0.7723
c0131944 zap_pte_range                               588   1.1575
c013602c page_add_rmap                               601   2.5042
c0115b60 do_page_fault                               607   0.4716
c0131440 clear_page_tables                           637   2.1233
c013611c page_remove_rmap                            657   2.0030
c01322b4 do_wp_page                                 2530   3.6988
00000000 total                                     13554   0.0087
created /tmp/prof.time
akpm-prof pushpatch 99  11.97s user 13.36s system 99% cpu 25.541 total

c0132d20 handle_mm_fault                             100   0.3472
c0147030 link_path_walk                              106   0.0503
c0116ae4 kmap_atomic                                 109   0.9732
c0115a0c pgd_alloc                                   140   0.7000
c011597c pte_alloc_one                               151   1.0486
c01c0578 __copy_from_user                            162   1.4464
c0132990 do_anonymous_page                           204   0.5543
c011c8d0 do_softirq                                  305   1.7330
c01c0510 __copy_to_user                              308   2.9615
c0132b00 do_no_page                                  310   0.5699
c01316fc copy_page_range                             314   0.5377
c013602c page_add_rmap                               361   1.5042
c0131944 zap_pte_range                               379   0.7461
c0115b60 do_page_fault                               409   0.3178
c013611c page_remove_rmap                            430   1.3110
c0131440 clear_page_tables                           443   1.4767
c01322b4 do_wp_page                                 1662   2.4298
00000000 total                                      9706   0.0062
created /tmp/prof.time
akpm-prof poppatch 99  8.86s user 9.33s system 98% cpu 18.433 total
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-20 11:11 shared pagetable benchmarking Andrew Morton
@ 2002-12-20 11:13 ` William Lee Irwin III
  2002-12-20 16:30 ` Dave McCracken
  2002-12-23 18:19 ` Dave McCracken
  2 siblings, 0 replies; 24+ messages in thread
From: William Lee Irwin III @ 2002-12-20 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave McCracken, linux-mm

On Fri, Dec 20, 2002 at 03:11:09AM -0800, Andrew Morton wrote:
> Did a bit of timing and profiling.  It's a uniprocessor
> kernel, 7G, PAE.
> The workload is application and removal of ~80 patches using
> my patch scripts.  Tons and tons of forks from bash.
> 2.5 ends up being 13% slower than 2.4, after disabling highpte
> to make it fair.  3%-odd of this is HZ=1000.  So say 10%.
> Pagetable sharing actually slowed this test down by several
> percent overall.  Which is unfortunate, because the main
> thing which Linus likes about shared pagetables is that it
> "speeds up forks".
> Is there anything we can do to fix all of this up a bit?

For testing purposes, try removing the opportunistic mmap()-time
sharing.


Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-20 11:11 shared pagetable benchmarking Andrew Morton
  2002-12-20 11:13 ` William Lee Irwin III
@ 2002-12-20 16:30 ` Dave McCracken
  2002-12-20 19:59   ` Andrew Morton
  2002-12-23 18:19 ` Dave McCracken
  2 siblings, 1 reply; 24+ messages in thread
From: Dave McCracken @ 2002-12-20 16:30 UTC (permalink / raw)
  To: Andrew Morton, linux-mm

--On Friday, December 20, 2002 03:11:09 -0800 Andrew Morton
<akpm@digeo.com> wrote:

> The workload is application and removal of ~80 patches using
> my patch scripts.  Tons and tons of forks from bash.
> 
> 2.5 ends up being 13% slower than 2.4, after disabling highpte
> to make it fair.  3%-odd of this is HZ=1000.  So say 10%.
> 
> Pagetable sharing actually slowed this test down by several
> percent overall.  Which is unfortunate, because the main
> thing which Linus likes about shared pagetables is that it
> "speeds up forks".
> 
> Is there anything we can do to fix all of this up a bit?

Ok, let's consider just what shared page tables does for fork.

In fork without shared page tables, there is a fixed cost per mapped page
where the pte entry has to be copied from the parent's pte page to the
child's.  This cost is higher for resident pages in 2.5 than 2.4 because of
rmap.

What shared page tables does isn't reduce that cost, it just defers it by
marking each pte page copy-on-write.  The cost is incurred when either the
parent or the child first tries to write to a page in that pte page.  The
savings comes when there are pte pages that never have to be unshared,
either because they map a shared region or they're not written to
(typically because the child quickly does an exec).

The worst case condition for shared page tables is when every pte page has
to be unshared.  Unfortunately this is also a common case.  Almost every
parent or child will touch three pages after fork:  the current stack page,
libc's data page, and the application's data page.  Each one of these is in
a separate pte page.  Since each pte page maps 4M (2M for PAE), small
processes only have those three pte pages, and they're all unshared.
Unfortunately this includes most base utilities, in particular shells, so
shell scripts will not benefit from shared page tables.

There is a small penalty for deferring the pte page copy, as Andrew's tests
show.  However, as soon as even one pte page is not copied, fork
performance improves dramatically.  My tests show that fork/exit for a 4
pte page process is about 25% to 30% faster with shared page tables than
without, simply because of the single extra page that's not unshared.  This
savings is multiplied for each additional pte page that remains shared.

I'll look for ways to optimize the unsharing to reduce the penalty, but I'm
not optimistic that we can eliminate it entirely.

Let's also not lose sight of what I consider the primary goal of shared
page tables, which is to greatly reduce the page table memory overhead of
massively shared large regions.

Dave McCracken

======================================================================
Dave McCracken          IBM Linux Base Kernel Team      1-512-838-3059
dmccr@us.ibm.com                                        T/L   678-3059

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-20 16:30 ` Dave McCracken
@ 2002-12-20 19:59   ` Andrew Morton
  2002-12-23 16:15     ` Dave McCracken
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2002-12-20 19:59 UTC (permalink / raw)
  To: Dave McCracken; +Cc: linux-mm

Dave McCracken wrote:
> 
> [ ... ]
>

Thanks.

> I'll look for ways to optimize the unsharing to reduce the penalty, but I'm
> not optimistic that we can eliminate it entirely.

So changing userspace to place its writeable memory on a new 4M boundary
would be a big win?

It's years since I played with elf, but I think this is feasible.  Change
the linker and just wait for it to propagate.

Do we know someone who can guide us in prototyping that?

Do we know where the writes are occurring?

> Let's also not lose sight of what I consider the primary goal of shared
> page tables, which is to greatly reduce the page table memory overhead of
> massively shared large regions.

Well yes.  But this is optimising the (extremely) uncommon case while
penalising the (very) common one.

It's the same with the reverse map - we've gone and added significant
expense even to machines and workloads which perform no page reclaim
at all.  Perhaps pagetable sharing can get that back for us.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-20 19:59   ` Andrew Morton
@ 2002-12-23 16:15     ` Dave McCracken
  2002-12-23 23:54       ` Andrew Morton
  2002-12-27  9:39       ` Daniel Phillips
  0 siblings, 2 replies; 24+ messages in thread
From: Dave McCracken @ 2002-12-23 16:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

--On Friday, December 20, 2002 11:59:12 -0800 Andrew Morton
<akpm@digeo.com> wrote:

> So changing userspace to place its writeable memory on a new 4M boundary
> would be a big win?
> 
> It's years since I played with elf, but I think this is feasible.  Change
> the linker and just wait for it to propagate.

Actually it'd require changes to both the linker and the kernel memory
range allocator.  Right now ld.so maps all memory needed for an entire
shared library, then uses mprotect and MAP_FIXED to modify parts of it to
be writable (or at least that's what I see using strace).  If it was done
using separate mmap calls we could redirect the writable regions to be in a
different pmd.

>> Let's also not lose sight of what I consider the primary goal of shared
>> page tables, which is to greatly reduce the page table memory overhead of
>> massively shared large regions.
> 
> Well yes.  But this is optimising the (extremely) uncommon case while
> penalising the (very) common one.

I guess I don't see wasting extra pte pages on duplicated mappings of
shared memory as extremely uncommon.  Granted, it's not that significant
for small applications, but it can make a machine unusable with some large
applications.  I think being able to run applications that couldn't run
before to be worth some consideration.

I also have a couple of ideas for ways to eliminate the penalty for small
tasks.  Would you grant that it's a worthwhile effort if the penalty for
small applications was zero?

Dave McCracken

======================================================================
Dave McCracken          IBM Linux Base Kernel Team      1-512-838-3059
dmccr@us.ibm.com                                        T/L   678-3059

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-20 11:11 shared pagetable benchmarking Andrew Morton
  2002-12-20 11:13 ` William Lee Irwin III
  2002-12-20 16:30 ` Dave McCracken
@ 2002-12-23 18:19 ` Dave McCracken
  2 siblings, 0 replies; 24+ messages in thread
From: Dave McCracken @ 2002-12-23 18:19 UTC (permalink / raw)
  To: Andrew Morton, linux-mm

[-- Attachment #1: Type: text/plain, Size: 620 bytes --]


--On Friday, December 20, 2002 03:11:09 -0800 Andrew Morton
<akpm@digeo.com> wrote:

> Is there anything we can do to fix all of this up a bit?

Ok, here's my first attempt at optimization.  I track how many pte pages a
task has, and just do the copy if it doesn't have more than 3.  My fork
tests show that for a process with 3 pte pages, this patch produces
performance equal to 2.5.52.

Dave McCracken

======================================================================
Dave McCracken          IBM Linux Base Kernel Team      1-512-838-3059
dmccr@us.ibm.com                                        T/L   678-3059

[-- Attachment #2: shpte-2.5.52-mm2-1.diff --]
[-- Type: text/plain, Size: 3938 bytes --]

--- 2.5.52-mm2-shsent/./include/linux/mm.h	2002-12-20 10:39:44.000000000 -0600
+++ 2.5.52-mm2-shpte/./include/linux/mm.h	2002-12-20 11:09:51.000000000 -0600
@@ -123,9 +123,6 @@
  * low four bits) to a page protection mask..
  */
 extern pgprot_t protection_map[16];
-#ifdef CONFIG_SHAREPTE
-extern pgprot_t protection_pmd[8];
-#endif
 
 /*
  * These are the virtual MM functions - opening of an area, closing and
--- 2.5.52-mm2-shsent/./include/linux/sched.h	2002-12-20 10:39:44.000000000 -0600
+++ 2.5.52-mm2-shpte/./include/linux/sched.h	2002-12-23 10:18:08.000000000 -0600
@@ -183,6 +183,7 @@
 	struct vm_area_struct * mmap_cache;	/* last find_vma result */
 	unsigned long free_area_cache;		/* first hole */
 	pgd_t * pgd;
+	atomic_t ptepages;			/* Number of pte pages allocated */
 	atomic_t mm_users;			/* How many users with user space? */
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
 	int map_count;				/* number of VMAs */
--- 2.5.52-mm2-shsent/./kernel/fork.c	2002-12-20 10:39:44.000000000 -0600
+++ 2.5.52-mm2-shpte/./kernel/fork.c	2002-12-23 10:32:15.000000000 -0600
@@ -238,6 +238,7 @@
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->map_count = 0;
 	mm->rss = 0;
+	atomic_set(&mm->ptepages, 0);
 	mm->cpu_vm_mask = 0;
 	pprev = &mm->mmap;
 
--- 2.5.52-mm2-shsent/./mm/memory.c	2002-12-20 10:39:45.000000000 -0600
+++ 2.5.52-mm2-shpte/./mm/memory.c	2002-12-23 10:22:52.000000000 -0600
@@ -116,6 +116,7 @@
 
 	pmd_clear(dir);
 	pgtable_remove_rmap_locked(ptepage, tlb->mm);
+	atomic_dec(&tlb->mm->ptepages);
 	dec_page_state(nr_page_table_pages);
 	ClearPagePtepage(ptepage);
 
@@ -184,6 +185,7 @@
 		SetPagePtepage(new);
 		pgtable_add_rmap(new, mm, address);
 		pmd_populate(mm, pmd, new);
+		atomic_inc(&mm->ptepages);
 		inc_page_state(nr_page_table_pages);
 	}
 out:
@@ -217,7 +219,6 @@
 #define PTE_TABLE_MASK	((PTRS_PER_PTE-1) * sizeof(pte_t))
 #define PMD_TABLE_MASK	((PTRS_PER_PMD-1) * sizeof(pmd_t))
 
-#ifndef CONFIG_SHAREPTE
 /*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
@@ -354,7 +355,6 @@
 nomem:
 	return -ENOMEM;
 }
-#endif
 
 static void zap_pte_range(mmu_gather_t *tlb, pmd_t * pmd, unsigned long address, unsigned long size)
 {
--- 2.5.52-mm2-shsent/./mm/ptshare.c	2002-12-20 10:39:45.000000000 -0600
+++ 2.5.52-mm2-shpte/./mm/ptshare.c	2002-12-23 12:17:23.000000000 -0600
@@ -23,7 +23,7 @@
 /*
  * Protections that can be set on the pmd entry (see discussion in mmap.c).
  */
-pgprot_t protection_pmd[8] = {
+static pgprot_t protection_pmd[8] = {
 	__PMD000, __PMD001, __PMD010, __PMD011, __PMD100, __PMD101, __PMD110, __PMD111
 };
 
@@ -459,6 +459,28 @@
 }
 
 /**
+ * fork_page_range - Either copy or share a page range at fork time
+ * @dst: the mm_struct of the forked child
+ * @src: the mm_struct of the forked parent
+ * @vma: the vm_area to be shared
+ * @prev_pmd: A pointer to the pmd entry we did at last invocation
+ *
+ * This wrapper decides whether to share page tables on fork or just make
+ * a copy.  The current criterion is whether a page table has more than 3
+ * pte pages, since all forked processes will unshare 3 pte pages after fork,
+ * even the ones doing an immediate exec.  Tests indicate that if a page
+ * table has more than 3 pte pages, it's a performance win to share.
+ */
+int fork_page_range(struct mm_struct *dst, struct mm_struct *src,
+		    struct vm_area_struct *vma, pmd_t **prev_pmd)
+{
+	if (atomic_read(&src->ptepages) > 3)
+		return share_page_range(dst, src, vma, prev_pmd);
+
+	return copy_page_range(dst, src, vma);
+}
+
+/**
  * unshare_page_range - Make sure no pte pages are shared in a given range
  * @mm: the mm_struct whose page table we unshare from
  * @address: the base address of the range

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-23 16:15     ` Dave McCracken
@ 2002-12-23 23:54       ` Andrew Morton
  2002-12-27  9:39       ` Daniel Phillips
  1 sibling, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2002-12-23 23:54 UTC (permalink / raw)
  To: Dave McCracken; +Cc: linux-mm

Dave McCracken wrote:
> 
> --On Friday, December 20, 2002 11:59:12 -0800 Andrew Morton
> <akpm@digeo.com> wrote:
> 
> > So changing userspace to place its writeable memory on a new 4M boundary
> > would be a big win?
> >
> > It's years since I played with elf, but I think this is feasible.  Change
> > the linker and just wait for it to propagate.
> 
> Actually it'd require changes to both the linker and the kernel memory
> range allocator.  Right now ld.so maps all memory needed for an entire
> shared library, then uses mprotect and MAP_FIXED to modify parts of it to
> be writable (or at least that's what I see using strace).  If it was done
> using separate mmap calls we could redirect the writable regions to be in a
> different pmd.

Yup.

Over the weekend I got all this going.  With binutils patches from HJ,
a kernel patch from Bill and tons of rebuilding things I had everything
in /proc/pid/maps on a separate 4M segment.

I also fixed run-child-first-on-fork.

Summary:

		2.4.20		2.5-shpte	2.5-shpte+weekend_hacks

aim9 fork_test	1950		1300		1700
aim9 exec_test	700		545		572
patch-scripts	16.5		19.5		18.5

The fork test isn't very interesting.  When you toss in an exec(),
the benefits are small.

It appears that Linus's only interest in shared pagetables is that
it could reclaim the fork/exec overhead which the reverse mapping
introduced.  As far as I can tell he is not concerned about space
consumption issues.

And if that is the selection criterion, I do not believe that these
speedups are sufficient to warrant a merge.

> >> Let's also not lose sight of what I consider the primary goal of shared
> >> page tables, which is to greatly reduce the page table memory overhead of
> >> massively shared large regions.
> >
> > Well yes.  But this is optimising the (extremely) uncommon case while
> > penalising the (very) common one.
> 
> I guess I don't see wasting extra pte pages on duplicated mappings of
> shared memory as extremely uncommon.  Granted, it's not that significant
> for small applications, but it can make a machine unusable with some large
> applications.  I think being able to run applications that couldn't run
> before to be worth some consideration.
> 
> I also have a couple of ideas for ways to eliminate the penalty for small
> tasks.  Would you grant that it's a worthwhile effort if the penalty for
> small applications was zero?
> 

It's not my call, David.  I've been putting myself in the role of
helping to get the code working and tested, and providing Linus
with whatever info can help him make a decision.  I guess he works
by observing what people are talking about, asking about and hurting
over on the mailing lists.  As well as his own experience.  And the
issue of pagetable consumption just doesn't have any visibility.

I expect his position would be that it's a specialised, rare problem
and that the fix is more appropriate to a specialised vendor kernel.

I suggest that you discuss it with him.  If that ends up being thumbs-down
I can continue to maintain the patch across 2.6.x.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-23 16:15     ` Dave McCracken
  2002-12-23 23:54       ` Andrew Morton
@ 2002-12-27  9:39       ` Daniel Phillips
  2002-12-27  9:58         ` Andrew Morton
  1 sibling, 1 reply; 24+ messages in thread
From: Daniel Phillips @ 2002-12-27  9:39 UTC (permalink / raw)
  To: Dave McCracken, Andrew Morton; +Cc: linux-mm

On Monday 23 December 2002 17:15, Dave McCracken wrote:
> >> Let's also not lose sight of what I consider the primary goal of shared
> >> page tables, which is to greatly reduce the page table memory overhead
> >> of massively shared large regions.
> >
> > Well yes.  But this is optimising the (extremely) uncommon case while
> > penalising the (very) common one.
>
> I guess I don't see wasting extra pte pages on duplicated mappings of
> shared memory as extremely uncommon.  Granted, it's not that significant
> for small applications, but it can make a machine unusable with some large
> applications.  I think being able to run applications that couldn't run
> before to be worth some consideration.
>
> I also have a couple of ideas for ways to eliminate the penalty for small
> tasks.  Would you grant that it's a worthwhile effort if the penalty for
> small applications was zero?

Hi Dave, Andrew,

A feature of my original demonstration patch was that I could enable/disable 
sharing with a per-fork granularity.  This is a good thing.  You can use this 
by detecting the case you can't optimize, i.e., forking from bash, and 
essentially using the old code.  The sawoff for improved efficiency comes in 
somewhere over 4 meg worth of shared memory, which just doesn't happen in 
fork+exec from bash.  Then there is always-unshare situation with the stack, 
which I'm sure you're aware of, where it's never worth doing the share.

That said, was not Ingo working on a replacement for fork+exec that doesn't 
do the useless fork?  Would this not make the vast majority of 
impossible-to-optimize cases go away?

Regards,

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-27  9:39       ` Daniel Phillips
@ 2002-12-27  9:58         ` Andrew Morton
  2002-12-27 15:59           ` Daniel Phillips
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2002-12-27  9:58 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Dave McCracken, linux-mm

Daniel Phillips wrote:
> 
> On Monday 23 December 2002 17:15, Dave McCracken wrote:
> > >> Let's also not lose sight of what I consider the primary goal of shared
> > >> page tables, which is to greatly reduce the page table memory overhead
> > >> of massively shared large regions.
> > >
> > > Well yes.  But this is optimising the (extremely) uncommon case while
> > > penalising the (very) common one.
> >
> > I guess I don't see wasting extra pte pages on duplicated mappings of
> > shared memory as extremely uncommon.  Granted, it's not that significant
> > for small applications, but it can make a machine unusable with some large
> > applications.  I think being able to run applications that couldn't run
> > before to be worth some consideration.
> >
> > I also have a couple of ideas for ways to eliminate the penalty for small
> > tasks.  Would you grant that it's a worthwhile effort if the penalty for
> > small applications was zero?
> 
> Hi Dave, Andrew,

Daniel!

> A feature of my original demonstration patch was that I could enable/disable
> sharing with a per-fork granularity.  This is a good thing.  You can use this
> by detecting the case you can't optimize, i.e., forking from bash, and
> essentially using the old code.  The sawoff for improved efficiency comes in
> somewhere over 4 meg worth of shared memory, which just doesn't happen in
> fork+exec from bash.  Then there is always-unshare situation with the stack,
> which I'm sure you're aware of, where it's never worth doing the share.

Yes, Dave did a prototype of that, and I am sure that it will pull back
the small additional cost of pagetable sharing in those cases.

But that's not the problem.  The problem is that it doesn't *speed up*
that case.  Which appears to be the only thing which interests Linus
in shared pagetables at this time: he "_hate_"s the fact that fork/exec
got slower.

> That said, was not Ingo working on a replacement for fork+exec that doesn't
> do the useless fork?  Would this not make the vast majority of
> impossible-to-optimize cases go away?

That's news to me.

posix_spawn() has been suggested by Ulrich, and he says that things like
bash could easily be converted.

I don't how much it would gain - possibly not a huge amount; the rmap
setup in exec seems to be where the major cost lies.  Plus there's still
exit().
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-27  9:58         ` Andrew Morton
@ 2002-12-27 15:59           ` Daniel Phillips
  2002-12-27 20:02             ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Daniel Phillips @ 2002-12-27 15:59 UTC (permalink / raw)
  To: Andrew Morton, Daniel Phillips; +Cc: Dave McCracken, linux-mm, Linus Torvalds

Hi Andrew,

On Friday 27 December 2002 10:58, Andrew Morton wrote:
> > A feature of my original demonstration patch was that I could
> > enable/disable sharing with a per-fork granularity.  This is a good
> > thing.  You can use this by detecting the case you can't optimize, i.e.,
> > forking from bash, and essentially using the old code.  The sawoff for
> > improved efficiency comes in somewhere over 4 meg worth of shared memory,
> > which just doesn't happen in fork+exec from bash.  Then there is
> > always-unshare situation with the stack, which I'm sure you're aware of,
> > where it's never worth doing the share.
>
> Yes, Dave did a prototype of that, and I am sure that it will pull back
> the small additional cost of pagetable sharing in those cases.
>
> But that's not the problem.  The problem is that it doesn't *speed up*
> that case.  Which appears to be the only thing which interests Linus
> in shared pagetables at this time: he "_hate_"s the fact that fork/exec
> got slower.

Did you ask Linus?  To my thinking, if it breaks even on small forks and wins
on the big forks that are bothering the database people etc (and aren't we 
all database people in the end) it's a clear win.

> > That said, was not Ingo working on a replacement for fork+exec that
> > doesn't do the useless fork?  Would this not make the vast majority of
> > impossible-to-optimize cases go away?
>
> That's news to me.
>
> posix_spawn() has been suggested by Ulrich, and he says that things like
> bash could easily be converted.

Yes, that's the reference.  I somehow got Ingo mixed in there because they 
were doing the thread cleanups together at the time, which quite possibly 
inspired that rather badly needed improvement.

> I don't how much it would gain - possibly not a huge amount; the rmap
> setup in exec seems to be where the major cost lies.  Plus there's still
> exit().

What you'd lose is the useless setup/teardown of three page table pages every 
fork+exec, a good thing regardless of page table sharing, but especially 
convenient for sharing, as it carves away a good percentage of the 
non-improved cases, allowing the improved ones to stand out more.

Anyway, I'll bow out of the rmap-optimizing game until next cycle.  There are 
still some nice optimizations that can be done, but what's the hurry?  There 
is plenty of longstanding kernel badness that dwarves this in importance 
(knfsd comes to mind).  I am just glad that rmap has stuck, as I am very 
happy to trade a few percentage points of speed in some applications that 
suck by design, in return for a VM that actually gives BSD a run for its 
money in terms of stability.

I guess that if pte sharing doesn't make it into Linus's tree then Redhat 
will be only too pleased to make it part of Advanced Server, as there are a 
couple of ridiculously big tech companies I could name that need it so badly 
it hurts.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-27 15:59           ` Daniel Phillips
@ 2002-12-27 20:02             ` Linus Torvalds
  2002-12-27 20:16               ` Dave McCracken
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2002-12-27 20:02 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, Dave McCracken, linux-mm

On Fri, 27 Dec 2002, Daniel Phillips wrote:
> 
> Did you ask Linus?  To my thinking, if it breaks even on small forks and wins
> on the big forks that are bothering the database people etc (and aren't we 
> all database people in the end) it's a clear win.

It doesn't break even on small forks. It _slows_them_down_.

I personally think that small forks are a hell of a lot more important
than big ones, since big ones happen rarely and don't tend to be all that
performance-critical anyway.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-27 20:02             ` Linus Torvalds
@ 2002-12-27 20:16               ` Dave McCracken
  2002-12-27 20:18                 ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Dave McCracken @ 2002-12-27 20:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, Andrew Morton, linux-mm

--On Friday, December 27, 2002 12:02:56 -0800 Linus Torvalds
<torvalds@transmeta.com> wrote:

> It doesn't break even on small forks. It _slows_them_down_.

I gave Andrew a patch that does make it break even on small forks, by doing
the copy at fork time when a process only has 3 pte pages.  My tests
indicate that any process with 4 or more pte pages usually is faster by
doing the share.

Dave McCracken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-27 20:16               ` Dave McCracken
@ 2002-12-27 20:18                 ` Linus Torvalds
  2002-12-27 20:45                   ` Dave McCracken
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2002-12-27 20:18 UTC (permalink / raw)
  To: Dave McCracken; +Cc: Daniel Phillips, Andrew Morton, linux-mm

On Fri, 27 Dec 2002, Dave McCracken wrote:
> 
> I gave Andrew a patch that does make it break even on small forks, by doing
> the copy at fork time when a process only has 3 pte pages.  My tests
> indicate that any process with 4 or more pte pages usually is faster by
> doing the share.

Ok, so it doesn't actually break even, it just disables itself. That's not 
the same thing in my book, but may of course be acceptable.

I'd personally be much happier if just the real cause for the rmap
slowdown was fixed, possibly by having it be done lazily (the shared page
table stuff tries to do the _copy_ of the rmap information lazily, but
maybe the real solution is to go one level further and just set the dang
things up lazily in the first place, since most of the time it's not even
needed).

That's clearly not 2.6.x material. But at this point I doubt that shared
page tables are either, unless they fix something more important than 
fork() speed for processes that are larger than 16MB.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-27 20:18                 ` Linus Torvalds
@ 2002-12-27 20:45                   ` Dave McCracken
  2002-12-27 20:50                     ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Dave McCracken @ 2002-12-27 20:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, Andrew Morton, linux-mm

--On Friday, December 27, 2002 12:18:23 -0800 Linus Torvalds
<torvalds@transmeta.com> wrote:

> That's clearly not 2.6.x material. But at this point I doubt that shared
> page tables are either, unless they fix something more important than 
> fork() speed for processes that are larger than 16MB.

The other thing it does is eliminate the duplicate pte pages for shared
regions everywhere they span a complete pte page.  While hugetlb can also
do this for some specialized applications, shared page tables will do it
for every shared region that's large enough.  I dunno whether you consider
that important enough to qualify, but I figured I should point it out.

Dave McCracken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-27 20:45                   ` Dave McCracken
@ 2002-12-27 20:50                     ` Linus Torvalds
  2002-12-27 23:56                       ` Daniel Phillips
  2002-12-28  0:45                       ` Martin J. Bligh
  0 siblings, 2 replies; 24+ messages in thread
From: Linus Torvalds @ 2002-12-27 20:50 UTC (permalink / raw)
  To: Dave McCracken; +Cc: Daniel Phillips, Andrew Morton, linux-mm

On Fri, 27 Dec 2002, Dave McCracken wrote:
>
> The other thing it does is eliminate the duplicate pte pages for shared
> regions everywhere they span a complete pte page.  While hugetlb can also
> do this for some specialized applications, shared page tables will do it
> for every shared region that's large enough.  I dunno whether you consider
> that important enough to qualify, but I figured I should point it out.

I don't consider it important enough to qualify unless there are some real 
loads where it really matters. I can well imagine that such loads exist 
(where low-memory usage by page tables is a real problem), but I'd like to 
have that confirmed as a bug-report and that the sharing really does fix 
it.

In other words, I can believe that the sharing is 2.6.x material, but
considering the fundamental nature of it I want it to be a confirmed
bug-fix, not a feature.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-27 20:50                     ` Linus Torvalds
@ 2002-12-27 23:56                       ` Daniel Phillips
  2002-12-28  0:45                       ` Martin J. Bligh
  1 sibling, 0 replies; 24+ messages in thread
From: Daniel Phillips @ 2002-12-27 23:56 UTC (permalink / raw)
  To: Linus Torvalds, Dave McCracken, Wim Coekaerts
  Cc: Daniel Phillips, Andrew Morton, linux-mm

On Friday 27 December 2002 21:50, Linus Torvalds wrote:
> On Fri, 27 Dec 2002, Dave McCracken wrote:
> > The other thing it does is eliminate the duplicate pte pages for shared
> > regions everywhere they span a complete pte page.  While hugetlb can also
> > do this for some specialized applications, shared page tables will do it
> > for every shared region that's large enough.  I dunno whether you
> > consider that important enough to qualify, but I figured I should point
> > it out.
>
> I don't consider it important enough to qualify unless there are some real
> loads where it really matters.

Well, I know IBM has real loads that it matters on, otherwise Dave wouldn't 
be on this.  I have reason to believe Oracle has loads that care about this 
as well, so it's time for somebody there to either speak up or kiss this 
facility goodbye for another cycle or two.  Wim?

--
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-27 20:50                     ` Linus Torvalds
  2002-12-27 23:56                       ` Daniel Phillips
@ 2002-12-28  0:45                       ` Martin J. Bligh
  2002-12-28  2:34                         ` Andrew Morton
  1 sibling, 1 reply; 24+ messages in thread
From: Martin J. Bligh @ 2002-12-28  0:45 UTC (permalink / raw)
  To: Linus Torvalds, Dave McCracken; +Cc: Daniel Phillips, Andrew Morton, linux-mm

> I don't consider it important enough to qualify unless there are some real 
> loads where it really matters. I can well imagine that such loads exist 
> (where low-memory usage by page tables is a real problem), but I'd like to 
> have that confirmed as a bug-report and that the sharing really does fix 
> it.

We had over 10Gb of PTEs running Oracle Apps (on 2.4 without RMAP) - 
RMAP would add another 5Gb or so to that (2Gb shared memory segment 
across many processes). But you can stick PTEs in highmem, whereas 
it's not easy to do that with pte_chains ... sticking 5Gb of overhead 
into ZONE_NORMAL is tricky ;-) The really nice thing about shared 
pagetables as a solution is that it's totally transparent, and requires 
no app modifications. Obviously degrading fork for small tasks is
unacceptable, but Dave seems to have fixed that issue now.

I think the long-term fix for the rmap performance hit is object-based 
RMAP (doing the reverse mappings shared on a per-area basis) which we've 
talked about, but not for 2.6 ... it may not turn out to be that hard 
though ... K42 did it before.

M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-28  0:45                       ` Martin J. Bligh
@ 2002-12-28  2:34                         ` Andrew Morton
  2002-12-28  3:10                           ` Linus Torvalds
  2002-12-28  3:19                           ` Martin J. Bligh
  0 siblings, 2 replies; 24+ messages in thread
From: Andrew Morton @ 2002-12-28  2:34 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Linus Torvalds, Dave McCracken, Daniel Phillips, linux-mm

"Martin J. Bligh" wrote:
> 
> > I don't consider it important enough to qualify unless there are some real
> > loads where it really matters. I can well imagine that such loads exist
> > (where low-memory usage by page tables is a real problem), but I'd like to
> > have that confirmed as a bug-report and that the sharing really does fix
> > it.
> 
> We had over 10Gb of PTEs running Oracle Apps (on 2.4 without RMAP) -
> RMAP would add another 5Gb or so to that (2Gb shared memory segment
> across many processes). But you can stick PTEs in highmem, whereas
> it's not easy to do that with pte_chains ... sticking 5Gb of overhead
> into ZONE_NORMAL is tricky ;-) The really nice thing about shared
> pagetables as a solution is that it's totally transparent, and requires
> no app modifications. Obviously degrading fork for small tasks is
> unacceptable, but Dave seems to have fixed that issue now.

To what extent is that a "real" workload?

What other applications are affected, and to what extent?

Why are hugepages not a sufficient solution?

Is this problem sufficiently common to warrant the inclusion of
pagetable sharing in the main kernel, as opposed to a specialised
Oracle/DB2 derivative?

> I think the long-term fix for the rmap performance hit is object-based
> RMAP (doing the reverse mappings shared on a per-area basis) which we've
> talked about, but not for 2.6 ... it may not turn out to be that hard
> though ... K42 did it before.

I think we can do a few things still in the 2.6 context.  The fact that
my "apply seventy patches with patch-scripts" test takes 350,000 pagefaults
in 13 seconds makes one go "hmm".
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-28  2:34                         ` Andrew Morton
@ 2002-12-28  3:10                           ` Linus Torvalds
  2002-12-28  6:58                             ` Andrew Morton
  2002-12-28  3:19                           ` Martin J. Bligh
  1 sibling, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2002-12-28  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Martin J. Bligh, Dave McCracken, Daniel Phillips, linux-mm, Ingo Molnar

On Fri, 27 Dec 2002, Andrew Morton wrote:
> 
> I think we can do a few things still in the 2.6 context.  The fact that
> my "apply seventy patches with patch-scripts" test takes 350,000 pagefaults
> in 13 seconds makes one go "hmm".

Hmm.. Whatever happened to the MAP_POPULATE tests?

The current "filemap_populate()" function is extremely stupid (it takes 
advantage neither of the locality of the page tables _nor_ of the radix 
tree layout), but even so it would probably be a win to pre-populate at 
mmap time.

But having a better "populate()" function that actually does multiple
pages at once by just accessing the radix trees and page table trees
directly should really be very low-overhead for the normal case, and be a
_big_ win in avoiding page faults.

Even with the existing stupid populate function, it might be interesting
seeing what would happen just from doing something silly like

===== arch/i386/kernel/sys_i386.c 1.10 vs edited =====
--- 1.10/arch/i386/kernel/sys_i386.c	Sat Dec 21 08:24:45 2002
+++ edited/arch/i386/kernel/sys_i386.c	Fri Dec 27 19:08:30 2002
@@ -54,6 +54,8 @@
 		file = fget(fd);
 		if (!file)
 			goto out;
+		if (prot & PROT_EXEC)
+			flags |= MAP_POPULATE | MAP_NONBLOCK;
 	}
 
 	down_write(&current->mm->mmap_sem);

(yeah, yeah, and maybe do the same in binfmt_elf.c too)

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-28  2:34                         ` Andrew Morton
  2002-12-28  3:10                           ` Linus Torvalds
@ 2002-12-28  3:19                           ` Martin J. Bligh
  1 sibling, 0 replies; 24+ messages in thread
From: Martin J. Bligh @ 2002-12-28  3:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, Dave McCracken, Daniel Phillips, linux-mm

> To what extent is that a "real" workload?

It was meant to be a simulation of a real customer enviroment,
I don't think it's unrealistic (they were actually trying to push it
to at least twice that).

> What other applications are affected, and to what extent?

Anything that does heavy sharing. Databases and Java heaps spring
to mind.

> Why are hugepages not a sufficient solution?

They may be for some workloads, but it's not as generalised. For
instance, one other thing that's being muttered about a lot is very
large heaps for Java workloads, and they want those swap backed.
Large pages also requires application modification and machine setup
for static pool size reservations in the current implementation.

> Is this problem sufficiently common to warrant the inclusion of
> pagetable sharing in the main kernel, as opposed to a specialised
> Oracle/DB2 derivative?

If we can get it not to degrade anything else (eg fork on small tasks),
I think it's worthwhile. I *think* we're there now, though a few more
perf checks are probably needed.

>> I think the long-term fix for the rmap performance hit is object-based
>> RMAP (doing the reverse mappings shared on a per-area basis) which we've
>> talked about, but not for 2.6 ... it may not turn out to be that hard
>> though ... K42 did it before.
>
> I think we can do a few things still in the 2.6 context.  The fact that
> my "apply seventy patches with patch-scripts" test takes 350,000
> pagefaults in 13 seconds makes one go "hmm".

Fixing that would be a worthy goal, IMHO.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-28  3:10                           ` Linus Torvalds
@ 2002-12-28  6:58                             ` Andrew Morton
  2002-12-28  7:39                               ` Ingo Molnar
  2002-12-28  7:47                               ` Linus Torvalds
  0 siblings, 2 replies; 24+ messages in thread
From: Andrew Morton @ 2002-12-28  6:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin J. Bligh, Dave McCracken, Daniel Phillips, linux-mm, Ingo Molnar

Linus Torvalds wrote:
> 
> On Fri, 27 Dec 2002, Andrew Morton wrote:
> >
> > I think we can do a few things still in the 2.6 context.  The fact that
> > my "apply seventy patches with patch-scripts" test takes 350,000 pagefaults
> > in 13 seconds makes one go "hmm".
> 
> Hmm.. Whatever happened to the MAP_POPULATE tests?
> 
> The current "filemap_populate()" function is extremely stupid (it takes
> advantage neither of the locality of the page tables _nor_ of the radix
> tree layout), but even so it would probably be a win to pre-populate at
> mmap time.

Yup.  Ingo said at the time:

  It would be faster to iterate the pagecache mapping's radix tree
  and the pagetables at once, but it's also *much* more complex. I have
  tried to implement it and had to unroll the change - mixing radix tree
  walking and pagetable walking and getting all the VM details right is
  really complex - especially considering all the re-lookup race checks
  that have to occur upon IO.

But find_get_pages() is well-suited to this, and was not in place when
he did this work.

> But having a better "populate()" function that actually does multiple
> pages at once by just accessing the radix trees and page table trees
> directly should really be very low-overhead for the normal case, and be a
> _big_ win in avoiding page faults.
> 
> Even with the existing stupid populate function, it might be interesting
> seeing what would happen just from doing something silly like
> 
> ===== arch/i386/kernel/sys_i386.c 1.10 vs edited =====
> --- 1.10/arch/i386/kernel/sys_i386.c    Sat Dec 21 08:24:45 2002
> +++ edited/arch/i386/kernel/sys_i386.c  Fri Dec 27 19:08:30 2002
> @@ -54,6 +54,8 @@
>                 file = fget(fd);
>                 if (!file)
>                         goto out;
> +               if (prot & PROT_EXEC)
> +                       flags |= MAP_POPULATE | MAP_NONBLOCK;

Yes, this could be used to prototype it, I think.

It doesn't work as-is, because remap_file_pages() requires a shared
mapping.  Disabling that check results in a scrogged ld.so and a
non-booting system.  remap_file_pages() plays games with the vma
protection in ways which I do not understand.

So hum.  I'll finish off some other stuff, take a more detailed look
at this soon.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-28  6:58                             ` Andrew Morton
@ 2002-12-28  7:39                               ` Ingo Molnar
  2002-12-28  7:47                               ` Linus Torvalds
  1 sibling, 0 replies; 24+ messages in thread
From: Ingo Molnar @ 2002-12-28  7:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Martin J. Bligh, Dave McCracken, Daniel Phillips,
	linux-mm

On Fri, 27 Dec 2002, Andrew Morton wrote:

> Yup.  Ingo said at the time:
> 
>   It would be faster to iterate the pagecache mapping's radix tree
>   and the pagetables at once, but it's also *much* more complex. I have
>   tried to implement it and had to unroll the change - mixing radix tree
>   walking and pagetable walking and getting all the VM details right is
>   really complex - especially considering all the re-lookup race checks
>   that have to occur upon IO.
> 
> But find_get_pages() is well-suited to this, and was not in place when
> he did this work.

i agree that find_get_pages() would simplify this work. I did not consider
group-lookup - i tried to implement an algorithm that had a single-page
scope, to keep the amount of locked pages to the minimum.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-28  6:58                             ` Andrew Morton
  2002-12-28  7:39                               ` Ingo Molnar
@ 2002-12-28  7:47                               ` Linus Torvalds
  2002-12-28 23:28                                 ` Andrew Morton
  1 sibling, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2002-12-28  7:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Martin J. Bligh, Dave McCracken, Daniel Phillips, linux-mm, Ingo Molnar

On Fri, 27 Dec 2002, Andrew Morton wrote:
> >                 if (!file)
> >                         goto out;
> > +               if (prot & PROT_EXEC)
> > +                       flags |= MAP_POPULATE | MAP_NONBLOCK;
> 
> Yes, this could be used to prototype it, I think.
> 
> It doesn't work as-is, because remap_file_pages() requires a shared
> mapping.  Disabling that check results in a scrogged ld.so and a
> non-booting system.  remap_file_pages() plays games with the vma
> protection in ways which I do not understand.

Ahh.. Those file protection games are wrong for anything but the specific 
case of the sys_remap_file_pages() system call. The mmap() case should 
_not_ use that system call path at all, but should instead just call the 
populate function directly. Something like the appended patch.

CAREFUL! I've not checked all the details on this, but moving the
MAP_POPULATE check upwards should get rid of the problems with the vma
goign away etc, so it should make this at least closer to correct, and
makes all the extra work that sys_remap_file_pages() does totally
unnecessary, since we know the vma and ranges already.

This has not been compiled, much less tested. Consider a example ONLY.

		Linus

----
===== arch/i386/kernel/sys_i386.c 1.10 vs edited =====
--- 1.10/arch/i386/kernel/sys_i386.c	Sat Dec 21 08:24:45 2002
+++ edited/arch/i386/kernel/sys_i386.c	Fri Dec 27 19:08:30 2002
@@ -54,6 +54,8 @@
 		file = fget(fd);
 		if (!file)
 			goto out;
+		if (prot & PROT_EXEC)
+			flags |= MAP_POPULATE | MAP_NONBLOCK;
 	}
 
 	down_write(&current->mm->mmap_sem);
===== mm/mmap.c 1.58 vs edited =====
--- 1.58/mm/mmap.c	Sat Dec 14 09:42:45 2002
+++ edited/mm/mmap.c	Fri Dec 27 23:45:45 2002
@@ -576,6 +576,11 @@
 		error = file->f_op->mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
+
+		if (flags & MAP_POPULATE) {
+			if (vma->vm_ops && vma->vm_ops->populate)
+				vma->vm_ops->populate(vma, addr, len, prot, pgoff, flags & MAP_NONBLOCK);
+		}
 	} else if (vm_flags & VM_SHARED) {
 		error = shmem_zero_setup(vma);
 		if (error)
@@ -606,12 +611,6 @@
 	if (vm_flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
 		make_pages_present(addr, addr + len);
-	}
-	if (flags & MAP_POPULATE) {
-		up_write(&mm->mmap_sem);
-		sys_remap_file_pages(addr, len, prot,
-					pgoff, flags & MAP_NONBLOCK);
-		down_write(&mm->mmap_sem);
 	}
 	return addr;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared pagetable benchmarking
  2002-12-28  7:47                               ` Linus Torvalds
@ 2002-12-28 23:28                                 ` Andrew Morton
  0 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2002-12-28 23:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin J. Bligh, Dave McCracken, Daniel Phillips, linux-mm, Ingo Molnar

Linus Torvalds wrote:
> 
> > ...
> The mmap() case should
> _not_ use that system call path at all, but should instead just call the
> populate function directly. Something like the appended patch.

Seems to do the right thing, but alas, it's slower:

without:
pushpatch 99  8.20s user 10.00s system 99% cpu 18.341 total
poppatch 99  5.76s user 6.65s system 99% cpu 12.521 total
c0114c64 kmap_atomic_to_page                          84   0.9438
c01308ec handle_mm_fault                              92   0.4340
c01c4b58 __copy_from_user                             94   0.8393
c012f330 clear_page_tables                           113   0.5650
c01305b0 do_anonymous_page                           123   0.3844
c011a9c0 do_softirq                                  145   0.8239
c0113d9c pte_alloc_one                               146   1.1406
c012f534 copy_page_range                             174   0.3595
c01c4af0 __copy_to_user                              188   1.8077
c01306f0 do_no_page                                  241   0.4744
c012f718 zap_pte_range                               265   0.6370
c0113ec0 do_page_fault                               321   0.2956
c0133a8c page_add_rmap                               322   1.1838
c0114be4 kmap_atomic                                 326   3.0185
c0133b9c page_remove_rmap                            360   0.9574
c012ff54 do_wp_page                                 1245   1.9095
00000000 total                                      6812   0.0042

(374019 pagefaults)

with:
pushpatch 99  8.16s user 11.76s system 99% cpu 20.072 total
poppatch 99  5.68s user 7.93s system 99% cpu 13.656 total
c012f330 clear_page_tables                           111   0.5550
c0114c64 kmap_atomic_to_page                         121   1.3596
c0113d9c pte_alloc_one                               140   1.0938
c011a9c0 do_softirq                                  150   0.8523
c01305b0 do_anonymous_page                           157   0.4906
c01c4af0 __copy_to_user                              157   1.5096
c012e590 install_page                                202   0.6012
c0113ec0 do_page_fault                               209   0.1924
c012f534 copy_page_range                             215   0.4442
c01306f0 do_no_page                                  224   0.4409
c0114be4 kmap_atomic                                 392   3.6296
c012f718 zap_pte_range                               417   1.0024
c0133a8c page_add_rmap                               563   2.0699
c0133b9c page_remove_rmap                            653   1.7367
c012ff54 do_wp_page                                 1318   2.0215
00000000 total                                      8072   0.0050

(240622 pagefaults)

That's uniprocessor, highpte.  Presumably there are lots of cached
libc pages which these scripts don't actually need.

It needs more analysis/instrumentation/work, but it's not promising.

Cache misses against the pte_chains is what is hurting here. Something
which may help on P4 is to keep the pte_chains at 32 bytes, so that
virtually-adjacent pages' pte_chains will probably share cachelines.  I
have a pseudo-4way HT box sitting here awaiting commissioning...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2002-12-28 23:28 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-12-20 11:11 shared pagetable benchmarking Andrew Morton
2002-12-20 11:13 ` William Lee Irwin III
2002-12-20 16:30 ` Dave McCracken
2002-12-20 19:59   ` Andrew Morton
2002-12-23 16:15     ` Dave McCracken
2002-12-23 23:54       ` Andrew Morton
2002-12-27  9:39       ` Daniel Phillips
2002-12-27  9:58         ` Andrew Morton
2002-12-27 15:59           ` Daniel Phillips
2002-12-27 20:02             ` Linus Torvalds
2002-12-27 20:16               ` Dave McCracken
2002-12-27 20:18                 ` Linus Torvalds
2002-12-27 20:45                   ` Dave McCracken
2002-12-27 20:50                     ` Linus Torvalds
2002-12-27 23:56                       ` Daniel Phillips
2002-12-28  0:45                       ` Martin J. Bligh
2002-12-28  2:34                         ` Andrew Morton
2002-12-28  3:10                           ` Linus Torvalds
2002-12-28  6:58                             ` Andrew Morton
2002-12-28  7:39                               ` Ingo Molnar
2002-12-28  7:47                               ` Linus Torvalds
2002-12-28 23:28                                 ` Andrew Morton
2002-12-28  3:19                           ` Martin J. Bligh
2002-12-23 18:19 ` Dave McCracken

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox