* shared pagetable benchmarking
@ 2002-12-20 11:11 Andrew Morton
2002-12-20 11:13 ` William Lee Irwin III
` (2 more replies)
0 siblings, 3 replies; 24+ messages in thread
From: Andrew Morton @ 2002-12-20 11:11 UTC (permalink / raw)
To: Dave McCracken, linux-mm
Did a bit of timing and profiling. It's a uniprocessor
kernel, 7G, PAE.
The workload is application and removal of ~80 patches using
my patch scripts. Tons and tons of forks from bash.
2.5 ends up being 13% slower than 2.4, after disabling highpte
to make it fair. 3%-odd of this is HZ=1000. So say 10%.
Pagetable sharing actually slowed this test down by several
percent overall. Which is unfortunate, because the main
thing which Linus likes about shared pagetables is that it
"speeds up forks".
Is there anything we can do to fix all of this up a bit?
2.4.21-pre2:
c0106d60 system_call 10 0.1786
c012ca00 __free_pages_ok 10 0.0124
c0114c38 mm_init 13 0.0396
c0124a84 find_vma 13 0.1548
c01283ac generic_file_write 13 0.0073
c012cd24 rmqueue 14 0.0201
c0122570 __free_pte 16 0.1667
c0123df0 handle_mm_fault 16 0.0625
c0123aac do_anonymous_page 25 0.0801
c0126c3c file_read_actor 33 0.1684
c0123be4 do_no_page 37 0.0706
c01226d0 copy_page_range 47 0.0810
c0112878 do_page_fault 49 0.0352
c01225e8 clear_page_tables 70 0.3017
c0122914 zap_page_range 72 0.1118
c01234b0 do_wp_page 275 0.3736
00000000 total 1062 0.0008
created /tmp/prof.time
akpm-prof pushpatch 99 11.83s user 10.56s system 99% cpu 22.439 total
c01283ac generic_file_write 9 0.0050
c012d8b4 free_page_and_swap_cache 9 0.1500
c012625c __find_get_page 10 0.2083
c0118cdc exit_notify 11 0.0174
c012ca00 __free_pages_ok 11 0.0137
c013c030 link_path_walk 16 0.0077
c012cd24 rmqueue 18 0.0259
c0123aac do_anonymous_page 20 0.0641
c0126c3c file_read_actor 25 0.1276
c01226d0 copy_page_range 27 0.0466
c0123be4 do_no_page 29 0.0553
c01225e8 clear_page_tables 32 0.1379
c01052b0 poll_idle 33 0.8250
c0122914 zap_page_range 42 0.0652
c0112878 do_page_fault 50 0.0359
c01234b0 do_wp_page 161 0.2188
00000000 total 791 0.0006
created /tmp/prof.time
akpm-prof poppatch 99 8.60s user 7.57s system 97% cpu 16.530 total
2.5.52-mm3:
c012b998 free_hot_cold_page 94 0.4896
c0117348 do_schedule 103 0.1717
c012ba7c buffered_rmqueue 110 0.5729
c01c1b4c strnlen_user 116 1.3810
c012e36c kmem_cache_alloc 120 1.8750
c0134454 find_vma 133 1.5114
c010a558 system_call 134 3.0455
c01504ac d_lookup 143 0.6164
c0148af4 link_path_walk 153 0.0915
c0116b88 kmap_atomic_to_page 175 1.9886
c0133060 handle_mm_fault 195 0.6414
c01c1d48 __copy_from_user 212 1.8929
c011598c pte_alloc_one 213 1.6641
c0132c74 do_anonymous_page 260 0.6311
c011cab0 do_softirq 300 1.7045
c01c1ce0 __copy_to_user 369 3.5481
c0132e10 do_no_page 529 0.8936
c013c9e4 pte_unshare 572 0.5789
c0116b04 kmap_atomic 585 5.2232
c0131b44 zap_pte_range 585 1.4199
c0115bc0 do_page_fault 600 0.4808
c0136428 page_add_rmap 766 2.1517
c0131890 clear_page_tables 860 2.6220
c013658c page_remove_rmap 928 1.9333
c013250c do_wp_page 2594 3.3601
00000000 total 15261 0.0097
created /tmp/prof.time
akpm-prof pushpatch 99 12.36s user 14.61s system 97% cpu 27.768 total
c0117348 do_schedule 77 0.1283
c010a558 system_call 85 1.9318
c0134454 find_vma 90 1.0227
c01504ac d_lookup 106 0.4569
c0116b88 kmap_atomic_to_page 107 1.2159
c0133060 handle_mm_fault 113 0.3717
c011598c pte_alloc_one 135 1.0547
c0148af4 link_path_walk 135 0.0807
c01c1d48 __copy_from_user 162 1.4464
c0132c74 do_anonymous_page 218 0.5291
c01c1ce0 __copy_to_user 297 2.8558
c011cab0 do_softirq 319 1.8125
c0132e10 do_no_page 325 0.5490
c0131b44 zap_pte_range 362 0.8786
c013c9e4 pte_unshare 375 0.3796
c0116b04 kmap_atomic 384 3.4286
c0115bc0 do_page_fault 447 0.3582
c0136428 page_add_rmap 505 1.4185
c0131890 clear_page_tables 563 1.7165
c013658c page_remove_rmap 585 1.2188
c013250c do_wp_page 1559 2.0194
00000000 total 10586 0.0067
created /tmp/prof.time
akpm-prof poppatch 99 9.00s user 10.31s system 96% cpu 19.926 total
OK, remove shpte:
=================
c0134344 find_vma 110 1.2500
c01c07fc strnlen_user 112 1.3333
c0118c94 copy_process 113 0.0456
c012b77c buffered_rmqueue 120 0.6250
c014f0a0 __d_lookup 133 0.6520
c010a558 system_call 145 3.2955
c01474e0 link_path_walk 162 0.0769
c0116b28 kmap_atomic_to_page 165 1.8750
c0132f00 handle_mm_fault 166 0.5461
c012e02c kmem_cache_alloc 185 2.8906
c012e0d8 kmem_cache_free 188 2.9375
c0115a0c pgd_alloc 193 0.9650
c01c09f8 __copy_from_user 196 1.7500
c011598c pte_alloc_one 216 1.6875
c0132b24 do_anonymous_page 249 0.6163
c011c880 do_softirq 293 1.6648
c01c0990 __copy_to_user 418 4.0192
c0132cb8 do_no_page 446 0.7637
c01317b4 copy_page_range 496 0.7425
c0116aa4 kmap_atomic 591 5.2768
c0115b60 do_page_fault 594 0.4760
c0131a50 zap_pte_range 609 1.2378
c0136314 page_add_rmap 632 1.7556
c01314f0 clear_page_tables 688 2.2051
c013647c page_remove_rmap 817 1.7021
c01323d4 do_wp_page 2600 3.4759
00000000 total 14713 0.0094
created /tmp/prof.time
akpm-prof pushpatch 99 12.29s user 14.40s system 99% cpu 26.913 total
c014f0a0 __d_lookup 91 0.4461
c010a558 system_call 93 2.1136
c0132f00 handle_mm_fault 111 0.3651
c012e02c kmem_cache_alloc 113 1.7656
c01474e0 link_path_walk 118 0.0560
c011598c pte_alloc_one 129 1.0078
c0116b28 kmap_atomic_to_page 129 1.4659
c012e0d8 kmem_cache_free 140 2.1875
c01c09f8 __copy_from_user 160 1.4286
c0115a0c pgd_alloc 170 0.8500
c0132b24 do_anonymous_page 184 0.4554
c011c880 do_softirq 297 1.6875
c01317b4 copy_page_range 309 0.4626
c01c0990 __copy_to_user 318 3.0577
c0132cb8 do_no_page 335 0.5736
c0131a50 zap_pte_range 364 0.7398
c01314f0 clear_page_tables 393 1.2596
c0115b60 do_page_fault 441 0.3534
c0136314 page_add_rmap 441 1.2250
c0116aa4 kmap_atomic 448 4.0000
c013647c page_remove_rmap 550 1.1458
c01323d4 do_wp_page 1593 2.1297
00000000 total 10335 0.0066
created /tmp/prof.time
akpm-prof poppatch 99 9.07s user 10.03s system 99% cpu 19.290 total
Also remove highpte
===================
c01c037c strnlen_user 108 1.2857
c010a558 system_call 111 2.5227
c012b79c buffered_rmqueue 113 0.5885
c0134144 find_vma 117 1.3295
c01171e4 do_schedule 118 0.1954
c0132d20 handle_mm_fault 132 0.4583
c014ebf0 __d_lookup 142 0.6961
c0131010 page_address 147 1.0500
c0116ae4 kmap_atomic 163 1.4554
c0147030 link_path_walk 186 0.0882
c01c0578 __copy_from_user 203 1.8125
c011597c pte_alloc_one 224 1.5556
c0115a0c pgd_alloc 231 1.1550
c0132990 do_anonymous_page 260 0.7065
c011c8d0 do_softirq 283 1.6080
c01c0510 __copy_to_user 380 3.6538
c0132b00 do_no_page 401 0.7371
c01316fc copy_page_range 451 0.7723
c0131944 zap_pte_range 588 1.1575
c013602c page_add_rmap 601 2.5042
c0115b60 do_page_fault 607 0.4716
c0131440 clear_page_tables 637 2.1233
c013611c page_remove_rmap 657 2.0030
c01322b4 do_wp_page 2530 3.6988
00000000 total 13554 0.0087
created /tmp/prof.time
akpm-prof pushpatch 99 11.97s user 13.36s system 99% cpu 25.541 total
c0132d20 handle_mm_fault 100 0.3472
c0147030 link_path_walk 106 0.0503
c0116ae4 kmap_atomic 109 0.9732
c0115a0c pgd_alloc 140 0.7000
c011597c pte_alloc_one 151 1.0486
c01c0578 __copy_from_user 162 1.4464
c0132990 do_anonymous_page 204 0.5543
c011c8d0 do_softirq 305 1.7330
c01c0510 __copy_to_user 308 2.9615
c0132b00 do_no_page 310 0.5699
c01316fc copy_page_range 314 0.5377
c013602c page_add_rmap 361 1.5042
c0131944 zap_pte_range 379 0.7461
c0115b60 do_page_fault 409 0.3178
c013611c page_remove_rmap 430 1.3110
c0131440 clear_page_tables 443 1.4767
c01322b4 do_wp_page 1662 2.4298
00000000 total 9706 0.0062
created /tmp/prof.time
akpm-prof poppatch 99 8.86s user 9.33s system 98% cpu 18.433 total
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: shared pagetable benchmarking 2002-12-20 11:11 shared pagetable benchmarking Andrew Morton @ 2002-12-20 11:13 ` William Lee Irwin III 2002-12-20 16:30 ` Dave McCracken 2002-12-23 18:19 ` Dave McCracken 2 siblings, 0 replies; 24+ messages in thread From: William Lee Irwin III @ 2002-12-20 11:13 UTC (permalink / raw) To: Andrew Morton; +Cc: Dave McCracken, linux-mm On Fri, Dec 20, 2002 at 03:11:09AM -0800, Andrew Morton wrote: > Did a bit of timing and profiling. It's a uniprocessor > kernel, 7G, PAE. > The workload is application and removal of ~80 patches using > my patch scripts. Tons and tons of forks from bash. > 2.5 ends up being 13% slower than 2.4, after disabling highpte > to make it fair. 3%-odd of this is HZ=1000. So say 10%. > Pagetable sharing actually slowed this test down by several > percent overall. Which is unfortunate, because the main > thing which Linus likes about shared pagetables is that it > "speeds up forks". > Is there anything we can do to fix all of this up a bit? For testing purposes, try removing the opportunistic mmap()-time sharing. Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-20 11:11 shared pagetable benchmarking Andrew Morton 2002-12-20 11:13 ` William Lee Irwin III @ 2002-12-20 16:30 ` Dave McCracken 2002-12-20 19:59 ` Andrew Morton 2002-12-23 18:19 ` Dave McCracken 2 siblings, 1 reply; 24+ messages in thread From: Dave McCracken @ 2002-12-20 16:30 UTC (permalink / raw) To: Andrew Morton, linux-mm --On Friday, December 20, 2002 03:11:09 -0800 Andrew Morton <akpm@digeo.com> wrote: > The workload is application and removal of ~80 patches using > my patch scripts. Tons and tons of forks from bash. > > 2.5 ends up being 13% slower than 2.4, after disabling highpte > to make it fair. 3%-odd of this is HZ=1000. So say 10%. > > Pagetable sharing actually slowed this test down by several > percent overall. Which is unfortunate, because the main > thing which Linus likes about shared pagetables is that it > "speeds up forks". > > Is there anything we can do to fix all of this up a bit? Ok, let's consider just what shared page tables does for fork. In fork without shared page tables, there is a fixed cost per mapped page where the pte entry has to be copied from the parent's pte page to the child's. This cost is higher for resident pages in 2.5 than 2.4 because of rmap. What shared page tables does isn't reduce that cost, it just defers it by marking each pte page copy-on-write. The cost is incurred when either the parent or the child first tries to write to a page in that pte page. The savings comes when there are pte pages that never have to be unshared, either because they map a shared region or they're not written to (typically because the child quickly does an exec). The worst case condition for shared page tables is when every pte page has to be unshared. Unfortunately this is also a common case. Almost every parent or child will touch three pages after fork: the current stack page, libc's data page, and the application's data page. Each one of these is in a separate pte page. Since each pte page maps 4M (2M for PAE), small processes only have those three pte pages, and they're all unshared. Unfortunately this includes most base utilities, in particular shells, so shell scripts will not benefit from shared page tables. There is a small penalty for deferring the pte page copy, as Andrew's tests show. However, as soon as even one pte page is not copied, fork performance improves dramatically. My tests show that fork/exit for a 4 pte page process is about 25% to 30% faster with shared page tables than without, simply because of the single extra page that's not unshared. This savings is multiplied for each additional pte page that remains shared. I'll look for ways to optimize the unsharing to reduce the penalty, but I'm not optimistic that we can eliminate it entirely. Let's also not lose sight of what I consider the primary goal of shared page tables, which is to greatly reduce the page table memory overhead of massively shared large regions. Dave McCracken ====================================================================== Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059 dmccr@us.ibm.com T/L 678-3059 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-20 16:30 ` Dave McCracken @ 2002-12-20 19:59 ` Andrew Morton 2002-12-23 16:15 ` Dave McCracken 0 siblings, 1 reply; 24+ messages in thread From: Andrew Morton @ 2002-12-20 19:59 UTC (permalink / raw) To: Dave McCracken; +Cc: linux-mm Dave McCracken wrote: > > [ ... ] > Thanks. > I'll look for ways to optimize the unsharing to reduce the penalty, but I'm > not optimistic that we can eliminate it entirely. So changing userspace to place its writeable memory on a new 4M boundary would be a big win? It's years since I played with elf, but I think this is feasible. Change the linker and just wait for it to propagate. Do we know someone who can guide us in prototyping that? Do we know where the writes are occurring? > Let's also not lose sight of what I consider the primary goal of shared > page tables, which is to greatly reduce the page table memory overhead of > massively shared large regions. Well yes. But this is optimising the (extremely) uncommon case while penalising the (very) common one. It's the same with the reverse map - we've gone and added significant expense even to machines and workloads which perform no page reclaim at all. Perhaps pagetable sharing can get that back for us. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-20 19:59 ` Andrew Morton @ 2002-12-23 16:15 ` Dave McCracken 2002-12-23 23:54 ` Andrew Morton 2002-12-27 9:39 ` Daniel Phillips 0 siblings, 2 replies; 24+ messages in thread From: Dave McCracken @ 2002-12-23 16:15 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm --On Friday, December 20, 2002 11:59:12 -0800 Andrew Morton <akpm@digeo.com> wrote: > So changing userspace to place its writeable memory on a new 4M boundary > would be a big win? > > It's years since I played with elf, but I think this is feasible. Change > the linker and just wait for it to propagate. Actually it'd require changes to both the linker and the kernel memory range allocator. Right now ld.so maps all memory needed for an entire shared library, then uses mprotect and MAP_FIXED to modify parts of it to be writable (or at least that's what I see using strace). If it was done using separate mmap calls we could redirect the writable regions to be in a different pmd. >> Let's also not lose sight of what I consider the primary goal of shared >> page tables, which is to greatly reduce the page table memory overhead of >> massively shared large regions. > > Well yes. But this is optimising the (extremely) uncommon case while > penalising the (very) common one. I guess I don't see wasting extra pte pages on duplicated mappings of shared memory as extremely uncommon. Granted, it's not that significant for small applications, but it can make a machine unusable with some large applications. I think being able to run applications that couldn't run before to be worth some consideration. I also have a couple of ideas for ways to eliminate the penalty for small tasks. Would you grant that it's a worthwhile effort if the penalty for small applications was zero? Dave McCracken ====================================================================== Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059 dmccr@us.ibm.com T/L 678-3059 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-23 16:15 ` Dave McCracken @ 2002-12-23 23:54 ` Andrew Morton 2002-12-27 9:39 ` Daniel Phillips 1 sibling, 0 replies; 24+ messages in thread From: Andrew Morton @ 2002-12-23 23:54 UTC (permalink / raw) To: Dave McCracken; +Cc: linux-mm Dave McCracken wrote: > > --On Friday, December 20, 2002 11:59:12 -0800 Andrew Morton > <akpm@digeo.com> wrote: > > > So changing userspace to place its writeable memory on a new 4M boundary > > would be a big win? > > > > It's years since I played with elf, but I think this is feasible. Change > > the linker and just wait for it to propagate. > > Actually it'd require changes to both the linker and the kernel memory > range allocator. Right now ld.so maps all memory needed for an entire > shared library, then uses mprotect and MAP_FIXED to modify parts of it to > be writable (or at least that's what I see using strace). If it was done > using separate mmap calls we could redirect the writable regions to be in a > different pmd. Yup. Over the weekend I got all this going. With binutils patches from HJ, a kernel patch from Bill and tons of rebuilding things I had everything in /proc/pid/maps on a separate 4M segment. I also fixed run-child-first-on-fork. Summary: 2.4.20 2.5-shpte 2.5-shpte+weekend_hacks aim9 fork_test 1950 1300 1700 aim9 exec_test 700 545 572 patch-scripts 16.5 19.5 18.5 The fork test isn't very interesting. When you toss in an exec(), the benefits are small. It appears that Linus's only interest in shared pagetables is that it could reclaim the fork/exec overhead which the reverse mapping introduced. As far as I can tell he is not concerned about space consumption issues. And if that is the selection criterion, I do not believe that these speedups are sufficient to warrant a merge. > >> Let's also not lose sight of what I consider the primary goal of shared > >> page tables, which is to greatly reduce the page table memory overhead of > >> massively shared large regions. > > > > Well yes. But this is optimising the (extremely) uncommon case while > > penalising the (very) common one. > > I guess I don't see wasting extra pte pages on duplicated mappings of > shared memory as extremely uncommon. Granted, it's not that significant > for small applications, but it can make a machine unusable with some large > applications. I think being able to run applications that couldn't run > before to be worth some consideration. > > I also have a couple of ideas for ways to eliminate the penalty for small > tasks. Would you grant that it's a worthwhile effort if the penalty for > small applications was zero? > It's not my call, David. I've been putting myself in the role of helping to get the code working and tested, and providing Linus with whatever info can help him make a decision. I guess he works by observing what people are talking about, asking about and hurting over on the mailing lists. As well as his own experience. And the issue of pagetable consumption just doesn't have any visibility. I expect his position would be that it's a specialised, rare problem and that the fix is more appropriate to a specialised vendor kernel. I suggest that you discuss it with him. If that ends up being thumbs-down I can continue to maintain the patch across 2.6.x. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-23 16:15 ` Dave McCracken 2002-12-23 23:54 ` Andrew Morton @ 2002-12-27 9:39 ` Daniel Phillips 2002-12-27 9:58 ` Andrew Morton 1 sibling, 1 reply; 24+ messages in thread From: Daniel Phillips @ 2002-12-27 9:39 UTC (permalink / raw) To: Dave McCracken, Andrew Morton; +Cc: linux-mm On Monday 23 December 2002 17:15, Dave McCracken wrote: > >> Let's also not lose sight of what I consider the primary goal of shared > >> page tables, which is to greatly reduce the page table memory overhead > >> of massively shared large regions. > > > > Well yes. But this is optimising the (extremely) uncommon case while > > penalising the (very) common one. > > I guess I don't see wasting extra pte pages on duplicated mappings of > shared memory as extremely uncommon. Granted, it's not that significant > for small applications, but it can make a machine unusable with some large > applications. I think being able to run applications that couldn't run > before to be worth some consideration. > > I also have a couple of ideas for ways to eliminate the penalty for small > tasks. Would you grant that it's a worthwhile effort if the penalty for > small applications was zero? Hi Dave, Andrew, A feature of my original demonstration patch was that I could enable/disable sharing with a per-fork granularity. This is a good thing. You can use this by detecting the case you can't optimize, i.e., forking from bash, and essentially using the old code. The sawoff for improved efficiency comes in somewhere over 4 meg worth of shared memory, which just doesn't happen in fork+exec from bash. Then there is always-unshare situation with the stack, which I'm sure you're aware of, where it's never worth doing the share. That said, was not Ingo working on a replacement for fork+exec that doesn't do the useless fork? Would this not make the vast majority of impossible-to-optimize cases go away? Regards, Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-27 9:39 ` Daniel Phillips @ 2002-12-27 9:58 ` Andrew Morton 2002-12-27 15:59 ` Daniel Phillips 0 siblings, 1 reply; 24+ messages in thread From: Andrew Morton @ 2002-12-27 9:58 UTC (permalink / raw) To: Daniel Phillips; +Cc: Dave McCracken, linux-mm Daniel Phillips wrote: > > On Monday 23 December 2002 17:15, Dave McCracken wrote: > > >> Let's also not lose sight of what I consider the primary goal of shared > > >> page tables, which is to greatly reduce the page table memory overhead > > >> of massively shared large regions. > > > > > > Well yes. But this is optimising the (extremely) uncommon case while > > > penalising the (very) common one. > > > > I guess I don't see wasting extra pte pages on duplicated mappings of > > shared memory as extremely uncommon. Granted, it's not that significant > > for small applications, but it can make a machine unusable with some large > > applications. I think being able to run applications that couldn't run > > before to be worth some consideration. > > > > I also have a couple of ideas for ways to eliminate the penalty for small > > tasks. Would you grant that it's a worthwhile effort if the penalty for > > small applications was zero? > > Hi Dave, Andrew, Daniel! > A feature of my original demonstration patch was that I could enable/disable > sharing with a per-fork granularity. This is a good thing. You can use this > by detecting the case you can't optimize, i.e., forking from bash, and > essentially using the old code. The sawoff for improved efficiency comes in > somewhere over 4 meg worth of shared memory, which just doesn't happen in > fork+exec from bash. Then there is always-unshare situation with the stack, > which I'm sure you're aware of, where it's never worth doing the share. Yes, Dave did a prototype of that, and I am sure that it will pull back the small additional cost of pagetable sharing in those cases. But that's not the problem. The problem is that it doesn't *speed up* that case. Which appears to be the only thing which interests Linus in shared pagetables at this time: he "_hate_"s the fact that fork/exec got slower. > That said, was not Ingo working on a replacement for fork+exec that doesn't > do the useless fork? Would this not make the vast majority of > impossible-to-optimize cases go away? That's news to me. posix_spawn() has been suggested by Ulrich, and he says that things like bash could easily be converted. I don't how much it would gain - possibly not a huge amount; the rmap setup in exec seems to be where the major cost lies. Plus there's still exit(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-27 9:58 ` Andrew Morton @ 2002-12-27 15:59 ` Daniel Phillips 2002-12-27 20:02 ` Linus Torvalds 0 siblings, 1 reply; 24+ messages in thread From: Daniel Phillips @ 2002-12-27 15:59 UTC (permalink / raw) To: Andrew Morton, Daniel Phillips; +Cc: Dave McCracken, linux-mm, Linus Torvalds Hi Andrew, On Friday 27 December 2002 10:58, Andrew Morton wrote: > > A feature of my original demonstration patch was that I could > > enable/disable sharing with a per-fork granularity. This is a good > > thing. You can use this by detecting the case you can't optimize, i.e., > > forking from bash, and essentially using the old code. The sawoff for > > improved efficiency comes in somewhere over 4 meg worth of shared memory, > > which just doesn't happen in fork+exec from bash. Then there is > > always-unshare situation with the stack, which I'm sure you're aware of, > > where it's never worth doing the share. > > Yes, Dave did a prototype of that, and I am sure that it will pull back > the small additional cost of pagetable sharing in those cases. > > But that's not the problem. The problem is that it doesn't *speed up* > that case. Which appears to be the only thing which interests Linus > in shared pagetables at this time: he "_hate_"s the fact that fork/exec > got slower. Did you ask Linus? To my thinking, if it breaks even on small forks and wins on the big forks that are bothering the database people etc (and aren't we all database people in the end) it's a clear win. > > That said, was not Ingo working on a replacement for fork+exec that > > doesn't do the useless fork? Would this not make the vast majority of > > impossible-to-optimize cases go away? > > That's news to me. > > posix_spawn() has been suggested by Ulrich, and he says that things like > bash could easily be converted. Yes, that's the reference. I somehow got Ingo mixed in there because they were doing the thread cleanups together at the time, which quite possibly inspired that rather badly needed improvement. > I don't how much it would gain - possibly not a huge amount; the rmap > setup in exec seems to be where the major cost lies. Plus there's still > exit(). What you'd lose is the useless setup/teardown of three page table pages every fork+exec, a good thing regardless of page table sharing, but especially convenient for sharing, as it carves away a good percentage of the non-improved cases, allowing the improved ones to stand out more. Anyway, I'll bow out of the rmap-optimizing game until next cycle. There are still some nice optimizations that can be done, but what's the hurry? There is plenty of longstanding kernel badness that dwarves this in importance (knfsd comes to mind). I am just glad that rmap has stuck, as I am very happy to trade a few percentage points of speed in some applications that suck by design, in return for a VM that actually gives BSD a run for its money in terms of stability. I guess that if pte sharing doesn't make it into Linus's tree then Redhat will be only too pleased to make it part of Advanced Server, as there are a couple of ridiculously big tech companies I could name that need it so badly it hurts. Regards, Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-27 15:59 ` Daniel Phillips @ 2002-12-27 20:02 ` Linus Torvalds 2002-12-27 20:16 ` Dave McCracken 0 siblings, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2002-12-27 20:02 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrew Morton, Dave McCracken, linux-mm On Fri, 27 Dec 2002, Daniel Phillips wrote: > > Did you ask Linus? To my thinking, if it breaks even on small forks and wins > on the big forks that are bothering the database people etc (and aren't we > all database people in the end) it's a clear win. It doesn't break even on small forks. It _slows_them_down_. I personally think that small forks are a hell of a lot more important than big ones, since big ones happen rarely and don't tend to be all that performance-critical anyway. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-27 20:02 ` Linus Torvalds @ 2002-12-27 20:16 ` Dave McCracken 2002-12-27 20:18 ` Linus Torvalds 0 siblings, 1 reply; 24+ messages in thread From: Dave McCracken @ 2002-12-27 20:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: Daniel Phillips, Andrew Morton, linux-mm --On Friday, December 27, 2002 12:02:56 -0800 Linus Torvalds <torvalds@transmeta.com> wrote: > It doesn't break even on small forks. It _slows_them_down_. I gave Andrew a patch that does make it break even on small forks, by doing the copy at fork time when a process only has 3 pte pages. My tests indicate that any process with 4 or more pte pages usually is faster by doing the share. Dave McCracken -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-27 20:16 ` Dave McCracken @ 2002-12-27 20:18 ` Linus Torvalds 2002-12-27 20:45 ` Dave McCracken 0 siblings, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2002-12-27 20:18 UTC (permalink / raw) To: Dave McCracken; +Cc: Daniel Phillips, Andrew Morton, linux-mm On Fri, 27 Dec 2002, Dave McCracken wrote: > > I gave Andrew a patch that does make it break even on small forks, by doing > the copy at fork time when a process only has 3 pte pages. My tests > indicate that any process with 4 or more pte pages usually is faster by > doing the share. Ok, so it doesn't actually break even, it just disables itself. That's not the same thing in my book, but may of course be acceptable. I'd personally be much happier if just the real cause for the rmap slowdown was fixed, possibly by having it be done lazily (the shared page table stuff tries to do the _copy_ of the rmap information lazily, but maybe the real solution is to go one level further and just set the dang things up lazily in the first place, since most of the time it's not even needed). That's clearly not 2.6.x material. But at this point I doubt that shared page tables are either, unless they fix something more important than fork() speed for processes that are larger than 16MB. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-27 20:18 ` Linus Torvalds @ 2002-12-27 20:45 ` Dave McCracken 2002-12-27 20:50 ` Linus Torvalds 0 siblings, 1 reply; 24+ messages in thread From: Dave McCracken @ 2002-12-27 20:45 UTC (permalink / raw) To: Linus Torvalds; +Cc: Daniel Phillips, Andrew Morton, linux-mm --On Friday, December 27, 2002 12:18:23 -0800 Linus Torvalds <torvalds@transmeta.com> wrote: > That's clearly not 2.6.x material. But at this point I doubt that shared > page tables are either, unless they fix something more important than > fork() speed for processes that are larger than 16MB. The other thing it does is eliminate the duplicate pte pages for shared regions everywhere they span a complete pte page. While hugetlb can also do this for some specialized applications, shared page tables will do it for every shared region that's large enough. I dunno whether you consider that important enough to qualify, but I figured I should point it out. Dave McCracken -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-27 20:45 ` Dave McCracken @ 2002-12-27 20:50 ` Linus Torvalds 2002-12-27 23:56 ` Daniel Phillips 2002-12-28 0:45 ` Martin J. Bligh 0 siblings, 2 replies; 24+ messages in thread From: Linus Torvalds @ 2002-12-27 20:50 UTC (permalink / raw) To: Dave McCracken; +Cc: Daniel Phillips, Andrew Morton, linux-mm On Fri, 27 Dec 2002, Dave McCracken wrote: > > The other thing it does is eliminate the duplicate pte pages for shared > regions everywhere they span a complete pte page. While hugetlb can also > do this for some specialized applications, shared page tables will do it > for every shared region that's large enough. I dunno whether you consider > that important enough to qualify, but I figured I should point it out. I don't consider it important enough to qualify unless there are some real loads where it really matters. I can well imagine that such loads exist (where low-memory usage by page tables is a real problem), but I'd like to have that confirmed as a bug-report and that the sharing really does fix it. In other words, I can believe that the sharing is 2.6.x material, but considering the fundamental nature of it I want it to be a confirmed bug-fix, not a feature. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-27 20:50 ` Linus Torvalds @ 2002-12-27 23:56 ` Daniel Phillips 2002-12-28 0:45 ` Martin J. Bligh 1 sibling, 0 replies; 24+ messages in thread From: Daniel Phillips @ 2002-12-27 23:56 UTC (permalink / raw) To: Linus Torvalds, Dave McCracken, Wim Coekaerts Cc: Daniel Phillips, Andrew Morton, linux-mm On Friday 27 December 2002 21:50, Linus Torvalds wrote: > On Fri, 27 Dec 2002, Dave McCracken wrote: > > The other thing it does is eliminate the duplicate pte pages for shared > > regions everywhere they span a complete pte page. While hugetlb can also > > do this for some specialized applications, shared page tables will do it > > for every shared region that's large enough. I dunno whether you > > consider that important enough to qualify, but I figured I should point > > it out. > > I don't consider it important enough to qualify unless there are some real > loads where it really matters. Well, I know IBM has real loads that it matters on, otherwise Dave wouldn't be on this. I have reason to believe Oracle has loads that care about this as well, so it's time for somebody there to either speak up or kiss this facility goodbye for another cycle or two. Wim? -- Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-27 20:50 ` Linus Torvalds 2002-12-27 23:56 ` Daniel Phillips @ 2002-12-28 0:45 ` Martin J. Bligh 2002-12-28 2:34 ` Andrew Morton 1 sibling, 1 reply; 24+ messages in thread From: Martin J. Bligh @ 2002-12-28 0:45 UTC (permalink / raw) To: Linus Torvalds, Dave McCracken; +Cc: Daniel Phillips, Andrew Morton, linux-mm > I don't consider it important enough to qualify unless there are some real > loads where it really matters. I can well imagine that such loads exist > (where low-memory usage by page tables is a real problem), but I'd like to > have that confirmed as a bug-report and that the sharing really does fix > it. We had over 10Gb of PTEs running Oracle Apps (on 2.4 without RMAP) - RMAP would add another 5Gb or so to that (2Gb shared memory segment across many processes). But you can stick PTEs in highmem, whereas it's not easy to do that with pte_chains ... sticking 5Gb of overhead into ZONE_NORMAL is tricky ;-) The really nice thing about shared pagetables as a solution is that it's totally transparent, and requires no app modifications. Obviously degrading fork for small tasks is unacceptable, but Dave seems to have fixed that issue now. I think the long-term fix for the rmap performance hit is object-based RMAP (doing the reverse mappings shared on a per-area basis) which we've talked about, but not for 2.6 ... it may not turn out to be that hard though ... K42 did it before. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-28 0:45 ` Martin J. Bligh @ 2002-12-28 2:34 ` Andrew Morton 2002-12-28 3:10 ` Linus Torvalds 2002-12-28 3:19 ` Martin J. Bligh 0 siblings, 2 replies; 24+ messages in thread From: Andrew Morton @ 2002-12-28 2:34 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Linus Torvalds, Dave McCracken, Daniel Phillips, linux-mm "Martin J. Bligh" wrote: > > > I don't consider it important enough to qualify unless there are some real > > loads where it really matters. I can well imagine that such loads exist > > (where low-memory usage by page tables is a real problem), but I'd like to > > have that confirmed as a bug-report and that the sharing really does fix > > it. > > We had over 10Gb of PTEs running Oracle Apps (on 2.4 without RMAP) - > RMAP would add another 5Gb or so to that (2Gb shared memory segment > across many processes). But you can stick PTEs in highmem, whereas > it's not easy to do that with pte_chains ... sticking 5Gb of overhead > into ZONE_NORMAL is tricky ;-) The really nice thing about shared > pagetables as a solution is that it's totally transparent, and requires > no app modifications. Obviously degrading fork for small tasks is > unacceptable, but Dave seems to have fixed that issue now. To what extent is that a "real" workload? What other applications are affected, and to what extent? Why are hugepages not a sufficient solution? Is this problem sufficiently common to warrant the inclusion of pagetable sharing in the main kernel, as opposed to a specialised Oracle/DB2 derivative? > I think the long-term fix for the rmap performance hit is object-based > RMAP (doing the reverse mappings shared on a per-area basis) which we've > talked about, but not for 2.6 ... it may not turn out to be that hard > though ... K42 did it before. I think we can do a few things still in the 2.6 context. The fact that my "apply seventy patches with patch-scripts" test takes 350,000 pagefaults in 13 seconds makes one go "hmm". -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-28 2:34 ` Andrew Morton @ 2002-12-28 3:10 ` Linus Torvalds 2002-12-28 6:58 ` Andrew Morton 2002-12-28 3:19 ` Martin J. Bligh 1 sibling, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2002-12-28 3:10 UTC (permalink / raw) To: Andrew Morton Cc: Martin J. Bligh, Dave McCracken, Daniel Phillips, linux-mm, Ingo Molnar On Fri, 27 Dec 2002, Andrew Morton wrote: > > I think we can do a few things still in the 2.6 context. The fact that > my "apply seventy patches with patch-scripts" test takes 350,000 pagefaults > in 13 seconds makes one go "hmm". Hmm.. Whatever happened to the MAP_POPULATE tests? The current "filemap_populate()" function is extremely stupid (it takes advantage neither of the locality of the page tables _nor_ of the radix tree layout), but even so it would probably be a win to pre-populate at mmap time. But having a better "populate()" function that actually does multiple pages at once by just accessing the radix trees and page table trees directly should really be very low-overhead for the normal case, and be a _big_ win in avoiding page faults. Even with the existing stupid populate function, it might be interesting seeing what would happen just from doing something silly like ===== arch/i386/kernel/sys_i386.c 1.10 vs edited ===== --- 1.10/arch/i386/kernel/sys_i386.c Sat Dec 21 08:24:45 2002 +++ edited/arch/i386/kernel/sys_i386.c Fri Dec 27 19:08:30 2002 @@ -54,6 +54,8 @@ file = fget(fd); if (!file) goto out; + if (prot & PROT_EXEC) + flags |= MAP_POPULATE | MAP_NONBLOCK; } down_write(¤t->mm->mmap_sem); (yeah, yeah, and maybe do the same in binfmt_elf.c too) Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-28 3:10 ` Linus Torvalds @ 2002-12-28 6:58 ` Andrew Morton 2002-12-28 7:39 ` Ingo Molnar 2002-12-28 7:47 ` Linus Torvalds 0 siblings, 2 replies; 24+ messages in thread From: Andrew Morton @ 2002-12-28 6:58 UTC (permalink / raw) To: Linus Torvalds Cc: Martin J. Bligh, Dave McCracken, Daniel Phillips, linux-mm, Ingo Molnar Linus Torvalds wrote: > > On Fri, 27 Dec 2002, Andrew Morton wrote: > > > > I think we can do a few things still in the 2.6 context. The fact that > > my "apply seventy patches with patch-scripts" test takes 350,000 pagefaults > > in 13 seconds makes one go "hmm". > > Hmm.. Whatever happened to the MAP_POPULATE tests? > > The current "filemap_populate()" function is extremely stupid (it takes > advantage neither of the locality of the page tables _nor_ of the radix > tree layout), but even so it would probably be a win to pre-populate at > mmap time. Yup. Ingo said at the time: It would be faster to iterate the pagecache mapping's radix tree and the pagetables at once, but it's also *much* more complex. I have tried to implement it and had to unroll the change - mixing radix tree walking and pagetable walking and getting all the VM details right is really complex - especially considering all the re-lookup race checks that have to occur upon IO. But find_get_pages() is well-suited to this, and was not in place when he did this work. > But having a better "populate()" function that actually does multiple > pages at once by just accessing the radix trees and page table trees > directly should really be very low-overhead for the normal case, and be a > _big_ win in avoiding page faults. > > Even with the existing stupid populate function, it might be interesting > seeing what would happen just from doing something silly like > > ===== arch/i386/kernel/sys_i386.c 1.10 vs edited ===== > --- 1.10/arch/i386/kernel/sys_i386.c Sat Dec 21 08:24:45 2002 > +++ edited/arch/i386/kernel/sys_i386.c Fri Dec 27 19:08:30 2002 > @@ -54,6 +54,8 @@ > file = fget(fd); > if (!file) > goto out; > + if (prot & PROT_EXEC) > + flags |= MAP_POPULATE | MAP_NONBLOCK; Yes, this could be used to prototype it, I think. It doesn't work as-is, because remap_file_pages() requires a shared mapping. Disabling that check results in a scrogged ld.so and a non-booting system. remap_file_pages() plays games with the vma protection in ways which I do not understand. So hum. I'll finish off some other stuff, take a more detailed look at this soon. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-28 6:58 ` Andrew Morton @ 2002-12-28 7:39 ` Ingo Molnar 2002-12-28 7:47 ` Linus Torvalds 1 sibling, 0 replies; 24+ messages in thread From: Ingo Molnar @ 2002-12-28 7:39 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Martin J. Bligh, Dave McCracken, Daniel Phillips, linux-mm On Fri, 27 Dec 2002, Andrew Morton wrote: > Yup. Ingo said at the time: > > It would be faster to iterate the pagecache mapping's radix tree > and the pagetables at once, but it's also *much* more complex. I have > tried to implement it and had to unroll the change - mixing radix tree > walking and pagetable walking and getting all the VM details right is > really complex - especially considering all the re-lookup race checks > that have to occur upon IO. > > But find_get_pages() is well-suited to this, and was not in place when > he did this work. i agree that find_get_pages() would simplify this work. I did not consider group-lookup - i tried to implement an algorithm that had a single-page scope, to keep the amount of locked pages to the minimum. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-28 6:58 ` Andrew Morton 2002-12-28 7:39 ` Ingo Molnar @ 2002-12-28 7:47 ` Linus Torvalds 2002-12-28 23:28 ` Andrew Morton 1 sibling, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2002-12-28 7:47 UTC (permalink / raw) To: Andrew Morton Cc: Martin J. Bligh, Dave McCracken, Daniel Phillips, linux-mm, Ingo Molnar On Fri, 27 Dec 2002, Andrew Morton wrote: > > if (!file) > > goto out; > > + if (prot & PROT_EXEC) > > + flags |= MAP_POPULATE | MAP_NONBLOCK; > > Yes, this could be used to prototype it, I think. > > It doesn't work as-is, because remap_file_pages() requires a shared > mapping. Disabling that check results in a scrogged ld.so and a > non-booting system. remap_file_pages() plays games with the vma > protection in ways which I do not understand. Ahh.. Those file protection games are wrong for anything but the specific case of the sys_remap_file_pages() system call. The mmap() case should _not_ use that system call path at all, but should instead just call the populate function directly. Something like the appended patch. CAREFUL! I've not checked all the details on this, but moving the MAP_POPULATE check upwards should get rid of the problems with the vma goign away etc, so it should make this at least closer to correct, and makes all the extra work that sys_remap_file_pages() does totally unnecessary, since we know the vma and ranges already. This has not been compiled, much less tested. Consider a example ONLY. Linus ---- ===== arch/i386/kernel/sys_i386.c 1.10 vs edited ===== --- 1.10/arch/i386/kernel/sys_i386.c Sat Dec 21 08:24:45 2002 +++ edited/arch/i386/kernel/sys_i386.c Fri Dec 27 19:08:30 2002 @@ -54,6 +54,8 @@ file = fget(fd); if (!file) goto out; + if (prot & PROT_EXEC) + flags |= MAP_POPULATE | MAP_NONBLOCK; } down_write(¤t->mm->mmap_sem); ===== mm/mmap.c 1.58 vs edited ===== --- 1.58/mm/mmap.c Sat Dec 14 09:42:45 2002 +++ edited/mm/mmap.c Fri Dec 27 23:45:45 2002 @@ -576,6 +576,11 @@ error = file->f_op->mmap(file, vma); if (error) goto unmap_and_free_vma; + + if (flags & MAP_POPULATE) { + if (vma->vm_ops && vma->vm_ops->populate) + vma->vm_ops->populate(vma, addr, len, prot, pgoff, flags & MAP_NONBLOCK); + } } else if (vm_flags & VM_SHARED) { error = shmem_zero_setup(vma); if (error) @@ -606,12 +611,6 @@ if (vm_flags & VM_LOCKED) { mm->locked_vm += len >> PAGE_SHIFT; make_pages_present(addr, addr + len); - } - if (flags & MAP_POPULATE) { - up_write(&mm->mmap_sem); - sys_remap_file_pages(addr, len, prot, - pgoff, flags & MAP_NONBLOCK); - down_write(&mm->mmap_sem); } return addr; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-28 7:47 ` Linus Torvalds @ 2002-12-28 23:28 ` Andrew Morton 0 siblings, 0 replies; 24+ messages in thread From: Andrew Morton @ 2002-12-28 23:28 UTC (permalink / raw) To: Linus Torvalds Cc: Martin J. Bligh, Dave McCracken, Daniel Phillips, linux-mm, Ingo Molnar Linus Torvalds wrote: > > > ... > The mmap() case should > _not_ use that system call path at all, but should instead just call the > populate function directly. Something like the appended patch. Seems to do the right thing, but alas, it's slower: without: pushpatch 99 8.20s user 10.00s system 99% cpu 18.341 total poppatch 99 5.76s user 6.65s system 99% cpu 12.521 total c0114c64 kmap_atomic_to_page 84 0.9438 c01308ec handle_mm_fault 92 0.4340 c01c4b58 __copy_from_user 94 0.8393 c012f330 clear_page_tables 113 0.5650 c01305b0 do_anonymous_page 123 0.3844 c011a9c0 do_softirq 145 0.8239 c0113d9c pte_alloc_one 146 1.1406 c012f534 copy_page_range 174 0.3595 c01c4af0 __copy_to_user 188 1.8077 c01306f0 do_no_page 241 0.4744 c012f718 zap_pte_range 265 0.6370 c0113ec0 do_page_fault 321 0.2956 c0133a8c page_add_rmap 322 1.1838 c0114be4 kmap_atomic 326 3.0185 c0133b9c page_remove_rmap 360 0.9574 c012ff54 do_wp_page 1245 1.9095 00000000 total 6812 0.0042 (374019 pagefaults) with: pushpatch 99 8.16s user 11.76s system 99% cpu 20.072 total poppatch 99 5.68s user 7.93s system 99% cpu 13.656 total c012f330 clear_page_tables 111 0.5550 c0114c64 kmap_atomic_to_page 121 1.3596 c0113d9c pte_alloc_one 140 1.0938 c011a9c0 do_softirq 150 0.8523 c01305b0 do_anonymous_page 157 0.4906 c01c4af0 __copy_to_user 157 1.5096 c012e590 install_page 202 0.6012 c0113ec0 do_page_fault 209 0.1924 c012f534 copy_page_range 215 0.4442 c01306f0 do_no_page 224 0.4409 c0114be4 kmap_atomic 392 3.6296 c012f718 zap_pte_range 417 1.0024 c0133a8c page_add_rmap 563 2.0699 c0133b9c page_remove_rmap 653 1.7367 c012ff54 do_wp_page 1318 2.0215 00000000 total 8072 0.0050 (240622 pagefaults) That's uniprocessor, highpte. Presumably there are lots of cached libc pages which these scripts don't actually need. It needs more analysis/instrumentation/work, but it's not promising. Cache misses against the pte_chains is what is hurting here. Something which may help on P4 is to keep the pte_chains at 32 bytes, so that virtually-adjacent pages' pte_chains will probably share cachelines. I have a pseudo-4way HT box sitting here awaiting commissioning... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-28 2:34 ` Andrew Morton 2002-12-28 3:10 ` Linus Torvalds @ 2002-12-28 3:19 ` Martin J. Bligh 1 sibling, 0 replies; 24+ messages in thread From: Martin J. Bligh @ 2002-12-28 3:19 UTC (permalink / raw) To: Andrew Morton; +Cc: Linus Torvalds, Dave McCracken, Daniel Phillips, linux-mm > To what extent is that a "real" workload? It was meant to be a simulation of a real customer enviroment, I don't think it's unrealistic (they were actually trying to push it to at least twice that). > What other applications are affected, and to what extent? Anything that does heavy sharing. Databases and Java heaps spring to mind. > Why are hugepages not a sufficient solution? They may be for some workloads, but it's not as generalised. For instance, one other thing that's being muttered about a lot is very large heaps for Java workloads, and they want those swap backed. Large pages also requires application modification and machine setup for static pool size reservations in the current implementation. > Is this problem sufficiently common to warrant the inclusion of > pagetable sharing in the main kernel, as opposed to a specialised > Oracle/DB2 derivative? If we can get it not to degrade anything else (eg fork on small tasks), I think it's worthwhile. I *think* we're there now, though a few more perf checks are probably needed. >> I think the long-term fix for the rmap performance hit is object-based >> RMAP (doing the reverse mappings shared on a per-area basis) which we've >> talked about, but not for 2.6 ... it may not turn out to be that hard >> though ... K42 did it before. > > I think we can do a few things still in the 2.6 context. The fact that > my "apply seventy patches with patch-scripts" test takes 350,000 > pagefaults in 13 seconds makes one go "hmm". Fixing that would be a worthy goal, IMHO. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: shared pagetable benchmarking 2002-12-20 11:11 shared pagetable benchmarking Andrew Morton 2002-12-20 11:13 ` William Lee Irwin III 2002-12-20 16:30 ` Dave McCracken @ 2002-12-23 18:19 ` Dave McCracken 2 siblings, 0 replies; 24+ messages in thread From: Dave McCracken @ 2002-12-23 18:19 UTC (permalink / raw) To: Andrew Morton, linux-mm [-- Attachment #1: Type: text/plain, Size: 620 bytes --] --On Friday, December 20, 2002 03:11:09 -0800 Andrew Morton <akpm@digeo.com> wrote: > Is there anything we can do to fix all of this up a bit? Ok, here's my first attempt at optimization. I track how many pte pages a task has, and just do the copy if it doesn't have more than 3. My fork tests show that for a process with 3 pte pages, this patch produces performance equal to 2.5.52. Dave McCracken ====================================================================== Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059 dmccr@us.ibm.com T/L 678-3059 [-- Attachment #2: shpte-2.5.52-mm2-1.diff --] [-- Type: text/plain, Size: 3938 bytes --] --- 2.5.52-mm2-shsent/./include/linux/mm.h 2002-12-20 10:39:44.000000000 -0600 +++ 2.5.52-mm2-shpte/./include/linux/mm.h 2002-12-20 11:09:51.000000000 -0600 @@ -123,9 +123,6 @@ * low four bits) to a page protection mask.. */ extern pgprot_t protection_map[16]; -#ifdef CONFIG_SHAREPTE -extern pgprot_t protection_pmd[8]; -#endif /* * These are the virtual MM functions - opening of an area, closing and --- 2.5.52-mm2-shsent/./include/linux/sched.h 2002-12-20 10:39:44.000000000 -0600 +++ 2.5.52-mm2-shpte/./include/linux/sched.h 2002-12-23 10:18:08.000000000 -0600 @@ -183,6 +183,7 @@ struct vm_area_struct * mmap_cache; /* last find_vma result */ unsigned long free_area_cache; /* first hole */ pgd_t * pgd; + atomic_t ptepages; /* Number of pte pages allocated */ atomic_t mm_users; /* How many users with user space? */ atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ int map_count; /* number of VMAs */ --- 2.5.52-mm2-shsent/./kernel/fork.c 2002-12-20 10:39:44.000000000 -0600 +++ 2.5.52-mm2-shpte/./kernel/fork.c 2002-12-23 10:32:15.000000000 -0600 @@ -238,6 +238,7 @@ mm->free_area_cache = TASK_UNMAPPED_BASE; mm->map_count = 0; mm->rss = 0; + atomic_set(&mm->ptepages, 0); mm->cpu_vm_mask = 0; pprev = &mm->mmap; --- 2.5.52-mm2-shsent/./mm/memory.c 2002-12-20 10:39:45.000000000 -0600 +++ 2.5.52-mm2-shpte/./mm/memory.c 2002-12-23 10:22:52.000000000 -0600 @@ -116,6 +116,7 @@ pmd_clear(dir); pgtable_remove_rmap_locked(ptepage, tlb->mm); + atomic_dec(&tlb->mm->ptepages); dec_page_state(nr_page_table_pages); ClearPagePtepage(ptepage); @@ -184,6 +185,7 @@ SetPagePtepage(new); pgtable_add_rmap(new, mm, address); pmd_populate(mm, pmd, new); + atomic_inc(&mm->ptepages); inc_page_state(nr_page_table_pages); } out: @@ -217,7 +219,6 @@ #define PTE_TABLE_MASK ((PTRS_PER_PTE-1) * sizeof(pte_t)) #define PMD_TABLE_MASK ((PTRS_PER_PMD-1) * sizeof(pmd_t)) -#ifndef CONFIG_SHAREPTE /* * copy one vm_area from one task to the other. Assumes the page tables * already present in the new task to be cleared in the whole range @@ -354,7 +355,6 @@ nomem: return -ENOMEM; } -#endif static void zap_pte_range(mmu_gather_t *tlb, pmd_t * pmd, unsigned long address, unsigned long size) { --- 2.5.52-mm2-shsent/./mm/ptshare.c 2002-12-20 10:39:45.000000000 -0600 +++ 2.5.52-mm2-shpte/./mm/ptshare.c 2002-12-23 12:17:23.000000000 -0600 @@ -23,7 +23,7 @@ /* * Protections that can be set on the pmd entry (see discussion in mmap.c). */ -pgprot_t protection_pmd[8] = { +static pgprot_t protection_pmd[8] = { __PMD000, __PMD001, __PMD010, __PMD011, __PMD100, __PMD101, __PMD110, __PMD111 }; @@ -459,6 +459,28 @@ } /** + * fork_page_range - Either copy or share a page range at fork time + * @dst: the mm_struct of the forked child + * @src: the mm_struct of the forked parent + * @vma: the vm_area to be shared + * @prev_pmd: A pointer to the pmd entry we did at last invocation + * + * This wrapper decides whether to share page tables on fork or just make + * a copy. The current criterion is whether a page table has more than 3 + * pte pages, since all forked processes will unshare 3 pte pages after fork, + * even the ones doing an immediate exec. Tests indicate that if a page + * table has more than 3 pte pages, it's a performance win to share. + */ +int fork_page_range(struct mm_struct *dst, struct mm_struct *src, + struct vm_area_struct *vma, pmd_t **prev_pmd) +{ + if (atomic_read(&src->ptepages) > 3) + return share_page_range(dst, src, vma, prev_pmd); + + return copy_page_range(dst, src, vma); +} + +/** * unshare_page_range - Make sure no pte pages are shared in a given range * @mm: the mm_struct whose page table we unshare from * @address: the base address of the range ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2002-12-28 23:28 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-12-20 11:11 shared pagetable benchmarking Andrew Morton 2002-12-20 11:13 ` William Lee Irwin III 2002-12-20 16:30 ` Dave McCracken 2002-12-20 19:59 ` Andrew Morton 2002-12-23 16:15 ` Dave McCracken 2002-12-23 23:54 ` Andrew Morton 2002-12-27 9:39 ` Daniel Phillips 2002-12-27 9:58 ` Andrew Morton 2002-12-27 15:59 ` Daniel Phillips 2002-12-27 20:02 ` Linus Torvalds 2002-12-27 20:16 ` Dave McCracken 2002-12-27 20:18 ` Linus Torvalds 2002-12-27 20:45 ` Dave McCracken 2002-12-27 20:50 ` Linus Torvalds 2002-12-27 23:56 ` Daniel Phillips 2002-12-28 0:45 ` Martin J. Bligh 2002-12-28 2:34 ` Andrew Morton 2002-12-28 3:10 ` Linus Torvalds 2002-12-28 6:58 ` Andrew Morton 2002-12-28 7:39 ` Ingo Molnar 2002-12-28 7:47 ` Linus Torvalds 2002-12-28 23:28 ` Andrew Morton 2002-12-28 3:19 ` Martin J. Bligh 2002-12-23 18:19 ` Dave McCracken
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox