* RE: Anticipatory prefaulting in the page fault handler V1
@ 2004-12-08 17:44 Luck, Tony
2004-12-08 17:57 ` Christoph Lameter
0 siblings, 1 reply; 23+ messages in thread
From: Luck, Tony @ 2004-12-08 17:44 UTC (permalink / raw)
To: Christoph Lameter, nickpiggin
Cc: Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel
>If a fault occurred for page x and is then followed by page
>x+1 then it may be reasonable to expect another page fault
>at x+2 in the future.
What if the application had used "madvise(start, len, MADV_RANDOM)"
to tell the kernel that this isn't "reasonable"?
-Tony
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: Anticipatory prefaulting in the page fault handler V1
2004-12-08 17:44 Anticipatory prefaulting in the page fault handler V1 Luck, Tony
@ 2004-12-08 17:57 ` Christoph Lameter
0 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2004-12-08 17:57 UTC (permalink / raw)
To: Luck, Tony
Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
On Wed, 8 Dec 2004, Luck, Tony wrote:
> >If a fault occurred for page x and is then followed by page
> >x+1 then it may be reasonable to expect another page fault
> >at x+2 in the future.
>
> What if the application had used "madvise(start, len, MADV_RANDOM)"
> to tell the kernel that this isn't "reasonable"?
We could use that as a way to switch of the preallocation. How expensive
is that check?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-14 12:24 ` Akinobu Mita
2004-12-14 15:25 ` Akinobu Mita
@ 2004-12-14 20:25 ` Christoph Lameter
1 sibling, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2004-12-14 20:25 UTC (permalink / raw)
To: Akinobu Mita
Cc: Martin J. Bligh, nickpiggin, Jeff Garzik, torvalds, hugh, benh,
linux-mm, linux-ia64, linux-kernel
On Tue, 14 Dec 2004, Akinobu Mita wrote:
> This is why I inserted pte_none() for each page_table in case of
> read fault too.
>
> If read access fault occured for the address "addr".
> It is completely unnecessary to check by pte_none() to the page_table
> for "addr". Because page_table_lock has never been released until
> do_anonymous_page returns (in case of read access fault)
>
> But there is not any guarantee that the page_tables for addr+PAGE_SIZE,
> addr+2*PAGE_SIZE, ... have not been mapped yet.
Right. Thanks for pointing that out.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter
` (4 preceding siblings ...)
2004-12-09 10:57 ` Pavel Machek
@ 2004-12-14 15:28 ` Adam Litke
5 siblings, 0 replies; 23+ messages in thread
From: Adam Litke @ 2004-12-14 15:28 UTC (permalink / raw)
To: Christoph Lameter
Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
What benchmark are you using to generate the following results? I'd
like to run this on some of my hardware and see how the results compare.
On Wed, 2004-12-08 at 11:24, Christoph Lameter wrote:
> Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
> number of threads (and thus increasing parallellism of page faults):
>
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498
> 32 3 2 1.397s 148.523s 78.044s 41965.149 80201.646
> 32 3 4 1.390s 152.618s 44.044s 40851.258 141545.239
> 32 3 8 1.500s 374.008s 53.001s 16754.519 118671.950
> 32 3 16 1.415s 1051.759s 73.094s 5973.803 85087.358
> 32 3 32 1.867s 3400.417s 117.003s 1849.186 53754.928
> 32 3 64 5.361s 11633.040s 197.034s 540.577 31881.112
> 32 3 128 23.387s 39386.390s 332.055s 159.642 18918.599
> 32 3 256 15.409s 20031.450s 168.095s 313.837 37237.918
> 32 3 512 18.720s 10338.511s 86.047s 607.446 72752.686
>
> Patched kernel:
>
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920
> 32 3 2 1.022s 127.770s 67.086s 48849.350 92707.085
> 32 3 4 0.995s 119.666s 37.045s 52141.503 167955.292
> 32 3 8 0.928s 87.400s 18.034s 71227.407 342934.242
> 32 3 16 1.067s 72.943s 11.035s 85007.293 553989.377
> 32 3 32 1.248s 133.753s 10.038s 46602.680 606062.151
> 32 3 64 5.557s 438.634s 13.093s 14163.802 451418.617
> 32 3 128 17.860s 1496.797s 19.048s 4153.714 322808.509
> 32 3 256 13.382s 766.063s 10.016s 8071.695 618816.838
> 32 3 512 17.067s 369.106s 5.041s 16291.764 1161285.521
>
> These number are roughly equal to what can be accomplished with the
> page fault scalability patches.
>
> Kernel patches with both the page fault scalability patches and
> prefaulting:
>
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369
> 32 10 2 4.005s 415.119s 221.095s 50036.407 94484.174
> 32 10 4 3.855s 371.317s 111.076s 55898.259 187635.724
> 32 10 8 3.902s 308.673s 67.094s 67092.476 308634.397
> 32 10 16 4.011s 224.213s 37.016s 91889.781 564241.062
> 32 10 32 5.483s 209.391s 27.046s 97598.647 763495.417
> 32 10 64 19.166s 219.925s 26.030s 87713.212 797286.395
> 32 10 128 53.482s 342.342s 27.024s 52981.744 769687.791
> 32 10 256 67.334s 180.321s 15.036s 84679.911 1364614.334
> 32 10 512 66.516s 93.098s 9.015s131387.893 2291548.865
>
> The fault rate doubles when both patches are applied.
>
> And on the high end (512 processors allocating 256G) (No numbers
> for regular kernels because they are extremely slow, also no
> number for a low number of threads. Also very slow)
>
> With prefaulting:
>
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 256 3 4 8.241s 1414.348s 449.016s 35380.301 112056.239
> 256 3 8 8.306s 1300.982s 247.025s 38441.977 203559.271
> 256 3 16 8.368s 1223.853s 154.089s 40846.272 324940.924
> 256 3 32 8.536s 1284.041s 110.097s 38938.970 453556.624
> 256 3 64 13.025s 3587.203s 110.010s 13980.123 457131.492
> 256 3 128 25.025s 11460.700s 145.071s 4382.104 345404.909
> 256 3 256 26.150s 6061.649s 75.086s 8267.625 663414.482
> 256 3 512 20.637s 3037.097s 38.062s 16460.435 1302993.019
>
> Page fault scalability patch and prefaulting. Max prefault order
> increased to 5 (max preallocation of 32 pages):
>
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 256 10 8 33.571s 4516.293s 863.021s 36874.099 194356.930
> 256 10 16 33.103s 3737.688s 461.028s 44492.553 363704.484
> 256 10 32 35.094s 3436.561s 321.080s 48326.262 521352.840
> 256 10 64 46.675s 2899.997s 245.020s 56936.124 684214.256
> 256 10 128 85.493s 2890.198s 203.008s 56380.890 826122.524
> 256 10 256 74.299s 1374.973s 99.088s115762.963 1679630.272
> 256 10 512 62.760s 706.559s 53.027s218078.311 3149273.714
>
> We are getting into an almost linear scalability in the high end with
> both patches and end up with a fault rate > 3 mio faults per second.
>
> The one thing that takes up a lot of time is still be the zeroing
> of pages in the page fault handler. There is a another
> set of patches that I am working on which will prezero pages
> and led to another an increase in performance by a factor of 2-4
> (if prezeroed pages are available which may not always be the case).
> Maybe we can reach 10 mio fault /sec that way.
>
> Patch against 2.6.10-rc3-bk3:
>
> Index: linux-2.6.9/include/linux/sched.h
> ===================================================================
> --- linux-2.6.9.orig/include/linux/sched.h 2004-12-01 10:37:31.000000000 -0800
> +++ linux-2.6.9/include/linux/sched.h 2004-12-01 10:38:15.000000000 -0800
> @@ -537,6 +537,8 @@
> #endif
>
> struct list_head tasks;
> + unsigned long anon_fault_next_addr; /* Predicted sequential fault address */
> + int anon_fault_order; /* Last order of allocation on fault */
> /*
> * ptrace_list/ptrace_children forms the list of my children
> * that were stolen by a ptracer.
> Index: linux-2.6.9/mm/memory.c
> ===================================================================
> --- linux-2.6.9.orig/mm/memory.c 2004-12-01 10:38:11.000000000 -0800
> +++ linux-2.6.9/mm/memory.c 2004-12-01 10:45:01.000000000 -0800
> @@ -55,6 +55,7 @@
>
> #include <linux/swapops.h>
> #include <linux/elf.h>
> +#include <linux/pagevec.h>
>
> #ifndef CONFIG_DISCONTIGMEM
> /* use the per-pgdat data instead for discontigmem - mbligh */
> @@ -1432,8 +1433,106 @@
> unsigned long addr)
> {
> pte_t entry;
> - struct page * page = ZERO_PAGE(addr);
> + struct page * page;
> +
> + addr &= PAGE_MASK;
> +
> + if (current->anon_fault_next_addr == addr) {
> + unsigned long end_addr;
> + int order = current->anon_fault_order;
> +
> + /* Sequence of page faults detected. Perform preallocation of pages */
>
> + /* The order of preallocations increases with each successful prediction */
> + order++;
> +
> + if ((1 << order) < PAGEVEC_SIZE)
> + end_addr = addr + (1 << (order + PAGE_SHIFT));
> + else
> + end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE;
> +
> + if (end_addr > vma->vm_end)
> + end_addr = vma->vm_end;
> + if ((addr & PMD_MASK) != (end_addr & PMD_MASK))
> + end_addr &= PMD_MASK;
> +
> + current->anon_fault_next_addr = end_addr;
> + current->anon_fault_order = order;
> +
> + if (write_access) {
> +
> + struct pagevec pv;
> + unsigned long a;
> + struct page **p;
> +
> + pte_unmap(page_table);
> + spin_unlock(&mm->page_table_lock);
> +
> + pagevec_init(&pv, 0);
> +
> + if (unlikely(anon_vma_prepare(vma)))
> + return VM_FAULT_OOM;
> +
> + /* Allocate the necessary pages */
> + for(a = addr;a < end_addr ; a += PAGE_SIZE) {
> + struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a);
> +
> + if (p) {
> + clear_user_highpage(p, a);
> + pagevec_add(&pv,p);
> + } else
> + break;
> + }
> + end_addr = a;
> +
> + spin_lock(&mm->page_table_lock);
> +
> + for(p = pv.pages; addr < end_addr; addr += PAGE_SIZE, p++) {
> +
> + page_table = pte_offset_map(pmd, addr);
> + if (!pte_none(*page_table)) {
> + /* Someone else got there first */
> + page_cache_release(*p);
> + pte_unmap(page_table);
> + continue;
> + }
> +
> + entry = maybe_mkwrite(pte_mkdirty(mk_pte(*p,
> + vma->vm_page_prot)),
> + vma);
> +
> + mm->rss++;
> + lru_cache_add_active(*p);
> + mark_page_accessed(*p);
> + page_add_anon_rmap(*p, vma, addr);
> +
> + set_pte(page_table, entry);
> + pte_unmap(page_table);
> +
> + /* No need to invalidate - it was non-present before */
> + update_mmu_cache(vma, addr, entry);
> + }
> + } else {
> + /* Read */
> + for(;addr < end_addr; addr += PAGE_SIZE) {
> + page_table = pte_offset_map(pmd, addr);
> + entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
> + set_pte(page_table, entry);
> + pte_unmap(page_table);
> +
> + /* No need to invalidate - it was non-present before */
> + update_mmu_cache(vma, addr, entry);
> +
> + };
> + }
> + spin_unlock(&mm->page_table_lock);
> + return VM_FAULT_MINOR;
> + }
> +
> + current->anon_fault_next_addr = addr + PAGE_SIZE;
> + current->anon_fault_order = 0;
> +
> + page = ZERO_PAGE(addr);
> /* Read-only mapping of ZERO_PAGE. */
> entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-14 12:24 ` Akinobu Mita
@ 2004-12-14 15:25 ` Akinobu Mita
2004-12-14 20:25 ` Christoph Lameter
1 sibling, 0 replies; 23+ messages in thread
From: Akinobu Mita @ 2004-12-14 15:25 UTC (permalink / raw)
To: Christoph Lameter
Cc: Martin J. Bligh, nickpiggin, Jeff Garzik, torvalds, hugh, benh,
linux-mm, linux-ia64, linux-kernel
On Tuesday 14 December 2004 21:24, Akinobu Mita wrote:
> But there is not any guarantee that the page_tables for addr+PAGE_SIZE,
> addr+2*PAGE_SIZE, ... have not been mapped yet.
>
> Anyway, I will try your V2 patch.
>
Below patch fixes V2 patch, and adds debug printk.
The output coincides with segfaulted processes.
# dmesg | grep ^comm:
comm: xscreensaver, addr_orig: ccdc40, addr: cce000, pid: 2995
comm: rhn-applet-gui, addr_orig: b6fd8020, addr: b6fd9000, pid: 3029
comm: rhn-applet-gui, addr_orig: b6e95020, addr: b6e96000, pid: 3029
comm: rhn-applet-gui, addr_orig: b6fd8020, addr: b6fd9000, pid: 3029
comm: rhn-applet-gui, addr_orig: b6e95020, addr: b6e96000, pid: 3029
comm: rhn-applet-gui, addr_orig: b6fd8020, addr: b6fd9000, pid: 3029
comm: X, addr_orig: 87e8000, addr: 87e9000, pid: 2874
comm: X, addr_orig: 87ea000, addr: 87eb000, pid: 2874
---
The read access prefaulting may override the page_table which has been
already mapped. this patch fixes it. and it shows which process might
suffer this problem.
--- 2.6-rc/mm/memory.c.orig 2004-12-14 22:06:08.000000000 +0900
+++ 2.6-rc/mm/memory.c 2004-12-14 23:42:34.000000000 +0900
@@ -1434,6 +1434,7 @@ do_anonymous_page(struct mm_struct *mm,
{
pte_t entry;
unsigned long end_addr;
+ unsigned long addr_orig = addr;
addr &= PAGE_MASK;
@@ -1517,9 +1518,15 @@ do_anonymous_page(struct mm_struct *mm,
/* Read */
entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
nextread:
- set_pte(page_table, entry);
- pte_unmap(page_table);
- update_mmu_cache(vma, addr, entry);
+ if (!pte_none(*page_table)) {
+ printk("comm: %s, addr_orig: %lx, addr: %lx, pid: %d\n",
+ current->comm, addr_orig, addr, current->pid);
+ pte_unmap(page_table);
+ } else {
+ set_pte(page_table, entry);
+ pte_unmap(page_table);
+ update_mmu_cache(vma, addr, entry);
+ }
addr += PAGE_SIZE;
if (unlikely(addr < end_addr)) {
pte_offset_map(pmd, addr);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-13 17:10 ` Christoph Lameter
2004-12-13 22:16 ` Martin J. Bligh
@ 2004-12-14 12:24 ` Akinobu Mita
2004-12-14 15:25 ` Akinobu Mita
2004-12-14 20:25 ` Christoph Lameter
1 sibling, 2 replies; 23+ messages in thread
From: Akinobu Mita @ 2004-12-14 12:24 UTC (permalink / raw)
To: Christoph Lameter
Cc: Martin J. Bligh, nickpiggin, Jeff Garzik, torvalds, hugh, benh,
linux-mm, linux-ia64, linux-kernel
On Tuesday 14 December 2004 02:10, Christoph Lameter wrote:
> On Mon, 13 Dec 2004, Akinobu Mita wrote:
> > 3) don't set_pte() for the entry which already have been set
>
> Not sure how this could have happened in the patch.
This is why I inserted pte_none() for each page_table in case of
read fault too.
If read access fault occured for the address "addr".
It is completely unnecessary to check by pte_none() to the page_table
for "addr". Because page_table_lock has never been released until
do_anonymous_page returns (in case of read access fault)
But there is not any guarantee that the page_tables for addr+PAGE_SIZE,
addr+2*PAGE_SIZE, ... have not been mapped yet.
Anyway, I will try your V2 patch.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-13 17:10 ` Christoph Lameter
@ 2004-12-13 22:16 ` Martin J. Bligh
2004-12-14 12:24 ` Akinobu Mita
1 sibling, 0 replies; 23+ messages in thread
From: Martin J. Bligh @ 2004-12-13 22:16 UTC (permalink / raw)
To: Christoph Lameter, Akinobu Mita
Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
>> I also encountered processes segfault.
>> Below patch fix several problems.
>>
>> 1) if no pages could allocated, returns VM_FAULT_OOM
>> 2) fix duplicated pte_offset_map() call
>
> I also saw these two issues and I think I dealt with them in a forthcoming
> patch.
>
>> 3) don't set_pte() for the entry which already have been set
>
> Not sure how this could have happened in the patch.
>
> Could you try my updated version:
Urgle. There was a fix from Hugh too ... any chance you could just stick
a whole new patch somewhere? I'm too idle/stupid to work it out ;-)
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-13 14:30 ` Akinobu Mita
@ 2004-12-13 17:10 ` Christoph Lameter
2004-12-13 22:16 ` Martin J. Bligh
2004-12-14 12:24 ` Akinobu Mita
0 siblings, 2 replies; 23+ messages in thread
From: Christoph Lameter @ 2004-12-13 17:10 UTC (permalink / raw)
To: Akinobu Mita
Cc: Martin J. Bligh, nickpiggin, Jeff Garzik, torvalds, hugh, benh,
linux-mm, linux-ia64, linux-kernel
On Mon, 13 Dec 2004, Akinobu Mita wrote:
> I also encountered processes segfault.
> Below patch fix several problems.
>
> 1) if no pages could allocated, returns VM_FAULT_OOM
> 2) fix duplicated pte_offset_map() call
I also saw these two issues and I think I dealt with them in a forthcoming
patch.
> 3) don't set_pte() for the entry which already have been set
Not sure how this could have happened in the patch.
Could you try my updated version:
Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-12-08 15:01:48.801457702 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-12-08 15:02:04.286479345 -0800
@@ -537,6 +537,8 @@
#endif
struct list_head tasks;
+ unsigned long anon_fault_next_addr; /* Predicted sequential fault address */
+ int anon_fault_order; /* Last order of allocation on fault */
/*
* ptrace_list/ptrace_children forms the list of my children
* that were stolen by a ptracer.
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-12-08 15:01:50.668339751 -0800
+++ linux-2.6.9/mm/memory.c 2004-12-09 14:21:17.090061608 -0800
@@ -55,6 +55,7 @@
#include <linux/swapops.h>
#include <linux/elf.h>
+#include <linux/pagevec.h>
#ifndef CONFIG_DISCONTIGMEM
/* use the per-pgdat data instead for discontigmem - mbligh */
@@ -1432,52 +1433,99 @@
unsigned long addr)
{
pte_t entry;
- struct page * page = ZERO_PAGE(addr);
-
- /* Read-only mapping of ZERO_PAGE. */
- entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+ unsigned long end_addr;
+
+ addr &= PAGE_MASK;
+
+ if (likely((vma->vm_flags & VM_RAND_READ) || current->anon_fault_next_addr != addr)) {
+ /* Single page */
+ current->anon_fault_order = 0;
+ end_addr = addr + PAGE_SIZE;
+ } else {
+ /* Sequence of faults detect. Perform preallocation */
+ int order = ++current->anon_fault_order;
+
+ if ((1 << order) < PAGEVEC_SIZE)
+ end_addr = addr + (PAGE_SIZE << order);
+ else
+ end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE;
- /* ..except if it's a write access */
+ if (end_addr > vma->vm_end)
+ end_addr = vma->vm_end;
+ if ((addr & PMD_MASK) != (end_addr & PMD_MASK))
+ end_addr &= PMD_MASK;
+ }
if (write_access) {
- /* Allocate our own private page. */
+
+ unsigned long a;
+ struct page **p;
+ struct pagevec pv;
+
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
+ pagevec_init(&pv, 0);
+
if (unlikely(anon_vma_prepare(vma)))
- goto no_mem;
- page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
- if (!page)
- goto no_mem;
- clear_user_highpage(page, addr);
+ return VM_FAULT_OOM;
+
+ /* Allocate the necessary pages */
+ for(a = addr; a < end_addr ; a += PAGE_SIZE) {
+ struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a);
+
+ if (likely(p)) {
+ clear_user_highpage(p, a);
+ pagevec_add(&pv, p);
+ } else {
+ if (a == addr)
+ return VM_FAULT_OOM;
+ break;
+ }
+ }
spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, addr);
- if (!pte_none(*page_table)) {
+ for(p = pv.pages; addr < a; addr += PAGE_SIZE, p++) {
+
+ page_table = pte_offset_map(pmd, addr);
+ if (unlikely(!pte_none(*page_table))) {
+ /* Someone else got there first */
+ pte_unmap(page_table);
+ page_cache_release(*p);
+ continue;
+ }
+
+ entry = maybe_mkwrite(pte_mkdirty(mk_pte(*p,
+ vma->vm_page_prot)),
+ vma);
+
+ mm->rss++;
+ lru_cache_add_active(*p);
+ mark_page_accessed(*p);
+ page_add_anon_rmap(*p, vma, addr);
+
+ set_pte(page_table, entry);
pte_unmap(page_table);
- page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
- goto out;
+
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, addr, entry);
+ }
+ } else {
+ /* Read */
+ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+nextread:
+ set_pte(page_table, entry);
+ pte_unmap(page_table);
+ update_mmu_cache(vma, addr, entry);
+ addr += PAGE_SIZE;
+ if (unlikely(addr < end_addr)) {
+ pte_offset_map(pmd, addr);
+ goto nextread;
}
- mm->rss++;
- entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
- vma->vm_page_prot)),
- vma);
- lru_cache_add_active(page);
- mark_page_accessed(page);
- page_add_anon_rmap(page, vma, addr);
}
-
- set_pte(page_table, entry);
- pte_unmap(page_table);
-
- /* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, addr, entry);
+ current->anon_fault_next_addr = addr;
spin_unlock(&mm->page_table_lock);
-out:
return VM_FAULT_MINOR;
-no_mem:
- return VM_FAULT_OOM;
}
/*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-09 19:32 ` Christoph Lameter
@ 2004-12-13 14:30 ` Akinobu Mita
2004-12-13 17:10 ` Christoph Lameter
0 siblings, 1 reply; 23+ messages in thread
From: Akinobu Mita @ 2004-12-13 14:30 UTC (permalink / raw)
To: Christoph Lameter, Martin J. Bligh
Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
On Friday 10 December 2004 04:32, Christoph Lameter wrote:
> On Wed, 8 Dec 2004, Martin J. Bligh wrote:
> > I tried benchmarking it ... but processes just segfault all the time.
> > Any chance you could try it out on SMP ia32 system?
>
> I tried it on my i386 system and it works fine. Sorry about the puny
> memory sizes (the system is a PIII-450 with 384k memory)
>
I also encountered processes segfault.
Below patch fix several problems.
1) if no pages could allocated, returns VM_FAULT_OOM
2) fix duplicated pte_offset_map() call
3) don't set_pte() for the entry which already have been set
Acutually, 3) fixes my segfault problem.
--- 2.6-rc/mm/memory.c.orig 2004-12-13 22:17:04.000000000 +0900
+++ 2.6-rc/mm/memory.c 2004-12-13 22:22:14.000000000 +0900
@@ -1483,6 +1483,8 @@ do_anonymous_page(struct mm_struct *mm,
} else
break;
}
+ if (a == addr)
+ goto no_mem;
end_addr = a;
spin_lock(&mm->page_table_lock);
@@ -1514,8 +1516,17 @@ do_anonymous_page(struct mm_struct *mm,
}
} else {
/* Read */
+ int first = 1;
+
for(;addr < end_addr; addr += PAGE_SIZE) {
- page_table = pte_offset_map(pmd, addr);
+ if (!first)
+ page_table = pte_offset_map(pmd, addr);
+ first = 0;
+ if (!pte_none(*page_table)) {
+ /* Someone else got there first */
+ pte_unmap(page_table);
+ continue;
+ }
entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
set_pte(page_table, entry);
pte_unmap(page_table);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-08 22:50 ` Martin J. Bligh
@ 2004-12-09 19:32 ` Christoph Lameter
2004-12-13 14:30 ` Akinobu Mita
0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2004-12-09 19:32 UTC (permalink / raw)
To: Martin J. Bligh
Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
On Wed, 8 Dec 2004, Martin J. Bligh wrote:
> I tried benchmarking it ... but processes just segfault all the time.
> Any chance you could try it out on SMP ia32 system?
I tried it on my i386 system and it works fine. Sorry about the puny
memory sizes (the system is a PIII-450 with 384k memory)
clameter@schroedinger:~/pfault/code$ ./pft -t -b256000 -r3 -f1
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
0 3 1 0.000s 0.004s 0.000s 37407.481 29200.500
0 3 2 0.002s 0.002s 0.000s 31177.059 27227.723
clameter@schroedinger:~/pfault/code$ uname -a
Linux schroedinger 2.6.10-rc3-bk3-prezero #8 SMP Wed Dec 8 15:22:28 PST
2004 i686 GNU/Linux
Could you send me your .config?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-09 10:57 ` Pavel Machek
2004-12-09 11:32 ` Nick Piggin
@ 2004-12-09 17:05 ` Christoph Lameter
1 sibling, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2004-12-09 17:05 UTC (permalink / raw)
To: Pavel Machek
Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
On Thu, 9 Dec 2004, Pavel Machek wrote:
> Hi!
>
> > Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
> > number of threads (and thus increasing parallellism of page faults):
> >
> > Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> > 32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498
> ...
> > Patched kernel:
> >
> > Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> > 32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920
> ...
> > These number are roughly equal to what can be accomplished with the
> > page fault scalability patches.
> >
> > Kernel patches with both the page fault scalability patches and
> > prefaulting:
> >
> > Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> > 32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369
> ...
> >
> > The fault rate doubles when both patches are applied.
> ...
> > We are getting into an almost linear scalability in the high end with
> > both patches and end up with a fault rate > 3 mio faults per second.
>
> Well, with both patches you also slow single-threaded case more than
> twice. What are the effects of this patch on UP system?
The faults per second are slightly increased, so its faster. The last
numbers are 10 repetitions and not 3. Do not look at the wall time.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-09 10:57 ` Pavel Machek
@ 2004-12-09 11:32 ` Nick Piggin
2004-12-09 17:05 ` Christoph Lameter
1 sibling, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2004-12-09 11:32 UTC (permalink / raw)
To: Pavel Machek
Cc: Christoph Lameter, Jeff Garzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
Pavel Machek wrote:
> Hi!
>
>
>>Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
>>number of threads (and thus increasing parallellism of page faults):
>>
>> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
>> 32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498
>
> ...
>
>>Patched kernel:
>>
>>Gb Rep Threads User System Wall flt/cpu/s fault/wsec
>> 32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920
>
> ...
>
>>These number are roughly equal to what can be accomplished with the
>>page fault scalability patches.
>>
>>Kernel patches with both the page fault scalability patches and
>>prefaulting:
>>
>> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
>> 32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369
>
> ...
>
>>The fault rate doubles when both patches are applied.
>
> ...
>
>>We are getting into an almost linear scalability in the high end with
>>both patches and end up with a fault rate > 3 mio faults per second.
>
>
> Well, with both patches you also slow single-threaded case more than
> twice. What are the effects of this patch on UP system?
fault/wsec is the important number.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter
` (3 preceding siblings ...)
2004-12-08 22:50 ` Martin J. Bligh
@ 2004-12-09 10:57 ` Pavel Machek
2004-12-09 11:32 ` Nick Piggin
2004-12-09 17:05 ` Christoph Lameter
2004-12-14 15:28 ` Adam Litke
5 siblings, 2 replies; 23+ messages in thread
From: Pavel Machek @ 2004-12-09 10:57 UTC (permalink / raw)
To: Christoph Lameter
Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
Hi!
> Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
> number of threads (and thus increasing parallellism of page faults):
>
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498
...
> Patched kernel:
>
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920
...
> These number are roughly equal to what can be accomplished with the
> page fault scalability patches.
>
> Kernel patches with both the page fault scalability patches and
> prefaulting:
>
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369
...
>
> The fault rate doubles when both patches are applied.
...
> We are getting into an almost linear scalability in the high end with
> both patches and end up with a fault rate > 3 mio faults per second.
Well, with both patches you also slow single-threaded case more than
twice. What are the effects of this patch on UP system?
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter
` (2 preceding siblings ...)
2004-12-08 19:07 ` Martin J. Bligh
@ 2004-12-08 22:50 ` Martin J. Bligh
2004-12-09 19:32 ` Christoph Lameter
2004-12-09 10:57 ` Pavel Machek
2004-12-14 15:28 ` Adam Litke
5 siblings, 1 reply; 23+ messages in thread
From: Martin J. Bligh @ 2004-12-08 22:50 UTC (permalink / raw)
To: Christoph Lameter, nickpiggin
Cc: Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel
> The page fault handler for anonymous pages can generate significant overhead
> apart from its essential function which is to clear and setup a new page
> table entry for a never accessed memory location. This overhead increases
> significantly in an SMP environment.
>
> In the page table scalability patches, we addressed the issue by changing
> the locking scheme so that multiple fault handlers are able to be processed
> concurrently on multiple cpus. This patch attempts to aggregate multiple
> page faults into a single one. It does that by noting
> anonymous page faults generated in sequence by an application.
>
> If a fault occurred for page x and is then followed by page x+1 then it may
> be reasonable to expect another page fault at x+2 in the future. If page
> table entries for x+1 and x+2 would be prepared in the fault handling for
> page x+1 then the overhead of taking a fault for x+2 is avoided. However
> page x+2 may never be used and thus we may have increased the rss
> of an application unnecessarily. The swapper will take care of removing
> that page if memory should get tight.
I tried benchmarking it ... but processes just segfault all the time.
Any chance you could try it out on SMP ia32 system?
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-08 21:26 ` David S. Miller
@ 2004-12-08 21:42 ` Linus Torvalds
0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2004-12-08 21:42 UTC (permalink / raw)
To: David S. Miller
Cc: Christoph Lameter, jbarnes, nickpiggin, jgarzik, hugh, benh,
linux-mm, linux-ia64, linux-kernel
On Wed, 8 Dec 2004, David S. Miller wrote:
>
> I see. Yet I noticed that while the patch makes system time decrease,
> for some reason the wall time is increasing with the patch applied.
> Why is that, or am I misreading your tables?
I assume that you're looking at the final "both patches applied" case.
It has ten repetitions, while the other two tables only have three. That
would explain the discrepancy.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-08 17:56 ` Christoph Lameter
2004-12-08 18:33 ` Jesse Barnes
@ 2004-12-08 21:26 ` David S. Miller
2004-12-08 21:42 ` Linus Torvalds
1 sibling, 1 reply; 23+ messages in thread
From: David S. Miller @ 2004-12-08 21:26 UTC (permalink / raw)
To: Christoph Lameter
Cc: jbarnes, nickpiggin, jgarzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
On Wed, 8 Dec 2004 09:56:00 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:
> A patch like this is important for applications that allocate and preset
> large amounts of memory on startup. It will drastically reduce the startup
> times.
I see. Yet I noticed that while the patch makes system time decrease,
for some reason the wall time is increasing with the patch applied.
Why is that, or am I misreading your tables?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter
2004-12-08 17:33 ` Jesse Barnes
2004-12-08 17:55 ` Dave Hansen
@ 2004-12-08 19:07 ` Martin J. Bligh
2004-12-08 22:50 ` Martin J. Bligh
` (2 subsequent siblings)
5 siblings, 0 replies; 23+ messages in thread
From: Martin J. Bligh @ 2004-12-08 19:07 UTC (permalink / raw)
To: Christoph Lameter, nickpiggin
Cc: Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel
> The page fault handler for anonymous pages can generate significant overhead
> apart from its essential function which is to clear and setup a new page
> table entry for a never accessed memory location. This overhead increases
> significantly in an SMP environment.
>
> In the page table scalability patches, we addressed the issue by changing
> the locking scheme so that multiple fault handlers are able to be processed
> concurrently on multiple cpus. This patch attempts to aggregate multiple
> page faults into a single one. It does that by noting
> anonymous page faults generated in sequence by an application.
>
> If a fault occurred for page x and is then followed by page x+1 then it may
> be reasonable to expect another page fault at x+2 in the future. If page
> table entries for x+1 and x+2 would be prepared in the fault handling for
> page x+1 then the overhead of taking a fault for x+2 is avoided. However
> page x+2 may never be used and thus we may have increased the rss
> of an application unnecessarily. The swapper will take care of removing
> that page if memory should get tight.
>
> The following patch makes the anonymous fault handler anticipate future
> faults. For each fault a prediction is made where the fault would occur
> (assuming linear acccess by the application). If the prediction turns out to
> be right (next fault is where expected) then a number of pages is
> preallocated in order to avoid a series of future faults. The order of the
> preallocation increases by the power of two for each success in sequence.
>
> The first successful prediction leads to an additional page being allocated.
> Second successful prediction leads to 2 additional pages being allocated.
> Third to 4 pages and so on. The max order is 3 by default. In a large
> continous allocation the number of faults is reduced by a factor of 8.
>
> The patch may be combined with the page fault scalability patch (another
> edition of the patch is needed which will be forthcoming after the
> page fault scalability patch has been included). The combined patches
> will triple the possible page fault rate from ~1 mio faults sec to 3 mio
> faults sec.
>
> Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
> number of threads (and thus increasing parallellism of page faults):
Mmmm ... we tried doing this before for filebacked pages by sniffing the
pagecache, but it crippled forky workloads (like kernel compile) with the
extra cost in zap_pte_range, etc.
Perhaps the locality is better for the anon stuff, but the cost is also
higher. Exactly what benchmark were you running on this? If you just run
a microbenchmark that allocates memory, then it will definitely be faster.
On other things, I suspect not ...
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-08 17:56 ` Christoph Lameter
@ 2004-12-08 18:33 ` Jesse Barnes
2004-12-08 21:26 ` David S. Miller
1 sibling, 0 replies; 23+ messages in thread
From: Jesse Barnes @ 2004-12-08 18:33 UTC (permalink / raw)
To: Christoph Lameter
Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
On Wednesday, December 8, 2004 9:56 am, Christoph Lameter wrote:
> > And again, I'm not sure how important that is, maybe this approach will
> > work well in the majority of cases (obviously it's a big win in
> > faults/sec for your benchmark, but I wonder about subsequent references
> > from other CPUs to those pages). You can look at
> > /sys/devices/platform/nodeN/meminfo to see where the pages are coming
> > from.
>
> The origin of the pages has not changed and the existing locality
> constraints are observed.
>
> A patch like this is important for applications that allocate and preset
> large amounts of memory on startup. It will drastically reduce the startup
> times.
Ok, that sounds good. My case was probably a bit contrived, but I'm glad to
see that you had already thought of it anyway.
Jesse
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: Anticipatory prefaulting in the page fault handler V1
@ 2004-12-08 18:31 Luck, Tony
0 siblings, 0 replies; 23+ messages in thread
From: Luck, Tony @ 2004-12-08 18:31 UTC (permalink / raw)
To: Christoph Lameter
Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
>We could use that as a way to switch of the preallocation. How
>expensive is that check?
If you already looked up the vma, then it is very cheap. Just
check for VM_RAND_READ in vma->vm_flags.
-Tony
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-08 17:33 ` Jesse Barnes
@ 2004-12-08 17:56 ` Christoph Lameter
2004-12-08 18:33 ` Jesse Barnes
2004-12-08 21:26 ` David S. Miller
0 siblings, 2 replies; 23+ messages in thread
From: Christoph Lameter @ 2004-12-08 17:56 UTC (permalink / raw)
To: Jesse Barnes
Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
On Wed, 8 Dec 2004, Jesse Barnes wrote:
> Nice results! Any idea how many applications benefit from this sort of
> anticipatory faulting? It has implications for NUMA allocation. Imagine an
> app that allocates a large virtual address space and then tries to fault in
> pages near each CPU in turn. With this patch applied, CPU 2 would be
> referencing pages near CPU 1, and CPU 3 would then fault in 4 pages, which
> would then be used by CPUs 4-6. Unless I'm missing something...
Faults are predicted for each thread executing on a different processor.
So each processor does its own predictions which will not generate
preallocations on a different processor (unless the thread is moved to
another processor but that is a very special situation).
> And again, I'm not sure how important that is, maybe this approach will work
> well in the majority of cases (obviously it's a big win in faults/sec for
> your benchmark, but I wonder about subsequent references from other CPUs to
> those pages). You can look at /sys/devices/platform/nodeN/meminfo to see
> where the pages are coming from.
The origin of the pages has not changed and the existing locality
constraints are observed.
A patch like this is important for applications that allocate and preset
large amounts of memory on startup. It will drastically reduce the startup
times.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter
2004-12-08 17:33 ` Jesse Barnes
@ 2004-12-08 17:55 ` Dave Hansen
2004-12-08 19:07 ` Martin J. Bligh
` (3 subsequent siblings)
5 siblings, 0 replies; 23+ messages in thread
From: Dave Hansen @ 2004-12-08 17:55 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, Jeff Garzik, Linus Torvalds, hugh,
Benjamin Herrenschmidt, linux-mm, linux-ia64,
Linux Kernel Mailing List
On Wed, 2004-12-08 at 09:24, Christoph Lameter wrote:
> The page fault handler for anonymous pages can generate significant overhead
> apart from its essential function which is to clear and setup a new page
> table entry for a never accessed memory location. This overhead increases
> significantly in an SMP environment.
do_anonymous_page() is a relatively compact function at this point.
This would probably be a lot more readable if it was broken out into at
least another function or two that do_anonymous_page() calls into. That
way, you also get a much cleaner separation if anyone needs to turn it
off in the future.
Speaking of that, have you seen this impair performance on any other
workloads?
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Anticipatory prefaulting in the page fault handler V1
2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter
@ 2004-12-08 17:33 ` Jesse Barnes
2004-12-08 17:56 ` Christoph Lameter
2004-12-08 17:55 ` Dave Hansen
` (4 subsequent siblings)
5 siblings, 1 reply; 23+ messages in thread
From: Jesse Barnes @ 2004-12-08 17:33 UTC (permalink / raw)
To: Christoph Lameter
Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
linux-ia64, linux-kernel
On Wednesday, December 8, 2004 9:24 am, Christoph Lameter wrote:
> Page fault scalability patch and prefaulting. Max prefault order
> increased to 5 (max preallocation of 32 pages):
>
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 256 10 8 33.571s 4516.293s 863.021s 36874.099 194356.930
> 256 10 16 33.103s 3737.688s 461.028s 44492.553 363704.484
> 256 10 32 35.094s 3436.561s 321.080s 48326.262 521352.840
> 256 10 64 46.675s 2899.997s 245.020s 56936.124 684214.256
> 256 10 128 85.493s 2890.198s 203.008s 56380.890 826122.524
> 256 10 256 74.299s 1374.973s 99.088s115762.963 1679630.272
> 256 10 512 62.760s 706.559s 53.027s218078.311 3149273.714
>
> We are getting into an almost linear scalability in the high end with
> both patches and end up with a fault rate > 3 mio faults per second.
Nice results! Any idea how many applications benefit from this sort of
anticipatory faulting? It has implications for NUMA allocation. Imagine an
app that allocates a large virtual address space and then tries to fault in
pages near each CPU in turn. With this patch applied, CPU 2 would be
referencing pages near CPU 1, and CPU 3 would then fault in 4 pages, which
would then be used by CPUs 4-6. Unless I'm missing something...
And again, I'm not sure how important that is, maybe this approach will work
well in the majority of cases (obviously it's a big win in faults/sec for
your benchmark, but I wonder about subsequent references from other CPUs to
those pages). You can look at /sys/devices/platform/nodeN/meminfo to see
where the pages are coming from.
Jesse
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Anticipatory prefaulting in the page fault handler V1
2004-12-02 18:10 ` cliff white
@ 2004-12-08 17:24 ` Christoph Lameter
2004-12-08 17:33 ` Jesse Barnes
` (5 more replies)
0 siblings, 6 replies; 23+ messages in thread
From: Christoph Lameter @ 2004-12-08 17:24 UTC (permalink / raw)
To: nickpiggin
Cc: Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel
The page fault handler for anonymous pages can generate significant overhead
apart from its essential function which is to clear and setup a new page
table entry for a never accessed memory location. This overhead increases
significantly in an SMP environment.
In the page table scalability patches, we addressed the issue by changing
the locking scheme so that multiple fault handlers are able to be processed
concurrently on multiple cpus. This patch attempts to aggregate multiple
page faults into a single one. It does that by noting
anonymous page faults generated in sequence by an application.
If a fault occurred for page x and is then followed by page x+1 then it may
be reasonable to expect another page fault at x+2 in the future. If page
table entries for x+1 and x+2 would be prepared in the fault handling for
page x+1 then the overhead of taking a fault for x+2 is avoided. However
page x+2 may never be used and thus we may have increased the rss
of an application unnecessarily. The swapper will take care of removing
that page if memory should get tight.
The following patch makes the anonymous fault handler anticipate future
faults. For each fault a prediction is made where the fault would occur
(assuming linear acccess by the application). If the prediction turns out to
be right (next fault is where expected) then a number of pages is
preallocated in order to avoid a series of future faults. The order of the
preallocation increases by the power of two for each success in sequence.
The first successful prediction leads to an additional page being allocated.
Second successful prediction leads to 2 additional pages being allocated.
Third to 4 pages and so on. The max order is 3 by default. In a large
continous allocation the number of faults is reduced by a factor of 8.
The patch may be combined with the page fault scalability patch (another
edition of the patch is needed which will be forthcoming after the
page fault scalability patch has been included). The combined patches
will triple the possible page fault rate from ~1 mio faults sec to 3 mio
faults sec.
Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
number of threads (and thus increasing parallellism of page faults):
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498
32 3 2 1.397s 148.523s 78.044s 41965.149 80201.646
32 3 4 1.390s 152.618s 44.044s 40851.258 141545.239
32 3 8 1.500s 374.008s 53.001s 16754.519 118671.950
32 3 16 1.415s 1051.759s 73.094s 5973.803 85087.358
32 3 32 1.867s 3400.417s 117.003s 1849.186 53754.928
32 3 64 5.361s 11633.040s 197.034s 540.577 31881.112
32 3 128 23.387s 39386.390s 332.055s 159.642 18918.599
32 3 256 15.409s 20031.450s 168.095s 313.837 37237.918
32 3 512 18.720s 10338.511s 86.047s 607.446 72752.686
Patched kernel:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920
32 3 2 1.022s 127.770s 67.086s 48849.350 92707.085
32 3 4 0.995s 119.666s 37.045s 52141.503 167955.292
32 3 8 0.928s 87.400s 18.034s 71227.407 342934.242
32 3 16 1.067s 72.943s 11.035s 85007.293 553989.377
32 3 32 1.248s 133.753s 10.038s 46602.680 606062.151
32 3 64 5.557s 438.634s 13.093s 14163.802 451418.617
32 3 128 17.860s 1496.797s 19.048s 4153.714 322808.509
32 3 256 13.382s 766.063s 10.016s 8071.695 618816.838
32 3 512 17.067s 369.106s 5.041s 16291.764 1161285.521
These number are roughly equal to what can be accomplished with the
page fault scalability patches.
Kernel patches with both the page fault scalability patches and
prefaulting:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369
32 10 2 4.005s 415.119s 221.095s 50036.407 94484.174
32 10 4 3.855s 371.317s 111.076s 55898.259 187635.724
32 10 8 3.902s 308.673s 67.094s 67092.476 308634.397
32 10 16 4.011s 224.213s 37.016s 91889.781 564241.062
32 10 32 5.483s 209.391s 27.046s 97598.647 763495.417
32 10 64 19.166s 219.925s 26.030s 87713.212 797286.395
32 10 128 53.482s 342.342s 27.024s 52981.744 769687.791
32 10 256 67.334s 180.321s 15.036s 84679.911 1364614.334
32 10 512 66.516s 93.098s 9.015s131387.893 2291548.865
The fault rate doubles when both patches are applied.
And on the high end (512 processors allocating 256G) (No numbers
for regular kernels because they are extremely slow, also no
number for a low number of threads. Also very slow)
With prefaulting:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
256 3 4 8.241s 1414.348s 449.016s 35380.301 112056.239
256 3 8 8.306s 1300.982s 247.025s 38441.977 203559.271
256 3 16 8.368s 1223.853s 154.089s 40846.272 324940.924
256 3 32 8.536s 1284.041s 110.097s 38938.970 453556.624
256 3 64 13.025s 3587.203s 110.010s 13980.123 457131.492
256 3 128 25.025s 11460.700s 145.071s 4382.104 345404.909
256 3 256 26.150s 6061.649s 75.086s 8267.625 663414.482
256 3 512 20.637s 3037.097s 38.062s 16460.435 1302993.019
Page fault scalability patch and prefaulting. Max prefault order
increased to 5 (max preallocation of 32 pages):
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
256 10 8 33.571s 4516.293s 863.021s 36874.099 194356.930
256 10 16 33.103s 3737.688s 461.028s 44492.553 363704.484
256 10 32 35.094s 3436.561s 321.080s 48326.262 521352.840
256 10 64 46.675s 2899.997s 245.020s 56936.124 684214.256
256 10 128 85.493s 2890.198s 203.008s 56380.890 826122.524
256 10 256 74.299s 1374.973s 99.088s115762.963 1679630.272
256 10 512 62.760s 706.559s 53.027s218078.311 3149273.714
We are getting into an almost linear scalability in the high end with
both patches and end up with a fault rate > 3 mio faults per second.
The one thing that takes up a lot of time is still be the zeroing
of pages in the page fault handler. There is a another
set of patches that I am working on which will prezero pages
and led to another an increase in performance by a factor of 2-4
(if prezeroed pages are available which may not always be the case).
Maybe we can reach 10 mio fault /sec that way.
Patch against 2.6.10-rc3-bk3:
Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-12-01 10:37:31.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-12-01 10:38:15.000000000 -0800
@@ -537,6 +537,8 @@
#endif
struct list_head tasks;
+ unsigned long anon_fault_next_addr; /* Predicted sequential fault address */
+ int anon_fault_order; /* Last order of allocation on fault */
/*
* ptrace_list/ptrace_children forms the list of my children
* that were stolen by a ptracer.
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-12-01 10:38:11.000000000 -0800
+++ linux-2.6.9/mm/memory.c 2004-12-01 10:45:01.000000000 -0800
@@ -55,6 +55,7 @@
#include <linux/swapops.h>
#include <linux/elf.h>
+#include <linux/pagevec.h>
#ifndef CONFIG_DISCONTIGMEM
/* use the per-pgdat data instead for discontigmem - mbligh */
@@ -1432,8 +1433,106 @@
unsigned long addr)
{
pte_t entry;
- struct page * page = ZERO_PAGE(addr);
+ struct page * page;
+
+ addr &= PAGE_MASK;
+
+ if (current->anon_fault_next_addr == addr) {
+ unsigned long end_addr;
+ int order = current->anon_fault_order;
+
+ /* Sequence of page faults detected. Perform preallocation of pages */
+ /* The order of preallocations increases with each successful prediction */
+ order++;
+
+ if ((1 << order) < PAGEVEC_SIZE)
+ end_addr = addr + (1 << (order + PAGE_SHIFT));
+ else
+ end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE;
+
+ if (end_addr > vma->vm_end)
+ end_addr = vma->vm_end;
+ if ((addr & PMD_MASK) != (end_addr & PMD_MASK))
+ end_addr &= PMD_MASK;
+
+ current->anon_fault_next_addr = end_addr;
+ current->anon_fault_order = order;
+
+ if (write_access) {
+
+ struct pagevec pv;
+ unsigned long a;
+ struct page **p;
+
+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
+
+ pagevec_init(&pv, 0);
+
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+
+ /* Allocate the necessary pages */
+ for(a = addr;a < end_addr ; a += PAGE_SIZE) {
+ struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a);
+
+ if (p) {
+ clear_user_highpage(p, a);
+ pagevec_add(&pv,p);
+ } else
+ break;
+ }
+ end_addr = a;
+
+ spin_lock(&mm->page_table_lock);
+
+ for(p = pv.pages; addr < end_addr; addr += PAGE_SIZE, p++) {
+
+ page_table = pte_offset_map(pmd, addr);
+ if (!pte_none(*page_table)) {
+ /* Someone else got there first */
+ page_cache_release(*p);
+ pte_unmap(page_table);
+ continue;
+ }
+
+ entry = maybe_mkwrite(pte_mkdirty(mk_pte(*p,
+ vma->vm_page_prot)),
+ vma);
+
+ mm->rss++;
+ lru_cache_add_active(*p);
+ mark_page_accessed(*p);
+ page_add_anon_rmap(*p, vma, addr);
+
+ set_pte(page_table, entry);
+ pte_unmap(page_table);
+
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, addr, entry);
+ }
+ } else {
+ /* Read */
+ for(;addr < end_addr; addr += PAGE_SIZE) {
+ page_table = pte_offset_map(pmd, addr);
+ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+ set_pte(page_table, entry);
+ pte_unmap(page_table);
+
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, addr, entry);
+
+ };
+ }
+ spin_unlock(&mm->page_table_lock);
+ return VM_FAULT_MINOR;
+ }
+
+ current->anon_fault_next_addr = addr + PAGE_SIZE;
+ current->anon_fault_order = 0;
+
+ page = ZERO_PAGE(addr);
/* Read-only mapping of ZERO_PAGE. */
entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2004-12-14 20:25 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-08 17:44 Anticipatory prefaulting in the page fault handler V1 Luck, Tony
2004-12-08 17:57 ` Christoph Lameter
-- strict thread matches above, loose matches on Subject: below --
2004-12-08 18:31 Luck, Tony
2004-11-22 15:00 page fault scalability patch V11 [1/7]: sloppy rss Hugh Dickins
2004-11-22 21:50 ` deferred rss update instead of " Christoph Lameter
2004-11-22 22:22 ` Linus Torvalds
2004-11-22 22:27 ` Christoph Lameter
2004-11-22 22:40 ` Linus Torvalds
2004-12-01 23:41 ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter
2004-12-02 0:10 ` Linus Torvalds
2004-12-02 6:21 ` Jeff Garzik
2004-12-02 6:34 ` Andrew Morton
2004-12-02 6:48 ` Jeff Garzik
2004-12-02 7:02 ` Andrew Morton
2004-12-02 7:26 ` Martin J. Bligh
2004-12-02 7:31 ` Jeff Garzik
2004-12-02 18:10 ` cliff white
2004-12-08 17:24 ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter
2004-12-08 17:33 ` Jesse Barnes
2004-12-08 17:56 ` Christoph Lameter
2004-12-08 18:33 ` Jesse Barnes
2004-12-08 21:26 ` David S. Miller
2004-12-08 21:42 ` Linus Torvalds
2004-12-08 17:55 ` Dave Hansen
2004-12-08 19:07 ` Martin J. Bligh
2004-12-08 22:50 ` Martin J. Bligh
2004-12-09 19:32 ` Christoph Lameter
2004-12-13 14:30 ` Akinobu Mita
2004-12-13 17:10 ` Christoph Lameter
2004-12-13 22:16 ` Martin J. Bligh
2004-12-14 12:24 ` Akinobu Mita
2004-12-14 15:25 ` Akinobu Mita
2004-12-14 20:25 ` Christoph Lameter
2004-12-09 10:57 ` Pavel Machek
2004-12-09 11:32 ` Nick Piggin
2004-12-09 17:05 ` Christoph Lameter
2004-12-14 15:28 ` Adam Litke
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox