RE: Anticipatory prefaulting in the page fault handler V1

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* RE: Anticipatory prefaulting in the page fault handler V1
@ 2004-12-08 17:44 Luck, Tony
  2004-12-08 17:57 ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Luck, Tony @ 2004-12-08 17:44 UTC (permalink / raw)
  To: Christoph Lameter, nickpiggin
  Cc: Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel

>If a fault occurred for page x and is then followed by page 
>x+1 then it may be reasonable to expect another page fault
>at x+2 in the future.

What if the application had used "madvise(start, len, MADV_RANDOM)"
to tell the kernel that this isn't "reasonable"?

-Tony
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Anticipatory prefaulting in the page fault handler V1
  2004-12-08 17:44 Anticipatory prefaulting in the page fault handler V1 Luck, Tony
@ 2004-12-08 17:57 ` Christoph Lameter
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2004-12-08 17:57 UTC (permalink / raw)
  To: Luck, Tony
  Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
	linux-ia64, linux-kernel

On Wed, 8 Dec 2004, Luck, Tony wrote:

> >If a fault occurred for page x and is then followed by page
> >x+1 then it may be reasonable to expect another page fault
> >at x+2 in the future.
>
> What if the application had used "madvise(start, len, MADV_RANDOM)"
> to tell the kernel that this isn't "reasonable"?

We could use that as a way to switch of the preallocation. How expensive
is that check?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Anticipatory prefaulting in the page fault handler V1
@ 2004-12-08 18:31 Luck, Tony
  0 siblings, 0 replies; 23+ messages in thread
From: Luck, Tony @ 2004-12-08 18:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
	linux-ia64, linux-kernel

>We could use that as a way to switch of the preallocation. How 
>expensive is that check?

If you already looked up the vma, then it is very cheap.  Just
check for VM_RAND_READ in vma->vm_flags.

-Tony
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: page fault scalability patch V11 [1/7]: sloppy rss
@ 2004-11-22 15:00 Hugh Dickins
  2004-11-22 21:50 ` deferred rss update instead of " Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Hugh Dickins @ 2004-11-22 15:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: torvalds, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm,
	linux-ia64, linux-kernel

On Fri, 19 Nov 2004, Christoph Lameter wrote:
> On Fri, 19 Nov 2004, Hugh Dickins wrote:
> 
> > Sorry, against what tree do these patches apply?
> > Apparently not linux-2.6.9, nor latest -bk, nor -mm?
> 
> 2.6.10-rc2-bk3

Ah, thanks - got it patched now, but your mailer (or something else)
is eating trailing spaces.  Better than adding them, but we have to
apply this patch before your set:

--- 2.6.10-rc2-bk3/include/asm-i386/system.h	2004-11-15 16:21:12.000000000 +0000
+++ linux/include/asm-i386/system.h	2004-11-22 14:44:30.761904592 +0000
@@ -273,9 +273,9 @@ static inline unsigned long __cmpxchg(vo
 #define cmpxchg(ptr,o,n)\
 	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
 					(unsigned long)(n),sizeof(*(ptr))))
-    
+
 #ifdef __KERNEL__
-struct alt_instr { 
+struct alt_instr {
 	__u8 *instr; 		/* original instruction */
 	__u8 *replacement;
 	__u8  cpuid;		/* cpuid bit set for replacement */
--- 2.6.10-rc2-bk3/include/asm-s390/pgalloc.h	2004-05-10 03:33:39.000000000 +0100
+++ linux/include/asm-s390/pgalloc.h	2004-11-22 14:54:43.704723120 +0000
@@ -99,7 +99,7 @@ static inline void pgd_populate(struct m
 
 #endif /* __s390x__ */
 
-static inline void 
+static inline void
 pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte)
 {
 #ifndef __s390x__
--- 2.6.10-rc2-bk3/mm/memory.c	2004-11-18 17:56:11.000000000 +0000
+++ linux/mm/memory.c	2004-11-22 14:39:33.924030808 +0000
@@ -1424,7 +1424,7 @@ out:
 /*
  * We are called with the MM semaphore and page_table_lock
  * spinlock held to protect against concurrent faults in
- * multithreaded programs. 
+ * multithreaded programs.
  */
 static int
 do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1615,7 +1615,7 @@ static int do_file_page(struct mm_struct
 	 * Fall back to the linear mapping if the fs does not support
 	 * ->populate:
 	 */
-	if (!vma->vm_ops || !vma->vm_ops->populate || 
+	if (!vma->vm_ops || !vma->vm_ops->populate ||
 			(write_access && !(vma->vm_flags & VM_SHARED))) {
 		pte_clear(pte);
 		return do_no_page(mm, vma, address, write_access, pte, pmd);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* deferred rss update instead of sloppy rss
  2004-11-22 15:00 page fault scalability patch V11 [1/7]: sloppy rss Hugh Dickins
@ 2004-11-22 21:50 ` Christoph Lameter
  2004-11-22 22:22   ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2004-11-22 21:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: torvalds, akpm, Benjamin Herrenschmidt, Nick Piggin, linux-mm,
	linux-ia64, linux-kernel

One way to solve the rss issues is--as discussed--to put rss into the
task structure and then have the page fault increment that rss.

The problem is then that the proc filesystem must do an extensive scan
over all threads to find users of a certain mm_struct.

The following patch does put the rss into task_struct. The page fault
handler is then incrementing current->rss if the page_table_lock is not
held.

The timer interrupt checks if task->rss is non zero (when doing
stime/utime updates. rss is defined near those so its hopefully on the
same cacheline and has a minimal impact).

If rss is non zero and the page_table_lock and the mmap_sem can be taken
then the mm->rss will be updated by the value of the task->rss and
task->rss will be zeroed. This avoids all proc issues. The only
disadvantage is that rss may be inaccurate for a couple of clock ticks.

This also adds some performance (sorry only a 4p system):

sloppy rss:

Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  4  10    1    0.593s     29.897s  30.050s 85973.585  85948.565
  4  10    2    0.616s     42.184s  22.045s 61247.450 116719.558
  4  10    4    0.559s     44.918s  14.076s 57641.255 177553.945

deferred rss:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  4  10    1    0.565s     29.429s  30.000s 87395.518  87366.580
  4  10    2    0.500s     33.514s  18.002s 77067.935 145426.659
  4  10    4    0.533s     44.455s  14.085s 58269.368 176413.196

Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h	2004-11-15 11:13:39.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h	2004-11-22 13:18:58.000000000 -0800
@@ -584,6 +584,10 @@
 	unsigned long it_real_incr, it_prof_incr, it_virt_incr;
 	struct timer_list real_timer;
 	unsigned long utime, stime;
+	long rss;	/* rss counter when mm->rss is not usable. mm->page_table_lock
+			 * not held but mm->mmap_sem must be held for sync with
+			 * the timer interrupt which clears rss and updates mm->rss.
+			 */
 	unsigned long nvcsw, nivcsw; /* context switch counts */
 	struct timespec start_time;
 /* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
Index: linux-2.6.9/mm/rmap.c
===================================================================
--- linux-2.6.9.orig/mm/rmap.c	2004-11-22 09:51:58.000000000 -0800
+++ linux-2.6.9/mm/rmap.c	2004-11-22 11:16:02.000000000 -0800
@@ -263,8 +263,6 @@
 	pte_t *pte;
 	int referenced = 0;

-	if (!mm->rss)
-		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
 		goto out;
@@ -507,8 +505,6 @@
 	pte_t pteval;
 	int ret = SWAP_AGAIN;

-	if (!mm->rss)
-		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
 		goto out;
@@ -791,8 +787,7 @@
 			if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
-			while (vma->vm_mm->rss &&
-				cursor < max_nl_cursor &&
+			while (cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				try_to_unmap_cluster(cursor, &mapcount, vma);
 				cursor += CLUSTER_SIZE;
Index: linux-2.6.9/kernel/fork.c
===================================================================
--- linux-2.6.9.orig/kernel/fork.c	2004-11-22 09:51:58.000000000 -0800
+++ linux-2.6.9/kernel/fork.c	2004-11-22 11:16:02.000000000 -0800
@@ -876,6 +876,7 @@
 	p->io_context = NULL;
 	p->io_wait = NULL;
 	p->audit_context = NULL;
+	p->rss = 0;
 #ifdef CONFIG_NUMA
  	p->mempolicy = mpol_copy(p->mempolicy);
  	if (IS_ERR(p->mempolicy)) {
Index: linux-2.6.9/kernel/exit.c
===================================================================
--- linux-2.6.9.orig/kernel/exit.c	2004-11-22 09:51:58.000000000 -0800
+++ linux-2.6.9/kernel/exit.c	2004-11-22 11:16:02.000000000 -0800
@@ -501,6 +501,9 @@
 	/* more a memory barrier than a real lock */
 	task_lock(tsk);
 	tsk->mm = NULL;
+	/* only holding mmap_sem here maybe get page_table_lock too? */
+	mm->rss += tsk->rss;
+	tsk->rss = 0;
 	up_read(&mm->mmap_sem);
 	enter_lazy_tlb(mm, current);
 	task_unlock(tsk);
Index: linux-2.6.9/kernel/timer.c
===================================================================
--- linux-2.6.9.orig/kernel/timer.c	2004-11-22 09:51:58.000000000 -0800
+++ linux-2.6.9/kernel/timer.c	2004-11-22 11:42:12.000000000 -0800
@@ -815,6 +815,15 @@
 		if (psecs / HZ >= p->signal->rlim[RLIMIT_CPU].rlim_max)
 			send_sig(SIGKILL, p, 1);
 	}
+	/* Update mm->rss if necessary */
+	if (p->rss && p->mm && down_write_trylock(&p->mm->mmap_sem)) {
+		if (spin_trylock(&p->mm->page_table_lock)) {
+			p->mm->rss += p->rss;
+			p->rss = 0;
+			spin_unlock(&p->mm->page_table_lock);
+		}
+		up_write(&p->mm->mmap_sem);
+	}
 }

 static inline void do_it_virt(struct task_struct * p, unsigned long ticks)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: deferred rss update instead of sloppy rss
  2004-11-22 21:50 ` deferred rss update instead of " Christoph Lameter
@ 2004-11-22 22:22   ` Linus Torvalds
  2004-11-22 22:27     ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2004-11-22 22:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin,
	linux-mm, linux-ia64, linux-kernel


On Mon, 22 Nov 2004, Christoph Lameter wrote:
> 
> The problem is then that the proc filesystem must do an extensive scan
> over all threads to find users of a certain mm_struct.

The alternative is to just add a simple list into the task_struct and the
head of it into mm_struct. Then, at fork, you just finish the fork() with

	list_add(p->mm_list, p->mm->thread_list);

and do the proper list_del() in exit_mm() or wherever.

You'll still loop in /proc, but you'll do the minimal loop necessary.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: deferred rss update instead of sloppy rss
  2004-11-22 22:22   ` Linus Torvalds
@ 2004-11-22 22:27     ` Christoph Lameter
  2004-11-22 22:40       ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2004-11-22 22:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin,
	linux-mm, linux-ia64, linux-kernel

On Mon, 22 Nov 2004, Linus Torvalds wrote:

> The alternative is to just add a simple list into the task_struct and the
> head of it into mm_struct. Then, at fork, you just finish the fork() with
>
> 	list_add(p->mm_list, p->mm->thread_list);
>
> and do the proper list_del() in exit_mm() or wherever.
>
> You'll still loop in /proc, but you'll do the minimal loop necessary.

I think the approach that I posted is simpler unless there are other
benefits to be gained if it would be easy to figure out which tasks use an
mm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: deferred rss update instead of sloppy rss
  2004-11-22 22:27     ` Christoph Lameter
@ 2004-11-22 22:40       ` Linus Torvalds
  2004-12-01 23:41         ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2004-11-22 22:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin,
	linux-mm, linux-ia64, linux-kernel


On Mon, 22 Nov 2004, Christoph Lameter wrote:
> 
> I think the approach that I posted is simpler unless there are other
> benefits to be gained if it would be easy to figure out which tasks use an
> mm.

I'm just worried that your timer tick thing won't catch things in a timely 
manner. That said, if that isn't an issue, and people don't have problems 
with it. On the other hand, if /proc literally is the only real user, then 
I guess it really can't matter.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-11-22 22:40       ` Linus Torvalds
@ 2004-12-01 23:41         ` Christoph Lameter
  2004-12-02  0:10           ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2004-12-01 23:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin,
	linux-mm, linux-ia64, linux-kernel

Changes from V11->V12 of this patch:
- dump sloppy_rss in favor of list_rss (Linus' proposal)
- keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)

This is a series of patches that increases the scalability of
the page fault handler for SMP. Here are some performance results
on a machine with 512 processors allocating 32 GB with an increasing
number of threads (that are assigned a processor each).

Without the patches:
Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 32   3    1    1.416s    138.165s 139.050s 45073.831  45097.498
 32   3    2    1.397s    148.523s  78.044s 41965.149  80201.646
 32   3    4    1.390s    152.618s  44.044s 40851.258 141545.239
 32   3    8    1.500s    374.008s  53.001s 16754.519 118671.950
 32   3   16    1.415s   1051.759s  73.094s  5973.803  85087.358
 32   3   32    1.867s   3400.417s 117.003s  1849.186  53754.928
 32   3   64    5.361s  11633.040s 197.034s   540.577  31881.112
 32   3  128   23.387s  39386.390s 332.055s   159.642  18918.599
 32   3  256   15.409s  20031.450s 168.095s   313.837  37237.918
 32   3  512   18.720s  10338.511s  86.047s   607.446  72752.686

With the patches:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 32   3    1    1.451s    140.151s 141.060s 44430.367  44428.115
 32   3    2    1.399s    136.349s  73.041s 45673.303  85699.793
 32   3    4    1.321s    129.760s  39.027s 47996.303 160197.217
 32   3    8    1.279s    100.648s  20.039s 61724.641 308454.557
 32   3   16    1.414s    153.975s  15.090s 40488.236 395681.716
 32   3   32    2.534s    337.021s  17.016s 18528.487 366445.400
 32   3   64    4.271s    709.872s  18.057s  8809.787 338656.440
 32   3  128   18.734s   1805.094s  21.084s  3449.586 288005.644
 32   3  256   14.698s    963.787s  11.078s  6429.787 534077.540
 32   3  512   15.299s    453.990s   5.098s 13406.321 1050416.414

For more than 8 cpus the page fault rate increases by orders
of magnitude. For more than 64 cpus the improvement in performace
is 10 times better.

The performance increase is accomplished by avoiding the use of the
page_table_lock spinlock (but not mm->mmap_sem!) through new atomic
operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd and pgd's
(pgd_test_and_populate, pmd_test_and_populate).

The page table lock can be avoided in the following situations:

1. An empty pte or pmd entry is populated

This is safe since the swapper may only depopulate them and the
swapper code has been changed to never set a pte to be empty until the
page has been evicted. The population of an empty pte is frequent
if a process touches newly allocated memory.

2. Modifications of flags in a pte entry (write/accessed).

These modifications are done by the CPU or by low level handlers
on various platforms also bypassing the page_table_lock. So this
seems to be safe too.

One essential change in the VM is the use of pte_cmpxchg (or its
generic emulation) on page table entries before doing an
update_mmu_change without holding the page table lock. However, we do
similar things now with other atomic pte operations such as
ptep_get_and_clear and ptep_test_and_clear_dirty. These operations
clear a pte *after* doing an operation on it. The ptep_cmpxchg as used
in this patch operates on an *cleared* pte and replaces it with a pte
pointing to valid memory. The effect of this change on various
architectures has to be thought through. Local definitions of
ptep_cmpxchg and ptep_xchg may be necessary.

For IA64 an icache coherency issue may arise that potentially requires
the flushing of the icache (as done via update_mmu_cache on IA64) prior
to the use of ptep_cmpxchg. Similar issues may arise on other platforms.

The patch introduces a split counter for rss handling to avoid atomic
operations and locks currently necessary for rss modifications. In
addition to mm->rss, tsk->rss is introduced. tsk->rss is defined to be
in the same cache line as tsk->mm (which is already used by the fault
handler) and thus tsk->rss can be incremented without locks
in a fast way. The cache line does not need to be shared between
processors in the page table handler.

A tasklist is generated for each mm (rcu based). Values in that list
are added up to calculate rss or anon_rss values.

The patchset is composed of 7 patches:

1/7: Avoid page_table_lock in handle_mm_fault

   This patch defers the acquisition of the page_table_lock as much as
   possible and uses atomic operations for allocating anonymous memory.
   These atomic operations are simulated by acquiring the page_table_lock
   for very small time frames if an architecture does not define
   __HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes the swapper so that a
   pte will not be set to empty if a page is in transition to swap.

   If only the first two patches are applied then the time that the
   page_table_lock is held is simply reduced. The lock may then be
   acquired multiple times during a page fault.

2/7: Atomic pte operations for ia64

3/7: Make cmpxchg generally available on i386

   The atomic operations on the page table rely heavily on cmpxchg
   instructions. This patch adds emulations for cmpxchg and cmpxchg8b
   for old 80386 and 80486 cpus. The emulations are only included if a
   kernel is build for these old cpus and are skipped for the real
   cmpxchg instructions if the kernel that is build for 386 or 486 is
   then run on a more recent cpu.

   This patch may be used independently of the other patches.

4/7: Atomic pte operations for i386

   A generally available cmpxchg (last patch) must be available for
   this patch to preserve the ability to build kernels for 386 and 486.

5/7: Atomic pte operation for x86_64

6/7: Atomic pte operations for s390

7/7: Split counter implementation for rss
  Add tsk->rss and tsk->anon_rss. Add tasklist. Add logic
  to calculate rss from tasklist.

There are some additional outstanding performance enhancements that are
not available yet but which require this patch. Those modifications
push the maximum page fault rate from ~ 1 mio faults per second as
shown above to above 3 mio faults per second.

The last editions of the sloppy rss, atomic rss and deferred rss patch
will be posted to linux-ia64 for archival purpose.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-01 23:41         ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter
@ 2004-12-02  0:10           ` Linus Torvalds
  2004-12-02  6:21             ` Jeff Garzik
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2004-12-02  0:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin,
	linux-mm, linux-ia64, linux-kernel


On Wed, 1 Dec 2004, Christoph Lameter wrote:
>
> Changes from V11->V12 of this patch:
> - dump sloppy_rss in favor of list_rss (Linus' proposal)
> - keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)
> 
> This is a series of patches that increases the scalability of
> the page fault handler for SMP. Here are some performance results
> on a machine with 512 processors allocating 32 GB with an increasing
> number of threads (that are assigned a processor each).

Ok, consider me convinced. I don't want to apply this before I get 2.6.10 
out the door, but I'm happy with it. I assume Andrew has already picked up 
the previous version.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  0:10           ` Linus Torvalds
@ 2004-12-02  6:21             ` Jeff Garzik
  2004-12-02  6:34               ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Garzik @ 2004-12-02  6:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Hugh Dickins, akpm, Benjamin Herrenschmidt,
	Nick Piggin, linux-mm, linux-ia64, linux-kernel

Linus Torvalds wrote:
> Ok, consider me convinced. I don't want to apply this before I get 2.6.10 
> out the door, but I'm happy with it. I assume Andrew has already picked up 
> the previous version.


Does that mean that 2.6.10 is actually close to the door?

/me runs...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  6:21             ` Jeff Garzik
@ 2004-12-02  6:34               ` Andrew Morton
  2004-12-02  6:48                 ` Jeff Garzik
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2004-12-02  6:34 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64,
	linux-kernel

Jeff Garzik <jgarzik@pobox.com> wrote:
>
> Linus Torvalds wrote:
> > Ok, consider me convinced. I don't want to apply this before I get 2.6.10 
> > out the door, but I'm happy with it. I assume Andrew has already picked up 
> > the previous version.
> 
> 
> Does that mean that 2.6.10 is actually close to the door?
> 

We need an -rc3 yet.  And I need to do another pass through the
regressions-since-2.6.9 list.  We've made pretty good progress there
recently.  Mid to late December is looking like the 2.6.10 date.

We need to be be achieving higher-quality major releases than we did in
2.6.8 and 2.6.9.  Really the only tool we have to ensure this is longer
stabilisation periods.

Of course, nobody will test -rc3 and a zillion people will test final
2.6.10, which is when we get lots of useful bug reports.  If this keeps on
happening then we'll need to get more serious about the 2.6.10.n process.

Or start alternating between stable and flakey releases, so 2.6.11 will be
a feature release with a 2-month development period and 2.6.12 will be a
bugfix-only release, with perhaps a 2-week development period, so people
know that the even-numbered releases are better stabilised.

We'll see.  It all depends on how many bugs you can fix in the next two
weeks ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  6:34               ` Andrew Morton
@ 2004-12-02  6:48                 ` Jeff Garzik
  2004-12-02  7:02                   ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Garzik @ 2004-12-02  6:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64,
	linux-kernel

Andrew Morton wrote:
> We need to be be achieving higher-quality major releases than we did in
> 2.6.8 and 2.6.9.  Really the only tool we have to ensure this is longer
> stabilisation periods.

I'm still hoping that distros (like my employer) and orgs like OSDL will 
step up, and hook 2.6.x BK snapshots into daily test harnesses.

Something like John Cherry's reports to lkml on warnings and errors 
would be darned useful.  His reports are IMO an ideal model:  show 
day-to-day _changes_ in test results.  Don't just dump a huge list of 
testsuite results, results which are often clogged with expected 
failures and testsuite bug noise.

	Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  6:48                 ` Jeff Garzik
@ 2004-12-02  7:02                   ` Andrew Morton
  2004-12-02  7:26                     ` Martin J. Bligh
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2004-12-02  7:02 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64,
	linux-kernel

Jeff Garzik <jgarzik@pobox.com> wrote:
>
> Andrew Morton wrote:
> > We need to be be achieving higher-quality major releases than we did in
> > 2.6.8 and 2.6.9.  Really the only tool we have to ensure this is longer
> > stabilisation periods.
> 
> 
> I'm still hoping that distros (like my employer) and orgs like OSDL will 
> step up, and hook 2.6.x BK snapshots into daily test harnesses.

I believe that both IBM and OSDL are doing this, or are getting geared up
to do this.  With both Linus bk and -mm.

However I have my doubts about how useful it will end up being.  These test
suites don't seem to pick up many regressions.  I've challenged Gerrit to
go back through a release cycle's bugfixes and work out how many of those
bugs would have been detected by the test suite.

My suspicion is that the answer will be "a very small proportion", and that
really is the bottom line.

We simply get far better coverage testing by releasing code, because of all
the wild, whacky and weird things which people do with their computers. 
Bless them.

> Something like John Cherry's reports to lkml on warnings and errors 
> would be darned useful.  His reports are IMO an ideal model:  show 
> day-to-day _changes_ in test results.  Don't just dump a huge list of 
> testsuite results, results which are often clogged with expected 
> failures and testsuite bug noise.
> 

Yes, we need humans between the tests and the developers.  Someone who has
good experience with the tests and who can say "hey, something changed
when I do X".  If nothing changed, we don't hear anything.

It's a developer role, not a testing role.   All testing is, really.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:02                   ` Andrew Morton
@ 2004-12-02  7:26                     ` Martin J. Bligh
  2004-12-02  7:31                       ` Jeff Garzik
  0 siblings, 1 reply; 23+ messages in thread
From: Martin J. Bligh @ 2004-12-02  7:26 UTC (permalink / raw)
  To: Andrew Morton, Jeff Garzik
  Cc: torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64,
	linux-kernel

--Andrew Morton <akpm@osdl.org> wrote (on Wednesday, December 01, 2004 23:02:17 -0800):

> Jeff Garzik <jgarzik@pobox.com> wrote:
>> 
>> Andrew Morton wrote:
>> > We need to be be achieving higher-quality major releases than we did in
>> > 2.6.8 and 2.6.9.  Really the only tool we have to ensure this is longer
>> > stabilisation periods.
>> 
>> 
>> I'm still hoping that distros (like my employer) and orgs like OSDL will 
>> step up, and hook 2.6.x BK snapshots into daily test harnesses.
> 
> I believe that both IBM and OSDL are doing this, or are getting geared up
> to do this.  With both Linus bk and -mm.

I already run a bunch of tests on a variety of machines for every new 
kernel ... but don't have an automated way to compare the results as yet, 
so don't actually look at them much ;-(. Sometime soon (quite possibly over 
Christmas) things will calm down enough I'll get a couple of days to write 
the appropriate perl script, and start publishing stuff.

> However I have my doubts about how useful it will end up being.  These test
> suites don't seem to pick up many regressions.  I've challenged Gerrit to
> go back through a release cycle's bugfixes and work out how many of those
> bugs would have been detected by the test suite.
> 
> My suspicion is that the answer will be "a very small proportion", and that
> really is the bottom line.

Yeah, probably. Though the stress tests catch a lot more than the 
functionality ones. The big pain in the ass is drivers, because I don't
have a hope in hell of testing more than 1% of them.

M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:26                     ` Martin J. Bligh
@ 2004-12-02  7:31                       ` Jeff Garzik
  2004-12-02 18:10                         ` cliff white
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Garzik @ 2004-12-02  7:31 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andrew Morton, torvalds, clameter, hugh, benh, nickpiggin,
	linux-mm, linux-ia64, linux-kernel

Martin J. Bligh wrote:
> Yeah, probably. Though the stress tests catch a lot more than the 
> functionality ones. The big pain in the ass is drivers, because I don't
> have a hope in hell of testing more than 1% of them.

My dream is that hardware vendors rotate their current machines through 
a test shop :)  It would be nice to make sure that the popular drivers 
get daily test coverage.

	Jeff, dreaming on


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:31                       ` Jeff Garzik
@ 2004-12-02 18:10                         ` cliff white
  2004-12-08 17:24                           ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: cliff white @ 2004-12-02 18:10 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: mbligh, akpm, torvalds, clameter, hugh, benh, nickpiggin,
	linux-mm, linux-ia64, linux-kernel

On Thu, 02 Dec 2004 02:31:35 -0500
Jeff Garzik <jgarzik@pobox.com> wrote:

> Martin J. Bligh wrote:
> > Yeah, probably. Though the stress tests catch a lot more than the 
> > functionality ones. The big pain in the ass is drivers, because I don't
> > have a hope in hell of testing more than 1% of them.
> 
> My dream is that hardware vendors rotate their current machines through 
> a test shop :)  It would be nice to make sure that the popular drivers 
> get daily test coverage.
> 
> 	Jeff, dreaming on

OSDL has recently re-done the donation policy, and we're much better positioned
to support that sort of thing now - Contact Tom Hanrahan at OSDL if you 
are a vendor, or know a vendor. ( Or you can become a vendor ) 

cliffw

> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
> 


-- 
The church is near, but the road is icy.
The bar is far, but i will walk carefully. - Russian proverb
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Anticipatory prefaulting in the page fault handler V1
  2004-12-02 18:10                         ` cliff white
@ 2004-12-08 17:24                           ` Christoph Lameter
  2004-12-08 17:33                             ` Jesse Barnes
                                               ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Christoph Lameter @ 2004-12-08 17:24 UTC (permalink / raw)
  To: nickpiggin
  Cc: Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel

The page fault handler for anonymous pages can generate significant overhead
apart from its essential function which is to clear and setup a new page
table entry for a never accessed memory location. This overhead increases
significantly in an SMP environment.

In the page table scalability patches, we addressed the issue by changing
the locking scheme so that multiple fault handlers are able to be processed
concurrently on multiple cpus. This patch attempts to aggregate multiple
page faults into a single one. It does that by noting
anonymous page faults generated in sequence by an application.

If a fault occurred for page x and is then followed by page x+1 then it may
be reasonable to expect another page fault at x+2 in the future. If page
table entries for x+1 and x+2 would be prepared in the fault handling for
page x+1 then the overhead of taking a fault for x+2 is avoided. However
page x+2 may never be used and thus we may have increased the rss
of an application unnecessarily. The swapper will take care of removing
that page if memory should get tight.

The following patch makes the anonymous fault handler anticipate future
faults. For each fault a prediction is made where the fault would occur
(assuming linear acccess by the application). If the prediction turns out to
be right (next fault is where expected) then a number of pages is
preallocated in order to avoid a series of future faults. The order of the
preallocation increases by the power of two for each success in sequence.

The first successful prediction leads to an additional page being allocated.
Second successful prediction leads to 2 additional pages being allocated.
Third to 4 pages and so on. The max order is 3 by default. In a large
continous allocation the number of faults is reduced by a factor of 8.

The patch may be combined with the page fault scalability patch (another
edition of the patch is needed which will be forthcoming after the
page fault scalability patch has been included). The combined patches
will triple the possible page fault rate from ~1 mio faults sec to 3 mio
faults sec.

Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
number of threads (and thus increasing parallellism of page faults):

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 32   3    1    1.416s    138.165s 139.050s 45073.831  45097.498
 32   3    2    1.397s    148.523s  78.044s 41965.149  80201.646
 32   3    4    1.390s    152.618s  44.044s 40851.258 141545.239
 32   3    8    1.500s    374.008s  53.001s 16754.519 118671.950
 32   3   16    1.415s   1051.759s  73.094s  5973.803  85087.358
 32   3   32    1.867s   3400.417s 117.003s  1849.186  53754.928
 32   3   64    5.361s  11633.040s 197.034s   540.577  31881.112
 32   3  128   23.387s  39386.390s 332.055s   159.642  18918.599
 32   3  256   15.409s  20031.450s 168.095s   313.837  37237.918
 32   3  512   18.720s  10338.511s  86.047s   607.446  72752.686

Patched kernel:

Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 32   3    1    1.098s    138.544s 139.063s 45053.657  45057.920
 32   3    2    1.022s    127.770s  67.086s 48849.350  92707.085
 32   3    4    0.995s    119.666s  37.045s 52141.503 167955.292
 32   3    8    0.928s     87.400s  18.034s 71227.407 342934.242
 32   3   16    1.067s     72.943s  11.035s 85007.293 553989.377
 32   3   32    1.248s    133.753s  10.038s 46602.680 606062.151
 32   3   64    5.557s    438.634s  13.093s 14163.802 451418.617
 32   3  128   17.860s   1496.797s  19.048s  4153.714 322808.509
 32   3  256   13.382s    766.063s  10.016s  8071.695 618816.838
 32   3  512   17.067s    369.106s   5.041s 16291.764 1161285.521

These number are roughly equal to what can be accomplished with the
page fault scalability patches.

Kernel patches with both the page fault scalability patches and
prefaulting:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 32  10    1    4.103s    456.384s 460.046s 45541.992  45544.369
 32  10    2    4.005s    415.119s 221.095s 50036.407  94484.174
 32  10    4    3.855s    371.317s 111.076s 55898.259 187635.724
 32  10    8    3.902s    308.673s  67.094s 67092.476 308634.397
 32  10   16    4.011s    224.213s  37.016s 91889.781 564241.062
 32  10   32    5.483s    209.391s  27.046s 97598.647 763495.417
 32  10   64   19.166s    219.925s  26.030s 87713.212 797286.395
 32  10  128   53.482s    342.342s  27.024s 52981.744 769687.791
 32  10  256   67.334s    180.321s  15.036s 84679.911 1364614.334
 32  10  512   66.516s     93.098s   9.015s131387.893 2291548.865

The fault rate doubles when both patches are applied.

And on the high end (512 processors allocating 256G) (No numbers
for regular kernels because they are extremely slow, also no
number for a low number of threads. Also very slow)

With prefaulting:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
256   3    4    8.241s   1414.348s 449.016s 35380.301 112056.239
256   3    8    8.306s   1300.982s 247.025s 38441.977 203559.271
256   3   16    8.368s   1223.853s 154.089s 40846.272 324940.924
256   3   32    8.536s   1284.041s 110.097s 38938.970 453556.624
256   3   64   13.025s   3587.203s 110.010s 13980.123 457131.492
256   3  128   25.025s  11460.700s 145.071s  4382.104 345404.909
256   3  256   26.150s   6061.649s  75.086s  8267.625 663414.482
256   3  512   20.637s   3037.097s  38.062s 16460.435 1302993.019

Page fault scalability patch and prefaulting. Max prefault order
increased to 5 (max preallocation of 32 pages):

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
256  10    8   33.571s   4516.293s 863.021s 36874.099 194356.930
256  10   16   33.103s   3737.688s 461.028s 44492.553 363704.484
256  10   32   35.094s   3436.561s 321.080s 48326.262 521352.840
256  10   64   46.675s   2899.997s 245.020s 56936.124 684214.256
256  10  128   85.493s   2890.198s 203.008s 56380.890 826122.524
256  10  256   74.299s   1374.973s  99.088s115762.963 1679630.272
256  10  512   62.760s    706.559s  53.027s218078.311 3149273.714

We are getting into an almost linear scalability in the high end with
both patches and end up with a fault rate > 3 mio faults per second.

The one thing that takes up a lot of time is still be the zeroing
of pages in the page fault handler. There is a another
set of patches that I am working on which will prezero pages
and led to another an increase in performance by a factor of 2-4
(if prezeroed pages are available which may not always be the case).
Maybe we can reach 10 mio fault /sec that way.

Patch against 2.6.10-rc3-bk3:

Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h	2004-12-01 10:37:31.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h	2004-12-01 10:38:15.000000000 -0800
@@ -537,6 +537,8 @@
 #endif

 	struct list_head tasks;
+	unsigned long anon_fault_next_addr;	/* Predicted sequential fault address */
+	int anon_fault_order;			/* Last order of allocation on fault */
 	/*
 	 * ptrace_list/ptrace_children forms the list of my children
 	 * that were stolen by a ptracer.
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c	2004-12-01 10:38:11.000000000 -0800
+++ linux-2.6.9/mm/memory.c	2004-12-01 10:45:01.000000000 -0800
@@ -55,6 +55,7 @@

 #include <linux/swapops.h>
 #include <linux/elf.h>
+#include <linux/pagevec.h>

 #ifndef CONFIG_DISCONTIGMEM
 /* use the per-pgdat data instead for discontigmem - mbligh */
@@ -1432,8 +1433,106 @@
 		unsigned long addr)
 {
 	pte_t entry;
-	struct page * page = ZERO_PAGE(addr);
+	struct page * page;
+
+	addr &= PAGE_MASK;
+
+ 	if (current->anon_fault_next_addr == addr) {
+ 		unsigned long end_addr;
+ 		int order = current->anon_fault_order;
+
+		/* Sequence of page faults detected. Perform preallocation of pages */

+		/* The order of preallocations increases with each successful prediction */
+ 		order++;
+
+		if ((1 << order) < PAGEVEC_SIZE)
+			end_addr = addr + (1 << (order + PAGE_SHIFT));
+		else
+			end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE;
+
+		if (end_addr > vma->vm_end)
+			end_addr = vma->vm_end;
+		if ((addr & PMD_MASK) != (end_addr & PMD_MASK))
+			end_addr &= PMD_MASK;
+
+		current->anon_fault_next_addr = end_addr;
+	 	current->anon_fault_order = order;
+
+		if (write_access) {
+
+			struct pagevec pv;
+			unsigned long a;
+			struct page **p;
+
+			pte_unmap(page_table);
+			spin_unlock(&mm->page_table_lock);
+
+			pagevec_init(&pv, 0);
+
+			if (unlikely(anon_vma_prepare(vma)))
+				return VM_FAULT_OOM;
+
+			/* Allocate the necessary pages */
+			for(a = addr;a < end_addr ; a += PAGE_SIZE) {
+				struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a);
+
+				if (p) {
+					clear_user_highpage(p, a);
+					pagevec_add(&pv,p);
+				} else
+					break;
+			}
+			end_addr = a;
+
+			spin_lock(&mm->page_table_lock);
+
+ 			for(p = pv.pages; addr < end_addr; addr += PAGE_SIZE, p++) {
+
+				page_table = pte_offset_map(pmd, addr);
+				if (!pte_none(*page_table)) {
+					/* Someone else got there first */
+					page_cache_release(*p);
+					pte_unmap(page_table);
+					continue;
+				}
+
+ 				entry = maybe_mkwrite(pte_mkdirty(mk_pte(*p,
+ 							 vma->vm_page_prot)),
+ 						      vma);
+
+				mm->rss++;
+				lru_cache_add_active(*p);
+				mark_page_accessed(*p);
+				page_add_anon_rmap(*p, vma, addr);
+
+				set_pte(page_table, entry);
+				pte_unmap(page_table);
+
+ 				/* No need to invalidate - it was non-present before */
+ 				update_mmu_cache(vma, addr, entry);
+			}
+ 		} else {
+ 			/* Read */
+ 			for(;addr < end_addr; addr += PAGE_SIZE) {
+				page_table = pte_offset_map(pmd, addr);
+ 				entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+				set_pte(page_table, entry);
+				pte_unmap(page_table);
+
+ 				/* No need to invalidate - it was non-present before */
+				update_mmu_cache(vma, addr, entry);
+
+			};
+		}
+		spin_unlock(&mm->page_table_lock);
+		return VM_FAULT_MINOR;
+	}
+
+	current->anon_fault_next_addr = addr + PAGE_SIZE;
+	current->anon_fault_order = 0;
+
+	page = ZERO_PAGE(addr);
 	/* Read-only mapping of ZERO_PAGE. */
 	entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-08 17:24                           ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter
@ 2004-12-08 17:33                             ` Jesse Barnes
  2004-12-08 17:56                               ` Christoph Lameter
  2004-12-08 17:55                             ` Dave Hansen
                                               ` (4 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Jesse Barnes @ 2004-12-08 17:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
	linux-ia64, linux-kernel

On Wednesday, December 8, 2004 9:24 am, Christoph Lameter wrote:
> Page fault scalability patch and prefaulting. Max prefault order
> increased to 5 (max preallocation of 32 pages):
>
>  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
> 256  10    8   33.571s   4516.293s 863.021s 36874.099 194356.930
> 256  10   16   33.103s   3737.688s 461.028s 44492.553 363704.484
> 256  10   32   35.094s   3436.561s 321.080s 48326.262 521352.840
> 256  10   64   46.675s   2899.997s 245.020s 56936.124 684214.256
> 256  10  128   85.493s   2890.198s 203.008s 56380.890 826122.524
> 256  10  256   74.299s   1374.973s  99.088s115762.963 1679630.272
> 256  10  512   62.760s    706.559s  53.027s218078.311 3149273.714
>
> We are getting into an almost linear scalability in the high end with
> both patches and end up with a fault rate > 3 mio faults per second.

Nice results!  Any idea how many applications benefit from this sort of 
anticipatory faulting?  It has implications for NUMA allocation.  Imagine an 
app that allocates a large virtual address space and then tries to fault in 
pages near each CPU in turn.  With this patch applied, CPU 2 would be 
referencing pages near CPU 1, and CPU 3 would then fault in 4 pages, which 
would then be used by CPUs 4-6.  Unless I'm missing something...

And again, I'm not sure how important that is, maybe this approach will work 
well in the majority of cases (obviously it's a big win in faults/sec for 
your benchmark, but I wonder about subsequent references from other CPUs to 
those pages).  You can look at /sys/devices/platform/nodeN/meminfo to see 
where the pages are coming from.

Jesse
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-08 17:33                             ` Jesse Barnes
@ 2004-12-08 17:56                               ` Christoph Lameter
  2004-12-08 18:33                                 ` Jesse Barnes
  2004-12-08 21:26                                 ` David S. Miller
  0 siblings, 2 replies; 23+ messages in thread
From: Christoph Lameter @ 2004-12-08 17:56 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
	linux-ia64, linux-kernel

On Wed, 8 Dec 2004, Jesse Barnes wrote:

> Nice results!  Any idea how many applications benefit from this sort of
> anticipatory faulting?  It has implications for NUMA allocation.  Imagine an
> app that allocates a large virtual address space and then tries to fault in
> pages near each CPU in turn.  With this patch applied, CPU 2 would be
> referencing pages near CPU 1, and CPU 3 would then fault in 4 pages, which
> would then be used by CPUs 4-6.  Unless I'm missing something...

Faults are predicted for each thread executing on a different processor.
So each processor does its own predictions which will not generate
preallocations on a different processor (unless the thread is moved to
another processor but that is a very special situation).

> And again, I'm not sure how important that is, maybe this approach will work
> well in the majority of cases (obviously it's a big win in faults/sec for
> your benchmark, but I wonder about subsequent references from other CPUs to
> those pages).  You can look at /sys/devices/platform/nodeN/meminfo to see
> where the pages are coming from.

The origin of the pages has not changed and the existing locality
constraints are observed.

A patch like this is important for applications that allocate and preset
large amounts of memory on startup. It will drastically reduce the startup
times.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-08 17:56                               ` Christoph Lameter
@ 2004-12-08 18:33                                 ` Jesse Barnes
  2004-12-08 21:26                                 ` David S. Miller
  1 sibling, 0 replies; 23+ messages in thread
From: Jesse Barnes @ 2004-12-08 18:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
	linux-ia64, linux-kernel

On Wednesday, December 8, 2004 9:56 am, Christoph Lameter wrote:
> > And again, I'm not sure how important that is, maybe this approach will
> > work well in the majority of cases (obviously it's a big win in
> > faults/sec for your benchmark, but I wonder about subsequent references
> > from other CPUs to those pages).  You can look at
> > /sys/devices/platform/nodeN/meminfo to see where the pages are coming
> > from.
>
> The origin of the pages has not changed and the existing locality
> constraints are observed.
>
> A patch like this is important for applications that allocate and preset
> large amounts of memory on startup. It will drastically reduce the startup
> times.

Ok, that sounds good.  My case was probably a bit contrived, but I'm glad to 
see that you had already thought of it anyway.

Jesse
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-08 17:56                               ` Christoph Lameter
  2004-12-08 18:33                                 ` Jesse Barnes
@ 2004-12-08 21:26                                 ` David S. Miller
  2004-12-08 21:42                                   ` Linus Torvalds
  1 sibling, 1 reply; 23+ messages in thread
From: David S. Miller @ 2004-12-08 21:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: jbarnes, nickpiggin, jgarzik, torvalds, hugh, benh, linux-mm,
	linux-ia64, linux-kernel

On Wed, 8 Dec 2004 09:56:00 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> A patch like this is important for applications that allocate and preset
> large amounts of memory on startup. It will drastically reduce the startup
> times.

I see.  Yet I noticed that while the patch makes system time decrease,
for some reason the wall time is increasing with the patch applied.
Why is that, or am I misreading your tables?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-08 21:26                                 ` David S. Miller
@ 2004-12-08 21:42                                   ` Linus Torvalds
  0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2004-12-08 21:42 UTC (permalink / raw)
  To: David S. Miller
  Cc: Christoph Lameter, jbarnes, nickpiggin, jgarzik, hugh, benh,
	linux-mm, linux-ia64, linux-kernel


On Wed, 8 Dec 2004, David S. Miller wrote:
> 
> I see.  Yet I noticed that while the patch makes system time decrease,
> for some reason the wall time is increasing with the patch applied.
> Why is that, or am I misreading your tables?

I assume that you're looking at the final "both patches applied" case.

It has ten repetitions, while the other two tables only have three. That 
would explain the discrepancy.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-08 17:24                           ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter
  2004-12-08 17:33                             ` Jesse Barnes
@ 2004-12-08 17:55                             ` Dave Hansen
  2004-12-08 19:07                             ` Martin J. Bligh
                                               ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Dave Hansen @ 2004-12-08 17:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Jeff Garzik, Linus Torvalds, hugh,
	Benjamin Herrenschmidt, linux-mm, linux-ia64,
	Linux Kernel Mailing List

On Wed, 2004-12-08 at 09:24, Christoph Lameter wrote:
> The page fault handler for anonymous pages can generate significant overhead
> apart from its essential function which is to clear and setup a new page
> table entry for a never accessed memory location. This overhead increases
> significantly in an SMP environment.

do_anonymous_page() is a relatively compact function at this point. 
This would probably be a lot more readable if it was broken out into at
least another function or two that do_anonymous_page() calls into.  That
way, you also get a much cleaner separation if anyone needs to turn it
off in the future.  

Speaking of that, have you seen this impair performance on any other
workloads?  

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-08 17:24                           ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter
  2004-12-08 17:33                             ` Jesse Barnes
  2004-12-08 17:55                             ` Dave Hansen
@ 2004-12-08 19:07                             ` Martin J. Bligh
  2004-12-08 22:50                             ` Martin J. Bligh
                                               ` (2 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Martin J. Bligh @ 2004-12-08 19:07 UTC (permalink / raw)
  To: Christoph Lameter, nickpiggin
  Cc: Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel

> The page fault handler for anonymous pages can generate significant overhead
> apart from its essential function which is to clear and setup a new page
> table entry for a never accessed memory location. This overhead increases
> significantly in an SMP environment.
> 
> In the page table scalability patches, we addressed the issue by changing
> the locking scheme so that multiple fault handlers are able to be processed
> concurrently on multiple cpus. This patch attempts to aggregate multiple
> page faults into a single one. It does that by noting
> anonymous page faults generated in sequence by an application.
> 
> If a fault occurred for page x and is then followed by page x+1 then it may
> be reasonable to expect another page fault at x+2 in the future. If page
> table entries for x+1 and x+2 would be prepared in the fault handling for
> page x+1 then the overhead of taking a fault for x+2 is avoided. However
> page x+2 may never be used and thus we may have increased the rss
> of an application unnecessarily. The swapper will take care of removing
> that page if memory should get tight.
> 
> The following patch makes the anonymous fault handler anticipate future
> faults. For each fault a prediction is made where the fault would occur
> (assuming linear acccess by the application). If the prediction turns out to
> be right (next fault is where expected) then a number of pages is
> preallocated in order to avoid a series of future faults. The order of the
> preallocation increases by the power of two for each success in sequence.
> 
> The first successful prediction leads to an additional page being allocated.
> Second successful prediction leads to 2 additional pages being allocated.
> Third to 4 pages and so on. The max order is 3 by default. In a large
> continous allocation the number of faults is reduced by a factor of 8.
> 
> The patch may be combined with the page fault scalability patch (another
> edition of the patch is needed which will be forthcoming after the
> page fault scalability patch has been included). The combined patches
> will triple the possible page fault rate from ~1 mio faults sec to 3 mio
> faults sec.
> 
> Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
> number of threads (and thus increasing parallellism of page faults):

Mmmm ... we tried doing this before for filebacked pages by sniffing the
pagecache, but it crippled forky workloads (like kernel compile) with the 
extra cost in zap_pte_range, etc. 

Perhaps the locality is better for the anon stuff, but the cost is also
higher. Exactly what benchmark were you running on this? If you just run
a microbenchmark that allocates memory, then it will definitely be faster.
On other things, I suspect not ...

M.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-08 17:24                           ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter
                                               ` (2 preceding siblings ...)
  2004-12-08 19:07                             ` Martin J. Bligh
@ 2004-12-08 22:50                             ` Martin J. Bligh
  2004-12-09 19:32                               ` Christoph Lameter
  2004-12-09 10:57                             ` Pavel Machek
  2004-12-14 15:28                             ` Adam Litke
  5 siblings, 1 reply; 23+ messages in thread
From: Martin J. Bligh @ 2004-12-08 22:50 UTC (permalink / raw)
  To: Christoph Lameter, nickpiggin
  Cc: Jeff Garzik, torvalds, hugh, benh, linux-mm, linux-ia64, linux-kernel

> The page fault handler for anonymous pages can generate significant overhead
> apart from its essential function which is to clear and setup a new page
> table entry for a never accessed memory location. This overhead increases
> significantly in an SMP environment.
> 
> In the page table scalability patches, we addressed the issue by changing
> the locking scheme so that multiple fault handlers are able to be processed
> concurrently on multiple cpus. This patch attempts to aggregate multiple
> page faults into a single one. It does that by noting
> anonymous page faults generated in sequence by an application.
> 
> If a fault occurred for page x and is then followed by page x+1 then it may
> be reasonable to expect another page fault at x+2 in the future. If page
> table entries for x+1 and x+2 would be prepared in the fault handling for
> page x+1 then the overhead of taking a fault for x+2 is avoided. However
> page x+2 may never be used and thus we may have increased the rss
> of an application unnecessarily. The swapper will take care of removing
> that page if memory should get tight.

I tried benchmarking it ... but processes just segfault all the time. 
Any chance you could try it out on SMP ia32 system?

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-08 22:50                             ` Martin J. Bligh
@ 2004-12-09 19:32                               ` Christoph Lameter
  2004-12-13 14:30                                 ` Akinobu Mita
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2004-12-09 19:32 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
	linux-ia64, linux-kernel

On Wed, 8 Dec 2004, Martin J. Bligh wrote:

> I tried benchmarking it ... but processes just segfault all the time.
> Any chance you could try it out on SMP ia32 system?

I tried it on my i386 system and it works fine. Sorry about the puny
memory sizes (the system is a PIII-450 with 384k memory)

clameter@schroedinger:~/pfault/code$ ./pft -t -b256000 -r3 -f1
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  0   3    1    0.000s      0.004s   0.000s 37407.481  29200.500
  0   3    2    0.002s      0.002s   0.000s 31177.059  27227.723

clameter@schroedinger:~/pfault/code$ uname -a
Linux schroedinger 2.6.10-rc3-bk3-prezero #8 SMP Wed Dec 8 15:22:28 PST
2004 i686 GNU/Linux

Could you send me your .config?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-09 19:32                               ` Christoph Lameter
@ 2004-12-13 14:30                                 ` Akinobu Mita
  2004-12-13 17:10                                   ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Akinobu Mita @ 2004-12-13 14:30 UTC (permalink / raw)
  To: Christoph Lameter, Martin J. Bligh
  Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
	linux-ia64, linux-kernel

On Friday 10 December 2004 04:32, Christoph Lameter wrote:
> On Wed, 8 Dec 2004, Martin J. Bligh wrote:
> > I tried benchmarking it ... but processes just segfault all the time.
> > Any chance you could try it out on SMP ia32 system?
>
> I tried it on my i386 system and it works fine. Sorry about the puny
> memory sizes (the system is a PIII-450 with 384k memory)
>

I also encountered processes segfault.
Below patch fix several problems.

1) if no pages could allocated, returns VM_FAULT_OOM
2) fix duplicated pte_offset_map() call
3) don't set_pte() for the entry which already have been set

Acutually, 3) fixes my segfault problem.

--- 2.6-rc/mm/memory.c.orig	2004-12-13 22:17:04.000000000 +0900
+++ 2.6-rc/mm/memory.c	2004-12-13 22:22:14.000000000 +0900
@@ -1483,6 +1483,8 @@ do_anonymous_page(struct mm_struct *mm, 
 				} else
 					break;
 			}
+			if (a == addr)
+				goto no_mem;
 			end_addr = a;
 
 			spin_lock(&mm->page_table_lock);
@@ -1514,8 +1516,17 @@ do_anonymous_page(struct mm_struct *mm, 
 			}
  		} else {
  			/* Read */
+			int first = 1;
+
  			for(;addr < end_addr; addr += PAGE_SIZE) {
-				page_table = pte_offset_map(pmd, addr);
+				if (!first)
+					page_table = pte_offset_map(pmd, addr);
+				first = 0;
+				if (!pte_none(*page_table)) {
+					/* Someone else got there first */
+					pte_unmap(page_table);
+					continue;
+				}
  				entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
 				set_pte(page_table, entry);
 				pte_unmap(page_table);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-13 14:30                                 ` Akinobu Mita
@ 2004-12-13 17:10                                   ` Christoph Lameter
  2004-12-13 22:16                                     ` Martin J. Bligh
  2004-12-14 12:24                                     ` Akinobu Mita
  0 siblings, 2 replies; 23+ messages in thread
From: Christoph Lameter @ 2004-12-13 17:10 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: Martin J. Bligh, nickpiggin, Jeff Garzik, torvalds, hugh, benh,
	linux-mm, linux-ia64, linux-kernel

On Mon, 13 Dec 2004, Akinobu Mita wrote:

> I also encountered processes segfault.
> Below patch fix several problems.
>
> 1) if no pages could allocated, returns VM_FAULT_OOM
> 2) fix duplicated pte_offset_map() call

I also saw these two issues and I think I dealt with them in a forthcoming
patch.

> 3) don't set_pte() for the entry which already have been set

Not sure how this could have happened in the patch.

Could you try my updated version:

Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h	2004-12-08 15:01:48.801457702 -0800
+++ linux-2.6.9/include/linux/sched.h	2004-12-08 15:02:04.286479345 -0800
@@ -537,6 +537,8 @@
 #endif

 	struct list_head tasks;
+	unsigned long anon_fault_next_addr;	/* Predicted sequential fault address */
+	int anon_fault_order;			/* Last order of allocation on fault */
 	/*
 	 * ptrace_list/ptrace_children forms the list of my children
 	 * that were stolen by a ptracer.
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c	2004-12-08 15:01:50.668339751 -0800
+++ linux-2.6.9/mm/memory.c	2004-12-09 14:21:17.090061608 -0800
@@ -55,6 +55,7 @@

 #include <linux/swapops.h>
 #include <linux/elf.h>
+#include <linux/pagevec.h>

 #ifndef CONFIG_DISCONTIGMEM
 /* use the per-pgdat data instead for discontigmem - mbligh */
@@ -1432,52 +1433,99 @@
 		unsigned long addr)
 {
 	pte_t entry;
-	struct page * page = ZERO_PAGE(addr);
-
-	/* Read-only mapping of ZERO_PAGE. */
-	entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+ 	unsigned long end_addr;
+
+	addr &= PAGE_MASK;
+
+ 	if (likely((vma->vm_flags & VM_RAND_READ) || current->anon_fault_next_addr != addr)) {
+		/* Single page */
+		current->anon_fault_order = 0;
+		end_addr = addr + PAGE_SIZE;
+	} else {
+		/* Sequence of faults detect. Perform preallocation */
+ 		int order = ++current->anon_fault_order;
+
+		if ((1 << order) < PAGEVEC_SIZE)
+			end_addr = addr + (PAGE_SIZE << order);
+		else
+			end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE;

-	/* ..except if it's a write access */
+		if (end_addr > vma->vm_end)
+			end_addr = vma->vm_end;
+		if ((addr & PMD_MASK) != (end_addr & PMD_MASK))
+			end_addr &= PMD_MASK;
+	}
 	if (write_access) {
-		/* Allocate our own private page. */
+
+		unsigned long a;
+		struct page **p;
+		struct pagevec pv;
+
 		pte_unmap(page_table);
 		spin_unlock(&mm->page_table_lock);

+		pagevec_init(&pv, 0);
+
 		if (unlikely(anon_vma_prepare(vma)))
-			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
-		if (!page)
-			goto no_mem;
-		clear_user_highpage(page, addr);
+			return VM_FAULT_OOM;
+
+		/* Allocate the necessary pages */
+		for(a = addr; a < end_addr ; a += PAGE_SIZE) {
+			struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a);
+
+			if (likely(p)) {
+				clear_user_highpage(p, a);
+				pagevec_add(&pv, p);
+			} else {
+				if (a == addr)
+					return VM_FAULT_OOM;
+				break;
+			}
+		}

 		spin_lock(&mm->page_table_lock);
-		page_table = pte_offset_map(pmd, addr);

-		if (!pte_none(*page_table)) {
+		for(p = pv.pages; addr < a; addr += PAGE_SIZE, p++) {
+
+			page_table = pte_offset_map(pmd, addr);
+			if (unlikely(!pte_none(*page_table))) {
+				/* Someone else got there first */
+				pte_unmap(page_table);
+				page_cache_release(*p);
+				continue;
+			}
+
+ 			entry = maybe_mkwrite(pte_mkdirty(mk_pte(*p,
+ 						 vma->vm_page_prot)),
+ 					      vma);
+
+			mm->rss++;
+			lru_cache_add_active(*p);
+			mark_page_accessed(*p);
+			page_add_anon_rmap(*p, vma, addr);
+
+			set_pte(page_table, entry);
 			pte_unmap(page_table);
-			page_cache_release(page);
-			spin_unlock(&mm->page_table_lock);
-			goto out;
+
+ 			/* No need to invalidate - it was non-present before */
+ 			update_mmu_cache(vma, addr, entry);
+		}
+ 	} else {
+ 		/* Read */
+		entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+nextread:
+		set_pte(page_table, entry);
+		pte_unmap(page_table);
+		update_mmu_cache(vma, addr, entry);
+		addr += PAGE_SIZE;
+		if (unlikely(addr < end_addr)) {
+			pte_offset_map(pmd, addr);
+			goto nextread;
 		}
-		mm->rss++;
-		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
-							 vma->vm_page_prot)),
-				      vma);
-		lru_cache_add_active(page);
-		mark_page_accessed(page);
-		page_add_anon_rmap(page, vma, addr);
 	}
-
-	set_pte(page_table, entry);
-	pte_unmap(page_table);
-
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, addr, entry);
+	current->anon_fault_next_addr = addr;
 	spin_unlock(&mm->page_table_lock);
-out:
 	return VM_FAULT_MINOR;
-no_mem:
-	return VM_FAULT_OOM;
 }

 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-13 17:10                                   ` Christoph Lameter
@ 2004-12-13 22:16                                     ` Martin J. Bligh
  2004-12-14 12:24                                     ` Akinobu Mita
  1 sibling, 0 replies; 23+ messages in thread
From: Martin J. Bligh @ 2004-12-13 22:16 UTC (permalink / raw)
  To: Christoph Lameter, Akinobu Mita
  Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
	linux-ia64, linux-kernel

>> I also encountered processes segfault.
>> Below patch fix several problems.
>> 
>> 1) if no pages could allocated, returns VM_FAULT_OOM
>> 2) fix duplicated pte_offset_map() call
> 
> I also saw these two issues and I think I dealt with them in a forthcoming
> patch.
> 
>> 3) don't set_pte() for the entry which already have been set
> 
> Not sure how this could have happened in the patch.
> 
> Could you try my updated version:

Urgle. There was a fix from Hugh too ... any chance you could just stick
a whole new patch somewhere? I'm too idle/stupid to work it out ;-)

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-13 17:10                                   ` Christoph Lameter
  2004-12-13 22:16                                     ` Martin J. Bligh
@ 2004-12-14 12:24                                     ` Akinobu Mita
  2004-12-14 15:25                                       ` Akinobu Mita
  2004-12-14 20:25                                       ` Christoph Lameter
  1 sibling, 2 replies; 23+ messages in thread
From: Akinobu Mita @ 2004-12-14 12:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Martin J. Bligh, nickpiggin, Jeff Garzik, torvalds, hugh, benh,
	linux-mm, linux-ia64, linux-kernel

On Tuesday 14 December 2004 02:10, Christoph Lameter wrote:
> On Mon, 13 Dec 2004, Akinobu Mita wrote:

> > 3) don't set_pte() for the entry which already have been set
>
> Not sure how this could have happened in the patch.

This is why I inserted pte_none() for each page_table in case of
read fault too.

If read access fault occured for the address "addr".
It is completely unnecessary to check by pte_none() to the page_table
for "addr". Because page_table_lock has never been released until
do_anonymous_page returns (in case of read access fault)

But there is not any guarantee that the page_tables for addr+PAGE_SIZE,
addr+2*PAGE_SIZE, ...  have not been mapped yet.

Anyway, I will try your V2 patch.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-14 12:24                                     ` Akinobu Mita
@ 2004-12-14 15:25                                       ` Akinobu Mita
  2004-12-14 20:25                                       ` Christoph Lameter
  1 sibling, 0 replies; 23+ messages in thread
From: Akinobu Mita @ 2004-12-14 15:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Martin J. Bligh, nickpiggin, Jeff Garzik, torvalds, hugh, benh,
	linux-mm, linux-ia64, linux-kernel

On Tuesday 14 December 2004 21:24, Akinobu Mita wrote:

> But there is not any guarantee that the page_tables for addr+PAGE_SIZE,
> addr+2*PAGE_SIZE, ...  have not been mapped yet.
>
> Anyway, I will try your V2 patch.
>

Below patch fixes V2 patch, and adds debug printk. 
The output coincides with segfaulted processes.

# dmesg | grep ^comm:

comm: xscreensaver, addr_orig: ccdc40, addr: cce000, pid: 2995
comm: rhn-applet-gui, addr_orig: b6fd8020, addr: b6fd9000, pid: 3029
comm: rhn-applet-gui, addr_orig: b6e95020, addr: b6e96000, pid: 3029
comm: rhn-applet-gui, addr_orig: b6fd8020, addr: b6fd9000, pid: 3029
comm: rhn-applet-gui, addr_orig: b6e95020, addr: b6e96000, pid: 3029
comm: rhn-applet-gui, addr_orig: b6fd8020, addr: b6fd9000, pid: 3029
comm: X, addr_orig: 87e8000, addr: 87e9000, pid: 2874
comm: X, addr_orig: 87ea000, addr: 87eb000, pid: 2874

---
The read access prefaulting may override the page_table which has been
already mapped. this patch fixes it. and it shows which process might
suffer this problem.


--- 2.6-rc/mm/memory.c.orig	2004-12-14 22:06:08.000000000 +0900
+++ 2.6-rc/mm/memory.c	2004-12-14 23:42:34.000000000 +0900
@@ -1434,6 +1434,7 @@ do_anonymous_page(struct mm_struct *mm, 
 {
 	pte_t entry;
  	unsigned long end_addr;
+ 	unsigned long addr_orig = addr;
 
 	addr &= PAGE_MASK;
 
@@ -1517,9 +1518,15 @@ do_anonymous_page(struct mm_struct *mm, 
  		/* Read */
 		entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
 nextread:
-		set_pte(page_table, entry);
-		pte_unmap(page_table);
-		update_mmu_cache(vma, addr, entry);
+		if (!pte_none(*page_table)) {
+			printk("comm: %s, addr_orig: %lx, addr: %lx, pid: %d\n",
+				current->comm, addr_orig, addr, current->pid);
+			pte_unmap(page_table);
+		} else {
+			set_pte(page_table, entry);
+			pte_unmap(page_table);
+			update_mmu_cache(vma, addr, entry);
+		}
 		addr += PAGE_SIZE;
 		if (unlikely(addr < end_addr)) {
 			pte_offset_map(pmd, addr);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-14 12:24                                     ` Akinobu Mita
  2004-12-14 15:25                                       ` Akinobu Mita
@ 2004-12-14 20:25                                       ` Christoph Lameter
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2004-12-14 20:25 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: Martin J. Bligh, nickpiggin, Jeff Garzik, torvalds, hugh, benh,
	linux-mm, linux-ia64, linux-kernel

On Tue, 14 Dec 2004, Akinobu Mita wrote:

> This is why I inserted pte_none() for each page_table in case of
> read fault too.
>
> If read access fault occured for the address "addr".
> It is completely unnecessary to check by pte_none() to the page_table
> for "addr". Because page_table_lock has never been released until
> do_anonymous_page returns (in case of read access fault)
>
> But there is not any guarantee that the page_tables for addr+PAGE_SIZE,
> addr+2*PAGE_SIZE, ...  have not been mapped yet.

Right. Thanks for pointing that out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-08 17:24                           ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter
                                               ` (3 preceding siblings ...)
  2004-12-08 22:50                             ` Martin J. Bligh
@ 2004-12-09 10:57                             ` Pavel Machek
  2004-12-09 11:32                               ` Nick Piggin
  2004-12-09 17:05                               ` Christoph Lameter
  2004-12-14 15:28                             ` Adam Litke
  5 siblings, 2 replies; 23+ messages in thread
From: Pavel Machek @ 2004-12-09 10:57 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
	linux-ia64, linux-kernel

Hi!

> Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
> number of threads (and thus increasing parallellism of page faults):
> 
>  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>  32   3    1    1.416s    138.165s 139.050s 45073.831  45097.498
...
> Patched kernel:
> 
> Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>  32   3    1    1.098s    138.544s 139.063s 45053.657  45057.920
...
> These number are roughly equal to what can be accomplished with the
> page fault scalability patches.
> 
> Kernel patches with both the page fault scalability patches and
> prefaulting:
> 
>  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>  32  10    1    4.103s    456.384s 460.046s 45541.992  45544.369
...
> 
> The fault rate doubles when both patches are applied.
...
> We are getting into an almost linear scalability in the high end with
> both patches and end up with a fault rate > 3 mio faults per second.

Well, with both patches you also slow single-threaded case more than
twice. What are the effects of this patch on UP system?
								Pavel

-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-09 10:57                             ` Pavel Machek
@ 2004-12-09 11:32                               ` Nick Piggin
  2004-12-09 17:05                               ` Christoph Lameter
  1 sibling, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2004-12-09 11:32 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Christoph Lameter, Jeff Garzik, torvalds, hugh, benh, linux-mm,
	linux-ia64, linux-kernel

Pavel Machek wrote:
> Hi!
> 
> 
>>Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
>>number of threads (and thus increasing parallellism of page faults):
>>
>> Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>> 32   3    1    1.416s    138.165s 139.050s 45073.831  45097.498
> 
> ...
> 
>>Patched kernel:
>>
>>Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>> 32   3    1    1.098s    138.544s 139.063s 45053.657  45057.920
> 
> ...
> 
>>These number are roughly equal to what can be accomplished with the
>>page fault scalability patches.
>>
>>Kernel patches with both the page fault scalability patches and
>>prefaulting:
>>
>> Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>> 32  10    1    4.103s    456.384s 460.046s 45541.992  45544.369
> 
> ...
> 
>>The fault rate doubles when both patches are applied.
> 
> ...
> 
>>We are getting into an almost linear scalability in the high end with
>>both patches and end up with a fault rate > 3 mio faults per second.
> 
> 
> Well, with both patches you also slow single-threaded case more than
> twice. What are the effects of this patch on UP system?

fault/wsec is the important number.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-09 10:57                             ` Pavel Machek
  2004-12-09 11:32                               ` Nick Piggin
@ 2004-12-09 17:05                               ` Christoph Lameter
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2004-12-09 17:05 UTC (permalink / raw)
  To: Pavel Machek
  Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
	linux-ia64, linux-kernel

On Thu, 9 Dec 2004, Pavel Machek wrote:

> Hi!
>
> > Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
> > number of threads (and thus increasing parallellism of page faults):
> >
> >  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
> >  32   3    1    1.416s    138.165s 139.050s 45073.831  45097.498
> ...
> > Patched kernel:
> >
> > Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
> >  32   3    1    1.098s    138.544s 139.063s 45053.657  45057.920
> ...
> > These number are roughly equal to what can be accomplished with the
> > page fault scalability patches.
> >
> > Kernel patches with both the page fault scalability patches and
> > prefaulting:
> >
> >  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
> >  32  10    1    4.103s    456.384s 460.046s 45541.992  45544.369
> ...
> >
> > The fault rate doubles when both patches are applied.
> ...
> > We are getting into an almost linear scalability in the high end with
> > both patches and end up with a fault rate > 3 mio faults per second.
>
> Well, with both patches you also slow single-threaded case more than
> twice. What are the effects of this patch on UP system?

The faults per second are slightly increased, so its faster. The last
numbers are 10  repetitions and not 3. Do not look at the wall time.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Anticipatory prefaulting in the page fault handler V1
  2004-12-08 17:24                           ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter
                                               ` (4 preceding siblings ...)
  2004-12-09 10:57                             ` Pavel Machek
@ 2004-12-14 15:28                             ` Adam Litke
  5 siblings, 0 replies; 23+ messages in thread
From: Adam Litke @ 2004-12-14 15:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: nickpiggin, Jeff Garzik, torvalds, hugh, benh, linux-mm,
	linux-ia64, linux-kernel

What benchmark are you using to generate the following results?  I'd
like to run this on some of my hardware and see how the results compare.

On Wed, 2004-12-08 at 11:24, Christoph Lameter wrote:
> Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
> number of threads (and thus increasing parallellism of page faults):
> 
>  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>  32   3    1    1.416s    138.165s 139.050s 45073.831  45097.498
>  32   3    2    1.397s    148.523s  78.044s 41965.149  80201.646
>  32   3    4    1.390s    152.618s  44.044s 40851.258 141545.239
>  32   3    8    1.500s    374.008s  53.001s 16754.519 118671.950
>  32   3   16    1.415s   1051.759s  73.094s  5973.803  85087.358
>  32   3   32    1.867s   3400.417s 117.003s  1849.186  53754.928
>  32   3   64    5.361s  11633.040s 197.034s   540.577  31881.112
>  32   3  128   23.387s  39386.390s 332.055s   159.642  18918.599
>  32   3  256   15.409s  20031.450s 168.095s   313.837  37237.918
>  32   3  512   18.720s  10338.511s  86.047s   607.446  72752.686
> 
> Patched kernel:
> 
> Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>  32   3    1    1.098s    138.544s 139.063s 45053.657  45057.920
>  32   3    2    1.022s    127.770s  67.086s 48849.350  92707.085
>  32   3    4    0.995s    119.666s  37.045s 52141.503 167955.292
>  32   3    8    0.928s     87.400s  18.034s 71227.407 342934.242
>  32   3   16    1.067s     72.943s  11.035s 85007.293 553989.377
>  32   3   32    1.248s    133.753s  10.038s 46602.680 606062.151
>  32   3   64    5.557s    438.634s  13.093s 14163.802 451418.617
>  32   3  128   17.860s   1496.797s  19.048s  4153.714 322808.509
>  32   3  256   13.382s    766.063s  10.016s  8071.695 618816.838
>  32   3  512   17.067s    369.106s   5.041s 16291.764 1161285.521
> 
> These number are roughly equal to what can be accomplished with the
> page fault scalability patches.
> 
> Kernel patches with both the page fault scalability patches and
> prefaulting:
> 
>  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>  32  10    1    4.103s    456.384s 460.046s 45541.992  45544.369
>  32  10    2    4.005s    415.119s 221.095s 50036.407  94484.174
>  32  10    4    3.855s    371.317s 111.076s 55898.259 187635.724
>  32  10    8    3.902s    308.673s  67.094s 67092.476 308634.397
>  32  10   16    4.011s    224.213s  37.016s 91889.781 564241.062
>  32  10   32    5.483s    209.391s  27.046s 97598.647 763495.417
>  32  10   64   19.166s    219.925s  26.030s 87713.212 797286.395
>  32  10  128   53.482s    342.342s  27.024s 52981.744 769687.791
>  32  10  256   67.334s    180.321s  15.036s 84679.911 1364614.334
>  32  10  512   66.516s     93.098s   9.015s131387.893 2291548.865
> 
> The fault rate doubles when both patches are applied.
> 
> And on the high end (512 processors allocating 256G) (No numbers
> for regular kernels because they are extremely slow, also no
> number for a low number of threads. Also very slow)
> 
> With prefaulting:
> 
>  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
> 256   3    4    8.241s   1414.348s 449.016s 35380.301 112056.239
> 256   3    8    8.306s   1300.982s 247.025s 38441.977 203559.271
> 256   3   16    8.368s   1223.853s 154.089s 40846.272 324940.924
> 256   3   32    8.536s   1284.041s 110.097s 38938.970 453556.624
> 256   3   64   13.025s   3587.203s 110.010s 13980.123 457131.492
> 256   3  128   25.025s  11460.700s 145.071s  4382.104 345404.909
> 256   3  256   26.150s   6061.649s  75.086s  8267.625 663414.482
> 256   3  512   20.637s   3037.097s  38.062s 16460.435 1302993.019
> 
> Page fault scalability patch and prefaulting. Max prefault order
> increased to 5 (max preallocation of 32 pages):
> 
>  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
> 256  10    8   33.571s   4516.293s 863.021s 36874.099 194356.930
> 256  10   16   33.103s   3737.688s 461.028s 44492.553 363704.484
> 256  10   32   35.094s   3436.561s 321.080s 48326.262 521352.840
> 256  10   64   46.675s   2899.997s 245.020s 56936.124 684214.256
> 256  10  128   85.493s   2890.198s 203.008s 56380.890 826122.524
> 256  10  256   74.299s   1374.973s  99.088s115762.963 1679630.272
> 256  10  512   62.760s    706.559s  53.027s218078.311 3149273.714
> 
> We are getting into an almost linear scalability in the high end with
> both patches and end up with a fault rate > 3 mio faults per second.
> 
> The one thing that takes up a lot of time is still be the zeroing
> of pages in the page fault handler. There is a another
> set of patches that I am working on which will prezero pages
> and led to another an increase in performance by a factor of 2-4
> (if prezeroed pages are available which may not always be the case).
> Maybe we can reach 10 mio fault /sec that way.
> 
> Patch against 2.6.10-rc3-bk3:
> 
> Index: linux-2.6.9/include/linux/sched.h
> ===================================================================
> --- linux-2.6.9.orig/include/linux/sched.h	2004-12-01 10:37:31.000000000 -0800
> +++ linux-2.6.9/include/linux/sched.h	2004-12-01 10:38:15.000000000 -0800
> @@ -537,6 +537,8 @@
>  #endif
> 
>  	struct list_head tasks;
> +	unsigned long anon_fault_next_addr;	/* Predicted sequential fault address */
> +	int anon_fault_order;			/* Last order of allocation on fault */
>  	/*
>  	 * ptrace_list/ptrace_children forms the list of my children
>  	 * that were stolen by a ptracer.
> Index: linux-2.6.9/mm/memory.c
> ===================================================================
> --- linux-2.6.9.orig/mm/memory.c	2004-12-01 10:38:11.000000000 -0800
> +++ linux-2.6.9/mm/memory.c	2004-12-01 10:45:01.000000000 -0800
> @@ -55,6 +55,7 @@
> 
>  #include <linux/swapops.h>
>  #include <linux/elf.h>
> +#include <linux/pagevec.h>
> 
>  #ifndef CONFIG_DISCONTIGMEM
>  /* use the per-pgdat data instead for discontigmem - mbligh */
> @@ -1432,8 +1433,106 @@
>  		unsigned long addr)
>  {
>  	pte_t entry;
> -	struct page * page = ZERO_PAGE(addr);
> +	struct page * page;
> +
> +	addr &= PAGE_MASK;
> +
> + 	if (current->anon_fault_next_addr == addr) {
> + 		unsigned long end_addr;
> + 		int order = current->anon_fault_order;
> +
> +		/* Sequence of page faults detected. Perform preallocation of pages */
> 
> +		/* The order of preallocations increases with each successful prediction */
> + 		order++;
> +
> +		if ((1 << order) < PAGEVEC_SIZE)
> +			end_addr = addr + (1 << (order + PAGE_SHIFT));
> +		else
> +			end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE;
> +
> +		if (end_addr > vma->vm_end)
> +			end_addr = vma->vm_end;
> +		if ((addr & PMD_MASK) != (end_addr & PMD_MASK))
> +			end_addr &= PMD_MASK;
> +
> +		current->anon_fault_next_addr = end_addr;
> +	 	current->anon_fault_order = order;
> +
> +		if (write_access) {
> +
> +			struct pagevec pv;
> +			unsigned long a;
> +			struct page **p;
> +
> +			pte_unmap(page_table);
> +			spin_unlock(&mm->page_table_lock);
> +
> +			pagevec_init(&pv, 0);
> +
> +			if (unlikely(anon_vma_prepare(vma)))
> +				return VM_FAULT_OOM;
> +
> +			/* Allocate the necessary pages */
> +			for(a = addr;a < end_addr ; a += PAGE_SIZE) {
> +				struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a);
> +
> +				if (p) {
> +					clear_user_highpage(p, a);
> +					pagevec_add(&pv,p);
> +				} else
> +					break;
> +			}
> +			end_addr = a;
> +
> +			spin_lock(&mm->page_table_lock);
> +
> + 			for(p = pv.pages; addr < end_addr; addr += PAGE_SIZE, p++) {
> +
> +				page_table = pte_offset_map(pmd, addr);
> +				if (!pte_none(*page_table)) {
> +					/* Someone else got there first */
> +					page_cache_release(*p);
> +					pte_unmap(page_table);
> +					continue;
> +				}
> +
> + 				entry = maybe_mkwrite(pte_mkdirty(mk_pte(*p,
> + 							 vma->vm_page_prot)),
> + 						      vma);
> +
> +				mm->rss++;
> +				lru_cache_add_active(*p);
> +				mark_page_accessed(*p);
> +				page_add_anon_rmap(*p, vma, addr);
> +
> +				set_pte(page_table, entry);
> +				pte_unmap(page_table);
> +
> + 				/* No need to invalidate - it was non-present before */
> + 				update_mmu_cache(vma, addr, entry);
> +			}
> + 		} else {
> + 			/* Read */
> + 			for(;addr < end_addr; addr += PAGE_SIZE) {
> +				page_table = pte_offset_map(pmd, addr);
> + 				entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
> +				set_pte(page_table, entry);
> +				pte_unmap(page_table);
> +
> + 				/* No need to invalidate - it was non-present before */
> +				update_mmu_cache(vma, addr, entry);
> +
> +			};
> +		}
> +		spin_unlock(&mm->page_table_lock);
> +		return VM_FAULT_MINOR;
> +	}
> +
> +	current->anon_fault_next_addr = addr + PAGE_SIZE;
> +	current->anon_fault_order = 0;
> +
> +	page = ZERO_PAGE(addr);
>  	/* Read-only mapping of ZERO_PAGE. */
>  	entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2004-12-14 20:25 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-08 17:44 Anticipatory prefaulting in the page fault handler V1 Luck, Tony
2004-12-08 17:57 ` Christoph Lameter
  -- strict thread matches above, loose matches on Subject: below --
2004-12-08 18:31 Luck, Tony
2004-11-22 15:00 page fault scalability patch V11 [1/7]: sloppy rss Hugh Dickins
2004-11-22 21:50 ` deferred rss update instead of " Christoph Lameter
2004-11-22 22:22   ` Linus Torvalds
2004-11-22 22:27     ` Christoph Lameter
2004-11-22 22:40       ` Linus Torvalds
2004-12-01 23:41         ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter
2004-12-02  0:10           ` Linus Torvalds
2004-12-02  6:21             ` Jeff Garzik
2004-12-02  6:34               ` Andrew Morton
2004-12-02  6:48                 ` Jeff Garzik
2004-12-02  7:02                   ` Andrew Morton
2004-12-02  7:26                     ` Martin J. Bligh
2004-12-02  7:31                       ` Jeff Garzik
2004-12-02 18:10                         ` cliff white
2004-12-08 17:24                           ` Anticipatory prefaulting in the page fault handler V1 Christoph Lameter
2004-12-08 17:33                             ` Jesse Barnes
2004-12-08 17:56                               ` Christoph Lameter
2004-12-08 18:33                                 ` Jesse Barnes
2004-12-08 21:26                                 ` David S. Miller
2004-12-08 21:42                                   ` Linus Torvalds
2004-12-08 17:55                             ` Dave Hansen
2004-12-08 19:07                             ` Martin J. Bligh
2004-12-08 22:50                             ` Martin J. Bligh
2004-12-09 19:32                               ` Christoph Lameter
2004-12-13 14:30                                 ` Akinobu Mita
2004-12-13 17:10                                   ` Christoph Lameter
2004-12-13 22:16                                     ` Martin J. Bligh
2004-12-14 12:24                                     ` Akinobu Mita
2004-12-14 15:25                                       ` Akinobu Mita
2004-12-14 20:25                                       ` Christoph Lameter
2004-12-09 10:57                             ` Pavel Machek
2004-12-09 11:32                               ` Nick Piggin
2004-12-09 17:05                               ` Christoph Lameter
2004-12-14 15:28                             ` Adam Litke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox