[PATCH 0/3] Page Fault Scalability V20: Overview

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] Page Fault Scalability V20: Overview
@ 2005-04-29 19:59 Christoph Lameter
  2005-04-29 19:59 ` [PATCH 1/3] Page Fault Scalability V20: Avoid spurious page faults Christoph Lameter
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Christoph Lameter @ 2005-04-29 19:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, linux-ia64, Christoph Lameter

This patch addresses the scalability issues that arise in Linux if
more than 4 processors generate page faults to populate memory
simultaneously. One example where this occurs is during the startup
phase of large scale applications. These typically start by forking
multiple threads that then concurrently initialize memory in order
to reduce the time needed. However, concurrent memory initialization
does only work as expected for Linux if more than 4 threads are
started. For more than 8 processors the performance begins to drop
until we reach single thread performance at 16-32 cpus. Performance
drops exponentially for configurations of higher cpu counts.

Without this patch these application may seem to freeze for long times
due to contention around the page_table_lock. These modifications
allow the page fault handler for anonymous faults scale linearly
(verified for up to 64 cpus).

Changelog:

V19->V20
 - Adapted to use set_pte_at and conform to the other pte api changes.
   This also required a change to pte_cmpxchg and pte_xchg.
 - Use atomic64_t if available for counters to allow more than 8TB memory.
 - Enable ATOMIC_TABLE_OPS via Kconfig for SMP configurations with
   suitable processors under IA32, X86_64 and IA64.
 - Drop rss patch which was already accepted.

Charts with performance data and a short description is available at
http://oss.sgi.com/projects/page_fault_performance/atomic-ptes.pdf

The basic approach in this patch set is the same as used in SGI's 2.4.X
based kernels which have been in production use by SGI in ProPack 3
for a long time. More information may be found at
http://oss.sgi.com/projects/page_fault_performance .

The patch set is currently composed of 3 patches:

1/3: Avoid spurious page faults

	ptes are currently sporadically set to zero for synchronization
	with the MMU which may cause spurious page faults. This patch
	uses ptep_xchg and ptep_cmpxchg to avoid clearing ptes and
	therefore also spurious page faults.

	The patch introduces CONFIG_ATOMIC_TABLE_OPS that is enabled
	if the hardware is able to support atomic operations and if
	a SMP kernel is being configured. A Kconfig update for i386,
	x86_64 and ia64 has been provided. On i386 this options is
	restricted to CPUs better than a 486 and non PAE mode (that
	way all the cmpxchg issues on old i386 CPUS and the problems
	with 64bit atomic operations on recent i386 CPUS are avoided).

	If CONFIG_ATOMIC_TABLE_OPS is not set then ptep_xchg and
	ptep_xcmpxchg are realized by falling back to clearing a pte
	before updating it.

	The patch does not change the use of mm->page_table_lock and
	the only performance improvement is the replacement of
	xchg-with-zero-and-then-write-new-pte-value with an xchg with
	the new value for SMP on some architectures if
	CONFIG_ATOMIC_TABLE_OPS is configured. It should not do anything
	major to VM operations.

2/3: Drop the first use of the page_table_lock in handle_mm_fault

	The patch introduces two new functions:

	page_table_atomic_start(mm), page_table_atomic_stop(mm)

	that fall back to the use of the page_table_lock if
	CONFIG_ATOMIC_TABLE_OPS is not defined.

	If CONFIG_ATOMIC_TABLE_OPS is defined those functions may
	be used to prep the CPU for atomic table ops (i386 in PAE mode
	may f.e. get the MMX register ready for 64bit atomic ops) but
	these are currently empty by default.

	Two operations may then be performed on the page table without
	acquiring the page table lock:

	a) updating access bits in pte
	b) anonymous read faults installed a mapping to the zero page.

	All counters are still protected with the page_table_lock thus
	avoiding any issues there.

	Some additional statistics are added to /proc/meminfo to
	give some statistics. This includes counting spurious faults
	with no effect.

3/3: Drop the use of the page_table_lock in do_anonymous_page

	The second acquisition of the page_table_lock is removed
	from do_anonymous_page and allows the anonymous
	write fault to be possible without the page_table_lock.

	The macros for manipulating rss and anon_rss in include/linux/sched.h
	are changed if CONFIG_ATOMIC_TABLE_OPS is set to use atomic
	operations for rss and anon_rss (safest solution for now, other
	solutions may easily be implemented by changing those macros).
	A 64 bit atomic type will be used if available.

	This patch typically yield significant increases in page fault
	performance for threaded applications on SMP systems. A nice
	color chart can be see at
	http://oss.sgi.com/projects/page_fault_performance/atomic-ptes.pdf


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/3] Page Fault Scalability V20: Avoid spurious page faults
  2005-04-29 19:59 [PATCH 0/3] Page Fault Scalability V20: Overview Christoph Lameter
@ 2005-04-29 19:59 ` Christoph Lameter
  2005-04-29 19:59 ` [PATCH 2/3] Page Fault Scalability V20: Avoid first acquisition of lock Christoph Lameter
  2005-04-29 19:59 ` [PATCH 3/3] Page Fault Scalability V20: Avoid lock for anonymous write fault Christoph Lameter
  2 siblings, 0 replies; 6+ messages in thread
From: Christoph Lameter @ 2005-04-29 19:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, linux-ia64, Christoph Lameter

Updating a page table entry (pte) can be difficult since the MMU may
modify the pte concurrently. The current approach taken is to first
exchange the pte contents with zero. Clearing the pte by writing
zero to it also resets the present bit, which will stop the MMU from
modifying the pte and allows the processing of the bits that were set.
Then the pte is set to its new value.

While the present bit is not set, accesses to the page mapped by the pte
will results in page faults, which may install a new pte over the non
present entry. In order to avoid that scenario the page_table_lock is held.
An access will still result in a page fault but the fault handler will
also try to acquire the page_table_lock. The page_table_lock is released
after the pte has been setup by the first process. The second process will
now acquire the page_table_lock and find that there is already a pte
setup for this page and return without having done anything.

This means that a useless page fault has been generated.

However, most architectures have the capability to atomically exchange the
value of the pte. For those the clearing of pte before setting them to
a new value is not necessary. The use of atomic exchanges avoids
useless page faults.

The following patch introduces two new atomic operations ptep_xchg and
ptep_cmpxchg that may be provided by an architecture. The fallback in
include/asm-generic/pgtable.h is to simulate both operations through the
existing ptep_get_and_clear function. So there is essentially no change if
atomic operations on ptes have not been defined. Architectures that do
not support atomic operations on ptes may continue to use the clearing of
a pte.

Atomic operations are enabled for i386, ia64 and x86_64 if a suitable
CPU is configured in SMP mode. Generic atomic definitions for ptep_xchg
and ptep_cmpxchg have been provided based on the existing xchg() and
cmpxchg() functions that already work atomically on many platforms.

The provided generic atomic functions may be overridden as usual by defining
the appropriate__HAVE_ARCH_xxx constant and providing a different
implementation.

This patch is a piece of my attempt to reduce the use of the page_table_lock
in the page fault handler through atomic operations. This is only possible
if it can be ensured that a pte is never cleared if the pte is in
use even when the page_table_lock is not held. Clearing a pte before setting
it to another value could result in a situation in which a fault generated by
another cpu could install a pte which is then immediately overwritten by
the first CPU setting the pte to a valid value again. This patch is necessary
for the other patches removing the use of the page_table_lock to work properly.

Some numbers:

AIM7 Benchmark on an 8 processor system:

w/o patch
Tasks    jobs/min  jti  jobs/min/task      real       cpu
    1      471.79  100       471.7899     12.34      2.30   Wed Mar 30 21:29:54 2005
  100    18068.92   89       180.6892     32.21    158.37   Wed Mar 30 21:30:27 2005
  200    21427.39   84       107.1369     54.32    315.84   Wed Mar 30 21:31:22 2005
  300    21500.87   82        71.6696     81.21    473.74   Wed Mar 30 21:32:43 2005
  400    24886.42   83        62.2160     93.55    633.73   Wed Mar 30 21:34:23 2005
  500    25658.89   81        51.3178    113.41    789.44   Wed Mar 30 21:36:17 2005
  600    25693.47   81        42.8225    135.91    949.00   Wed Mar 30 21:38:33 2005
  700    26098.32   80        37.2833    156.10   1108.17   Wed Mar 30 21:41:10 2005
  800    26334.25   80        32.9178    176.80   1266.73   Wed Mar 30 21:44:07 2005
  900    26913.85   80        29.9043    194.62   1422.11   Wed Mar 30 21:47:22 2005
 1000    26749.89   80        26.7499    217.57   1583.95   Wed Mar 30 21:51:01 2005

w/patch:
Tasks    jobs/min  jti  jobs/min/task      real       cpu
    1      470.30  100       470.3030     12.38      2.33   Wed Mar 30 21:57:27 2005
  100    18465.05   89       184.6505     31.52    158.62   Wed Mar 30 21:57:58 2005
  200    22399.26   86       111.9963     51.97    315.95   Wed Mar 30 21:58:51 2005
  300    24274.61   84        80.9154     71.93    475.04   Wed Mar 30 22:00:03 2005
  400    25120.86   82        62.8021     92.67    634.10   Wed Mar 30 22:01:36 2005
  500    25742.87   81        51.4857    113.04    791.13   Wed Mar 30 22:03:30 2005
  600    26322.73   82        43.8712    132.66    948.31   Wed Mar 30 22:05:43 2005
  700    25718.40   80        36.7406    158.41   1112.30   Wed Mar 30 22:08:22 2005
  800    26361.08   80        32.9514    176.62   1269.94   Wed Mar 30 22:11:19 2005
  900    26975.67   81        29.9730    194.17   1424.56   Wed Mar 30 22:14:33 2005
 1000    26765.51   80        26.7655    217.44   1585.27   Wed Mar 30 22:18:12 2005

There are some minor performance improvements and some minimal losses for other
numbers of tasks. The improvement may be due to the avoidance of one store and the
avoidance of useless page faults.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.11/mm/rmap.c
===================================================================
--- linux-2.6.11.orig/mm/rmap.c	2005-04-29 08:25:55.000000000 -0700
+++ linux-2.6.11/mm/rmap.c	2005-04-29 08:26:12.000000000 -0700
@@ -574,11 +574,6 @@ static int try_to_unmap_one(struct page 
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
-
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
-		set_page_dirty(page);
 
 	if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page->private };
@@ -593,10 +588,15 @@ static int try_to_unmap_one(struct page 
 			list_add(&mm->mmlist, &init_mm.mmlist);
 			spin_unlock(&mmlist_lock);
 		}
-		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
+		pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
 		dec_mm_counter(mm, anon_rss);
-	}
+	} else
+		pteval = ptep_clear_flush(vma, address, pte);
+
+	/* Move the dirty bit to the physical page now the pte is gone. */
+	if (pte_dirty(pteval))
+		set_page_dirty(page);
 
 	inc_mm_counter(mm, rss);
 	page_remove_rmap(page);
@@ -689,15 +689,15 @@ static void try_to_unmap_cluster(unsigne
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
 
-		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pfn);
-		pteval = ptep_clear_flush(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
-			set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
+			pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+		else
+			pteval = ptep_clear_flush(vma, address, pte);
 
-		/* Move the dirty bit to the physical page now the pte is gone. */
+		/* Move the dirty bit to the physical page now that the pte is gone. */
 		if (pte_dirty(pteval))
 			set_page_dirty(page);
 
Index: linux-2.6.11/mm/mprotect.c
===================================================================
--- linux-2.6.11.orig/mm/mprotect.c	2005-04-29 08:25:55.000000000 -0700
+++ linux-2.6.11/mm/mprotect.c	2005-04-29 08:26:12.000000000 -0700
@@ -32,17 +32,19 @@ static void change_pte_range(struct mm_s
 
 	pte = pte_offset_map(pmd, addr);
 	do {
-		if (pte_present(*pte)) {
-			pte_t ptent;
+		pte_t ptent;
+redo:
+		ptent = *pte;
+		if (!pte_present(ptent))
+			continue;
 
-			/* Avoid an SMP race with hardware updated dirty/clean
-			 * bits by wiping the pte and then setting the new pte
-			 * into place.
-			 */
-			ptent = pte_modify(ptep_get_and_clear(mm, addr, pte), newprot);
-			set_pte_at(mm, addr, pte, ptent);
-			lazy_mmu_prot_update(ptent);
-		}
+		/* Deal with a potential SMP race with hardware/arch
+		 * interrupt updating dirty/clean bits through the use
+		 * of ptep_cmpxchg.
+		 */
+		if (!ptep_cmpxchg(mm, addr, pte, ptent, pte_modify(ptent, newprot)))
+				goto redo;
+		lazy_mmu_prot_update(ptent);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	pte_unmap(pte - 1);
 }
Index: linux-2.6.11/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.11.orig/include/asm-generic/pgtable.h	2005-04-29 08:25:54.000000000 -0700
+++ linux-2.6.11/include/asm-generic/pgtable.h	2005-04-29 08:26:12.000000000 -0700
@@ -111,6 +111,92 @@ do {				  					  \
 })
 #endif
 
+#ifdef CONFIG_ATOMIC_TABLE_OPS
+
+/*
+ * The architecture does support atomic table operations.
+ * We may be able to provide atomic ptep_xchg and ptep_cmpxchg using
+ * cmpxchg and xchg.
+ */
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg(__mm, __address, __ptep, __pteval) \
+	__pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)))
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__mm, __address, __ptep,__oldval,__newval)		\
+	(cmpxchg(&pte_val(*(__ptep)),					\
+			pte_val(__oldval),				\
+			pte_val(__newval)				\
+		) == pte_val(__oldval)					\
+	)
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg(__vma, __address, __ptep, __pteval);	\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+#endif
+
+#else
+
+/*
+ * No support for atomic operations on the page table.
+ * Exchanging of pte values is done by first swapping zeros into
+ * a pte and then putting new content into the pte entry.
+ * However, these functions will generate an empty pte for a
+ * short time frame. This means that the page_table_lock must be held
+ * to avoid a page fault that would install a new entry.
+ */
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg(__mm, __address, __ptep, __pteval)			\
+({									\
+	pte_t __pte = ptep_get_and_clear(__mm, __address, __ptep);	\
+	set_pte_at(__mm, __address, __ptep, __pteval);			\
+	__pte;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_clear_flush(__vma, __address, __ptep);	\
+	set_pte_at((__vma)->mm, __address, __ptep, __pteval);		\
+	__pte;								\
+})
+#else
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg((__vma)->mm, __address, __ptep, __pteval);\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+#endif
+#endif
+
+/*
+ * The fallback function for ptep_cmpxchg avoids any real use of cmpxchg
+ * since cmpxchg may not be available on certain architectures. Instead
+ * the clearing of a pte is used as a form of locking mechanism.
+ * This approach will only work if the page_table_lock is held to insure
+ * that the pte is not populated by a page fault generated on another
+ * CPU.
+ */
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__mm, __address, __ptep, __old, __new)		\
+({									\
+	pte_t prev = ptep_get_and_clear(__mm, __address, __ptep);	\
+	int r = pte_val(prev) == pte_val(__old);			\
+	set_pte_at(__mm, __address, __ptep, r ? (__new) : prev);	\
+	r;								\
+})
+#endif
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
 {
Index: linux-2.6.11/arch/ia64/Kconfig
===================================================================
--- linux-2.6.11.orig/arch/ia64/Kconfig	2005-04-29 08:26:06.000000000 -0700
+++ linux-2.6.11/arch/ia64/Kconfig	2005-04-29 09:23:07.000000000 -0700
@@ -273,6 +273,11 @@ config PREEMPT
           Say Y here if you are building a kernel for a desktop, embedded
           or real-time system.  Say N if you are unsure.
 
+config ATOMIC_TABLE_OPS
+	bool
+	depends on SMP
+	default y
+
 config HAVE_DEC_LOCK
 	bool
 	depends on (SMP || PREEMPT)
Index: linux-2.6.11/arch/i386/Kconfig
===================================================================
--- linux-2.6.11.orig/arch/i386/Kconfig	2005-04-29 08:25:50.000000000 -0700
+++ linux-2.6.11/arch/i386/Kconfig	2005-04-29 08:26:12.000000000 -0700
@@ -886,6 +886,11 @@ config HAVE_DEC_LOCK
 	depends on (SMP || PREEMPT) && X86_CMPXCHG
 	default y
 
+config ATOMIC_TABLE_OPS
+	bool
+	depends on SMP && X86_CMPXCHG && !X86_PAE
+	default y
+
 # turning this on wastes a bunch of space.
 # Summit needs it only when NUMA is on
 config BOOT_IOREMAP
Index: linux-2.6.11/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.11.orig/arch/x86_64/Kconfig	2005-04-29 08:25:51.000000000 -0700
+++ linux-2.6.11/arch/x86_64/Kconfig	2005-04-29 08:26:12.000000000 -0700
@@ -223,6 +223,11 @@ config PREEMPT
 	  Say Y here if you are feeling brave and building a kernel for a
 	  desktop, embedded or real-time system.  Say N if you are unsure.
 
+config ATOMIC_TABLE_OPS
+	bool
+	depends on SMP
+	default y
+
 config PREEMPT_BKL
 	bool "Preempt The Big Kernel Lock"
 	depends on PREEMPT
Index: linux-2.6.11/mm/memory.c
===================================================================
--- linux-2.6.11.orig/mm/memory.c	2005-04-29 08:25:55.000000000 -0700
+++ linux-2.6.11/mm/memory.c	2005-04-29 08:26:12.000000000 -0700
@@ -551,15 +551,19 @@ static void zap_pte_range(struct mmu_gat
 				     page->index > details->last_index))
 					continue;
 			}
-			ptent = ptep_get_and_clear(tlb->mm, addr, pte);
-			tlb_remove_tlb_entry(tlb, pte, addr);
-			if (unlikely(!page))
+			if (unlikely(!page)) {
+				ptent = ptep_get_and_clear(tlb->mm, addr, pte);
+				tlb_remove_tlb_entry(tlb, pte, addr);
 				continue;
+			}
 			if (unlikely(details) && details->nonlinear_vma
 			    && linear_page_index(details->nonlinear_vma,
 						addr) != page->index)
-				set_pte_at(tlb->mm, addr, pte,
-					   pgoff_to_pte(page->index));
+				ptent = ptep_xchg(tlb->mm, addr, pte,
+						  pgoff_to_pte(page->index));
+			else
+				ptent = ptep_get_and_clear(tlb->mm, addr, pte);
+			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (pte_dirty(ptent))
 				set_page_dirty(page);
 			if (PageAnon(page))
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 2/3] Page Fault Scalability V20: Avoid first acquisition of lock
  2005-04-29 19:59 [PATCH 0/3] Page Fault Scalability V20: Overview Christoph Lameter
  2005-04-29 19:59 ` [PATCH 1/3] Page Fault Scalability V20: Avoid spurious page faults Christoph Lameter
@ 2005-04-29 19:59 ` Christoph Lameter
  2005-04-29 19:59 ` [PATCH 3/3] Page Fault Scalability V20: Avoid lock for anonymous write fault Christoph Lameter
  2 siblings, 0 replies; 6+ messages in thread
From: Christoph Lameter @ 2005-04-29 19:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, linux-ia64, Christoph Lameter

The page fault handler attempts to use the page_table_lock only for short
time periods. It repeatedly drops and reacquires the lock. When the lock
is reacquired, checks are made if the underlying pte has changed before
replacing the pte value. These locations are a good fit for the use of
ptep_cmpxchg.

The following patch allows the use of atomic operations to remove first
acquisition of the page_table_lock. A section
using atomic pte operations is begun with

	page_table_atomic_start(struct mm_struct *)

and ends with

	page_table_atomic_stop(struct mm_struct *)

Both of these become spin_lock(page_table_lock) and
spin_unlock(page_table_lock) if atomic page table operations are not
configured (CONFIG_ATOMIC_TABLE_OPS undefined).

Atomic pte operations using pte_xchg and pte_cmpxchg only work for the lowest
layer of the page table. Higher layers may also be populated in an atomic
way by defining pmd_test_and_populate() etc. The generic versions of these
functions fall back to the page_table_lock. Populating higher level page
table entries is rare and therefore this is not likely to be performance
critical. For ia64 a definition of higher level atomic operations is
included.

This patch depends on the patch to avoid spurious page faults to be applied
first and will only remove the first acquisition of the page_table_lock in
the page fault handler. This will allow the following page table operations
without acquiring the page_table_lock:

1. Updating of access bits (handle_mm_fault)
2. Anonymous read faults (do_anonymous_page)

The page_table_lock is still acquired for creating a new pte for an anonymous
write fault and therefore the problems with atomic updates of rss do not yet occur.

The patch also adds some diagnostic features by counting the number of cmpxchg
failures (useful for verification if this patch works right) and the number of
page faults that led to no change in the page table. The statistics may be
accessed via /proc/meminfo.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.11/mm/memory.c
===================================================================
--- linux-2.6.11.orig/mm/memory.c	2005-04-29 12:12:38.000000000 -0700
+++ linux-2.6.11/mm/memory.c	2005-04-29 12:12:45.000000000 -0700
@@ -36,6 +36,8 @@
  *		(Gerhard.Wichert@pdb.siemens.de)
  *
  * Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ * Jan 2005 	Scalability improvement by reducing the use and the length of time
+ *		the page table lock is held (Christoph Lameter)
  */
 
 #include <linux/kernel_stat.h>
@@ -1655,8 +1657,7 @@ void swapin_readahead(swp_entry_t entry,
 }
 
 /*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore and have started atomic pte operations
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
@@ -1668,15 +1669,14 @@ static int do_swap_page(struct mm_struct
 	int ret = VM_FAULT_MINOR;
 
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
+	page_table_atomic_stop(mm);
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
-			 * Back out if somebody else faulted in this pte while
-			 * we released the page table lock.
+			 * Back out if somebody else faulted in this pte
 			 */
 			spin_lock(&mm->page_table_lock);
 			page_table = pte_offset_map(pmd, address);
@@ -1699,8 +1699,7 @@ static int do_swap_page(struct mm_struct
 	lock_page(page);
 
 	/*
-	 * Back out if somebody else faulted in this pte while we
-	 * released the page table lock.
+	 * Back out if somebody else faulted in this pte
 	 */
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
@@ -1748,62 +1747,72 @@ out:
 }
 
 /*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs. 
+ * We are called with atomic operations started and the
+ * value of the pte that was read in orig_entry.
  */
 static int
 do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_t *page_table, pmd_t *pmd, int write_access,
-		unsigned long addr)
+		unsigned long addr, pte_t orig_entry)
 {
 	pte_t entry;
-	struct page * page = ZERO_PAGE(addr);
+	struct page * page;
 
-	/* Read-only mapping of ZERO_PAGE. */
-	entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+	if (unlikely(!write_access)) {
 
-	/* ..except if it's a write access */
-	if (write_access) {
-		/* Allocate our own private page. */
+		/* Read-only mapping of ZERO_PAGE. */
+		entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+
+		/*
+		 * If the cmpxchg fails then another cpu may
+		 * already have populated the entry
+		*/
+
+		if (ptep_cmpxchg(mm, addr, page_table, orig_entry, entry)) {
+			update_mmu_cache(vma, addr, entry);
+			lazy_mmu_prot_update(entry);
+		} else
+			inc_page_state(cmpxchg_fail_anon_read);
 		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);
+		page_table_atomic_stop(mm);
+		return VM_FAULT_MINOR;
+	}
 
-		if (unlikely(anon_vma_prepare(vma)))
-			goto no_mem;
-		page = alloc_zeroed_user_highpage(vma, addr);
-		if (!page)
-			goto no_mem;
+	/* This leaves the write case */
+	page_table_atomic_stop(mm);
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
 
-		spin_lock(&mm->page_table_lock);
-		page_table = pte_offset_map(pmd, addr);
+	page = alloc_zeroed_user_highpage(vma, addr);
+	if (!page)
+		return VM_FAULT_OOM;
 
-		if (!pte_none(*page_table)) {
-			pte_unmap(page_table);
-			page_cache_release(page);
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-		inc_mm_counter(mm, rss);
-		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
-							 vma->vm_page_prot)),
-				      vma);
-		lru_cache_add_active(page);
-		SetPageReferenced(page);
-		page_add_anon_rmap(page, vma, addr);
-	}
+	entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
+						vma->vm_page_prot)),
+				vma);
+	spin_lock(&mm->page_table_lock);
 
-	set_pte_at(mm, addr, page_table, entry);
-	pte_unmap(page_table);
+	if (!ptep_cmpxchg(mm, addr, page_table, orig_entry, entry)) {
+		pte_unmap(page_table);
+		page_cache_release(page);
+		spin_unlock(&mm->page_table_lock);
+		inc_page_state(cmpxchg_fail_anon_write);
+		return VM_FAULT_MINOR;
+        }
 
-	/* No need to invalidate - it was non-present before */
+	/*
+	 * These two functions must come after the cmpxchg
+	 * because if the page is on the LRU then try_to_unmap may come
+	 * in and unmap the pte.
+	 */
+	page_add_anon_rmap(page, vma, addr);
+	lru_cache_add_active(page);
+	inc_mm_counter(mm, rss);
+	pte_unmap(page_table);
 	update_mmu_cache(vma, addr, entry);
 	lazy_mmu_prot_update(entry);
 	spin_unlock(&mm->page_table_lock);
-out:
 	return VM_FAULT_MINOR;
-no_mem:
-	return VM_FAULT_OOM;
 }
 
 /*
@@ -1815,12 +1824,12 @@ no_mem:
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
  *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held and atomic pte operations started.
  */
 static int
 do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *page_table,
+        pmd_t *pmd, pte_t orig_entry)
 {
 	struct page * new_page;
 	struct address_space *mapping = NULL;
@@ -1831,9 +1840,9 @@ do_no_page(struct mm_struct *mm, struct 
 
 	if (!vma->vm_ops || !vma->vm_ops->nopage)
 		return do_anonymous_page(mm, vma, page_table,
-					pmd, write_access, address);
+					pmd, write_access, address, orig_entry);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
+	page_table_atomic_stop(mm);
 
 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
@@ -1940,7 +1949,7 @@ oom:
  * nonlinear vmas.
  */
 static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
-	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
 {
 	unsigned long pgoff;
 	int err;
@@ -1953,13 +1962,13 @@ static int do_file_page(struct mm_struct
 	if (!vma->vm_ops || !vma->vm_ops->populate || 
 			(write_access && !(vma->vm_flags & VM_SHARED))) {
 		pte_clear(mm, address, pte);
-		return do_no_page(mm, vma, address, write_access, pte, pmd);
+		return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 	}
 
-	pgoff = pte_to_pgoff(*pte);
+	pgoff = pte_to_pgoff(entry);
 
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
+	page_table_atomic_stop(mm);
 
 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -1978,50 +1987,72 @@ static int do_file_page(struct mm_struct
  * with external mmu caches can use to update those (ie the Sparc or
  * PowerPC hashed page tables that act as extended TLBs).
  *
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
- * The adding of pages is protected by the MM semaphore (which we hold),
- * so we don't need to worry about a page being suddenly been added into
- * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
- */
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We exploit that case if possible to avoid taking the
+ * page table lock.
+*/
 static inline int handle_pte_fault(struct mm_struct *mm,
 	struct vm_area_struct * vma, unsigned long address,
 	int write_access, pte_t *pte, pmd_t *pmd)
 {
 	pte_t entry;
+	pte_t new_entry;
 
 	entry = *pte;
 	if (!pte_present(entry)) {
 		/*
-		 * If it truly wasn't present, we know that kswapd
-		 * and the PTE updates will not touch it later. So
-		 * drop the lock.
+		 * Pass the value of the pte to do_no_page and do_file_page
+		 * This value may be used to verify that the pte is still
+		 * not present allowing atomic insertion of ptes.
 		 */
 		if (pte_none(entry))
-			return do_no_page(mm, vma, address, write_access, pte, pmd);
+			return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 		if (pte_file(entry))
-			return do_file_page(mm, vma, address, write_access, pte, pmd);
+			return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
 		return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
 	}
 
+	new_entry = pte_mkyoung(entry);
 	if (write_access) {
-		if (!pte_write(entry))
-			return do_wp_page(mm, vma, address, pte, pmd, entry);
-
+		if (!pte_write(entry)) {
+#ifdef CONFIG_ATOMIC_TABLE_OPS
+			/* do_wp_page modifies a pte. We can add a pte without the
+			 * page_table_lock but not modify a pte since a cmpxchg
+			 * does not allow us to verify that the page was not
+			 * changed under us. So acquire the page table lock.
+			 */
+			spin_lock(&mm->page_table_lock);
+			if (pte_same(entry, *pte))
+#endif
+				return do_wp_page(mm, vma, address, pte, pmd, entry);
+#ifdef CONFIG_ATOMIC_TABLE_OPS
+			/* pte was changed under us. Another processor may have
+			 * done what we needed to do.
+			 */
+			pte_unmap(pte);
+			spin_unlock(&mm->page_table_lock);
+			return VM_FAULT_MINOR;
+#endif
+		}
 		entry = pte_mkdirty(entry);
 	}
-	entry = pte_mkyoung(entry);
-	ptep_set_access_flags(vma, address, pte, entry, write_access);
-	update_mmu_cache(vma, address, entry);
-	lazy_mmu_prot_update(entry);
+
+	/*
+	 * If the cmpxchg fails then another processor may have done
+	 * the changes for us. If not then another fault will bring
+	 * another chance to do this again.
+	*/
+	if (ptep_cmpxchg(mm, address, pte, entry, new_entry)) {
+		flush_tlb_page(vma, address);
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
+	} else
+		inc_page_state(cmpxchg_fail_flag_update);
+
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
+	page_table_atomic_stop(mm);
+	if (pte_val(new_entry) == pte_val(entry))
+		inc_page_state(spurious_page_faults);
 	return VM_FAULT_MINOR;
 }
 
@@ -2040,33 +2071,73 @@ int handle_mm_fault(struct mm_struct *mm
 
 	inc_page_state(pgfault);
 
-	if (is_vm_hugetlb_page(vma))
+	if (unlikely(is_vm_hugetlb_page(vma)))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
 
 	/*
-	 * We need the page table lock to synchronize with kswapd
-	 * and the SMP-safe atomic PTE updates.
+	 * We try to rely on the mmap_sem and the SMP-safe atomic PTE updates.
+	 * to synchronize with kswapd. However, the arch may fall back
+	 * in page_table_atomic_start to the page table lock.
+	 *
+	 * We may be able to avoid taking and releasing the page_table_lock
+	 * for the p??_alloc functions through atomic operations so we
+	 * duplicate the functionality of pmd_alloc, pud_alloc and
+	 * pte_alloc_map here.
 	 */
+	page_table_atomic_start(mm);
 	pgd = pgd_offset(mm, address);
-	spin_lock(&mm->page_table_lock);
+	if (unlikely(pgd_none(*pgd))) {
+		pud_t *new;
 
-	pud = pud_alloc(mm, pgd, address);
-	if (!pud)
-		goto oom;
+		page_table_atomic_stop(mm);
+		new = pud_alloc_one(mm, address);
 
-	pmd = pmd_alloc(mm, pud, address);
-	if (!pmd)
-		goto oom;
+		if (!new)
+			return VM_FAULT_OOM;
 
-	pte = pte_alloc_map(mm, pmd, address);
-	if (!pte)
-		goto oom;
+		page_table_atomic_start(mm);
+		if (!pgd_test_and_populate(mm, pgd, new))
+			pud_free(new);
+	}
+
+	pud = pud_offset(pgd, address);
+	if (unlikely(pud_none(*pud))) {
+		pmd_t *new;
+
+		page_table_atomic_stop(mm);
+		new = pmd_alloc_one(mm, address);
+
+		if (!new)
+			return VM_FAULT_OOM;
 	
-	return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+		page_table_atomic_start(mm);
 
- oom:
-	spin_unlock(&mm->page_table_lock);
-	return VM_FAULT_OOM;
+		if (!pud_test_and_populate(mm, pud, new))
+			pmd_free(new);
+	}
+
+	pmd = pmd_offset(pud, address);
+	if (unlikely(!pmd_present(*pmd))) {
+		struct page *new;
+
+		page_table_atomic_stop(mm);
+		new = pte_alloc_one(mm, address);
+
+		if (!new)
+			return VM_FAULT_OOM;
+
+		page_table_atomic_start(mm);
+
+		if (!pmd_test_and_populate(mm, pmd, new))
+			pte_free(new);
+		else {
+			inc_page_state(nr_page_table_pages);
+			mm->nr_ptes++;
+		}
+	}
+
+	pte = pte_offset_map(pmd, address);
+	return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
 }
 
 #ifndef __PAGETABLE_PUD_FOLDED
Index: linux-2.6.11/include/asm-generic/pgtable-nopud.h
===================================================================
--- linux-2.6.11.orig/include/asm-generic/pgtable-nopud.h	2005-04-29 12:12:01.000000000 -0700
+++ linux-2.6.11/include/asm-generic/pgtable-nopud.h	2005-04-29 12:12:45.000000000 -0700
@@ -27,8 +27,14 @@ static inline int pgd_bad(pgd_t pgd)		{ 
 static inline int pgd_present(pgd_t pgd)	{ return 1; }
 static inline void pgd_clear(pgd_t *pgd)	{ }
 #define pud_ERROR(pud)				(pgd_ERROR((pud).pgd))
-
 #define pgd_populate(mm, pgd, pud)		do { } while (0)
+
+#define __HAVE_ARCH_PGD_TEST_AND_POPULATE
+static inline int pgd_test_and_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+{
+	return 1;
+}
+
 /*
  * (puds are folded into pgds so this doesn't get actually called,
  * but the define is needed for a generic inline function.)
Index: linux-2.6.11/include/asm-generic/pgtable-nopmd.h
===================================================================
--- linux-2.6.11.orig/include/asm-generic/pgtable-nopmd.h	2005-04-29 12:12:01.000000000 -0700
+++ linux-2.6.11/include/asm-generic/pgtable-nopmd.h	2005-04-29 12:12:45.000000000 -0700
@@ -31,6 +31,11 @@ static inline void pud_clear(pud_t *pud)
 #define pmd_ERROR(pmd)				(pud_ERROR((pmd).pud))
 
 #define pud_populate(mm, pmd, pte)		do { } while (0)
+#define __ARCH_HAVE_PUD_TEST_AND_POPULATE
+static inline int pud_test_and_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
+{
+	return 1;
+}
 
 /*
  * (pmds are folded into puds so this doesn't get actually called,
Index: linux-2.6.11/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.11.orig/include/asm-generic/pgtable.h	2005-04-29 12:12:38.000000000 -0700
+++ linux-2.6.11/include/asm-generic/pgtable.h	2005-04-29 12:12:45.000000000 -0700
@@ -141,6 +141,65 @@ do {				  					  \
 })
 #endif
 
+/*
+ * page_table_atomic_start and page_table_atomic_stop may be used to
+ * define special measures that an arch needs to guarantee atomic
+ * operations outside of a spinlock. In the case that an arch does
+ * not support atomic page table operations we will fall back to the
+ * page table lock.
+ */
+#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START
+#define page_table_atomic_start(mm) do { } while (0)
+#endif
+
+#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START
+#define page_table_atomic_stop(mm) do { } while (0)
+#endif
+
+/*
+ * Fallback functions for atomic population of higher page table
+ * structures. These simply acquire the page_table_lock for
+ * synchronization. An architecture may override these generic
+ * functions to provide atomic populate functions to make these
+ * more effective.
+ */
+
+#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pud)			\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = pgd_none(*(__pgd));					\
+	if (__rc) pgd_populate(__mm, __pgd, __pud);			\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define pud_test_and_populate(__mm, __pud, __pmd)			\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = pud_none(*(__pud));					\
+	if (__rc) pud_populate(__mm, __pud, __pmd);			\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page)			\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = !pmd_present(*(__pmd));					\
+	if (__rc) pmd_populate(__mm, __pmd, __page);			\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+#endif
+
 #else
 
 /*
@@ -151,6 +210,11 @@ do {				  					  \
  * short time frame. This means that the page_table_lock must be held
  * to avoid a page fault that would install a new entry.
  */
+
+/* Fall back to the page table lock to synchronize page table access */
+#define page_table_atomic_start(mm)	spin_lock(&(mm)->page_table_lock)
+#define page_table_atomic_stop(mm)	spin_unlock(&(mm)->page_table_lock)
+
 #ifndef __HAVE_ARCH_PTEP_XCHG
 #define ptep_xchg(__mm, __address, __ptep, __pteval)			\
 ({									\
@@ -195,6 +259,41 @@ do {				  					  \
 	r;								\
 })
 #endif
+
+/*
+ * Fallback functions for atomic population of higher page table
+ * structures. These rely on the page_table_lock being held.
+ */
+#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pud)			\
+({									\
+	int __rc;							\
+	__rc = pgd_none(*(__pgd));					\
+	if (__rc) pgd_populate(__mm, __pgd, __pud);			\
+	__rc;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define pud_test_and_populate(__mm, __pud, __pmd)			\
+({									\
+       int __rc;							\
+       __rc = pud_none(*(__pud));					\
+       if (__rc) pud_populate(__mm, __pud, __pmd);			\
+       __rc;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page)			\
+({									\
+       int __rc;							\
+       __rc = !pmd_present(*(__pmd));					\
+       if (__rc) pmd_populate(__mm, __pmd, __page);			\
+       __rc;								\
+})
+#endif
+
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
Index: linux-2.6.11/include/linux/page-flags.h
===================================================================
--- linux-2.6.11.orig/include/linux/page-flags.h	2005-04-29 12:12:02.000000000 -0700
+++ linux-2.6.11/include/linux/page-flags.h	2005-04-29 12:12:45.000000000 -0700
@@ -131,6 +131,13 @@ struct page_state {
 	unsigned long allocstall;	/* direct reclaim calls */
 
 	unsigned long pgrotated;	/* pages rotated to tail of the LRU */
+
+	/* Page fault related counters */
+	unsigned long spurious_page_faults;	/* Faults with no ops */
+	unsigned long cmpxchg_fail_flag_update;	/* cmpxchg failures for pte flag update */
+	unsigned long cmpxchg_fail_flag_reuse;	/* cmpxchg failures when cow reuse of pte */
+	unsigned long cmpxchg_fail_anon_read;	/* cmpxchg failures on anonymous read */
+	unsigned long cmpxchg_fail_anon_write;	/* cmpxchg failures on anonymous write */
 };
 
 extern void get_page_state(struct page_state *ret);
Index: linux-2.6.11/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.11.orig/fs/proc/proc_misc.c	2005-04-29 12:12:01.000000000 -0700
+++ linux-2.6.11/fs/proc/proc_misc.c	2005-04-29 12:12:45.000000000 -0700
@@ -128,7 +128,7 @@ static int meminfo_read_proc(char *page,
 	struct vmalloc_info vmi;
 	long cached;
 
-	get_page_state(&ps);
+	get_full_page_state(&ps);
 	get_zone_counts(&active, &inactive, &free);
 
 /*
@@ -173,7 +173,12 @@ static int meminfo_read_proc(char *page,
 		"PageTables:   %8lu kB\n"
 		"VmallocTotal: %8lu kB\n"
 		"VmallocUsed:  %8lu kB\n"
-		"VmallocChunk: %8lu kB\n",
+		"VmallocChunk: %8lu kB\n"
+		"Spurious page faults    : %8lu\n"
+		"cmpxchg fail flag update: %8lu\n"
+		"cmpxchg fail COW reuse  : %8lu\n"
+		"cmpxchg fail anon read  : %8lu\n"
+		"cmpxchg fail anon write : %8lu\n",
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
@@ -196,7 +201,12 @@ static int meminfo_read_proc(char *page,
 		K(ps.nr_page_table_pages),
 		(unsigned long)VMALLOC_TOTAL >> 10,
 		vmi.used >> 10,
-		vmi.largest_chunk >> 10
+		vmi.largest_chunk >> 10,
+		ps.spurious_page_faults,
+		ps.cmpxchg_fail_flag_update,
+		ps.cmpxchg_fail_flag_reuse,
+		ps.cmpxchg_fail_anon_read,
+		ps.cmpxchg_fail_anon_write
 		);
 
 		len += hugetlb_report_meminfo(page + len);
Index: linux-2.6.11/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.11.orig/include/asm-ia64/pgalloc.h	2005-04-29 12:12:01.000000000 -0700
+++ linux-2.6.11/include/asm-ia64/pgalloc.h	2005-04-29 12:14:58.000000000 -0700
@@ -22,6 +22,10 @@
 
 #include <asm/mmu_context.h>
 
+/* Empty entries of PMD and PGD */
+#define PMD_NONE       0
+#define PUD_NONE       0
+
 /*
  * Very stupidly, we used to get new pgd's and pmd's, init their contents
  * to point to the NULL versions of the next level page table, later on
@@ -108,6 +112,21 @@ pmd_alloc_one (struct mm_struct *mm, uns
 	return pmd;
 }
 
+/* Atomic populate */
+static inline int
+pud_test_and_populate (struct mm_struct *mm, pud_t *pud_entry, pmd_t *pmd)
+{
+	return ia64_cmpxchg8_acq(pud_entry,__pa(pmd), PUD_NONE) == PUD_NONE;
+}
+
+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+	return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
+
 static inline void
 pmd_free (pmd_t *pmd)
 {
Index: linux-2.6.11/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.11.orig/include/asm-ia64/pgtable.h	2005-04-29 12:12:01.000000000 -0700
+++ linux-2.6.11/include/asm-ia64/pgtable.h	2005-04-29 12:12:45.000000000 -0700
@@ -560,6 +560,8 @@ do {											\
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
 #define __HAVE_ARCH_LAZY_MMU_PROT_UPDATE
+#define __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define __HAVE_ARCH_PMD_TEST_AND_POPULATE
 
 #include <asm-generic/pgtable-nopud.h>
 #include <asm-generic/pgtable.h>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 3/3] Page Fault Scalability V20: Avoid lock for anonymous write fault
  2005-04-29 19:59 [PATCH 0/3] Page Fault Scalability V20: Overview Christoph Lameter
  2005-04-29 19:59 ` [PATCH 1/3] Page Fault Scalability V20: Avoid spurious page faults Christoph Lameter
  2005-04-29 19:59 ` [PATCH 2/3] Page Fault Scalability V20: Avoid first acquisition of lock Christoph Lameter
@ 2005-04-29 19:59 ` Christoph Lameter
  2005-04-29 21:02   ` Christoph Hellwig
  2 siblings, 1 reply; 6+ messages in thread
From: Christoph Lameter @ 2005-04-29 19:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, linux-ia64, Christoph Lameter

Do not use the page_table_lock in do_anonymous_page. This will significantly
increase the parallelism in the page fault handler for SMP systems. The patch
also modifies the definitions of _mm_counter functions so that rss and anon_rss
become atomic (and will use atomic64_t if available).

For the benefit of these performance enhancements see the charts at
http://oss.sgi.com/projects/page_fault_performance/atomic-ptes.pdf

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.11/mm/memory.c
===================================================================
--- linux-2.6.11.orig/mm/memory.c	2005-04-29 10:31:50.000000000 -0700
+++ linux-2.6.11/mm/memory.c	2005-04-29 10:33:06.000000000 -0700
@@ -1790,12 +1790,12 @@ do_anonymous_page(struct mm_struct *mm, 
 	entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
 						vma->vm_page_prot)),
 				vma);
-	spin_lock(&mm->page_table_lock);
+	page_table_atomic_start(mm);
 
 	if (!ptep_cmpxchg(mm, addr, page_table, orig_entry, entry)) {
 		pte_unmap(page_table);
 		page_cache_release(page);
-		spin_unlock(&mm->page_table_lock);
+		page_table_atomic_stop(mm);
 		inc_page_state(cmpxchg_fail_anon_write);
 		return VM_FAULT_MINOR;
         }
@@ -1811,7 +1811,7 @@ do_anonymous_page(struct mm_struct *mm, 
 	pte_unmap(page_table);
 	update_mmu_cache(vma, addr, entry);
 	lazy_mmu_prot_update(entry);
-	spin_unlock(&mm->page_table_lock);
+	page_table_atomic_stop(mm);
 	return VM_FAULT_MINOR;
 }
 
Index: linux-2.6.11/include/linux/sched.h
===================================================================
--- linux-2.6.11.orig/include/linux/sched.h	2005-04-29 08:25:55.000000000 -0700
+++ linux-2.6.11/include/linux/sched.h	2005-04-29 10:33:06.000000000 -0700
@@ -204,12 +204,43 @@ arch_get_unmapped_area_topdown(struct fi
 extern void arch_unmap_area(struct vm_area_struct *area);
 extern void arch_unmap_area_topdown(struct vm_area_struct *area);
 
+#ifdef CONFIG_ATOMIC_TABLE_OPS
+/*
+ * No spinlock is held during atomic page table operations. The
+ * counters are not protected anymore and must also be
+ * incremented atomically.
+*/
+#ifdef ATOMIC64_INIT
+#define set_mm_counter(mm, member, value) atomic64_set(&(mm)->_##member, value)
+#define get_mm_counter(mm, member) ((unsigned long)atomic64_read(&(mm)->_##member))
+#define add_mm_counter(mm, member, value) atomic64_add(value, &(mm)->_##member)
+#define inc_mm_counter(mm, member) atomic64_dec(&(mm)->_##member)
+#define dec_mm_counter(mm, member) atomic64_dec(&(mm)->_##member)
+typedef atomic64_t mm_counter_t;
+#else
+/*
+ * This may limit process memory to 2^31 * PAGE_SIZE which may be around 8TB
+ * if using 4KB page size
+ */
+#define set_mm_counter(mm, member, value) atomic_set(&(mm)->_##member, value)
+#define get_mm_counter(mm, member) ((unsigned long)atomic_read(&(mm)->_##member))
+#define add_mm_counter(mm, member, value) atomic_add(value, &(mm)->_##member)
+#define inc_mm_counter(mm, member) atomic_inc(&(mm)->_##member)
+#define dec_mm_counter(mm, member) atomic_dec(&(mm)->_##member)
+typedef atomic_t mm_counter_t;
+#endif
+#else
+/*
+ * No atomic page table operations. Counters are protected by
+ * the page table lock
+ */
 #define set_mm_counter(mm, member, value) (mm)->_##member = (value)
 #define get_mm_counter(mm, member) ((mm)->_##member)
 #define add_mm_counter(mm, member, value) (mm)->_##member += (value)
 #define inc_mm_counter(mm, member) (mm)->_##member++
 #define dec_mm_counter(mm, member) (mm)->_##member--
 typedef unsigned long mm_counter_t;
+#endif
 
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] Page Fault Scalability V20: Avoid lock for anonymous write fault
  2005-04-29 19:59 ` [PATCH 3/3] Page Fault Scalability V20: Avoid lock for anonymous write fault Christoph Lameter
@ 2005-04-29 21:02   ` Christoph Hellwig
  2005-04-29 23:06     ` Christoph Lameter
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2005-04-29 21:02 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, linux-mm, linux-ia64

On Fri, Apr 29, 2005 at 12:59:17PM -0700, Christoph Lameter wrote:
> Do not use the page_table_lock in do_anonymous_page. This will significantly
> increase the parallelism in the page fault handler for SMP systems. The patch
> also modifies the definitions of _mm_counter functions so that rss and anon_rss
> become atomic (and will use atomic64_t if available).

I thought we said all architectures should provide an atomic64_t (and
given that it's not actually 64bit on 32bit architecture we should
probably rename it to atomic_long_t)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] Page Fault Scalability V20: Avoid lock for anonymous write fault
  2005-04-29 21:02   ` Christoph Hellwig
@ 2005-04-29 23:06     ` Christoph Lameter
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Lameter @ 2005-04-29 23:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-mm, linux-ia64

On Fri, 29 Apr 2005, Christoph Hellwig wrote:

> On Fri, Apr 29, 2005 at 12:59:17PM -0700, Christoph Lameter wrote:
> > Do not use the page_table_lock in do_anonymous_page. This will significantly
> > increase the parallelism in the page fault handler for SMP systems. The patch
> > also modifies the definitions of _mm_counter functions so that rss and anon_rss
> > become atomic (and will use atomic64_t if available).
>
> I thought we said all architectures should provide an atomic64_t (and
> given that it's not actually 64bit on 32bit architecture we should
> probably rename it to atomic_long_t)

Yes the way atomic types are provided may need a revision.
First of all we need atomic types that are size bound

	atomic8_t
	atomic16_t
	atomic32_t

and (if available)

	atomic64_t

and then some aliases

	atomic_t -> atomic type for int
	atomic_long_t -> atomic type for long

If these types are available then this patch could be cleaned up to
just use atomic_long_t.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2005-04-29 23:06 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-04-29 19:59 [PATCH 0/3] Page Fault Scalability V20: Overview Christoph Lameter
2005-04-29 19:59 ` [PATCH 1/3] Page Fault Scalability V20: Avoid spurious page faults Christoph Lameter
2005-04-29 19:59 ` [PATCH 2/3] Page Fault Scalability V20: Avoid first acquisition of lock Christoph Lameter
2005-04-29 19:59 ` [PATCH 3/3] Page Fault Scalability V20: Avoid lock for anonymous write fault Christoph Lameter
2005-04-29 21:02   ` Christoph Hellwig
2005-04-29 23:06     ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox