pagefault scalability patches

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* pagefault scalability patches
@ 2005-08-17 22:17 Andrew Morton
  2005-08-17 22:19 ` Christoph Lameter
                   ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Andrew Morton @ 2005-08-17 22:17 UTC (permalink / raw)
  To: Linus Torvalds, Hugh Dickins; +Cc: Christoph Lameter, Nick Piggin, linux-mm

[-- Attachment #1: Type: text/plain, Size: 629 bytes --]


These are getting in the way now, and I need to make a go/no-go decision.

I have vague feelings of ickiness with the patches wrt:

a) general increase of complexity

b) the fact that they only partially address the problem: anonymous page
   faults are addressed, but lots of other places aren't.

c) the fact that they address one particular part of one particular
   workload on exceedingly rare machines.

I believe that Nick has plans to address b).

I'd like us to thrash this out (again), please.  Hugh, could you (for the
nth and final time) describe your concerns with these patches?

Thanks.

(Three patches attached)

[-- Attachment #2: page-fault-patches-introduce-pte_xchg-and-pte_cmpxchg.patch --]
[-- Type: application/octet-stream, Size: 12175 bytes --]


From: Christoph Lameter <christoph@lameter.com>

Updating a page table entry (pte) can be difficult since the MMU may modify
the pte concurrently.  The current approach taken is to first exchange the pte
contents with zero.  Clearing the pte by writing zero to it also resets the
present bit, which will stop the MMU from modifying the pte and allows the
processing of the bits that were set.  Then the pte is set to its new value.

While the present bit is not set, accesses to the page mapped by the pte will
results in page faults, which may install a new pte over the non present
entry.  In order to avoid that scenario the page_table_lock is held.  An
access will still result in a page fault but the fault handler will also try
to acquire the page_table_lock.  The page_table_lock is released after the pte
has been setup by the first process.  The second process will now acquire the
page_table_lock and find that there is already a pte setup for this page and
return without having done anything.

This means that a useless page fault has been generated.

However, most architectures have the capability to atomically exchange the
value of the pte.  For those the clearing of pte before setting them to a new
value is not necessary.  The use of atomic exchanges avoids useless page
faults.

The following patch introduces two new atomic operations ptep_xchg and
ptep_cmpxchg that may be provided by an architecture.  The fallback in
include/asm-generic/pgtable.h is to simulate both operations through the
existing ptep_get_and_clear function.  So there is essentially no change if
atomic operations on ptes have not been defined.  Architectures that do not
support atomic operations on ptes may continue to use the clearing of a pte.

Atomic operations are enabled for i386, ia64 and x86_64 if a suitable CPU is
configured in SMP mode.  Generic atomic definitions for ptep_xchg and
ptep_cmpxchg have been provided based on the existing xchg() and cmpxchg()
functions that already work atomically on many platforms.

The provided generic atomic functions may be overridden as usual by defining
the appropriate__HAVE_ARCH_xxx constant and providing a different
implementation.

This patch is a piece of my attempt to reduce the use of the page_table_lock
in the page fault handler through atomic operations.  This is only possible if
it can be ensured that a pte is never cleared if the pte is in use even when
the page_table_lock is not held.  Clearing a pte before setting it to another
value could result in a situation in which a fault generated by another cpu
could install a pte which is then immediately overwritten by the first CPU
setting the pte to a valid value again.  The patch is necessary for the other
patches removing the use of the page_table_lock to work properly.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 arch/i386/Kconfig             |    5 ++
 arch/ia64/Kconfig             |    5 ++
 arch/x86_64/Kconfig           |    5 ++
 include/asm-generic/pgtable.h |   86 ++++++++++++++++++++++++++++++++++++++++++
 mm/memory.c                   |   14 ++++--
 mm/mprotect.c                 |   22 +++++-----
 mm/rmap.c                     |   22 +++++-----
 7 files changed, 133 insertions(+), 26 deletions(-)

diff -puN arch/i386/Kconfig~page-fault-patches-introduce-pte_xchg-and-pte_cmpxchg arch/i386/Kconfig
--- 25/arch/i386/Kconfig~page-fault-patches-introduce-pte_xchg-and-pte_cmpxchg	Wed Aug 17 14:53:01 2005
+++ 25-akpm/arch/i386/Kconfig	Wed Aug 17 15:09:24 2005
@@ -905,6 +905,11 @@ config HAVE_DEC_LOCK
 	depends on (SMP || PREEMPT) && X86_CMPXCHG
 	default y
 
+config ATOMIC_TABLE_OPS
+	bool
+	depends on SMP && X86_CMPXCHG && !X86_PAE
+	default y
+
 # turning this on wastes a bunch of space.
 # Summit needs it only when NUMA is on
 config BOOT_IOREMAP
diff -puN arch/ia64/Kconfig~page-fault-patches-introduce-pte_xchg-and-pte_cmpxchg arch/ia64/Kconfig
--- 25/arch/ia64/Kconfig~page-fault-patches-introduce-pte_xchg-and-pte_cmpxchg	Wed Aug 17 14:53:01 2005
+++ 25-akpm/arch/ia64/Kconfig	Wed Aug 17 15:09:24 2005
@@ -297,6 +297,11 @@ config PREEMPT
 
 source "mm/Kconfig"
 
+config ATOMIC_TABLE_OPS
+	bool
+	depends on SMP
+	default y
+
 config HAVE_DEC_LOCK
 	bool
 	depends on (SMP || PREEMPT)
diff -puN arch/x86_64/Kconfig~page-fault-patches-introduce-pte_xchg-and-pte_cmpxchg arch/x86_64/Kconfig
--- 25/arch/x86_64/Kconfig~page-fault-patches-introduce-pte_xchg-and-pte_cmpxchg	Wed Aug 17 14:53:01 2005
+++ 25-akpm/arch/x86_64/Kconfig	Wed Aug 17 15:09:24 2005
@@ -217,6 +217,11 @@ config SCHED_SMT
 	  cost of slightly increased overhead in some places. If unsure say
 	  N here.
 
+config ATOMIC_TABLE_OPS
+	bool
+	  depends on SMP
+	  default y
+
 source "kernel/Kconfig.preempt"
 
 config K8_NUMA
diff -puN include/asm-generic/pgtable.h~page-fault-patches-introduce-pte_xchg-and-pte_cmpxchg include/asm-generic/pgtable.h
--- 25/include/asm-generic/pgtable.h~page-fault-patches-introduce-pte_xchg-and-pte_cmpxchg	Wed Aug 17 14:53:01 2005
+++ 25-akpm/include/asm-generic/pgtable.h	Wed Aug 17 15:09:38 2005
@@ -111,6 +111,92 @@ do {				  					  \
 })
 #endif
 
+#ifdef CONFIG_ATOMIC_TABLE_OPS
+
+/*
+ * The architecture does support atomic table operations.
+ * We may be able to provide atomic ptep_xchg and ptep_cmpxchg using
+ * cmpxchg and xchg.
+ */
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg(__mm, __address, __ptep, __pteval) \
+	__pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)))
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__mm, __address, __ptep,__oldval,__newval)		\
+	(cmpxchg(&pte_val(*(__ptep)),					\
+			pte_val(__oldval),				\
+			pte_val(__newval)				\
+		) == pte_val(__oldval)					\
+	)
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg(__vma, __address, __ptep, __pteval);	\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+#endif
+
+#else
+
+/*
+ * No support for atomic operations on the page table.
+ * Exchanging of pte values is done by first swapping zeros into
+ * a pte and then putting new content into the pte entry.
+ * However, these functions will generate an empty pte for a
+ * short time frame. This means that the page_table_lock must be held
+ * to avoid a page fault that would install a new entry.
+ */
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg(__mm, __address, __ptep, __pteval)			\
+({									\
+	pte_t __pte = ptep_get_and_clear(__mm, __address, __ptep);	\
+	set_pte_at(__mm, __address, __ptep, __pteval);			\
+	__pte;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_clear_flush(__vma, __address, __ptep);	\
+	set_pte_at((__vma)->vm_mm, __address, __ptep, __pteval);		\
+	__pte;								\
+})
+#else
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg((__vma)->vm_mm, __address, __ptep, __pteval);\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+#endif
+#endif
+
+/*
+ * The fallback function for ptep_cmpxchg avoids any real use of cmpxchg
+ * since cmpxchg may not be available on certain architectures. Instead
+ * the clearing of a pte is used as a form of locking mechanism.
+ * This approach will only work if the page_table_lock is held to insure
+ * that the pte is not populated by a page fault generated on another
+ * CPU.
+ */
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__mm, __address, __ptep, __old, __new)		\
+({									\
+	pte_t prev = ptep_get_and_clear(__mm, __address, __ptep);	\
+	int r = pte_val(prev) == pte_val(__old);			\
+	set_pte_at(__mm, __address, __ptep, r ? (__new) : prev);	\
+	r;								\
+})
+#endif
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
 {
diff -puN mm/memory.c~page-fault-patches-introduce-pte_xchg-and-pte_cmpxchg mm/memory.c
--- 25/mm/memory.c~page-fault-patches-introduce-pte_xchg-and-pte_cmpxchg	Wed Aug 17 14:53:01 2005
+++ 25-akpm/mm/memory.c	Wed Aug 17 15:09:33 2005
@@ -551,15 +551,19 @@ static void zap_pte_range(struct mmu_gat
 				     page->index > details->last_index))
 					continue;
 			}
-			ptent = ptep_get_and_clear(tlb->mm, addr, pte);
-			tlb_remove_tlb_entry(tlb, pte, addr);
-			if (unlikely(!page))
+			if (unlikely(!page)) {
+				ptent = ptep_get_and_clear(tlb->mm, addr, pte);
+				tlb_remove_tlb_entry(tlb, pte, addr);
 				continue;
+			}
 			if (unlikely(details) && details->nonlinear_vma
 			    && linear_page_index(details->nonlinear_vma,
 						addr) != page->index)
-				set_pte_at(tlb->mm, addr, pte,
-					   pgoff_to_pte(page->index));
+				ptent = ptep_xchg(tlb->mm, addr, pte,
+						  pgoff_to_pte(page->index));
+			else
+				ptent = ptep_get_and_clear(tlb->mm, addr, pte);
+			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (pte_dirty(ptent))
 				set_page_dirty(page);
 			if (PageAnon(page))
diff -puN mm/mprotect.c~page-fault-patches-introduce-pte_xchg-and-pte_cmpxchg mm/mprotect.c
--- 25/mm/mprotect.c~page-fault-patches-introduce-pte_xchg-and-pte_cmpxchg	Wed Aug 17 14:53:01 2005
+++ 25-akpm/mm/mprotect.c	Wed Aug 17 14:53:01 2005
@@ -32,17 +32,19 @@ static void change_pte_range(struct mm_s
 
 	pte = pte_offset_map(pmd, addr);
 	do {
-		if (pte_present(*pte)) {
-			pte_t ptent;
+		pte_t ptent;
+redo:
+		ptent = *pte;
+		if (!pte_present(ptent))
+			continue;
 
-			/* Avoid an SMP race with hardware updated dirty/clean
-			 * bits by wiping the pte and then setting the new pte
-			 * into place.
-			 */
-			ptent = pte_modify(ptep_get_and_clear(mm, addr, pte), newprot);
-			set_pte_at(mm, addr, pte, ptent);
-			lazy_mmu_prot_update(ptent);
-		}
+		/* Deal with a potential SMP race with hardware/arch
+		 * interrupt updating dirty/clean bits through the use
+		 * of ptep_cmpxchg.
+		 */
+		if (!ptep_cmpxchg(mm, addr, pte, ptent, pte_modify(ptent, newprot)))
+				goto redo;
+		lazy_mmu_prot_update(ptent);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	pte_unmap(pte - 1);
 }
diff -puN mm/rmap.c~page-fault-patches-introduce-pte_xchg-and-pte_cmpxchg mm/rmap.c
--- 25/mm/rmap.c~page-fault-patches-introduce-pte_xchg-and-pte_cmpxchg	Wed Aug 17 14:53:01 2005
+++ 25-akpm/mm/rmap.c	Wed Aug 17 15:02:57 2005
@@ -539,11 +539,6 @@ static int try_to_unmap_one(struct page 
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
-
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
-		set_page_dirty(page);
 
 	if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page->private };
@@ -558,10 +553,15 @@ static int try_to_unmap_one(struct page 
 			list_add(&mm->mmlist, &init_mm.mmlist);
 			spin_unlock(&mmlist_lock);
 		}
-		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
+		pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
 		dec_mm_counter(mm, anon_rss);
-	}
+	} else
+		pteval = ptep_clear_flush(vma, address, pte);
+
+	/* Move the dirty bit to the physical page now the pte is gone. */
+	if (pte_dirty(pteval))
+		set_page_dirty(page);
 
 	dec_mm_counter(mm, rss);
 	page_remove_rmap(page);
@@ -653,15 +653,15 @@ static void try_to_unmap_cluster(unsigne
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
 
-		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pfn);
-		pteval = ptep_clear_flush(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
-			set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
+			pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+		else
+			pteval = ptep_clear_flush(vma, address, pte);
 
-		/* Move the dirty bit to the physical page now the pte is gone. */
+		/* Move the dirty bit to the physical page now that the pte is gone. */
 		if (pte_dirty(pteval))
 			set_page_dirty(page);
 
_

[-- Attachment #3: page-fault-patches-optional-page_lock-acquisition-in.patch --]
[-- Type: application/octet-stream, Size: 25364 bytes --]


From: Christoph Lameter <christoph@lameter.com>

The page fault handler attempts to use the page_table_lock only for short time
periods.  It repeatedly drops and reacquires the lock.  When the lock is
reacquired, checks are made if the underlying pte has changed before replacing
the pte value.  These locations are a good fit for the use of ptep_cmpxchg.

The following patch allows the use of atomic operations to remove the first
acquisition of the page_table_lock.  A section using atomic pte operations is
begun with

	page_table_atomic_start(struct mm_struct *)

and ends with

	page_table_atomic_stop(struct mm_struct *)

Both of these become spin_lock(page_table_lock) and
spin_unlock(page_table_lock) if atomic page table operations are not
configured (CONFIG_ATOMIC_TABLE_OPS undefined).

Atomic pte operations using pte_xchg and pte_cmpxchg only work for the lowest
layer of the page table.  Higher layers may also be populated in an atomic way
by defining pmd_test_and_populate() etc.  The generic versions of these
functions fall back to the page_table_lock.  Populating higher level page
table entries is rare and therefore this is not likely to be performance
critical.  For ia64 a definition of higher level atomic operations is
included.

This patch depends on the patch to avoid spurious page faults to be applied
first and will only remove the first acquisition of the page_table_lock in the
page fault handler.  This will allow the following page table operations
without acquiring the page_table_lock:

1. Updating of access bits (handle_mm_fault)
2. Anonymous read faults (do_anonymous_page)

The page_table_lock is still acquired for creating a new pte for an anonymous
write fault and therefore the problems with atomic updates of rss do not yet
occur.

The patch also adds some diagnostic features by counting the number of cmpxchg
failures (useful for verification if this patch works right) and the number of
page faults that led to no change in the page table.  The statistics may be
accessed via /proc/meminfo.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 include/asm-generic/4level-fixup.h  |    1 
 include/asm-generic/pgtable-nopmd.h |    5 
 include/asm-generic/pgtable-nopud.h |    8 
 include/asm-generic/pgtable.h       |   99 +++++++++++
 include/asm-ia64/pgalloc.h          |   19 ++
 include/asm-ia64/pgtable.h          |    2 
 include/linux/page-flags.h          |    6 
 mm/memory.c                         |  299 ++++++++++++++++++++++++------------
 mm/page_alloc.c                     |    6 
 proc/proc_misc.c                    |    0 
 10 files changed, 347 insertions(+), 98 deletions(-)

diff -puN fs/proc/proc_misc.c~page-fault-patches-optional-page_lock-acquisition-in fs/proc/proc_misc.c
diff -puN include/asm-generic/pgtable.h~page-fault-patches-optional-page_lock-acquisition-in include/asm-generic/pgtable.h
--- 25/include/asm-generic/pgtable.h~page-fault-patches-optional-page_lock-acquisition-in	Wed Aug 17 15:09:54 2005
+++ 25-akpm/include/asm-generic/pgtable.h	Wed Aug 17 15:09:54 2005
@@ -141,6 +141,65 @@ do {				  					  \
 })
 #endif
 
+/*
+ * page_table_atomic_start and page_table_atomic_stop may be used to
+ * define special measures that an arch needs to guarantee atomic
+ * operations outside of a spinlock. In the case that an arch does
+ * not support atomic page table operations we will fall back to the
+ * page table lock.
+ */
+#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START
+#define page_table_atomic_start(mm) do { } while (0)
+#endif
+
+#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START
+#define page_table_atomic_stop(mm) do { } while (0)
+#endif
+
+/*
+ * Fallback functions for atomic population of higher page table
+ * structures. These simply acquire the page_table_lock for
+ * synchronization. An architecture may override these generic
+ * functions to provide atomic populate functions to make these
+ * more effective.
+ */
+
+#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pud)			\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = pgd_none(*(__pgd));					\
+	if (__rc) pgd_populate(__mm, __pgd, __pud);			\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define pud_test_and_populate(__mm, __pud, __pmd)			\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = pud_none(*(__pud));					\
+	if (__rc) pud_populate(__mm, __pud, __pmd);			\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page)			\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = !pmd_present(*(__pmd));					\
+	if (__rc) pmd_populate(__mm, __pmd, __page);			\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+#endif
+
 #else
 
 /*
@@ -151,6 +210,11 @@ do {				  					  \
  * short time frame. This means that the page_table_lock must be held
  * to avoid a page fault that would install a new entry.
  */
+
+/* Fall back to the page table lock to synchronize page table access */
+#define page_table_atomic_start(mm)	spin_lock(&(mm)->page_table_lock)
+#define page_table_atomic_stop(mm)	spin_unlock(&(mm)->page_table_lock)
+
 #ifndef __HAVE_ARCH_PTEP_XCHG
 #define ptep_xchg(__mm, __address, __ptep, __pteval)			\
 ({									\
@@ -195,6 +259,41 @@ do {				  					  \
 	r;								\
 })
 #endif
+
+/*
+ * Fallback functions for atomic population of higher page table
+ * structures. These rely on the page_table_lock being held.
+ */
+#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pud)			\
+({									\
+	int __rc;							\
+	__rc = pgd_none(*(__pgd));					\
+	if (__rc) pgd_populate(__mm, __pgd, __pud);			\
+	__rc;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define pud_test_and_populate(__mm, __pud, __pmd)			\
+({									\
+       int __rc;							\
+       __rc = pud_none(*(__pud));					\
+       if (__rc) pud_populate(__mm, __pud, __pmd);			\
+       __rc;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page)			\
+({									\
+       int __rc;							\
+       __rc = !pmd_present(*(__pmd));					\
+       if (__rc) pmd_populate(__mm, __pmd, __page);			\
+       __rc;								\
+})
+#endif
+
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
diff -puN include/asm-generic/pgtable-nopmd.h~page-fault-patches-optional-page_lock-acquisition-in include/asm-generic/pgtable-nopmd.h
--- 25/include/asm-generic/pgtable-nopmd.h~page-fault-patches-optional-page_lock-acquisition-in	Wed Aug 17 15:09:54 2005
+++ 25-akpm/include/asm-generic/pgtable-nopmd.h	Wed Aug 17 15:09:54 2005
@@ -31,6 +31,11 @@ static inline void pud_clear(pud_t *pud)
 #define pmd_ERROR(pmd)				(pud_ERROR((pmd).pud))
 
 #define pud_populate(mm, pmd, pte)		do { } while (0)
+#define __ARCH_HAVE_PUD_TEST_AND_POPULATE
+static inline int pud_test_and_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
+{
+	return 1;
+}
 
 /*
  * (pmds are folded into puds so this doesn't get actually called,
diff -puN include/asm-generic/pgtable-nopud.h~page-fault-patches-optional-page_lock-acquisition-in include/asm-generic/pgtable-nopud.h
--- 25/include/asm-generic/pgtable-nopud.h~page-fault-patches-optional-page_lock-acquisition-in	Wed Aug 17 15:09:54 2005
+++ 25-akpm/include/asm-generic/pgtable-nopud.h	Wed Aug 17 15:09:54 2005
@@ -27,8 +27,14 @@ static inline int pgd_bad(pgd_t pgd)		{ 
 static inline int pgd_present(pgd_t pgd)	{ return 1; }
 static inline void pgd_clear(pgd_t *pgd)	{ }
 #define pud_ERROR(pud)				(pgd_ERROR((pud).pgd))
-
 #define pgd_populate(mm, pgd, pud)		do { } while (0)
+
+#define __HAVE_ARCH_PGD_TEST_AND_POPULATE
+static inline int pgd_test_and_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+{
+	return 1;
+}
+
 /*
  * (puds are folded into pgds so this doesn't get actually called,
  * but the define is needed for a generic inline function.)
diff -puN include/asm-ia64/pgalloc.h~page-fault-patches-optional-page_lock-acquisition-in include/asm-ia64/pgalloc.h
--- 25/include/asm-ia64/pgalloc.h~page-fault-patches-optional-page_lock-acquisition-in	Wed Aug 17 15:09:54 2005
+++ 25-akpm/include/asm-ia64/pgalloc.h	Wed Aug 17 15:09:54 2005
@@ -1,6 +1,10 @@
 #ifndef _ASM_IA64_PGALLOC_H
 #define _ASM_IA64_PGALLOC_H
 
+/* Empty entries of PMD and PGD */
+#define PMD_NONE       0
+#define PUD_NONE       0
+
 /*
  * This file contains the functions and defines necessary to allocate
  * page tables.
@@ -86,6 +90,21 @@ static inline void pgd_free(pgd_t * pgd)
 	pgtable_quicklist_free(pgd);
 }
 
+/* Atomic populate */
+static inline int
+pud_test_and_populate (struct mm_struct *mm, pud_t *pud_entry, pmd_t *pmd)
+{
+	return ia64_cmpxchg8_acq(pud_entry,__pa(pmd), PUD_NONE) == PUD_NONE;
+}
+
+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+	return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
+
 static inline void
 pud_populate(struct mm_struct *mm, pud_t * pud_entry, pmd_t * pmd)
 {
diff -puN include/asm-ia64/pgtable.h~page-fault-patches-optional-page_lock-acquisition-in include/asm-ia64/pgtable.h
--- 25/include/asm-ia64/pgtable.h~page-fault-patches-optional-page_lock-acquisition-in	Wed Aug 17 15:09:54 2005
+++ 25-akpm/include/asm-ia64/pgtable.h	Wed Aug 17 15:09:54 2005
@@ -565,6 +565,8 @@ do {											\
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
 #define __HAVE_ARCH_LAZY_MMU_PROT_UPDATE
+#define __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define __HAVE_ARCH_PMD_TEST_AND_POPULATE
 
 #include <asm-generic/pgtable-nopud.h>
 #include <asm-generic/pgtable.h>
diff -puN include/linux/page-flags.h~page-fault-patches-optional-page_lock-acquisition-in include/linux/page-flags.h
--- 25/include/linux/page-flags.h~page-fault-patches-optional-page_lock-acquisition-in	Wed Aug 17 15:09:54 2005
+++ 25-akpm/include/linux/page-flags.h	Wed Aug 17 15:09:59 2005
@@ -131,6 +131,12 @@ struct page_state {
 
 	unsigned long pgrotated;	/* pages rotated to tail of the LRU */
 	unsigned long nr_bounce;	/* pages for bounce buffers */
+	unsigned long spurious_page_faults;	/* Faults with no ops */
+	unsigned long cmpxchg_fail_flag_update;	/* cmpxchg failures for pte flag update */
+	unsigned long cmpxchg_fail_flag_reuse;	/* cmpxchg failures when cow reuse of pte */
+
+	unsigned long cmpxchg_fail_anon_read;	/* cmpxchg failures on anonymous read */
+	unsigned long cmpxchg_fail_anon_write;	/* cmpxchg failures on anonymous write */
 };
 
 extern void get_page_state(struct page_state *ret);
diff -puN mm/memory.c~page-fault-patches-optional-page_lock-acquisition-in mm/memory.c
--- 25/mm/memory.c~page-fault-patches-optional-page_lock-acquisition-in	Wed Aug 17 15:09:54 2005
+++ 25-akpm/mm/memory.c	Wed Aug 17 15:10:11 2005
@@ -36,6 +36,8 @@
  *		(Gerhard.Wichert@pdb.siemens.de)
  *
  * Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ * Jan 2005 	Scalability improvement by reducing the use and the length of time
+ *		the page table lock is held (Christoph Lameter)
  */
 
 #include <linux/kernel_stat.h>
@@ -977,7 +979,7 @@ int get_user_pages(struct task_struct *t
 				 */
 				if (ret & VM_FAULT_WRITE)
 					write_access = 0;
-				
+
 				switch (ret & ~VM_FAULT_WRITE) {
 				case VM_FAULT_MINOR:
 					tsk->min_flt++;
@@ -1646,8 +1648,7 @@ void swapin_readahead(swp_entry_t entry,
 }
 
 /*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore and have started atomic pte operations
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
@@ -1659,15 +1660,14 @@ static int do_swap_page(struct mm_struct
 	int ret = VM_FAULT_MINOR;
 
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
+	page_table_atomic_stop(mm);
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
-			 * Back out if somebody else faulted in this pte while
-			 * we released the page table lock.
+			 * Back out if somebody else faulted in this pte
 			 */
 			spin_lock(&mm->page_table_lock);
 			page_table = pte_offset_map(pmd, address);
@@ -1690,8 +1690,7 @@ static int do_swap_page(struct mm_struct
 	lock_page(page);
 
 	/*
-	 * Back out if somebody else faulted in this pte while we
-	 * released the page table lock.
+	 * Back out if somebody else faulted in this pte
 	 */
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
@@ -1746,61 +1745,79 @@ out_nomap:
 }
 
 /*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs. 
+ * We are called with atomic operations started and the
+ * value of the pte that was read in orig_entry.
  */
 static int
 do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_t *page_table, pmd_t *pmd, int write_access,
-		unsigned long addr)
+		unsigned long addr, pte_t orig_entry)
 {
 	pte_t entry;
-	struct page * page = ZERO_PAGE(addr);
+	struct page * page;
 
-	/* Read-only mapping of ZERO_PAGE. */
-	entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+	if (unlikely(!write_access)) {
 
-	/* ..except if it's a write access */
-	if (write_access) {
-		/* Allocate our own private page. */
+		/* Read-only mapping of ZERO_PAGE. */
+		entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr),
+					vma->vm_page_prot));
+
+		/*
+		 * If the cmpxchg fails then another cpu may
+		 * already have populated the entry
+		 */
+		if (ptep_cmpxchg(mm, addr, page_table, orig_entry, entry)) {
+			update_mmu_cache(vma, addr, entry);
+			lazy_mmu_prot_update(entry);
+		} else {
+			inc_page_state(cmpxchg_fail_anon_read);
+		}
 		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);
+		goto minor_fault;
+	}
 
-		if (unlikely(anon_vma_prepare(vma)))
-			goto no_mem;
-		page = alloc_zeroed_user_highpage(vma, addr);
-		if (!page)
-			goto no_mem;
+	/* This leaves the write case */
+	page_table_atomic_stop(mm);
+	if (unlikely(anon_vma_prepare(vma)))
+		goto oom;
 
-		spin_lock(&mm->page_table_lock);
-		page_table = pte_offset_map(pmd, addr);
+	page = alloc_zeroed_user_highpage(vma, addr);
+	if (!page)
+		goto oom;
 
-		if (!pte_none(*page_table)) {
-			pte_unmap(page_table);
-			page_cache_release(page);
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-		inc_mm_counter(mm, rss);
-		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
-							 vma->vm_page_prot)),
-				      vma);
-		lru_cache_add_active(page);
-		SetPageReferenced(page);
-		page_add_anon_rmap(page, vma, addr);
-	}
+	entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
+						vma->vm_page_prot)),
+				vma);
+	spin_lock(&mm->page_table_lock);
 
-	set_pte_at(mm, addr, page_table, entry);
-	pte_unmap(page_table);
+	if (!ptep_cmpxchg(mm, addr, page_table, orig_entry, entry)) {
+		pte_unmap(page_table);
+		page_cache_release(page);
+		inc_page_state(cmpxchg_fail_anon_write);
+		goto minor_fault_atomic;
+        }
 
-	/* No need to invalidate - it was non-present before */
+	/*
+	 * These two functions must come after the cmpxchg
+	 * because if the page is on the LRU then try_to_unmap may come
+	 * in and unmap the pte.
+	 */
+	page_add_anon_rmap(page, vma, addr);
+	lru_cache_add_active(page);
+	inc_mm_counter(mm, rss);
+	pte_unmap(page_table);
 	update_mmu_cache(vma, addr, entry);
 	lazy_mmu_prot_update(entry);
+
+minor_fault:
 	spin_unlock(&mm->page_table_lock);
-out:
 	return VM_FAULT_MINOR;
-no_mem:
+
+minor_fault_atomic:
+	page_table_atomic_stop(mm);
+	return VM_FAULT_MINOR;
+
+oom:
 	return VM_FAULT_OOM;
 }
 
@@ -1813,12 +1830,12 @@ no_mem:
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
  *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held and atomic pte operations started.
  */
 static int
 do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *page_table,
+        pmd_t *pmd, pte_t orig_entry)
 {
 	struct page * new_page;
 	struct address_space *mapping = NULL;
@@ -1829,9 +1846,9 @@ do_no_page(struct mm_struct *mm, struct 
 
 	if (!vma->vm_ops || !vma->vm_ops->nopage)
 		return do_anonymous_page(mm, vma, page_table,
-					pmd, write_access, address);
+					pmd, write_access, address, orig_entry);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
+	page_table_atomic_stop(mm);
 
 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
@@ -1938,7 +1955,7 @@ oom:
  * nonlinear vmas.
  */
 static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
-	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
 {
 	unsigned long pgoff;
 	int err;
@@ -1951,13 +1968,13 @@ static int do_file_page(struct mm_struct
 	if (!vma->vm_ops->populate ||
 			(write_access && !(vma->vm_flags & VM_SHARED))) {
 		pte_clear(mm, address, pte);
-		return do_no_page(mm, vma, address, write_access, pte, pmd);
+		return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 	}
 
-	pgoff = pte_to_pgoff(*pte);
+	pgoff = pte_to_pgoff(entry);
 
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
+	page_table_atomic_stop(mm);
 
 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -1976,49 +1993,80 @@ static int do_file_page(struct mm_struct
  * with external mmu caches can use to update those (ie the Sparc or
  * PowerPC hashed page tables that act as extended TLBs).
  *
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
- * The adding of pages is protected by the MM semaphore (which we hold),
- * so we don't need to worry about a page being suddenly been added into
- * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
- */
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We exploit that case if possible to avoid taking the
+ * page table lock.
+*/
 static inline int handle_pte_fault(struct mm_struct *mm,
 	struct vm_area_struct * vma, unsigned long address,
 	int write_access, pte_t *pte, pmd_t *pmd)
 {
 	pte_t entry;
+	pte_t new_entry;
 
 	entry = *pte;
 	if (!pte_present(entry)) {
 		/*
-		 * If it truly wasn't present, we know that kswapd
-		 * and the PTE updates will not touch it later. So
-		 * drop the lock.
+		 * Pass the value of the pte to do_no_page and do_file_page
+		 * This value may be used to verify that the pte is still
+		 * not present allowing atomic insertion of ptes.
 		 */
 		if (pte_none(entry))
-			return do_no_page(mm, vma, address, write_access, pte, pmd);
+			return do_no_page(mm, vma, address, write_access,
+						pte, pmd, entry);
 		if (pte_file(entry))
-			return do_file_page(mm, vma, address, write_access, pte, pmd);
-		return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
+			return do_file_page(mm, vma, address, write_access,
+						pte, pmd, entry);
+		return do_swap_page(mm, vma, address, pte, pmd,
+						entry, write_access);
 	}
 
+	new_entry = pte_mkyoung(entry);
 	if (write_access) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
+#ifdef CONFIG_ATOMIC_TABLE_OPS
+			/*
+			 * do_wp_page modifies a pte. We can add a pte without
+			 * the page_table_lock but not modify a pte since a
+			 * cmpxchg does not allow us to verify that the page
+			 * was not changed under us. So acquire the page table
+			 * lock.
+			 */
+			spin_lock(&mm->page_table_lock);
+			if (pte_same(entry, *pte))
+				return do_wp_page(mm, vma, address, pte,
+							pmd, entry);
+			/*
+			 * pte was changed under us. Another processor may have
+			 * done what we needed to do.
+			 */
+			pte_unmap(pte);
+			spin_unlock(&mm->page_table_lock);
+			return VM_FAULT_MINOR;
+#else
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
-		entry = pte_mkdirty(entry);
+#endif
+		}
+		new_entry = pte_mkdirty(new_entry);
 	}
-	entry = pte_mkyoung(entry);
-	ptep_set_access_flags(vma, address, pte, entry, write_access);
-	update_mmu_cache(vma, address, entry);
-	lazy_mmu_prot_update(entry);
+
+	/*
+	 * If the cmpxchg fails then another processor may have done
+	 * the changes for us. If not then another fault will bring
+	 * another chance to do this again.
+	*/
+	if (ptep_cmpxchg(mm, address, pte, entry, new_entry)) {
+		flush_tlb_page(vma, address);
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
+	} else {
+		inc_page_state(cmpxchg_fail_flag_update);
+	}
+
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
+	page_table_atomic_stop(mm);
+	if (pte_val(new_entry) == pte_val(entry))
+		inc_page_state(spurious_page_faults);
 	return VM_FAULT_MINOR;
 }
 
@@ -2037,33 +2085,90 @@ int __handle_mm_fault(struct mm_struct *
 
 	inc_page_state(pgfault);
 
-	if (is_vm_hugetlb_page(vma))
-		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+	if (unlikely(is_vm_hugetlb_page(vma)))
+		goto sigbus;		/* mapping truncation does this. */
 
 	/*
-	 * We need the page table lock to synchronize with kswapd
-	 * and the SMP-safe atomic PTE updates.
+	 * We try to rely on the mmap_sem and the SMP-safe atomic PTE updates.
+	 * to synchronize with kswapd. However, the arch may fall back
+	 * in page_table_atomic_start to the page table lock.
+	 *
+	 * We may be able to avoid taking and releasing the page_table_lock
+	 * for the p??_alloc functions through atomic operations so we
+	 * duplicate the functionality of pmd_alloc, pud_alloc and
+	 * pte_alloc_map here.
 	 */
+	page_table_atomic_start(mm);
 	pgd = pgd_offset(mm, address);
-	spin_lock(&mm->page_table_lock);
+	if (unlikely(pgd_none(*pgd))) {
+#ifdef __ARCH_HAS_4LEVEL_HACK
+		/* The hack does not allow a clean fall back.
+		 * We need to insert a pmd entry into a pgd. pgd_test_and_populate is set
+		 * up to take a pmd entry. pud_none(pgd) == 0, therefore
+		 * the pud population branch will never be taken.
+		 */
+		pmd_t *new;
 
-	pud = pud_alloc(mm, pgd, address);
-	if (!pud)
-		goto oom;
+		page_table_atomic_stop(mm);
+		new = pmd_alloc_one(mm, address);
+#else
+		pud_t *new;
 
-	pmd = pmd_alloc(mm, pud, address);
-	if (!pmd)
-		goto oom;
+		page_table_atomic_stop(mm);
+		new = pud_alloc_one(mm, address);
+#endif
 
-	pte = pte_alloc_map(mm, pmd, address);
-	if (!pte)
-		goto oom;
-	
-	return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+		if (!new)
+			goto oom;
 
- oom:
-	spin_unlock(&mm->page_table_lock);
+		page_table_atomic_start(mm);
+		if (!pgd_test_and_populate(mm, pgd, new))
+			pud_free(new);
+	}
+
+	pud = pud_offset(pgd, address);
+	if (unlikely(pud_none(*pud))) {
+		pmd_t *new;
+
+		page_table_atomic_stop(mm);
+		new = pmd_alloc_one(mm, address);
+
+		if (!new)
+			goto oom;
+
+		page_table_atomic_start(mm);
+
+		if (!pud_test_and_populate(mm, pud, new))
+			pmd_free(new);
+	}
+
+	pmd = pmd_offset(pud, address);
+	if (unlikely(!pmd_present(*pmd))) {
+		struct page *new;
+
+		page_table_atomic_stop(mm);
+		new = pte_alloc_one(mm, address);
+
+		if (!new)
+			goto oom;
+
+		page_table_atomic_start(mm);
+
+		if (!pmd_test_and_populate(mm, pmd, new))
+			pte_free(new);
+		else {
+			inc_page_state(nr_page_table_pages);
+			mm->nr_ptes++;
+		}
+	}
+
+	pte = pte_offset_map(pmd, address);
+	return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+oom:
 	return VM_FAULT_OOM;
+
+sigbus:
+	return VM_FAULT_SIGBUS;
 }
 
 #ifndef __PAGETABLE_PUD_FOLDED
diff -puN mm/page_alloc.c~page-fault-patches-optional-page_lock-acquisition-in mm/page_alloc.c
--- 25/mm/page_alloc.c~page-fault-patches-optional-page_lock-acquisition-in	Wed Aug 17 15:09:59 2005
+++ 25-akpm/mm/page_alloc.c	Wed Aug 17 15:09:59 2005
@@ -2218,6 +2218,12 @@ static char *vmstat_text[] = {
 
 	"pgrotated",
 	"nr_bounce",
+	"spurious_page_faults",
+	"cmpxchg_fail_flag_update",
+	"cmpxchg_fail_flag_reuse",
+
+	"cmpxchg_fail_anon_read",
+	"cmpxchg_fail_anon_write",
 };
 
 static void *vmstat_start(struct seq_file *m, loff_t *pos)
diff -puN include/asm-generic/4level-fixup.h~page-fault-patches-optional-page_lock-acquisition-in include/asm-generic/4level-fixup.h
--- 25/include/asm-generic/4level-fixup.h~page-fault-patches-optional-page_lock-acquisition-in	Wed Aug 17 15:10:08 2005
+++ 25-akpm/include/asm-generic/4level-fixup.h	Wed Aug 17 15:10:08 2005
@@ -26,6 +26,7 @@
 #define pud_present(pud)		1
 #define pud_ERROR(pud)			do { } while (0)
 #define pud_clear(pud)			pgd_clear(pud)
+#define pud_populate			pgd_populate
 
 #undef pud_free_tlb
 #define pud_free_tlb(tlb, x)            do { } while (0)
_

[-- Attachment #4: page-fault-patches-no-pagetable-lock-in-do_anon_page.patch --]
[-- Type: application/octet-stream, Size: 4302 bytes --]


From: Christoph Lameter <christoph@lameter.com>

Do not use the page_table_lock in do_anonymous_page.  This will significantly
increase the parallelism in the page fault handler for SMP systems.  The patch
also modifies the definitions of _mm_counter functions so that rss and
anon_rss become atomic (and will use atomic64_t if available).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 include/linux/sched.h |   31 +++++++++++++++++++++++++++++++
 mm/memory.c           |   14 +++++---------
 2 files changed, 36 insertions(+), 9 deletions(-)

diff -puN include/linux/sched.h~page-fault-patches-no-pagetable-lock-in-do_anon_page include/linux/sched.h
--- 25/include/linux/sched.h~page-fault-patches-no-pagetable-lock-in-do_anon_page	Wed Aug 17 15:10:30 2005
+++ 25-akpm/include/linux/sched.h	Wed Aug 17 15:10:45 2005
@@ -204,12 +204,43 @@ arch_get_unmapped_area_topdown(struct fi
 extern void arch_unmap_area(struct mm_struct *, unsigned long);
 extern void arch_unmap_area_topdown(struct mm_struct *, unsigned long);
 
+#ifdef CONFIG_ATOMIC_TABLE_OPS
+/*
+ * No spinlock is held during atomic page table operations. The
+ * counters are not protected anymore and must also be
+ * incremented atomically.
+*/
+#ifdef ATOMIC64_INIT
+#define set_mm_counter(mm, member, value) atomic64_set(&(mm)->_##member, value)
+#define get_mm_counter(mm, member) ((unsigned long)atomic64_read(&(mm)->_##member))
+#define add_mm_counter(mm, member, value) atomic64_add(value, &(mm)->_##member)
+#define inc_mm_counter(mm, member) atomic64_inc(&(mm)->_##member)
+#define dec_mm_counter(mm, member) atomic64_dec(&(mm)->_##member)
+typedef atomic64_t mm_counter_t;
+#else
+/*
+ * This may limit process memory to 2^31 * PAGE_SIZE which may be around 8TB
+ * if using 4KB page size
+ */
+#define set_mm_counter(mm, member, value) atomic_set(&(mm)->_##member, value)
+#define get_mm_counter(mm, member) ((unsigned long)atomic_read(&(mm)->_##member))
+#define add_mm_counter(mm, member, value) atomic_add(value, &(mm)->_##member)
+#define inc_mm_counter(mm, member) atomic_inc(&(mm)->_##member)
+#define dec_mm_counter(mm, member) atomic_dec(&(mm)->_##member)
+typedef atomic_t mm_counter_t;
+#endif
+#else
+/*
+ * No atomic page table operations. Counters are protected by
+ * the page table lock
+ */
 #define set_mm_counter(mm, member, value) (mm)->_##member = (value)
 #define get_mm_counter(mm, member) ((mm)->_##member)
 #define add_mm_counter(mm, member, value) (mm)->_##member += (value)
 #define inc_mm_counter(mm, member) (mm)->_##member++
 #define dec_mm_counter(mm, member) (mm)->_##member--
 typedef unsigned long mm_counter_t;
+#endif
 
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
diff -puN mm/memory.c~page-fault-patches-no-pagetable-lock-in-do_anon_page mm/memory.c
--- 25/mm/memory.c~page-fault-patches-no-pagetable-lock-in-do_anon_page	Wed Aug 17 15:10:30 2005
+++ 25-akpm/mm/memory.c	Wed Aug 17 15:10:48 2005
@@ -1772,12 +1772,12 @@ do_anonymous_page(struct mm_struct *mm, 
 		} else {
 			inc_page_state(cmpxchg_fail_anon_read);
 		}
-		pte_unmap(page_table);
 		goto minor_fault;
 	}
 
 	/* This leaves the write case */
 	page_table_atomic_stop(mm);
+	pte_unmap(page_table);
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
 
@@ -1788,13 +1788,13 @@ do_anonymous_page(struct mm_struct *mm, 
 	entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
 						vma->vm_page_prot)),
 				vma);
-	spin_lock(&mm->page_table_lock);
+	page_table = pte_offset_map(pmd, addr);
+	page_table_atomic_start(mm);
 
 	if (!ptep_cmpxchg(mm, addr, page_table, orig_entry, entry)) {
-		pte_unmap(page_table);
 		page_cache_release(page);
 		inc_page_state(cmpxchg_fail_anon_write);
-		goto minor_fault_atomic;
+		goto minor_fault;
         }
 
 	/*
@@ -1805,16 +1805,12 @@ do_anonymous_page(struct mm_struct *mm, 
 	page_add_anon_rmap(page, vma, addr);
 	lru_cache_add_active(page);
 	inc_mm_counter(mm, rss);
-	pte_unmap(page_table);
 	update_mmu_cache(vma, addr, entry);
 	lazy_mmu_prot_update(entry);
 
 minor_fault:
-	spin_unlock(&mm->page_table_lock);
-	return VM_FAULT_MINOR;
-
-minor_fault_atomic:
 	page_table_atomic_stop(mm);
+	pte_unmap(page_table);
 	return VM_FAULT_MINOR;
 
 oom:
_

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 22:17 pagefault scalability patches Andrew Morton
@ 2005-08-17 22:19 ` Christoph Lameter
  2005-08-17 22:36 ` Linus Torvalds
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2005-08-17 22:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, Hugh Dickins, Nick Piggin, linux-mm

On Wed, 17 Aug 2005, Andrew Morton wrote:

> a) general increase of complexity

The complexity is necesary in order to move to atomic operations that will 
also allow future enhancements.
 
> b) the fact that they only partially address the problem: anonymous page
>    faults are addressed, but lots of other places aren't.

The patches also allow atomic updates of pte flags. The most common 
bottleneck are anonymous faults.

> c) the fact that they address one particular part of one particular
>    workload on exceedingly rare machines.

No this is a general fix for anonymous page faults on SMP machines. As 
noted at the KS, other are seeing similar performance problems.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 22:17 pagefault scalability patches Andrew Morton
  2005-08-17 22:19 ` Christoph Lameter
@ 2005-08-17 22:36 ` Linus Torvalds
  2005-08-17 22:51   ` Christoph Lameter
  2005-08-18  0:43 ` Andrew Morton
  2005-08-18  2:00 ` Nick Piggin
  3 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2005-08-17 22:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, Christoph Lameter, Nick Piggin, linux-mm

On Wed, 17 Aug 2005, Andrew Morton wrote:
> 
> These are getting in the way now, and I need to make a go/no-go decision.
> 
> I have vague feelings of ickiness with the patches wrt:
> 
> a) general increase of complexity
> 
> b) the fact that they only partially address the problem: anonymous page
>    faults are addressed, but lots of other places aren't.
> 
> c) the fact that they address one particular part of one particular
>    workload on exceedingly rare machines.
> 
> I believe that Nick has plans to address b).
> 
> I'd like us to thrash this out (again), please.  Hugh, could you (for the
> nth and final time) describe your concerns with these patches?

Hmm.. I personally like the anonymous page thing, since I actualyl think
that's one of the most important ones. It's the one that does _not_ 
actually only matter for some esoteric case: if you can get rid of a lock 
in the page fault logic, that's a big win, in my opinion. Locks are 
expensive.

HOWEVER, the fact that it makes the mm counters be atomic just makes it
pointless. It may help scalability, but it loses the attribute that I
considered a big win above - it no longer helps the non-contended case (at
least on x86, a uncontended spinlock is about as expensive as a atomic
op).

I thought Christoph (Nick?) had a patch to make the counters be
per-thread, and then just folded back into the mm-struct every once in a
while?

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 22:36 ` Linus Torvalds
@ 2005-08-17 22:51   ` Christoph Lameter
  2005-08-17 23:01     ` Linus Torvalds
  2005-08-22  2:13     ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 47+ messages in thread
From: Christoph Lameter @ 2005-08-17 22:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Hugh Dickins, Nick Piggin, linux-mm

On Wed, 17 Aug 2005, Linus Torvalds wrote:

> HOWEVER, the fact that it makes the mm counters be atomic just makes it
> pointless. It may help scalability, but it loses the attribute that I
> considered a big win above - it no longer helps the non-contended case (at
> least on x86, a uncontended spinlock is about as expensive as a atomic
> op).

We are trading 2x (spinlock(page_table_lock), 
spin_unlock(page_table_lock)) against one atomic inc.

> 
> I thought Christoph (Nick?) had a patch to make the counters be
> per-thread, and then just folded back into the mm-struct every once in a
> while?

Yes I do but I did want want to risk that can of worms becoming entwined 
with the page fault scalability patches.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 22:51   ` Christoph Lameter
@ 2005-08-17 23:01     ` Linus Torvalds
  2005-08-17 23:12       ` Christoph Lameter
  2005-08-22  2:13     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2005-08-17 23:01 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, Hugh Dickins, Nick Piggin, linux-mm


On Wed, 17 Aug 2005, Christoph Lameter wrote:
>
> We are trading 2x (spinlock(page_table_lock), 
> spin_unlock(page_table_lock)) against one atomic inc.

Bzzt. Thank you for playing.

Spinunlock is free on x86 and x86-64, since it's a plain normal store. The 
x86 memory ordering semantics take care of the rest.

In other words, one uncontended spinlock/unlock pair is pretty much
_exactly_ the same cost as one single atomic operation, and there is no 
win.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 23:01     ` Linus Torvalds
@ 2005-08-17 23:12       ` Christoph Lameter
  2005-08-17 23:23         ` Linus Torvalds
  2005-08-17 23:30         ` Andrew Morton
  0 siblings, 2 replies; 47+ messages in thread
From: Christoph Lameter @ 2005-08-17 23:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Hugh Dickins, Nick Piggin, linux-mm

On Wed, 17 Aug 2005, Linus Torvalds wrote:

> On Wed, 17 Aug 2005, Christoph Lameter wrote:
> >
> > We are trading 2x (spinlock(page_table_lock), 
> > spin_unlock(page_table_lock)) against one atomic inc.
> 
> Bzzt. Thank you for playing.
> 
> Spinunlock is free on x86 and x86-64, since it's a plain normal store. The 
> x86 memory ordering semantics take care of the rest.

The same is basically true for ia64 (there is an additional memory barrier 
since ordering must be enforced at that point)

> In other words, one uncontended spinlock/unlock pair is pretty much
> _exactly_ the same cost as one single atomic operation, and there is no 
> win.

We have no problems if the lock are not contended. Then we just reduce the 
overhead by eliminating one semaphore instruction.

If they are contended then spinlock/unlock serializes the page 
fault handler.

Numbers:

Unpatched:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.757s     62.772s  63.052s 49515.393  49522.112
 16   3    2    0.674s     68.268s  36.048s 45627.693  86230.400
 16   3    4    0.649s     66.478s  20.061s 46861.663 152603.033
 16   3    8    0.666s    179.611s  25.096s 17449.372 121159.869
 16   3   16    0.721s    545.957s  38.015s  5754.251  82456.599
 16   3   32    1.327s   1718.947s  59.083s  1828.620  52573.584
 16   3   64    3.181s   5260.674s  93.047s   597.609  33651.618
 16   3  128   15.966s  19392.742s 163.090s   162.078  19192.626
 16   3  256   18.214s   9994.286s  84.077s   314.180  37105.725
 16   3  512   13.866s   5023.788s  42.063s   624.443  73776.955

Page fault patches

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  4   3    1    0.153s     12.314s  12.047s 63077.153  63065.474
  4   3    2    0.155s     10.430s   5.074s 74290.224 136812.728
  4   3    4    0.157s      9.377s   2.095s 82477.064 266154.766
  4   3    8    0.164s     10.348s   2.002s 74807.804 388714.846
  4   3   16    0.250s     20.687s   2.017s 37560.913 360858.481
  4   3   32    0.568s     43.941s   2.034s 17668.743 335334.362
  4   3   64    2.954s     95.528s   2.066s  7985.502 294723.687
  4   3  128   15.449s    259.534s   3.053s  2859.924 222310.352
  4   3  256   16.108s    137.784s   2.019s  5110.263 357984.896
  4   3  512   14.083s     82.377s   1.049s  8152.851 527180.088

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 23:12       ` Christoph Lameter
@ 2005-08-17 23:23         ` Linus Torvalds
  2005-08-17 23:31           ` Christoph Lameter
  2005-08-17 23:30         ` Andrew Morton
  1 sibling, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2005-08-17 23:23 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, Hugh Dickins, Nick Piggin, linux-mm

On Wed, 17 Aug 2005, Christoph Lameter wrote:
>
> We have no problems if the lock are not contended. Then we just reduce the 
> overhead by eliminating one semaphore instruction.

We _do_ have a problem.

Do a kernel benchmark on UP vs SMP, and realize that the cost of just
uncontended spinlocks is about 20% on some kernel loads. That's with
purely single-threaded benchmarks, tied to one CPU - the cost of atomic
ops really is that high. The only difference is the spinlock/unlock.

(Now, the page fault case may not be that bad, but the point remains: 
locking and atomic ops are bad. The anonymous page thing is one of the 
hottest pieces of code in the kernel under perfectly normal loads, and 
getting rid of spinlocks there is worth it).

The thing is, I personally don't care very much at all about 5000 threads
doing page faults in the same VM at the same time. I care about _one_
thread doing page faults in the same VM, and the fact that your patch, if
done right, could help that. That's why I like the patch. Not because of 
your scalability numbers ;)

So we're coming from two different angles here.

			Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 23:12       ` Christoph Lameter
  2005-08-17 23:23         ` Linus Torvalds
@ 2005-08-17 23:30         ` Andrew Morton
  2005-08-17 23:33           ` Christoph Lameter
  1 sibling, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2005-08-17 23:30 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: torvalds, hugh, piggin, linux-mm

Christoph Lameter <clameter@engr.sgi.com> wrote:
>
> Numbers:
> 
> Unpatched:
> 
>  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>  16   3    1    0.757s     62.772s  63.052s 49515.393  49522.112
> 
> ...
>
> Page fault patches
> 
>  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>   4   3    1    0.153s     12.314s  12.047s 63077.153  63065.474

With what workload, on what hardware?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 23:23         ` Linus Torvalds
@ 2005-08-17 23:31           ` Christoph Lameter
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2005-08-17 23:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Hugh Dickins, Nick Piggin, linux-mm

On Wed, 17 Aug 2005, Linus Torvalds wrote:

> On Wed, 17 Aug 2005, Christoph Lameter wrote:
> >
> > We have no problems if the lock are not contended. Then we just reduce the 
> > overhead by eliminating one semaphore instruction.
> 
> We _do_ have a problem.

Ok. Thats even better :-)

> Do a kernel benchmark on UP vs SMP, and realize that the cost of just
> uncontended spinlocks is about 20% on some kernel loads. That's with
> purely single-threaded benchmarks, tied to one CPU - the cost of atomic
> ops really is that high. The only difference is the spinlock/unlock.
> 
> (Now, the page fault case may not be that bad, but the point remains: 
> locking and atomic ops are bad. The anonymous page thing is one of the 
> hottest pieces of code in the kernel under perfectly normal loads, and 
> getting rid of spinlocks there is worth it).

Right.
 
> The thing is, I personally don't care very much at all about 5000 threads
> doing page faults in the same VM at the same time. I care about _one_
> thread doing page faults in the same VM, and the fact that your patch, if
> done right, could help that. That's why I like the patch. Not because of 
> your scalability numbers ;)

I will submit the list rss stuff later but there are several 
modifications to mm management that will cause additional large scale 
changes. Andrew is already complaining so I did not want to risk more 
invasive patches than this. The rss counter management is already handled 
through macros (due to the first page fault scalability patch that was 
accepted in January) so that it will be easy to substitute alternate 
implementions in order to avoid atomic operations.

The list rss patches give another threefold improvement in page fault 
numbers on our large scale systems.

> So we're coming from two different angles here.

I thought we were on the same page?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 23:30         ` Andrew Morton
@ 2005-08-17 23:33           ` Christoph Lameter
  2005-08-17 23:44             ` Andrew Morton
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2005-08-17 23:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, hugh, piggin, linux-mm

On Wed, 17 Aug 2005, Andrew Morton wrote:

> With what workload, on what hardware?

This is the page fault scalability test that I posted last year with the 
first edition of this patchset. Hardware is Itanium with 256 nodes 
otherwise I would not have been able to test up to 512 processors.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 23:33           ` Christoph Lameter
@ 2005-08-17 23:44             ` Andrew Morton
  2005-08-17 23:52               ` Peter Chubb
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2005-08-17 23:44 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: torvalds, hugh, piggin, linux-mm

Christoph Lameter <clameter@engr.sgi.com> wrote:
>
> On Wed, 17 Aug 2005, Andrew Morton wrote:
> 
> > With what workload, on what hardware?
> 
> This is the page fault scalability test that I posted last year with the 
> first edition of this patchset. Hardware is Itanium with 256 nodes 
> otherwise I would not have been able to test up to 512 processors.

We forget things easily - please don't expect us to remember what that test
did.

What did it do?

The decreases in system CPU time for the single-threaded case are
extraordinarily high.  What's going on?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 23:44             ` Andrew Morton
@ 2005-08-17 23:52               ` Peter Chubb
  2005-08-17 23:58                 ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Chubb @ 2005-08-17 23:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, torvalds, hugh, piggin, linux-mm

>>>>> "Andrew" == Andrew Morton <akpm@osdl.org> writes:

Andrew> The decreases in system CPU time for the single-threaded case
Andrew> are extraordinarily high.  

Are the sizes of the test the same?  The unpatched version says 16G,
the patched one 4G --- with a quarter the memory size I'd expect less
than a quarter of the overhead...

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 23:52               ` Peter Chubb
@ 2005-08-17 23:58                 ` Christoph Lameter
  2005-08-18  0:47                   ` Andrew Morton
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2005-08-17 23:58 UTC (permalink / raw)
  To: Peter Chubb; +Cc: Andrew Morton, torvalds, hugh, piggin, linux-mm

On Thu, 18 Aug 2005, Peter Chubb wrote:

> >>>>> "Andrew" == Andrew Morton <akpm@osdl.org> writes:
> 
> Andrew> The decreases in system CPU time for the single-threaded case
> Andrew> are extraordinarily high.  
> 
> Are the sizes of the test the same?  The unpatched version says 16G,
> the patched one 4G --- with a quarter the memory size I'd expect less
> than a quarter of the overhead...

Yup I screwed up.

Patched:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.859s     64.994s  65.084s 47768.542  47771.664
 16   3    2    0.682s     63.165s  33.097s 49269.255  92591.334
 16   3    4    0.632s     52.805s  17.061s 58866.320 178579.491
 16   3    8    0.683s     44.233s   8.074s 70034.218 359660.206
 16   3   16    0.666s     82.785s   8.052s 37694.972 368802.163
 16   3   32    1.301s    172.066s   8.085s 18144.775 355252.190
 16   3   64    4.958s    364.566s   9.054s  8512.883 329495.174
 16   3  128   20.006s    860.666s  11.000s  3571.958 285801.678
 16   3  256   12.773s    546.095s   6.071s  5628.745 468417.083
 16   3  512   14.547s    253.782s   3.053s 11723.346 889858.164

Tool used to measure this is at 
http://marc.theaimsgroup.com/?l=linux-kernel&m=109257807215046&w=2

The code for the test program follows the patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 22:17 pagefault scalability patches Andrew Morton
  2005-08-17 22:19 ` Christoph Lameter
  2005-08-17 22:36 ` Linus Torvalds
@ 2005-08-18  0:43 ` Andrew Morton
  2005-08-18 16:04   ` Christoph Lameter
                     ` (2 more replies)
  2005-08-18  2:00 ` Nick Piggin
  3 siblings, 3 replies; 47+ messages in thread
From: Andrew Morton @ 2005-08-18  0:43 UTC (permalink / raw)
  To: torvalds, hugh, clameter, piggin, linux-mm

Andrew Morton <akpm@osdl.org> wrote:
>
> I have vague feelings of ickiness with the patches wrt:
> 
>  a) general increase of complexity
> 
>  b) the fact that they only partially address the problem: anonymous page
>     faults are addressed, but lots of other places aren't.
> 
>  c) the fact that they address one particular part of one particular
>     workload on exceedingly rare machines.

d) the fact that some architectures will be using atomic pte ops and
   others will be using page_table_lock in core MM code.

   Using different locking/atomicity schemes in different architectures
   has obvious complexity and test coverage drawbacks.

   Is it still the case that some architectures must retain the
   page_table_lock approach because they use it to lock other arch-internal
   things?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 23:58                 ` Christoph Lameter
@ 2005-08-18  0:47                   ` Andrew Morton
  2005-08-18 16:09                     ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2005-08-18  0:47 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: peterc, torvalds, hugh, piggin, linux-mm

Christoph Lameter <clameter@engr.sgi.com> wrote:
>
> On Thu, 18 Aug 2005, Peter Chubb wrote:
> 
> > >>>>> "Andrew" == Andrew Morton <akpm@osdl.org> writes:
> > 
> > Andrew> The decreases in system CPU time for the single-threaded case
> > Andrew> are extraordinarily high.  
> > 
> > Are the sizes of the test the same?  The unpatched version says 16G,
> > the patched one 4G --- with a quarter the memory size I'd expect less
> > than a quarter of the overhead...
> 
> Yup I screwed up.
> 
> Patched:
> 
>  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>  16   3    1    0.859s     64.994s  65.084s 47768.542  47771.664

Versus:

> Unpatched:
>
>  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>  16   3    1    0.757s     62.772s  63.052s 49515.393  49522.112

It got slower?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 22:17 pagefault scalability patches Andrew Morton
                   ` (2 preceding siblings ...)
  2005-08-18  0:43 ` Andrew Morton
@ 2005-08-18  2:00 ` Nick Piggin
  2005-08-18  8:38   ` Nick Piggin
  3 siblings, 1 reply; 47+ messages in thread
From: Nick Piggin @ 2005-08-18  2:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Hugh Dickins, Christoph Lameter, Nick Piggin, linux-mm

Andrew Morton wrote:

>These are getting in the way now, and I need to make a go/no-go decision.
>
>I have vague feelings of ickiness with the patches wrt:
>
>a) general increase of complexity
>
>b) the fact that they only partially address the problem: anonymous page
>   faults are addressed, but lots of other places aren't.
>
>c) the fact that they address one particular part of one particular
>   workload on exceedingly rare machines.
>
>I believe that Nick has plans to address b).
>
>I'd like us to thrash this out (again), please.  Hugh, could you (for the
>nth and final time) describe your concerns with these patches?
>
>

That's true I do have a more general API that gives a bit more
flexibility in the arch implementation, and allows complete removal
of ptl...

I'd like to get time to finish that up and get it working on ppc64
and see it in the kernel, however it is very intrusive (eg. does
things like remove ptl from around mmu gather operations).

Basically it is going to take a long time to get everyone on side
even if the patch was 100% ready today (which it isn't).

If the big ticket item is taking the ptl out of the anonymous fault
path, then we probably should forget my stuff and consider Christoph's
on its own merits.

FWIW, I don't think it is an unreasonable approach to solving the
problem at hand in a fairly unintrusive manner.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-18  2:00 ` Nick Piggin
@ 2005-08-18  8:38   ` Nick Piggin
  2005-08-18 16:17     ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: Nick Piggin @ 2005-08-18  8:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Hugh Dickins, Christoph Lameter, Nick Piggin, linux-mm

Nick Piggin wrote:

> If the big ticket item is taking the ptl out of the anonymous fault
> path, then we probably should forget my stuff

( for now :) )

> and consider Christoph's
> on its own merits.
> 
> FWIW, I don't think it is an unreasonable approach to solving the
> problem at hand in a fairly unintrusive manner.
> 

To be clear: by "it" I mean Christoph's patches, not mine.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-18  0:43 ` Andrew Morton
@ 2005-08-18 16:04   ` Christoph Lameter
  2005-08-18 20:16   ` Hugh Dickins
  2005-08-22  2:09   ` pagefault scalability patches Benjamin Herrenschmidt
  2 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2005-08-18 16:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, hugh, piggin, linux-mm

On Wed, 17 Aug 2005, Andrew Morton wrote:

> d) the fact that some architectures will be using atomic pte ops and
>    others will be using page_table_lock in core MM code.

We could generally go to atomic pte operations. But this would 
require extensive changes to all architectures. There is a tradeoff 
between atomic operations and using regular loads and stores on page table 
entries. If a number of page table entries have to be modified then it is 
advantageous to take a lock. If an individial entry is modified then it is 
better to do an atomic operation.

>    Using different locking/atomicity schemes in different architectures
>    has obvious complexity and test coverage drawbacks.

We could require the same locking scheme for all architectures. Some 
architectures would then have to simulate the atomicity 
which would cause performance loss.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-18  0:47                   ` Andrew Morton
@ 2005-08-18 16:09                     ` Christoph Lameter
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2005-08-18 16:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: peterc, torvalds, hugh, piggin, linux-mm

On Wed, 17 Aug 2005, Andrew Morton wrote:

> > Patched:
> > 
> >  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
> >  16   3    1    0.859s     64.994s  65.084s 47768.542  47771.664
> 
> Versus:
> 
> > Unpatched:
> >
> >  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
> >  16   3    1    0.757s     62.772s  63.052s 49515.393  49522.112
> 
> It got slower?

For that sample yes. There is a certain unpredictability coming with NUMA 
systems. Memory placement affects the tests. This in the margin of error.

Another test shows just the opposite:

unpatched:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.735s     64.074s  64.083s 48537.993  48519.120
 16   3    2    0.773s     94.774s  49.046s 32923.047  63588.898
 16   3    4    0.717s     87.110s  29.092s 35816.846 105117.121
 16   3    8    0.677s    136.768s  21.069s 22886.951 145008.228
 16   3   16    0.757s    288.464s  23.045s 10876.524 134128.797
 16   3   32   13.612s    297.150s  23.034s 10122.600 134723.354
 16   3   64   60.201s    318.414s  27.048s  8308.505 114470.017
 16   3  128  279.422s    322.942s  41.063s  5222.299  75562.812
 16   3  256  280.823s    146.732s  28.073s  7357.466 109486.455
 16   3  512  282.124s     77.636s  24.023s  8743.940 129787.460

patched:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.702s     62.858s  63.056s 49491.809  49489.633
 16   3    2    0.734s     72.348s  38.004s 43043.199  82674.132
 16   3    4    0.718s     76.552s  25.012s 40710.056 125186.047
 16   3    8    0.782s     58.417s  12.020s 53137.972 257740.814
 16   3   16    1.534s     93.568s   9.092s 33077.207 316995.454
 16   3   32    3.297s    173.145s   9.078s 17828.534 321373.156
 16   3   64    9.001s    445.874s  11.064s  6915.569 270213.663
 16   3  128   27.157s   1500.321s  16.060s  2059.426 189481.849
 16   3  256   25.647s    762.183s   8.083s  3992.895 355973.645
 16   3  512   26.167s    407.595s   5.008s  7252.183 619054.581

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-18  8:38   ` Nick Piggin
@ 2005-08-18 16:17     ` Christoph Lameter
  2005-08-22  2:04       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2005-08-18 16:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linus Torvalds, Hugh Dickins, Nick Piggin, linux-mm

On Thu, 18 Aug 2005, Nick Piggin wrote:

> Nick Piggin wrote:
> 
> > If the big ticket item is taking the ptl out of the anonymous fault
> > path, then we probably should forget my stuff
> 
> ( for now :) )

I think we can gradually work atomic operations into various code paths 
where this will be advantageous and your work may be a very important base 
to get there.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-18  0:43 ` Andrew Morton
  2005-08-18 16:04   ` Christoph Lameter
@ 2005-08-18 20:16   ` Hugh Dickins
  2005-08-19  1:22     ` [PATCH] use mm_counter macros for nr_pte since its also under ptl Christoph Lameter
                       ` (2 more replies)
  2005-08-22  2:09   ` pagefault scalability patches Benjamin Herrenschmidt
  2 siblings, 3 replies; 47+ messages in thread
From: Hugh Dickins @ 2005-08-18 20:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, clameter, piggin, linux-mm

On Wed, 17 Aug 2005, Andrew Morton wrote:
> Andrew Morton <akpm@osdl.org> wrote:
> >
> > I have vague feelings of ickiness with the patches wrt:
> > 
> >  a) general increase of complexity
> > 
> >  b) the fact that they only partially address the problem: anonymous page
> >     faults are addressed, but lots of other places aren't.
> > 
> >  c) the fact that they address one particular part of one particular
> >     workload on exceedingly rare machines.
> 
> d) the fact that some architectures will be using atomic pte ops and
>    others will be using page_table_lock in core MM code.
> 
>    Using different locking/atomicity schemes in different architectures
>    has obvious complexity and test coverage drawbacks.
> 
>    Is it still the case that some architectures must retain the
>    page_table_lock approach because they use it to lock other arch-internal
>    things?

With the addition of (d) you've got a good summary of my objections,
without my having to write a word - thank you.

I'll add scattered observations here.
Towards the end, I do have a constructive alternative (*).

There's a lot about atomic pte ops in this thread, but it's a pte
cmpxchg which do_anonymous_page has to do - if I remember PaulMcK's
bogroll rightly, cmpxchgs are extra bad news.

Christoph and Nick are keen to go further, deeper into the atomics
and cmpxchgs, away from the page table lock.  Is that sensible when
we have batch operations like zap_pte_range and copy_pte_range?

Christoph's current patch does _not_ increase the need for atomics
in those two (except for rss and anon_rss, but we'd do well to batch
those updates anyway - zap_pte_range's tlb stuff half does it).  It
does have to ptep_cmpxchg in mprotect and ptep_xchg in try_to_unmap,
but I guess neither of those is a serious worry.

Nobody (except me, in the last few days) has actually been testing
whether these patches do anything for page fault scalability since
they went into -mm.  Proof: if that's what you're testing, you very
soon hit the BUG_ON(mm->nr_ptes...) at the end of exit_mmap.  And
once you've worked your way through the architectural maze, you
realize that nr_ptes used to be protected by page_table_lock but
is currently unprotected when CONFIG_ATOMIC_TABLE_OPS.  (I fixed
that here by adding back page_table_lock around it, but Christoph
will probably prefer to go atomic with it; for people just testing
the scalability, it's okay to remove that BUG_ON for the moment.)

Christoph likes to assure us "No this is a general fix for anonymous
page faults on SMP machines.  As noted at the KS, other are seeing
similar performance problems".  Perhaps, but if so, they should be
speaking up, telling us Christoph's patches solve their problems,
and providing patches to convert their architectures over.

How many architectures have been converted to ATOMIC_TABLE_OPS
(could we call that ATOMIC_PAGE_TABLE_OPS?): just ia64, x86_64
and i386.  i386 being a joke, since it's only the non-PAE case
which is converted, yet surely anyone getting into a serious
number of cpus on i386 will be using PAE?

I may well be to blame for this.  Perhaps my hostility has
discouraged others from doing the work to add to what's there.
Certainly it was me who advised Christoph to drop the i386 PAE
support he originally had, since it was too ugly and buggy.

And it was probably my resistance to the per-task rss patch which
has led him to hold that back for now.  I think wisely, that is a
separate issue.  But from what Linus says, it does rather look like
we can't sensibly go forward with anonymous pte cmpxchging, without
a matching rss solution.

My resistance to the rss patch (then, haven't seen a recent) was all
this infrastructure for a just couple of counts which don't really
matter.  (There were three places in rmap.c which avoided rss 0 mms,
but that was a historic necessity: I've deleted those checks from the
rmap.c waiting in -mm.)  Can't we just let them be racy?

Plus, fear of tools looking into /proc for the rss of one of Christoph's
512-threaded processes, and each lookup of each thread of the process
examining every other task_struct of the process?  We need somehow to
prevent that, to look no further than the mm_struct in most cases.

(*) I realized that the time was coming to decide one way or the other
on these page fault scalability patches in -mm.  So I've spent the
last few days prototyping an alternative, to see how well it compares.

The thing I really like in Christoph's patches is not the cmpxchging,
but the narrowing of the page table lock.  There is very little need
to be holding it across the pgd->pud->pmd->pt traversals: in general
you can enter the do_..._page functions without acquiring page table
lock at all, entering them with a "speculative" entry which need only
be confirmed by pte_same once new page has been allocated.  (The
PAE case does need to be more careful about it, though it can still
avoid preliminary page table lock at least in do_anonymous_page case.)

This advantage applies to all architectures, though a few will need
a bit more research and care (the question you ask in your (d)).  In
most of the loops, it's mainly a matter of removing page_table_lock
from the outer level, and inserting it at the inner pte level (which
pleases the low latency people too).  It satisfies Linus' desire to
reduce the locking in the simple anonymous case, on all architectures.

Perhaps it was Nick who first pointed this out to me, or was it me to
him, I forget?  we simply have to be careful to unlink a vma from its
prio_tree and from its anon_vma in free_pgtables, before any page
tables are freed - that way, vmscan's rmap functions cannot reach
page tables on their way to being freed.

With the page table lock moved inward, we can then easily choose to
use a per-pagetable lock, to handle the page fault scalability issue
without departing far from our existing locking conventions.  Indeed,
I have a working prototype for that, but I don't have equipment to test
scalability on SGI's scale, and on my 2*HT*Xeons the best results are
coming from just narrowing the page table lock, not from splitting it.

I'm not ready.  The patches, as I say, are currently just prototypes,
and mix in the usual tidyups I cannot resist when hacking (e.g. why
does almost every do_..._page have its arguments in a different order?).
I was intent on getting numbers, hoping to find that the numbers which
emerge from this would be clearly better than with Christoph's patches.
But no, they're comparable (on my puny machines), in some cases one
or the other better (of course mine do work out better for PAE, and
presumably for any non-ATOMIC_TABLE_OPS-architetures, but you could
say I loaded the dice against Christoph there).

I find proceeding in this way easier to understand, and would myself
prefer Christoph's patches removed from -mm, so we can build the
narrower page_table_lock solution there, then see what works best
as a scalability solution on top - per-pagetable locking, or pte
cmpxchging.  But we all find our own ways easier to understand.

You might like me to post my patch for testing (not for merging into
any tree at this stage): please give me a couple of days to jiggle
around with it first.

Nick (if you've got this far), you mention in one of your mails of this
thread, that you remove page_table_lock from around the tlb mmu_gather
stuff: yes, me too, but I did it with less awareness, and your comment
makes me realize it needs a little more care.  Did you actually find
a problem with that (beyond needing preempt_disable, which ought to go
into them anyway) on some architecture, or were you just voicing caution?

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH] use mm_counter macros for nr_pte since its also under ptl
  2005-08-18 20:16   ` Hugh Dickins
@ 2005-08-19  1:22     ` Christoph Lameter
  2005-08-19  3:17       ` Andrew Morton
  2005-08-19  1:33     ` pagefault scalability patches Christoph Lameter
  2005-08-19  3:53     ` [RFC] Concept for delayed counter updates in mm_struct Christoph Lameter
  2 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2005-08-19  1:22 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, torvalds, piggin, linux-mm

On Thu, 18 Aug 2005, Hugh Dickins wrote:

> they went into -mm.  Proof: if that's what you're testing, you very
> soon hit the BUG_ON(mm->nr_ptes...) at the end of exit_mmap.  And
> once you've worked your way through the architectural maze, you
> realize that nr_ptes used to be protected by page_table_lock but
> is currently unprotected when CONFIG_ATOMIC_TABLE_OPS.  (I fixed
> that here by adding back page_table_lock around it, but Christoph
> will probably prefer to go atomic with it; for people just testing
> the scalability, it's okay to remove that BUG_ON for the moment.)

Ah thanks.

Actually this is a bug already present in Linus' tree (but still my 
fault). nr_pte's needs to be managed through the mm counter macros like
other counters protected by the page table fault. 

This is a patch against Linus' current tree and independent of the page 
fault scalability patches.

---

Make nr_pte a mm_counter.

nr_pte is also protected by the page_table_lock like rss and anon_rss. This patch
changes all accesses to nr_pte to use the macros provided for that purpose.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.13-rc6/include/linux/sched.h
===================================================================
--- linux-2.6.13-rc6.orig/include/linux/sched.h	2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/include/linux/sched.h	2005-08-18 18:10:28.000000000 -0700
@@ -238,9 +238,10 @@ struct mm_struct {
 	unsigned long start_brk, brk, start_stack;
 	unsigned long arg_start, arg_end, env_start, env_end;
 	unsigned long total_vm, locked_vm, shared_vm;
-	unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;
+	unsigned long exec_vm, stack_vm, reserved_vm, def_flags;
 
 	/* Special counters protected by the page_table_lock */
+	mm_counter_t _nr_ptes;
 	mm_counter_t _rss;
 	mm_counter_t _anon_rss;
 
Index: linux-2.6.13-rc6/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.13-rc6.orig/fs/proc/task_mmu.c	2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/fs/proc/task_mmu.c	2005-08-18 18:10:28.000000000 -0700
@@ -27,7 +27,7 @@ char *task_mem(struct mm_struct *mm, cha
 		get_mm_counter(mm, rss) << (PAGE_SHIFT-10),
 		data << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
-		(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
+		(PTRS_PER_PTE*sizeof(pte_t)*get_mm_counter(mm, nr_ptes)) >> 10);
 	return buffer;
 }
 
Index: linux-2.6.13-rc6/kernel/fork.c
===================================================================
--- linux-2.6.13-rc6.orig/kernel/fork.c	2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/kernel/fork.c	2005-08-18 18:10:28.000000000 -0700
@@ -320,7 +320,7 @@ static struct mm_struct * mm_init(struct
 	init_rwsem(&mm->mmap_sem);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm->core_waiters = 0;
-	mm->nr_ptes = 0;
+	set_mm_counter(mm, nr_ptes, 0);
 	spin_lock_init(&mm->page_table_lock);
 	rwlock_init(&mm->ioctx_list_lock);
 	mm->ioctx_list = NULL;
Index: linux-2.6.13-rc6/mm/memory.c
===================================================================
--- linux-2.6.13-rc6.orig/mm/memory.c	2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/mm/memory.c	2005-08-18 18:10:28.000000000 -0700
@@ -116,7 +116,7 @@ static void free_pte_range(struct mmu_ga
 	pmd_clear(pmd);
 	pte_free_tlb(tlb, page);
 	dec_page_state(nr_page_table_pages);
-	tlb->mm->nr_ptes--;
+	dec_mm_counter(tlb->mm, nr_ptes);
 }
 
 static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
@@ -299,7 +299,7 @@ pte_t fastcall *pte_alloc_map(struct mm_
 			pte_free(new);
 			goto out;
 		}
-		mm->nr_ptes++;
+		inc_mm_counter(mm, nr_ptes);
 		inc_page_state(nr_page_table_pages);
 		pmd_populate(mm, pmd, new);
 	}
Index: linux-2.6.13-rc6/mm/mmap.c
===================================================================
--- linux-2.6.13-rc6.orig/mm/mmap.c	2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/mm/mmap.c	2005-08-18 18:10:28.000000000 -0700
@@ -1969,7 +1969,7 @@ void exit_mmap(struct mm_struct *mm)
 		vma = next;
 	}
 
-	BUG_ON(mm->nr_ptes > (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT);
+	BUG_ON(get_mm_counter(mm, nr_ptes) > (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT);
 }
 
 /* Insert vm structure into process list sorted by address
Index: linux-2.6.13-rc6/arch/um/kernel/skas/mmu.c
===================================================================
--- linux-2.6.13-rc6.orig/arch/um/kernel/skas/mmu.c	2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/arch/um/kernel/skas/mmu.c	2005-08-18 18:10:28.000000000 -0700
@@ -115,7 +115,7 @@ int init_new_context_skas(struct task_st
 		if(ret)
 			goto out_free;
 
-		mm->nr_ptes--;
+		dec_mm_counter(mm, nr_ptes);
 
 		if((cur_mm != NULL) && (cur_mm != &init_mm))
 			mm_id->u.pid = copy_context_skas0(stack,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-18 20:16   ` Hugh Dickins
  2005-08-19  1:22     ` [PATCH] use mm_counter macros for nr_pte since its also under ptl Christoph Lameter
@ 2005-08-19  1:33     ` Christoph Lameter
  2005-08-19  3:53     ` [RFC] Concept for delayed counter updates in mm_struct Christoph Lameter
  2 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2005-08-19  1:33 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, torvalds, piggin, linux-mm

On Thu, 18 Aug 2005, Hugh Dickins wrote:

> There's a lot about atomic pte ops in this thread, but it's a pte
> cmpxchg which do_anonymous_page has to do - if I remember PaulMcK's
> bogroll rightly, cmpxchgs are extra bad news.

Same badness as spin_lock yes but they serialize for extremely small time 
periods. So they are better than spinlock.

> Christoph and Nick are keen to go further, deeper into the atomics
> and cmpxchgs, away from the page table lock.  Is that sensible when
> we have batch operations like zap_pte_range and copy_pte_range?

I did a batch faulting scheme last year too. See 
http://marc.theaimsgroup.com/?l=linux-kernel&m=110488578521535&w=2

> How many architectures have been converted to ATOMIC_TABLE_OPS
> (could we call that ATOMIC_PAGE_TABLE_OPS?): just ia64, x86_64
> and i386.  i386 being a joke, since it's only the non-PAE case
> which is converted, yet surely anyone getting into a serious
> number of cpus on i386 will be using PAE?

Right. This is just a start. If it would be in the kernel then other 
people will do the work as I have heard repeatedly. Chicken-Egg.

> I may well be to blame for this.  Perhaps my hostility has
> discouraged others from doing the work to add to what's there.
> Certainly it was me who advised Christoph to drop the i386 PAE
> support he originally had, since it was too ugly and buggy.

PAE support can be added within the framework provided by these 
patches.

> And it was probably my resistance to the per-task rss patch which
> has led him to hold that back for now.  I think wisely, that is a
> separate issue.  But from what Linus says, it does rather look like
> we can't sensibly go forward with anonymous pte cmpxchging, without
> a matching rss solution.

I am working on getting the bit rot out of my old patches. This is going 
to take a few days.

> matter.  (There were three places in rmap.c which avoided rss 0 mms,
> but that was a historic necessity: I've deleted those checks from the
> rmap.c waiting in -mm.)  Can't we just let them be racy?

Its great that these are gone. I just tried to find them and was happy to 
discover they were already gone.

> With the page table lock moved inward, we can then easily choose to
> use a per-pagetable lock, to handle the page fault scalability issue
> without departing far from our existing locking conventions.  Indeed,
> I have a working prototype for that, but I don't have equipment to test
> scalability on SGI's scale, and on my 2*HT*Xeons the best results are
> coming from just narrowing the page table lock, not from splitting it.

I have tried that last year too. I thought you helped me see the light on 
the futility of that approach?

> I find proceeding in this way easier to understand, and would myself
> prefer Christoph's patches removed from -mm, so we can build the
> narrower page_table_lock solution there, then see what works best
> as a scalability solution on top - per-pagetable locking, or pte
> cmpxchging.  But we all find our own ways easier to understand.

Oh no. We have been there before and I fear that if this gets removed then there 
will be no progress for a long time like before. Please at least leave the 
first patch in which provides an infrastructure for atomic pte operations 
that may then be deployed in a variety of ways and be useful for 
approaches that Hugh or Nick may come up with.

> You might like me to post my patch for testing (not for merging into
> any tree at this stage): please give me a couple of days to jiggle
> around with it first.

I'd be interested to see if you can really come up with anything that we 
have not tried yet.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] use mm_counter macros for nr_pte since its also under ptl
  2005-08-19  1:22     ` [PATCH] use mm_counter macros for nr_pte since its also under ptl Christoph Lameter
@ 2005-08-19  3:17       ` Andrew Morton
  2005-08-19  3:51         ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2005-08-19  3:17 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: hugh, torvalds, piggin, linux-mm

Christoph Lameter <clameter@engr.sgi.com> wrote:
>
> Actually this is a bug already present in Linus' tree (but still my 
>  fault). nr_pte's needs to be managed through the mm counter macros like
>  other counters protected by the page table fault. 

Does that mean that Linus's tree can presently go BUG?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] use mm_counter macros for nr_pte since its also under ptl
  2005-08-19  3:17       ` Andrew Morton
@ 2005-08-19  3:51         ` Christoph Lameter
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2005-08-19  3:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: hugh, torvalds, piggin, linux-mm

On Thu, 18 Aug 2005, Andrew Morton wrote:

> Christoph Lameter <clameter@engr.sgi.com> wrote:
> >
> > Actually this is a bug already present in Linus' tree (but still my 
> >  fault). nr_pte's needs to be managed through the mm counter macros like
> >  other counters protected by the page table fault. 
> 
> Does that mean that Linus's tree can presently go BUG?
> 
No because the mm_counter macros do nothing special at this point. Just an 
uncleanness if the page fault patches are not applied.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC] Concept for delayed counter updates in mm_struct
  2005-08-18 20:16   ` Hugh Dickins
  2005-08-19  1:22     ` [PATCH] use mm_counter macros for nr_pte since its also under ptl Christoph Lameter
  2005-08-19  1:33     ` pagefault scalability patches Christoph Lameter
@ 2005-08-19  3:53     ` Christoph Lameter
  2005-08-19  4:29       ` Andrew Morton
  2 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2005-08-19  3:53 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, torvalds, nickpiggin, linux-mm

I think there may be an easier way of avoiding atomic increments
if the page_table_lock is not used than methods that I had proposed last 
year (building lists of task_structs).

If we keep deltas in the task_struct then we can at some later point add 
those to an mm_struct (via calling mm_counter_catchup(mm).

The main problem in the past with using current for rss information were 
primarily concerns with get_user_pages(). I hope that the approach here 
solves the issues neatly. get_user_pages() first shifts any deltas into 
the current->mm. Then it does the handle_mm_fault() thing which may 
accumulate new deltas in current. These are stuffed into the target mm 
after the page_table_lock has been acquired.

What is missing in this patch are points were mm_counter_catchup can be called.
These points must be code where the page table lock is held. One way of providing
these would be to call mm_counter_catchup when a task is in the scheduler.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.13-rc6/kernel/fork.c
===================================================================
--- linux-2.6.13-rc6.orig/kernel/fork.c	2005-08-18 18:10:28.000000000 -0700
+++ linux-2.6.13-rc6/kernel/fork.c	2005-08-18 20:34:14.000000000 -0700
@@ -173,6 +173,9 @@ static struct task_struct *dup_task_stru
 	*tsk = *orig;
 	tsk->thread_info = ti;
 	ti->task = tsk;
+	tsk->delta_rss = 0;
+	tsk->delta_anon_rss = 0;
+	tsk->delta_nr_ptes = 0;
 
 	/* One for us, one for whoever does the "release_task()" (usually parent) */
 	atomic_set(&tsk->usage,2);
Index: linux-2.6.13-rc6/include/linux/sched.h
===================================================================
--- linux-2.6.13-rc6.orig/include/linux/sched.h	2005-08-18 18:10:28.000000000 -0700
+++ linux-2.6.13-rc6/include/linux/sched.h	2005-08-18 20:15:50.000000000 -0700
@@ -604,6 +604,15 @@ struct task_struct {
 	unsigned long flags;	/* per process flags, defined below */
 	unsigned long ptrace;
 
+	/*
+	 * The counters in the mm_struct require the page table lock
+	 * These deltas here accumulate changes that are later folded
+	 * into the corresponding mm_struct counters
+	 */
+	long delta_rss;
+	long delta_anon_rss;
+	long delta_nr_ptes;
+
 	int lock_depth;		/* BKL lock depth */
 
 #if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
@@ -1347,6 +1356,23 @@ static inline void thaw_processes(void) 
 static inline int try_to_freeze(void) { return 0; }
 
 #endif /* CONFIG_PM */
+
+/*
+ * Update mm_struct counters with deltas from task_struct.
+ * Must be called with the page_table_lock held.
+ */
+inline static void mm_counter_catchup(struct mm_struct *mm)
+{
+	if (unlikely(current->delta_rss | current->delta_anon_rss | current->delta_nr_ptes)) {
+		add_mm_counter(mm, rss, current->delta_rss);
+		add_mm_counter(mm, anon_rss, current->delta_anon_rss);
+		add_mm_counter(mm, nr_ptes, current->delta_nr_ptes);
+		current->delta_rss = 0;
+		current->delta_anon_rss = 0;
+		current->delta_nr_ptes = 0;
+	}
+}
+
 #endif /* __KERNEL__ */
 
 #endif
Index: linux-2.6.13-rc6/mm/memory.c
===================================================================
--- linux-2.6.13-rc6.orig/mm/memory.c	2005-08-18 18:10:28.000000000 -0700
+++ linux-2.6.13-rc6/mm/memory.c	2005-08-18 20:33:37.000000000 -0700
@@ -299,7 +299,7 @@ pte_t fastcall *pte_alloc_map(struct mm_
 			pte_free(new);
 			goto out;
 		}
-		inc_mm_counter(mm, nr_ptes);
+		current->delta_nr_ptes++;
 		inc_page_state(nr_page_table_pages);
 		pmd_populate(mm, pmd, new);
 	}
@@ -892,6 +892,13 @@ int get_user_pages(struct task_struct *t
 	flags &= force ? (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
 	i = 0;
 
+	if (mm != current->mm) {
+		/* Insure that there are no deltas for current->mm */
+		spin_lock(&current->mm->page_table_lock);
+		mm_counter_catchup(current->mm);
+		spin_unlock(&current->mm->page_table_lock);
+	}
+
 	do {
 		struct vm_area_struct *	vma;
 
@@ -989,6 +996,12 @@ int get_user_pages(struct task_struct *t
 					BUG();
 				}
 				spin_lock(&mm->page_table_lock);
+				/*
+				 * Update any counters in the mm handled so that
+				 * they are not reflected in the mm of the running
+				 * process
+				 */
+				mm_counter_catchup(mm);
 			}
 			if (pages) {
 				pages[i] = page;
@@ -1778,7 +1791,7 @@ do_anonymous_page(struct mm_struct *mm, 
 			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
-		inc_mm_counter(mm, rss);
+		current->delta_rss++;
 		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
 							 vma->vm_page_prot)),
 				      vma);
Index: linux-2.6.13-rc6/mm/rmap.c
===================================================================
--- linux-2.6.13-rc6.orig/mm/rmap.c	2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/mm/rmap.c	2005-08-18 20:15:08.000000000 -0700
@@ -448,7 +448,7 @@ void page_add_anon_rmap(struct page *pag
 	BUG_ON(PageReserved(page));
 	BUG_ON(!anon_vma);
 
-	inc_mm_counter(vma->vm_mm, anon_rss);
+	current->delta_anon_rss++;
 
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 	index = (address - vma->vm_start) >> PAGE_SHIFT;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC] Concept for delayed counter updates in mm_struct
  2005-08-19  3:53     ` [RFC] Concept for delayed counter updates in mm_struct Christoph Lameter
@ 2005-08-19  4:29       ` Andrew Morton
  2005-08-19  4:34         ` Andi Kleen
  2005-08-19  4:49         ` Linus Torvalds
  0 siblings, 2 replies; 47+ messages in thread
From: Andrew Morton @ 2005-08-19  4:29 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: hugh, torvalds, nickpiggin, linux-mm

Christoph Lameter <clameter@engr.sgi.com> wrote:
>
> I think there may be an easier way of avoiding atomic increments
> if the page_table_lock is not used than methods that I had proposed last 
> year (building lists of task_structs).
> 
> If we keep deltas in the task_struct then we can at some later point add 
> those to an mm_struct (via calling mm_counter_catchup(mm).
> 
> The main problem in the past with using current for rss information were 
> primarily concerns with get_user_pages(). I hope that the approach here 
> solves the issues neatly. get_user_pages() first shifts any deltas into 
> the current->mm. Then it does the handle_mm_fault() thing which may 
> accumulate new deltas in current. These are stuffed into the target mm 
> after the page_table_lock has been acquired.
> 
> What is missing in this patch are points were mm_counter_catchup can be called.
> These points must be code where the page table lock is held. One way of providing
> these would be to call mm_counter_catchup when a task is in the scheduler.
> 

That sounds sane.

> 
> Index: linux-2.6.13-rc6/kernel/fork.c
> ===================================================================
> --- linux-2.6.13-rc6.orig/kernel/fork.c	2005-08-18 18:10:28.000000000 -0700
> +++ linux-2.6.13-rc6/kernel/fork.c	2005-08-18 20:34:14.000000000 -0700
> @@ -173,6 +173,9 @@ static struct task_struct *dup_task_stru
>  	*tsk = *orig;
>  	tsk->thread_info = ti;
>  	ti->task = tsk;
> +	tsk->delta_rss = 0;
> +	tsk->delta_anon_rss = 0;
> +	tsk->delta_nr_ptes = 0;
>  
>  	/* One for us, one for whoever does the "release_task()" (usually parent) */
>  	atomic_set(&tsk->usage,2);
> Index: linux-2.6.13-rc6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.13-rc6.orig/include/linux/sched.h	2005-08-18 18:10:28.000000000 -0700
> +++ linux-2.6.13-rc6/include/linux/sched.h	2005-08-18 20:15:50.000000000 -0700
> @@ -604,6 +604,15 @@ struct task_struct {
>  	unsigned long flags;	/* per process flags, defined below */
>  	unsigned long ptrace;
>  
> +	/*
> +	 * The counters in the mm_struct require the page table lock
> +	 * These deltas here accumulate changes that are later folded
> +	 * into the corresponding mm_struct counters
> +	 */
> +	long delta_rss;
> +	long delta_anon_rss;
> +	long delta_nr_ptes;
> +
>  	int lock_depth;		/* BKL lock depth */
>  
>  #if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
> @@ -1347,6 +1356,23 @@ static inline void thaw_processes(void) 
>  static inline int try_to_freeze(void) { return 0; }
>  
>  #endif /* CONFIG_PM */
> +
> +/*
> + * Update mm_struct counters with deltas from task_struct.
> + * Must be called with the page_table_lock held.
> + */
> +inline static void mm_counter_catchup(struct mm_struct *mm)

`static inline', please.

> +{
> +	if (unlikely(current->delta_rss | current->delta_anon_rss | current->delta_nr_ptes)) {
> +		add_mm_counter(mm, rss, current->delta_rss);
> +		add_mm_counter(mm, anon_rss, current->delta_anon_rss);
> +		add_mm_counter(mm, nr_ptes, current->delta_nr_ptes);
> +		current->delta_rss = 0;
> +		current->delta_anon_rss = 0;
> +		current->delta_nr_ptes = 0;
> +	}
> +}

This looks way too big to be inlined.

Also, evaluation of `current' takes ~14 bytes of code on x86 and sometimes
the compiler doesn't CSE it.  This is why we often do

	struct task_struct *tsk = current;

	<use tsk>

> +	if (mm != current->mm) {
> +		/* Insure that there are no deltas for current->mm */

"Ensure" ;)

> @@ -989,6 +996,12 @@ int get_user_pages(struct task_struct *t
>  					BUG();
>  				}
>  				spin_lock(&mm->page_table_lock);
> +				/*
> +				 * Update any counters in the mm handled so that
> +				 * they are not reflected in the mm of the running
> +				 * process
> +				 */

Is ptrace->get_user_pages() the only place where one process pokes
at another process's memory?  I think so..
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC] Concept for delayed counter updates in mm_struct
  2005-08-19  4:29       ` Andrew Morton
@ 2005-08-19  4:34         ` Andi Kleen
  2005-08-19  4:49         ` Linus Torvalds
  1 sibling, 0 replies; 47+ messages in thread
From: Andi Kleen @ 2005-08-19  4:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, hugh, torvalds, nickpiggin, linux-mm

> Is ptrace->get_user_pages() the only place where one process pokes
> at another process's memory?  I think so..

/proc/*/{cmdline,env}. But they all use get_user_pages()

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC] Concept for delayed counter updates in mm_struct
  2005-08-19  4:29       ` Andrew Morton
  2005-08-19  4:34         ` Andi Kleen
@ 2005-08-19  4:49         ` Linus Torvalds
  2005-08-19 16:06           ` Christoph Lameter
                             ` (2 more replies)
  1 sibling, 3 replies; 47+ messages in thread
From: Linus Torvalds @ 2005-08-19  4:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, hugh, nickpiggin, linux-mm


On Thu, 18 Aug 2005, Andrew Morton wrote:
> Christoph Lameter <clameter@engr.sgi.com> wrote:
>
> > What is missing in this patch are points were mm_counter_catchup can be called.
> > These points must be code where the page table lock is held. One way of providing
> > these would be to call mm_counter_catchup when a task is in the scheduler.
> > 
> 
> That sounds sane.

But that patch doesn't work.

There's no locking around the scheduler. It's all per-CPU, and the only 
exclusivity is in the per-rq locking.

So if you gather the mm counters in the scheduler, you'd need to do it all 
with atomic ops. But you're still using the non-atomic add_mm_counter..

So you need to make those mm counters really atomic now. 

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC] Concept for delayed counter updates in mm_struct
  2005-08-19  4:49         ` Linus Torvalds
@ 2005-08-19 16:06           ` Christoph Lameter
  2005-08-20  7:33           ` [PATCH] mm_struct counter deltas in task_struct Christoph Lameter
  2005-08-20  7:35           ` [PATCH] Use deltas to replace atomic inc Christoph Lameter
  2 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2005-08-19 16:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, hugh, nickpiggin, linux-mm

On Thu, 18 Aug 2005, Linus Torvalds wrote:

> On Thu, 18 Aug 2005, Andrew Morton wrote:
> > Christoph Lameter <clameter@engr.sgi.com> wrote:
> >
> > > What is missing in this patch are points were mm_counter_catchup can be called.
> > > These points must be code where the page table lock is held. One way of providing
> > > these would be to call mm_counter_catchup when a task is in the scheduler.
> > That sounds sane.
> 
> But that patch doesn't work.
> 
> There's no locking around the scheduler. It's all per-CPU, and the only 
> exclusivity is in the per-rq locking.
> 
> So if you gather the mm counters in the scheduler, you'd need to do it all 
> with atomic ops. But you're still using the non-atomic add_mm_counter..

We can check the deltas and if they are nonzero take the page table lock 
and update the counters. If this is too much effort then we need to find 
out some other place hwere the page table lock is already taken.

> So you need to make those mm counters really atomic now. 

Thats what we are trying to avoid.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH] mm_struct counter deltas in task_struct
  2005-08-19  4:49         ` Linus Torvalds
  2005-08-19 16:06           ` Christoph Lameter
@ 2005-08-20  7:33           ` Christoph Lameter
  2005-08-20  7:35           ` [PATCH] Use deltas to replace atomic inc Christoph Lameter
  2 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2005-08-20  7:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, hugh, nickpiggin, linux-mm

Introduce counter deltas in the task_struct. Instead of updating
the counters in the mm_struct via inc_mm_counter() etc one may now use
inc_mm_delta(). Inc_mm_delta will increment a delta in the task_struct.
The delta is later folded into the mm_struct counter during schedule().
The advantage is that the operations on the deltas do not need any locks.

The delta counters may be used for a variety of purposes outside of the
page fault scalability patchset. (f.e. It may be possible to switch the 
existing tlb "freed" handling to use this method).

The method to fold the counters in schedule() may require some scrutiny. We only
take the page_table_lock if its available otherwise counter updates are deferred
until the next schedule(). If the page_table_lock is busy for extended time periods
then lots of deltas may accumulate. The reported RSS visible through /proc may lag
a bit as a result. One may want to add other points where the mm counters will
be updated.

The main problem in the past with using current to store mm_struct counter
information were primarily concerns with get_user_pages(). The approach here
solves the issues in the following way:

get_user_pages() first shifts any deltas into the current->mm. Then it does
the handle_mm_fault() thing which may accumulate new deltas in current.
These are stuffed into the target mm after the page_table_lock has been
acquired.

This patch only introduces the counter deltas and is independent of the
page fault scalabilty patches. It does not make the page fault
scalability patchset use the deltas.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.13-rc6-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.13-rc6-mm1.orig/include/linux/sched.h	2005-08-19 11:47:48.000000000 -0700
+++ linux-2.6.13-rc6-mm1/include/linux/sched.h	2005-08-19 23:36:22.000000000 -0700
@@ -265,6 +265,16 @@ typedef atomic_t mm_counter_t;
 typedef unsigned long mm_counter_t;
 #endif
 
+/*
+ * mm_counter operations through the deltas in task_struct
+ * that do not require holding the page_table_lock.
+ */
+#define inc_mm_delta(member) current->delta_##member++
+#define dec_mm_delta(member) current->delta_##member--
+
+#define mm_counter_updates_pending(__p) \
+	((__p)->delta_nr_ptes | (__p)->delta_rss | (__p)->delta_anon_rss)
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -700,6 +710,15 @@ struct task_struct {
 
 	struct mm_struct *mm, *active_mm;
 
+	/*
+	 * Deltas for corresponding counters in mm_struct which require
+	 * the page_table_lock. The deltas may be updated and are later
+	 * folded into the corresponding mm_struct counters.
+	 */
+	long delta_rss;
+	long delta_anon_rss;
+	long delta_nr_ptes;
+
 /* task state */
 	struct linux_binfmt *binfmt;
 	long exit_state;
@@ -1417,6 +1436,9 @@ static inline void thaw_processes(void) 
 static inline int try_to_freeze(void) { return 0; }
 
 #endif /* CONFIG_PM */
+
+extern void mm_counter_catchup(struct task_struct *t, struct mm_struct *mm);
+
 #endif /* __KERNEL__ */
 
 #endif
Index: linux-2.6.13-rc6-mm1/kernel/fork.c
===================================================================
--- linux-2.6.13-rc6-mm1.orig/kernel/fork.c	2005-08-19 11:47:24.000000000 -0700
+++ linux-2.6.13-rc6-mm1/kernel/fork.c	2005-08-19 23:34:42.000000000 -0700
@@ -173,6 +173,9 @@ static struct task_struct *dup_task_stru
 	*ti = *orig->thread_info;
 	*tsk = *orig;
 	tsk->thread_info = ti;
+	tsk->delta_rss = 0;
+	tsk->delta_anon_rss = 0;
+	tsk->delta_nr_ptes = 0;
 	ti->task = tsk;
 
 	/* One for us, one for whoever does the "release_task()" (usually parent) */
@@ -427,6 +430,13 @@ void mm_release(struct task_struct *tsk,
 {
 	struct completion *vfork_done = tsk->vfork_done;
 
+	/* If we are still carrying deltas then apply them */
+	if (mm && mm_counter_updates_pending(tsk)) {
+		spin_lock(&mm->page_table_lock);
+		mm_counter_catchup(tsk, mm);
+		spin_unlock(&mm->page_table_lock);
+	}
+
 	/* Get rid of any cached register state */
 	deactivate_mm(tsk, mm);
 
Index: linux-2.6.13-rc6-mm1/mm/memory.c
===================================================================
--- linux-2.6.13-rc6-mm1.orig/mm/memory.c	2005-08-19 11:45:27.000000000 -0700
+++ linux-2.6.13-rc6-mm1/mm/memory.c	2005-08-19 23:34:42.000000000 -0700
@@ -886,6 +886,21 @@ untouched_anonymous_page(struct mm_struc
 	return 0;
 }
 
+/*
+ * Update the mm_struct counters protected by
+ * the page_table_lock using the deltas in the task_struct.
+ * Must hold the page_table_lock.
+ */
+void mm_counter_catchup(struct task_struct *t, struct mm_struct *mm)
+{
+	add_mm_counter(mm, rss, t->delta_rss);
+	add_mm_counter(mm, anon_rss, t->delta_anon_rss);
+	add_mm_counter(mm, nr_ptes, t->delta_nr_ptes);
+	t->delta_rss = 0;
+	t->delta_anon_rss = 0;
+	t->delta_nr_ptes = 0;
+}
+
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, int len, int write, int force,
 		struct page **pages, struct vm_area_struct **vmas)
@@ -893,7 +908,18 @@ int get_user_pages(struct task_struct *t
 	int i;
 	unsigned int flags;
 
-	/* 
+	if (mm != current->mm && mm_counter_updates_pending(current)) {
+		/*
+		 * Access to a foreign mm requires us first to bring
+		 * the counters in our mm up to date with counter deltas
+		 * so that we do not carry any deltas.
+		 */
+		spin_lock(&current->mm->page_table_lock);
+		mm_counter_catchup(current, current->mm);
+		spin_unlock(&current->mm->page_table_lock);
+	}
+
+	/*
 	 * Require read or write permissions.
 	 * If 'force' is set, we only require the "MAY" flags.
 	 */
@@ -1011,6 +1037,15 @@ int get_user_pages(struct task_struct *t
 			start += PAGE_SIZE;
 			len--;
 		} while (len && start < vma->vm_end);
+
+		if (mm != current->mm && mm_counter_updates_pending(current))
+			/*
+			 * Foreign mm. Update any counters delta in the
+			 * foreign mm otherwise they will be later added
+			 * to the mm_struct of this process.
+			 */
+			mm_counter_catchup(current, mm);
+
 		spin_unlock(&mm->page_table_lock);
 	} while (len);
 	return i;
Index: linux-2.6.13-rc6-mm1/kernel/sched.c
===================================================================
--- linux-2.6.13-rc6-mm1.orig/kernel/sched.c	2005-08-19 11:47:49.000000000 -0700
+++ linux-2.6.13-rc6-mm1/kernel/sched.c	2005-08-19 23:34:42.000000000 -0700
@@ -2917,6 +2917,15 @@ asmlinkage void __sched schedule(void)
 	}
 	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
 
+	/* If we have the opportunity then update the mm_counters */
+	if (unlikely(current->mm
+		&& mm_counter_updates_pending(current)
+		&& spin_trylock(&current->mm->page_table_lock))) {
+
+		mm_counter_catchup(current, current->mm);
+		spin_unlock(&current->mm->page_table_lock);
+	}
+
 need_resched:
 	preempt_disable();
 	prev = current;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH] Use deltas to replace atomic inc
  2005-08-19  4:49         ` Linus Torvalds
  2005-08-19 16:06           ` Christoph Lameter
  2005-08-20  7:33           ` [PATCH] mm_struct counter deltas in task_struct Christoph Lameter
@ 2005-08-20  7:35           ` Christoph Lameter
  2005-08-20  7:58             ` Andrew Morton
  2 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2005-08-20  7:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, hugh, nickpiggin, linux-mm

This patch applies on top of the counter delta patches and the page fault
scalability patchset in 2.6.13-rc6-mm1.

It switches the code paths that could potentially not use the page table lock
to use inc_mm_delta instead of inc_mm_counter (which requires the ptl or atomic
operations). We can then remove the definitions for making the mm_struct counters
atomic.

As a consequence page_add_anon_rmap does no longer require the page_table_lock.
It will always increase the delta rss of the currently executing process instead
of increasing the rss of the mm belonging to the vma. Most of the time this is okay
except in the case when the unuse_mm uses this function. In that case
the deferred counters need to be charged to the mm_structs as they are processed
similarly to what was done for get_user_pages().

The use of deltas could be taken further and other places could be switched.
Obviously this would be possible with places like unuse_pte() that now use mixed
mm_counter and mm_delta operations.

In the case of CONFIG_ATOMIC_TABLE_OPS not having been defined for an 
arch then we will still be using the deltas. This will help somewhat in 
avoiding bouncing cachelines however the page_table_lock will still be 
taken which is the major scalability bottleneck. Maybe we need to fall 
back to no deltas?

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.13-rc6-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.13-rc6-mm1.orig/include/linux/sched.h	2005-08-19 23:42:27.000000000 -0700
+++ linux-2.6.13-rc6-mm1/include/linux/sched.h	2005-08-20 00:22:23.000000000 -0700
@@ -227,35 +227,8 @@ arch_get_unmapped_area_topdown(struct fi
 extern void arch_unmap_area(struct mm_struct *, unsigned long);
 extern void arch_unmap_area_topdown(struct mm_struct *, unsigned long);
 
-#ifdef CONFIG_ATOMIC_TABLE_OPS
 /*
- * No spinlock is held during atomic page table operations. The
- * counters are not protected anymore and must also be
- * incremented atomically.
-*/
-#ifdef ATOMIC64_INIT
-#define set_mm_counter(mm, member, value) atomic64_set(&(mm)->_##member, value)
-#define get_mm_counter(mm, member) ((unsigned long)atomic64_read(&(mm)->_##member))
-#define add_mm_counter(mm, member, value) atomic64_add(value, &(mm)->_##member)
-#define inc_mm_counter(mm, member) atomic64_inc(&(mm)->_##member)
-#define dec_mm_counter(mm, member) atomic64_dec(&(mm)->_##member)
-typedef atomic64_t mm_counter_t;
-#else
-/*
- * This may limit process memory to 2^31 * PAGE_SIZE which may be around 8TB
- * if using 4KB page size
- */
-#define set_mm_counter(mm, member, value) atomic_set(&(mm)->_##member, value)
-#define get_mm_counter(mm, member) ((unsigned long)atomic_read(&(mm)->_##member))
-#define add_mm_counter(mm, member, value) atomic_add(value, &(mm)->_##member)
-#define inc_mm_counter(mm, member) atomic_inc(&(mm)->_##member)
-#define dec_mm_counter(mm, member) atomic_dec(&(mm)->_##member)
-typedef atomic_t mm_counter_t;
-#endif
-#else
-/*
- * No atomic page table operations. Counters are protected by
- * the page table lock
+ * Operations for mm_struct counters protected by the page table lock
  */
 #define set_mm_counter(mm, member, value) (mm)->_##member = (value)
 #define get_mm_counter(mm, member) ((mm)->_##member)
@@ -263,7 +236,6 @@ typedef atomic_t mm_counter_t;
 #define inc_mm_counter(mm, member) (mm)->_##member++
 #define dec_mm_counter(mm, member) (mm)->_##member--
 typedef unsigned long mm_counter_t;
-#endif
 
 /*
  * mm_counter operations through the deltas in task_struct
Index: linux-2.6.13-rc6-mm1/mm/memory.c
===================================================================
--- linux-2.6.13-rc6-mm1.orig/mm/memory.c	2005-08-19 23:42:27.000000000 -0700
+++ linux-2.6.13-rc6-mm1/mm/memory.c	2005-08-20 00:22:23.000000000 -0700
@@ -1842,7 +1842,7 @@ do_anonymous_page(struct mm_struct *mm, 
 	 */
 	page_add_anon_rmap(page, vma, addr);
 	lru_cache_add_active(page);
-	inc_mm_counter(mm, rss);
+	inc_mm_delta(rss);
 	update_mmu_cache(vma, addr, entry);
 	lazy_mmu_prot_update(entry);
 
@@ -2192,7 +2192,7 @@ int __handle_mm_fault(struct mm_struct *
 			pte_free(new);
 		else {
 			inc_page_state(nr_page_table_pages);
-			inc_mm_counter(mm, nr_ptes);
+			inc_mm_delta(nr_ptes);
 		}
 	}
 
Index: linux-2.6.13-rc6-mm1/mm/rmap.c
===================================================================
--- linux-2.6.13-rc6-mm1.orig/mm/rmap.c	2005-08-19 11:45:27.000000000 -0700
+++ linux-2.6.13-rc6-mm1/mm/rmap.c	2005-08-20 00:22:23.000000000 -0700
@@ -437,15 +437,13 @@ int page_referenced(struct page *page, i
  * @page:	the page to add the mapping to
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
- *
- * The caller needs to hold the mm->page_table_lock.
  */
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
 {
 	BUG_ON(PageReserved(page));
 
-	inc_mm_counter(vma->vm_mm, anon_rss);
+	inc_mm_delta(anon_rss);
 
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		struct anon_vma *anon_vma = vma->anon_vma;
Index: linux-2.6.13-rc6-mm1/mm/swapfile.c
===================================================================
--- linux-2.6.13-rc6-mm1.orig/mm/swapfile.c	2005-08-19 11:47:49.000000000 -0700
+++ linux-2.6.13-rc6-mm1/mm/swapfile.c	2005-08-20 00:22:23.000000000 -0700
@@ -508,6 +508,16 @@ static int unuse_mm(struct mm_struct *mm
 {
 	struct vm_area_struct *vma;
 
+	/*
+	 * Ensure that existing deltas are charged to the current mm since
+	 * we will charge the next batch manually to the target mm
+	 */
+	if (current->mm && mm_counter_updates_pending(current)) {
+		spin_lock(&current->mm->page_table_lock);
+		mm_counter_catchup(current, current->mm);
+		spin_unlock(&current->mm->page_table_lock);
+	}
+
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		/*
 		 * Activate page so shrink_cache is unlikely to unmap its
@@ -523,6 +533,13 @@ static int unuse_mm(struct mm_struct *mm
 		if (vma->anon_vma && unuse_vma(vma, entry, page))
 			break;
 	}
+
+	/*
+	 * Make sure all the deferred counters get charged
+	 * to the right mm_struct.
+	 */
+	mm_counter_catchup(current, mm);
+
 	spin_unlock(&mm->page_table_lock);
 	up_read(&mm->mmap_sem);
 	/*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use deltas to replace atomic inc
  2005-08-20  7:35           ` [PATCH] Use deltas to replace atomic inc Christoph Lameter
@ 2005-08-20  7:58             ` Andrew Morton
  2005-08-22  3:32               ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2005-08-20  7:58 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: torvalds, hugh, nickpiggin, linux-mm

Christoph Lameter <clameter@engr.sgi.com> wrote:
>
>  @@ -508,6 +508,16 @@ static int unuse_mm(struct mm_struct *mm
>   {
>   	struct vm_area_struct *vma;
>   
>  +	/*
>  +	 * Ensure that existing deltas are charged to the current mm since
>  +	 * we will charge the next batch manually to the target mm
>  +	 */
>  +	if (current->mm && mm_counter_updates_pending(current)) {

Is there a race window right here?

>  +		spin_lock(&current->mm->page_table_lock);
>  +		mm_counter_catchup(current, current->mm);
>  +		spin_unlock(&current->mm->page_table_lock);
>  +	}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-18 16:17     ` Christoph Lameter
@ 2005-08-22  2:04       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 47+ messages in thread
From: Benjamin Herrenschmidt @ 2005-08-22  2:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Andrew Morton, Linus Torvalds, Hugh Dickins,
	Nick Piggin, linux-mm

On Thu, 2005-08-18 at 09:17 -0700, Christoph Lameter wrote:
> On Thu, 18 Aug 2005, Nick Piggin wrote:
> 
> > Nick Piggin wrote:
> > 
> > > If the big ticket item is taking the ptl out of the anonymous fault
> > > path, then we probably should forget my stuff
> > 
> > ( for now :) )
> 
> I think we can gradually work atomic operations into various code paths 
> where this will be advantageous and your work may be a very important base 
> to get there.

Don't forget however that when doing things like tearing down page
tables, it's a lot more efficient to take 1 lock, then do a bunch of
things non-atomically, then drop that lock.

At least on PPC, the cost of a lock is approx. equivalent to the cost of
an atomic, and is measurable on such things.

That said, I think your approach for the anonymous page case is a good
first step for now. I'll have to adapt ppc64 to it but it shouldn't be
too hard.

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-18  0:43 ` Andrew Morton
  2005-08-18 16:04   ` Christoph Lameter
  2005-08-18 20:16   ` Hugh Dickins
@ 2005-08-22  2:09   ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 47+ messages in thread
From: Benjamin Herrenschmidt @ 2005-08-22  2:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, hugh, clameter, piggin, linux-mm

On Wed, 2005-08-17 at 17:43 -0700, Andrew Morton wrote:
> Andrew Morton <akpm@osdl.org> wrote:

> d) the fact that some architectures will be using atomic pte ops and
>    others will be using page_table_lock in core MM code.
> 
>    Using different locking/atomicity schemes in different architectures
>    has obvious complexity and test coverage drawbacks.
> 
>    Is it still the case that some architectures must retain the
>    page_table_lock approach because they use it to lock other arch-internal
>    things?

Yes. The ppc64 case for example isn't trivial due to the difference
between manipulating the linux page tables, and sync'ing the hash
table. 

If we go toward non-PTL page faults, I'll need to review all the hash
management code path that assume that thanks to the PTL, nothing will be
happening to the page tables between a PTL update and the matching hash
flush.

I think it shouldn't be too bad though as long as we are only ever doing
that to fill a previously !present PTE (no hash flush necessary). If we
ever want that for the COW case as well (where set_pte is called for an
already present PTE), then things would get more complicated and I may
have to rely more on the per-PTE locking mecanism we have (which is
currently mostly used to avoid duplicates in the hash table).

Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: pagefault scalability patches
  2005-08-17 22:51   ` Christoph Lameter
  2005-08-17 23:01     ` Linus Torvalds
@ 2005-08-22  2:13     ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 47+ messages in thread
From: Benjamin Herrenschmidt @ 2005-08-22  2:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Andrew Morton, Hugh Dickins, Nick Piggin, linux-mm

On Wed, 2005-08-17 at 15:51 -0700, Christoph Lameter wrote:
> On Wed, 17 Aug 2005, Linus Torvalds wrote:
> 
> > HOWEVER, the fact that it makes the mm counters be atomic just makes it
> > pointless. It may help scalability, but it loses the attribute that I
> > considered a big win above - it no longer helps the non-contended case (at
> > least on x86, a uncontended spinlock is about as expensive as a atomic
> > op).
> 
> We are trading 2x (spinlock(page_table_lock), 
> spin_unlock(page_table_lock)) against one atomic inc.

At least on ppc, unlock isn't atomic

> > I thought Christoph (Nick?) had a patch to make the counters be
> > per-thread, and then just folded back into the mm-struct every once in a
> > while?
> 
> Yes I do but I did want want to risk that can of worms becoming entwined 
> with the page fault scalability patches.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use deltas to replace atomic inc
  2005-08-20  7:58             ` Andrew Morton
@ 2005-08-22  3:32               ` Christoph Lameter
  2005-08-22  3:48                 ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2005-08-22  3:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, hugh, nickpiggin, linux-mm

On Sat, 20 Aug 2005, Andrew Morton wrote:

> Christoph Lameter <clameter@engr.sgi.com> wrote:
> >
> >  @@ -508,6 +508,16 @@ static int unuse_mm(struct mm_struct *mm
> >   {
> >   	struct vm_area_struct *vma;
> >   
> >  +	/*
> >  +	 * Ensure that existing deltas are charged to the current mm since
> >  +	 * we will charge the next batch manually to the target mm
> >  +	 */
> >  +	if (current->mm && mm_counter_updates_pending(current)) {
> 
> Is there a race window right here?

Why? current is tied to a thread.

The thing that bothers me more is that schedule() can be called both by 
handle_mm_fault as well as during unuse_mm. We may need some flag 
PF_NO_COUNTER_UPDATES or so there to insure that schedule() does not add 
deltas to the current->mm.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use deltas to replace atomic inc
  2005-08-22  3:32               ` Christoph Lameter
@ 2005-08-22  3:48                 ` Linus Torvalds
  2005-08-22  4:06                   ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2005-08-22  3:48 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, hugh, nickpiggin, linux-mm


On Sun, 21 Aug 2005, Christoph Lameter wrote:
> 
> The thing that bothers me more is that schedule() can be called both by 
> handle_mm_fault as well as during unuse_mm. We may need some flag 
> PF_NO_COUNTER_UPDATES or so there to insure that schedule() does not add 
> deltas to the current->mm.

Why? I don't think it's ever wrong to do the thing. We should be holding 
no locks at the point (and we haven't grabbed he RQ lock yet), so it 
should always be safe to get the page table lock. 

I think the delta approach looks quite reasonable, although I think 
somebody should check that the cache behaviour is ok (ie the deltas should 
hopefully be in a cacheline that we need to look at anyway).

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use deltas to replace atomic inc
  2005-08-22  3:48                 ` Linus Torvalds
@ 2005-08-22  4:06                   ` Christoph Lameter
  2005-08-22  4:18                     ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2005-08-22  4:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, hugh, nickpiggin, linux-mm

On Sun, 21 Aug 2005, Linus Torvalds wrote:

> On Sun, 21 Aug 2005, Christoph Lameter wrote:
> > 
> > The thing that bothers me more is that schedule() can be called both by 
> > handle_mm_fault as well as during unuse_mm. We may need some flag 
> > PF_NO_COUNTER_UPDATES or so there to insure that schedule() does not add 
> > deltas to the current->mm.
> 
> Why? I don't think it's ever wrong to do the thing. We should be holding 
> no locks at the point (and we haven't grabbed he RQ lock yet), so it 
> should always be safe to get the page table lock. 

get_user_pages and unuse_mm may be working on an mm that is not 
current->mm. If schedule is called then the deltas are added to the wrong 
mm (current->mm). If we had PF_NO_COUNTER_UPDATES then we could force no 
counters updates to occur until get_user_pages or unuse_mm assigns the 
deltas to a specific mm using mm_counter_catchup(current, target_mm).

> I think the delta approach looks quite reasonable, although I think 
> somebody should check that the cache behaviour is ok (ie the deltas should 
> hopefully be in a cacheline that we need to look at anyway).

I put the deltas near to exit_state which is also checked in schedule().
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use deltas to replace atomic inc
  2005-08-22  4:06                   ` Christoph Lameter
@ 2005-08-22  4:18                     ` Linus Torvalds
  2005-08-22 13:23                       ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2005-08-22  4:18 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, hugh, nickpiggin, linux-mm

On Sun, 21 Aug 2005, Christoph Lameter wrote:
> > 
> > Why? I don't think it's ever wrong to do the thing. We should be holding 
> > no locks at the point (and we haven't grabbed he RQ lock yet), so it 
> > should always be safe to get the page table lock. 
> 
> get_user_pages and unuse_mm may be working on an mm that is not 
> current->mm. If schedule is called then the deltas are added to the wrong 
> mm (current->mm).

Hmm. But we already hold (and _have_ to hold) the mm lock there, don't we?

Why not make the rule be that we only use the delta stuff when we don't 
hold the mm lock. Which is pretty seldom, but the big one is obviously 
anon page faults.

Whenever we already -do- hold the page table lock for other reasons, 
there's no actual upside to using the delta representation. In fact, 
there's only downsides, since it just makes the things like scheduling 
slower.

Or did I miss some clever thing?

			Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use deltas to replace atomic inc
  2005-08-22  4:18                     ` Linus Torvalds
@ 2005-08-22 13:23                       ` Christoph Lameter
  2005-08-22 14:22                         ` Hugh Dickins
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2005-08-22 13:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, hugh, nickpiggin, linux-mm

On Sun, 21 Aug 2005, Linus Torvalds wrote:

> On Sun, 21 Aug 2005, Christoph Lameter wrote:
> > > 
> > > Why? I don't think it's ever wrong to do the thing. We should be holding 
> > > no locks at the point (and we haven't grabbed he RQ lock yet), so it 
> > > should always be safe to get the page table lock. 
> > 
> > get_user_pages and unuse_mm may be working on an mm that is not 
> > current->mm. If schedule is called then the deltas are added to the wrong 
> > mm (current->mm).
> 
> Hmm. But we already hold (and _have_ to hold) the mm lock there, don't we?

We drop the lock for allocations. At that point schedule() may be called.

> Why not make the rule be that we only use the delta stuff when we don't 
> hold the mm lock. Which is pretty seldom, but the big one is obviously 
> anon page faults.

Yes I have tried to follow that. But there are some functions that are 
called from both contexts. Most importantly this is page_add_anon_rmap.
If we would remove incrementing anon_rss from the function then we can at 
avoid the problem for the unuse_mm path. But we would still have the issue 
with handle_mm_fault. So we would still need some flag.

> Whenever we already -do- hold the page table lock for other reasons, 
> there's no actual upside to using the delta representation. In fact, 
> there's only downsides, since it just makes the things like scheduling 
> slower.

There is a slight benefit in avoiding additional writes to mm_struct which 
limits cacheline bouncing during heavy contention.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use deltas to replace atomic inc
  2005-08-22 13:23                       ` Christoph Lameter
@ 2005-08-22 14:22                         ` Hugh Dickins
  2005-08-22 15:24                           ` Christoph Lameter
                                             ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Hugh Dickins @ 2005-08-22 14:22 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Linus Torvalds, Andrew Morton, nickpiggin, linux-mm

On Mon, 22 Aug 2005, Christoph Lameter wrote:
> On Sun, 21 Aug 2005, Linus Torvalds wrote:
> 
> > Why not make the rule be that we only use the delta stuff when we don't 
> > hold the mm lock. Which is pretty seldom, but the big one is obviously 
> > anon page faults.
> 
> Yes I have tried to follow that. But there are some functions that are 
> called from both contexts. Most importantly this is page_add_anon_rmap.
> If we would remove incrementing anon_rss from the function then we can at 
> avoid the problem for the unuse_mm path. But we would still have the issue 
> with handle_mm_fault. So we would still need some flag.

I'm not following this thread closely yet, concentrating on getting my
alternative page fault scalability ready for you (which will need an
rss solution like yours, when locking across the mm is split: the rules
may be a little different, I'll need to see what you end up with).

(Your deltas seem sensible, but hard to place the reaccumulation:
I worry that you may be taking page_table_lock more just for that.)

Just to say, please do remove the anon_rss incrementation from inside
page_add_anon_rmap if it helps you: I put it there as a lazy way of
avoiding a bigger patch, but it would be much nicer to count separate
file_rss and anon_rss, with /proc showing the sum of the two as rss.

Especially since it's anonymous that most concerns you: the present
setup is biased against anonymous since it's counted in two fields -
perhaps no issue once you've rearranged, but not good for atomics.

If you don't get there first, I do intend in due course to replace
rss and anon_rss by file_rss and anon_rss (and try to sort out the
the tlb mmu gathering stuff, which is a little odd, in dealing with
rss but not with anon_rss).

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use deltas to replace atomic inc
  2005-08-22 14:22                         ` Hugh Dickins
@ 2005-08-22 15:24                           ` Christoph Lameter
  2005-08-22 15:43                             ` Andi Kleen
  2005-08-22 20:30                           ` [PATCH] mm_struct counter deltas V2 Christoph Lameter
  2005-08-22 20:31                           ` [PATCH] Use deltas to replace atomic inc V2 Christoph Lameter
  2 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2005-08-22 15:24 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linus Torvalds, Andrew Morton, nickpiggin, linux-mm

On Mon, 22 Aug 2005, Hugh Dickins wrote:

> (Your deltas seem sensible, but hard to place the reaccumulation:
> I worry that you may be taking page_table_lock more just for that.)

The page_table_lock is taken using a spin_trylock. Its skipped if 
contended.

> Just to say, please do remove the anon_rss incrementation from inside
> page_add_anon_rmap if it helps you: I put it there as a lazy way of
> avoiding a bigger patch, but it would be much nicer to count separate
> file_rss and anon_rss, with /proc showing the sum of the two as rss.

Hmm. Thats a big projects. Lets first focus on getting the deltas right.

> Especially since it's anonymous that most concerns you: the present
> setup is biased against anonymous since it's counted in two fields -
> perhaps no issue once you've rearranged, but not good for atomics.

There will be no atomics once the deltas are in.
 
> If you don't get there first, I do intend in due course to replace
> rss and anon_rss by file_rss and anon_rss (and try to sort out the
> the tlb mmu gathering stuff, which is a little odd, in dealing with
> rss but not with anon_rss).

That will be great.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use deltas to replace atomic inc
  2005-08-22 15:24                           ` Christoph Lameter
@ 2005-08-22 15:43                             ` Andi Kleen
  2005-08-22 16:24                               ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: Andi Kleen @ 2005-08-22 15:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, nickpiggin, linux-mm

On Mon, Aug 22, 2005 at 08:24:50AM -0700, Christoph Lameter wrote:
> On Mon, 22 Aug 2005, Hugh Dickins wrote:
> 
> > (Your deltas seem sensible, but hard to place the reaccumulation:
> > I worry that you may be taking page_table_lock more just for that.)
> 
> The page_table_lock is taken using a spin_trylock. Its skipped if 
> contended.

Hmm - doesn't try lock cause a cache line bounce on the bus too? 
I think it does. That would mean its latency is not much better 
than a real spinlock (assuming it doesn't have to spin) 

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use deltas to replace atomic inc
  2005-08-22 15:43                             ` Andi Kleen
@ 2005-08-22 16:24                               ` Christoph Lameter
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2005-08-22 16:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, nickpiggin, linux-mm

On Mon, 22 Aug 2005, Andi Kleen wrote:

> > The page_table_lock is taken using a spin_trylock. Its skipped if 
> > contended.
> Hmm - doesn't try lock cause a cache line bounce on the bus too? 
> I think it does. That would mean its latency is not much better 
> than a real spinlock (assuming it doesn't have to spin) 

Trylock does a cmpxchg on ia64 and thus acquires a exclusive cache line. 
So yes. But this is only done if there are updates pending. schedule() is 
not called that frequently. On bootup I see on average updates of 5-20 
pages per mm_counter_catchup. 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH] mm_struct counter deltas V2
  2005-08-22 14:22                         ` Hugh Dickins
  2005-08-22 15:24                           ` Christoph Lameter
@ 2005-08-22 20:30                           ` Christoph Lameter
  2005-08-22 20:31                           ` [PATCH] Use deltas to replace atomic inc V2 Christoph Lameter
  2 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2005-08-22 20:30 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linus Torvalds, Andrew Morton, nickpiggin, linux-mm

New version of the counter deltas. This includes PF_NOMMCOUNTERS to switch
off the counter consolidation in get_user_pages().

---
mm_struct counter deltas in task_struct

Introduce counter deltas in the task_struct. Instead of updating
the counters in the mm_struct via inc_mm_counter() etc one may now use
inc_mm_delta(). Inc_mm_delta will increment a delta in the task_struct.
The delta is later folded into the mm_struct counter during schedule().
The advantage is that the operations on the deltas do not need any locks.

The delta counters may be used for a variety of purposes outside of the
page fault scalability patchset. (f.e. the existing tlb "freed" handling
may be switched to use this method).

The method to fold the counters in schedule() may require some scrutiny. We only
take the page_table_lock if its available otherwise counter updates are deferred
until the next schedule(). If the page_table_lock is busy for extended time periods
then lots of deltas may accumulate. The reported RSS visible through /proc may lag
a bit as a result. One may want to add other points where the mm counters will
be updated.

The main problem in the past with using current to store mm_struct counter
information were primarily concerns with get_user_pages(). The approach here
solves the issues in the following way:

get_user_pages() first shifts any deltas into the current->mm. Then it does
the handle_mm_fault() thing which may accumulate new deltas in current.
PF_NOMMCOUNTER is set to disable schedule() counter consolidation which would
add the deltas to the wrong mm. The resulting deltas are stuffed into the target
mm after the page_table_lock has been acquired for the last time in get_user_pages.

This patch only introduces the counter deltas and is independent of the
page fault scalabilty patches. It does not make the page fault
scalability patchset use the deltas.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.13-rc6-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.13-rc6-mm1.orig/include/linux/sched.h	2005-08-19 11:47:48.000000000 -0700
+++ linux-2.6.13-rc6-mm1/include/linux/sched.h	2005-08-22 12:34:51.000000000 -0700
@@ -265,6 +265,16 @@ typedef atomic_t mm_counter_t;
 typedef unsigned long mm_counter_t;
 #endif
 
+/*
+ * mm_counter operations through the deltas in task_struct
+ * that do not require holding the page_table_lock.
+ */
+#define inc_mm_delta(member) current->delta_##member++
+#define dec_mm_delta(member) current->delta_##member--
+
+#define mm_counter_updates_pending(__p) \
+	((__p)->delta_nr_ptes | (__p)->delta_rss | (__p)->delta_anon_rss)
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -700,6 +710,15 @@ struct task_struct {
 
 	struct mm_struct *mm, *active_mm;
 
+	/*
+	 * Deltas for corresponding counters in mm_struct which require
+	 * the page_table_lock. The deltas may be updated and are later
+	 * folded into the corresponding mm_struct counters.
+	 */
+	long delta_rss;
+	long delta_anon_rss;
+	long delta_nr_ptes;
+
 /* task state */
 	struct linux_binfmt *binfmt;
 	long exit_state;
@@ -889,6 +908,7 @@ do { if (atomic_dec_and_test(&(tsk)->usa
 #define PF_SYNCWRITE	0x00200000	/* I am doing a sync write */
 #define PF_BORROWED_MM	0x00400000	/* I am a kthread doing use_mm */
 #define PF_RANDOMIZE	0x00800000	/* randomize virtual address space */
+#define PF_NOMMCOUNTER	0x01000000	/* No delta processing for mm_struct */
 
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
@@ -1417,6 +1437,9 @@ static inline void thaw_processes(void) 
 static inline int try_to_freeze(void) { return 0; }
 
 #endif /* CONFIG_PM */
+
+extern void mm_counter_catchup(struct task_struct *t, struct mm_struct *mm);
+
 #endif /* __KERNEL__ */
 
 #endif
Index: linux-2.6.13-rc6-mm1/kernel/fork.c
===================================================================
--- linux-2.6.13-rc6-mm1.orig/kernel/fork.c	2005-08-19 11:47:24.000000000 -0700
+++ linux-2.6.13-rc6-mm1/kernel/fork.c	2005-08-22 12:34:51.000000000 -0700
@@ -173,6 +173,9 @@ static struct task_struct *dup_task_stru
 	*ti = *orig->thread_info;
 	*tsk = *orig;
 	tsk->thread_info = ti;
+	tsk->delta_rss = 0;
+	tsk->delta_anon_rss = 0;
+	tsk->delta_nr_ptes = 0;
 	ti->task = tsk;
 
 	/* One for us, one for whoever does the "release_task()" (usually parent) */
@@ -427,6 +430,13 @@ void mm_release(struct task_struct *tsk,
 {
 	struct completion *vfork_done = tsk->vfork_done;
 
+	/* If we are still carrying deltas then apply them */
+	if (mm && mm_counter_updates_pending(tsk)) {
+		spin_lock(&mm->page_table_lock);
+		mm_counter_catchup(tsk, mm);
+		spin_unlock(&mm->page_table_lock);
+	}
+
 	/* Get rid of any cached register state */
 	deactivate_mm(tsk, mm);
 
Index: linux-2.6.13-rc6-mm1/mm/memory.c
===================================================================
--- linux-2.6.13-rc6-mm1.orig/mm/memory.c	2005-08-19 11:45:27.000000000 -0700
+++ linux-2.6.13-rc6-mm1/mm/memory.c	2005-08-22 12:34:51.000000000 -0700
@@ -886,6 +886,21 @@ untouched_anonymous_page(struct mm_struc
 	return 0;
 }
 
+/*
+ * Update the mm_struct counters protected by
+ * the page_table_lock using the deltas in the task_struct.
+ * Must hold the page_table_lock.
+ */
+void mm_counter_catchup(struct task_struct *t, struct mm_struct *mm)
+{
+	add_mm_counter(mm, rss, t->delta_rss);
+	add_mm_counter(mm, anon_rss, t->delta_anon_rss);
+	add_mm_counter(mm, nr_ptes, t->delta_nr_ptes);
+	t->delta_rss = 0;
+	t->delta_anon_rss = 0;
+	t->delta_nr_ptes = 0;
+}
+
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, int len, int write, int force,
 		struct page **pages, struct vm_area_struct **vmas)
@@ -951,7 +966,21 @@ int get_user_pages(struct task_struct *t
 						&start, &len, i);
 			continue;
 		}
+
+		if (mm != current->mm && mm_counter_updates_pending(current)) {
+			/*
+			 * Access to a foreign mm requires us first to bring
+			 * the counters in our mm up to date with counter deltas
+			 * so that we do not carry any deltas.
+			 */
+			spin_lock(&current->mm->page_table_lock);
+			mm_counter_catchup(current, current->mm);
+			spin_unlock(&current->mm->page_table_lock);
+		}
+
 		spin_lock(&mm->page_table_lock);
+		current->flags |= PF_NOMMCOUNTER;
+
 		do {
 			int write_access = write;
 			struct page *page;
@@ -991,8 +1020,10 @@ int get_user_pages(struct task_struct *t
 					tsk->maj_flt++;
 					break;
 				case VM_FAULT_SIGBUS:
+					current->flags &= ~PF_NOMMCOUNTER;
 					return i ? i : -EFAULT;
 				case VM_FAULT_OOM:
+					current->flags &= ~PF_NOMMCOUNTER;
 					return i ? i : -ENOMEM;
 				default:
 					BUG();
@@ -1011,6 +1042,17 @@ int get_user_pages(struct task_struct *t
 			start += PAGE_SIZE;
 			len--;
 		} while (len && start < vma->vm_end);
+		current->flags &= ~PF_NOMMCOUNTER;
+
+		if (mm != current->mm && mm_counter_updates_pending(current)) {
+			/*
+			 * Foreign mm. Update any counters delta in the
+			 * foreign mm otherwise they will be later added
+			 * to the mm_struct of this process.
+			 */
+			mm_counter_catchup(current, mm);
+		}
+
 		spin_unlock(&mm->page_table_lock);
 	} while (len);
 	return i;
Index: linux-2.6.13-rc6-mm1/kernel/sched.c
===================================================================
--- linux-2.6.13-rc6-mm1.orig/kernel/sched.c	2005-08-19 11:47:49.000000000 -0700
+++ linux-2.6.13-rc6-mm1/kernel/sched.c	2005-08-22 12:34:51.000000000 -0700
@@ -2917,6 +2917,16 @@ asmlinkage void __sched schedule(void)
 	}
 	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
 
+	/* If we have the opportunity then update the mm_counters */
+	if (unlikely(current->mm
+		&& mm_counter_updates_pending(current)
+		&& !(current->flags & PF_NOMMCOUNTER)
+		&& spin_trylock(&current->mm->page_table_lock))) {
+
+		mm_counter_catchup(current, current->mm);
+		spin_unlock(&current->mm->page_table_lock);
+	}
+
 need_resched:
 	preempt_disable();
 	prev = current;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH] Use deltas to replace atomic inc V2
  2005-08-22 14:22                         ` Hugh Dickins
  2005-08-22 15:24                           ` Christoph Lameter
  2005-08-22 20:30                           ` [PATCH] mm_struct counter deltas V2 Christoph Lameter
@ 2005-08-22 20:31                           ` Christoph Lameter
  2 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2005-08-22 20:31 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linus Torvalds, Andrew Morton, nickpiggin, linux-mm

Patch with PF_NOMMCOUNTERS

--

This patch applies on top of the counter delta patches and the page fault
scalability patchset in 2.6.13-rc6-mm1.

It switches the code paths that could potentially not use the page table lock
to use inc_mm_delta instead of inc_mm_counter (which requires the ptl or atomic
operations). We can then remove the definitions for making the mm_struct counters
atomic.

As a consequence page_add_anon_rmap does no longer require the page_table_lock.
It will always increase the delta rss of the currently executing process instead
of increasing the rss of the mm belonging to the vma. Most of the time this is okay
except in the case when the unuse_mm uses this function. In that case
the deferred counters need to be charged to the mm_structs as they are processed
similarly to what was done for get_user_pages().

The use of deltas could be taken further and other places could be switched.
Obviously this would be possible with places like unuse_pte() that now use mixed
mm_counter and mm_delta operations.

In the case of CONFIG_ATOMIC_TABLE_OPS not having been defined for an arch
we will still be using the deltas. This will help somewhat in avoiding
bouncing cachelines however the page_table_lock will still be taken which
is the major scalability bottleneck. Maybe we need to fall back to no deltas?

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.13-rc6-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.13-rc6-mm1.orig/include/linux/sched.h	2005-08-22 12:34:51.000000000 -0700
+++ linux-2.6.13-rc6-mm1/include/linux/sched.h	2005-08-22 12:44:51.000000000 -0700
@@ -227,35 +227,8 @@ arch_get_unmapped_area_topdown(struct fi
 extern void arch_unmap_area(struct mm_struct *, unsigned long);
 extern void arch_unmap_area_topdown(struct mm_struct *, unsigned long);
 
-#ifdef CONFIG_ATOMIC_TABLE_OPS
 /*
- * No spinlock is held during atomic page table operations. The
- * counters are not protected anymore and must also be
- * incremented atomically.
-*/
-#ifdef ATOMIC64_INIT
-#define set_mm_counter(mm, member, value) atomic64_set(&(mm)->_##member, value)
-#define get_mm_counter(mm, member) ((unsigned long)atomic64_read(&(mm)->_##member))
-#define add_mm_counter(mm, member, value) atomic64_add(value, &(mm)->_##member)
-#define inc_mm_counter(mm, member) atomic64_inc(&(mm)->_##member)
-#define dec_mm_counter(mm, member) atomic64_dec(&(mm)->_##member)
-typedef atomic64_t mm_counter_t;
-#else
-/*
- * This may limit process memory to 2^31 * PAGE_SIZE which may be around 8TB
- * if using 4KB page size
- */
-#define set_mm_counter(mm, member, value) atomic_set(&(mm)->_##member, value)
-#define get_mm_counter(mm, member) ((unsigned long)atomic_read(&(mm)->_##member))
-#define add_mm_counter(mm, member, value) atomic_add(value, &(mm)->_##member)
-#define inc_mm_counter(mm, member) atomic_inc(&(mm)->_##member)
-#define dec_mm_counter(mm, member) atomic_dec(&(mm)->_##member)
-typedef atomic_t mm_counter_t;
-#endif
-#else
-/*
- * No atomic page table operations. Counters are protected by
- * the page table lock
+ * Operations for mm_struct counters protected by the page table lock
  */
 #define set_mm_counter(mm, member, value) (mm)->_##member = (value)
 #define get_mm_counter(mm, member) ((mm)->_##member)
@@ -263,7 +236,6 @@ typedef atomic_t mm_counter_t;
 #define inc_mm_counter(mm, member) (mm)->_##member++
 #define dec_mm_counter(mm, member) (mm)->_##member--
 typedef unsigned long mm_counter_t;
-#endif
 
 /*
  * mm_counter operations through the deltas in task_struct
Index: linux-2.6.13-rc6-mm1/mm/memory.c
===================================================================
--- linux-2.6.13-rc6-mm1.orig/mm/memory.c	2005-08-22 12:34:51.000000000 -0700
+++ linux-2.6.13-rc6-mm1/mm/memory.c	2005-08-22 12:44:51.000000000 -0700
@@ -1849,7 +1849,7 @@ do_anonymous_page(struct mm_struct *mm, 
 	 */
 	page_add_anon_rmap(page, vma, addr);
 	lru_cache_add_active(page);
-	inc_mm_counter(mm, rss);
+	inc_mm_delta(rss);
 	update_mmu_cache(vma, addr, entry);
 	lazy_mmu_prot_update(entry);
 
@@ -2199,7 +2199,7 @@ int __handle_mm_fault(struct mm_struct *
 			pte_free(new);
 		else {
 			inc_page_state(nr_page_table_pages);
-			inc_mm_counter(mm, nr_ptes);
+			inc_mm_delta(nr_ptes);
 		}
 	}
 
Index: linux-2.6.13-rc6-mm1/mm/rmap.c
===================================================================
--- linux-2.6.13-rc6-mm1.orig/mm/rmap.c	2005-08-19 11:45:27.000000000 -0700
+++ linux-2.6.13-rc6-mm1/mm/rmap.c	2005-08-22 12:44:51.000000000 -0700
@@ -437,15 +437,13 @@ int page_referenced(struct page *page, i
  * @page:	the page to add the mapping to
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
- *
- * The caller needs to hold the mm->page_table_lock.
  */
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
 {
 	BUG_ON(PageReserved(page));
 
-	inc_mm_counter(vma->vm_mm, anon_rss);
+	inc_mm_delta(anon_rss);
 
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		struct anon_vma *anon_vma = vma->anon_vma;
Index: linux-2.6.13-rc6-mm1/mm/swapfile.c
===================================================================
--- linux-2.6.13-rc6-mm1.orig/mm/swapfile.c	2005-08-19 11:47:49.000000000 -0700
+++ linux-2.6.13-rc6-mm1/mm/swapfile.c	2005-08-22 12:44:51.000000000 -0700
@@ -508,6 +508,17 @@ static int unuse_mm(struct mm_struct *mm
 {
 	struct vm_area_struct *vma;
 
+	/*
+	 * Ensure that existing deltas are charged to the current mm since
+	 * we will charge the next batch manually to the target mm
+	 */
+	if (current->mm && mm_counter_updates_pending(current)) {
+		spin_lock(&current->mm->page_table_lock);
+		mm_counter_catchup(current, current->mm);
+		spin_unlock(&current->mm->page_table_lock);
+	}
+	current->flags |= PF_NOMMCOUNTER;
+
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		/*
 		 * Activate page so shrink_cache is unlikely to unmap its
@@ -523,6 +534,14 @@ static int unuse_mm(struct mm_struct *mm
 		if (vma->anon_vma && unuse_vma(vma, entry, page))
 			break;
 	}
+
+	/*
+	 * Make sure all the deferred counters get charged
+	 * to the right mm_struct.
+	 */
+	mm_counter_catchup(current, mm);
+	current->flags &= ~PF_NOMMCOUNTER;
+
 	spin_unlock(&mm->page_table_lock);
 	up_read(&mm->mmap_sem);
 	/*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2005-08-22 20:31 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-17 22:17 pagefault scalability patches Andrew Morton
2005-08-17 22:19 ` Christoph Lameter
2005-08-17 22:36 ` Linus Torvalds
2005-08-17 22:51   ` Christoph Lameter
2005-08-17 23:01     ` Linus Torvalds
2005-08-17 23:12       ` Christoph Lameter
2005-08-17 23:23         ` Linus Torvalds
2005-08-17 23:31           ` Christoph Lameter
2005-08-17 23:30         ` Andrew Morton
2005-08-17 23:33           ` Christoph Lameter
2005-08-17 23:44             ` Andrew Morton
2005-08-17 23:52               ` Peter Chubb
2005-08-17 23:58                 ` Christoph Lameter
2005-08-18  0:47                   ` Andrew Morton
2005-08-18 16:09                     ` Christoph Lameter
2005-08-22  2:13     ` Benjamin Herrenschmidt
2005-08-18  0:43 ` Andrew Morton
2005-08-18 16:04   ` Christoph Lameter
2005-08-18 20:16   ` Hugh Dickins
2005-08-19  1:22     ` [PATCH] use mm_counter macros for nr_pte since its also under ptl Christoph Lameter
2005-08-19  3:17       ` Andrew Morton
2005-08-19  3:51         ` Christoph Lameter
2005-08-19  1:33     ` pagefault scalability patches Christoph Lameter
2005-08-19  3:53     ` [RFC] Concept for delayed counter updates in mm_struct Christoph Lameter
2005-08-19  4:29       ` Andrew Morton
2005-08-19  4:34         ` Andi Kleen
2005-08-19  4:49         ` Linus Torvalds
2005-08-19 16:06           ` Christoph Lameter
2005-08-20  7:33           ` [PATCH] mm_struct counter deltas in task_struct Christoph Lameter
2005-08-20  7:35           ` [PATCH] Use deltas to replace atomic inc Christoph Lameter
2005-08-20  7:58             ` Andrew Morton
2005-08-22  3:32               ` Christoph Lameter
2005-08-22  3:48                 ` Linus Torvalds
2005-08-22  4:06                   ` Christoph Lameter
2005-08-22  4:18                     ` Linus Torvalds
2005-08-22 13:23                       ` Christoph Lameter
2005-08-22 14:22                         ` Hugh Dickins
2005-08-22 15:24                           ` Christoph Lameter
2005-08-22 15:43                             ` Andi Kleen
2005-08-22 16:24                               ` Christoph Lameter
2005-08-22 20:30                           ` [PATCH] mm_struct counter deltas V2 Christoph Lameter
2005-08-22 20:31                           ` [PATCH] Use deltas to replace atomic inc V2 Christoph Lameter
2005-08-22  2:09   ` pagefault scalability patches Benjamin Herrenschmidt
2005-08-18  2:00 ` Nick Piggin
2005-08-18  8:38   ` Nick Piggin
2005-08-18 16:17     ` Christoph Lameter
2005-08-22  2:04       ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox