[RFC][PATCH 0/10] alternate 4-level page tables patches

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/10] alternate 4-level page tables patches
@ 2004-12-18  6:55 Nick Piggin
  2004-12-18  6:55 ` [PATCH 1/10] " Nick Piggin
  2004-12-19  0:07 ` [RFC][PATCH 0/10] " Hugh Dickins
  0 siblings, 2 replies; 77+ messages in thread
From: Nick Piggin @ 2004-12-18  6:55 UTC (permalink / raw)
  To: Linux Memory Management, Andi Kleen
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton

Hi,

Apologies for not making progress on this done sooner, but better late than never.

First off - don't let all the signed-off-by: things fool you, I'm only intending
this for comments, not merging. I just spent a bit of time getting the descriptions
in better shape.

Second - much of it is Andi's code (especially 4 level core, and x86-64 stuff).
If any attributions aren't quite accurate at this stage, don't worry too much!

Anyway, although we have a working 4-level page tables implementation, I am keeping
with this because my personal taste preference. Not that it is anything against
Andi's taste or technical implementation... but I wouldn't like progress to be held
up on account of me, so I wouldn't be too upset to forget about this until 2.7 (or
for ever)... /end disclaimer

Well, the patches follow. Tested lightly on i386 32 and 36 bits, ia64, and x86-64
with full 4 levels.

Comments?

Nick

A bit of an aside: I was hoping to have a page table folding implementation that is
basically transparent to architectures. That is, a 3-level arch could just include
some generic header to fold the 4th level, and call it a day (without any other mods
to arch/?/* or include/asm-?/*).

The reality is, this isn't going to happen with our current semantics. It probably
isn't a really big deal though, because I don't expect we'd have to support a 5
level implementation any time soon. But it is something I'd like to explore further.

I'll illustrate with an example: in the current setup, if the pmd is folded into
the pgd, pgd_present is always true, and pmd_present is what actually examines the
entry in the pgd. Now clearly, the architecture has to implement pmd_present, which
is pgd_present in a 2-level setup.

I would like to change that so pgd_present really does check the actual pgd entry,
and pmd_present is unconditionally true. IMO this would work better and be less
confusing than the current setup... but that's getting off topic...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 1/10] alternate 4-level page tables patches
  2004-12-18  6:55 [RFC][PATCH 0/10] alternate 4-level page tables patches Nick Piggin
@ 2004-12-18  6:55 ` Nick Piggin
  2004-12-18  6:56   ` [PATCH 2/10] " Nick Piggin
  2004-12-19  0:07 ` [RFC][PATCH 0/10] " Hugh Dickins
  1 sibling, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-18  6:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management, Andi Kleen, Hugh Dickins,
	Linus Torvalds, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 5 bytes --]

1/10

[-- Attachment #2: 3level-compat.patch --]
[-- Type: text/plain, Size: 2305 bytes --]



Generic headers to fold the 3-level pagetable into 2 levels.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>


---

 linux-2.6-npiggin/include/asm-generic/pgtable-nopmd.h |   59 ++++++++++++++++++
 1 files changed, 59 insertions(+)

diff -puN /dev/null include/asm-generic/pgtable-nopmd.h
--- /dev/null	2004-09-06 19:38:39.000000000 +1000
+++ linux-2.6-npiggin/include/asm-generic/pgtable-nopmd.h	2004-12-18 17:07:48.000000000 +1100
@@ -0,0 +1,59 @@
+#ifndef _PGTABLE_NOPMD_H
+#define _PGTABLE_NOPMD_H
+
+#ifndef __ASSEMBLY__
+
+/*
+ * Having the pmd type consist of a pgd gets the size right, and allows
+ * us to conceptually access the pgd entry that this pmd is folded into
+ * without casting.
+ */
+typedef struct { pgd_t pgd; } pmd_t;
+
+#define PMD_SHIFT	PGDIR_SHIFT
+#define PTRS_PER_PMD	1
+#define PMD_SIZE  	(1UL << PMD_SHIFT)
+#define PMD_MASK  	(~(PMD_SIZE-1))
+
+/*
+ * The "pgd_xxx()" functions here are trivial for a folded two-level
+ * setup: the pmd is never bad, and a pmd always exists (as it's folded
+ * into the pgd entry)
+ */
+static inline int pgd_none(pgd_t pgd)		{ return 0; }
+static inline int pgd_bad(pgd_t pgd)		{ return 0; }
+static inline int pgd_present(pgd_t pgd)	{ return 1; }
+static inline void pgd_clear(pgd_t *pgd)	{ }
+#define pmd_ERROR(pmd)				(pgd_ERROR((pmd).pgd))
+
+#define pgd_populate(mm, pmd, pte)		do { } while (0)
+#define pgd_populate_kernel(mm, pmd, pte)	do { } while (0)
+
+/*
+ * (pmds are folded into pgds so this doesn't get actually called,
+ * but the define is needed for a generic inline function.)
+ */
+#define set_pgd(pgdptr, pgdval)			set_pmd((pmd_t *)(pgdptr), (pmd_t) { pgdval })
+
+static inline pmd_t * pmd_offset(pgd_t * pgd, unsigned long address)
+{
+	return (pmd_t *)pgd;
+}
+
+#define pmd_val(x)				(pgd_val((x).pgd))
+#define __pmd(x)				((pmd_t) { __pgd(x) } )
+
+#define pgd_page(pgd)				(pmd_page((pmd_t){ pgd }))
+#define pgd_page_kernel(pgd)			(pmd_page_kernel((pmd_t){ pgd }))
+
+/*
+ * allocating and freeing a pmd is trivial: the 1-entry pmd is
+ * inside the pgd, so has no extra memory associated with it.
+ */
+#define pmd_alloc_one(mm, address)		NULL
+#define pmd_free(x)				do { } while (0)
+#define __pmd_free_tlb(tlb, x)			do { } while (0)
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _PGTABLE_NOPMD_H */

_

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 2/10] alternate 4-level page tables patches
  2004-12-18  6:55 ` [PATCH 1/10] " Nick Piggin
@ 2004-12-18  6:56   ` Nick Piggin
  2004-12-18  6:56     ` [PATCH 3/10] " Nick Piggin
  0 siblings, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-18  6:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management, Andi Kleen, Hugh Dickins,
	Linus Torvalds, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 5 bytes --]

2/10

[-- Attachment #2: 3level-i386-cleanup.patch --]
[-- Type: text/plain, Size: 9391 bytes --]


Adapt the i386 architecture to use the generic 2-level folding header.
Just to show how it is done.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>


---

 linux-2.6-npiggin/include/asm-i386/mmzone.h              |    1 
 linux-2.6-npiggin/include/asm-i386/page.h                |    6 --
 linux-2.6-npiggin/include/asm-i386/pgalloc.h             |   17 +++----
 linux-2.6-npiggin/include/asm-i386/pgtable-2level-defs.h |    2 
 linux-2.6-npiggin/include/asm-i386/pgtable-2level.h      |   33 +++------------
 linux-2.6-npiggin/include/asm-i386/pgtable-3level.h      |   11 +++++
 linux-2.6-npiggin/include/asm-i386/pgtable.h             |   13 +----
 7 files changed, 31 insertions(+), 52 deletions(-)

diff -puN include/asm-i386/pgtable-2level.h~3level-i386-cleanup include/asm-i386/pgtable-2level.h
--- linux-2.6/include/asm-i386/pgtable-2level.h~3level-i386-cleanup	2004-12-18 16:47:57.000000000 +1100
+++ linux-2.6-npiggin/include/asm-i386/pgtable-2level.h	2004-12-18 16:47:57.000000000 +1100
@@ -1,44 +1,22 @@
 #ifndef _I386_PGTABLE_2LEVEL_H
 #define _I386_PGTABLE_2LEVEL_H
 
+#include <asm-generic/pgtable-nopmd.h>
+
 #define pte_ERROR(e) \
 	printk("%s:%d: bad pte %08lx.\n", __FILE__, __LINE__, (e).pte_low)
-#define pmd_ERROR(e) \
-	printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pmd_val(e))
 #define pgd_ERROR(e) \
 	printk("%s:%d: bad pgd %08lx.\n", __FILE__, __LINE__, pgd_val(e))
 
 /*
- * The "pgd_xxx()" functions here are trivial for a folded two-level
- * setup: the pgd is never bad, and a pmd always exists (as it's folded
- * into the pgd entry)
- */
-static inline int pgd_none(pgd_t pgd)		{ return 0; }
-static inline int pgd_bad(pgd_t pgd)		{ return 0; }
-static inline int pgd_present(pgd_t pgd)	{ return 1; }
-#define pgd_clear(xp)				do { } while (0)
-
-/*
  * Certain architectures need to do special things when PTEs
  * within a page table are directly modified.  Thus, the following
  * hook is made available.
  */
 #define set_pte(pteptr, pteval) (*(pteptr) = pteval)
 #define set_pte_atomic(pteptr, pteval) set_pte(pteptr,pteval)
-/*
- * (pmds are folded into pgds so this doesn't get actually called,
- * but the define is needed for a generic inline function.)
- */
-#define set_pmd(pmdptr, pmdval) (*(pmdptr) = pmdval)
-#define set_pgd(pgdptr, pgdval) (*(pgdptr) = pgdval)
+#define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval))
 
-#define pgd_page(pgd) \
-((unsigned long) __va(pgd_val(pgd) & PAGE_MASK))
-
-static inline pmd_t * pmd_offset(pgd_t * dir, unsigned long address)
-{
-	return (pmd_t *) dir;
-}
 #define ptep_get_and_clear(xp)	__pte(xchg(&(xp)->pte_low, 0))
 #define pte_same(a, b)		((a).pte_low == (b).pte_low)
 #define pte_page(x)		pfn_to_page(pte_pfn(x))
@@ -47,6 +25,11 @@ static inline pmd_t * pmd_offset(pgd_t *
 #define pfn_pte(pfn, prot)	__pte(((pfn) << PAGE_SHIFT) | pgprot_val(prot))
 #define pfn_pmd(pfn, prot)	__pmd(((pfn) << PAGE_SHIFT) | pgprot_val(prot))
 
+#define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
+
+#define pmd_page_kernel(pmd) \
+((unsigned long) __va(pmd_val(pmd) & PAGE_MASK))
+
 /*
  * All present user pages are user-executable:
  */
diff -puN include/asm-i386/page.h~3level-i386-cleanup include/asm-i386/page.h
--- linux-2.6/include/asm-i386/page.h~3level-i386-cleanup	2004-12-18 16:47:57.000000000 +1100
+++ linux-2.6-npiggin/include/asm-i386/page.h	2004-12-18 16:47:57.000000000 +1100
@@ -46,11 +46,12 @@ typedef struct { unsigned long pte_low, 
 typedef struct { unsigned long long pmd; } pmd_t;
 typedef struct { unsigned long long pgd; } pgd_t;
 typedef struct { unsigned long long pgprot; } pgprot_t;
+#define pmd_val(x)	((x).pmd)
 #define pte_val(x)	((x).pte_low | ((unsigned long long)(x).pte_high << 32))
+#define __pmd(x) ((pmd_t) { (x) } )
 #define HPAGE_SHIFT	21
 #else
 typedef struct { unsigned long pte_low; } pte_t;
-typedef struct { unsigned long pmd; } pmd_t;
 typedef struct { unsigned long pgd; } pgd_t;
 typedef struct { unsigned long pgprot; } pgprot_t;
 #define boot_pte_t pte_t /* or would you rather have a typedef */
@@ -66,13 +67,10 @@ typedef struct { unsigned long pgprot; }
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #endif
 
-
-#define pmd_val(x)	((x).pmd)
 #define pgd_val(x)	((x).pgd)
 #define pgprot_val(x)	((x).pgprot)
 
 #define __pte(x) ((pte_t) { (x) } )
-#define __pmd(x) ((pmd_t) { (x) } )
 #define __pgd(x) ((pgd_t) { (x) } )
 #define __pgprot(x)	((pgprot_t) { (x) } )
 
diff -puN include/asm-i386/pgtable-2level-defs.h~3level-i386-cleanup include/asm-i386/pgtable-2level-defs.h
--- linux-2.6/include/asm-i386/pgtable-2level-defs.h~3level-i386-cleanup	2004-12-18 16:47:57.000000000 +1100
+++ linux-2.6-npiggin/include/asm-i386/pgtable-2level-defs.h	2004-12-18 16:47:57.000000000 +1100
@@ -12,8 +12,6 @@
  * the i386 is two-level, so we don't really have any
  * PMD directory physically.
  */
-#define PMD_SHIFT	22
-#define PTRS_PER_PMD	1
 
 #define PTRS_PER_PTE	1024
 
diff -puN include/asm-i386/pgtable-3level.h~3level-i386-cleanup include/asm-i386/pgtable-3level.h
--- linux-2.6/include/asm-i386/pgtable-3level.h~3level-i386-cleanup	2004-12-18 16:47:57.000000000 +1100
+++ linux-2.6-npiggin/include/asm-i386/pgtable-3level.h	2004-12-18 17:07:46.000000000 +1100
@@ -70,9 +70,18 @@ static inline void set_pte(pte_t *ptep, 
  */
 static inline void pgd_clear (pgd_t * pgd) { }
 
+#define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
+
+#define pmd_page_kernel(pmd) \
+((unsigned long) __va(pmd_val(pmd) & PAGE_MASK))
+
 #define pgd_page(pgd) \
+((struct page *) __va(pgd_val(pgd) & PAGE_MASK))
+
+#define pgd_page_kernel(pgd) \
 ((unsigned long) __va(pgd_val(pgd) & PAGE_MASK))
 
+
 /* Find an entry in the second-level page table.. */
 #define pmd_offset(dir, address) ((pmd_t *) pgd_page(*(dir)) + \
 			pmd_index(address))
@@ -142,4 +151,6 @@ static inline pmd_t pfn_pmd(unsigned lon
 #define __pte_to_swp_entry(pte)		((swp_entry_t){ (pte).pte_high })
 #define __swp_entry_to_pte(x)		((pte_t){ 0, (x).val })
 
+#define __pmd_free_tlb(tlb, x)		do { } while (0)
+
 #endif /* _I386_PGTABLE_3LEVEL_H */
diff -puN include/asm-i386/pgalloc.h~3level-i386-cleanup include/asm-i386/pgalloc.h
--- linux-2.6/include/asm-i386/pgalloc.h~3level-i386-cleanup	2004-12-18 16:47:57.000000000 +1100
+++ linux-2.6-npiggin/include/asm-i386/pgalloc.h	2004-12-18 17:07:46.000000000 +1100
@@ -10,12 +10,10 @@
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))
 
-static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
-{
-	set_pmd(pmd, __pmd(_PAGE_TABLE +
-		((unsigned long long)page_to_pfn(pte) <<
-			(unsigned long long) PAGE_SHIFT)));
-}
+#define pmd_populate(mm, pmd, pte) 				\
+	set_pmd(pmd, __pmd(_PAGE_TABLE +			\
+		((unsigned long long)page_to_pfn(pte) <<	\
+			(unsigned long long) PAGE_SHIFT)))
 /*
  * Allocate and free page tables.
  */
@@ -39,16 +37,15 @@ static inline void pte_free(struct page 
 
 #define __pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
 
+#ifdef CONFIG_X86_PAE
 /*
- * allocating and freeing a pmd is trivial: the 1-entry pmd is
- * inside the pgd, so has no extra memory associated with it.
- * (In the PAE case we free the pmds as part of the pgd.)
+ * In the PAE case we free the pmds as part of the pgd.
  */
-
 #define pmd_alloc_one(mm, addr)		({ BUG(); ((pmd_t *)2); })
 #define pmd_free(x)			do { } while (0)
 #define __pmd_free_tlb(tlb,x)		do { } while (0)
 #define pgd_populate(mm, pmd, pte)	BUG()
+#endif
 
 #define check_pgt_cache()	do { } while (0)
 
diff -puN include/asm-i386/pgtable.h~3level-i386-cleanup include/asm-i386/pgtable.h
--- linux-2.6/include/asm-i386/pgtable.h~3level-i386-cleanup	2004-12-18 16:47:57.000000000 +1100
+++ linux-2.6-npiggin/include/asm-i386/pgtable.h	2004-12-18 17:07:46.000000000 +1100
@@ -50,12 +50,12 @@ void paging_init(void);
  */
 #ifdef CONFIG_X86_PAE
 # include <asm/pgtable-3level-defs.h>
+# define PMD_SIZE	(1UL << PMD_SHIFT)
+# define PMD_MASK	(~(PMD_SIZE-1))
 #else
 # include <asm/pgtable-2level-defs.h>
 #endif
 
-#define PMD_SIZE	(1UL << PMD_SHIFT)
-#define PMD_MASK	(~(PMD_SIZE-1))
 #define PGDIR_SIZE	(1UL << PGDIR_SHIFT)
 #define PGDIR_MASK	(~(PGDIR_SIZE-1))
 
@@ -293,15 +293,8 @@ static inline pte_t pte_modify(pte_t pte
 
 #define page_pte(page) page_pte_prot(page, __pgprot(0))
 
-#define pmd_page_kernel(pmd) \
-((unsigned long) __va(pmd_val(pmd) & PAGE_MASK))
-
-#ifndef CONFIG_DISCONTIGMEM
-#define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
-#endif /* !CONFIG_DISCONTIGMEM */
-
 #define pmd_large(pmd) \
-	((pmd_val(pmd) & (_PAGE_PSE|_PAGE_PRESENT)) == (_PAGE_PSE|_PAGE_PRESENT))
+((pmd_val(pmd) & (_PAGE_PSE|_PAGE_PRESENT)) == (_PAGE_PSE|_PAGE_PRESENT))
 
 /*
  * the pgd page can be thought of an array like this: pgd_t[PTRS_PER_PGD]
diff -puN include/asm-i386/mmzone.h~3level-i386-cleanup include/asm-i386/mmzone.h
--- linux-2.6/include/asm-i386/mmzone.h~3level-i386-cleanup	2004-12-18 16:47:57.000000000 +1100
+++ linux-2.6-npiggin/include/asm-i386/mmzone.h	2004-12-18 16:47:57.000000000 +1100
@@ -116,7 +116,6 @@ static inline struct pglist_data *pfn_to
 	(unsigned long)(__page - __zone->zone_mem_map)			\
 		+ __zone->zone_start_pfn;				\
 })
-#define pmd_page(pmd)		(pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
 
 #ifdef CONFIG_X86_NUMAQ            /* we have contiguous memory on NUMA-Q */
 #define pfn_valid(pfn)          ((pfn) < num_physpages)

_

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 3/10] alternate 4-level page tables patches
  2004-12-18  6:56   ` [PATCH 2/10] " Nick Piggin
@ 2004-12-18  6:56     ` Nick Piggin
  2004-12-18  6:57       ` [PATCH 4/10] " Nick Piggin
  0 siblings, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-18  6:56 UTC (permalink / raw)
  To: Linux Memory Management
  Cc: Andi Kleen, Hugh Dickins, Linus Torvalds, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 5 bytes --]

3/10

[-- Attachment #2: 3level-split-copy_page_range.patch --]
[-- Type: text/plain, Size: 8698 bytes --]



Split copy_page_range into the usual set of page table walking functions.
Needed to handle the complexity when moving to 4 levels.

Split out from Andi Kleen's 4level patch.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>


---

 linux-2.6-npiggin/mm/memory.c |  290 ++++++++++++++++++++++--------------------
 1 files changed, 152 insertions(+), 138 deletions(-)

diff -puN mm/memory.c~3level-split-copy_page_range mm/memory.c
--- linux-2.6/mm/memory.c~3level-split-copy_page_range	2004-12-18 16:48:55.000000000 +1100
+++ linux-2.6-npiggin/mm/memory.c	2004-12-18 17:07:49.000000000 +1100
@@ -204,165 +204,179 @@ pte_t fastcall * pte_alloc_kernel(struct
 out:
 	return pte_offset_kernel(pmd, address);
 }
-#define PTE_TABLE_MASK	((PTRS_PER_PTE-1) * sizeof(pte_t))
-#define PMD_TABLE_MASK	((PTRS_PER_PMD-1) * sizeof(pmd_t))
 
 /*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
  * covered by this vma.
  *
- * 08Jan98 Merged into one routine from several inline routines to reduce
- *         variable count and make things faster. -jj
- *
  * dst->page_table_lock is held on entry and exit,
- * but may be dropped within pmd_alloc() and pte_alloc_map().
+ * but may be dropped within p[mg]d_alloc() and pte_alloc_map().
  */
+
+static inline void
+copy_swap_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t pte)
+{
+	if (pte_file(pte))
+		return;
+	swap_duplicate(pte_to_swp_entry(pte));
+	if (list_empty(&dst_mm->mmlist)) {
+		spin_lock(&mmlist_lock);
+		list_add(&dst_mm->mmlist, &src_mm->mmlist);
+		spin_unlock(&mmlist_lock);
+	}
+}
+
+static inline void
+copy_one_pte(struct mm_struct *dst_mm,  struct mm_struct *src_mm,
+		pte_t *dst_pte, pte_t *src_pte, unsigned long vm_flags,
+		unsigned long addr)
+{
+	pte_t pte = *src_pte;
+	struct page *page;
+	unsigned long pfn;
+
+	/* pte contains position in swap, so copy. */
+	if (!pte_present(pte)) {
+		copy_swap_pte(dst_mm, src_mm, pte);
+		set_pte(dst_pte, pte);
+		return;
+	}
+	pfn = pte_pfn(pte);
+	/* the pte points outside of valid memory, the
+	 * mapping is assumed to be good, meaningful
+	 * and not mapped via rmap - duplicate the
+	 * mapping as is.
+	 */
+	page = NULL;
+	if (pfn_valid(pfn))
+		page = pfn_to_page(pfn);
+
+	if (!page || PageReserved(page)) {
+		set_pte(dst_pte, pte);
+		return;
+	}
+
+	/*
+	 * If it's a COW mapping, write protect it both
+	 * in the parent and the child
+	 */
+	if ((vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE) {
+		ptep_set_wrprotect(src_pte);
+		pte = *src_pte;
+	}
+
+	/*
+	 * If it's a shared mapping, mark it clean in
+	 * the child
+	 */
+	if (vm_flags & VM_SHARED)
+		pte = pte_mkclean(pte);
+	pte = pte_mkold(pte);
+	get_page(page);
+	dst_mm->rss++;
+	if (PageAnon(page))
+		dst_mm->anon_rss++;
+	set_pte(dst_pte, pte);
+	page_dup_rmap(page);
+}
+
+static int copy_pte_range(struct mm_struct *dst_mm,  struct mm_struct *src_mm,
+		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+		unsigned long addr, unsigned long end)
+{
+	pte_t *src_pte, *dst_pte;
+	pte_t *s, *d;
+	unsigned long vm_flags = vma->vm_flags;
+
+	d = dst_pte = pte_alloc_map(dst_mm, dst_pmd, addr);
+	if (!dst_pte)
+		return -ENOMEM;
+
+	spin_lock(&src_mm->page_table_lock);
+	s = src_pte = pte_offset_map_nested(src_pmd, addr);
+	for (; addr < end; addr += PAGE_SIZE, s++, d++) {
+		if (pte_none(*s))
+			continue;
+		copy_one_pte(dst_mm, src_mm, d, s, vm_flags, addr);
+	}
+	pte_unmap_nested(src_pte);
+	pte_unmap(dst_pte);
+	spin_unlock(&src_mm->page_table_lock);
+	cond_resched_lock(&dst_mm->page_table_lock);
+	return 0;
+}
+
+static int copy_pmd_range(struct mm_struct *dst_mm,  struct mm_struct *src_mm,
+		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
+		unsigned long addr, unsigned long end)
+{
+	pmd_t *src_pmd, *dst_pmd;
+	int err = 0;
+	unsigned long next;
+
+	src_pmd = pmd_offset(src_pgd, addr);
+	dst_pmd = pmd_alloc(dst_mm, dst_pgd, addr);
+	if (!dst_pmd)
+		return -ENOMEM;
+
+	for (; addr < end; addr = next, src_pmd++, dst_pmd++) {
+		next = (addr + PMD_SIZE) & PMD_MASK;
+		if (next > end)
+			next = end;
+		if (pmd_none(*src_pmd))
+			continue;
+		if (pmd_bad(*src_pmd)) {
+			pmd_ERROR(*src_pmd);
+			pmd_clear(src_pmd);
+			continue;
+		}
+		err = copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
+							vma, addr, next);
+		if (err)
+			break;
+	}
+	return err;
+}
+
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma)
+		struct vm_area_struct *vma)
 {
-	pgd_t * src_pgd, * dst_pgd;
-	unsigned long address = vma->vm_start;
-	unsigned long end = vma->vm_end;
-	unsigned long cow;
+	pgd_t *src_pgd, *dst_pgd;
+	unsigned long addr, start, end, next;
+	int err = 0;
 
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst, src, vma);
 
-	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
-	src_pgd = pgd_offset(src, address)-1;
-	dst_pgd = pgd_offset(dst, address)-1;
-
-	for (;;) {
-		pmd_t * src_pmd, * dst_pmd;
-
-		src_pgd++; dst_pgd++;
-		
-		/* copy_pmd_range */
-		
+	start = vma->vm_start;
+	src_pgd = pgd_offset(src, start);
+	dst_pgd = pgd_offset(dst, start);
+
+	end = vma->vm_end;
+	addr = start;
+	while (addr && (addr < end-1)) {
+		next = (addr + PGDIR_SIZE) & PGDIR_MASK;
+		if (next > end || next <= addr)
+			next = end;
 		if (pgd_none(*src_pgd))
-			goto skip_copy_pmd_range;
-		if (unlikely(pgd_bad(*src_pgd))) {
+			continue;
+		if (pgd_bad(*src_pgd)) {
 			pgd_ERROR(*src_pgd);
 			pgd_clear(src_pgd);
-skip_copy_pmd_range:	address = (address + PGDIR_SIZE) & PGDIR_MASK;
-			if (!address || (address >= end))
-				goto out;
 			continue;
 		}
+		err = copy_pmd_range(dst, src, dst_pgd, src_pgd,
+							vma, addr, next);
+		if (err)
+			break;
 
-		src_pmd = pmd_offset(src_pgd, address);
-		dst_pmd = pmd_alloc(dst, dst_pgd, address);
-		if (!dst_pmd)
-			goto nomem;
-
-		do {
-			pte_t * src_pte, * dst_pte;
-		
-			/* copy_pte_range */
-		
-			if (pmd_none(*src_pmd))
-				goto skip_copy_pte_range;
-			if (unlikely(pmd_bad(*src_pmd))) {
-				pmd_ERROR(*src_pmd);
-				pmd_clear(src_pmd);
-skip_copy_pte_range:
-				address = (address + PMD_SIZE) & PMD_MASK;
-				if (address >= end)
-					goto out;
-				goto cont_copy_pmd_range;
-			}
-
-			dst_pte = pte_alloc_map(dst, dst_pmd, address);
-			if (!dst_pte)
-				goto nomem;
-			spin_lock(&src->page_table_lock);	
-			src_pte = pte_offset_map_nested(src_pmd, address);
-			do {
-				pte_t pte = *src_pte;
-				struct page *page;
-				unsigned long pfn;
-
-				/* copy_one_pte */
-
-				if (pte_none(pte))
-					goto cont_copy_pte_range_noset;
-				/* pte contains position in swap, so copy. */
-				if (!pte_present(pte)) {
-					if (!pte_file(pte)) {
-						swap_duplicate(pte_to_swp_entry(pte));
-						if (list_empty(&dst->mmlist)) {
-							spin_lock(&mmlist_lock);
-							list_add(&dst->mmlist,
-								 &src->mmlist);
-							spin_unlock(&mmlist_lock);
-						}
-					}
-					set_pte(dst_pte, pte);
-					goto cont_copy_pte_range_noset;
-				}
-				pfn = pte_pfn(pte);
-				/* the pte points outside of valid memory, the
-				 * mapping is assumed to be good, meaningful
-				 * and not mapped via rmap - duplicate the
-				 * mapping as is.
-				 */
-				page = NULL;
-				if (pfn_valid(pfn)) 
-					page = pfn_to_page(pfn); 
-
-				if (!page || PageReserved(page)) {
-					set_pte(dst_pte, pte);
-					goto cont_copy_pte_range_noset;
-				}
-
-				/*
-				 * If it's a COW mapping, write protect it both
-				 * in the parent and the child
-				 */
-				if (cow) {
-					ptep_set_wrprotect(src_pte);
-					pte = *src_pte;
-				}
-
-				/*
-				 * If it's a shared mapping, mark it clean in
-				 * the child
-				 */
-				if (vma->vm_flags & VM_SHARED)
-					pte = pte_mkclean(pte);
-				pte = pte_mkold(pte);
-				get_page(page);
-				dst->rss++;
-				if (PageAnon(page))
-					dst->anon_rss++;
-				set_pte(dst_pte, pte);
-				page_dup_rmap(page);
-cont_copy_pte_range_noset:
-				address += PAGE_SIZE;
-				if (address >= end) {
-					pte_unmap_nested(src_pte);
-					pte_unmap(dst_pte);
-					goto out_unlock;
-				}
-				src_pte++;
-				dst_pte++;
-			} while ((unsigned long)src_pte & PTE_TABLE_MASK);
-			pte_unmap_nested(src_pte-1);
-			pte_unmap(dst_pte-1);
-			spin_unlock(&src->page_table_lock);
-			cond_resched_lock(&dst->page_table_lock);
-cont_copy_pmd_range:
-			src_pmd++;
-			dst_pmd++;
-		} while ((unsigned long)src_pmd & PMD_TABLE_MASK);
+		src_pgd++;
+		dst_pgd++;
+		addr = next;
 	}
-out_unlock:
-	spin_unlock(&src->page_table_lock);
-out:
-	return 0;
-nomem:
-	return -ENOMEM;
+
+	return err;
 }
 
 static void zap_pte_range(struct mmu_gather *tlb,

_

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18  6:56     ` [PATCH 3/10] " Nick Piggin
@ 2004-12-18  6:57       ` Nick Piggin
  2004-12-18  6:58         ` [PATCH 5/10] " Nick Piggin
                           ` (2 more replies)
  0 siblings, 3 replies; 77+ messages in thread
From: Nick Piggin @ 2004-12-18  6:57 UTC (permalink / raw)
  To: Linux Memory Management
  Cc: Andi Kleen, Hugh Dickins, Linus Torvalds, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 5 bytes --]

4/10

[-- Attachment #2: 3level-clear_page_range.patch --]
[-- Type: text/plain, Size: 6693 bytes --]



Rename clear_page_tables to clear_page_range. clear_page_range takes byte
ranges, and aggressively frees page table pages. Maybe useful to control
page table memory consumption on 4-level architectures (and even 3 level
ones).

Possible downsides are:
- flush_tlb_pgtables gets called more often (only a problem for sparc64
  AFAIKS).
  
- the opportunistic "expand to fill PGDIR_SIZE hole" logic that ensures
  something actually gets done under the old system is still in place.
  This could sometimes make unmapping small regions more inefficient. There
  are some other solutions to look at if this is the case though.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>


---

 linux-2.6-npiggin/include/linux/mm.h |    2 
 linux-2.6-npiggin/mm/memory.c        |   80 +++++++++++++++++++++--------------
 linux-2.6-npiggin/mm/mmap.c          |   23 +++-------
 3 files changed, 58 insertions(+), 47 deletions(-)

diff -puN include/linux/mm.h~3level-clear_page_range include/linux/mm.h
--- linux-2.6/include/linux/mm.h~3level-clear_page_range	2004-12-18 16:50:44.000000000 +1100
+++ linux-2.6-npiggin/include/linux/mm.h	2004-12-18 17:07:48.000000000 +1100
@@ -566,7 +566,7 @@ int unmap_vmas(struct mmu_gather **tlbp,
 		struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
-void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr);
+void clear_page_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 int zeromap_page_range(struct vm_area_struct *vma, unsigned long from,
diff -puN mm/memory.c~3level-clear_page_range mm/memory.c
--- linux-2.6/mm/memory.c~3level-clear_page_range	2004-12-18 16:50:44.000000000 +1100
+++ linux-2.6-npiggin/mm/memory.c	2004-12-18 17:07:48.000000000 +1100
@@ -100,58 +100,76 @@ static inline void copy_cow_page(struct 
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
-static inline void free_one_pmd(struct mmu_gather *tlb, pmd_t * dir)
+static inline void clear_pmd_range(struct mmu_gather *tlb, pmd_t *pmd, unsigned long start, unsigned long end)
 {
 	struct page *page;
 
-	if (pmd_none(*dir))
+	if (pmd_none(*pmd))
 		return;
-	if (unlikely(pmd_bad(*dir))) {
-		pmd_ERROR(*dir);
-		pmd_clear(dir);
+	if (unlikely(pmd_bad(*pmd))) {
+		pmd_ERROR(*pmd);
+		pmd_clear(pmd);
 		return;
 	}
-	page = pmd_page(*dir);
-	pmd_clear(dir);
-	dec_page_state(nr_page_table_pages);
-	tlb->mm->nr_ptes--;
-	pte_free_tlb(tlb, page);
+	if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK)) {
+		page = pmd_page(*pmd);
+		pmd_clear(pmd);
+		dec_page_state(nr_page_table_pages);
+		tlb->mm->nr_ptes--;
+		pte_free_tlb(tlb, page);
+	}
 }
 
-static inline void free_one_pgd(struct mmu_gather *tlb, pgd_t * dir)
+static inline void clear_pgd_range(struct mmu_gather *tlb, pgd_t *pgd, unsigned long start, unsigned long end)
 {
-	int j;
-	pmd_t * pmd;
+	unsigned long addr = start, next;
+	pmd_t *pmd, *__pmd;
 
-	if (pgd_none(*dir))
+	if (pgd_none(*pgd))
 		return;
-	if (unlikely(pgd_bad(*dir))) {
-		pgd_ERROR(*dir);
-		pgd_clear(dir);
+	if (unlikely(pgd_bad(*pgd))) {
+		pgd_ERROR(*pgd);
+		pgd_clear(pgd);
 		return;
 	}
-	pmd = pmd_offset(dir, 0);
-	pgd_clear(dir);
-	for (j = 0; j < PTRS_PER_PMD ; j++)
-		free_one_pmd(tlb, pmd+j);
-	pmd_free_tlb(tlb, pmd);
+
+	pmd = __pmd = pmd_offset(pgd, start);
+	do {
+		next = (addr + PMD_SIZE) & PMD_MASK;
+		if (next > end || next <= addr)
+			next = end;
+		
+		clear_pmd_range(tlb, pmd, addr, next);
+		pmd++;
+		addr = next;
+	} while (addr && (addr <= end - 1));
+
+	if (!(start & ~PGDIR_MASK) && !(end & ~PGDIR_MASK)) {
+		pgd_clear(pgd);
+		pmd_free_tlb(tlb, __pmd);
+	}
 }
 
 /*
- * This function clears all user-level page tables of a process - this
- * is needed by execve(), so that old pages aren't in the way.
+ * This function clears user-level page tables of a process.
  *
  * Must be called with pagetable lock held.
  */
-void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr)
+void clear_page_range(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
-	pgd_t * page_dir = tlb->mm->pgd;
+	unsigned long addr = start, next;
+	unsigned long i, nr = pgd_index(end + PGDIR_SIZE-1) - pgd_index(start);
+	pgd_t * pgd = pgd_offset(tlb->mm, start);
 
-	page_dir += first;
-	do {
-		free_one_pgd(tlb, page_dir);
-		page_dir++;
-	} while (--nr);
+	for (i = 0; i < nr; i++) {
+		next = (addr + PGDIR_SIZE) & PGDIR_MASK;
+		if (next > end || next <= addr)
+			next = end;
+		
+		clear_pgd_range(tlb, pgd, addr, next);
+		pgd++;
+		addr = next;
+	}
 }
 
 pte_t fastcall * pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
diff -puN mm/mmap.c~3level-clear_page_range mm/mmap.c
--- linux-2.6/mm/mmap.c~3level-clear_page_range	2004-12-18 16:50:44.000000000 +1100
+++ linux-2.6-npiggin/mm/mmap.c	2004-12-18 16:50:44.000000000 +1100
@@ -1474,7 +1474,6 @@ static void free_pgtables(struct mmu_gat
 {
 	unsigned long first = start & PGDIR_MASK;
 	unsigned long last = end + PGDIR_SIZE - 1;
-	unsigned long start_index, end_index;
 	struct mm_struct *mm = tlb->mm;
 
 	if (!prev) {
@@ -1499,24 +1498,16 @@ static void free_pgtables(struct mmu_gat
 				last = next->vm_start;
 		}
 		if (prev->vm_end > first)
-			first = prev->vm_end + PGDIR_SIZE - 1;
+			first = prev->vm_end;
 		break;
 	}
 no_mmaps:
 	if (last < first)	/* for arches with discontiguous pgd indices */
 		return;
-	/*
-	 * If the PGD bits are not consecutive in the virtual address, the
-	 * old method of shifting the VA >> by PGDIR_SHIFT doesn't work.
-	 */
-	start_index = pgd_index(first);
-	if (start_index < FIRST_USER_PGD_NR)
-		start_index = FIRST_USER_PGD_NR;
-	end_index = pgd_index(last);
-	if (end_index > start_index) {
-		clear_page_tables(tlb, start_index, end_index - start_index);
-		flush_tlb_pgtables(mm, first & PGDIR_MASK, last & PGDIR_MASK);
-	}
+	if (first < FIRST_USER_PGD_NR * PGDIR_SIZE)
+		first = FIRST_USER_PGD_NR * PGDIR_SIZE;
+	clear_page_range(tlb, first, last);
+	flush_tlb_pgtables(mm, first, last);
 }
 
 /* Normal function to fix up a mapping
@@ -1844,7 +1835,9 @@ void exit_mmap(struct mm_struct *mm)
 					~0UL, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 	BUG_ON(mm->map_count);	/* This is just debugging */
-	clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
+	clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE,
+			(TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK);
+	
 	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
 
 	vma = mm->mmap;

_

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 5/10] alternate 4-level page tables patches
  2004-12-18  6:57       ` [PATCH 4/10] " Nick Piggin
@ 2004-12-18  6:58         ` Nick Piggin
  2004-12-18  6:58           ` [PATCH 6/10] " Nick Piggin
  2004-12-18  9:05         ` [PATCH 4/10] " Nick Piggin
  2004-12-18 10:45         ` William Lee Irwin III
  2 siblings, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-18  6:58 UTC (permalink / raw)
  To: Linux Memory Management
  Cc: Andi Kleen, Hugh Dickins, Linus Torvalds, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 5 bytes --]

5/10

[-- Attachment #2: 4level-compat.patch --]
[-- Type: text/plain, Size: 6300 bytes --]



Generic headers to fold the 4-level pagetable into 3 levels.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>


---

 linux-2.6-npiggin/include/asm-generic/pgtable-nopmd.h |   46 +++++++------
 linux-2.6-npiggin/include/asm-generic/pgtable-nopud.h |   61 ++++++++++++++++++
 linux-2.6-npiggin/include/asm-generic/tlb.h           |    6 +
 3 files changed, 91 insertions(+), 22 deletions(-)

diff -puN /dev/null include/asm-generic/pgtable-nopud.h
--- /dev/null	2004-09-06 19:38:39.000000000 +1000
+++ linux-2.6-npiggin/include/asm-generic/pgtable-nopud.h	2004-12-18 16:57:19.000000000 +1100
@@ -0,0 +1,61 @@
+#ifndef _PGTABLE_NOPUD_H
+#define _PGTABLE_NOPUD_H
+
+#ifndef __ASSEMBLY__
+
+/*
+ * Having the pud type consist of a pgd gets the size right, and allows
+ * us to conceptually access the pgd entry that this pud is folded into
+ * without casting.
+ */
+typedef struct { pgd_t pgd; } pud_t;
+
+#define PUD_SHIFT	PGDIR_SHIFT
+#define PTRS_PER_PUD	1
+#define PUD_SIZE  	(1UL << PUD_SHIFT)
+#define PUD_MASK  	(~(PUD_SIZE-1))
+
+/*
+ * The "pgd_xxx()" functions here are trivial for a folded two-level
+ * setup: the pud is never bad, and a pud always exists (as it's folded
+ * into the pgd entry)
+ */
+static inline int pgd_none(pgd_t pgd)		{ return 0; }
+static inline int pgd_bad(pgd_t pgd)		{ return 0; }
+static inline int pgd_present(pgd_t pgd)	{ return 1; }
+static inline void pgd_clear(pgd_t *pgd)	{ }
+#define pud_ERROR(pud)				(pgd_ERROR((pud).pgd))
+
+#define pgd_populate(mm, pgd, pud)		do { } while (0)
+/*
+ * (puds are folded into pgds so this doesn't get actually called,
+ * but the define is needed for a generic inline function.)
+ */
+#define set_pgd(pgdptr, pgdval)			set_pud((pud_t *)(pgdptr), (pud_t) { pgdval })
+
+static inline pud_t * pud_offset(pgd_t * pgd, unsigned long address)
+{
+	return (pud_t *)pgd;
+}
+
+static inline pud_t * pud_offset_k(pgd_t * pgd, unsigned long address)
+{
+	return (pud_t *)pgd;
+}
+
+#define pud_val(x)				(pgd_val((x).pgd))
+#define __pud(x)				((pud_t) { __pgd(x) } )
+
+#define pgd_page(pgd)				(pud_page((pud_t){ pgd }))
+#define pgd_page_kernel(pgd)			(pud_page_kernel((pud_t){ pgd }))
+
+/*
+ * allocating and freeing a pud is trivial: the 1-entry pud is
+ * inside the pgd, so has no extra memory associated with it.
+ */
+#define pud_alloc_one(mm, address)		NULL
+#define pud_free(x)				do { } while (0)
+#define __pud_free_tlb(tlb, x)			do { } while (0)
+
+#endif /* __ASSEMBLY__ */
+#endif /* _PGTABLE_NOPUD_H */
diff -puN include/asm-generic/pgtable-nopmd.h~4level-compat include/asm-generic/pgtable-nopmd.h
--- linux-2.6/include/asm-generic/pgtable-nopmd.h~4level-compat	2004-12-18 16:57:19.000000000 +1100
+++ linux-2.6-npiggin/include/asm-generic/pgtable-nopmd.h	2004-12-18 16:57:19.000000000 +1100
@@ -3,52 +3,54 @@
 
 #ifndef __ASSEMBLY__
 
+#include <asm-generic/pgtable-nopud.h>
+
 /*
- * Having the pmd type consist of a pgd gets the size right, and allows
- * us to conceptually access the pgd entry that this pmd is folded into
+ * Having the pmd type consist of a pud gets the size right, and allows
+ * us to conceptually access the pud entry that this pmd is folded into
  * without casting.
  */
-typedef struct { pgd_t pgd; } pmd_t;
+typedef struct { pud_t pud; } pmd_t;
 
-#define PMD_SHIFT	PGDIR_SHIFT
+#define PMD_SHIFT	PUD_SHIFT
 #define PTRS_PER_PMD	1
 #define PMD_SIZE  	(1UL << PMD_SHIFT)
 #define PMD_MASK  	(~(PMD_SIZE-1))
 
 /*
- * The "pgd_xxx()" functions here are trivial for a folded two-level
+ * The "pud_xxx()" functions here are trivial for a folded two-level
  * setup: the pmd is never bad, and a pmd always exists (as it's folded
- * into the pgd entry)
+ * into the pud entry)
  */
-static inline int pgd_none(pgd_t pgd)		{ return 0; }
-static inline int pgd_bad(pgd_t pgd)		{ return 0; }
-static inline int pgd_present(pgd_t pgd)	{ return 1; }
-static inline void pgd_clear(pgd_t *pgd)	{ }
-#define pmd_ERROR(pmd)				(pgd_ERROR((pmd).pgd))
+static inline int pud_none(pud_t pud)		{ return 0; }
+static inline int pud_bad(pud_t pud)		{ return 0; }
+static inline int pud_present(pud_t pud)	{ return 1; }
+static inline void pud_clear(pud_t *pud)	{ }
+#define pmd_ERROR(pmd)				(pud_ERROR((pmd).pud))
 
-#define pgd_populate(mm, pmd, pte)		do { } while (0)
-#define pgd_populate_kernel(mm, pmd, pte)	do { } while (0)
+#define pud_populate(mm, pmd, pte)		do { } while (0)
+#define pud_populate_kernel(mm, pmd, pte)	do { } while (0)
 
 /*
- * (pmds are folded into pgds so this doesn't get actually called,
+ * (pmds are folded into puds so this doesn't get actually called,
  * but the define is needed for a generic inline function.)
  */
-#define set_pgd(pgdptr, pgdval)			set_pmd((pmd_t *)(pgdptr), (pmd_t) { pgdval })
+#define set_pud(pudptr, pudval)			set_pmd((pmd_t *)(pudptr), (pmd_t) { pudval })
 
-static inline pmd_t * pmd_offset(pgd_t * pgd, unsigned long address)
+static inline pmd_t * pmd_offset(pud_t * pud, unsigned long address)
 {
-	return (pmd_t *)pgd;
+	return (pmd_t *)pud;
 }
 
-#define pmd_val(x)				(pgd_val((x).pgd))
-#define __pmd(x)				((pmd_t) { __pgd(x) } )
+#define pmd_val(x)				(pud_val((x).pud))
+#define __pmd(x)				((pmd_t) { __pud(x) } )
 
-#define pgd_page(pgd)				(pmd_page((pmd_t){ pgd }))
-#define pgd_page_kernel(pgd)			(pmd_page_kernel((pmd_t){ pgd }))
+#define pud_page(pud)				(pmd_page((pmd_t){ pud }))
+#define pud_page_kernel(pud)			(pmd_page_kernel((pmd_t){ pud }))
 
 /*
  * allocating and freeing a pmd is trivial: the 1-entry pmd is
- * inside the pgd, so has no extra memory associated with it.
+ * inside the pud, so has no extra memory associated with it.
  */
 #define pmd_alloc_one(mm, address)		NULL
 #define pmd_free(x)				do { } while (0)
diff -puN include/asm-generic/tlb.h~4level-compat include/asm-generic/tlb.h
--- linux-2.6/include/asm-generic/tlb.h~4level-compat	2004-12-18 16:57:19.000000000 +1100
+++ linux-2.6-npiggin/include/asm-generic/tlb.h	2004-12-18 16:57:19.000000000 +1100
@@ -141,6 +141,12 @@ static inline void tlb_remove_page(struc
 		__pte_free_tlb(tlb, ptep);			\
 	} while (0)
 
+#define pud_free_tlb(tlb, pudp)					\
+	do {							\
+		tlb->need_flush = 1;				\
+		__pud_free_tlb(tlb, pudp);			\
+	} while (0)
+
 #define pmd_free_tlb(tlb, pmdp)					\
 	do {							\
 		tlb->need_flush = 1;				\

_

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 6/10] alternate 4-level page tables patches
  2004-12-18  6:58         ` [PATCH 5/10] " Nick Piggin
@ 2004-12-18  6:58           ` Nick Piggin
  2004-12-18  6:59             ` [PATCH 7/10] " Nick Piggin
  0 siblings, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-18  6:58 UTC (permalink / raw)
  To: Linux Memory Management
  Cc: Andi Kleen, Hugh Dickins, Linus Torvalds, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 5 bytes --]

6/10

[-- Attachment #2: 4level-core-patch.patch --]
[-- Type: text/plain, Size: 44216 bytes --]



Extend the Linux MM to 4level page tables. 

This is the core patch for mm/*, fs/*, include/linux/*  

It breaks all architectures, which will be fixed in separate patches.

The conversion is quite straight forward.  All the functions walking the page
table hierarchy have been changed to deal with another level at the top.  The
additional level is called pml4.  

mm/memory.c has changed a lot because it did most of the heavy lifting here. 
Most of the changes here are extensions of the previous code.  

Signed-off-by: Andi Kleen <ak@suse.de>

Converted to use the pud_t 'page upper' level between pgd and pmd instead of
Andi's pml4 level above pgd.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>

---

 25-akpm/fs/exec.c                 |    6 
 25-akpm/include/linux/init_task.h |    2 
 25-akpm/include/linux/sched.h     |    2 
 25-akpm/kernel/fork.c             |    6 
 25-akpm/mm/fremap.c               |   11 



---

 linux-2.6-npiggin/drivers/char/drm/drm_memory.h |    3 
 linux-2.6-npiggin/fs/exec.c                     |    6 
 linux-2.6-npiggin/include/linux/init_task.h     |    2 
 linux-2.6-npiggin/include/linux/mm.h            |   20 -
 linux-2.6-npiggin/mm/fremap.c                   |   18 -
 linux-2.6-npiggin/mm/memory.c                   |  408 ++++++++++++++++++------
 linux-2.6-npiggin/mm/mempolicy.c                |   22 +
 linux-2.6-npiggin/mm/mprotect.c                 |   65 ++-
 linux-2.6-npiggin/mm/mremap.c                   |   29 +
 linux-2.6-npiggin/mm/msync.c                    |   55 ++-
 linux-2.6-npiggin/mm/rmap.c                     |   21 +
 linux-2.6-npiggin/mm/swapfile.c                 |   81 +++-
 linux-2.6-npiggin/mm/vmalloc.c                  |  113 ++++--
 13 files changed, 644 insertions(+), 199 deletions(-)

diff -puN fs/exec.c~4level-core-patch fs/exec.c
--- linux-2.6/fs/exec.c~4level-core-patch	2004-12-18 16:57:50.000000000 +1100
+++ linux-2.6-npiggin/fs/exec.c	2004-12-18 16:57:50.000000000 +1100
@@ -300,6 +300,7 @@ void install_arg_page(struct vm_area_str
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t * pgd;
+	pud_t * pud;
 	pmd_t * pmd;
 	pte_t * pte;
 
@@ -310,7 +311,10 @@ void install_arg_page(struct vm_area_str
 	pgd = pgd_offset(mm, address);
 
 	spin_lock(&mm->page_table_lock);
-	pmd = pmd_alloc(mm, pgd, address);
+	pud = pud_alloc(mm, pgd, address);
+	if (!pud)
+		goto out;
+	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		goto out;
 	pte = pte_alloc_map(mm, pmd, address);
diff -puN include/linux/init_task.h~4level-core-patch include/linux/init_task.h
--- linux-2.6/include/linux/init_task.h~4level-core-patch	2004-12-18 16:57:50.000000000 +1100
+++ linux-2.6-npiggin/include/linux/init_task.h	2004-12-18 16:57:50.000000000 +1100
@@ -34,7 +34,7 @@
 #define INIT_MM(name) \
 {			 					\
 	.mm_rb		= RB_ROOT,				\
-	.pgd		= swapper_pg_dir, 			\
+	.pgd		= swapper_pg_dir,			\
 	.mm_users	= ATOMIC_INIT(2), 			\
 	.mm_count	= ATOMIC_INIT(1), 			\
 	.mmap_sem	= __RWSEM_INITIALIZER(name.mmap_sem),	\
diff -puN include/linux/mm.h~4level-core-patch include/linux/mm.h
--- linux-2.6/include/linux/mm.h~4level-core-patch	2004-12-18 16:57:50.000000000 +1100
+++ linux-2.6-npiggin/include/linux/mm.h	2004-12-18 16:57:50.000000000 +1100
@@ -581,7 +581,8 @@ static inline void unmap_shared_mapping_
 }
 
 extern int vmtruncate(struct inode * inode, loff_t offset);
-extern pmd_t *FASTCALL(__pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address));
+extern pud_t *FASTCALL(__pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address));
+extern pmd_t *FASTCALL(__pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address));
 extern pte_t *FASTCALL(pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address));
 extern pte_t *FASTCALL(pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address));
 extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot);
@@ -626,15 +627,22 @@ extern struct shrinker *set_shrinker(int
 extern void remove_shrinker(struct shrinker *shrinker);
 
 /*
- * On a two-level page table, this ends up being trivial. Thus the
- * inlining and the symmetry break with pte_alloc_map() that does all
+ * On a two-level or three-level page table, this ends up being trivial. Thus
+ * the inlining and the symmetry break with pte_alloc_map() that does all
  * of this out-of-line.
  */
-static inline pmd_t *pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+static inline pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
 {
 	if (pgd_none(*pgd))
-		return __pmd_alloc(mm, pgd, address);
-	return pmd_offset(pgd, address);
+		return __pud_alloc(mm, pgd, address);
+	return pud_offset(pgd, address);
+}
+
+static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
+{
+	if (pud_none(*pud))
+		return __pmd_alloc(mm, pud, address);
+	return pmd_offset(pud, address);
 }
 
 extern void free_area_init(unsigned long * zones_size);
diff -puN include/linux/sched.h~4level-core-patch include/linux/sched.h
diff -puN kernel/fork.c~4level-core-patch kernel/fork.c
diff -puN mm/fremap.c~4level-core-patch mm/fremap.c
--- linux-2.6/mm/fremap.c~4level-core-patch	2004-12-18 16:57:50.000000000 +1100
+++ linux-2.6-npiggin/mm/fremap.c	2004-12-18 16:57:50.000000000 +1100
@@ -60,14 +60,19 @@ int install_page(struct mm_struct *mm, s
 	pgoff_t size;
 	int err = -ENOMEM;
 	pte_t *pte;
-	pgd_t *pgd;
 	pmd_t *pmd;
+	pud_t *pud;
+	pgd_t *pgd;
 	pte_t pte_val;
 
 	pgd = pgd_offset(mm, addr);
 	spin_lock(&mm->page_table_lock);
+	
+	pud = pud_alloc(mm, pgd, addr);
+	if (!pud)
+	    goto err_unlock;
 
-	pmd = pmd_alloc(mm, pgd, addr);
+	pmd = pmd_alloc(mm, pud, addr);
 	if (!pmd)
 		goto err_unlock;
 
@@ -112,14 +117,19 @@ int install_file_pte(struct mm_struct *m
 {
 	int err = -ENOMEM;
 	pte_t *pte;
-	pgd_t *pgd;
 	pmd_t *pmd;
+	pud_t *pud;
+	pgd_t *pgd;
 	pte_t pte_val;
 
 	pgd = pgd_offset(mm, addr);
 	spin_lock(&mm->page_table_lock);
+	
+	pud = pud_alloc(mm, pgd, addr);
+	if (!pud)
+		goto err_unlock;
 
-	pmd = pmd_alloc(mm, pgd, addr);
+	pmd = pmd_alloc(mm, pud, addr);
 	if (!pmd)
 		goto err_unlock;
 
diff -puN mm/memory.c~4level-core-patch mm/memory.c
--- linux-2.6/mm/memory.c~4level-core-patch	2004-12-18 16:57:50.000000000 +1100
+++ linux-2.6-npiggin/mm/memory.c	2004-12-18 17:07:43.000000000 +1100
@@ -34,6 +34,8 @@
  *
  * 16.07.99  -  Support of BIGMEM added by Gerhard Wichert, Siemens AG
  *		(Gerhard.Wichert@pdb.siemens.de)
+ *
+ * Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
  */
 
 #include <linux/kernel_stat.h>
@@ -120,11 +122,42 @@ static inline void clear_pmd_range(struc
 	}
 }
 
-static inline void clear_pgd_range(struct mmu_gather *tlb, pgd_t *pgd, unsigned long start, unsigned long end)
+static inline void clear_pud_range(struct mmu_gather *tlb, pud_t *pud, unsigned long start, unsigned long end)
 {
 	unsigned long addr = start, next;
 	pmd_t *pmd, *__pmd;
 
+	if (pud_none(*pud))
+		return;
+	if (unlikely(pud_bad(*pud))) {
+		pud_ERROR(*pud);
+		pud_clear(pud);
+		return;
+	}
+
+	pmd = __pmd = pmd_offset(pud, start);
+	do {
+		next = (addr + PMD_SIZE) & PMD_MASK;
+		if (next > end || next <= addr)
+			next = end;
+		
+		clear_pmd_range(tlb, pmd, addr, next);
+		pmd++;
+		addr = next;
+	} while (addr && (addr < end));
+
+	if (!(start & ~PUD_MASK) && !(end & ~PUD_MASK)) {
+		pud_clear(pud);
+		pmd_free_tlb(tlb, __pmd);
+	}
+}
+
+
+static inline void clear_pgd_range(struct mmu_gather *tlb, pgd_t *pgd, unsigned long start, unsigned long end)
+{
+	unsigned long addr = start, next;
+	pud_t *pud, *__pud;
+
 	if (pgd_none(*pgd))
 		return;
 	if (unlikely(pgd_bad(*pgd))) {
@@ -133,20 +166,20 @@ static inline void clear_pgd_range(struc
 		return;
 	}
 
-	pmd = __pmd = pmd_offset(pgd, start);
+	pud = __pud = pud_offset(pgd, start);
 	do {
-		next = (addr + PMD_SIZE) & PMD_MASK;
+		next = (addr + PUD_SIZE) & PUD_MASK;
 		if (next > end || next <= addr)
 			next = end;
 		
-		clear_pmd_range(tlb, pmd, addr, next);
-		pmd++;
+		clear_pud_range(tlb, pud, addr, next);
+		pud++;
 		addr = next;
-	} while (addr && (addr <= end - 1));
+	} while (addr && (addr <= end));
 
 	if (!(start & ~PGDIR_MASK) && !(end & ~PGDIR_MASK)) {
 		pgd_clear(pgd);
-		pmd_free_tlb(tlb, __pmd);
+		pud_free_tlb(tlb, __pud);
 	}
 }
 
@@ -326,15 +359,15 @@ static int copy_pte_range(struct mm_stru
 }
 
 static int copy_pmd_range(struct mm_struct *dst_mm,  struct mm_struct *src_mm,
-		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
+		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end)
 {
 	pmd_t *src_pmd, *dst_pmd;
 	int err = 0;
 	unsigned long next;
 
-	src_pmd = pmd_offset(src_pgd, addr);
-	dst_pmd = pmd_alloc(dst_mm, dst_pgd, addr);
+	src_pmd = pmd_offset(src_pud, addr);
+	dst_pmd = pmd_alloc(dst_mm, dst_pud, addr);
 	if (!dst_pmd)
 		return -ENOMEM;
 
@@ -357,6 +390,38 @@ static int copy_pmd_range(struct mm_stru
 	return err;
 }
 
+static int copy_pud_range(struct mm_struct *dst_mm,  struct mm_struct *src_mm,
+		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
+		unsigned long addr, unsigned long end)
+{
+	pud_t *src_pud, *dst_pud;
+	int err = 0;
+	unsigned long next;
+
+	src_pud = pud_offset(src_pgd, addr);
+	dst_pud = pud_alloc(dst_mm, dst_pgd, addr);
+	if (!dst_pud)
+		return -ENOMEM;
+
+	for (; addr < end; addr = next, src_pud++, dst_pud++) {
+		next = (addr + PUD_SIZE) & PUD_MASK;
+		if (next > end)
+			next = end;
+		if (pud_none(*src_pud))
+			continue;
+		if (pud_bad(*src_pud)) {
+			pud_ERROR(*src_pud);
+			pud_clear(src_pud);
+			continue;
+		}
+		err = copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
+							vma, addr, next);
+		if (err)
+			break;
+	}
+	return err;
+}
+
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 		struct vm_area_struct *vma)
 {
@@ -384,7 +449,7 @@ int copy_page_range(struct mm_struct *ds
 			pgd_clear(src_pgd);
 			continue;
 		}
-		err = copy_pmd_range(dst, src, dst_pgd, src_pgd,
+		err = copy_pud_range(dst, src, dst_pgd, src_pgd,
 							vma, addr, next);
 		if (err)
 			break;
@@ -481,23 +546,23 @@ static void zap_pte_range(struct mmu_gat
 }
 
 static void zap_pmd_range(struct mmu_gather *tlb,
-		pgd_t * dir, unsigned long address,
+		pud_t *pud, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
 	pmd_t * pmd;
 	unsigned long end;
 
-	if (pgd_none(*dir))
+	if (pud_none(*pud))
 		return;
-	if (unlikely(pgd_bad(*dir))) {
-		pgd_ERROR(*dir);
-		pgd_clear(dir);
+	if (unlikely(pud_bad(*pud))) {
+		pud_ERROR(*pud);
+		pud_clear(pud);
 		return;
 	}
-	pmd = pmd_offset(dir, address);
+	pmd = pmd_offset(pud, address);
 	end = address + size;
-	if (end > ((address + PGDIR_SIZE) & PGDIR_MASK))
-		end = ((address + PGDIR_SIZE) & PGDIR_MASK);
+	if (end > ((address + PUD_SIZE) & PUD_MASK))
+		end = ((address + PUD_SIZE) & PUD_MASK);
 	do {
 		zap_pte_range(tlb, pmd, address, end - address, details);
 		address = (address + PMD_SIZE) & PMD_MASK; 
@@ -505,20 +570,46 @@ static void zap_pmd_range(struct mmu_gat
 	} while (address && (address < end));
 }
 
+static void zap_pud_range(struct mmu_gather *tlb,
+		pgd_t * pgd, unsigned long address,
+		unsigned long end, struct zap_details *details)
+{
+	pud_t * pud;
+
+	if (pgd_none(*pgd))
+		return;
+	if (unlikely(pgd_bad(*pgd))) {
+		pgd_ERROR(*pgd);
+		pgd_clear(pgd);
+		return;
+	}
+	pud = pud_offset(pgd, address);
+	do {
+		zap_pmd_range(tlb, pud, address, end - address, details);
+		address = (address + PUD_SIZE) & PUD_MASK; 
+		pud++;
+	} while (address && (address < end));
+}
+
 static void unmap_page_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long address,
 		unsigned long end, struct zap_details *details)
 {
-	pgd_t * dir;
+	unsigned long next;
+	pgd_t *pgd;
+	int i;
 
 	BUG_ON(address >= end);
-	dir = pgd_offset(vma->vm_mm, address);
+	pgd = pgd_offset(vma->vm_mm, address);
 	tlb_start_vma(tlb, vma);
-	do {
-		zap_pmd_range(tlb, dir, address, end - address, details);
-		address = (address + PGDIR_SIZE) & PGDIR_MASK;
-		dir++;
-	} while (address && (address < end));
+	for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
+		next = (address + PGDIR_SIZE) & PGDIR_MASK;
+		if (next <= address || next > end)
+			next = end;
+		zap_pud_range(tlb, pgd, address, next, details);
+		address = next;
+		pgd++;
+	}
 	tlb_end_vma(tlb, vma);
 }
 
@@ -660,6 +751,7 @@ struct page *
 follow_page(struct mm_struct *mm, unsigned long address, int write) 
 {
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *ptep, pte;
 	unsigned long pfn;
@@ -673,13 +765,15 @@ follow_page(struct mm_struct *mm, unsign
 	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
 		goto out;
 
-	pmd = pmd_offset(pgd, address);
-	if (pmd_none(*pmd))
+	pud = pud_offset(pgd, address);
+	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+		goto out;
+	
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 	if (pmd_huge(*pmd))
 		return follow_huge_pmd(mm, address, pmd, write);
-	if (unlikely(pmd_bad(*pmd)))
-		goto out;
 
 	ptep = pte_offset_map(pmd, address);
 	if (!ptep)
@@ -723,6 +817,7 @@ untouched_anonymous_page(struct mm_struc
 			 unsigned long address)
 {
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 
 	/* Check if the vma is for an anonymous mapping. */
@@ -734,8 +829,12 @@ untouched_anonymous_page(struct mm_struc
 	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
 		return 1;
 
+	pud = pud_offset(pgd, address);
+	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+		return 1;
+
 	/* Check if page middle directory entry exists. */
-	pmd = pmd_offset(pgd, address);
+	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		return 1;
 
@@ -767,6 +866,7 @@ int get_user_pages(struct task_struct *t
 			unsigned long pg = start & PAGE_MASK;
 			struct vm_area_struct *gate_vma = get_gate_vma(tsk);
 			pgd_t *pgd;
+			pud_t *pud;
 			pmd_t *pmd;
 			pte_t *pte;
 			if (write) /* user gate pages are read-only */
@@ -776,7 +876,9 @@ int get_user_pages(struct task_struct *t
 			else
 				pgd = pgd_offset_gate(mm, pg);
 			BUG_ON(pgd_none(*pgd));
-			pmd = pmd_offset(pgd, pg);
+			pud = pud_offset(pgd, pg);
+			BUG_ON(pud_none(*pud));
+			pmd = pmd_offset(pud, pg);
 			BUG_ON(pmd_none(*pmd));
 			pte = pte_offset_map(pmd, pg);
 			BUG_ON(pte_none(*pte));
@@ -889,16 +991,16 @@ static void zeromap_pte_range(pte_t * pt
 	} while (address && (address < end));
 }
 
-static inline int zeromap_pmd_range(struct mm_struct *mm, pmd_t * pmd, unsigned long address,
-                                    unsigned long size, pgprot_t prot)
+static inline int zeromap_pmd_range(struct mm_struct *mm, pmd_t * pmd,
+		unsigned long address, unsigned long size, pgprot_t prot)
 {
 	unsigned long base, end;
 
-	base = address & PGDIR_MASK;
-	address &= ~PGDIR_MASK;
+	base = address & PUD_MASK;
+	address &= ~PUD_MASK;
 	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
+	if (end > PUD_SIZE)
+		end = PUD_SIZE;
 	do {
 		pte_t * pte = pte_alloc_map(mm, pmd, base + address);
 		if (!pte)
@@ -911,31 +1013,64 @@ static inline int zeromap_pmd_range(stru
 	return 0;
 }
 
-int zeromap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, pgprot_t prot)
+static inline int zeromap_pud_range(struct mm_struct *mm, pud_t * pud,
+				    unsigned long address,
+                                    unsigned long size, pgprot_t prot)
+{
+	unsigned long base, end;
+	int error = 0;
+
+	base = address & PGDIR_MASK;
+	address &= ~PGDIR_MASK;
+	end = address + size;
+	if (end > PGDIR_SIZE)
+		end = PGDIR_SIZE;
+	do {
+		pmd_t * pmd = pmd_alloc(mm, pud, base + address);
+		error = -ENOMEM;
+		if (!pmd)
+			break;
+		error = zeromap_pmd_range(mm, pmd, address, end - address, prot);
+		if (error)
+			break;
+		address = (address + PUD_SIZE) & PUD_MASK;
+		pud++;
+	} while (address && (address < end));
+	return 0;
+}
+
+int zeromap_page_range(struct vm_area_struct *vma, unsigned long address,
+					unsigned long size, pgprot_t prot)
 {
+	int i;
 	int error = 0;
-	pgd_t * dir;
+	pgd_t * pgd;
 	unsigned long beg = address;
 	unsigned long end = address + size;
+	unsigned long next;
 	struct mm_struct *mm = vma->vm_mm;
 
-	dir = pgd_offset(mm, address);
+	pgd = pgd_offset(mm, address);
 	flush_cache_range(vma, beg, end);
-	if (address >= end)
-		BUG();
+	BUG_ON(address >= end);
+	BUG_ON(end > vma->vm_end);
 
 	spin_lock(&mm->page_table_lock);
-	do {
-		pmd_t *pmd = pmd_alloc(mm, dir, address);
+	for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
+		pud_t *pud = pud_alloc(mm, pgd, address);
 		error = -ENOMEM;
-		if (!pmd)
+		if (!pud)
 			break;
-		error = zeromap_pmd_range(mm, pmd, address, end - address, prot);
+		next = (address + PGDIR_SIZE) & PGDIR_MASK;
+		if (next <= beg || next > end)
+			next = end;
+		error = zeromap_pud_range(mm, pud, address,
+						next - address, prot);
 		if (error)
 			break;
-		address = (address + PGDIR_SIZE) & PGDIR_MASK;
-		dir++;
-	} while (address && (address < end));
+		address = next;
+		pgd++;
+	}
 	/*
 	 * Why flush? zeromap_pte_range has a BUG_ON for !pte_none()
 	 */
@@ -949,8 +1084,9 @@ int zeromap_page_range(struct vm_area_st
  * mappings are removed. any references to nonexistent pages results
  * in null mappings (currently treated as "copy-on-access")
  */
-static inline void remap_pte_range(pte_t * pte, unsigned long address, unsigned long size,
-	unsigned long pfn, pgprot_t prot)
+static inline void
+remap_pte_range(pte_t * pte, unsigned long address, unsigned long size,
+		unsigned long pfn, pgprot_t prot)
 {
 	unsigned long end;
 
@@ -968,22 +1104,24 @@ static inline void remap_pte_range(pte_t
 	} while (address && (address < end));
 }
 
-static inline int remap_pmd_range(struct mm_struct *mm, pmd_t * pmd, unsigned long address, unsigned long size,
-	unsigned long pfn, pgprot_t prot)
+static inline int
+remap_pmd_range(struct mm_struct *mm, pmd_t * pmd, unsigned long address,
+		unsigned long size, unsigned long pfn, pgprot_t prot)
 {
 	unsigned long base, end;
 
-	base = address & PGDIR_MASK;
-	address &= ~PGDIR_MASK;
+	base = address & PUD_MASK;
+	address &= ~PUD_MASK;
 	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
-	pfn -= address >> PAGE_SHIFT;
+	if (end > PUD_SIZE)
+		end = PUD_SIZE;
+	pfn -= (address >> PAGE_SHIFT);
 	do {
 		pte_t * pte = pte_alloc_map(mm, pmd, base + address);
 		if (!pte)
 			return -ENOMEM;
-		remap_pte_range(pte, base + address, end - address, pfn + (address >> PAGE_SHIFT), prot);
+		remap_pte_range(pte, base + address, end - address,
+				(address >> PAGE_SHIFT) + pfn, prot);
 		pte_unmap(pte);
 		address = (address + PMD_SIZE) & PMD_MASK;
 		pmd++;
@@ -991,20 +1129,50 @@ static inline int remap_pmd_range(struct
 	return 0;
 }
 
+static inline int remap_pud_range(struct mm_struct *mm, pud_t * pud,
+				  unsigned long address, unsigned long size,
+				  unsigned long pfn, pgprot_t prot)
+{
+	unsigned long base, end;
+	int error;
+
+	base = address & PGDIR_MASK;
+	address &= ~PGDIR_MASK;
+	end = address + size;
+	if (end > PGDIR_SIZE)
+		end = PGDIR_SIZE;
+	pfn -= address >> PAGE_SHIFT;
+	do {
+		pmd_t *pmd = pmd_alloc(mm, pud, base+address);
+		error = -ENOMEM;
+		if (!pmd)
+			break;
+		error = remap_pmd_range(mm, pmd, base + address, end - address,
+				(address >> PAGE_SHIFT) + pfn, prot);
+		if (error)
+			break;
+		address = (address + PUD_SIZE) & PUD_MASK;
+		pud++;
+	} while (address && (address < end));
+	return error;
+}
+
 /*  Note: this is only safe if the mm semaphore is held when called. */
-int remap_pfn_range(struct vm_area_struct *vma, unsigned long from, unsigned long pfn, unsigned long size, pgprot_t prot)
+int remap_pfn_range(struct vm_area_struct *vma, unsigned long from,
+		    unsigned long pfn, unsigned long size, pgprot_t prot)
 {
 	int error = 0;
-	pgd_t * dir;
+	pgd_t *pgd;
 	unsigned long beg = from;
 	unsigned long end = from + size;
+	unsigned long next;
 	struct mm_struct *mm = vma->vm_mm;
+	int i;
 
 	pfn -= from >> PAGE_SHIFT;
-	dir = pgd_offset(mm, from);
+	pgd = pgd_offset(mm, from);
 	flush_cache_range(vma, beg, end);
-	if (from >= end)
-		BUG();
+	BUG_ON(from >= end);
 
 	/*
 	 * Physically remapped pages are special. Tell the
@@ -1015,25 +1183,32 @@ int remap_pfn_range(struct vm_area_struc
 	 *	this region.
 	 */
 	vma->vm_flags |= VM_IO | VM_RESERVED;
+
 	spin_lock(&mm->page_table_lock);
-	do {
-		pmd_t *pmd = pmd_alloc(mm, dir, from);
+	for (i = pgd_index(beg); i <= pgd_index(end-1); i++) {
+		pud_t *pud = pud_alloc(mm, pgd, from);
 		error = -ENOMEM;
-		if (!pmd)
+		if (!pud)
 			break;
-		error = remap_pmd_range(mm, pmd, from, end - from, pfn + (from >> PAGE_SHIFT), prot);
+		next = (from + PGDIR_SIZE) & PGDIR_MASK;
+		if (next > end || next <= from)
+			next = end;
+		error = remap_pud_range(mm, pud, from, end - from,
+					pfn + (from >> PAGE_SHIFT), prot);
 		if (error)
 			break;
-		from = (from + PGDIR_SIZE) & PGDIR_MASK;
-		dir++;
-	} while (from && (from < end));
+		from = next;
+		pgd++;
+	}
 	/*
 	 * Why flush? remap_pte_range has a BUG_ON for !pte_none()
 	 */
 	flush_tlb_range(vma, beg, end);
 	spin_unlock(&mm->page_table_lock);
+
 	return error;
 }
+
 EXPORT_SYMBOL(remap_pfn_range);
 
 /*
@@ -1725,13 +1900,14 @@ static inline int handle_pte_fault(struc
  * By the time we get here, we already hold the mm semaphore
  */
 int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,
-	unsigned long address, int write_access)
+		unsigned long address, int write_access)
 {
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
+	pte_t *pte;
 
 	__set_current_state(TASK_RUNNING);
-	pgd = pgd_offset(mm, address);
 
 	inc_page_state(pgfault);
 
@@ -1742,18 +1918,63 @@ int handle_mm_fault(struct mm_struct *mm
 	 * We need the page table lock to synchronize with kswapd
 	 * and the SMP-safe atomic PTE updates.
 	 */
+	pgd = pgd_offset(mm, address);
 	spin_lock(&mm->page_table_lock);
-	pmd = pmd_alloc(mm, pgd, address);
 
-	if (pmd) {
-		pte_t * pte = pte_alloc_map(mm, pmd, address);
-		if (pte)
-			return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
-	}
+	pud = pud_alloc(mm, pgd, address);
+	if (!pud)
+		goto oom;
+
+	pmd = pmd_alloc(mm, pud, address);
+	if (!pmd)
+		goto oom;
+
+	pte = pte_alloc_map(mm, pmd, address);
+	if (!pte)
+		goto oom;
+	
+	return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+
+ oom:
 	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_OOM;
 }
 
+#if (PTRS_PER_PGD > 1)
+/*
+ * Allocate page upper directory.
+ *
+ * We've already handled the fast-path in-line, and we own the
+ * page table lock.
+ *
+ * On a two-level or three-level page table, this ends up actually being
+ * entirely optimized away.
+ */
+pud_t fastcall *__pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+{
+	pud_t *new;
+
+	spin_unlock(&mm->page_table_lock);
+	new = pud_alloc_one(mm, address);
+	spin_lock(&mm->page_table_lock);
+	if (!new)
+		return NULL;
+
+	/*
+	 * Because we dropped the lock, we should re-check the
+	 * entry, as somebody else could have populated it..
+	 */
+	if (pgd_present(*pgd)) {
+		pud_free(new);
+		goto out;
+	}
+	pgd_populate(mm, pgd, new);
+out:
+	return pud_offset(pgd, address);
+}
+#endif
+
+#if (PTRS_PER_PUD > 1)
 /*
  * Allocate page middle directory.
  *
@@ -1763,7 +1984,7 @@ int handle_mm_fault(struct mm_struct *mm
  * On a two-level page table, this ends up actually being entirely
  * optimized away.
  */
-pmd_t fastcall *__pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+pmd_t fastcall *__pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 {
 	pmd_t *new;
 
@@ -1777,14 +1998,15 @@ pmd_t fastcall *__pmd_alloc(struct mm_st
 	 * Because we dropped the lock, we should re-check the
 	 * entry, as somebody else could have populated it..
 	 */
-	if (pgd_present(*pgd)) {
+	if (pud_present(*pud)) {
 		pmd_free(new);
 		goto out;
 	}
-	pgd_populate(mm, pgd, new);
+	pud_populate(mm, pud, new);
 out:
-	return pmd_offset(pgd, address);
+	return pmd_offset(pud, address);
 }
+#endif
 
 int make_pages_present(unsigned long addr, unsigned long end)
 {
@@ -1815,17 +2037,21 @@ struct page * vmalloc_to_page(void * vma
 	unsigned long addr = (unsigned long) vmalloc_addr;
 	struct page *page = NULL;
 	pgd_t *pgd = pgd_offset_k(addr);
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *ptep, pte;
   
 	if (!pgd_none(*pgd)) {
-		pmd = pmd_offset(pgd, addr);
-		if (!pmd_none(*pmd)) {
-			ptep = pte_offset_map(pmd, addr);
-			pte = *ptep;
-			if (pte_present(pte))
-				page = pte_page(pte);
-			pte_unmap(ptep);
+		pud = pud_offset_k(pgd, addr);
+		if (!pud_none(*pud)) {
+			pmd = pmd_offset(pud, addr);
+			if (!pmd_none(*pmd)) {
+				ptep = pte_offset_map(pmd, addr);
+				pte = *ptep;
+				if (pte_present(pte))
+					page = pte_page(pte);
+				pte_unmap(ptep);
+			}
 		}
 	}
 	return page;
diff -puN mm/mempolicy.c~4level-core-patch mm/mempolicy.c
--- linux-2.6/mm/mempolicy.c~4level-core-patch	2004-12-18 16:57:50.000000000 +1100
+++ linux-2.6-npiggin/mm/mempolicy.c	2004-12-18 16:57:50.000000000 +1100
@@ -234,18 +234,29 @@ static struct mempolicy *mpol_new(int mo
 
 /* Ensure all existing pages follow the policy. */
 static int
-verify_pages(unsigned long addr, unsigned long end, unsigned long *nodes)
+verify_pages(struct mm_struct *mm,
+	     unsigned long addr, unsigned long end, unsigned long *nodes)
 {
 	while (addr < end) {
 		struct page *p;
 		pte_t *pte;
 		pmd_t *pmd;
-		pgd_t *pgd = pgd_offset_k(addr);
+		pud_t *pud;
+		pgd_t *pgd;
+		pgd = pgd_offset(mm, addr);
 		if (pgd_none(*pgd)) {
-			addr = (addr + PGDIR_SIZE) & PGDIR_MASK;
+			unsigned long next = (addr + PGDIR_SIZE) & PGDIR_MASK;
+			if (next > addr)
+				break;
+			addr = next;
+			continue;
+		}
+		pud = pud_offset(pgd, addr);
+		if (pud_none(*pud)) {
+			addr = (addr + PUD_SIZE) & PUD_MASK;
 			continue;
 		}
-		pmd = pmd_offset(pgd, addr);
+		pmd = pmd_offset(pud, addr);
 		if (pmd_none(*pmd)) {
 			addr = (addr + PMD_SIZE) & PMD_MASK;
 			continue;
@@ -283,7 +294,8 @@ check_range(struct mm_struct *mm, unsign
 		if (prev && prev->vm_end < vma->vm_start)
 			return ERR_PTR(-EFAULT);
 		if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) {
-			err = verify_pages(vma->vm_start, vma->vm_end, nodes);
+			err = verify_pages(vma->vm_mm,
+					   vma->vm_start, vma->vm_end, nodes);
 			if (err) {
 				first = ERR_PTR(err);
 				break;
diff -puN mm/mmap.c~4level-core-patch mm/mmap.c
diff -puN mm/mprotect.c~4level-core-patch mm/mprotect.c
--- linux-2.6/mm/mprotect.c~4level-core-patch	2004-12-18 16:57:50.000000000 +1100
+++ linux-2.6-npiggin/mm/mprotect.c	2004-12-18 16:57:50.000000000 +1100
@@ -62,12 +62,38 @@ change_pte_range(pmd_t *pmd, unsigned lo
 }
 
 static inline void
-change_pmd_range(pgd_t *pgd, unsigned long address,
+change_pmd_range(pud_t *pud, unsigned long address,
 		unsigned long size, pgprot_t newprot)
 {
 	pmd_t * pmd;
 	unsigned long end;
 
+	if (pud_none(*pud))
+		return;
+	if (pud_bad(*pud)) {
+		pud_ERROR(*pud);
+		pud_clear(pud);
+		return;
+	}
+	pmd = pmd_offset(pud, address);
+	address &= ~PUD_MASK;
+	end = address + size;
+	if (end > PUD_SIZE)
+		end = PUD_SIZE;
+	do {
+		change_pte_range(pmd, address, end - address, newprot);
+		address = (address + PMD_SIZE) & PMD_MASK;
+		pmd++;
+	} while (address && (address < end));
+}
+
+static inline void
+change_pud_range(pgd_t *pgd, unsigned long address,
+		unsigned long size, pgprot_t newprot)
+{
+	pud_t * pud;
+	unsigned long end;
+
 	if (pgd_none(*pgd))
 		return;
 	if (pgd_bad(*pgd)) {
@@ -75,15 +101,15 @@ change_pmd_range(pgd_t *pgd, unsigned lo
 		pgd_clear(pgd);
 		return;
 	}
-	pmd = pmd_offset(pgd, address);
+	pud = pud_offset(pgd, address);
 	address &= ~PGDIR_MASK;
 	end = address + size;
 	if (end > PGDIR_SIZE)
 		end = PGDIR_SIZE;
 	do {
-		change_pte_range(pmd, address, end - address, newprot);
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
+		change_pmd_range(pud, address, end - address, newprot);
+		address = (address + PUD_SIZE) & PUD_MASK;
+		pud++;
 	} while (address && (address < end));
 }
 
@@ -91,22 +117,25 @@ static void
 change_protection(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, pgprot_t newprot)
 {
-	pgd_t *dir;
-	unsigned long beg = start;
+	struct mm_struct *mm = current->mm;
+	pgd_t *pgd;
+	unsigned long beg = start, next;
+	int i;
 
-	dir = pgd_offset(current->mm, start);
+	pgd = pgd_offset(mm, start);
 	flush_cache_range(vma, beg, end);
-	if (start >= end)
-		BUG();
-	spin_lock(&current->mm->page_table_lock);
-	do {
-		change_pmd_range(dir, start, end - start, newprot);
-		start = (start + PGDIR_SIZE) & PGDIR_MASK;
-		dir++;
-	} while (start && (start < end));
+	BUG_ON(start >= end);
+	spin_lock(&mm->page_table_lock);
+	for (i = pgd_index(start); i <= pgd_index(end-1); i++) {
+		next = (start + PGDIR_SIZE) & PGDIR_MASK;
+		if (next <= start || next > end)
+			next = end;
+		change_pud_range(pgd, start, next - start, newprot);
+		start = next;
+		pgd++;
+	}
 	flush_tlb_range(vma, beg, end);
-	spin_unlock(&current->mm->page_table_lock);
-	return;
+	spin_unlock(&mm->page_table_lock);
 }
 
 static int
diff -puN mm/mremap.c~4level-core-patch mm/mremap.c
--- linux-2.6/mm/mremap.c~4level-core-patch	2004-12-18 16:57:50.000000000 +1100
+++ linux-2.6-npiggin/mm/mremap.c	2004-12-18 16:57:50.000000000 +1100
@@ -25,19 +25,24 @@
 static pte_t *get_one_pte_map_nested(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte = NULL;
 
 	pgd = pgd_offset(mm, addr);
 	if (pgd_none(*pgd))
 		goto end;
-	if (pgd_bad(*pgd)) {
-		pgd_ERROR(*pgd);
-		pgd_clear(pgd);
+
+	pud = pud_offset(pgd, addr);
+	if (pud_none(*pud))
+		goto end;
+	if (pud_bad(*pud)) {
+		pud_ERROR(*pud);
+		pud_clear(pud);
 		goto end;
 	}
 
-	pmd = pmd_offset(pgd, addr);
+	pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd))
 		goto end;
 	if (pmd_bad(*pmd)) {
@@ -58,12 +63,17 @@ end:
 static pte_t *get_one_pte_map(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 
 	pgd = pgd_offset(mm, addr);
 	if (pgd_none(*pgd))
 		return NULL;
-	pmd = pmd_offset(pgd, addr);
+
+	pud = pud_offset(pgd, addr);
+	if (pud_none(*pud))
+		return NULL;
+	pmd = pmd_offset(pud, addr);
 	if (!pmd_present(*pmd))
 		return NULL;
 	return pte_offset_map(pmd, addr);
@@ -71,10 +81,17 @@ static pte_t *get_one_pte_map(struct mm_
 
 static inline pte_t *alloc_one_pte_map(struct mm_struct *mm, unsigned long addr)
 {
+	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte = NULL;
 
-	pmd = pmd_alloc(mm, pgd_offset(mm, addr), addr);
+	pgd = pgd_offset(mm, addr);
+
+	pud = pud_alloc(mm, pgd, addr);
+	if (!pud)
+		return NULL;
+	pmd = pmd_alloc(mm, pud, addr);
 	if (pmd)
 		pte = pte_alloc_map(mm, pmd, addr);
 	return pte;
diff -puN mm/msync.c~4level-core-patch mm/msync.c
--- linux-2.6/mm/msync.c~4level-core-patch	2004-12-18 16:57:50.000000000 +1100
+++ linux-2.6-npiggin/mm/msync.c	2004-12-18 17:07:43.000000000 +1100
@@ -67,13 +67,39 @@ static int filemap_sync_pte_range(pmd_t 
 	return error;
 }
 
-static inline int filemap_sync_pmd_range(pgd_t * pgd,
+static inline int filemap_sync_pmd_range(pud_t * pud,
 	unsigned long address, unsigned long end, 
 	struct vm_area_struct *vma, unsigned int flags)
 {
 	pmd_t * pmd;
 	int error;
 
+	if (pud_none(*pud))
+		return 0;
+	if (pud_bad(*pud)) {
+		pud_ERROR(*pud);
+		pud_clear(pud);
+		return 0;
+	}
+	pmd = pmd_offset(pud, address);
+	if ((address & PUD_MASK) != (end & PUD_MASK))
+		end = (address & PUD_MASK) + PUD_SIZE;
+	error = 0;
+	do {
+		error |= filemap_sync_pte_range(pmd, address, end, vma, flags);
+		address = (address + PMD_SIZE) & PMD_MASK;
+		pmd++;
+	} while (address && (address < end));
+	return error;
+}
+
+static inline int filemap_sync_pud_range(pgd_t *pgd,
+	unsigned long address, unsigned long end,
+	struct vm_area_struct *vma, unsigned int flags)
+{
+	pud_t *pud;
+	int error;
+
 	if (pgd_none(*pgd))
 		return 0;
 	if (pgd_bad(*pgd)) {
@@ -81,14 +107,14 @@ static inline int filemap_sync_pmd_range
 		pgd_clear(pgd);
 		return 0;
 	}
-	pmd = pmd_offset(pgd, address);
+	pud = pud_offset(pgd, address);
 	if ((address & PGDIR_MASK) != (end & PGDIR_MASK))
 		end = (address & PGDIR_MASK) + PGDIR_SIZE;
 	error = 0;
 	do {
-		error |= filemap_sync_pte_range(pmd, address, end, vma, flags);
-		address = (address + PMD_SIZE) & PMD_MASK;
-		pmd++;
+		error |= filemap_sync_pmd_range(pud, address, end, vma, flags);
+		address = (address + PUD_SIZE) & PUD_MASK;
+		pud++;
 	} while (address && (address < end));
 	return error;
 }
@@ -96,8 +122,10 @@ static inline int filemap_sync_pmd_range
 static int filemap_sync(struct vm_area_struct * vma, unsigned long address,
 	size_t size, unsigned int flags)
 {
-	pgd_t * dir;
+	pgd_t *pgd;
 	unsigned long end = address + size;
+	unsigned long next;
+	int i;
 	int error = 0;
 
 	/* Aquire the lock early; it may be possible to avoid dropping
@@ -105,7 +133,7 @@ static int filemap_sync(struct vm_area_s
 	 */
 	spin_lock(&vma->vm_mm->page_table_lock);
 
-	dir = pgd_offset(vma->vm_mm, address);
+	pgd = pgd_offset(vma->vm_mm, address);
 	flush_cache_range(vma, address, end);
 
 	/* For hugepages we can't go walking the page table normally,
@@ -116,11 +144,14 @@ static int filemap_sync(struct vm_area_s
 
 	if (address >= end)
 		BUG();
-	do {
-		error |= filemap_sync_pmd_range(dir, address, end, vma, flags);
-		address = (address + PGDIR_SIZE) & PGDIR_MASK;
-		dir++;
-	} while (address && (address < end));
+	for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
+		next = (address + PGDIR_SIZE) & PGDIR_MASK;
+		if (next <= address || next > end)
+			next = end;
+		error |= filemap_sync_pud_range(pgd, address, next, vma, flags);
+		address = next;
+		pgd++;
+	}
 	/*
 	 * Why flush ? filemap_sync_pte already flushed the tlbs with the
 	 * dirty bits.
diff -puN mm/rmap.c~4level-core-patch mm/rmap.c
--- linux-2.6/mm/rmap.c~4level-core-patch	2004-12-18 16:57:50.000000000 +1100
+++ linux-2.6-npiggin/mm/rmap.c	2004-12-18 16:57:50.000000000 +1100
@@ -259,6 +259,7 @@ static int page_referenced_one(struct pa
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 	int referenced = 0;
@@ -275,7 +276,11 @@ static int page_referenced_one(struct pa
 	if (!pgd_present(*pgd))
 		goto out_unlock;
 
-	pmd = pmd_offset(pgd, address);
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out_unlock;
+
+	pmd = pmd_offset(pud, address);
 	if (!pmd_present(*pmd))
 		goto out_unlock;
 
@@ -502,6 +507,7 @@ static int try_to_unmap_one(struct page 
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 	pte_t pteval;
@@ -523,7 +529,11 @@ static int try_to_unmap_one(struct page 
 	if (!pgd_present(*pgd))
 		goto out_unlock;
 
-	pmd = pmd_offset(pgd, address);
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out_unlock;
+
+	pmd = pmd_offset(pud, address);
 	if (!pmd_present(*pmd))
 		goto out_unlock;
 
@@ -631,6 +641,7 @@ static void try_to_unmap_cluster(unsigne
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 	pte_t pteval;
@@ -656,7 +667,11 @@ static void try_to_unmap_cluster(unsigne
 	if (!pgd_present(*pgd))
 		goto out_unlock;
 
-	pmd = pmd_offset(pgd, address);
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out_unlock;
+
+	pmd = pmd_offset(pud, address);
 	if (!pmd_present(*pmd))
 		goto out_unlock;
 
diff -puN mm/swapfile.c~4level-core-patch mm/swapfile.c
--- linux-2.6/mm/swapfile.c~4level-core-patch	2004-12-18 16:57:50.000000000 +1100
+++ linux-2.6-npiggin/mm/swapfile.c	2004-12-18 17:07:43.000000000 +1100
@@ -486,27 +486,27 @@ static unsigned long unuse_pmd(struct vm
 }
 
 /* vma->vm_mm->page_table_lock is held */
-static unsigned long unuse_pgd(struct vm_area_struct * vma, pgd_t *dir,
-	unsigned long address, unsigned long size,
+static unsigned long unuse_pud(struct vm_area_struct * vma, pud_t *pud,
+        unsigned long address, unsigned long size, unsigned long offset,
 	swp_entry_t entry, struct page *page)
 {
 	pmd_t * pmd;
-	unsigned long offset, end;
+	unsigned long end;
 	unsigned long foundaddr;
 
-	if (pgd_none(*dir))
+	if (pud_none(*pud))
 		return 0;
-	if (pgd_bad(*dir)) {
-		pgd_ERROR(*dir);
-		pgd_clear(dir);
+	if (pud_bad(*pud)) {
+		pud_ERROR(*pud);
+		pud_clear(pud);
 		return 0;
 	}
-	pmd = pmd_offset(dir, address);
-	offset = address & PGDIR_MASK;
-	address &= ~PGDIR_MASK;
+	pmd = pmd_offset(pud, address);
+	offset += address & PUD_MASK;
+	address &= ~PUD_MASK;
 	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
+	if (end > PUD_SIZE)
+		end = PUD_SIZE;
 	if (address >= end)
 		BUG();
 	do {
@@ -521,12 +521,48 @@ static unsigned long unuse_pgd(struct vm
 }
 
 /* vma->vm_mm->page_table_lock is held */
+static unsigned long unuse_pgd(struct vm_area_struct * vma, pgd_t *pgd,
+	unsigned long address, unsigned long size,
+	swp_entry_t entry, struct page *page)
+{
+	pud_t * pud;
+	unsigned long offset;
+	unsigned long foundaddr;
+	unsigned long end;
+
+	if (pgd_none(*pgd))
+		return 0;
+	if (pgd_bad(*pgd)) {
+		pgd_ERROR(*pgd);
+		pgd_clear(pgd);
+		return 0;
+	}
+	pud = pud_offset(pgd, address);
+	offset = address & PGDIR_MASK;
+	address &= ~PGDIR_MASK;
+	end = address + size;
+	if (end > PGDIR_SIZE)
+		end = PGDIR_SIZE;
+	BUG_ON (address >= end);
+	do {
+		foundaddr = unuse_pud(vma, pud, address, end - address,
+					        offset, entry, page);
+		if (foundaddr)
+			return foundaddr;
+		address = (address + PUD_SIZE) & PUD_MASK;
+		pud++;
+	} while (address && (address < end));
+	return 0;
+}
+
+/* vma->vm_mm->page_table_lock is held */
 static unsigned long unuse_vma(struct vm_area_struct * vma,
 	swp_entry_t entry, struct page *page)
 {
-	pgd_t *pgdir;
-	unsigned long start, end;
+	pgd_t *pgd;
+	unsigned long start, end, next;
 	unsigned long foundaddr;
+	int i;
 
 	if (page->mapping) {
 		start = page_address_in_vma(page, vma);
@@ -538,15 +574,18 @@ static unsigned long unuse_vma(struct vm
 		start = vma->vm_start;
 		end = vma->vm_end;
 	}
-	pgdir = pgd_offset(vma->vm_mm, start);
-	do {
-		foundaddr = unuse_pgd(vma, pgdir, start, end - start,
-						entry, page);
+	pgd = pgd_offset(vma->vm_mm, start);
+	for (i = pgd_index(start); i <= pgd_index(end-1); i++) {
+		next = (start + PGDIR_SIZE) & PGDIR_MASK;
+		if (next > end || next <= start)
+			next = end;
+		foundaddr = unuse_pgd(vma, pgd, start, next - start, entry, page);
 		if (foundaddr)
 			return foundaddr;
-		start = (start + PGDIR_SIZE) & PGDIR_MASK;
-		pgdir++;
-	} while (start && (start < end));
+		start = next;
+		i++;
+		pgd++;
+	}
 	return 0;
 }
 
diff -puN mm/vmalloc.c~4level-core-patch mm/vmalloc.c
--- linux-2.6/mm/vmalloc.c~4level-core-patch	2004-12-18 16:57:50.000000000 +1100
+++ linux-2.6-npiggin/mm/vmalloc.c	2004-12-18 17:07:43.000000000 +1100
@@ -56,25 +56,25 @@ static void unmap_area_pte(pmd_t *pmd, u
 	} while (address < end);
 }
 
-static void unmap_area_pmd(pgd_t *dir, unsigned long address,
+static void unmap_area_pmd(pud_t *pud, unsigned long address,
 				  unsigned long size)
 {
 	unsigned long end;
 	pmd_t *pmd;
 
-	if (pgd_none(*dir))
+	if (pud_none(*pud))
 		return;
-	if (pgd_bad(*dir)) {
-		pgd_ERROR(*dir);
-		pgd_clear(dir);
+	if (pud_bad(*pud)) {
+		pud_ERROR(*pud);
+		pud_clear(pud);
 		return;
 	}
 
-	pmd = pmd_offset(dir, address);
-	address &= ~PGDIR_MASK;
+	pmd = pmd_offset(pud, address);
+	address &= ~PUD_MASK;
 	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
+	if (end > PUD_SIZE)
+		end = PUD_SIZE;
 
 	do {
 		unmap_area_pte(pmd, address, end - address);
@@ -83,6 +83,33 @@ static void unmap_area_pmd(pgd_t *dir, u
 	} while (address < end);
 }
 
+static void unmap_area_pud(pgd_t *pgd, unsigned long address,
+			   unsigned long size)
+{
+	pud_t *pud;
+	unsigned long end;
+
+	if (pgd_none(*pgd))
+		return;
+	if (pgd_bad(*pgd)) {
+		pgd_ERROR(*pgd);
+		pgd_clear(pgd);
+		return;
+	}
+
+	pud = pud_offset_k(pgd, address);
+	address &= ~PGDIR_MASK;
+	end = address + size;
+	if (end > PGDIR_SIZE)
+		end = PGDIR_SIZE;
+
+	do {
+		unmap_area_pmd(pud, address, end - address);
+		address = (address + PUD_SIZE) & PUD_MASK;
+		pud++;
+	} while (address && (address < end));
+}
+
 static int map_area_pte(pte_t *pte, unsigned long address,
 			       unsigned long size, pgprot_t prot,
 			       struct page ***pages)
@@ -96,7 +123,6 @@ static int map_area_pte(pte_t *pte, unsi
 
 	do {
 		struct page *page = **pages;
-
 		WARN_ON(!pte_none(*pte));
 		if (!page)
 			return -ENOMEM;
@@ -115,11 +141,11 @@ static int map_area_pmd(pmd_t *pmd, unsi
 {
 	unsigned long base, end;
 
-	base = address & PGDIR_MASK;
-	address &= ~PGDIR_MASK;
+	base = address & PUD_MASK;
+	address &= ~PUD_MASK;
 	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
+	if (end > PUD_SIZE)
+		end = PUD_SIZE;
 
 	do {
 		pte_t * pte = pte_alloc_kernel(&init_mm, pmd, base + address);
@@ -134,19 +160,41 @@ static int map_area_pmd(pmd_t *pmd, unsi
 	return 0;
 }
 
+static int map_area_pud(pud_t *pud, unsigned long address,
+			       unsigned long end, pgprot_t prot,
+			       struct page ***pages)
+{
+	do {
+		pmd_t *pmd = pmd_alloc(&init_mm, pud, address);
+		if (!pmd)
+			return -ENOMEM;
+		if (map_area_pmd(pmd, address, end - address, prot, pages))
+			return -ENOMEM;
+		address = (address + PUD_SIZE) & PUD_MASK;
+		pud++;
+	} while (address && address < end);
+
+	return 0;
+}
+
 void unmap_vm_area(struct vm_struct *area)
 {
 	unsigned long address = (unsigned long) area->addr;
 	unsigned long end = (address + area->size);
-	pgd_t *dir;
+	unsigned long next;
+	pgd_t *pgd;
+	int i;
 
-	dir = pgd_offset_k(address);
+	pgd = pgd_offset_k(address);
 	flush_cache_vunmap(address, end);
-	do {
-		unmap_area_pmd(dir, address, end - address);
-		address = (address + PGDIR_SIZE) & PGDIR_MASK;
-		dir++;
-	} while (address && (address < end));
+	for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
+		next = (address + PGDIR_SIZE) & PGDIR_MASK;
+		if (next <= address || next > end)
+			next = end;
+		unmap_area_pud(pgd, address, next - address);
+		address = next;
+	        pgd++;
+	}
 	flush_tlb_kernel_range((unsigned long) area->addr, end);
 }
 
@@ -154,25 +202,30 @@ int map_vm_area(struct vm_struct *area, 
 {
 	unsigned long address = (unsigned long) area->addr;
 	unsigned long end = address + (area->size-PAGE_SIZE);
-	pgd_t *dir;
+	unsigned long next;
+	pgd_t *pgd;
 	int err = 0;
+	int i;
 
-	dir = pgd_offset_k(address);
+	pgd = pgd_offset_k(address);
 	spin_lock(&init_mm.page_table_lock);
-	do {
-		pmd_t *pmd = pmd_alloc(&init_mm, dir, address);
-		if (!pmd) {
+	for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
+		pud_t *pud = pud_alloc(&init_mm, pgd, address);
+		if (!pud) {
 			err = -ENOMEM;
 			break;
 		}
-		if (map_area_pmd(pmd, address, end - address, prot, pages)) {
+		next = (address + PGDIR_SIZE) & PGDIR_MASK;
+		if (next < address || next > end)
+			next = end;
+		if (map_area_pud(pud, address, next, prot, pages)) {
 			err = -ENOMEM;
 			break;
 		}
 
-		address = (address + PGDIR_SIZE) & PGDIR_MASK;
-		dir++;
-	} while (address && (address < end));
+		address = next;
+		pgd++;
+	}
 
 	spin_unlock(&init_mm.page_table_lock);
 	flush_cache_vmap((unsigned long) area->addr, end);
diff -puN drivers/char/drm/drm_memory.h~4level-core-patch drivers/char/drm/drm_memory.h
--- linux-2.6/drivers/char/drm/drm_memory.h~4level-core-patch	2004-12-18 16:57:50.000000000 +1100
+++ linux-2.6-npiggin/drivers/char/drm/drm_memory.h	2004-12-18 16:57:50.000000000 +1100
@@ -125,7 +125,8 @@ static inline unsigned long
 drm_follow_page (void *vaddr)
 {
 	pgd_t *pgd = pgd_offset_k((unsigned long) vaddr);
-	pmd_t *pmd = pmd_offset(pgd, (unsigned long) vaddr);
+	pud_t *pud = pud_offset(pgd, (unsigned long) vaddr);
+	pmd_t *pmd = pmd_offset(pud, (unsigned long) vaddr);
 	pte_t *ptep = pte_offset_kernel(pmd, (unsigned long) vaddr);
 	return pte_pfn(*ptep) << PAGE_SHIFT;
 }

_

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 7/10] alternate 4-level page tables patches
  2004-12-18  6:58           ` [PATCH 6/10] " Nick Piggin
@ 2004-12-18  6:59             ` Nick Piggin
  2004-12-18  7:00               ` [PATCH 8/10] " Nick Piggin
  0 siblings, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-18  6:59 UTC (permalink / raw)
  To: Linux Memory Management
  Cc: Andi Kleen, Hugh Dickins, Linus Torvalds, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 5 bytes --]

7/10

[-- Attachment #2: 4level-architecture-changes-for-i386.patch --]
[-- Type: text/plain, Size: 14555 bytes --]



i386		works with 2 and 3 levels

Signed-off-by: Andi Kleen <ak@suse.de>

Converted to use pud_t by Nick Piggin

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>

---



---

 linux-2.6-npiggin/arch/i386/kernel/acpi/sleep.c     |    3 +-
 linux-2.6-npiggin/arch/i386/kernel/vm86.c           |   11 ++++++++-
 linux-2.6-npiggin/arch/i386/mm/fault.c              |   13 ++++++++--
 linux-2.6-npiggin/arch/i386/mm/hugetlbpage.c        |    8 +++++-
 linux-2.6-npiggin/arch/i386/mm/init.c               |   18 ++++++++++-----
 linux-2.6-npiggin/arch/i386/mm/ioremap.c            |    7 +++++
 linux-2.6-npiggin/arch/i386/mm/pageattr.c           |   14 ++++++++---
 linux-2.6-npiggin/arch/i386/mm/pgtable.c            |   12 ++++++++--
 linux-2.6-npiggin/include/asm-i386/pgalloc.h        |    3 --
 linux-2.6-npiggin/include/asm-i386/pgtable-3level.h |   24 ++++++++++----------
 linux-2.6-npiggin/include/asm-i386/pgtable.h        |    1 
 11 files changed, 81 insertions(+), 33 deletions(-)

diff -puN arch/i386/kernel/acpi/sleep.c~4level-architecture-changes-for-i386 arch/i386/kernel/acpi/sleep.c
--- linux-2.6/arch/i386/kernel/acpi/sleep.c~4level-architecture-changes-for-i386	2004-12-18 17:03:11.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/kernel/acpi/sleep.c	2004-12-18 17:03:11.000000000 +1100
@@ -7,6 +7,7 @@
 
 #include <linux/acpi.h>
 #include <linux/bootmem.h>
+#include <asm/current.h> /* XXX remove me */
 #include <asm/smp.h>
 
 
@@ -24,7 +25,7 @@ static void init_low_mapping(pgd_t *pgd,
 	int pgd_ofs = 0;
 
 	while ((pgd_ofs < pgd_limit) && (pgd_ofs + USER_PTRS_PER_PGD < PTRS_PER_PGD)) {
-		set_pgd(pgd, *(pgd+USER_PTRS_PER_PGD));
+		set_pgd(pgd, (*(pgd+USER_PTRS_PER_PGD)));
 		pgd_ofs++, pgd++;
 	}
 }
diff -puN arch/i386/kernel/vm86.c~4level-architecture-changes-for-i386 arch/i386/kernel/vm86.c
--- linux-2.6/arch/i386/kernel/vm86.c~4level-architecture-changes-for-i386	2004-12-18 17:03:11.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/kernel/vm86.c	2004-12-18 17:03:11.000000000 +1100
@@ -137,6 +137,7 @@ struct pt_regs * fastcall save_v86_state
 static void mark_screen_rdonly(struct task_struct * tsk)
 {
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte, *mapped;
 	int i;
@@ -151,7 +152,15 @@ static void mark_screen_rdonly(struct ta
 		pgd_clear(pgd);
 		goto out;
 	}
-	pmd = pmd_offset(pgd, 0xA0000);
+	pud = pud_offset(pgd, 0xA0000);
+	if (pud_none(*pud))
+		goto out;
+	if (pud_bad(*pud)) {
+		pud_ERROR(*pud);
+		pud_clear(pud);
+		goto out;
+	}
+	pmd = pmd_offset(pud, 0xA0000);
 	if (pmd_none(*pmd))
 		goto out;
 	if (pmd_bad(*pmd)) {
diff -puN arch/i386/mm/fault.c~4level-architecture-changes-for-i386 arch/i386/mm/fault.c
--- linux-2.6/arch/i386/mm/fault.c~4level-architecture-changes-for-i386	2004-12-18 17:03:11.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/mm/fault.c	2004-12-18 17:03:11.000000000 +1100
@@ -518,6 +518,7 @@ vmalloc_fault:
 		int index = pgd_index(address);
 		unsigned long pgd_paddr;
 		pgd_t *pgd, *pgd_k;
+		pud_t *pud, *pud_k;
 		pmd_t *pmd, *pmd_k;
 		pte_t *pte_k;
 
@@ -530,11 +531,17 @@ vmalloc_fault:
 
 		/*
 		 * set_pgd(pgd, *pgd_k); here would be useless on PAE
-		 * and redundant with the set_pmd() on non-PAE.
+		 * and redundant with the set_pmd() on non-PAE. As would
+		 * set_pud.
 		 */
 
-		pmd = pmd_offset(pgd, address);
-		pmd_k = pmd_offset(pgd_k, address);
+		pud = pud_offset(pgd, address);
+		pud_k = pud_offset(pgd_k, address);
+		if (!pud_present(*pud_k))
+			goto no_context;
+		
+		pmd = pmd_offset(pud, address);
+		pmd_k = pmd_offset(pud_k, address);
 		if (!pmd_present(*pmd_k))
 			goto no_context;
 		set_pmd(pmd, *pmd_k);
diff -puN arch/i386/mm/hugetlbpage.c~4level-architecture-changes-for-i386 arch/i386/mm/hugetlbpage.c
--- linux-2.6/arch/i386/mm/hugetlbpage.c~4level-architecture-changes-for-i386	2004-12-18 17:03:11.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/mm/hugetlbpage.c	2004-12-18 17:03:11.000000000 +1100
@@ -21,20 +21,24 @@
 static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd = NULL;
 
 	pgd = pgd_offset(mm, addr);
-	pmd = pmd_alloc(mm, pgd, addr);
+	pud = pud_alloc(mm, pgd, addr);
+	pmd = pmd_alloc(mm, pud, addr);
 	return (pte_t *) pmd;
 }
 
 static pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd = NULL;
 
 	pgd = pgd_offset(mm, addr);
-	pmd = pmd_offset(pgd, addr);
+	pud = pud_offset(pgd, addr);
+	pmd = pmd_offset(pud, addr);
 	return (pte_t *) pmd;
 }
 
diff -puN arch/i386/mm/init.c~4level-architecture-changes-for-i386 arch/i386/mm/init.c
--- linux-2.6/arch/i386/mm/init.c~4level-architecture-changes-for-i386	2004-12-18 17:03:11.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/mm/init.c	2004-12-18 17:03:11.000000000 +1100
@@ -54,15 +54,18 @@ static int noinline do_test_wp_bit(void)
  */
 static pmd_t * __init one_md_table_init(pgd_t *pgd)
 {
+	pud_t *pud;
 	pmd_t *pmd_table;
 		
 #ifdef CONFIG_X86_PAE
 	pmd_table = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
 	set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
-	if (pmd_table != pmd_offset(pgd, 0)) 
+	pud = pud_offset(pgd, 0);
+	if (pmd_table != pmd_offset(pud, 0)) 
 		BUG();
 #else
-	pmd_table = pmd_offset(pgd, 0);
+	pud = pud_offset(pgd, 0);
+	pmd_table = pmd_offset(pud, 0);
 #endif
 
 	return pmd_table;
@@ -100,6 +103,7 @@ static pte_t * __init one_page_table_ini
 static void __init page_table_range_init (unsigned long start, unsigned long end, pgd_t *pgd_base)
 {
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	int pgd_idx, pmd_idx;
 	unsigned long vaddr;
@@ -112,8 +116,8 @@ static void __init page_table_range_init
 	for ( ; (pgd_idx < PTRS_PER_PGD) && (vaddr != end); pgd++, pgd_idx++) {
 		if (pgd_none(*pgd)) 
 			one_md_table_init(pgd);
-
-		pmd = pmd_offset(pgd, vaddr);
+		pud = pud_offset(pgd, vaddr);
+		pmd = pmd_offset(pud, vaddr);
 		for (; (pmd_idx < PTRS_PER_PMD) && (vaddr != end); pmd++, pmd_idx++) {
 			if (pmd_none(*pmd)) 
 				one_page_table_init(pmd);
@@ -233,7 +237,7 @@ EXPORT_SYMBOL(kmap_prot);
 EXPORT_SYMBOL(kmap_pte);
 
 #define kmap_get_fixmap_pte(vaddr)					\
-	pte_offset_kernel(pmd_offset(pgd_offset_k(vaddr), (vaddr)), (vaddr))
+	pte_offset_kernel(pmd_offset(pud_offset(pgd_offset_k(vaddr), vaddr), (vaddr)), (vaddr))
 
 void __init kmap_init(void)
 {
@@ -249,6 +253,7 @@ void __init kmap_init(void)
 void __init permanent_kmaps_init(pgd_t *pgd_base)
 {
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 	unsigned long vaddr;
@@ -257,7 +262,8 @@ void __init permanent_kmaps_init(pgd_t *
 	page_table_range_init(vaddr, vaddr + PAGE_SIZE*LAST_PKMAP, pgd_base);
 
 	pgd = swapper_pg_dir + pgd_index(vaddr);
-	pmd = pmd_offset(pgd, vaddr);
+	pud = pud_offset(pgd, vaddr);
+	pmd = pmd_offset(pud, vaddr);
 	pte = pte_offset_kernel(pmd, vaddr);
 	pkmap_page_table = pte;	
 }
diff -puN arch/i386/mm/ioremap.c~4level-architecture-changes-for-i386 arch/i386/mm/ioremap.c
--- linux-2.6/arch/i386/mm/ioremap.c~4level-architecture-changes-for-i386	2004-12-18 17:03:11.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/mm/ioremap.c	2004-12-18 17:03:11.000000000 +1100
@@ -80,9 +80,14 @@ static int remap_area_pages(unsigned lon
 		BUG();
 	spin_lock(&init_mm.page_table_lock);
 	do {
+		pud_t *pud;
 		pmd_t *pmd;
-		pmd = pmd_alloc(&init_mm, dir, address);
+		
 		error = -ENOMEM;
+		pud = pud_alloc(&init_mm, dir, address);
+		if (!pud)
+			break;
+		pmd = pmd_alloc(&init_mm, pud, address);
 		if (!pmd)
 			break;
 		if (remap_area_pmd(pmd, address, end - address,
diff -puN arch/i386/mm/pageattr.c~4level-architecture-changes-for-i386 arch/i386/mm/pageattr.c
--- linux-2.6/arch/i386/mm/pageattr.c~4level-architecture-changes-for-i386	2004-12-18 17:03:11.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/mm/pageattr.c	2004-12-18 17:03:11.000000000 +1100
@@ -19,11 +19,15 @@ static struct list_head df_list = LIST_H
 
 pte_t *lookup_address(unsigned long address) 
 { 
-	pgd_t *pgd = pgd_offset_k(address); 
+	pgd_t *pgd = pgd_offset_k(address);
+	pud_t *pud;
 	pmd_t *pmd;
 	if (pgd_none(*pgd))
 		return NULL;
-	pmd = pmd_offset(pgd, address); 	       
+	pud = pud_offset(pgd, address);
+	if (pud_none(*pud))
+		return NULL;
+	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		return NULL;
 	if (pmd_large(*pmd))
@@ -77,9 +81,11 @@ static void set_pmd_pte(pte_t *kpte, uns
 	spin_lock_irqsave(&pgd_lock, flags);
 	for (page = pgd_list; page; page = (struct page *)page->index) {
 		pgd_t *pgd;
+		pud_t *pud;
 		pmd_t *pmd;
 		pgd = (pgd_t *)page_address(page) + pgd_index(address);
-		pmd = pmd_offset(pgd, address);
+		pud = pud_offset(pgd, address);
+		pmd = pmd_offset(pud, address);
 		set_pte_atomic((pte_t *)pmd, pte);
 	}
 	spin_unlock_irqrestore(&pgd_lock, flags);
@@ -92,7 +98,7 @@ static void set_pmd_pte(pte_t *kpte, uns
 static inline void revert_page(struct page *kpte_page, unsigned long address)
 {
 	pte_t *linear = (pte_t *) 
-		pmd_offset(pgd_offset(&init_mm, address), address);
+		pmd_offset(pud_offset(pgd_offset_k(address), address), address);
 	set_pmd_pte(linear,  address,
 		    pfn_pte((__pa(address) & LARGE_PAGE_MASK) >> PAGE_SHIFT,
 			    PAGE_KERNEL_LARGE));
diff -puN arch/i386/mm/pgtable.c~4level-architecture-changes-for-i386 arch/i386/mm/pgtable.c
--- linux-2.6/arch/i386/mm/pgtable.c~4level-architecture-changes-for-i386	2004-12-18 17:03:11.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/mm/pgtable.c	2004-12-18 17:03:11.000000000 +1100
@@ -62,6 +62,7 @@ void show_mem(void)
 static void set_pte_pfn(unsigned long vaddr, unsigned long pfn, pgprot_t flags)
 {
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
@@ -70,7 +71,12 @@ static void set_pte_pfn(unsigned long va
 		BUG();
 		return;
 	}
-	pmd = pmd_offset(pgd, vaddr);
+	pud = pud_offset(pgd, vaddr);
+	if (pud_none(*pud)) {
+		BUG();
+		return;
+	}
+	pmd = pmd_offset(pud, vaddr);
 	if (pmd_none(*pmd)) {
 		BUG();
 		return;
@@ -95,6 +101,7 @@ static void set_pte_pfn(unsigned long va
 void set_pmd_pfn(unsigned long vaddr, unsigned long pfn, pgprot_t flags)
 {
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 
 	if (vaddr & (PMD_SIZE-1)) {		/* vaddr is misaligned */
@@ -110,7 +117,8 @@ void set_pmd_pfn(unsigned long vaddr, un
 		printk ("set_pmd_pfn: pgd_none\n");
 		return; /* BUG(); */
 	}
-	pmd = pmd_offset(pgd, vaddr);
+	pud = pud_offset(pgd, vaddr);
+	pmd = pmd_offset(pud, vaddr);
 	set_pmd(pmd, pfn_pmd(pfn, flags));
 	/*
 	 * It's enough to flush this one mapping.
diff -puN include/asm-i386/mmu_context.h~4level-architecture-changes-for-i386 include/asm-i386/mmu_context.h
diff -puN include/asm-i386/page.h~4level-architecture-changes-for-i386 include/asm-i386/page.h
diff -puN include/asm-i386/pgalloc.h~4level-architecture-changes-for-i386 include/asm-i386/pgalloc.h
--- linux-2.6/include/asm-i386/pgalloc.h~4level-architecture-changes-for-i386	2004-12-18 17:03:11.000000000 +1100
+++ linux-2.6-npiggin/include/asm-i386/pgalloc.h	2004-12-18 17:03:11.000000000 +1100
@@ -17,7 +17,6 @@
 /*
  * Allocate and free page tables.
  */
-
 extern pgd_t *pgd_alloc(struct mm_struct *);
 extern void pgd_free(pgd_t *pgd);
 
@@ -44,7 +43,7 @@ static inline void pte_free(struct page 
 #define pmd_alloc_one(mm, addr)		({ BUG(); ((pmd_t *)2); })
 #define pmd_free(x)			do { } while (0)
 #define __pmd_free_tlb(tlb,x)		do { } while (0)
-#define pgd_populate(mm, pmd, pte)	BUG()
+#define pud_populate(mm, pmd, pte)	BUG()
 #endif
 
 #define check_pgt_cache()	do { } while (0)
diff -puN include/asm-i386/pgtable.h~4level-architecture-changes-for-i386 include/asm-i386/pgtable.h
--- linux-2.6/include/asm-i386/pgtable.h~4level-architecture-changes-for-i386	2004-12-18 17:03:11.000000000 +1100
+++ linux-2.6-npiggin/include/asm-i386/pgtable.h	2004-12-18 17:03:11.000000000 +1100
@@ -303,6 +303,7 @@ static inline pte_t pte_modify(pte_t pte
  * control the given virtual address
  */
 #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
+#define pgd_index_k(addr) pgd_index(addr)
 
 /*
  * pgd_offset() returns a (pgd_t *)
diff -puN include/asm-i386/pgtable-2level.h~4level-architecture-changes-for-i386 include/asm-i386/pgtable-2level.h
diff -puN include/asm-i386/pgtable-3level.h~4level-architecture-changes-for-i386 include/asm-i386/pgtable-3level.h
--- linux-2.6/include/asm-i386/pgtable-3level.h~4level-architecture-changes-for-i386	2004-12-18 17:03:11.000000000 +1100
+++ linux-2.6-npiggin/include/asm-i386/pgtable-3level.h	2004-12-18 17:03:11.000000000 +1100
@@ -1,6 +1,8 @@
 #ifndef _I386_PGTABLE_3LEVEL_H
 #define _I386_PGTABLE_3LEVEL_H
 
+#include <asm-generic/pgtable-nopud.h>
+
 /*
  * Intel Physical Address Extension (PAE) Mode - three-level page
  * tables on PPro+ CPUs.
@@ -15,9 +17,9 @@
 #define pgd_ERROR(e) \
 	printk("%s:%d: bad pgd %p(%016Lx).\n", __FILE__, __LINE__, &(e), pgd_val(e))
 
-static inline int pgd_none(pgd_t pgd)		{ return 0; }
-static inline int pgd_bad(pgd_t pgd)		{ return 0; }
-static inline int pgd_present(pgd_t pgd)	{ return 1; }
+#define pud_none(pud)				0
+#define pud_bad(pud)				0
+#define pud_present(pud)			1
 
 /*
  * Is the pte executable?
@@ -59,8 +61,8 @@ static inline void set_pte(pte_t *ptep, 
 		set_64bit((unsigned long long *)(pteptr),pte_val(pteval))
 #define set_pmd(pmdptr,pmdval) \
 		set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
-#define set_pgd(pgdptr,pgdval) \
-		set_64bit((unsigned long long *)(pgdptr),pgd_val(pgdval))
+#define set_pud(pudptr,pudval) \
+		set_64bit((unsigned long long *)(pudptr),pud_val(pudval))
 
 /*
  * Pentium-II erratum A13: in PAE mode we explicitly have to flush
@@ -68,22 +70,22 @@ static inline void set_pte(pte_t *ptep, 
  * We do not let the generic code free and clear pgd entries due to
  * this erratum.
  */
-static inline void pgd_clear (pgd_t * pgd) { }
+static inline void pud_clear (pud_t * pud) { }
 
 #define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
 
 #define pmd_page_kernel(pmd) \
 ((unsigned long) __va(pmd_val(pmd) & PAGE_MASK))
 
-#define pgd_page(pgd) \
-((struct page *) __va(pgd_val(pgd) & PAGE_MASK))
+#define pud_page(pud) \
+((struct page *) __va(pud_val(pud) & PAGE_MASK))
 
-#define pgd_page_kernel(pgd) \
-((unsigned long) __va(pgd_val(pgd) & PAGE_MASK))
+#define pud_page_kernel(pud) \
+((unsigned long) __va(pud_val(pud) & PAGE_MASK))
 
 
 /* Find an entry in the second-level page table.. */
-#define pmd_offset(dir, address) ((pmd_t *) pgd_page(*(dir)) + \
+#define pmd_offset(pud, address) ((pmd_t *) pud_page(*(pud)) + \
 			pmd_index(address))
 
 static inline pte_t ptep_get_and_clear(pte_t *ptep)

_

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 8/10] alternate 4-level page tables patches
  2004-12-18  6:59             ` [PATCH 7/10] " Nick Piggin
@ 2004-12-18  7:00               ` Nick Piggin
  2004-12-18  7:00                 ` [PATCH 9/10] " Nick Piggin
  0 siblings, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-18  7:00 UTC (permalink / raw)
  To: Linux Memory Management
  Cc: Andi Kleen, Hugh Dickins, Linus Torvalds, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 5 bytes --]

8/10

[-- Attachment #2: 4level-ia64.patch --]
[-- Type: text/plain, Size: 7795 bytes --]



Convert ia64 architecture over to handle 4 level pagetables.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>


---

 linux-2.6-npiggin/arch/ia64/mm/fault.c       |    7 ++++-
 linux-2.6-npiggin/arch/ia64/mm/hugetlbpage.c |   35 ++++++++++++---------------
 linux-2.6-npiggin/arch/ia64/mm/init.c        |   14 +++++++++-
 linux-2.6-npiggin/include/asm-ia64/pgalloc.h |    5 +--
 linux-2.6-npiggin/include/asm-ia64/pgtable.h |   14 ++++++----
 linux-2.6-npiggin/include/asm-ia64/tlb.h     |    6 ++++
 6 files changed, 50 insertions(+), 31 deletions(-)

diff -puN include/asm-ia64/pgtable.h~4level-ia64 include/asm-ia64/pgtable.h
--- linux-2.6/include/asm-ia64/pgtable.h~4level-ia64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/include/asm-ia64/pgtable.h	2004-12-18 17:03:12.000000000 +1100
@@ -254,11 +254,12 @@ ia64_phys_addr_valid (unsigned long addr
 #define pmd_page_kernel(pmd)		((unsigned long) __va(pmd_val(pmd) & _PFN_MASK))
 #define pmd_page(pmd)			virt_to_page((pmd_val(pmd) + PAGE_OFFSET))
 
-#define pgd_none(pgd)			(!pgd_val(pgd))
-#define pgd_bad(pgd)			(!ia64_phys_addr_valid(pgd_val(pgd)))
-#define pgd_present(pgd)		(pgd_val(pgd) != 0UL)
-#define pgd_clear(pgdp)			(pgd_val(*(pgdp)) = 0UL)
-#define pgd_page(pgd)			((unsigned long) __va(pgd_val(pgd) & _PFN_MASK))
+#define pud_none(pud)			(!pud_val(pud))
+#define pud_bad(pud)			(!ia64_phys_addr_valid(pud_val(pud)))
+#define pud_present(pud)		(pud_val(pud) != 0UL)
+#define pud_clear(pudp)			(pud_val(*(pudp)) = 0UL)
+
+#define pud_page(pud)			((unsigned long) __va(pud_val(pud) & _PFN_MASK))
 
 /*
  * The following have defined behavior only work if pte_present() is true.
@@ -328,7 +329,7 @@ pgd_offset (struct mm_struct *mm, unsign
 
 /* Find an entry in the second-level page table.. */
 #define pmd_offset(dir,addr) \
-	((pmd_t *) pgd_page(*(dir)) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1)))
+	((pmd_t *) pud_page(*(dir)) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1)))
 
 /*
  * Find an entry in the third-level page table.  This looks more complicated than it
@@ -561,5 +562,6 @@ do {											\
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
 #include <asm-generic/pgtable.h>
+#include <asm-generic/pgtable-nopud.h>
 
 #endif /* _ASM_IA64_PGTABLE_H */
diff -puN arch/ia64/mm/fault.c~4level-ia64 arch/ia64/mm/fault.c
--- linux-2.6/arch/ia64/mm/fault.c~4level-ia64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/arch/ia64/mm/fault.c	2004-12-18 17:03:12.000000000 +1100
@@ -51,6 +51,7 @@ static int
 mapped_kernel_page_is_present (unsigned long address)
 {
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *ptep, pte;
 
@@ -58,7 +59,11 @@ mapped_kernel_page_is_present (unsigned 
 	if (pgd_none(*pgd) || pgd_bad(*pgd))
 		return 0;
 
-	pmd = pmd_offset(pgd, address);
+	pud = pud_offset(pgd, address);
+	if (pud_none(*pud) || pud_bad(*pud))
+		return 0;
+
+	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd) || pmd_bad(*pmd))
 		return 0;
 
diff -puN arch/ia64/mm/hugetlbpage.c~4level-ia64 arch/ia64/mm/hugetlbpage.c
--- linux-2.6/arch/ia64/mm/hugetlbpage.c~4level-ia64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/arch/ia64/mm/hugetlbpage.c	2004-12-18 17:03:12.000000000 +1100
@@ -29,13 +29,17 @@ huge_pte_alloc (struct mm_struct *mm, un
 {
 	unsigned long taddr = htlbpage_to_page(addr);
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte = NULL;
 
 	pgd = pgd_offset(mm, taddr);
-	pmd = pmd_alloc(mm, pgd, taddr);
-	if (pmd)
-		pte = pte_alloc_map(mm, pmd, taddr);
+	pud = pud_alloc(mm, pgd, taddr);
+	if (pud) {
+		pmd = pmd_alloc(mm, pud, taddr);
+		if (pmd)
+			pte = pte_alloc_map(mm, pmd, taddr);
+	}
 	return pte;
 }
 
@@ -44,14 +48,18 @@ huge_pte_offset (struct mm_struct *mm, u
 {
 	unsigned long taddr = htlbpage_to_page(addr);
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte = NULL;
 
 	pgd = pgd_offset(mm, taddr);
 	if (pgd_present(*pgd)) {
-		pmd = pmd_offset(pgd, taddr);
-		if (pmd_present(*pmd))
-			pte = pte_offset_map(pmd, taddr);
+		pud = pud_offset(pgd, taddr);
+		if (pud_present(*pud)) {
+			pmd = pmd_offset(pud, taddr);
+			if (pmd_present(*pmd))
+				pte = pte_offset_map(pmd, taddr);
+		}
 	}
 
 	return pte;
@@ -187,7 +195,6 @@ void hugetlb_free_pgtables(struct mmu_ga
 {
 	unsigned long first = start & HUGETLB_PGDIR_MASK;
 	unsigned long last = end + HUGETLB_PGDIR_SIZE - 1;
-	unsigned long start_index, end_index;
 	struct mm_struct *mm = tlb->mm;
 
 	if (!prev) {
@@ -212,23 +219,13 @@ void hugetlb_free_pgtables(struct mmu_ga
 				last = next->vm_start;
 		}
 		if (prev->vm_end > first)
-			first = prev->vm_end + HUGETLB_PGDIR_SIZE - 1;
+			first = prev->vm_end;
 		break;
 	}
 no_mmaps:
 	if (last < first)	/* for arches with discontiguous pgd indices */
 		return;
-	/*
-	 * If the PGD bits are not consecutive in the virtual address, the
-	 * old method of shifting the VA >> by PGDIR_SHIFT doesn't work.
-	 */
-
-	start_index = pgd_index(htlbpage_to_page(first));
-	end_index = pgd_index(htlbpage_to_page(last));
-
-	if (end_index > start_index) {
-		clear_page_tables(tlb, start_index, end_index - start_index);
-	}
+	clear_page_range(tlb, first, last);
 }
 
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
diff -puN arch/ia64/mm/init.c~4level-ia64 arch/ia64/mm/init.c
--- linux-2.6/arch/ia64/mm/init.c~4level-ia64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/arch/ia64/mm/init.c	2004-12-18 17:03:12.000000000 +1100
@@ -237,6 +237,7 @@ struct page *
 put_kernel_page (struct page *page, unsigned long address, pgprot_t pgprot)
 {
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
@@ -248,7 +249,11 @@ put_kernel_page (struct page *page, unsi
 
 	spin_lock(&init_mm.page_table_lock);
 	{
-		pmd = pmd_alloc(&init_mm, pgd, address);
+		pud = pud_alloc(&init_mm, pgd, address);
+		if (!pud)
+			goto out;
+
+		pmd = pmd_alloc(&init_mm, pud, address);
 		if (!pmd)
 			goto out;
 		pte = pte_alloc_map(&init_mm, pmd, address);
@@ -381,6 +386,7 @@ create_mem_map_page_table (u64 start, u6
 	struct page *map_start, *map_end;
 	int node;
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
@@ -395,7 +401,11 @@ create_mem_map_page_table (u64 start, u6
 		pgd = pgd_offset_k(address);
 		if (pgd_none(*pgd))
 			pgd_populate(&init_mm, pgd, alloc_bootmem_pages_node(NODE_DATA(node), PAGE_SIZE));
-		pmd = pmd_offset(pgd, address);
+		pud = pud_offset(pgd, address);
+
+		if (pud_none(*pud))
+			pud_populate(&init_mm, pud, alloc_bootmem_pages_node(NODE_DATA(node), PAGE_SIZE));
+		pmd = pmd_offset(pud, address);
 
 		if (pmd_none(*pmd))
 			pmd_populate_kernel(&init_mm, pmd, alloc_bootmem_pages_node(NODE_DATA(node), PAGE_SIZE));
diff -puN include/asm-ia64/pgalloc.h~4level-ia64 include/asm-ia64/pgalloc.h
--- linux-2.6/include/asm-ia64/pgalloc.h~4level-ia64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/include/asm-ia64/pgalloc.h	2004-12-18 17:03:12.000000000 +1100
@@ -79,12 +79,11 @@ pgd_free (pgd_t *pgd)
 }
 
 static inline void
-pgd_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
+pud_populate (struct mm_struct *mm, pud_t *pud_entry, pmd_t *pmd)
 {
-	pgd_val(*pgd_entry) = __pa(pmd);
+	pud_val(*pud_entry) = __pa(pmd);
 }
 
-
 static inline pmd_t*
 pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
 {
diff -puN include/asm-ia64/tlb.h~4level-ia64 include/asm-ia64/tlb.h
--- linux-2.6/include/asm-ia64/tlb.h~4level-ia64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/include/asm-ia64/tlb.h	2004-12-18 17:03:12.000000000 +1100
@@ -236,4 +236,10 @@ do {							\
 	__pmd_free_tlb(tlb, ptep);			\
 } while (0)
 
+#define pud_free_tlb(tlb, pudp)				\
+do {							\
+	tlb->need_flush = 1;				\
+	__pud_free_tlb(tlb, pudp);			\
+} while (0)
+
 #endif /* _ASM_IA64_TLB_H */

_

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 9/10] alternate 4-level page tables patches
  2004-12-18  7:00               ` [PATCH 8/10] " Nick Piggin
@ 2004-12-18  7:00                 ` Nick Piggin
  2004-12-18  7:01                   ` [PATCH 10/10] " Nick Piggin
  0 siblings, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-18  7:00 UTC (permalink / raw)
  To: Linux Memory Management
  Cc: Andi Kleen, Hugh Dickins, Linus Torvalds, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 5 bytes --]

9/10

[-- Attachment #2: 4level-x86-64.patch --]
[-- Type: text/plain, Size: 44615 bytes --]



From: Andi Kleen <ak@suse.de>

Converted to true 4levels.  The address space per process is expanded to
47bits now, the supported physical address space is 46bits.

Lmbench fork/exit numbers are down a few percent because it has to walk much
more pagetables, but some planned future optimizations will hopefully recover
it.

See Documentation/x86_64/mm.txt for more details on the memory map.

Converted to pud_t by Nick Piggin.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>


---

 linux-2.6-npiggin/Documentation/x86_64/mm.txt      |  168 ++-------------------
 linux-2.6-npiggin/arch/x86_64/ia32/syscall32.c     |   31 ++-
 linux-2.6-npiggin/arch/x86_64/kernel/acpi/sleep.c  |    8 -
 linux-2.6-npiggin/arch/x86_64/kernel/head.S        |    1 
 linux-2.6-npiggin/arch/x86_64/kernel/init_task.c   |    2 
 linux-2.6-npiggin/arch/x86_64/kernel/reboot.c      |    2 
 linux-2.6-npiggin/arch/x86_64/kernel/setup64.c     |   13 -
 linux-2.6-npiggin/arch/x86_64/mm/fault.c           |  111 ++++++++-----
 linux-2.6-npiggin/arch/x86_64/mm/init.c            |  101 +++++-------
 linux-2.6-npiggin/arch/x86_64/mm/ioremap.c         |   43 ++++-
 linux-2.6-npiggin/arch/x86_64/mm/pageattr.c        |   34 ++--
 linux-2.6-npiggin/include/asm-x86_64/e820.h        |    3 
 linux-2.6-npiggin/include/asm-x86_64/mmu_context.h |    5 
 linux-2.6-npiggin/include/asm-x86_64/page.h        |   12 -
 linux-2.6-npiggin/include/asm-x86_64/pda.h         |    1 
 linux-2.6-npiggin/include/asm-x86_64/pgalloc.h     |   38 ++++
 linux-2.6-npiggin/include/asm-x86_64/pgtable.h     |  138 +++++++----------
 linux-2.6-npiggin/include/asm-x86_64/processor.h   |    4 
 18 files changed, 314 insertions(+), 401 deletions(-)

diff -puN Documentation/x86_64/mm.txt~4level-x86-64 Documentation/x86_64/mm.txt
--- linux-2.6/Documentation/x86_64/mm.txt~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/Documentation/x86_64/mm.txt	2004-12-18 17:03:12.000000000 +1100
@@ -1,148 +1,24 @@
-The paging design used on the x86-64 linux kernel port in 2.4.x provides:
 
-o	per process virtual address space limit of 512 Gigabytes
-o	top of userspace stack located at address 0x0000007fffffffff
-o	PAGE_OFFSET = 0xffff800000000000
-o	start of the kernel = 0xffffffff800000000
-o	global RAM per system 2^64-PAGE_OFFSET-sizeof(kernel) = 128 Terabytes - 2 Gigabytes
-o	no need of any common code change
-o	no need to use highmem to handle the 128 Terabytes of RAM
-
-Description:
-
-	Userspace is able to modify and it sees only the 3rd/2nd/1st level
-	pagetables (pgd_offset() implicitly walks the 1st slot of the 4th
-	level pagetable and it returns an entry into the 3rd level pagetable).
-	This is where the per-process 512 Gigabytes limit cames from.
-
-	The common code pgd is the PDPE, the pmd is the PDE, the
-	pte is the PTE. The PML4E remains invisible to the common
-	code.
-
-	The kernel uses all the first 47 bits of the negative half
-	of the virtual address space to build the direct mapping using
-	2 Mbytes page size. The kernel virtual	addresses have bit number
-	47 always set to 1 (and in turn also bits 48-63 are set to 1 too,
-	due the sign extension). This is where the 128 Terabytes - 2 Gigabytes global
-	limit of RAM cames from.
-
-	Since the per-process limit is 512 Gigabytes (due to kernel common
-	code 3 level pagetable limitation), the higher virtual address mapped
-	into userspace is 0x7fffffffff and it makes sense to use it
-	as the top of the userspace stack to allow the stack to grow as
-	much as possible.
-
-	Setting the PAGE_OFFSET to 2^39 (after the last userspace
-	virtual address) wouldn't make much difference compared to
-	setting PAGE_OFFSET to 0xffff800000000000 because we have an
-	hole into the virtual address space. The last byte mapped by the
-	255th slot in the 4th level pagetable is at virtual address
-	0x00007fffffffffff and the first byte mapped by the 256th slot in the
-	4th level pagetable is at address 0xffff800000000000. Due to this
-	hole we can't trivially build a direct mapping across all the
-	512 slots of the 4th level pagetable, so we simply use only the
-	second (negative) half of the 4th level pagetable for that purpose
-	(that provides us 128 Terabytes of contigous virtual addresses).
-	Strictly speaking we could build a direct mapping also across the hole
-	using some DISCONTIGMEM trick, but we don't need such a large
-	direct mapping right now.
-
-Future:
-
-	During 2.5.x we can break the 512 Gigabytes per-process limit
-	possibly by removing from the common code any knowledge about the
-	architectural dependent physical layout of the virtual to physical
-	mapping.
-
-	Once the 512 Gigabytes limit will be removed the kernel stack will
-	be moved (most probably to virtual address 0x00007fffffffffff).
-	Nothing	will break in userspace due that move, as nothing breaks
-	in IA32 compiling the kernel with CONFIG_2G.
-
-Linus agreed on not breaking common code and to live with the 512 Gigabytes
-per-process limitation for the 2.4.x timeframe and he has given me and Andi
-some very useful hints... (thanks! :)
-
-Thanks also to H. Peter Anvin for his interesting and useful suggestions on
-the x86-64-discuss lists!
-
-Other memory management related issues follows:
-
-PAGE_SIZE:
-
-	If somebody is wondering why these days we still have a so small
-	4k pagesize (16 or 32 kbytes would be much better for performance
-	of course), the PAGE_SIZE have to remain 4k for 32bit apps to
-	provide 100% backwards compatible IA32 API (we can't allow silent
-	fs corruption or as best a loss of coherency with the page cache
-	by allocating MAP_SHARED areas in MAP_ANONYMOUS memory with a
-	do_mmap_fake). I think it could be possible to have a dynamic page
-	size between 32bit and 64bit apps but it would need extremely
-	intrusive changes in the common code as first for page cache and
-	we sure don't want to depend on them right now even if the
-	hardware would support that.
-
-PAGETABLE SIZE:
-
-	In turn we can't afford to have pagetables larger than 4k because
-	we could not be able to allocate them due physical memory
-	fragmentation, and failing to allocate the kernel stack is a minor
-	issue compared to failing the allocation of a pagetable. If we
-	fail the allocation of a pagetable the only thing we can do is to
-	sched_yield polling the freelist (deadlock prone) or to segfault
-	the task (not even the sighandler would be sure to run).
-
-KERNEL STACK:
-
-	1st stage:
-
-	The kernel stack will be at first allocated with an order 2 allocation
-	(16k) (the utilization of the stack for a 64bit platform really
-	isn't exactly the double of a 32bit platform because the local
-	variables may not be all 64bit wide, but not much less). This will
-	make things even worse than they are right now on IA32 with
-	respect of failing fork/clone due memory fragmentation.
-
-	2nd stage:
-
-	We'll benchmark if reserving one register as task_struct
-	pointer will improve performance of the kernel (instead of
-	recalculating the task_struct pointer starting from the stack
-	pointer each time). My guess is that recalculating will be faster
-	but it worth a try.
-
-		If reserving one register for the task_struct pointer
-		will be faster we can as well split task_struct and kernel
-		stack. task_struct can be a slab allocation or a
-		PAGE_SIZEd allocation, and the kernel stack can then be
-		allocated in a order 1 allocation. Really this is risky,
-		since 8k on a 64bit platform is going to be less than 7k
-		on a 32bit platform but we could try it out. This would
-		reduce the fragmentation problem of an order of magnitude
-		making it equal to the current IA32.
-
-		We must also consider the x86-64 seems to provide in hardware a
-		per-irq stack that could allow us to remove the irq handler
-		footprint from the regular per-process-stack, so it could allow
-		us to live with a smaller kernel stack compared to the other
-		linux architectures.
-
-	3rd stage:
-
-	Before going into production if we still have the order 2
-	allocation we can add a sysctl that allows the kernel stack to be
-	allocated with vmalloc during memory fragmentation. This have to
-	remain turned off during benchmarks :) but it should be ok in real
-	life.
-
-Order of PAGE_CACHE_SIZE and other allocations:
-
-	On the long run we can increase the PAGE_CACHE_SIZE to be
-	an order 2 allocations and also the slab/buffercache etc.ec..
-	could be all done with order 2 allocations. To make the above
-	to work we should change lots of common code thus it can be done
-	only once the basic port will be in a production state. Having
-	a working PAGE_CACHE_SIZE would be a benefit also for
-	IA32 and other architectures of course.
+<previous description obsolete, deleted>
 
-Andrea <andrea@suse.de> SuSE
+Virtual memory map with 4 level page tables:
+
+0000000000000000 - 00007fffffffffff (=47bits) user space, different per mm
+hole caused by [48:63] sign extension
+ffff800000000000 - ffff80ffffffffff (=40bits) guard hole
+ffff810000000000 - ffffc0ffffffffff (=46bits) direct mapping of phys. memory
+ffffc10000000000 - ffffc1ffffffffff (=40bits) hole
+ffffc20000000000 - ffffe1ffffffffff (=45bits) vmalloc/ioremap space
+... unused hole ...
+ffffffff80000000 - ffffffff82800000 (=40MB)   kernel text mapping, from phys 0
+... unused hole ...
+ffffffff88000000 - fffffffffff00000 (=1919MB) module mapping space
+
+vmalloc space is lazily synchronized into the different PML4 pages of
+the processes using the page fault handler, with init_level4_pgt as
+reference.
+
+Current X86-64 implementations only support 40 bit of address space,
+but we support upto 46bits. This expands into MBZ space in the page tables.
+
+-Andi Kleen, Jul 2004
diff -puN arch/x86_64/ia32/syscall32.c~4level-x86-64 arch/x86_64/ia32/syscall32.c
--- linux-2.6/arch/x86_64/ia32/syscall32.c~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/arch/x86_64/ia32/syscall32.c	2004-12-18 17:03:12.000000000 +1100
@@ -40,23 +40,30 @@ static int use_sysenter = -1;
  */
 int __map_syscall32(struct mm_struct *mm, unsigned long address)
 { 
+	pgd_t *pgd;
+	pgd_t *pud;
 	pte_t *pte;
 	pmd_t *pmd;
-	int err = 0;
+	int err = -ENOMEM;
 
 	spin_lock(&mm->page_table_lock); 
-	pmd = pmd_alloc(mm, pgd_offset(mm, address), address); 
-	if (pmd && (pte = pte_alloc_map(mm, pmd, address)) != NULL) { 
-		if (pte_none(*pte)) { 
-			set_pte(pte, 
-				mk_pte(virt_to_page(syscall32_page), 
-				       PAGE_KERNEL_VSYSCALL)); 
+ 	pgd = pgd_offset(mm, address);
+ 	pud = pud_alloc(mm, pgd, address);
+ 	if (pud) {
+ 		pmd = pmd_alloc(mm, pud, address);
+ 		if (pmd && (pte = pte_alloc_map(mm, pmd, address)) != NULL) {
+ 			if (pte_none(*pte)) {
+ 				set_pte(pte,
+ 					mk_pte(virt_to_page(syscall32_page),
+ 					       PAGE_KERNEL_VSYSCALL));
+ 			}
+ 			/* Flush only the local CPU. Other CPUs taking a fault
+ 			   will just end up here again
+			   This probably not needed and just paranoia. */
+ 			__flush_tlb_one(address);
+ 			err = 0;
 		}
-		/* Flush only the local CPU. Other CPUs taking a fault
-		   will just end up here again */
-		__flush_tlb_one(address); 
-	} else
-		err = -ENOMEM; 
+	}
 	spin_unlock(&mm->page_table_lock);
 	return err;
 }
diff -puN arch/x86_64/kernel/acpi/sleep.c~4level-x86-64 arch/x86_64/kernel/acpi/sleep.c
--- linux-2.6/arch/x86_64/kernel/acpi/sleep.c~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/arch/x86_64/kernel/acpi/sleep.c	2004-12-18 17:03:12.000000000 +1100
@@ -61,9 +61,13 @@ extern char wakeup_start, wakeup_end;
 
 extern unsigned long FASTCALL(acpi_copy_wakeup_routine(unsigned long));
 
+static pgd_t low_ptr;
+
 static void init_low_mapping(void)
 {
-	cpu_pda[0].level4_pgt[0] = cpu_pda[0].level4_pgt[pml4_index(PAGE_OFFSET)];
+	pgd_t *slot0 = pgd_offset(current->mm, 0UL);
+	low_ptr = *slot0;
+	set_pgd(slot0, *pgd_offset(current->mm, PAGE_OFFSET));
 	flush_tlb_all();
 }
 
@@ -97,7 +101,7 @@ int acpi_save_state_disk (void)
  */
 void acpi_restore_state_mem (void)
 {
-	cpu_pda[0].level4_pgt[0] = 0;
+	set_pgd(pgd_offset(current->mm, 0UL), low_ptr);
 	flush_tlb_all();
 }
 
diff -puN arch/x86_64/kernel/head.S~4level-x86-64 arch/x86_64/kernel/head.S
--- linux-2.6/arch/x86_64/kernel/head.S~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/arch/x86_64/kernel/head.S	2004-12-18 17:03:12.000000000 +1100
@@ -225,7 +225,6 @@ ENTRY(init_level4_pgt)
 	.quad	0x0000000000103007		/* -> level3_kernel_pgt */
 
 .org 0x2000
-/* Kernel does not "know" about 4-th level of page tables. */
 ENTRY(level3_ident_pgt)
 	.quad	0x0000000000104007
 	.fill	511,8,0
diff -puN arch/x86_64/kernel/init_task.c~4level-x86-64 arch/x86_64/kernel/init_task.c
--- linux-2.6/arch/x86_64/kernel/init_task.c~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/arch/x86_64/kernel/init_task.c	2004-12-18 17:03:12.000000000 +1100
@@ -47,5 +47,3 @@ EXPORT_SYMBOL(init_task);
 DEFINE_PER_CPU(struct tss_struct, init_tss) ____cacheline_maxaligned_in_smp;
 
 #define ALIGN_TO_4K __attribute__((section(".data.init_task")))
-
-pgd_t boot_vmalloc_pgt[512]  ALIGN_TO_4K;
diff -puN arch/x86_64/kernel/setup64.c~4level-x86-64 arch/x86_64/kernel/setup64.c
--- linux-2.6/arch/x86_64/kernel/setup64.c~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/arch/x86_64/kernel/setup64.c	2004-12-18 17:03:12.000000000 +1100
@@ -66,7 +66,7 @@ __setup("noexec=", nonx_setup); 
 
 /*
  * Great future plan:
- * Declare PDA itself and support (irqstack,tss,pml4) as per cpu data.
+ * Declare PDA itself and support (irqstack,tss,pgd) as per cpu data.
  * Always point %gs to its beginning
  */
 void __init setup_per_cpu_areas(void)
@@ -100,7 +100,6 @@ void __init setup_per_cpu_areas(void)
 
 void pda_init(int cpu)
 { 
-        pml4_t *level4;
 	struct x8664_pda *pda = &cpu_pda[cpu];
 
 	/* Setup up data that may be needed in __get_free_pages early */
@@ -119,22 +118,14 @@ void pda_init(int cpu)
 		/* others are initialized in smpboot.c */
 		pda->pcurrent = &init_task;
 		pda->irqstackptr = boot_cpu_stack; 
-		level4 = init_level4_pgt; 
 	} else {
-		level4 = (pml4_t *)__get_free_pages(GFP_ATOMIC, 0); 
-		if (!level4) 
-			panic("Cannot allocate top level page for cpu %d", cpu); 
 		pda->irqstackptr = (char *)
 			__get_free_pages(GFP_ATOMIC, IRQSTACK_ORDER);
 		if (!pda->irqstackptr)
 			panic("cannot allocate irqstack for cpu %d", cpu); 
 	}
 
-	pda->level4_pgt = (unsigned long *)level4; 
-	if (level4 != init_level4_pgt)
-		memcpy(level4, &init_level4_pgt, PAGE_SIZE); 
-	set_pml4(level4 + 510, mk_kernel_pml4(__pa_symbol(boot_vmalloc_pgt)));
-	asm volatile("movq %0,%%cr3" :: "r" (__pa(level4))); 
+	asm volatile("movq %0,%%cr3" :: "r" (__pa_symbol(&init_level4_pgt)));
 
 	pda->irqstackptr += IRQSTACKSIZE-64;
 } 
diff -puN arch/x86_64/mm/fault.c~4level-x86-64 arch/x86_64/mm/fault.c
--- linux-2.6/arch/x86_64/mm/fault.c~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/arch/x86_64/mm/fault.c	2004-12-18 17:03:12.000000000 +1100
@@ -143,25 +143,25 @@ static int bad_address(void *p) 
 
 void dump_pagetable(unsigned long address)
 {
-	pml4_t *pml4;
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
-	asm("movq %%cr3,%0" : "=r" (pml4));
+	asm("movq %%cr3,%0" : "=r" (pgd));
 
-	pml4 = __va((unsigned long)pml4 & PHYSICAL_PAGE_MASK); 
-	pml4 += pml4_index(address);
-	printk("PML4 %lx ", pml4_val(*pml4));
-	if (bad_address(pml4)) goto bad;
-	if (!pml4_present(*pml4)) goto ret; 
-
-	pgd = __pgd_offset_k((pgd_t *)pml4_page(*pml4), address);
+	pgd = __va((unsigned long)pgd & PHYSICAL_PAGE_MASK); 
+	pgd += pgd_index(address);
+	printk("PGD %lx ", pgd_val(*pgd));
 	if (bad_address(pgd)) goto bad;
-	printk("PGD %lx ", pgd_val(*pgd)); 
-	if (!pgd_present(*pgd))	goto ret;
+	if (!pgd_present(*pgd)) goto ret; 
+
+	pud = __pud_offset_k((pud_t *)pgd_page(*pgd), address);
+	if (bad_address(pud)) goto bad;
+	printk("PUD %lx ", pud_val(*pud));
+	if (!pud_present(*pud))	goto ret;
 
-	pmd = pmd_offset(pgd, address);
+	pmd = pmd_offset(pud, address);
 	if (bad_address(pmd)) goto bad;
 	printk("PMD %lx ", pmd_val(*pmd));
 	if (!pmd_present(*pmd))	goto ret;	 
@@ -232,7 +232,53 @@ static noinline void pgtable_bad(unsigne
 	do_exit(SIGKILL);
 }
 
-int page_fault_trace; 
+/*
+ * Handle a fault on the vmalloc or module mapping area
+ */
+static int vmalloc_fault(unsigned long address)
+{
+	pgd_t *pgd, *pgd_ref;
+	pud_t *pud, *pud_ref;
+	pmd_t *pmd, *pmd_ref;
+	pte_t *pte, *pte_ref;
+
+	/* Copy kernel mappings over when needed. This can also
+	   happen within a race in page table update. In the later
+	   case just flush. */
+
+	pgd = pgd_offset(current->mm ?: &init_mm, address);
+	pgd_ref = pgd_offset_k(address);
+	if (pgd_none(*pgd_ref))
+		return -1;
+	if (pgd_none(*pgd))
+		set_pgd(pgd, *pgd_ref);
+
+	/* Below here mismatches are bugs because these lower tables
+	   are shared */
+
+	pud = pud_offset(pgd, address);
+	pud_ref = pud_offset(pgd_ref, address);
+	if (pud_none(*pud_ref))
+		return -1;
+	if (pud_none(*pud) || pud_page(*pud) != pud_page(*pud_ref))
+		BUG();
+	pmd = pmd_offset(pud, address);
+	pmd_ref = pmd_offset(pud_ref, address);
+	if (pmd_none(*pmd_ref))
+		return -1;
+	if (pmd_none(*pmd) || pmd_page(*pmd) != pmd_page(*pmd_ref))
+		BUG();
+	pte_ref = pte_offset_kernel(pmd_ref, address);
+	if (!pte_present(*pte_ref))
+		return -1;
+	pte = pte_offset_kernel(pmd, address);
+	if (!pte_present(*pte) || pte_page(*pte) != pte_page(*pte_ref))
+		BUG();
+	__flush_tlb_all();
+	return 0;
+}
+
+int page_fault_trace = 0;
 int exception_trace = 1;
 
 /*
@@ -300,8 +346,11 @@ asmlinkage void do_page_fault(struct pt_
 	 * protection error (error_code & 1) == 0.
 	 */
 	if (unlikely(address >= TASK_SIZE)) {
-		if (!(error_code & 5))
-			goto vmalloc_fault;
+		if (!(error_code & 5)) {
+			if (vmalloc_fault(address) < 0)
+				goto bad_area_nosemaphore;
+			return;
+		}
 		/*
 		 * Don't take the mm semaphore here. If we fixup a prefetch
 		 * fault we could otherwise deadlock.
@@ -310,7 +359,7 @@ asmlinkage void do_page_fault(struct pt_
 	}
 
 	if (unlikely(error_code & (1 << 3)))
-		goto page_table_corruption;
+		pgtable_bad(address, regs, error_code);
 
 	/*
 	 * If we're in an interrupt or have no user
@@ -524,34 +573,4 @@ do_sigbus:
 	info.si_addr = (void __user *)address;
 	force_sig_info(SIGBUS, &info, tsk);
 	return;
-
-vmalloc_fault:
-	{
-		pgd_t *pgd;
-		pmd_t *pmd;
-		pte_t *pte; 
-
-		/*
-		 * x86-64 has the same kernel 3rd level pages for all CPUs.
-		 * But for vmalloc/modules the TLB synchronization works lazily,
-		 * so it can happen that we get a page fault for something
-		 * that is really already in the page table. Just check if it
-		 * is really there and when yes flush the local TLB. 
-		 */
-		pgd = pgd_offset_k(address);
-		if (!pgd_present(*pgd))
-			goto bad_area_nosemaphore;
-		pmd = pmd_offset(pgd, address);
-		if (!pmd_present(*pmd))
-			goto bad_area_nosemaphore;
-		pte = pte_offset_kernel(pmd, address); 
-		if (!pte_present(*pte))
-			goto bad_area_nosemaphore;
-
-		__flush_tlb_all();		
-		return;
-	}
-
-page_table_corruption:
-	pgtable_bad(address, regs, error_code);
 }
diff -puN arch/x86_64/mm/init.c~4level-x86-64 arch/x86_64/mm/init.c
--- linux-2.6/arch/x86_64/mm/init.c~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/arch/x86_64/mm/init.c	2004-12-18 17:03:12.000000000 +1100
@@ -108,28 +108,28 @@ static void *spp_getpage(void)
 static void set_pte_phys(unsigned long vaddr,
 			 unsigned long phys, pgprot_t prot)
 {
-	pml4_t *level4;
 	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte, new_pte;
 
 	Dprintk("set_pte_phys %lx to %lx\n", vaddr, phys);
 
-	level4 = pml4_offset_k(vaddr);
-	if (pml4_none(*level4)) {
-		printk("PML4 FIXMAP MISSING, it should be setup in head.S!\n");
+	pgd = pgd_offset_k(vaddr);
+	if (pgd_none(*pgd)) {
+		printk("PGD FIXMAP MISSING, it should be setup in head.S!\n");
 		return;
 	}
-	pgd = level3_offset_k(level4, vaddr);
-	if (pgd_none(*pgd)) {
+	pud = pud_offset(pgd, vaddr);
+	if (pud_none(*pud)) {
 		pmd = (pmd_t *) spp_getpage(); 
-		set_pgd(pgd, __pgd(__pa(pmd) | _KERNPG_TABLE | _PAGE_USER));
-		if (pmd != pmd_offset(pgd, 0)) {
-			printk("PAGETABLE BUG #01! %p <-> %p\n", pmd, pmd_offset(pgd,0));
+		set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE | _PAGE_USER));
+		if (pmd != pmd_offset(pud, 0)) {
+			printk("PAGETABLE BUG #01! %p <-> %p\n", pmd, pmd_offset(pud,0));
 			return;
 		}
 	}
-	pmd = pmd_offset(pgd, vaddr);
+	pmd = pmd_offset(pud, vaddr);
 	if (pmd_none(*pmd)) {
 		pte = (pte_t *) spp_getpage();
 		set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE | _PAGE_USER));
@@ -210,31 +210,31 @@ static __init void unmap_low_page(int i)
 	ti->allocated = 0; 
 } 
 
-static void __init phys_pgd_init(pgd_t *pgd, unsigned long address, unsigned long end)
+static void __init phys_pud_init(pud_t *pud, unsigned long address, unsigned long end)
 { 
 	long i, j; 
 
-	i = pgd_index(address);
-	pgd = pgd + i;
-	for (; i < PTRS_PER_PGD; pgd++, i++) {
+	i = pud_index(address);
+	pud = pud + i;
+	for (; i < PTRS_PER_PUD; pud++, i++) {
 		int map; 
 		unsigned long paddr, pmd_phys;
 		pmd_t *pmd;
 
-		paddr = (address & PML4_MASK) + i*PGDIR_SIZE;
+		paddr = address + i*PUD_SIZE;
 		if (paddr >= end) { 
-			for (; i < PTRS_PER_PGD; i++, pgd++) 
-				set_pgd(pgd, __pgd(0)); 
+			for (; i < PTRS_PER_PUD; i++, pud++) 
+				set_pud(pud, __pud(0)); 
 			break;
 		} 
 
-		if (!e820_mapped(paddr, paddr+PGDIR_SIZE, 0)) { 
-			set_pgd(pgd, __pgd(0)); 
+		if (!e820_mapped(paddr, paddr+PUD_SIZE, 0)) { 
+			set_pud(pud, __pud(0)); 
 			continue;
 		} 
 
 		pmd = alloc_low_page(&map, &pmd_phys);
-		set_pgd(pgd, __pgd(pmd_phys | _KERNPG_TABLE));
+		set_pud(pud, __pud(pmd_phys | _KERNPG_TABLE));
 		for (j = 0; j < PTRS_PER_PMD; pmd++, j++, paddr += PMD_SIZE) {
 			unsigned long pe;
 
@@ -260,7 +260,7 @@ void __init init_memory_mapping(void) 
 	unsigned long adr;	       
 	unsigned long end;
 	unsigned long next; 
-	unsigned long pgds, pmds, tables; 
+	unsigned long puds, pmds, tables; 
 
 	Dprintk("init_memory_mapping\n");
 
@@ -273,9 +273,9 @@ void __init init_memory_mapping(void) 
 	 * discovered.
 	 */
 
-	pgds = (end + PGDIR_SIZE - 1) >> PGDIR_SHIFT;
+	puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
 	pmds = (end + PMD_SIZE - 1) >> PMD_SHIFT; 
-	tables = round_up(pgds*8, PAGE_SIZE) + round_up(pmds * 8, PAGE_SIZE); 
+	tables = round_up(puds*8, PAGE_SIZE) + round_up(pmds * 8, PAGE_SIZE); 
 
 	table_start = find_e820_area(0x8000, __pa_symbol(&_text), tables); 
 	if (table_start == -1UL) 
@@ -288,13 +288,13 @@ void __init init_memory_mapping(void) 
 
 	for (adr = PAGE_OFFSET; adr < end; adr = next) { 
 		int map;
-		unsigned long pgd_phys; 
-		pgd_t *pgd = alloc_low_page(&map, &pgd_phys);
-		next = adr + PML4_SIZE;
+		unsigned long pud_phys; 
+		pud_t *pud = alloc_low_page(&map, &pud_phys);
+		next = adr + PGDIR_SIZE;
 		if (next > end) 
 			next = end; 
-		phys_pgd_init(pgd, adr-PAGE_OFFSET, next-PAGE_OFFSET); 
-		set_pml4(init_level4_pgt + pml4_index(adr), mk_kernel_pml4(pgd_phys));
+		phys_pud_init(pud, adr-PAGE_OFFSET, next-PAGE_OFFSET); 
+		set_pgd(init_level4_pgt + pgd_index(adr), mk_kernel_pgd(pud_phys));
 		unmap_low_page(map);   
 	} 
 	asm volatile("movq %%cr4,%0" : "=r" (mmu_cr4_features));
@@ -306,25 +306,12 @@ void __init init_memory_mapping(void) 
 
 extern struct x8664_pda cpu_pda[NR_CPUS];
 
-static unsigned long low_pml4[NR_CPUS];
-
-void swap_low_mappings(void)
-{
-	int i;
-	for (i = 0; i < NR_CPUS; i++) {
-	        unsigned long t;
-		if (!cpu_pda[i].level4_pgt) 
-			continue;
-		t = cpu_pda[i].level4_pgt[0];
-		cpu_pda[i].level4_pgt[0] = low_pml4[i];
-		low_pml4[i] = t;
-	}
-	flush_tlb_all();
-}
-
+/* Assumes all CPUs still execute in init_mm */
 void zap_low_mappings(void)
 {
-	swap_low_mappings();
+	pgd_t *pgd = pgd_offset_k(0UL);
+	pgd_clear(pgd);
+	flush_tlb_all();
 }
 
 #ifndef CONFIG_DISCONTIGMEM
@@ -361,10 +348,14 @@ void __init clear_kernel_mapping(unsigne
 	
 	for (; address < end; address += LARGE_PAGE_SIZE) { 
 		pgd_t *pgd = pgd_offset_k(address);
-               pmd_t *pmd;
-		if (!pgd || pgd_none(*pgd))
+		pud_t *pud;
+		pmd_t *pmd;
+		if (pgd_none(*pgd))
+			continue;
+		pud = pud_offset(pgd, address);
+		if (pud_none(*pud))
 			continue; 
-               pmd = pmd_offset(pgd, address);
+		pmd = pmd_offset(pud, address);
 		if (!pmd || pmd_none(*pmd))
 			continue; 
 		if (0 == (pmd_val(*pmd) & _PAGE_PSE)) { 
@@ -531,29 +522,29 @@ void __init reserve_bootmem_generic(unsi
 int kern_addr_valid(unsigned long addr) 
 { 
 	unsigned long above = ((long)addr) >> __VIRTUAL_MASK_SHIFT;
-       pml4_t *pml4;
        pgd_t *pgd;
+       pud_t *pud;
        pmd_t *pmd;
        pte_t *pte;
 
 	if (above != 0 && above != -1UL)
 		return 0; 
 	
-       pml4 = pml4_offset_k(addr);
-	if (pml4_none(*pml4))
+	pgd = pgd_offset_k(addr);
+	if (pgd_none(*pgd))
 		return 0;
 
-       pgd = pgd_offset_k(addr);
-	if (pgd_none(*pgd))
+	pud = pud_offset(pgd, addr);
+	if (pud_none(*pud))
 		return 0; 
 
-       pmd = pmd_offset(pgd, addr);
+	pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd))
 		return 0;
 	if (pmd_large(*pmd))
 		return pfn_valid(pmd_pfn(*pmd));
 
-       pte = pte_offset_kernel(pmd, addr);
+	pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte))
 		return 0;
 	return pfn_valid(pte_pfn(*pte));
diff -puN arch/x86_64/mm/ioremap.c~4level-x86-64 arch/x86_64/mm/ioremap.c
--- linux-2.6/arch/x86_64/mm/ioremap.c~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/arch/x86_64/mm/ioremap.c	2004-12-18 17:03:12.000000000 +1100
@@ -49,10 +49,10 @@ static inline int remap_area_pmd(pmd_t *
 {
 	unsigned long end;
 
-	address &= ~PGDIR_MASK;
+	address &= ~PUD_MASK;
 	end = address + size;
-	if (end > PGDIR_SIZE)
-		end = PGDIR_SIZE;
+	if (end > PUD_SIZE)
+		end = PUD_SIZE;
 	phys_addr -= address;
 	if (address >= end)
 		BUG();
@@ -67,31 +67,54 @@ static inline int remap_area_pmd(pmd_t *
 	return 0;
 }
 
+static inline int remap_area_pud(pud_t * pud, unsigned long address, unsigned long size,
+	unsigned long phys_addr, unsigned long flags)
+{
+	unsigned long end;
+
+	address &= ~PGDIR_MASK;
+	end = address + size;
+	if (end > PGDIR_SIZE)
+		end = PGDIR_SIZE;
+	phys_addr -= address;
+	if (address >= end)
+		BUG();
+	do {
+		pmd_t * pmd = pmd_alloc(&init_mm, pud, address);
+		if (!pmd)
+			return -ENOMEM;
+		remap_area_pmd(pmd, address, end - address, address + phys_addr, flags);
+		address = (address + PUD_SIZE) & PUD_MASK;
+		pmd++;
+	} while (address && (address < end));
+	return 0;
+}
+
 static int remap_area_pages(unsigned long address, unsigned long phys_addr,
 				 unsigned long size, unsigned long flags)
 {
 	int error;
-	pgd_t * dir;
+	pgd_t *pgd;
 	unsigned long end = address + size;
 
 	phys_addr -= address;
-	dir = pgd_offset_k(address);
+	pgd = pgd_offset_k(address);
 	flush_cache_all();
 	if (address >= end)
 		BUG();
 	spin_lock(&init_mm.page_table_lock);
 	do {
-		pmd_t *pmd;
-		pmd = pmd_alloc(&init_mm, dir, address);
+		pud_t *pud;
+		pud = pud_alloc(&init_mm, pgd, address);
 		error = -ENOMEM;
-		if (!pmd)
+		if (!pud)
 			break;
-		if (remap_area_pmd(pmd, address, end - address,
+		if (remap_area_pud(pud, address, end - address,
 					 phys_addr + address, flags))
 			break;
 		error = 0;
 		address = (address + PGDIR_SIZE) & PGDIR_MASK;
-		dir++;
+		pgd++;
 	} while (address && (address < end));
 	spin_unlock(&init_mm.page_table_lock);
 	flush_tlb_all();
diff -puN arch/x86_64/mm/pageattr.c~4level-x86-64 arch/x86_64/mm/pageattr.c
--- linux-2.6/arch/x86_64/mm/pageattr.c~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/arch/x86_64/mm/pageattr.c	2004-12-18 17:03:12.000000000 +1100
@@ -16,12 +16,16 @@
 
 static inline pte_t *lookup_address(unsigned long address) 
 { 
-	pgd_t *pgd = pgd_offset_k(address); 
+	pgd_t *pgd = pgd_offset_k(address);
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
-	if (!pgd || !pgd_present(*pgd))
+	if (pgd_none(*pgd))
+		return NULL;
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
 		return NULL; 
-	pmd = pmd_offset(pgd, address); 	       
+	pmd = pmd_offset(pud, address);
 	if (!pmd_present(*pmd))
 		return NULL; 
 	if (pmd_large(*pmd))
@@ -98,16 +102,20 @@ static inline void save_page(unsigned lo
  */
 static void revert_page(unsigned long address, pgprot_t ref_prot)
 {
-       pgd_t *pgd;
-       pmd_t *pmd; 
-       pte_t large_pte; 
-       
-       pgd = pgd_offset_k(address); 
-       pmd = pmd_offset(pgd, address);
-       BUG_ON(pmd_val(*pmd) & _PAGE_PSE); 
-       pgprot_val(ref_prot) |= _PAGE_PSE;
-       large_pte = mk_pte_phys(__pa(address) & LARGE_PAGE_MASK, ref_prot);
-       set_pte((pte_t *)pmd, large_pte);
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t large_pte;
+
+	pgd = pgd_offset_k(address);
+	BUG_ON(pgd_none(*pgd));
+	pud = pud_offset(pgd,address);
+	BUG_ON(pud_none(*pud));
+	pmd = pmd_offset(pud, address);
+	BUG_ON(pmd_val(*pmd) & _PAGE_PSE);
+	pgprot_val(ref_prot) |= _PAGE_PSE;
+	large_pte = mk_pte_phys(__pa(address) & LARGE_PAGE_MASK, ref_prot);
+	set_pte((pte_t *)pmd, large_pte);
 }      
 
 static int
diff -puN include/asm-x86_64/e820.h~4level-x86-64 include/asm-x86_64/e820.h
--- linux-2.6/include/asm-x86_64/e820.h~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/include/asm-x86_64/e820.h	2004-12-18 17:03:12.000000000 +1100
@@ -26,9 +26,6 @@
 
 #define LOWMEMSIZE()	(0x9f000)
 
-#define MAXMEM		(120UL * 1024 * 1024 * 1024 * 1024)  /* 120TB */ 
-
-
 #ifndef __ASSEMBLY__
 struct e820entry {
 	u64 addr;	/* start of memory segment */
diff -puN include/asm-x86_64/mmu_context.h~4level-x86-64 include/asm-x86_64/mmu_context.h
--- linux-2.6/include/asm-x86_64/mmu_context.h~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/include/asm-x86_64/mmu_context.h	2004-12-18 17:03:12.000000000 +1100
@@ -40,10 +40,7 @@ static inline void switch_mm(struct mm_s
 		write_pda(active_mm, next);
 #endif
 		set_bit(cpu, &next->cpu_vm_mask);
-		/* Re-load page tables */
-		*read_pda(level4_pgt) = __pa(next->pgd) | _PAGE_TABLE;
-		__flush_tlb();
-
+		asm volatile("movq %0,%%cr3" :: "r" (__pa(next->pgd)) : "memory");
 		if (unlikely(next->context.ldt != prev->context.ldt)) 
 			load_LDT_nolock(&next->context, cpu);
 	}
diff -puN include/asm-x86_64/page.h~4level-x86-64 include/asm-x86_64/page.h
--- linux-2.6/include/asm-x86_64/page.h~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/include/asm-x86_64/page.h	2004-12-18 17:03:12.000000000 +1100
@@ -43,22 +43,22 @@ void copy_page(void *, void *);
  */
 typedef struct { unsigned long pte; } pte_t;
 typedef struct { unsigned long pmd; } pmd_t;
+typedef struct { unsigned long pud; } pud_t;
 typedef struct { unsigned long pgd; } pgd_t;
-typedef struct { unsigned long pml4; } pml4_t;
 #define PTE_MASK	PHYSICAL_PAGE_MASK
 
 typedef struct { unsigned long pgprot; } pgprot_t;
 
 #define pte_val(x)	((x).pte)
 #define pmd_val(x)	((x).pmd)
+#define pud_val(x)	((x).pud)
 #define pgd_val(x)	((x).pgd)
-#define pml4_val(x)	((x).pml4)
 #define pgprot_val(x)	((x).pgprot)
 
 #define __pte(x) ((pte_t) { (x) } )
 #define __pmd(x) ((pmd_t) { (x) } )
+#define __pud(x) ((pud_t) { (x) } )
 #define __pgd(x) ((pgd_t) { (x) } )
-#define __pml4(x) ((pml4_t) { (x) } )
 #define __pgprot(x)	((pgprot_t) { (x) } )
 
 extern unsigned long vm_stack_flags, vm_stack_flags32;
@@ -67,19 +67,19 @@ extern unsigned long vm_force_exec32;
 
 #define __START_KERNEL		0xffffffff80100000UL
 #define __START_KERNEL_map	0xffffffff80000000UL
-#define __PAGE_OFFSET           0x0000010000000000UL	/* 1 << 40 */
+#define __PAGE_OFFSET           0xffff810000000000UL
 
 #else
 #define __START_KERNEL		0xffffffff80100000
 #define __START_KERNEL_map	0xffffffff80000000
-#define __PAGE_OFFSET           0x0000010000000000	/* 1 << 40 */
+#define __PAGE_OFFSET           0xffff810000000000
 #endif /* !__ASSEMBLY__ */
 
 /* to align the pointer to the (next) page boundary */
 #define PAGE_ALIGN(addr)	(((addr)+PAGE_SIZE-1)&PAGE_MASK)
 
 /* See Documentation/x86_64/mm.txt for a description of the memory map. */
-#define __PHYSICAL_MASK_SHIFT	40
+#define __PHYSICAL_MASK_SHIFT	46
 #define __PHYSICAL_MASK		((1UL << __PHYSICAL_MASK_SHIFT) - 1)
 #define __VIRTUAL_MASK_SHIFT	48
 #define __VIRTUAL_MASK		((1UL << __VIRTUAL_MASK_SHIFT) - 1)
diff -puN include/asm-x86_64/pda.h~4level-x86-64 include/asm-x86_64/pda.h
--- linux-2.6/include/asm-x86_64/pda.h~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/include/asm-x86_64/pda.h	2004-12-18 17:03:12.000000000 +1100
@@ -17,7 +17,6 @@ struct x8664_pda {
         int irqcount;		    /* Irq nesting counter. Starts with -1 */  	
 	int cpunumber;		    /* Logical CPU number */
 	char *irqstackptr;	/* top of irqstack */
-	unsigned long volatile *level4_pgt; /* Per CPU top level page table */
 	unsigned int __softirq_pending;
 	unsigned int __nmi_count;	/* number of NMI on this CPUs */
 	struct mm_struct *active_mm;
diff -puN include/asm-x86_64/pgalloc.h~4level-x86-64 include/asm-x86_64/pgalloc.h
--- linux-2.6/include/asm-x86_64/pgalloc.h~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/include/asm-x86_64/pgalloc.h	2004-12-18 17:03:12.000000000 +1100
@@ -9,8 +9,10 @@
 
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
-#define pgd_populate(mm, pgd, pmd) \
-		set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pmd)))
+#define pud_populate(mm, pud, pmd) \
+		set_pud(pud, __pud(_PAGE_TABLE | __pa(pmd)))
+#define pgd_populate(mm, pgd, pud) \
+		set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud)))
 
 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
 {
@@ -33,12 +35,37 @@ static inline pmd_t *pmd_alloc_one (stru
 	return (pmd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
 }
 
-static inline pgd_t *pgd_alloc (struct mm_struct *mm)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return (pgd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	return (pud_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
 }
 
-static inline void pgd_free (pgd_t *pgd)
+static inline void pud_free (pud_t *pud)
+{
+	BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
+	free_page((unsigned long)pud);
+}
+
+static inline pgd_t *pgd_alloc(struct mm_struct *mm)
+{
+	unsigned boundary;
+	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	if (!pgd)
+		return NULL;
+	/*
+	 * Copy kernel pointers in from init.
+	 * Could keep a freelist or slab cache of those because the kernel
+	 * part never changes.
+	 */
+	boundary = pgd_index(__PAGE_OFFSET);
+	memset(pgd, 0, boundary * sizeof(pgd_t));
+	memcpy(pgd + boundary,
+	       init_level4_pgt + boundary,
+	       (PTRS_PER_PGD - boundary) * sizeof(pgd_t));
+	return pgd;
+}
+
+static inline void pgd_free(pgd_t *pgd)
 {
 	BUG_ON((unsigned long)pgd & (PAGE_SIZE-1));
 	free_page((unsigned long)pgd);
@@ -73,5 +100,6 @@ extern inline void pte_free(struct page 
 
 #define __pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
 #define __pmd_free_tlb(tlb,x)   pmd_free(x)
+#define __pud_free_tlb(tlb,x)   pud_free(x)
 
 #endif /* _X86_64_PGALLOC_H */
diff -puN include/asm-x86_64/pgtable.h~4level-x86-64 include/asm-x86_64/pgtable.h
--- linux-2.6/include/asm-x86_64/pgtable.h~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/include/asm-x86_64/pgtable.h	2004-12-18 17:03:12.000000000 +1100
@@ -4,12 +4,6 @@
 /*
  * This file contains the functions and defines necessary to modify and use
  * the x86-64 page table tree.
- * 
- * x86-64 has a 4 level table setup. Generic linux MM only supports
- * three levels. The fourth level is currently a single static page that
- * is shared by everybody and just contains a pointer to the current
- * three level page setup on the beginning and some kernel mappings at 
- * the end. For more details see Documentation/x86_64/mm.txt
  */
 #include <asm/processor.h>
 #include <asm/fixmap.h>
@@ -17,15 +11,14 @@
 #include <linux/threads.h>
 #include <asm/pda.h>
 
-extern pgd_t level3_kernel_pgt[512];
-extern pgd_t level3_physmem_pgt[512];
-extern pgd_t level3_ident_pgt[512];
+extern pud_t level3_kernel_pgt[512];
+extern pud_t level3_physmem_pgt[512];
+extern pud_t level3_ident_pgt[512];
 extern pmd_t level2_kernel_pgt[512];
-extern pml4_t init_level4_pgt[];
-extern pgd_t boot_vmalloc_pgt[];
+extern pgd_t init_level4_pgt[];
 extern unsigned long __supported_pte_mask;
 
-#define swapper_pg_dir NULL
+#define swapper_pg_dir init_level4_pgt
 
 extern void paging_init(void);
 extern void clear_kernel_mapping(unsigned long addr, unsigned long size);
@@ -39,16 +32,19 @@ extern unsigned long pgkern_mask;
 extern unsigned long empty_zero_page[PAGE_SIZE/sizeof(unsigned long)];
 #define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
 
-#define PML4_SHIFT	39
-#define PTRS_PER_PML4	512
-
 /*
  * PGDIR_SHIFT determines what a top-level page table entry can map
  */
-#define PGDIR_SHIFT	30
+#define PGDIR_SHIFT	39
 #define PTRS_PER_PGD	512
 
 /*
+ * 3rd level page
+ */
+#define PUD_SHIFT	30
+#define PTRS_PER_PUD	512
+
+/*
  * PMD_SHIFT determines the size of the area a middle-level
  * page table can map
  */
@@ -64,14 +60,13 @@ extern unsigned long empty_zero_page[PAG
 	printk("%s:%d: bad pte %p(%016lx).\n", __FILE__, __LINE__, &(e), pte_val(e))
 #define pmd_ERROR(e) \
 	printk("%s:%d: bad pmd %p(%016lx).\n", __FILE__, __LINE__, &(e), pmd_val(e))
+#define pud_ERROR(e) \
+	printk("%s:%d: bad pud %p(%016lx).\n", __FILE__, __LINE__, &(e), pud_val(e))
 #define pgd_ERROR(e) \
 	printk("%s:%d: bad pgd %p(%016lx).\n", __FILE__, __LINE__, &(e), pgd_val(e))
 
-
-#define pml4_none(x)	(!pml4_val(x))
 #define pgd_none(x)	(!pgd_val(x))
-
-extern inline int pgd_present(pgd_t pgd)	{ return !pgd_none(pgd); }
+#define pud_none(x)	(!pud_val(x))
 
 static inline void set_pte(pte_t *dst, pte_t val)
 {
@@ -83,6 +78,16 @@ static inline void set_pmd(pmd_t *dst, p
         pmd_val(*dst) = pmd_val(val); 
 } 
 
+static inline void set_pud(pud_t *dst, pud_t val)
+{
+	pud_val(*dst) = pud_val(val);
+}
+
+extern inline void pud_clear (pud_t *pud)
+{
+	set_pud(pud, __pud(0));
+}
+
 static inline void set_pgd(pgd_t *dst, pgd_t val)
 {
 	pgd_val(*dst) = pgd_val(val); 
@@ -93,45 +98,30 @@ extern inline void pgd_clear (pgd_t * pg
 	set_pgd(pgd, __pgd(0));
 }
 
-static inline void set_pml4(pml4_t *dst, pml4_t val)
-{
-	pml4_val(*dst) = pml4_val(val); 
-}
-
-#define pgd_page(pgd) \
-((unsigned long) __va(pgd_val(pgd) & PHYSICAL_PAGE_MASK))
+#define pud_page(pud) \
+((unsigned long) __va(pud_val(pud) & PHYSICAL_PAGE_MASK))
 
 #define ptep_get_and_clear(xp)	__pte(xchg(&(xp)->pte, 0))
 #define pte_same(a, b)		((a).pte == (b).pte)
 
-#define PML4_SIZE	(1UL << PML4_SHIFT)
-#define PML4_MASK       (~(PML4_SIZE-1))
 #define PMD_SIZE	(1UL << PMD_SHIFT)
 #define PMD_MASK	(~(PMD_SIZE-1))
+#define PUD_SIZE	(1UL << PUD_SHIFT)
+#define PUD_MASK	(~(PUD_SIZE-1))
 #define PGDIR_SIZE	(1UL << PGDIR_SHIFT)
 #define PGDIR_MASK	(~(PGDIR_SIZE-1))
 
 #define USER_PTRS_PER_PGD	(TASK_SIZE/PGDIR_SIZE)
 #define FIRST_USER_PGD_NR	0
 
-#define USER_PGD_PTRS (PAGE_OFFSET >> PGDIR_SHIFT)
-#define KERNEL_PGD_PTRS (PTRS_PER_PGD-USER_PGD_PTRS)
-
-#define TWOLEVEL_PGDIR_SHIFT	20
-#define BOOT_USER_L4_PTRS 1
-#define BOOT_KERNEL_L4_PTRS 511	/* But we will do it in 4rd level */
-
-
-
 #ifndef __ASSEMBLY__
-#define VMALLOC_START    0xffffff0000000000UL
-#define VMALLOC_END      0xffffff7fffffffffUL
-#define MODULES_VADDR    0xffffffffa0000000UL
-#define MODULES_END      0xffffffffafffffffUL
+#define MAXMEM		 0x3fffffffffffUL
+#define VMALLOC_START    0xffffc20000000000UL
+#define VMALLOC_END      0xffffe1ffffffffffUL
+#define MODULES_VADDR    0xffffffff88000000
+#define MODULES_END      0xfffffffffff00000
 #define MODULES_LEN   (MODULES_END - MODULES_VADDR)
 
-#define IOMAP_START      0xfffffe8000000000UL
-
 #define _PAGE_BIT_PRESENT	0
 #define _PAGE_BIT_RW		1
 #define _PAGE_BIT_USER		2
@@ -222,6 +212,14 @@ static inline unsigned long pgd_bad(pgd_
        return val & ~(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED);      
 } 
 
+static inline unsigned long pud_bad(pud_t pud)
+{
+       unsigned long val = pud_val(pud);
+       val &= ~PTE_MASK;
+       val &= ~(_PAGE_USER | _PAGE_DIRTY);
+       return val & ~(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED);
+}
+
 #define pte_none(x)	(!pte_val(x))
 #define pte_present(x)	(pte_val(x) & (_PAGE_PRESENT | _PAGE_PROTNONE))
 #define pte_clear(xp)	do { set_pte(xp, __pte(0)); } while (0)
@@ -300,54 +298,32 @@ static inline int pmd_large(pmd_t pte) {
 
 /*
  * Level 4 access.
- * Never use these in the common code.
  */
-#define pml4_page(pml4) ((unsigned long) __va(pml4_val(pml4) & PTE_MASK))
-#define pml4_index(address) ((address >> PML4_SHIFT) & (PTRS_PER_PML4-1))
-#define pml4_offset_k(address) (init_level4_pgt + pml4_index(address))
-#define pml4_present(pml4) (pml4_val(pml4) & _PAGE_PRESENT)
-#define mk_kernel_pml4(address) ((pml4_t){ (address) | _KERNPG_TABLE })
-#define level3_offset_k(dir, address) ((pgd_t *) pml4_page(*(dir)) + pgd_index(address))
+#define pgd_page(pgd) ((unsigned long) __va((unsigned long)pgd_val(pgd) & PTE_MASK))
+#define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
+#define pgd_offset(mm, addr) ((mm)->pgd + pgd_index(addr))
+#define pgd_offset_k(address) (init_level4_pgt + pgd_index(address))
+#define pgd_present(pgd) (pgd_val(pgd) & _PAGE_PRESENT)
+#define mk_kernel_pgd(address) ((pgd_t){ (address) | _KERNPG_TABLE })
 
-/* PGD - Level3 access */
+/* PUD - Level3 access */
 /* to find an entry in a page-table-directory. */
-#define pgd_index(address) ((address >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
-static inline pgd_t *__pgd_offset_k(pgd_t *pgd, unsigned long address)
+#define pud_index(address) (((address) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
+#define pud_offset(pgd, address) ((pud_t *) pgd_page(*(pgd)) + pud_index(address))
+#define pud_offset_k(pgd, addr) pud_offset(pgd, addr)
+#define pud_present(pud) (pud_val(pud) & _PAGE_PRESENT)
+
+static inline pud_t *__pud_offset_k(pud_t *pud, unsigned long address)
 { 
-	return pgd + pgd_index(address);
+	return pud + pud_index(address);
 } 
 
-/* Find correct pgd via the hidden fourth level page level: */
-
-/* This accesses the reference page table of the boot cpu. 
-   Other CPUs get synced lazily via the page fault handler. */
-static inline pgd_t *pgd_offset_k(unsigned long address)
-{
-	unsigned long addr;
-
-	addr = pml4_val(init_level4_pgt[pml4_index(address)]);
-	addr &= PHYSICAL_PAGE_MASK;
-	return __pgd_offset_k((pgd_t *)__va(addr), address);
-}
-
-/* Access the pgd of the page table as seen by the current CPU. */ 
-static inline pgd_t *current_pgd_offset_k(unsigned long address)
-{
-	unsigned long addr;
-
-	addr = read_pda(level4_pgt)[pml4_index(address)];
-	addr &= PHYSICAL_PAGE_MASK;
-	return __pgd_offset_k((pgd_t *)__va(addr), address);
-}
-
-#define pgd_offset(mm, address) ((mm)->pgd+pgd_index(address))
-
 /* PMD  - Level 2 access */
 #define pmd_page_kernel(pmd) ((unsigned long) __va(pmd_val(pmd) & PTE_MASK))
 #define pmd_page(pmd)		(pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
 
 #define pmd_index(address) (((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1))
-#define pmd_offset(dir, address) ((pmd_t *) pgd_page(*(dir)) + \
+#define pmd_offset(dir, address) ((pmd_t *) pud_page(*(dir)) + \
 			pmd_index(address))
 #define pmd_none(x)	(!pmd_val(x))
 #define pmd_present(x)	(pmd_val(x) & _PAGE_PRESENT)
diff -puN include/asm-x86_64/processor.h~4level-x86-64 include/asm-x86_64/processor.h
--- linux-2.6/include/asm-x86_64/processor.h~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/include/asm-x86_64/processor.h	2004-12-18 17:03:12.000000000 +1100
@@ -165,9 +165,9 @@ static inline void clear_in_cr4 (unsigne
 
 
 /*
- * User space process size: 512GB - 1GB (default).
+ * User space process size. 47bits.
  */
-#define TASK_SIZE	(0x0000007fc0000000UL)
+#define TASK_SIZE	(0x800000000000)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
diff -puN arch/x86_64/kernel/reboot.c~4level-x86-64 arch/x86_64/kernel/reboot.c
--- linux-2.6/arch/x86_64/kernel/reboot.c~4level-x86-64	2004-12-18 17:03:12.000000000 +1100
+++ linux-2.6-npiggin/arch/x86_64/kernel/reboot.c	2004-12-18 17:03:12.000000000 +1100
@@ -74,7 +74,7 @@ static void reboot_warm(void)
 	local_irq_disable(); 
 		
 	/* restore identity mapping */
-	init_level4_pgt[0] = __pml4(__pa(level3_ident_pgt) | 7); 
+	init_level4_pgt[0] = __pgd(__pa(level3_ident_pgt) | 7); 
 	__flush_tlb_all(); 
 
 	/* Move the trampoline to low memory */

_

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 10/10] alternate 4-level page tables patches
  2004-12-18  7:00                 ` [PATCH 9/10] " Nick Piggin
@ 2004-12-18  7:01                   ` Nick Piggin
  2004-12-18  7:31                     ` Andi Kleen
  0 siblings, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-18  7:01 UTC (permalink / raw)
  To: Linux Memory Management
  Cc: Andi Kleen, Hugh Dickins, Linus Torvalds, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 6 bytes --]

10/10

[-- Attachment #2: mm-inline-ptbl-walkers.patch --]
[-- Type: text/plain, Size: 5888 bytes --]



Convert some pagetable walking functions over to be inline where
they are only used once. This is worth a percent or so on lmbench
fork.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>


---

 linux-2.6-npiggin/mm/memory.c   |    8 ++++----
 linux-2.6-npiggin/mm/msync.c    |    4 ++--
 linux-2.6-npiggin/mm/swapfile.c |    6 +++---
 linux-2.6-npiggin/mm/vmalloc.c  |   12 ++++++------
 4 files changed, 15 insertions(+), 15 deletions(-)

diff -puN mm/memory.c~mm-inline-ptbl-walkers mm/memory.c
--- linux-2.6/mm/memory.c~mm-inline-ptbl-walkers	2004-12-18 17:47:33.000000000 +1100
+++ linux-2.6-npiggin/mm/memory.c	2004-12-18 17:48:14.000000000 +1100
@@ -462,7 +462,7 @@ int copy_page_range(struct mm_struct *ds
 	return err;
 }
 
-static void zap_pte_range(struct mmu_gather *tlb,
+static inline void zap_pte_range(struct mmu_gather *tlb,
 		pmd_t *pmd, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
@@ -545,7 +545,7 @@ static void zap_pte_range(struct mmu_gat
 	pte_unmap(ptep-1);
 }
 
-static void zap_pmd_range(struct mmu_gather *tlb,
+static inline void zap_pmd_range(struct mmu_gather *tlb,
 		pud_t *pud, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
@@ -570,7 +570,7 @@ static void zap_pmd_range(struct mmu_gat
 	} while (address && (address < end));
 }
 
-static void zap_pud_range(struct mmu_gather *tlb,
+static inline void zap_pud_range(struct mmu_gather *tlb,
 		pgd_t * pgd, unsigned long address,
 		unsigned long end, struct zap_details *details)
 {
@@ -973,7 +973,7 @@ out:
 
 EXPORT_SYMBOL(get_user_pages);
 
-static void zeromap_pte_range(pte_t * pte, unsigned long address,
+static inline void zeromap_pte_range(pte_t * pte, unsigned long address,
                                      unsigned long size, pgprot_t prot)
 {
 	unsigned long end;
diff -puN mm/msync.c~mm-inline-ptbl-walkers mm/msync.c
--- linux-2.6/mm/msync.c~mm-inline-ptbl-walkers	2004-12-18 17:47:33.000000000 +1100
+++ linux-2.6-npiggin/mm/msync.c	2004-12-18 17:47:33.000000000 +1100
@@ -21,7 +21,7 @@
  * Called with mm->page_table_lock held to protect against other
  * threads/the swapper from ripping pte's out from under us.
  */
-static int filemap_sync_pte(pte_t *ptep, struct vm_area_struct *vma,
+static inline int filemap_sync_pte(pte_t *ptep, struct vm_area_struct *vma,
 	unsigned long address, unsigned int flags)
 {
 	pte_t pte = *ptep;
@@ -38,7 +38,7 @@ static int filemap_sync_pte(pte_t *ptep,
 	return 0;
 }
 
-static int filemap_sync_pte_range(pmd_t * pmd,
+static inline int filemap_sync_pte_range(pmd_t * pmd,
 	unsigned long address, unsigned long end, 
 	struct vm_area_struct *vma, unsigned int flags)
 {
diff -puN mm/swapfile.c~mm-inline-ptbl-walkers mm/swapfile.c
--- linux-2.6/mm/swapfile.c~mm-inline-ptbl-walkers	2004-12-18 17:47:33.000000000 +1100
+++ linux-2.6-npiggin/mm/swapfile.c	2004-12-18 17:47:33.000000000 +1100
@@ -427,7 +427,7 @@ void free_swap_and_cache(swp_entry_t ent
  * what to do if a write is requested later.
  */
 /* vma->vm_mm->page_table_lock is held */
-static void
+static inline void
 unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir,
 	swp_entry_t entry, struct page *page)
 {
@@ -439,7 +439,7 @@ unuse_pte(struct vm_area_struct *vma, un
 }
 
 /* vma->vm_mm->page_table_lock is held */
-static unsigned long unuse_pmd(struct vm_area_struct * vma, pmd_t *dir,
+static inline unsigned long unuse_pmd(struct vm_area_struct * vma, pmd_t *dir,
 	unsigned long address, unsigned long size, unsigned long offset,
 	swp_entry_t entry, struct page *page)
 {
@@ -486,7 +486,7 @@ static unsigned long unuse_pmd(struct vm
 }
 
 /* vma->vm_mm->page_table_lock is held */
-static unsigned long unuse_pud(struct vm_area_struct * vma, pud_t *pud,
+static inline unsigned long unuse_pud(struct vm_area_struct * vma, pud_t *pud,
         unsigned long address, unsigned long size, unsigned long offset,
 	swp_entry_t entry, struct page *page)
 {
diff -puN mm/vmalloc.c~mm-inline-ptbl-walkers mm/vmalloc.c
--- linux-2.6/mm/vmalloc.c~mm-inline-ptbl-walkers	2004-12-18 17:47:33.000000000 +1100
+++ linux-2.6-npiggin/mm/vmalloc.c	2004-12-18 17:47:33.000000000 +1100
@@ -23,7 +23,7 @@
 rwlock_t vmlist_lock = RW_LOCK_UNLOCKED;
 struct vm_struct *vmlist;
 
-static void unmap_area_pte(pmd_t *pmd, unsigned long address,
+static inline void unmap_area_pte(pmd_t *pmd, unsigned long address,
 				  unsigned long size)
 {
 	unsigned long end;
@@ -56,7 +56,7 @@ static void unmap_area_pte(pmd_t *pmd, u
 	} while (address < end);
 }
 
-static void unmap_area_pmd(pud_t *pud, unsigned long address,
+static inline void unmap_area_pmd(pud_t *pud, unsigned long address,
 				  unsigned long size)
 {
 	unsigned long end;
@@ -83,7 +83,7 @@ static void unmap_area_pmd(pud_t *pud, u
 	} while (address < end);
 }
 
-static void unmap_area_pud(pgd_t *pgd, unsigned long address,
+static inline void unmap_area_pud(pgd_t *pgd, unsigned long address,
 			   unsigned long size)
 {
 	pud_t *pud;
@@ -110,7 +110,7 @@ static void unmap_area_pud(pgd_t *pgd, u
 	} while (address && (address < end));
 }
 
-static int map_area_pte(pte_t *pte, unsigned long address,
+static inline int map_area_pte(pte_t *pte, unsigned long address,
 			       unsigned long size, pgprot_t prot,
 			       struct page ***pages)
 {
@@ -135,7 +135,7 @@ static int map_area_pte(pte_t *pte, unsi
 	return 0;
 }
 
-static int map_area_pmd(pmd_t *pmd, unsigned long address,
+static inline int map_area_pmd(pmd_t *pmd, unsigned long address,
 			       unsigned long size, pgprot_t prot,
 			       struct page ***pages)
 {
@@ -160,7 +160,7 @@ static int map_area_pmd(pmd_t *pmd, unsi
 	return 0;
 }
 
-static int map_area_pud(pud_t *pud, unsigned long address,
+static inline int map_area_pud(pud_t *pud, unsigned long address,
 			       unsigned long end, pgprot_t prot,
 			       struct page ***pages)
 {

_

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/10] alternate 4-level page tables patches
  2004-12-18  7:01                   ` [PATCH 10/10] " Nick Piggin
@ 2004-12-18  7:31                     ` Andi Kleen
  2004-12-18  7:46                       ` Nick Piggin
                                         ` (2 more replies)
  0 siblings, 3 replies; 77+ messages in thread
From: Andi Kleen @ 2004-12-18  7:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management, Andi Kleen, Hugh Dickins,
	Linus Torvalds, Andrew Morton

On Sat, Dec 18, 2004 at 06:01:05PM +1100, Nick Piggin wrote:
> 10/10

> 
> 
> Convert some pagetable walking functions over to be inline where
> they are only used once. This is worth a percent or so on lmbench
> fork.

Any modern gcc (3.4+ or 3.3-hammer) should use unit-at-a-time anyways,
which automatically inlines all static functions that are only used once.

I like it because during debugging you can turn it off and it makes
it much easier to read oopses when not everything is inlined.  And 
when turned on it generates much smaller and faster as you've shown
code.

Ok except on i386 where someone decided to explicitely turn it off 
all the time :/

I've been reenabling it on the suse kernel for a long time because
it doesn't seem to have any bad side effects and makes the code
considerably smaller.  It would be better to just turn it on in mainline 
again, then you'll see much more gain everywhere.

BTW we can do much better with all the page table walking by
adding some bitmaps about used ptes to struct page and skipping
holes quickly. DaveM has a patch for that in the queue, I hope a patch 
similar to his can be added once 4level page tables are in.

-Andi

Here's the patch: 

Enable unit-at-a-time by default. At least with 3.3-hammer and 3.4 
it seems to work just fine. Has been tested with 3.3-hammer over
several suse releases.

Signed-off-by: Andi Kleen <ak@suse.de>

diff -u linux-2.6.10rc2-time/arch/i386/Makefile-o linux-2.6.10rc2-time/arch/i386/Makefile
--- linux-2.6.10rc2-time/arch/i386/Makefile-o	2004-11-15 12:34:25.000000000 +0100
+++ linux-2.6.10rc2-time/arch/i386/Makefile	2004-12-18 08:27:14.000000000 +0100
@@ -57,9 +57,8 @@
 GCC_VERSION			:= $(call cc-version)
 cflags-$(CONFIG_REGPARM) 	+= $(shell if [ $(GCC_VERSION) -ge 0300 ] ; then echo "-mregparm=3"; fi ;)

-# Disable unit-at-a-time mode, it makes gcc use a lot more stack
-# due to the lack of sharing of stacklots.
-CFLAGS += $(call cc-option,-fno-unit-at-a-time)
+# Enable unit-at-a-time mode. It generates considerably smaller code.
+CFLAGS += $(call cc-option,-funit-at-a-time)

 CFLAGS += $(cflags-y)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/10] alternate 4-level page tables patches
  2004-12-18  7:31                     ` Andi Kleen
@ 2004-12-18  7:46                       ` Nick Piggin
  2004-12-18  8:08                       ` Andrew Morton
  2004-12-18 19:06                       ` Linus Torvalds
  2 siblings, 0 replies; 77+ messages in thread
From: Nick Piggin @ 2004-12-18  7:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Memory Management, Hugh Dickins, Linus Torvalds, Andrew Morton

Andi Kleen wrote:
> On Sat, Dec 18, 2004 at 06:01:05PM +1100, Nick Piggin wrote:
> 
>>10/10
> 
> 
>>
>>Convert some pagetable walking functions over to be inline where
>>they are only used once. This is worth a percent or so on lmbench
>>fork.
> 
> 
> Any modern gcc (3.4+ or 3.3-hammer) should use unit-at-a-time anyways,
> which automatically inlines all static functions that are only used once.
> 
> I like it because during debugging you can turn it off and it makes
> it much easier to read oopses when not everything is inlined.  And 
> when turned on it generates much smaller and faster as you've shown
> code.
> 

Yep, that makes a lot of sense.

> Ok except on i386 where someone decided to explicitely turn it off 
> all the time :/
> 
> I've been reenabling it on the suse kernel for a long time because
> it doesn't seem to have any bad side effects and makes the code
> considerably smaller.  It would be better to just turn it on in mainline 
> again, then you'll see much more gain everywhere.
> 

I won't get into this argument ;)

But I'll just say that this inline patch isn't very important at all -
it seems to be worth about 1% at best.

> BTW we can do much better with all the page table walking by
> adding some bitmaps about used ptes to struct page and skipping
> holes quickly. DaveM has a patch for that in the queue, I hope a patch 
> similar to his can be added once 4level page tables are in.
> 

Hmm, haven't seen them. Would be interesting - I guess you can get
a pretty big cache saving by testing a single bit rather than a
full word, assuming the common case is pretty sparse. I wonder how
it goes in practice?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/10] alternate 4-level page tables patches
  2004-12-18  7:31                     ` Andi Kleen
  2004-12-18  7:46                       ` Nick Piggin
@ 2004-12-18  8:08                       ` Andrew Morton
  2004-12-18  9:48                         ` Andi Kleen
  2004-12-18 19:06                       ` Linus Torvalds
  2 siblings, 1 reply; 77+ messages in thread
From: Andrew Morton @ 2004-12-18  8:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: nickpiggin, linux-mm, hugh, torvalds

Andi Kleen <ak@suse.de> wrote:
>
>  Enable unit-at-a-time by default. At least with 3.3-hammer and 3.4 
>  it seems to work just fine. Has been tested with 3.3-hammer over
>  several suse releases.

iirc, we turned this off because the compiler would go nuts inlining things
and would consume too much stack:

	a()
	{
		int a1;
	}

	b()
	{
		int b1;
	}

	c()
	{
		a();
		b();
	}

Here, c() will allocate eight bytes for locals.

If gcc ever fixes up the unionisation-of-local-storage thing then we'll be
OK.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18  6:57       ` [PATCH 4/10] " Nick Piggin
  2004-12-18  6:58         ` [PATCH 5/10] " Nick Piggin
@ 2004-12-18  9:05         ` Nick Piggin
  2004-12-18  9:50           ` Andi Kleen
  2004-12-18 10:45         ` William Lee Irwin III
  2 siblings, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-18  9:05 UTC (permalink / raw)
  To: Linux Memory Management
  Cc: Andi Kleen, Hugh Dickins, Linus Torvalds, Andrew Morton

Nick Piggin wrote:
> 4/10
> 
> 
> ------------------------------------------------------------------------
> 
> 
> 
> Rename clear_page_tables to clear_page_range. clear_page_range takes byte
> ranges, and aggressively frees page table pages. Maybe useful to control
> page table memory consumption on 4-level architectures (and even 3 level
> ones).
> 

I maybe didn't do this patch justice by hiding it away in this series.
It may be worthy of its own thread - surely there must be some significant
downsides if nobody had implemented it in the past (or maybe just a fact
of "that doesn't happen much").

Anyway, if we show off its best-case: start 100 processes that each allocate
1GB of memory, touch all pages, then free it (but don't exit). Do that on
i386 with PAE (but most any 3+ level setup will be more or less vulnerable
to the same problem).

npiggin@intel:~/tests/pte$ grep PageTables meminfo.100*
meminfo.100:PageTables:     181228 kB
meminfo.100.optimized:PageTables:       2476 kB

You see, the 1GB we've allocated isn't perfectly PGDIR aligned, so none of
the page tables can get freed. So it is a potentially significant saving
in some cases.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/10] alternate 4-level page tables patches
  2004-12-18  8:08                       ` Andrew Morton
@ 2004-12-18  9:48                         ` Andi Kleen
  0 siblings, 0 replies; 77+ messages in thread
From: Andi Kleen @ 2004-12-18  9:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, nickpiggin, linux-mm, hugh, torvalds

On Sat, Dec 18, 2004 at 12:08:41AM -0800, Andrew Morton wrote:
> Andi Kleen <ak@suse.de> wrote:
> >
> >  Enable unit-at-a-time by default. At least with 3.3-hammer and 3.4 
> >  it seems to work just fine. Has been tested with 3.3-hammer over
> >  several suse releases.
> 
> iirc, we turned this off because the compiler would go nuts inlining things
> and would consume too much stack:

I haven't had any report where this really happened with 3.3-hammer.

And in general in case it happens in one or two places only then it should
be fixed there with a few strategic "noinlines"

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18  9:05         ` [PATCH 4/10] " Nick Piggin
@ 2004-12-18  9:50           ` Andi Kleen
  2004-12-18 10:06             ` Nick Piggin
  0 siblings, 1 reply; 77+ messages in thread
From: Andi Kleen @ 2004-12-18  9:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management, Andi Kleen, Hugh Dickins,
	Linus Torvalds, Andrew Morton

On Sat, Dec 18, 2004 at 08:05:26PM +1100, Nick Piggin wrote:
> Nick Piggin wrote:
> >4/10
> >
> >
> >------------------------------------------------------------------------
> >
> >
> >
> >Rename clear_page_tables to clear_page_range. clear_page_range takes byte
> >ranges, and aggressively frees page table pages. Maybe useful to control
> >page table memory consumption on 4-level architectures (and even 3 level
> >ones).
> >
> 
> I maybe didn't do this patch justice by hiding it away in this series.
> It may be worthy of its own thread - surely there must be some significant
> downsides if nobody had implemented it in the past (or maybe just a fact
> of "that doesn't happen much").

Yes, more could be done in this area. When I did 4level I just tried
to keep the same semantics without optimizing anything.

Another way I thought about was to have a reference count of the used
ptes/pmds per page table page in struct page and free the page when it goes 
to zero. That would give perfect garbage collection. Drawback is that
it may be a bit intrusive again.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18  9:50           ` Andi Kleen
@ 2004-12-18 10:06             ` Nick Piggin
  2004-12-18 10:11               ` Andi Kleen
                                 ` (2 more replies)
  0 siblings, 3 replies; 77+ messages in thread
From: Nick Piggin @ 2004-12-18 10:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Memory Management, Hugh Dickins, Linus Torvalds, Andrew Morton

Andi Kleen wrote:
> On Sat, Dec 18, 2004 at 08:05:26PM +1100, Nick Piggin wrote:
> 
>>Nick Piggin wrote:
>>
>>>4/10
>>>
>>>
>>>------------------------------------------------------------------------
>>>
>>>
>>>
>>>Rename clear_page_tables to clear_page_range. clear_page_range takes byte
>>>ranges, and aggressively frees page table pages. Maybe useful to control
>>>page table memory consumption on 4-level architectures (and even 3 level
>>>ones).
>>>
>>
>>I maybe didn't do this patch justice by hiding it away in this series.
>>It may be worthy of its own thread - surely there must be some significant
>>downsides if nobody had implemented it in the past (or maybe just a fact
>>of "that doesn't happen much").
> 
> 
> Yes, more could be done in this area. When I did 4level I just tried
> to keep the same semantics without optimizing anything.
> 

Sure - and we can look at it further later (we may even include a variant
of this patch in your 4level patches for example).

I just noticed it wasn't too difficult code-wise to implement, and Hugh
thought it might be worthwhile... hence I included it in this patchset.
Maybe a bit rude of me to change behaviour in the middle of a 4level
patchset though ;)

> Another way I thought about was to have a reference count of the used
> ptes/pmds per page table page in struct page and free the page when it goes 
> to zero. That would give perfect garbage collection. Drawback is that
> it may be a bit intrusive again.
> 

Yes I thought about that a bit too.

Note that this (4/10) patch should give perfect garbage collection too
(modulo bugs). The difference is in where the overheads lie. I suspect
refcounting may be too much overhead (at least, SMP overhead); especially
in light of Christoph's results.

Although I think it would enable you to do page table reclaim when
reclaiming mapped, file backed pages quite easily... but I'm not sure if
that is enough to offset the slowdowns.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18 10:06             ` Nick Piggin
@ 2004-12-18 10:11               ` Andi Kleen
  2004-12-18 10:22               ` Nick Piggin
  2004-12-18 11:06               ` William Lee Irwin III
  2 siblings, 0 replies; 77+ messages in thread
From: Andi Kleen @ 2004-12-18 10:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton

> Note that this (4/10) patch should give perfect garbage collection too
> (modulo bugs). The difference is in where the overheads lie. I suspect
> refcounting may be too much overhead (at least, SMP overhead); especially
> in light of Christoph's results.

Not sure - walking a lot of page tables is certainly worse. That is 
why the current code is so simple minded - it tries to avoid walking
too much.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18 10:06             ` Nick Piggin
  2004-12-18 10:11               ` Andi Kleen
@ 2004-12-18 10:22               ` Nick Piggin
  2004-12-18 10:29                 ` Nick Piggin
  2004-12-18 11:06               ` William Lee Irwin III
  2 siblings, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-18 10:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Memory Management, Hugh Dickins, Linus Torvalds, Andrew Morton

Nick Piggin wrote:
> Andi Kleen wrote:
> 

>> Another way I thought about was to have a reference count of the used
>> ptes/pmds per page table page in struct page and free the page when it 
>> goes to zero. That would give perfect garbage collection. Drawback is 
>> that
>> it may be a bit intrusive again.
>>
> 
> Yes I thought about that a bit too.
> 
> Note that this (4/10) patch should give perfect garbage collection too
> (modulo bugs). The difference is in where the overheads lie. I suspect
> refcounting may be too much overhead (at least, SMP overhead); especially
> in light of Christoph's results.
> 

Hmm... you could refcount just the pud and pmd directories, and
have clear_page_range just expand to cover the nearest pte page
sized hole....

That way, clear_page_range can drive perfect pmd_page (lowest
level directiry) freeing, and that will drive refcount garbage
collection of the next level directory, which will cascade
up(down?)wards to all levels.

This should have minimal overhead in the fault paths, while also
lowering unmap overhead (maybe even more than it is today).

But there is no point rushing into things... as you said, we can
look into more optimisations after the base is solid.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18 10:22               ` Nick Piggin
@ 2004-12-18 10:29                 ` Nick Piggin
  0 siblings, 0 replies; 77+ messages in thread
From: Nick Piggin @ 2004-12-18 10:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Memory Management, Hugh Dickins, Linus Torvalds, Andrew Morton

Nick Piggin wrote:
> Nick Piggin wrote:
> 
>> Andi Kleen wrote:
>>
> 
>>> Another way I thought about was to have a reference count of the used
>>> ptes/pmds per page table page in struct page and free the page when 
>>> it goes to zero. That would give perfect garbage collection. Drawback 
>>> is that
>>> it may be a bit intrusive again.
>>>
>>
>> Yes I thought about that a bit too.
>>
>> Note that this (4/10) patch should give perfect garbage collection too
>> (modulo bugs). The difference is in where the overheads lie. I suspect
>> refcounting may be too much overhead (at least, SMP overhead); especially
>> in light of Christoph's results.
>>
> 
> Hmm... you could refcount just the pud and pmd directories, and

.. or to hide my bias, the pgd and pmd!

My thinking is that populating of these directories should happen
infrequently enough that an atomic counter shouldn't show up on the
radar, neither for single threaded nor massively multithreaded
perfomance.

In the current fully ptl locked scheme, it wouldn't even have to be
an atomic counter...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18  6:57       ` [PATCH 4/10] " Nick Piggin
  2004-12-18  6:58         ` [PATCH 5/10] " Nick Piggin
  2004-12-18  9:05         ` [PATCH 4/10] " Nick Piggin
@ 2004-12-18 10:45         ` William Lee Irwin III
  2004-12-18 10:58           ` Nick Piggin
  2 siblings, 1 reply; 77+ messages in thread
From: William Lee Irwin III @ 2004-12-18 10:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management, Andi Kleen, Hugh Dickins,
	Linus Torvalds, Andrew Morton

On Sat, Dec 18, 2004 at 05:57:12PM +1100, Nick Piggin wrote:
> Rename clear_page_tables to clear_page_range. clear_page_range takes byte
> ranges, and aggressively frees page table pages. Maybe useful to control
> page table memory consumption on 4-level architectures (and even 3 level
> ones).
> Possible downsides are:
> - flush_tlb_pgtables gets called more often (only a problem for sparc64
>   AFAIKS).
> - the opportunistic "expand to fill PGDIR_SIZE hole" logic that ensures
>   something actually gets done under the old system is still in place.
>   This could sometimes make unmapping small regions more inefficient. There
>   are some other solutions to look at if this is the case though.
> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>

I wrote something equivalent to this in September, but dropped it on the
floor after some private replies etc. indicated no one gave a damn about
the testcase I posted in Message-ID: <20040908110718.GX3106@holomorphy.com>
where the pagetable leak fooled the OOM killer into shooting the wrong
processes long enough to trigger a panic() in oom_kill.c


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18 10:45         ` William Lee Irwin III
@ 2004-12-18 10:58           ` Nick Piggin
  0 siblings, 0 replies; 77+ messages in thread
From: Nick Piggin @ 2004-12-18 10:58 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linux Memory Management, Andi Kleen, Hugh Dickins,
	Linus Torvalds, Andrew Morton

William Lee Irwin III wrote:
> On Sat, Dec 18, 2004 at 05:57:12PM +1100, Nick Piggin wrote:
> 
>>Rename clear_page_tables to clear_page_range. clear_page_range takes byte
>>ranges, and aggressively frees page table pages. Maybe useful to control
>>page table memory consumption on 4-level architectures (and even 3 level
>>ones).
>>Possible downsides are:
>>- flush_tlb_pgtables gets called more often (only a problem for sparc64
>>  AFAIKS).
>>- the opportunistic "expand to fill PGDIR_SIZE hole" logic that ensures
>>  something actually gets done under the old system is still in place.
>>  This could sometimes make unmapping small regions more inefficient. There
>>  are some other solutions to look at if this is the case though.
>>Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
> 
> 
> I wrote something equivalent to this in September, but dropped it on the
> floor after some private replies etc. indicated no one gave a damn about
> the testcase I posted in Message-ID: <20040908110718.GX3106@holomorphy.com>
> where the pagetable leak fooled the OOM killer into shooting the wrong
> processes long enough to trigger a panic() in oom_kill.c
> 
> 

Hmm... I actually didn't see your post, but it seems like something
worth pursuing.

My incling is that the hybrid expand-to-hole / refcounting approach
I outlined earlier in the thread might be the way to go.

I'll be interested to hear what the powers that be think.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18 10:06             ` Nick Piggin
  2004-12-18 10:11               ` Andi Kleen
  2004-12-18 10:22               ` Nick Piggin
@ 2004-12-18 11:06               ` William Lee Irwin III
  2004-12-18 11:17                 ` Nick Piggin
  2 siblings, 1 reply; 77+ messages in thread
From: William Lee Irwin III @ 2004-12-18 11:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton

On Sat, Dec 18, 2004 at 09:06:29PM +1100, Nick Piggin wrote:
> Yes I thought about that a bit too.
> Note that this (4/10) patch should give perfect garbage collection too
> (modulo bugs). The difference is in where the overheads lie. I suspect
> refcounting may be too much overhead (at least, SMP overhead); especially
> in light of Christoph's results.

If this were so, then clear_page_tables() during process destruction
would be unnecessary. detach_vmas_to_be_unmapped() makes additional
work for such schemes, but even improvements are still rather helpful.


On Sat, Dec 18, 2004 at 09:06:29PM +1100, Nick Piggin wrote:
> Although I think it would enable you to do page table reclaim when
> reclaiming mapped, file backed pages quite easily... but I'm not sure if
> that is enough to offset the slowdowns.

That would be a far more appropriate response to high multiprogramming
levels than what is now done.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18 11:06               ` William Lee Irwin III
@ 2004-12-18 11:17                 ` Nick Piggin
  2004-12-18 11:32                   ` William Lee Irwin III
  0 siblings, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-18 11:17 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andi Kleen, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton

William Lee Irwin III wrote:
> On Sat, Dec 18, 2004 at 09:06:29PM +1100, Nick Piggin wrote:
> 
>>Yes I thought about that a bit too.
>>Note that this (4/10) patch should give perfect garbage collection too
>>(modulo bugs). The difference is in where the overheads lie. I suspect
>>refcounting may be too much overhead (at least, SMP overhead); especially
>>in light of Christoph's results.
> 
> 
> If this were so, then clear_page_tables() during process destruction
> would be unnecessary. detach_vmas_to_be_unmapped() makes additional
> work for such schemes, but even improvements are still rather helpful.
> 

If what were so?

> 
> On Sat, Dec 18, 2004 at 09:06:29PM +1100, Nick Piggin wrote:
> 
>>Although I think it would enable you to do page table reclaim when
>>reclaiming mapped, file backed pages quite easily... but I'm not sure if
>>that is enough to offset the slowdowns.
> 
> 
> That would be a far more appropriate response to high multiprogramming
> levels than what is now done.
> 

On a select few workloads, yes.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18 11:17                 ` Nick Piggin
@ 2004-12-18 11:32                   ` William Lee Irwin III
  2004-12-18 11:55                     ` Nick Piggin
  0 siblings, 1 reply; 77+ messages in thread
From: William Lee Irwin III @ 2004-12-18 11:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton

William Lee Irwin III wrote:
>> If this were so, then clear_page_tables() during process destruction
>> would be unnecessary. detach_vmas_to_be_unmapped() makes additional
>> work for such schemes, but even improvements are still rather helpful.

On Sat, Dec 18, 2004 at 10:17:17PM +1100, Nick Piggin wrote:
> If what were so?

If clear_page_tables() implemented perfect GC.


On Sat, Dec 18, 2004 at 09:06:29PM +1100, Nick Piggin wrote:
>>> Although I think it would enable you to do page table reclaim when
>>> reclaiming mapped, file backed pages quite easily... but I'm not sure if
>>> that is enough to offset the slowdowns.

William Lee Irwin III wrote:
>> That would be a far more appropriate response to high multiprogramming
>> levels than what is now done.

On Sat, Dec 18, 2004 at 10:17:17PM +1100, Nick Piggin wrote:
> On a select few workloads, yes.

Counterexamples would be illustrative.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18 11:32                   ` William Lee Irwin III
@ 2004-12-18 11:55                     ` Nick Piggin
  2004-12-18 12:46                       ` William Lee Irwin III
  0 siblings, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-18 11:55 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andi Kleen, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> 
>>>If this were so, then clear_page_tables() during process destruction
>>>would be unnecessary. detach_vmas_to_be_unmapped() makes additional
>>>work for such schemes, but even improvements are still rather helpful.
> 
> 
> On Sat, Dec 18, 2004 at 10:17:17PM +1100, Nick Piggin wrote:
> 
>>If what were so?
> 
> 
> If clear_page_tables() implemented perfect GC.
> 

Oh... well it does perfectly free memory in the context of what ranges
have been previously cleared with clear_page_tables. So that doesn't
free you from the requirement of calling clear_page_tables at some
point.

I suspect though, you are referring to refcounting, in which case yes,
GC could probably be performed at unmap time, and clear_page_tables
could disappear. I still think it would be too costly to refcount down
to the pte_t level, especially SMP-wise.... but I'm just basing that
on a few minutes of thought, so - I don't really know.

> 
> On Sat, Dec 18, 2004 at 09:06:29PM +1100, Nick Piggin wrote:
> 
>>>>Although I think it would enable you to do page table reclaim when
>>>>reclaiming mapped, file backed pages quite easily... but I'm not sure if
>>>>that is enough to offset the slowdowns.
> 
> 
> William Lee Irwin III wrote:
> 
>>>That would be a far more appropriate response to high multiprogramming
>>>levels than what is now done.
> 
> 
> On Sat, Dec 18, 2004 at 10:17:17PM +1100, Nick Piggin wrote:
> 
>>On a select few workloads, yes.
> 
> 
> Counterexamples would be illustrative.
> 

Oh, just workloads where memory is fairly dense in virtual space, and
not shared (much). Non-oracle workloads, perhaps? :)

Seriously? On my typical desktop, I have 250MB used, of which 1MB is
page tables, I suspect this is a pretty typical ratio on desktops,
but I have less experience with high end database servers and that type
of stuff.

I was hoping you could provide an example rather than me a counter ;)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18 11:55                     ` Nick Piggin
@ 2004-12-18 12:46                       ` William Lee Irwin III
  2004-12-18 12:48                         ` William Lee Irwin III
  2004-12-19  0:05                         ` Nick Piggin
  0 siblings, 2 replies; 77+ messages in thread
From: William Lee Irwin III @ 2004-12-18 12:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton

William Lee Irwin III wrote:
>> If clear_page_tables() implemented perfect GC.

On Sat, Dec 18, 2004 at 10:55:58PM +1100, Nick Piggin wrote:
> Oh... well it does perfectly free memory in the context of what ranges
> have been previously cleared with clear_page_tables. So that doesn't
> free you from the requirement of calling clear_page_tables at some
> point.
> I suspect though, you are referring to refcounting, in which case yes,
> GC could probably be performed at unmap time, and clear_page_tables
> could disappear. I still think it would be too costly to refcount down
> to the pte_t level, especially SMP-wise.... but I'm just basing that
> on a few minutes of thought, so - I don't really know.

vmas are unmapped one-by-one during process destruction.

William Lee Irwin III wrote:
>> Counterexamples would be illustrative.

On Sat, Dec 18, 2004 at 10:55:58PM +1100, Nick Piggin wrote:
> Oh, just workloads where memory is fairly dense in virtual space, and
> not shared (much). Non-oracle workloads, perhaps? :)
> Seriously? On my typical desktop, I have 250MB used, of which 1MB is
> page tables, I suspect this is a pretty typical ratio on desktops,
> but I have less experience with high end database servers and that type
> of stuff.
> I was hoping you could provide an example rather than me a counter ;)

Page replacement is largely irrelevant to databases. Administrators
etc. rather go through pains to avoid page replacement and at some
cost. They rather reclaim when page replacement occurs. More beneficial
for databases would be increasing the multiprogramming level a system
can maintain without page replacement or background data structure
reclamation.  This is, of course, not to say that databases can
tolerate leaks or effective leaks of kernel memory or data structures.

Effective eviction of process data is far more pertinent to laptops and
desktops, where every wasted pagetable page is another page of
userspace program data that has to be swapped out and another write to
a disk spun by a battery with a limited lifetime (though the timer is
probably a larger concern wrt. battery life). Idle processes are likely
to be the largest concern there. The kernel's memory footprint
is always pure overhead, and pagetables are a very large part of it.

(a) idle bloatzilla
(b) idle mutt
(c) idle shells
(d) numerous daemons started up by initscripts and rarely ever invoked

-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18 12:46                       ` William Lee Irwin III
@ 2004-12-18 12:48                         ` William Lee Irwin III
  2004-12-19  0:05                         ` Nick Piggin
  1 sibling, 0 replies; 77+ messages in thread
From: William Lee Irwin III @ 2004-12-18 12:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton

On Sat, Dec 18, 2004 at 04:46:35AM -0800, William Lee Irwin III wrote:
> cost. They rather reclaim when page replacement occurs. More beneficial

Editing blunder.

They rather claim defeat when page replacement occurs.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/10] alternate 4-level page tables patches
  2004-12-18  7:31                     ` Andi Kleen
  2004-12-18  7:46                       ` Nick Piggin
  2004-12-18  8:08                       ` Andrew Morton
@ 2004-12-18 19:06                       ` Linus Torvalds
  2004-12-20 17:43                         ` Andi Kleen
  2 siblings, 1 reply; 77+ messages in thread
From: Linus Torvalds @ 2004-12-18 19:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, Linux Memory Management, Hugh Dickins, Andrew Morton


On Sat, 18 Dec 2004, Andi Kleen wrote:
> 
> Ok except on i386 where someone decided to explicitely turn it off 
> all the time :/

Because it used to be broken as hell. The code it generated was absolute 
and utter crap.

Maybe some versions of gcc get it right now, but what it _used_ to do was 
to make functions that had hundreds of bytes of stack-space, because gcc 
would never re-use stack slots, and if you have code like

	static int fn_case1(..)
	..

	static int fn_case2(..)
	..

	switch (ioctl) {
	case abc:
		err = fn_case1(..);
		break;
	case def:
		err = fn_case2(..)
		break;
	..
	case xyz:
		err = fn_case25(..);
		break;
	}

which actually is not that unusual, gcc would make a TOTAL mess of it, 
because it would add up _all_ the stack space for _all_ the functions, and 
instead of having 16 bytes of stack used, it would use a kilobyte.

It may be less of an issue on x86-64, because
 - you probably haven't even looked
 - with more registers, you need less spilling, and inlining works better.

> Enable unit-at-a-time by default. At least with 3.3-hammer and 3.4 
> it seems to work just fine. Has been tested with 3.3-hammer over
> several suse releases.

How about looking at those ioctl functions, and verifying that gcc has 
been fixed? With more than just _one_ boutique compiler version?

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-18 12:46                       ` William Lee Irwin III
  2004-12-18 12:48                         ` William Lee Irwin III
@ 2004-12-19  0:05                         ` Nick Piggin
  2004-12-19  0:20                           ` William Lee Irwin III
  1 sibling, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-19  0:05 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andi Kleen, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> 
>>>If clear_page_tables() implemented perfect GC.
> 
> 
> On Sat, Dec 18, 2004 at 10:55:58PM +1100, Nick Piggin wrote:
> 
>>Oh... well it does perfectly free memory in the context of what ranges
>>have been previously cleared with clear_page_tables. So that doesn't
>>free you from the requirement of calling clear_page_tables at some
>>point.
>>I suspect though, you are referring to refcounting, in which case yes,
>>GC could probably be performed at unmap time, and clear_page_tables
>>could disappear. I still think it would be too costly to refcount down
>>to the pte_t level, especially SMP-wise.... but I'm just basing that
>>on a few minutes of thought, so - I don't really know.
> 
> 
> vmas are unmapped one-by-one during process destruction.
> 

Yeah but clear_page_tables isn't called for each vma that is unmapped
at exit time. Rather, one big one is called at the end - I suspect
this is usually more efficient.

> 
> William Lee Irwin III wrote:
> 
>>>Counterexamples would be illustrative.
> 
> 
> On Sat, Dec 18, 2004 at 10:55:58PM +1100, Nick Piggin wrote:
> 
>>Oh, just workloads where memory is fairly dense in virtual space, and
>>not shared (much). Non-oracle workloads, perhaps? :)
>>Seriously? On my typical desktop, I have 250MB used, of which 1MB is
>>page tables, I suspect this is a pretty typical ratio on desktops,
>>but I have less experience with high end database servers and that type
>>of stuff.
>>I was hoping you could provide an example rather than me a counter ;)
> 
> 
> Page replacement is largely irrelevant to databases. Administrators
> etc. rather go through pains to avoid page replacement and at some
> cost. They rather reclaim when page replacement occurs. More beneficial
> for databases would be increasing the multiprogramming level a system
> can maintain without page replacement or background data structure
> reclamation.  This is, of course, not to say that databases can
> tolerate leaks or effective leaks of kernel memory or data structures.
> 

OK. Well with the simple patch I've shown, we no longer 'leak' pagetables
(although the unmap-time cost may require moving to a partially refcounted
approach).

Does anyone know of workloads that have significant clear_page_tables
cost?

> Effective eviction of process data is far more pertinent to laptops and
> desktops, where every wasted pagetable page is another page of
> userspace program data that has to be swapped out and another write to
> a disk spun by a battery with a limited lifetime (though the timer is
> probably a larger concern wrt. battery life). Idle processes are likely
> to be the largest concern there. The kernel's memory footprint
> is always pure overhead, and pagetables are a very large part of it.
> 
> (a) idle bloatzilla
> (b) idle mutt
> (c) idle shells
> (d) numerous daemons started up by initscripts and rarely ever invoked
> 

Oh sure, but in those cases, the pagetables aren't such a big waste
of space, because memory access isn't too sparse, and you don't have
a huge amount of sharing (even executables, shared libraries - there
just aren't that many processes running to make page tables a large
fraction of resident memory).

So I'm not saying there are no savings to be had at all, but just that
maybe they aren't worth it (I don't know - maybe it is possible to do
a full refcounting implementation without adding fastpath overhead).

I mean, I've got 250MB used and only 1/250th of that is in pagetables.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-18  6:55 [RFC][PATCH 0/10] alternate 4-level page tables patches Nick Piggin
  2004-12-18  6:55 ` [PATCH 1/10] " Nick Piggin
@ 2004-12-19  0:07 ` Hugh Dickins
  2004-12-19  0:33   ` Nick Piggin
  2004-12-20 18:04   ` Andi Kleen
  1 sibling, 2 replies; 77+ messages in thread
From: Hugh Dickins @ 2004-12-19  0:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management, Andi Kleen, Linus Torvalds, Andrew Morton

On Sat, 18 Dec 2004, Nick Piggin wrote:
> 
> Apologies for not making progress on this done sooner, but better late than never.
> 
> First off - don't let all the signed-off-by: things fool you, I'm only intending
> this for comments, not merging. I just spent a bit of time getting the descriptions
> in better shape.
> 
> Second - much of it is Andi's code (especially 4 level core, and x86-64 stuff).
> If any attributions aren't quite accurate at this stage, don't worry too much!
> 
> 
> Anyway, although we have a working 4-level page tables implementation, I am keeping
> with this because my personal taste preference. Not that it is anything against
> Andi's taste or technical implementation... but I wouldn't like progress to be held
> up on account of me, so I wouldn't be too upset to forget about this until 2.7 (or
> for ever)... /end disclaimer
> 
> Well, the patches follow. Tested lightly on i386 32 and 36 bits, ia64, and x86-64
> with full 4 levels.
> 
> Comments?

I had been sceptical whether it's now worth a revised implementation.
But these look like good tasteful patches to me, nicely split up.

In all they will amount to more change than Andi's original version -
partly because of the de-pml4-ing in x86_64, but more because of the
genericizing of nopmd and then nopud - but that's worthwhile.
The changes seem to be the ones which ought to be in there.

I think Andi's work has benefitted from having
your eye and hand go over it for a second round.

> A bit of an aside: I was hoping to have a page table folding implementation that is
> basically transparent to architectures. That is, a 3-level arch could just include
> some generic header to fold the 4th level, and call it a day (without any other mods
> to arch/?/* or include/asm-?/*).
> 
> The reality is, this isn't going to happen with our current semantics. It probably
> isn't a really big deal though, because I don't expect we'd have to support a 5
> level implementation any time soon. But it is something I'd like to explore further.
> 
> I'll illustrate with an example: in the current setup, if the pmd is folded into
> the pgd, pgd_present is always true, and pmd_present is what actually examines the
> entry in the pgd. Now clearly, the architecture has to implement pmd_present, which
> is pgd_present in a 2-level setup.
> 
> I would like to change that so pgd_present really does check the actual pgd entry,
> and pmd_present is unconditionally true. IMO this would work better and be less
> confusing than the current setup... but that's getting off topic...

Thanks for going into that.  Of course I'm disappointed, I had
been hoping that pud would obviate the need for immediate change
in all the arches.  But I trust your explanation for why not, and
after several readings I think I'm beginning to understand it!

My vote is for you (with arch assistants) to extend this work to the
other arches, and these patches to replace the current 4level patches
in -mm.  But what does Andi think - are those "inline"s his only dissent?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-19  0:05                         ` Nick Piggin
@ 2004-12-19  0:20                           ` William Lee Irwin III
  2004-12-19  0:38                             ` Nick Piggin
  2004-12-19  1:31                             ` Linus Torvalds
  0 siblings, 2 replies; 77+ messages in thread
From: William Lee Irwin III @ 2004-12-19  0:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton

William Lee Irwin III wrote:
>> vmas are unmapped one-by-one during process destruction.

On Sun, Dec 19, 2004 at 11:05:22AM +1100, Nick Piggin wrote:
> Yeah but clear_page_tables isn't called for each vma that is unmapped
> at exit time. Rather, one big one is called at the end - I suspect
> this is usually more efficient.

For clear_page_tables() you want to scan as little as possible. The
exit()-time performance issue is tlb_finish_mmu().


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-19  0:07 ` [RFC][PATCH 0/10] " Hugh Dickins
@ 2004-12-19  0:33   ` Nick Piggin
  2004-12-20 18:04   ` Andi Kleen
  1 sibling, 0 replies; 77+ messages in thread
From: Nick Piggin @ 2004-12-19  0:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linux Memory Management, Andi Kleen, Linus Torvalds, Andrew Morton

Hugh Dickins wrote:
> On Sat, 18 Dec 2004, Nick Piggin wrote:
> 

>>Well, the patches follow. Tested lightly on i386 32 and 36 bits, ia64, and x86-64
>>with full 4 levels.
>>
>>Comments?
> 
> 
> I had been sceptical whether it's now worth a revised implementation.
> But these look like good tasteful patches to me, nicely split up.
> 
> In all they will amount to more change than Andi's original version -
> partly because of the de-pml4-ing in x86_64, but more because of the
> genericizing of nopmd and then nopud - but that's worthwhile.
> The changes seem to be the ones which ought to be in there.
> 
> I think Andi's work has benefitted from having
> your eye and hand go over it for a second round.
> 

Well yes - and let's not lose sight of what the patches actually consist
of: _most_ of the hard work is Andi's, and fortunately things are clean
enough that moving from pml4 to pud wasn't a lot harder than a
s/pgd/pud, s/pml4/pgd!

- even for x86-64, which I had expected to be a much harder job.

[snip]

> My vote is for you (with arch assistants) to extend this work to the
> other arches, and these patches to replace the current 4level patches
> in -mm.  But what does Andi think - are those "inline"s his only dissent?
> 

The rest of the architectures shouldn't be much problem I hope. If
there were any difficulties, then Andi should already have them covered,
and the rest is more or less a straight search-replace.

But yeah we obviously want to get Andi on side _if_ we are to go with
`pud`...

Thanks for the comments Hugh.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-19  0:20                           ` William Lee Irwin III
@ 2004-12-19  0:38                             ` Nick Piggin
  2004-12-19  1:01                               ` William Lee Irwin III
  2004-12-19  1:31                             ` Linus Torvalds
  1 sibling, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-19  0:38 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andi Kleen, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> 
>>>vmas are unmapped one-by-one during process destruction.
> 
> 
> On Sun, Dec 19, 2004 at 11:05:22AM +1100, Nick Piggin wrote:
> 
>>Yeah but clear_page_tables isn't called for each vma that is unmapped
>>at exit time. Rather, one big one is called at the end - I suspect
>>this is usually more efficient.
> 
> 
> For clear_page_tables() you want to scan as little as possible. The

Sure. I wonder if we could cut down the amount of scanning by keeping
track of what ranges of vmas have been unmapped... still, I don't think
I have seen this function high on a profile, so until then I personally
don't think I'll bother ;)

> exit()-time performance issue is tlb_finish_mmu().
> 

Makes sense. I guess there is often a lot of memory one has to shoot
down.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-19  0:38                             ` Nick Piggin
@ 2004-12-19  1:01                               ` William Lee Irwin III
  0 siblings, 0 replies; 77+ messages in thread
From: William Lee Irwin III @ 2004-12-19  1:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton

On Sun, Dec 19, 2004 at 11:38:56AM +1100, Nick Piggin wrote:
>> For clear_page_tables() you want to scan as little as possible. The

William Lee Irwin III wrote:
> Sure. I wonder if we could cut down the amount of scanning by keeping
> track of what ranges of vmas have been unmapped... still, I don't think
> I have seen this function high on a profile, so until then I personally
> don't think I'll bother ;)

Probably best to talk to davem about this. Or at least he's a source of
information about this independent from me.


William Lee Irwin III wrote:
>> exit()-time performance issue is tlb_finish_mmu().

On Sun, Dec 19, 2004 at 11:38:56AM +1100, Nick Piggin wrote:
> Makes sense. I guess there is often a lot of memory one has to shoot
> down.

Also, some of the comments in the prior post relate to one of the
differences in how I implemented all this.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-19  0:20                           ` William Lee Irwin III
  2004-12-19  0:38                             ` Nick Piggin
@ 2004-12-19  1:31                             ` Linus Torvalds
  2004-12-19  2:08                               ` William Lee Irwin III
  1 sibling, 1 reply; 77+ messages in thread
From: Linus Torvalds @ 2004-12-19  1:31 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Nick Piggin, Andi Kleen, Linux Memory Management, Hugh Dickins,
	Andrew Morton, Ingo Molnar

On Sat, 18 Dec 2004, William Lee Irwin III wrote:
>
> William Lee Irwin III wrote:
> >> vmas are unmapped one-by-one during process destruction.
> 
> On Sun, Dec 19, 2004 at 11:05:22AM +1100, Nick Piggin wrote:
> > Yeah but clear_page_tables isn't called for each vma that is unmapped
> > at exit time. Rather, one big one is called at the end - I suspect
> > this is usually more efficient.
> 
> For clear_page_tables() you want to scan as little as possible. The
> exit()-time performance issue is tlb_finish_mmu().

Note that the fact that we share lots of code between "unmap" and "exit" 
is likely a (performance) bug.

The exit case is really a lot simpler, not just because we get rid of the 
whole VM, but because nobody else can be reading the page tables at the 
same time, and in particular we do not have a lot of the races that a 
simple unmap can have. The whole "gather/flush" thing is overkill, I 
think.

Actually, looking at the code, I wonder why we haven't marked the exit 
case to be "fast". We have this special optimization for single-CPU which 
doesn't bunch pages up and free them in chunks, and we should probably 
mark the exit case to use the fast-case where we can flush the TLB's 
early. Hmm?

Ingo, is there any reason we don't do this:

--- 1.24/include/asm-generic/tlb.h	2004-07-10 17:14:00 -07:00
+++ edited/include/asm-generic/tlb.h	2004-12-18 17:30:43 -08:00
@@ -58,7 +58,7 @@
 	tlb->mm = mm;

 	/* Use fast mode if only one CPU is online */
-	tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;
+	tlb->nr = num_online_cpus() > 1 && !full_mm_flush ? 0U : ~0U;

 	tlb->fullmm = full_mm_flush;
 	tlb->freed = 0;

which should make the exit case TLB handling go much faster. Was there 
some race in that too? Nobody should be using the VM any more at that 
point, so it _should_ be safe, no?

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-19  1:31                             ` Linus Torvalds
@ 2004-12-19  2:08                               ` William Lee Irwin III
  2004-12-19  2:26                                 ` Nick Piggin
  2004-12-19  5:23                                 ` Linus Torvalds
  0 siblings, 2 replies; 77+ messages in thread
From: William Lee Irwin III @ 2004-12-19  2:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andi Kleen, Linux Memory Management, Hugh Dickins,
	Andrew Morton, Ingo Molnar

On Sat, 18 Dec 2004, William Lee Irwin III wrote:
>> For clear_page_tables() you want to scan as little as possible. The
>> exit()-time performance issue is tlb_finish_mmu().

On Sat, Dec 18, 2004 at 05:31:48PM -0800, Linus Torvalds wrote:
> Note that the fact that we share lots of code between "unmap" and "exit" 
> is likely a (performance) bug.
> The exit case is really a lot simpler, not just because we get rid of the 
> whole VM, but because nobody else can be reading the page tables at the 
> same time, and in particular we do not have a lot of the races that a 
> simple unmap can have. The whole "gather/flush" thing is overkill, I 
> think.

For x86-style MMU's you could literally not bother flushing the TLB at
all, since you'll just switch to another set of pagetables.


On Sat, Dec 18, 2004 at 05:31:48PM -0800, Linus Torvalds wrote:
> Actually, looking at the code, I wonder why we haven't marked the exit 
> case to be "fast". We have this special optimization for single-CPU which 
> doesn't bunch pages up and free them in chunks, and we should probably 
> mark the exit case to use the fast-case where we can flush the TLB's 
> early. Hmm?
> Ingo, is there any reason we don't do this:
> --- 1.24/include/asm-generic/tlb.h	2004-07-10 17:14:00 -07:00
> +++ edited/include/asm-generic/tlb.h	2004-12-18 17:30:43 -08:00
> @@ -58,7 +58,7 @@
>  	tlb->mm = mm;
>  
>  	/* Use fast mode if only one CPU is online */
> -	tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;
> +	tlb->nr = num_online_cpus() > 1 && !full_mm_flush ? 0U : ~0U;
>  
>  	tlb->fullmm = full_mm_flush;
>  	tlb->freed = 0;
> which should make the exit case TLB handling go much faster. Was there 
> some race in that too? Nobody should be using the VM any more at that 
> point, so it _should_ be safe, no?

The stale translations can't be left around for ASID-tagged TLB's, lest
the next user of the ASID inherit them.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-19  2:08                               ` William Lee Irwin III
@ 2004-12-19  2:26                                 ` Nick Piggin
  2004-12-19  5:23                                 ` Linus Torvalds
  1 sibling, 0 replies; 77+ messages in thread
From: Nick Piggin @ 2004-12-19  2:26 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linus Torvalds, Andi Kleen, Linux Memory Management,
	Hugh Dickins, Andrew Morton, Ingo Molnar

William Lee Irwin III wrote:
> On Sat, 18 Dec 2004, William Lee Irwin III wrote:
> 
>>>For clear_page_tables() you want to scan as little as possible. The
>>>exit()-time performance issue is tlb_finish_mmu().
> 
> 
> On Sat, Dec 18, 2004 at 05:31:48PM -0800, Linus Torvalds wrote:
> 
>>Note that the fact that we share lots of code between "unmap" and "exit" 
>>is likely a (performance) bug.
>>The exit case is really a lot simpler, not just because we get rid of the 
>>whole VM, but because nobody else can be reading the page tables at the 
>>same time, and in particular we do not have a lot of the races that a 
>>simple unmap can have. The whole "gather/flush" thing is overkill, I 
>>think.
> 
> 
> For x86-style MMU's you could literally not bother flushing the TLB at
> all, since you'll just switch to another set of pagetables.
> 
> 

I think you're right. At exit-time, I think the flush should just amount
to __flush_tlb() anyway, in which case it is only another cr3 reload, so
it is probably not a big saving?

> On Sat, Dec 18, 2004 at 05:31:48PM -0800, Linus Torvalds wrote:
> 
>>Actually, looking at the code, I wonder why we haven't marked the exit 
>>case to be "fast". We have this special optimization for single-CPU which 
>>doesn't bunch pages up and free them in chunks, and we should probably 
>>mark the exit case to use the fast-case where we can flush the TLB's 
>>early. Hmm?
>>Ingo, is there any reason we don't do this:
>>--- 1.24/include/asm-generic/tlb.h	2004-07-10 17:14:00 -07:00
>>+++ edited/include/asm-generic/tlb.h	2004-12-18 17:30:43 -08:00
>>@@ -58,7 +58,7 @@
>> 	tlb->mm = mm;
>> 
>> 	/* Use fast mode if only one CPU is online */
>>-	tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;
>>+	tlb->nr = num_online_cpus() > 1 && !full_mm_flush ? 0U : ~0U;
>> 
>> 	tlb->fullmm = full_mm_flush;
>> 	tlb->freed = 0;
>>which should make the exit case TLB handling go much faster. Was there 
>>some race in that too? Nobody should be using the VM any more at that 
>>point, so it _should_ be safe, no?
> 
> 
> The stale translations can't be left around for ASID-tagged TLB's, lest
> the next user of the ASID inherit them.
> 

I think they should still be flushed before the ASID is reused though,
right? (not that I know how Linux uses ASID TLBs)

It looks to me like Linus' patch should fly (albiet with my limited
knowledge of !i386 architectures).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-19  2:08                               ` William Lee Irwin III
  2004-12-19  2:26                                 ` Nick Piggin
@ 2004-12-19  5:23                                 ` Linus Torvalds
  2004-12-19  6:02                                   ` William Lee Irwin III
  1 sibling, 1 reply; 77+ messages in thread
From: Linus Torvalds @ 2004-12-19  5:23 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Nick Piggin, Andi Kleen, Linux Memory Management, Hugh Dickins,
	Andrew Morton, Ingo Molnar


On Sat, 18 Dec 2004, William Lee Irwin III wrote:
> 
> For x86-style MMU's you could literally not bother flushing the TLB at
> all, since you'll just switch to another set of pagetables.

Yes, we could avoid the flush entirely, since we've already "flushed" the 
TLB by virtue of having switched to another TLB.

And it's in no way x86-specific:

> The stale translations can't be left around for ASID-tagged TLB's, lest
> the next user of the ASID inherit them.

We need to (and do) flush the TLB on ASID re-use, regardless. That's true
in any case. 

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-19  5:23                                 ` Linus Torvalds
@ 2004-12-19  6:02                                   ` William Lee Irwin III
  2004-12-19 18:17                                     ` Linus Torvalds
  0 siblings, 1 reply; 77+ messages in thread
From: William Lee Irwin III @ 2004-12-19  6:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andi Kleen, Linux Memory Management, Hugh Dickins,
	Andrew Morton, Ingo Molnar

On Sat, Dec 18, 2004 at 09:23:06PM -0800, Linus Torvalds wrote:
> Yes, we could avoid the flush entirely, since we've already "flushed" the 
> TLB by virtue of having switched to another TLB.
> And it's in no way x86-specific:

I'd say there's a little reliance on the semantics being emulated by
other architectures, but I doubt it strongly influences performance.


On Sat, 18 Dec 2004, William Lee Irwin III wrote:
>> The stale translations can't be left around for ASID-tagged TLB's, lest
>> the next user of the ASID inherit them.

On Sat, Dec 18, 2004 at 09:23:06PM -0800, Linus Torvalds wrote:
> We need to (and do) flush the TLB on ASID re-use, regardless. That's true
> in any case. 

If it's already been audited and there's nothing to do, all the better.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-19  6:02                                   ` William Lee Irwin III
@ 2004-12-19 18:17                                     ` Linus Torvalds
  2004-12-20  1:00                                       ` William Lee Irwin III
  0 siblings, 1 reply; 77+ messages in thread
From: Linus Torvalds @ 2004-12-19 18:17 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Nick Piggin, Andi Kleen, Linux Memory Management, Hugh Dickins,
	Andrew Morton, Ingo Molnar

On Sat, 18 Dec 2004, William Lee Irwin III wrote:
> 
> On Sat, Dec 18, 2004 at 09:23:06PM -0800, Linus Torvalds wrote:
> > We need to (and do) flush the TLB on ASID re-use, regardless. That's true
> > in any case. 
> 
> If it's already been audited and there's nothing to do, all the better.

It's more an issue of "if they don't, it won't work". 

That should be true at least for the "traditional" kind of ASI's, where
the ASI space is smaller than the process space, and ASI's get re-used
while a process is live anyway.

Of course, who knows what evil things the ppc external hash does with the 
thing. 

Anyway, I don't think we should necessariyl remove the flush entirely, and
the simple one-liner only did the "immediate free" thing without the
complex "batch things up and free them only after a flush".

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/10] alternate 4-level page tables patches
  2004-12-19 18:17                                     ` Linus Torvalds
@ 2004-12-20  1:00                                       ` William Lee Irwin III
  0 siblings, 0 replies; 77+ messages in thread
From: William Lee Irwin III @ 2004-12-20  1:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andi Kleen, Linux Memory Management, Hugh Dickins,
	Andrew Morton, Ingo Molnar

On Sat, 18 Dec 2004, William Lee Irwin III wrote:
>> If it's already been audited and there's nothing to do, all the better.

On Sun, Dec 19, 2004 at 10:17:44AM -0800, Linus Torvalds wrote:
> It's more an issue of "if they don't, it won't work". 
> That should be true at least for the "traditional" kind of ASI's, where
> the ASI space is smaller than the process space, and ASI's get re-used
> while a process is live anyway.
> Of course, who knows what evil things the ppc external hash does with the 
> thing. 
> Anyway, I don't think we should necessariyl remove the flush entirely, and
> the simple one-liner only did the "immediate free" thing without the
> complex "batch things up and free them only after a flush".


The "adversary argument" I used to brew up things to look for involved
eagerly flushing ASID's during exit(), marking the ASID dead at exit
on account of reliance on the flush, and reusing it without a flush due
to the death marker. I didn't bother coming up with more than one
possible scenario. Others will be visible while searching for one.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/10] alternate 4-level page tables patches
  2004-12-18 19:06                       ` Linus Torvalds
@ 2004-12-20 17:43                         ` Andi Kleen
  2004-12-20 17:47                           ` Randy.Dunlap
  2004-12-20 18:08                           ` Linus Torvalds
  0 siblings, 2 replies; 77+ messages in thread
From: Andi Kleen @ 2004-12-20 17:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Nick Piggin, Linux Memory Management, Hugh Dickins,
	Andrew Morton

On Sat, Dec 18, 2004 at 11:06:48AM -0800, Linus Torvalds wrote:
> 
> 
> On Sat, 18 Dec 2004, Andi Kleen wrote:
> > 
> > Ok except on i386 where someone decided to explicitely turn it off 
> > all the time :/
> 
> Because it used to be broken as hell. The code it generated was absolute 
> and utter crap.

I disagree. It generated significantly smaller code and the SUSE 
kernel has been shipping with it for several releases and I'm not
aware of any bug report related to unit-at-a-time.

> 
> Maybe some versions of gcc get it right now, but what it _used_ to do was 
> to make functions that had hundreds of bytes of stack-space, because gcc 
> would never re-use stack slots, and if you have code like

The right fix in that case would have been to add a few "noinline"s
to these cases (should be easy to check for if it really happens 
by grepping assembly code for large stack frames), not penalize code quality
of the whole kernel.

I did a grep over a gcc 4.0-snapshot compiled i386 kernel.  There
are a few really bad cases (e.g. GDTH, intelfb, some WAN stuff)
that should be fixed, but from a quick review they all just put a single big
object on the stack, and are not affected by unit-at-a-time.

[note names are after the occurrence, not before]

everything > 0x400

     808:       81 ec 58 09 00 00       sub    $0x958,%esp
./drivers/video/intelfb/intelfb.o
     808:       81 ec 58 09 00 00       sub    $0x958,%esp
./drivers/video/intelfb/intelfbdrv.o
     3b4:       81 ec 08 04 00 00       sub    $0x408,%esp
./drivers/net/wan/cyclomx.o
      e8:       81 ec 08 04 00 00       sub    $0x408,%esp
./drivers/net/wan/cycx_x25.o
    4d82:       81 ec 44 06 00 00       sub    $0x644,%esp
    184b:       81 ec 48 02 00 00       sub    $0x248,%esp
./drivers/scsi/gdth.o

More smaller ones.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/10] alternate 4-level page tables patches
  2004-12-20 17:43                         ` Andi Kleen
@ 2004-12-20 17:47                           ` Randy.Dunlap
  2004-12-20 18:08                           ` Linus Torvalds
  1 sibling, 0 replies; 77+ messages in thread
From: Randy.Dunlap @ 2004-12-20 17:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Nick Piggin, Linux Memory Management,
	Hugh Dickins, Andrew Morton

Andi Kleen wrote:
> On Sat, Dec 18, 2004 at 11:06:48AM -0800, Linus Torvalds wrote:
> 
>>
>>On Sat, 18 Dec 2004, Andi Kleen wrote:
>>
>>>Ok except on i386 where someone decided to explicitely turn it off 
>>>all the time :/
>>
>>Because it used to be broken as hell. The code it generated was absolute 
>>and utter crap.
> 
> 
> I disagree. It generated significantly smaller code and the SUSE 
> kernel has been shipping with it for several releases and I'm not
> aware of any bug report related to unit-at-a-time.
> 
> 
>>Maybe some versions of gcc get it right now, but what it _used_ to do was 
>>to make functions that had hundreds of bytes of stack-space, because gcc 
>>would never re-use stack slots, and if you have code like
> 
> 
> The right fix in that case would have been to add a few "noinline"s
> to these cases (should be easy to check for if it really happens 
> by grepping assembly code for large stack frames), not penalize code quality
> of the whole kernel.
> 
> I did a grep over a gcc 4.0-snapshot compiled i386 kernel.  There
> are a few really bad cases (e.g. GDTH, intelfb, some WAN stuff)
> that should be fixed, but from a quick review they all just put a single big
> object on the stack, and are not affected by unit-at-a-time.
> 
> [note names are after the occurrence, not before]
> 
> everything > 0x400
> 
>      808:       81 ec 58 09 00 00       sub    $0x958,%esp
> ./drivers/video/intelfb/intelfb.o
>      808:       81 ec 58 09 00 00       sub    $0x958,%esp
> ./drivers/video/intelfb/intelfbdrv.o
>      3b4:       81 ec 08 04 00 00       sub    $0x408,%esp
> ./drivers/net/wan/cyclomx.o
>       e8:       81 ec 08 04 00 00       sub    $0x408,%esp
> ./drivers/net/wan/cycx_x25.o
>     4d82:       81 ec 44 06 00 00       sub    $0x644,%esp
>     184b:       81 ec 48 02 00 00       sub    $0x248,%esp
> ./drivers/scsi/gdth.o
> 
> More smaller ones.

I posted a patch for intelfbdrv yesterday (on linux-fbdev-devel
m-l) and I'm working on gdth stack usage right now.
Basically I'm just tackling the top offenders (> 1000 bytes).

-- 
~Randy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-19  0:07 ` [RFC][PATCH 0/10] " Hugh Dickins
  2004-12-19  0:33   ` Nick Piggin
@ 2004-12-20 18:04   ` Andi Kleen
  2004-12-20 18:40     ` Linus Torvalds
  1 sibling, 1 reply; 77+ messages in thread
From: Andi Kleen @ 2004-12-20 18:04 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Linux Memory Management, Andi Kleen, Linus Torvalds,
	Andrew Morton

On Sun, Dec 19, 2004 at 12:07:34AM +0000, Hugh Dickins wrote:
> Thanks for going into that.  Of course I'm disappointed, I had
> been hoping that pud would obviate the need for immediate change
> in all the arches.  But I trust your explanation for why not, and
> after several readings I think I'm beginning to understand it!
> 
> My vote is for you (with arch assistants) to extend this work to the
> other arches, and these patches to replace the current 4level patches
> in -mm.  But what does Andi think - are those "inline"s his only dissent?

I don't see the point of redoing the work. IMHO Nick's new patches
only have cosmetic advantages over mine. Seems to be quite a lot of 
work to just rename some data types for me with unclear gain.

And the arch maintainers may be unwilling to redo this multiple times :)

One issue I see is that there is still some work to be done - in particular
the optimized page table walking will need to be added to regain
lmbench fork/exec performance. I've been waiting for my patches
to be merged to then work on top of that. Doing another round
of changes would make this difficult, because it would mean more
delay and/or conflicting patches.

But I'm not strongly opposed to it. If everybody else thinks "pud_t" 
is the greatest thing since sliced bread and much a much better name
than "pml4_t" then I guess we could eat the delay and disruption
that another round of these disruptive patches takes.

But I have my doubts it is worth it. Also who guarantees that
not somebody else turns up and wants to rename it to "pad_t" or 
"pod_t" or somesuch and then we would have to wait even 
longer for things to settle down. ;-) In my patches I avoided
the problem by just picking the name AMD gave it and which
seems to be the standard now in the x86-64 world at least (Intel
uses it too) 

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/10] alternate 4-level page tables patches
  2004-12-20 17:43                         ` Andi Kleen
  2004-12-20 17:47                           ` Randy.Dunlap
@ 2004-12-20 18:08                           ` Linus Torvalds
  2004-12-20 18:15                             ` Linus Torvalds
  2004-12-20 18:19                             ` Andi Kleen
  1 sibling, 2 replies; 77+ messages in thread
From: Linus Torvalds @ 2004-12-20 18:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, Linux Memory Management, Hugh Dickins, Andrew Morton

On Mon, 20 Dec 2004, Andi Kleen wrote:
> > 
> > Because it used to be broken as hell. The code it generated was absolute 
> > and utter crap.
> 
> I disagree. It generated significantly smaller code and the SUSE 
> kernel has been shipping with it for several releases and I'm not
> aware of any bug report related to unit-at-a-time.

You didn't answer my question: have you checked anything but your recent 
version of gcc?

The fact is, there _were_ lots of complaints about unit-at-a-time. There 
was a reason that thing got disabled. Maybe they got fixed, BUT THAT 
DOESN'T HELP, if people are still using the old compilers that support 
the notion, but do crap for it.

We still support gcc-2.95. By implication, that pretty much means that we 
support all the early unit-at-a-time compilers too. Not just the 
potentially fixed ones.

Thus your "it works for SuSE" argument is totally pointless, and totally 
misses the issue.

> The right fix in that case would have been to add a few "noinline"s
> to these cases (should be easy to check for if it really happens 
> by grepping assembly code for large stack frames), not penalize code quality
> of the whole kernel.

No. The right fix is _always_ to make sure that we are conservative enough 
that we don't have to depend on getting compiler-specific details really 
really right. 

The thing is, performance (even when unit-at-a-time works) comes second to 
stability. And I don't say that as a user (although it's obviously true 
for users too), I say that as a _developer_. The amount of effort needed 
to chase down strange problem reports due to compiler issues is just not 
worth it.

I would suggest that if you want unit-at-a-time, you make it a config 
option, and you mark it very clearly as requiring a new enough compiler 
that it's worth it and stable. That way if people have problems, we can 
ask them "did you have unit-at-a-time enabled?" and see if the problem 
goes away.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/10] alternate 4-level page tables patches
  2004-12-20 18:08                           ` Linus Torvalds
@ 2004-12-20 18:15                             ` Linus Torvalds
  2004-12-20 18:19                             ` Andi Kleen
  1 sibling, 0 replies; 77+ messages in thread
From: Linus Torvalds @ 2004-12-20 18:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, Linux Memory Management, Hugh Dickins, Andrew Morton


On Mon, 20 Dec 2004, Linus Torvalds wrote:
> 
> I would suggest that if you want unit-at-a-time, you make it a config 
> option, and you mark it very clearly as requiring a new enough compiler 
> that it's worth it and stable. That way if people have problems, we can 
> ask them "did you have unit-at-a-time enabled?" and see if the problem 
> goes away.

Btw, if you do this, I'd also suggest checking out exactly when gcc
started to do things right - not just "praying" about which version of gcc
is recent enough. Exactly so that the KConfig help message can say "if
your version of gcc is more recent than 3.3.4" rather than "if you have
some unspecified recent compiler".

The thing is, individual big stack users are fairly easy to find. But a 
chain where a few functions grew the stack a bit more, and the combined 
stack usage became big is harder to see.

I guess I could try to make sparse generate soem call-chain information 
(need to take function pointer structure usage into account to make it 
really useful, since a lot of the callchains are through the VFS and MM 
pointers)

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/10] alternate 4-level page tables patches
  2004-12-20 18:08                           ` Linus Torvalds
  2004-12-20 18:15                             ` Linus Torvalds
@ 2004-12-20 18:19                             ` Andi Kleen
  2004-12-20 18:47                               ` Linus Torvalds
  1 sibling, 1 reply; 77+ messages in thread
From: Andi Kleen @ 2004-12-20 18:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Nick Piggin, Linux Memory Management, Hugh Dickins,
	Andrew Morton

On Mon, Dec 20, 2004 at 10:08:29AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 20 Dec 2004, Andi Kleen wrote:
> > > 
> > > Because it used to be broken as hell. The code it generated was absolute 
> > > and utter crap.
> > 
> > I disagree. It generated significantly smaller code and the SUSE 
> > kernel has been shipping with it for several releases and I'm not
> > aware of any bug report related to unit-at-a-time.
> 
> You didn't answer my question: have you checked anything but your recent 
> version of gcc?

I have experience with 3.3-hammer (from SUSE kernel releases) and exact
data from a 4.0 snapshot (as posted) 

> 
> The fact is, there _were_ lots of complaints about unit-at-a-time. There 

I remember there was one, but they took a brute-force sledgehammer fix.
The right fix would have been to add the noinlines, not penalize
everybody.

> was a reason that thing got disabled. Maybe they got fixed, BUT THAT 
> DOESN'T HELP, if people are still using the old compilers that support 
> the notion, but do crap for it.

It helps when you add the noinlines. I can do that later - search
for Arjan's old report (I think he reported it), check what compiler
version he used, compile everything with it and unit-at-a-time
and eyeball all the big stack frames and add noinline
if it should be really needed.

> 
> We still support gcc-2.95. By implication, that pretty much means that we 
> support all the early unit-at-a-time compilers too. Not just the 
> potentially fixed ones.

The only widely used compilers with unit-at-a-time are 3.3-hammer (actually
several iterations since it has changed a bit over time) and
3.4 

> Thus your "it works for SuSE" argument is totally pointless, and totally 
> misses the issue.

Well, it's possible that there is a problem in 3.4 that isn't in
3.3-hammer (that is what suse uses), but if yes it should 
be easy to workaround with a few noinlines.

> 
> > The right fix in that case would have been to add a few "noinline"s
> > to these cases (should be easy to check for if it really happens 
> > by grepping assembly code for large stack frames), not penalize code quality
> > of the whole kernel.
> 
> No. The right fix is _always_ to make sure that we are conservative enough 
> that we don't have to depend on getting compiler-specific details really 
> really right. 
> 
> The thing is, performance (even when unit-at-a-time works) comes second to 
> stability. And I don't say that as a user (although it's obviously true 
> for users too), I say that as a _developer_. The amount of effort needed 
> to chase down strange problem reports due to compiler issues is just not 
> worth it.

I agree in the general case, but at least for stack consumption stuff
I don't. Since we have so much code it's pretty much required that
someone does the regular objdump -S ... | grep sub.*esp check
and verifies that nobody added more stack pigs. As the data in my
last mail has shown this is pretty much required. And when there
is a unit-at-a-time problem it can be quickly caught this way.

And I fixed quite a lot of stack consumption bugs over the years, but
none of them was caused by unit-a-a-time.

BTW what I heard from gcc people is that they plan to make unit-at-a-time
mandatory in some future version, so eventually we have to deal with
it anyways.

And I had it always enabled on x86-64 since the beginning and there
was so far not a *single* bug report related to it. 

> I would suggest that if you want unit-at-a-time, you make it a config 
> option, and you mark it very clearly as requiring a new enough compiler 
> that it's worth it and stable. That way if people have problems, we can 
> ask them "did you have unit-at-a-time enabled?" and see if the problem 
> goes away.

If you really suspect unit-at-a-time better just grep the stack frames.
And we already have too many such dumb options, like the totally useless
option to change all the code alignments in the config (I bet 99% of
all users will get it wrong). At least I will not add more of them.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-20 18:04   ` Andi Kleen
@ 2004-12-20 18:40     ` Linus Torvalds
  2004-12-20 18:53       ` Andi Kleen
  0 siblings, 1 reply; 77+ messages in thread
From: Linus Torvalds @ 2004-12-20 18:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hugh Dickins, Nick Piggin, Linux Memory Management, Andrew Morton

On Mon, 20 Dec 2004, Andi Kleen wrote:
> 
> But I'm not strongly opposed to it. If everybody else thinks "pud_t" 
> is the greatest thing since sliced bread and much a much better name
> than "pml4_t" then I guess we could eat the delay and disruption
> that another round of these disruptive patches takes.

To me, it's not the name, but the _placement_.

"pml4_t" is at the _top_, and replaces "pgd_t" in that position. While 
"pud_t" is in the _middle_, and extends upon the existing practice of 
folding the mid directory.

I had a reason why I put "pmd_t" in between the old pgd_t and pte_t when I
expanded from two to three levels: it ends up adding the levels at the
point where they are conceptually the least intrusive.

By "conceptually least intrusive", think about this: one of the most core
header files in the kernel, <linux/sched.h> mentions "pgd_t", but it does
_not_ mention "pmd_t". Why?

Basically, by doing the new folded table in the middle, it _only_ affects 
code that actually walks the page tables. Basically, what I wanted in the 
original 2->3 leval expansion was that people who don't use the new level 
should be able to conceptually totally ignore it. I think that is even 
more true in the 3->4 level expansion.

I haven't done any side-by-side comparisons on your original patches, and
on Nick's version of your patches, but I'm pretty certain that Nick's
patches are more "directed", with less noise. Not because of any name
issues, but because I think the right place to do the folding is in the
middle.

Quite frankly, I don't love Nick's patches either. I'd prefer to see the
infrastructure happen first - have the patch sequence first make _every_
single architecture use the "generic pud_t folding", and basically be in 
the position where the first <n> patches just do the syntactic part that 
makes it possible for then patches <n+1>, <n+2> to actually convert 
individual architectures that want it.

But Nick's patches seem to come fairly close to that.

So no, naming isn't the big difference. The conceptual difference is
bigger. It's just that once you conceptually do it in the middle, a
numbered name like "pml4_t" just doesn't make any sense (I don't think it
makes much sense at the top either, since there is no 1..2..3 to match it,
but that's a separate issue ;)

			Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/10] alternate 4-level page tables patches
  2004-12-20 18:19                             ` Andi Kleen
@ 2004-12-20 18:47                               ` Linus Torvalds
  2004-12-20 18:52                                 ` Linus Torvalds
  2004-12-20 18:59                                 ` Andi Kleen
  0 siblings, 2 replies; 77+ messages in thread
From: Linus Torvalds @ 2004-12-20 18:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, Linux Memory Management, Hugh Dickins, Andrew Morton

On Mon, 20 Dec 2004, Andi Kleen wrote:
> 
> I remember there was one, but they took a brute-force sledgehammer fix.
> The right fix would have been to add the noinlines, not penalize
> everybody.

No. 

Adding work-arounds to source code for broken compilers is just not 
acceptable. If some compiler feature works badly, it is _disabled_.

Look at "-fno-strict-aliasing". Exactly the same issue. Sure, we could 
have tried to find every place where it was an issue, but very 
fundamentally that's HARD. The issues aren't obvious from the source code, 
and the "fixes" are not obvious either and do not improve readability. 
Even though arguably the aliasing logic _could_ have helped other places

So if a compiler does something we don't want to handle, we disable that
feature. It's just not _possible_ to audit the source code for these kinds
of compiler features unless you write a tool that does most of it
automatically (or at least points out where the things need to be done).

Once you start doing "noinline" and depend on those being right, you end
up having to support that forever - with new code inevitably causing
subtle breakage because of some strange compiler rule that in no way is
obvious (ie adding/removing a "static" just because you ended up exporting
it to somebody else suddenly has very non-local issues - that's BAD).

> It helps when you add the noinlines. I can do that later - search
> for Arjan's old report (I think he reported it), check what compiler
> version he used, compile everything with it and unit-at-a-time
> and eyeball all the big stack frames and add noinline
> if it should be really needed.

If you do that _first_, then sure. And have some automated checker tool
that we can run occasionally to verify that we don't break this magic rule
later by mistake.

			Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/10] alternate 4-level page tables patches
  2004-12-20 18:47                               ` Linus Torvalds
@ 2004-12-20 18:52                                 ` Linus Torvalds
  2004-12-20 18:59                                 ` Andi Kleen
  1 sibling, 0 replies; 77+ messages in thread
From: Linus Torvalds @ 2004-12-20 18:52 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, Linux Memory Management, Hugh Dickins, Andrew Morton

On Mon, 20 Dec 2004, Linus Torvalds wrote:
> 
> If you do that _first_, then sure. And have some automated checker tool
> that we can run occasionally to verify that we don't break this magic rule
> later by mistake.

Note: the reason I care so deeply is that this kind of problem tends to 
bite us _exactly_ where we don't want to be bitten: in random drivers, and 
surround code that not necessarily very many actual core developers really 
end up using. 

If some subtle issue only happens in very specific code, it's much easier 
to work around. And if it happens in core code, you can at least rest easy 
in the knowledge that many people are going to get hit by it, and we can 
thus find it easily. 

So, ironically, the worst bugs are those that affect only a small
percentage of users. You'd think that the worst bugs are those that cause 
the most problems, but it actually ends up being exactly the other way 
around: the _least_ problems or the most _subtle_ problems are the ones 
that I'm nervous about. 

				Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-20 18:40     ` Linus Torvalds
@ 2004-12-20 18:53       ` Andi Kleen
  2004-12-21  0:04         ` Linus Torvalds
  0 siblings, 1 reply; 77+ messages in thread
From: Andi Kleen @ 2004-12-20 18:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Hugh Dickins, Nick Piggin, Linux Memory Management,
	Andrew Morton

> I haven't done any side-by-side comparisons on your original patches, and
> on Nick's version of your patches, but I'm pretty certain that Nick's
> patches are more "directed", with less noise. Not because of any name

> issues, but because I think the right place to do the folding is in the
> middle.

I don't think it's true. Currently his patches are much smaller,
but only because he didn't convert the architectures yet. I think 
if he does a full conversion it will be about the same size.

[like he originally hoped to not touch the architectures that
much, but it didn't work out] 

> 
> Quite frankly, I don't love Nick's patches either. I'd prefer to see the
> infrastructure happen first - have the patch sequence first make _every_
> single architecture use the "generic pud_t folding", and basically be in 
> the position where the first <n> patches just do the syntactic part that 
> makes it possible for then patches <n+1>, <n+2> to actually convert 
> individual architectures that want it.

I'm not sure what you mean with that. You have to convert the architectures,
otherwise they won't compile. That's true for my patch and true for
Nick's (except that he didn't do all the work of converting the archs yet)

While it may be possible to do some
hacks that allows code to be unconverted I didn't do this intentionally:
the risk of some common code not getting converted and breaking
true 4level page tables is too high.

At least my patchkit was 
infrastructure (basically mm/* and a few related headers), 
compat stuff (nopml4-* emulation layer) and then a single patch
for each architecture.  You'll need to add it pretty much 
all at one, otherwise things won't compile. I don't see how
you could do it less intrusively (Nick's patches definitely require
a similar flag day). 

Currently only x86-64 is truly 4 level. ppc64 plans to be, but they
haven't done it yet. All the others seem to want to stay at 2 or 3 levels
for now.

Ok in theory you could leave out the x86-64 patch at first, but then
you would need a different patch that makes it use nopml4 (or pud_t) 

> So no, naming isn't the big difference. The conceptual difference is
> bigger. It's just that once you conceptually do it in the middle, a
> numbered name like "pml4_t" just doesn't make any sense (I don't think it

Sorry I didn't invent it, just copied it from the x86-64 architecture
manuals because I didn't see any reason to be different.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/10] alternate 4-level page tables patches
  2004-12-20 18:59                                 ` Andi Kleen
@ 2004-12-20 18:57                                   ` Randy.Dunlap
  0 siblings, 0 replies; 77+ messages in thread
From: Randy.Dunlap @ 2004-12-20 18:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Nick Piggin, Linux Memory Management,
	Hugh Dickins, Andrew Morton

Andi Kleen wrote:
> 
>>If you do that _first_, then sure. And have some automated checker tool
>>that we can run occasionally to verify that we don't break this magic rule
>>later by mistake.
> 
> 
> scripts/checkstack.pl
> 
> There is probably a makefile target for it too, but I cannot find it 
> right now. Probably should be in make buildcheck.

It's in 'make checkstack'.

-- 
~Randy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/10] alternate 4-level page tables patches
  2004-12-20 18:47                               ` Linus Torvalds
  2004-12-20 18:52                                 ` Linus Torvalds
@ 2004-12-20 18:59                                 ` Andi Kleen
  2004-12-20 18:57                                   ` Randy.Dunlap
  1 sibling, 1 reply; 77+ messages in thread
From: Andi Kleen @ 2004-12-20 18:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Nick Piggin, Linux Memory Management, Hugh Dickins,
	Andrew Morton

On Mon, Dec 20, 2004 at 10:47:05AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 20 Dec 2004, Andi Kleen wrote:
> > 
> > I remember there was one, but they took a brute-force sledgehammer fix.
> > The right fix would have been to add the noinlines, not penalize
> > everybody.
> 
> No. 
> 
> Adding work-arounds to source code for broken compilers is just not 
> acceptable. If some compiler feature works badly, it is _disabled_.


> 
> Look at "-fno-strict-aliasing". Exactly the same issue. Sure, we could 
> have tried to find every place where it was an issue, but very 
> fundamentally that's HARD. The issues aren't obvious from the source code, 
> and the "fixes" are not obvious either and do not improve readability. 
> Even though arguably the aliasing logic _could_ have helped other places

IMHO that's totally different. Yes aliasing problems are happening
everywhere because they cause problems with very common coding idioms.
But the big stack frame thing is pretty rare if it happens at all  
(I'm still sceptical) 

> 
> So if a compiler does something we don't want to handle, we disable that
> feature. It's just not _possible_ to audit the source code for these kinds
> of compiler features unless you write a tool that does most of it
> automatically (or at least points out where the things need to be done).

./scripts/checkstack.pl

> 
> Once you start doing "noinline" and depend on those being right, you end
> up having to support that forever - with new code inevitably causing
> subtle breakage because of some strange compiler rule that in no way is
> obvious (ie adding/removing a "static" just because you ended up exporting
> it to somebody else suddenly has very non-local issues - that's BAD).

You're far exaggerating the problem. It happens pretty seldom.

I think even Arjan only found one case or two in million lines
of code.

And as I said the stack frame sizes need to be regularly checked
anyways, since there seems to be a fraction of the driver people
who are just not aware of it (totally independent of unit-at-a-time) 

> 
> > It helps when you add the noinlines. I can do that later - search
> > for Arjan's old report (I think he reported it), check what compiler
> > version he used, compile everything with it and unit-at-a-time
> > and eyeball all the big stack frames and add noinline
> > if it should be really needed.
> 
> If you do that _first_, then sure. And have some automated checker tool
> that we can run occasionally to verify that we don't break this magic rule
> later by mistake.

scripts/checkstack.pl

There is probably a makefile target for it too, but I cannot find it 
right now. Probably should be in make buildcheck.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-20 18:53       ` Andi Kleen
@ 2004-12-21  0:04         ` Linus Torvalds
  2004-12-21  0:22           ` Andi Kleen
  0 siblings, 1 reply; 77+ messages in thread
From: Linus Torvalds @ 2004-12-21  0:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hugh Dickins, Nick Piggin, Linux Memory Management, Andrew Morton


On Mon, 20 Dec 2004, Andi Kleen wrote:
> 
> I'm not sure what you mean with that. You have to convert the architectures,
> otherwise they won't compile. That's true for my patch and true for
> Nick's (except that he didn't do all the work of converting the archs yet)

Well, you do have to convert the architectures, in the sense that you need 
to fix up the types for the "pmd_offset()" etc functions.

But you shouldn't have to fix up anything else. Especially if "pgd_t" 
doesn't change, the _only_ things that need fixing up is anything that 
walks the page tables. Nothing else.

>>   It's just that once you conceptually do it in the middle, a
>> numbered name like "pml4_t" just doesn't make any sense (
>
> Sorry I didn't invent it, just copied it from the x86-64 architecture
> manuals because I didn't see any reason to be different.

The thing is, I doubt the x86-64 architecture manuals use "pgd", "pmd" and 
"pte", do they? So regardless, there's no consitent naming.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21  0:04         ` Linus Torvalds
@ 2004-12-21  0:22           ` Andi Kleen
  2004-12-21  0:43             ` Linus Torvalds
  2004-12-21  0:47             ` Nick Piggin
  0 siblings, 2 replies; 77+ messages in thread
From: Andi Kleen @ 2004-12-21  0:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Hugh Dickins, Nick Piggin, Linux Memory Management,
	Andrew Morton

On Mon, Dec 20, 2004 at 04:04:49PM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 20 Dec 2004, Andi Kleen wrote:
> > 
> > I'm not sure what you mean with that. You have to convert the architectures,
> > otherwise they won't compile. That's true for my patch and true for
> > Nick's (except that he didn't do all the work of converting the archs yet)
> 
> Well, you do have to convert the architectures, in the sense that you need 
> to fix up the types for the "pmd_offset()" etc functions.
> 
> But you shouldn't have to fix up anything else. Especially if "pgd_t" 
> doesn't change, the _only_ things that need fixing up is anything that 
> walks the page tables. Nothing else.

Actually anything that looks up anything in the page tables. 
And there is plenty of that in each architecture.

You have to break this code, otherwise you cannot catch the code
walking page tables and risk unconverted generic code.

I repeat again: the differences on what code needs
to be changed between my patchkit and Nick's are quite minor.

The main difference is just the naming. And that mine is actually
tested on many architectures and and has been in -mm* for some time
and is ready for merging, while Nick's is still in the early stages.

> 
> >>   It's just that once you conceptually do it in the middle, a
> >> numbered name like "pml4_t" just doesn't make any sense (
> >
> > Sorry I didn't invent it, just copied it from the x86-64 architecture
> > manuals because I didn't see any reason to be different.
> 
> The thing is, I doubt the x86-64 architecture manuals use "pgd", "pmd" and 
> "pte", do they? So regardless, there's no consitent naming.

There is consistent naming for the highest level at least. 

They use pte, pde, pdpe, pml4e (for the entries, the levels are
called pte, pde, pdp, pml4) 

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21  0:22           ` Andi Kleen
@ 2004-12-21  0:43             ` Linus Torvalds
  2004-12-21  0:47             ` Nick Piggin
  1 sibling, 0 replies; 77+ messages in thread
From: Linus Torvalds @ 2004-12-21  0:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hugh Dickins, Nick Piggin, Linux Memory Management, Andrew Morton

On Tue, 21 Dec 2004, Andi Kleen wrote:
> 
> I repeat again: the differences on what code needs
> to be changed between my patchkit and Nick's are quite minor.

Oh, I believe that. But you don't seem to see what my argument was:

> The main difference is just the naming.

No. There's a more conceptual issue, as I mentioned.

Yes, the code ends up being the same, you'll always have

		pgd = pgd_offset(mm, address);
		if (!pgd_present(*pgd))
			goto out;
		pmd = pmd_offset(pgd, address);
		if (!pmd_present(*pmd))
			goto out;
		pte = pte_offset_map(pmd, address);
		if (!pte_present(*pte))
			goto out_unmap;
		...

	out_unmap:
		pte_unmap(pte);
	out:

and regardless of where you add the new thing, it's going to be the same. 
In one case you have a pml4_offset() _before_ the pgd_offset, in the other 
case you have a pud_offset() _after_ the pgd_offset. 

But that doesn't change the fact that you conceptually add it to two 
totally different locations. That you say "it's only naming" means that 
you don't see what I tried to argue. IT IS NOT ONLY NAMING.

I told you why I think the second location is the right one. You ignored 
it. Fine.

> > The thing is, I doubt the x86-64 architecture manuals use "pgd", "pmd" and 
> > "pte", do they? So regardless, there's no consitent naming.
> 
> There is consistent naming for the highest level at least. 
> 
> They use pte, pde, pdpe, pml4e (for the entries, the levels are
> called pte, pde, pdp, pml4) 

My point is that using pml4 is clearly not consistent in _LINUX_, since
AMD doesn't use the names Linux uses for the other levels. So the naming 
_really_ doesn't matter. The only thing that matters is the location of 
the new level.

And I _guarantee_ that your patches touch more files than Nick's patches 
do. Exactly because you change the meaning (and name) of the top-level 
directory, which is referenced in places that don't otherwise care about 
the internals.

I already pointed you at <linux/sched.h> as an example of something that 
cares about the top level, but not the middle ones. Same goes for 
kernel/fork.c, for all the same reasons. 

Not a lot of code, I agree. I think it's an "how do you approach it" 
issue - the end result is largely the same, the _approach_ is different. 

And your approach means you change files that you have absolutely no 
reason to change. Like kernel/fork.c. 

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21  0:22           ` Andi Kleen
  2004-12-21  0:43             ` Linus Torvalds
@ 2004-12-21  0:47             ` Nick Piggin
  2004-12-21  2:55               ` Hugh Dickins
  1 sibling, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-21  0:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Hugh Dickins, Linux Memory Management, Andrew Morton

On Tue, 2004-12-21 at 01:22 +0100, Andi Kleen wrote:

> I repeat again: the differences on what code needs
> to be changed between my patchkit and Nick's are quite minor.
> 

The thing I prefer about the pud is that the folding method is identical
to pmd. If you have a look at asm-generic/pgtable-nopmd.h and -nopud.h,
they are the same file, with a few things renamed.

> The main difference is just the naming. And that mine is actually
> tested on many architectures and and has been in -mm* for some time
> and is ready for merging, while Nick's is still in the early stages.
> 

True it will need more testing than yours would, which would almost be
able to go in as soon as 2.6.10 was released... but considering most of
the hard stuff _is_ your work, then hopefully most problems should be
resolved already.

I understand you'd like the 4-levels patch to be present in 2.6.11... I
don't think that going with the "pud" version would necessarily prevent
that from happening.

> > 
> > >>   It's just that once you conceptually do it in the middle, a
> > >> numbered name like "pml4_t" just doesn't make any sense (
> > >
> > > Sorry I didn't invent it, just copied it from the x86-64 architecture
> > > manuals because I didn't see any reason to be different.
> > 
> > The thing is, I doubt the x86-64 architecture manuals use "pgd", "pmd" and 
> > "pte", do they? So regardless, there's no consitent naming.
> 
> There is consistent naming for the highest level at least. 
> 
> They use pte, pde, pdpe, pml4e (for the entries, the levels are
> called pte, pde, pdp, pml4) 
> 

Well I won't argue about naming, because I don't think anyone cares
enough for it to be a problem. But pud is consistent with _Linux_
naming, at least (ie. p?d)...

Anyway, I'll continue to try to get more architecture support, and
let someone else decide between pud and pml4 ;) Although if it looks
like it is going to really slow down progress for you, then I am
happy to abandon it.

Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21  0:47             ` Nick Piggin
@ 2004-12-21  2:55               ` Hugh Dickins
  2004-12-21  3:21                 ` Nick Piggin
  2004-12-21  3:47                 ` Linus Torvalds
  0 siblings, 2 replies; 77+ messages in thread
From: Hugh Dickins @ 2004-12-21  2:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Linus Torvalds, Linux Memory Management, Andrew Morton

On Tue, 21 Dec 2004, Nick Piggin wrote:
> 
> Anyway, I'll continue to try to get more architecture support,

Sorry for being dense: despite your earlier explanation,
I've yet again lost track of why pud needs any flag day through
the architectures - beyond the inclusion of some generic nopud.h,
but different from the one you're rightly aiming for.

Good as they are, imagine setting aside your nopmd.h mods as a cleanup
for some other occasion.  Then wouldn't a generic nopud.h something like

#define pud_t				pgd_t
#define pud_alloc(mm, pgd, address)	(pgd)
#define pud_offset(pgd, start)		(pgd)
#define pud_none(pud)			0
#define pud_bad(pud)			0
#define pud_ERROR(pud)
#define pud_clear(pud)
#define PUD_SIZE			PGDIR_SIZE
#define PUD_MASK			PGDIR_MASK

get your kernel with common patch 6/10 compiling and working correctly
on all architectures?  with just a one-line mod to each architecture
to include it?

Your answer seems to be no, so I guess there's a place in the code
you can point to, which shows up the nonsense of this suggestion:
please humiliate me!

Certainly x86_64 then needs to use other definitions to get its 4levels
working.  And it'd be highly advisable to convert i386 and some other
common architectures (as you have already done) to use more typesafe
declarations in which a pud_t is distinct from a pgd_t, so that people
building mods to the common pagetable code cannot mix levels by mistake.

But I don't see why the pagetable code in each arch subdirectory needs
to have a pud level inserted all at once (whereas a flag day was needed
for the pml4 patch, because mm->pgd got replaced by mm->pml4).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21  2:55               ` Hugh Dickins
@ 2004-12-21  3:21                 ` Nick Piggin
  2004-12-21  3:47                 ` Linus Torvalds
  1 sibling, 0 replies; 77+ messages in thread
From: Nick Piggin @ 2004-12-21  3:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andi Kleen, Linus Torvalds, Linux Memory Management, Andrew Morton

On Tue, 2004-12-21 at 02:55 +0000, Hugh Dickins wrote:
> On Tue, 21 Dec 2004, Nick Piggin wrote:
> > 
> > Anyway, I'll continue to try to get more architecture support,
> 
> Sorry for being dense: despite your earlier explanation,
> I've yet again lost track of why pud needs any flag day through
> the architectures - beyond the inclusion of some generic nopud.h,
> but different from the one you're rightly aiming for.
> 
> Good as they are, imagine setting aside your nopmd.h mods as a cleanup
> for some other occasion.  Then wouldn't a generic nopud.h something like
> 
> #define pud_t				pgd_t
> #define pud_alloc(mm, pgd, address)	(pgd)
> #define pud_offset(pgd, start)		(pgd)
> #define pud_none(pud)			0
> #define pud_bad(pud)			0
> #define pud_ERROR(pud)
> #define pud_clear(pud)
> #define PUD_SIZE			PGDIR_SIZE
> #define PUD_MASK			PGDIR_MASK
> 
> get your kernel with common patch 6/10 compiling and working correctly
> on all architectures?  with just a one-line mod to each architecture
> to include it?
> 
> Your answer seems to be no, so I guess there's a place in the code
> you can point to, which shows up the nonsense of this suggestion:
> please humiliate me!
> 

Hmm yeah that would be ideal, I think.

Maybe I didn't explore it far enough... but __pmd_alloc will be now
doing a pud_populate to set up the allocated pmd, that pud_populate
needs to be what pgd_populate was before the conversion...

Hmm, so I guess

#define pud_populate pgd_populate

would do it.

I did get about that far, but that was in the context of the nopmd
cleanup, where I was getting tangled up on these (and other) things.
You may be right in that it is much easier to do the above conversion
in terms of nopud only, because you are starting with a cleaner slate...



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21  2:55               ` Hugh Dickins
  2004-12-21  3:21                 ` Nick Piggin
@ 2004-12-21  3:47                 ` Linus Torvalds
  2004-12-21  3:56                   ` Linus Torvalds
  1 sibling, 1 reply; 77+ messages in thread
From: Linus Torvalds @ 2004-12-21  3:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Andi Kleen, Linux Memory Management, Andrew Morton

On Tue, 21 Dec 2004, Hugh Dickins wrote:

> On Tue, 21 Dec 2004, Nick Piggin wrote:
> > 
> > Anyway, I'll continue to try to get more architecture support,
> 
> Sorry for being dense: despite your earlier explanation,
> I've yet again lost track of why pud needs any flag day through
> the architectures - beyond the inclusion of some generic nopud.h,
> but different from the one you're rightly aiming for.

It really should not need one.

If you do it right (and "right" here means "wrong"), all architectures 
should continue to work, aside from the fact that they get some nasty 
pointer type warnings.

> Good as they are, imagine setting aside your nopmd.h mods as a cleanup
> for some other occasion.  Then wouldn't a generic nopud.h something like
> 
> #define pud_t				pgd_t
> #define pud_alloc(mm, pgd, address)	(pgd)
> #define pud_offset(pgd, start)		(pgd)
> #define pud_none(pud)			0
> #define pud_bad(pud)			0
> #define pud_ERROR(pud)
> #define pud_clear(pud)
> #define PUD_SIZE			PGDIR_SIZE
> #define PUD_MASK			PGDIR_MASK

That gets it 99% of the way, but the fact is, code that hasn't been
changed to actually _use_ pud_t etc will get a warning because it will
pass down a "pgd_t *" to the "pmd_present()" things, and that's wrong -
they should be converted to get "pud_t"s.

(Or, the other way around: if an architecture has _not_ updated its 
pmd_offset() etc friends, it will get horrible pointer type warnings from 
code that expects a pud_t).

But since such an architecture will actually only _have_ three levels of 
pages tables anyway, the warnings will be only warnings - the code 
generated should be correct anyway.

(It may be _possible_ to avoid the warnings by just making "pud_t" and
"pmd_t" be the same type for such architectures, and just allowing
_mixing_ of three-level and four-level accesses.  I have to say that I 
consider that pretty borderline programming practice though).

> But I don't see why the pagetable code in each arch subdirectory needs
> to have a pud level inserted all at once (whereas a flag day was needed
> for the pml4 patch, because mm->pgd got replaced by mm->pml4).

There is a "flag day", because even architectures that haven't been
updated to 4-level page tables will see the four-level page table accessor
functions in generic code. But see above: I think we can make the
"flagness"  be less critical, in the sense that it will generate warnings,
but the code will still work.

But yes, that really _requires_ that the new level is in the "middle", aka 
the pud_t approach of Nick's patches. And I may be missing something.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21  3:47                 ` Linus Torvalds
@ 2004-12-21  3:56                   ` Linus Torvalds
  2004-12-21  4:04                     ` Nick Piggin
                                       ` (2 more replies)
  0 siblings, 3 replies; 77+ messages in thread
From: Linus Torvalds @ 2004-12-21  3:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Andi Kleen, Linux Memory Management, Andrew Morton

On Mon, 20 Dec 2004, Linus Torvalds wrote:
> 
> (It may be _possible_ to avoid the warnings by just making "pud_t" and
> "pmd_t" be the same type for such architectures, and just allowing
> _mixing_ of three-level and four-level accesses.  I have to say that I 
> consider that pretty borderline programming practice though).

Actually, I notice that this is exactly what you did, sorry for not being 
more careful about reading your defines.

Thinking some more about it, I don't much like the "mixing" of 3-level and
4-level things, but since the only downside is a lack of type-safety for
the 4-level case (ie you can get it wrong without getting any warning),
and since that type safety _does_ exist in the case where the four levels 
are actually used, I think it's ok. 

It would be bad if the architecture that supported 4level page tables was
really rare and broken (so that mistakes would happen and not get noticed
for a while), but I suspect x86-64 by now is probably the second- or
third-most used architecture, so it's not like the lack of type safety on 
other architectures where it doesn't matter would be a huge maintenance 
problem.

Color me convinced. 

Nick, can you see if such a patch is possible? I'll test ppc64 still 
working..

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21  3:56                   ` Linus Torvalds
@ 2004-12-21  4:04                     ` Nick Piggin
  2004-12-21  4:08                       ` Nick Piggin
  2004-12-21  9:36                     ` Andi Kleen
  2004-12-21 10:52                     ` Nick Piggin
  2 siblings, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-21  4:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andi Kleen, Linux Memory Management, Andrew Morton

On Mon, 2004-12-20 at 19:56 -0800, Linus Torvalds wrote:

> Color me convinced. 
> 
> Nick, can you see if such a patch is possible? I'll test ppc64 still 
> working..
> 

Yep, I'm beginning to think it is the way to go as well: we'll have all
the generic code and some key architectures compiling with the struct
type checking... and the 4-level fallback header will keep arch
maintainers from being inconvenienced while spitting out enough warnings
that they'll get on to fixing it.

I'll take a look shortly.

Nick


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21  4:04                     ` Nick Piggin
@ 2004-12-21  4:08                       ` Nick Piggin
  0 siblings, 0 replies; 77+ messages in thread
From: Nick Piggin @ 2004-12-21  4:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andi Kleen, Linux Memory Management, Andrew Morton

On Tue, 2004-12-21 at 15:04 +1100, Nick Piggin wrote:
> On Mon, 2004-12-20 at 19:56 -0800, Linus Torvalds wrote:
> 
> > Color me convinced. 
> > 
> > Nick, can you see if such a patch is possible? I'll test ppc64 still 
> > working..
> > 
> 
> Yep, I'm beginning to think it is the way to go as well: we'll have all
> the generic code and some key architectures compiling with the struct

struct/stricter

Too much C for one day :P


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21  3:56                   ` Linus Torvalds
  2004-12-21  4:04                     ` Nick Piggin
@ 2004-12-21  9:36                     ` Andi Kleen
  2004-12-21 10:13                       ` Hugh Dickins
                                         ` (2 more replies)
  2004-12-21 10:52                     ` Nick Piggin
  2 siblings, 3 replies; 77+ messages in thread
From: Andi Kleen @ 2004-12-21  9:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Nick Piggin, Andi Kleen, Linux Memory Management,
	Andrew Morton

On Mon, Dec 20, 2004 at 07:56:36PM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 20 Dec 2004, Linus Torvalds wrote:
> > 
> > (It may be _possible_ to avoid the warnings by just making "pud_t" and
> > "pmd_t" be the same type for such architectures, and just allowing
> > _mixing_ of three-level and four-level accesses.  I have to say that I 
> > consider that pretty borderline programming practice though).
> 
> Actually, I notice that this is exactly what you did, sorry for not being 
> more careful about reading your defines.
> 
> Thinking some more about it, I don't much like the "mixing" of 3-level and
> 4-level things, but since the only downside is a lack of type-safety for
> the 4-level case (ie you can get it wrong without getting any warning),
> and since that type safety _does_ exist in the case where the four levels 
> are actually used, I think it's ok. 


Sorry, but I don't.

> 
> It would be bad if the architecture that supported 4level page tables was
> really rare and broken (so that mistakes would happen and not get noticed
> for a while), but I suspect x86-64 by now is probably the second- or
> third-most used architecture, so it's not like the lack of type safety on 
> other architectures where it doesn't matter would be a huge maintenance 
> problem.

Sorry, but I think that's a very bad approach. If the i386 users
don't get warnings I will need to spend a lot of time just patching
behind them. While x86-64 is getting more and more popular most
hacking still happens on i386.

Please use a type safe approach that causes warnings
and errors on i386 too. Otherwise it'll cause me much additional
work longer term. Having the small advantage of a perhaps
slightly easier migration for long term maintenance hazzle
is a bad tradeoff IMHO.

Also is the flag day really that bad? I already did near all the work
IMHO (with the help of some architecture maintainers, thanks guys!)
and the patches are really not *that* bad. Linus can you please
at least take a second look at them before going with the non
typesafe hack? 

Thanks,

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21  9:36                     ` Andi Kleen
@ 2004-12-21 10:13                       ` Hugh Dickins
  2004-12-21 10:59                       ` Nick Piggin
  2004-12-21 17:36                       ` Linus Torvalds
  2 siblings, 0 replies; 77+ messages in thread
From: Hugh Dickins @ 2004-12-21 10:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Nick Piggin, Linux Memory Management, Andrew Morton

On Tue, 21 Dec 2004, Andi Kleen wrote:
> 
> Sorry, but I think that's a very bad approach. If the i386 users
> don't get warnings I will need to spend a lot of time just patching
> behind them. While x86-64 is getting more and more popular most
> hacking still happens on i386.
> 
> Please use a type safe approach that causes warnings
> and errors on i386 too. Otherwise it'll cause me much additional
> work longer term. Having the small advantage of a perhaps
> slightly easier migration for long term maintenance hazzle
> is a bad tradeoff IMHO.

I agree, that's what I was asking too: if i386 is not initially
converted to typesafe pud_t, then I'd soon want to add a patch
for that.  The type unsafe pud_t == pgd_t is great for doing a
simple conversion of all architectures in one small patch, but
no way does it exclude implementing typesafe pud_t on selected
(perhaps eventually all) architectures, both those that need it
for 4levels and those where it's advisable for build testing.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21  3:56                   ` Linus Torvalds
  2004-12-21  4:04                     ` Nick Piggin
  2004-12-21  9:36                     ` Andi Kleen
@ 2004-12-21 10:52                     ` Nick Piggin
  2 siblings, 0 replies; 77+ messages in thread
From: Nick Piggin @ 2004-12-21 10:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andi Kleen, Linux Memory Management, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 768 bytes --]

Linus Torvalds wrote:

> Nick, can you see if such a patch is possible? I'll test ppc64 still 
> working..
> 

OK I seem to have got something working after fumbling around in the
dark for a bit. I apologise if it blows up straight away for you, which
isn't unlikely.

Tested only on i386 2-levels for now (not much point in testing i386
3 levels really). I'll do some testing on ia64 and x86-64 tomorrow, but
I've run out of time tonight.

You'll want the full rollup here (against 2.6.10-rc3):
http://www.kerneltrap.org/~npiggin/vm/4level.patch.gz

And attached is the broken out patch (included in the above). An arch
only needs to include this in asm/pgtable.h, and no other changes. As
you see it wasn't _quite_ as clean as Hugh had hoped, but not too bad.

Nick

[-- Attachment #2: 4level-fallback.patch --]
[-- Type: text/plain, Size: 3259 bytes --]




---

 linux-2.6-npiggin/include/asm-generic/4level-fixup.h |   32 +++++++++++++++++++
 linux-2.6-npiggin/include/linux/mm.h                 |    6 +++
 linux-2.6-npiggin/mm/memory.c                        |   25 ++++++++++++++
 3 files changed, 63 insertions(+)

diff -puN /dev/null include/asm-generic/4level-fixup.h
--- /dev/null	2004-09-06 19:38:39.000000000 +1000
+++ linux-2.6-npiggin/include/asm-generic/4level-fixup.h	2004-12-21 20:27:48.000000000 +1100
@@ -0,0 +1,32 @@
+#ifndef _4LEVEL_FIXUP_H
+#define _4LEVEL_FIXUP_H
+
+#define __ARCH_HAS_4LEVEL_HACK
+
+#define PUD_SIZE			PGDIR_SIZE
+#define PUD_MASK			PGDIR_MASK
+#define PTRS_PER_PUD			1
+
+#define pud_t				pgd_t
+
+#define pmd_alloc(mm, pud, address)			\
+({	pmd_t *ret;					\
+	if (pgd_none(*pud))				\
+ 		ret = __pmd_alloc(mm, pud, address);	\
+ 	else						\
+		ret = pmd_offset(pud, address);		\
+ 	ret;						\
+})
+
+#define pud_alloc(mm, pgd, address)	(pgd)
+#define pud_offset(pgd, start)		(pgd)
+#define pud_none(pud)			0
+#define pud_bad(pud)			0
+#define pud_present(pud)		1
+#define pud_ERROR(pud)			do { printk("pud_ERROR\n"); BUG(); } while (0)
+#define pud_clear(pud)			do { } while (0)
+
+#define pud_free(x)			do { } while (0)
+#define __pud_free_tlb(tlb, x)		do { } while (0)
+
+#endif
diff -puN include/linux/mm.h~4level-fallback include/linux/mm.h
--- linux-2.6/include/linux/mm.h~4level-fallback	2004-12-21 20:27:48.000000000 +1100
+++ linux-2.6-npiggin/include/linux/mm.h	2004-12-21 20:27:48.000000000 +1100
@@ -631,6 +631,11 @@ extern void remove_shrinker(struct shrin
  * the inlining and the symmetry break with pte_alloc_map() that does all
  * of this out-of-line.
  */
+/*
+ * The following ifdef needed to get the 4level-fixup.h header to work.
+ * Remove it when 4level-fixup.h has been removed.
+ */
+#ifndef __ARCH_HAS_4LEVEL_HACK 
 static inline pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
 {
 	if (pgd_none(*pgd))
@@ -644,6 +649,7 @@ static inline pmd_t *pmd_alloc(struct mm
 		return __pmd_alloc(mm, pud, address);
 	return pmd_offset(pud, address);
 }
+#endif
 
 extern void free_area_init(unsigned long * zones_size);
 extern void free_area_init_node(int nid, pg_data_t *pgdat,
diff -puN mm/memory.c~4level-fallback mm/memory.c
--- linux-2.6/mm/memory.c~4level-fallback	2004-12-21 20:27:48.000000000 +1100
+++ linux-2.6-npiggin/mm/memory.c	2004-12-21 20:27:48.000000000 +1100
@@ -1940,6 +1940,7 @@ int handle_mm_fault(struct mm_struct *mm
 	return VM_FAULT_OOM;
 }
 
+#ifndef __ARCH_HAS_4LEVEL_HACK
 #if (PTRS_PER_PGD > 1)
 /*
  * Allocate page upper directory.
@@ -2007,6 +2008,30 @@ out:
 	return pmd_offset(pud, address);
 }
 #endif
+#else
+pmd_t fastcall *__pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
+{
+	pmd_t *new;
+
+	spin_unlock(&mm->page_table_lock);
+	new = pmd_alloc_one(mm, address);
+	spin_lock(&mm->page_table_lock);
+	if (!new)
+		return NULL;
+
+	/*
+	 * Because we dropped the lock, we should re-check the
+	 * entry, as somebody else could have populated it..
+	 */
+	if (pgd_present(*pud)) {
+		pmd_free(new);
+		goto out;
+	}
+	pgd_populate(mm, pud, new);
+out:
+	return pmd_offset(pud, address);
+}
+#endif
 
 int make_pages_present(unsigned long addr, unsigned long end)
 {

_

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21  9:36                     ` Andi Kleen
  2004-12-21 10:13                       ` Hugh Dickins
@ 2004-12-21 10:59                       ` Nick Piggin
  2004-12-21 17:36                       ` Linus Torvalds
  2 siblings, 0 replies; 77+ messages in thread
From: Nick Piggin @ 2004-12-21 10:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Hugh Dickins, Linux Memory Management, Andrew Morton

Andi Kleen wrote:
> On Mon, Dec 20, 2004 at 07:56:36PM -0800, Linus Torvalds wrote:

>>It would be bad if the architecture that supported 4level page tables was
>>really rare and broken (so that mistakes would happen and not get noticed
>>for a while), but I suspect x86-64 by now is probably the second- or
>>third-most used architecture, so it's not like the lack of type safety on 
>>other architectures where it doesn't matter would be a huge maintenance 
>>problem.
> 
> 
> Sorry, but I think that's a very bad approach. If the i386 users
> don't get warnings I will need to spend a lot of time just patching
> behind them. While x86-64 is getting more and more popular most
> hacking still happens on i386.
> 
> Please use a type safe approach that causes warnings
> and errors on i386 too. Otherwise it'll cause me much additional
> work longer term. Having the small advantage of a perhaps
> slightly easier migration for long term maintenance hazzle
> is a bad tradeoff IMHO.
> 

Oh yes, you're right there. And i386 will get warnings. The un-type-safe
headers are just for those architectures that haven't converted over yet.

I'm somewhat on the fence with this.

On one hand it does allow users, developers and arch maintainers of more
obscure architectures to have their kernels continue to work (provided that
I can get the thing working), and migrate over slowly.

But on the other hand, is it really needed? As you said, you've already
done much of the arch work in your patch...

Maybe it is a good thing in that it would allow my patchset to be picked
up _sooner_, which would encourage arch maintainers and speed progress.
With any luck, all traces of it would be gone before 2.6.11 is released.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21  9:36                     ` Andi Kleen
  2004-12-21 10:13                       ` Hugh Dickins
  2004-12-21 10:59                       ` Nick Piggin
@ 2004-12-21 17:36                       ` Linus Torvalds
  2004-12-21 20:19                         ` Andi Kleen
  2 siblings, 1 reply; 77+ messages in thread
From: Linus Torvalds @ 2004-12-21 17:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hugh Dickins, Nick Piggin, Linux Memory Management, Andrew Morton

On Tue, 21 Dec 2004, Andi Kleen wrote:
> 
> Sorry, but I think that's a very bad approach. If the i386 users
> don't get warnings I will need to spend a lot of time just patching
> behind them. While x86-64 is getting more and more popular most
> hacking still happens on i386.

That's true, but it's not an issue for several reasons:

 - we can easily update just _x86_ to be type-safe (ie add the fourth 
   level to x86 just to get type safety, even if it's folded). That 
   doesn't mean that we have to worry about 20 _other_ architectures, that 
   most developers can't even test.

   Iow, the lack of type-safety is not something forced by the approach. 
   The lack of type safety is an _option_ to allow architectures to not
   have to have a flag-day when everybody needs to switch.

   In fact, the lack of type-safety would allow every single intermediate
   patch to always compile, and work - on all architectures. Which isn't
   true in the current series, and which is a really nice feature, because 
   it means that you really can build up the thing entirely, up to the 
   point where you "turn it on" one architecture at a time.

 - even if we left x86 type-unsafe, the fact is, the things that walk the 
   page tables almost never get changed. I don't remember the last time we 
   really changed things around all that much. So even without x86, it 
   likely wouldn't be a problem.

> Also is the flag day really that bad?

I think that _avoiding_ a flag-day is always good. Also, more importantly,
it looks like this approach allows each patch to be smaller and more 
self-contained, ie we never have the situation where "uhhuh, now it won't 
compile on arch Xxxx for ten patches, until we turn things on". The 
smaller the patches are, the more obvious any problems will be.

Think of it this way: for random architecture X, the four-level page table 
patches really should make _no_ difference until they are enabled. So you 
can do 90% of the work, and be pretty confident that things work. Most 
importantly, if things _don't_ work before the thing has been enabled, 
that's a big clue ;)

And then, the last (small) patch for architecture X actually ends up 
enabling the work. Everybody will be happier with something like that, 
since it makes merging _much_ easier. For example, I'll have zero problems 
at all with merging the infrastructure the day after 2.6.10 is released, 
since I'll know that it won't hurt any of the other architectures, and it 
won't make trouble for anybody.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21 17:36                       ` Linus Torvalds
@ 2004-12-21 20:19                         ` Andi Kleen
  2004-12-21 23:49                           ` Nick Piggin
  0 siblings, 1 reply; 77+ messages in thread
From: Andi Kleen @ 2004-12-21 20:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Hugh Dickins, Nick Piggin, Linux Memory Management,
	Andrew Morton

> That's true, but it's not an issue for several reasons:
> 
>  - we can easily update just _x86_ to be type-safe (ie add the fourth 
>    level to x86 just to get type safety, even if it's folded). That 
>    doesn't mean that we have to worry about 20 _other_ architectures, that 
>    most developers can't even test.

I already covered near all of them anyways (m68k is the main exception) 
And quite a few of them have been even tested, thanks to the port
maintainers.

> > Also is the flag day really that bad?
> 
> I think that _avoiding_ a flag-day is always good. Also, more importantly,
> it looks like this approach allows each patch to be smaller and more 
> self-contained, ie we never have the situation where "uhhuh, now it won't 
> compile on arch Xxxx for ten patches, until we turn things on". The 
> smaller the patches are, the more obvious any problems will be.

With the warnings the port maintainers will need to do the conversion
work anyways, they can't just leave the warnings in (at least if they
care to still maintain their code in the future) 

> 
> Think of it this way: for random architecture X, the four-level page table 
> patches really should make _no_ difference until they are enabled. So you 
> can do 90% of the work, and be pretty confident that things work. Most 
> importantly, if things _don't_ work before the thing has been enabled, 
> that's a big clue ;)

My approach was to just do the straight forward conversions. The only
risk (from experience) so far was that things not compile when I forgot
one replacement, but when they compile they tend to work.

I must say I would still prefer if my patches were applied instead

of going through all of this again in a slightly different form.
e.g. who is doing all this "PUD" stuff? Nick's patch so far was only
a prototype and probably needs quite a bit more work and then a new
-mm testing cycle. 

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21 20:19                         ` Andi Kleen
@ 2004-12-21 23:49                           ` Nick Piggin
  2004-12-22 10:38                             ` Andi Kleen
  0 siblings, 1 reply; 77+ messages in thread
From: Nick Piggin @ 2004-12-21 23:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Hugh Dickins, Linux Memory Management, Andrew Morton

Andi Kleen wrote:

>>Think of it this way: for random architecture X, the four-level page table 
>>patches really should make _no_ difference until they are enabled. So you 
>>can do 90% of the work, and be pretty confident that things work. Most 
>>importantly, if things _don't_ work before the thing has been enabled, 
>>that's a big clue ;)
> 
> 
> My approach was to just do the straight forward conversions. The only
> risk (from experience) so far was that things not compile when I forgot
> one replacement, but when they compile they tend to work.
> 

That is more or less the same with the 'pud' patches - the hard part
is in the infrastructure and generic code, architectures are generally
pretty simple.

> I must say I would still prefer if my patches were applied instead
> 
> of going through all of this again in a slightly different form.
> e.g. who is doing all this "PUD" stuff? Nick's patch so far was only
> a prototype and probably needs quite a bit more work and then a new
> -mm testing cycle. 
> 

To summarise my position,

I would like 'pud' to go in, because once one of the implementations gets
into 2.6, it is going to be a lot harder to justify changing. And I
personally like pud better (not the name, but the place) so I would prefer
that to get in. Again, that is nothing against your implementation or your
personal taste.

So all I can do is put up my suggestion, and leave it to someone else to
decide. I'm not so established/experienced in this code to be making big
choices.

I understand you'd be frustrated if 4level wasn't in 2.6.11, but as I
said, I don't think the choice of pud over pml4 would necessarily cause
such a delay.

As far as I understand, you don't have any problem with the 'pud'
implementation in principle?

Nick
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-21 23:49                           ` Nick Piggin
@ 2004-12-22 10:38                             ` Andi Kleen
  2004-12-22 11:19                               ` Nick Piggin
  0 siblings, 1 reply; 77+ messages in thread
From: Andi Kleen @ 2004-12-22 10:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Linus Torvalds, Hugh Dickins,
	Linux Memory Management, Andrew Morton

> I understand you'd be frustrated if 4level wasn't in 2.6.11, but as I
> said, I don't think the choice of pud over pml4 would necessarily cause
> such a delay.

It would require a longer testing cycle in -mm* again, at least
several weeks and probably some support from the arch maintainers again.
That may push it too late.

> 
> As far as I understand, you don't have any problem with the 'pud'
> implementation in principle?

I don't have anything directly against the name (although I'm still not sure
what it actually stands for) or the location (top level or mid level), 
but I'm worried about the delay of redoing the testing cycle completely.

I don't see any technical advantages of your approach over mine, eventually
all the work has to be done anyways, so in the end it boils down
what names are prefered. However I suspect you could use your time
better, Nick, than redoing things that have been already done ;-) 

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-22 10:38                             ` Andi Kleen
@ 2004-12-22 11:19                               ` Nick Piggin
  2004-12-22 11:23                                 ` Nick Piggin
  2004-12-22 18:07                                 ` Andi Kleen
  0 siblings, 2 replies; 77+ messages in thread
From: Nick Piggin @ 2004-12-22 11:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Hugh Dickins, Linux Memory Management, Andrew Morton

Andi Kleen wrote:
>>I understand you'd be frustrated if 4level wasn't in 2.6.11, but as I
>>said, I don't think the choice of pud over pml4 would necessarily cause
>>such a delay.
> 
> 
> It would require a longer testing cycle in -mm* again, at least
> several weeks and probably some support from the arch maintainers again.
> That may push it too late.
> 

Yes it would ideally need a week or so in -mm. And yes, arch maintainers
would need to give some support again, unfortunately: the proposed
fallback header is only a "dirty-make-this-compile-hack", that shouldn't
be propogated into a 2.6 proper release if possible.

> 
>>As far as I understand, you don't have any problem with the 'pud'
>>implementation in principle?
> 
> 
> I don't have anything directly against the name (although I'm still not sure
> what it actually stands for) or the location (top level or mid level), 
> but I'm worried about the delay of redoing the testing cycle completely.
> 

The name I guess is "upper". So you have a global, upper, middle, page table,
so it sort-of fits :)

But it is the location rather than the name that is the important factor in
my continuing to persue this.

> I don't see any technical advantages of your approach over mine, eventually
> all the work has to be done anyways, so in the end it boils down
> what names are prefered. However I suspect you could use your time
> better, Nick, than redoing things that have been already done ;-) 
> 

Well I suspect there are no advantages at all if you look at the compiled
binary.

But the advantages I see in the source code are a) pud folding matches exactly
how pmd folding was done on 2 level architectures, and b) it doesn't touch
either of the "business ends" of the page table structure (ie. top most or
bottom most levels).  I think these two points give some (if only slight)
advantage in maintainability and consistency.

It is unfortunate, and nobody's fault but my own, that I didn't look at your
patches earlier and work with you while you were still in the earlier stages
of coding. So I apologise for that.

I agree that the situation we now have where I'm essentially posting a
"competing" implementation which is just a slight variation on your patches,
but less testing and arch work is not ideal. The only reason I feel strongly
enough to have gone this far is because it is very core code.

And yeah, I'm sure I could use my time better!! This is just a bed time
project which is why I had been a bit slow with it ;)

I hope we can reach a conclusion. I don't want to (nor am I any way in a
position to) just say no pml4. Nor do I want the situation where nobody can
agree and it comes to the choice being made by a vote or other means. But I
do think there are legitimate reasons for pud over pml4.

If I can get the bulk of the architectures changed and tested, the arch
maintainers don't kick up too much fuss, it has a relatively trouble free run
in -mm, and Andrew and Linus are still happy to merge before 2.6.11, would you
be OK with the pud version (in principle)?

Nick
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-22 11:19                               ` Nick Piggin
@ 2004-12-22 11:23                                 ` Nick Piggin
  2004-12-22 18:07                                 ` Andi Kleen
  1 sibling, 0 replies; 77+ messages in thread
From: Nick Piggin @ 2004-12-22 11:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Hugh Dickins, Linux Memory Management, Andrew Morton

Nick Piggin wrote:

> If I can get the bulk of the architectures changed and tested, the arch
> maintainers don't kick up too much fuss, it has a relatively trouble 
> free run
> in -mm, and Andrew and Linus are still happy to merge before 2.6.11, 
> would you
> be OK with the pud version (in principle)?
> 

Well, I'll be away from tomorrow for the next few days... I hope
when I come back everything has been magically resolved :-)

Please play nicely everyone!!

Nick
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-22 11:19                               ` Nick Piggin
  2004-12-22 11:23                                 ` Nick Piggin
@ 2004-12-22 18:07                                 ` Andi Kleen
  2004-12-30 21:24                                   ` Nick Piggin
  1 sibling, 1 reply; 77+ messages in thread
From: Andi Kleen @ 2004-12-22 18:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Linus Torvalds, Hugh Dickins,
	Linux Memory Management, Andrew Morton

On Wed, Dec 22, 2004 at 10:19:09PM +1100, Nick Piggin wrote:
> But the advantages I see in the source code are a) pud folding matches 
> exactly
> how pmd folding was done on 2 level architectures, and b) it doesn't touch
> either of the "business ends" of the page table structure (ie. top most or
> bottom most levels).  I think these two points give some (if only slight)
> advantage in maintainability and consistency.

Sure, but when it's merged then pml4_t (or p<whatever>_t) would be 
the "business end", so it doesn't make much difference longer term.
After all future linux coders will not really care what was in the
past, just what is in the code at the time they hack on it.

> If I can get the bulk of the architectures changed and tested, the arch
> maintainers don't kick up too much fuss, it has a relatively trouble free 
> run
> in -mm, and Andrew and Linus are still happy to merge before 2.6.11, would 
> you
> be OK with the pud version (in principle)?

I can't say I'm very enthusiastic about it (but more due to scheduling
issues than technical issues). I don't see anything wrong with them by itself,
but I also don't think they have any particular advantages over the
pml4 version. But in the end the main thing I care about is that
4 level pagetables get in in some form, where exactly the
new level is added and how it is named is secondary.

I would prefer if it happened sooner though because the work
is not finished (the optimized walking is still needed),
and i've been just waiting for getting merged and settled
down a bit before continuing. 

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH 0/10] alternate 4-level page tables patches
  2004-12-22 18:07                                 ` Andi Kleen
@ 2004-12-30 21:24                                   ` Nick Piggin
  0 siblings, 0 replies; 77+ messages in thread
From: Nick Piggin @ 2004-12-30 21:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Hugh Dickins, Linux Memory Management, Andrew Morton

Andi Kleen wrote:
> On Wed, Dec 22, 2004 at 10:19:09PM +1100, Nick Piggin wrote:
> 
>>But the advantages I see in the source code are a) pud folding matches 
>>exactly
>>how pmd folding was done on 2 level architectures, and b) it doesn't touch
>>either of the "business ends" of the page table structure (ie. top most or
>>bottom most levels).  I think these two points give some (if only slight)
>>advantage in maintainability and consistency.
> 
> 
> Sure, but when it's merged then pml4_t (or p<whatever>_t) would be 
> the "business end", so it doesn't make much difference longer term.
> After all future linux coders will not really care what was in the
> past, just what is in the code at the time they hack on it.
> 

Yeah OK, raw-code wise the pml4 patch isn't much different. But the
conceptual intrusiveness of having the folding 'magic' in the top
level page table is a bit higher.

Also, pml4 does have some implementation intrusiveness by introducing
a new _way_ of folding the table, whereas pud folds in the same manner
as pmd.

> 
> 
>>If I can get the bulk of the architectures changed and tested, the arch
>>maintainers don't kick up too much fuss, it has a relatively trouble free 
>>run
>>in -mm, and Andrew and Linus are still happy to merge before 2.6.11, would 
>>you
>>be OK with the pud version (in principle)?
> 
> 
> I can't say I'm very enthusiastic about it (but more due to scheduling
> issues than technical issues). I don't see anything wrong with them by itself,
> but I also don't think they have any particular advantages over the
> pml4 version. But in the end the main thing I care about is that
> 4 level pagetables get in in some form, where exactly the
> new level is added and how it is named is secondary.
> 

So long as you are not completely against it, that is a good start ;)

> I would prefer if it happened sooner though because the work
> is not finished (the optimized walking is still needed),
> and i've been just waiting for getting merged and settled
> down a bit before continuing. 
> 

Yeah sure. I can also try to help with that (regardless of which patch
is merged).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2004-12-30 21:24 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-18  6:55 [RFC][PATCH 0/10] alternate 4-level page tables patches Nick Piggin
2004-12-18  6:55 ` [PATCH 1/10] " Nick Piggin
2004-12-18  6:56   ` [PATCH 2/10] " Nick Piggin
2004-12-18  6:56     ` [PATCH 3/10] " Nick Piggin
2004-12-18  6:57       ` [PATCH 4/10] " Nick Piggin
2004-12-18  6:58         ` [PATCH 5/10] " Nick Piggin
2004-12-18  6:58           ` [PATCH 6/10] " Nick Piggin
2004-12-18  6:59             ` [PATCH 7/10] " Nick Piggin
2004-12-18  7:00               ` [PATCH 8/10] " Nick Piggin
2004-12-18  7:00                 ` [PATCH 9/10] " Nick Piggin
2004-12-18  7:01                   ` [PATCH 10/10] " Nick Piggin
2004-12-18  7:31                     ` Andi Kleen
2004-12-18  7:46                       ` Nick Piggin
2004-12-18  8:08                       ` Andrew Morton
2004-12-18  9:48                         ` Andi Kleen
2004-12-18 19:06                       ` Linus Torvalds
2004-12-20 17:43                         ` Andi Kleen
2004-12-20 17:47                           ` Randy.Dunlap
2004-12-20 18:08                           ` Linus Torvalds
2004-12-20 18:15                             ` Linus Torvalds
2004-12-20 18:19                             ` Andi Kleen
2004-12-20 18:47                               ` Linus Torvalds
2004-12-20 18:52                                 ` Linus Torvalds
2004-12-20 18:59                                 ` Andi Kleen
2004-12-20 18:57                                   ` Randy.Dunlap
2004-12-18  9:05         ` [PATCH 4/10] " Nick Piggin
2004-12-18  9:50           ` Andi Kleen
2004-12-18 10:06             ` Nick Piggin
2004-12-18 10:11               ` Andi Kleen
2004-12-18 10:22               ` Nick Piggin
2004-12-18 10:29                 ` Nick Piggin
2004-12-18 11:06               ` William Lee Irwin III
2004-12-18 11:17                 ` Nick Piggin
2004-12-18 11:32                   ` William Lee Irwin III
2004-12-18 11:55                     ` Nick Piggin
2004-12-18 12:46                       ` William Lee Irwin III
2004-12-18 12:48                         ` William Lee Irwin III
2004-12-19  0:05                         ` Nick Piggin
2004-12-19  0:20                           ` William Lee Irwin III
2004-12-19  0:38                             ` Nick Piggin
2004-12-19  1:01                               ` William Lee Irwin III
2004-12-19  1:31                             ` Linus Torvalds
2004-12-19  2:08                               ` William Lee Irwin III
2004-12-19  2:26                                 ` Nick Piggin
2004-12-19  5:23                                 ` Linus Torvalds
2004-12-19  6:02                                   ` William Lee Irwin III
2004-12-19 18:17                                     ` Linus Torvalds
2004-12-20  1:00                                       ` William Lee Irwin III
2004-12-18 10:45         ` William Lee Irwin III
2004-12-18 10:58           ` Nick Piggin
2004-12-19  0:07 ` [RFC][PATCH 0/10] " Hugh Dickins
2004-12-19  0:33   ` Nick Piggin
2004-12-20 18:04   ` Andi Kleen
2004-12-20 18:40     ` Linus Torvalds
2004-12-20 18:53       ` Andi Kleen
2004-12-21  0:04         ` Linus Torvalds
2004-12-21  0:22           ` Andi Kleen
2004-12-21  0:43             ` Linus Torvalds
2004-12-21  0:47             ` Nick Piggin
2004-12-21  2:55               ` Hugh Dickins
2004-12-21  3:21                 ` Nick Piggin
2004-12-21  3:47                 ` Linus Torvalds
2004-12-21  3:56                   ` Linus Torvalds
2004-12-21  4:04                     ` Nick Piggin
2004-12-21  4:08                       ` Nick Piggin
2004-12-21  9:36                     ` Andi Kleen
2004-12-21 10:13                       ` Hugh Dickins
2004-12-21 10:59                       ` Nick Piggin
2004-12-21 17:36                       ` Linus Torvalds
2004-12-21 20:19                         ` Andi Kleen
2004-12-21 23:49                           ` Nick Piggin
2004-12-22 10:38                             ` Andi Kleen
2004-12-22 11:19                               ` Nick Piggin
2004-12-22 11:23                                 ` Nick Piggin
2004-12-22 18:07                                 ` Andi Kleen
2004-12-30 21:24                                   ` Nick Piggin
2004-12-21 10:52                     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox