[PATCH 0/4] Demand faunting for huge pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] Demand faunting for huge pages
@ 2005-08-17 18:56 Adam Litke
  2005-08-17 19:03 ` [PATCH 1/4] x86-pte_huge Adam Litke
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Adam Litke @ 2005-08-17 18:56 UTC (permalink / raw)
  To: linux-mm; +Cc: agl, christoph, ak, kenneth.w.chen, david

The following patch set implements demand faulting for huge pages.  In
response to helpful feedback from Christoph Lameter, Kenneth Chen, and
Andi Kleen, I've split up the demand fault patch (previously posted on
LKML: http://lkml.org/lkml/2005/8/5/154 ) into a smaller, more
digestible set.

The first three patches should be pretty clear-cut and harmless and just
make way for a neater switch to demand faulting.  The code touched by
the x86 patches is either already present or (AFAICT) not needed for
other architectures.  Comments?  Anyone want to try this out on their
specific huge page workload and architecture combinati?

The patches are:
  x86-pte_huge - Create pte_huge() test function
  x86-move-stale-pgtable - Check for stale pte in huge_pte_alloc()
  x86-walk-check - Check for not present huge page table entries
  htlb-fault - Demand faulting for huge pages

Patches coming soon in reply to this message.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] x86-pte_huge
  2005-08-17 18:56 [PATCH 0/4] Demand faunting for huge pages Adam Litke
@ 2005-08-17 19:03 ` Adam Litke
  2005-08-17 19:18   ` Dave Hansen
  2005-08-17 19:03 ` [PATCH 2/4] x86-move-stale-pgtable Adam Litke
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Adam Litke @ 2005-08-17 19:03 UTC (permalink / raw)
  To: linux-mm; +Cc: christoph, ak, kenneth.w.chen, david

Initial Post (Wed, 17 Aug 2005)

This patch adds a macro pte_huge(pte) for i386/x86_64  which is needed by a
patch later in the series.  Instead of repeating (_PAGE_PRESENT | _PAGE_PSE),
I've added __LARGE_PTE to i386 to match x86_64.

Diffed against 2.6.13-rc6-git7

Signed-off-by: Adam Litke <agl@us.ibm.com>
---
 asm-i386/pgtable.h   |    4 +++-
 asm-x86_64/pgtable.h |    3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)
diff -upN reference/include/asm-i386/pgtable.h current/include/asm-i386/pgtable.h
--- reference/include/asm-i386/pgtable.h
+++ current/include/asm-i386/pgtable.h
@@ -215,11 +215,13 @@ extern unsigned long pg0[];
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
  */
+#define __LARGE_PTE (_PAGE_PSE | _PAGE_PRESENT)
 static inline int pte_user(pte_t pte)		{ return (pte).pte_low & _PAGE_USER; }
 static inline int pte_read(pte_t pte)		{ return (pte).pte_low & _PAGE_USER; }
 static inline int pte_dirty(pte_t pte)		{ return (pte).pte_low & _PAGE_DIRTY; }
 static inline int pte_young(pte_t pte)		{ return (pte).pte_low & _PAGE_ACCESSED; }
 static inline int pte_write(pte_t pte)		{ return (pte).pte_low & _PAGE_RW; }
+static inline int pte_huge(pte_t pte)		{ return ((pte).pte_low & __LARGE_PTE) == __LARGE_PTE; }
 
 /*
  * The following only works if pte_present() is not true.
@@ -236,7 +238,7 @@ static inline pte_t pte_mkexec(pte_t pte
 static inline pte_t pte_mkdirty(pte_t pte)	{ (pte).pte_low |= _PAGE_DIRTY; return pte; }
 static inline pte_t pte_mkyoung(pte_t pte)	{ (pte).pte_low |= _PAGE_ACCESSED; return pte; }
 static inline pte_t pte_mkwrite(pte_t pte)	{ (pte).pte_low |= _PAGE_RW; return pte; }
-static inline pte_t pte_mkhuge(pte_t pte)	{ (pte).pte_low |= _PAGE_PRESENT | _PAGE_PSE; return pte; }
+static inline pte_t pte_mkhuge(pte_t pte)	{ (pte).pte_low |= __LARGE_PTE; return pte; }
 
 #ifdef CONFIG_X86_PAE
 # include <asm/pgtable-3level.h>
diff -upN reference/include/asm-x86_64/pgtable.h current/include/asm-x86_64/pgtable.h
--- reference/include/asm-x86_64/pgtable.h
+++ current/include/asm-x86_64/pgtable.h
@@ -247,6 +247,7 @@ static inline pte_t pfn_pte(unsigned lon
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
  */
+#define __LARGE_PTE (_PAGE_PSE|_PAGE_PRESENT)
 static inline int pte_user(pte_t pte)		{ return pte_val(pte) & _PAGE_USER; }
 extern inline int pte_read(pte_t pte)		{ return pte_val(pte) & _PAGE_USER; }
 extern inline int pte_exec(pte_t pte)		{ return pte_val(pte) & _PAGE_USER; }
@@ -254,8 +255,8 @@ extern inline int pte_dirty(pte_t pte)		
 extern inline int pte_young(pte_t pte)		{ return pte_val(pte) & _PAGE_ACCESSED; }
 extern inline int pte_write(pte_t pte)		{ return pte_val(pte) & _PAGE_RW; }
 static inline int pte_file(pte_t pte)		{ return pte_val(pte) & _PAGE_FILE; }
+static inline int pte_huge(pte_t pte)           { return (pte_val(pte) & __LARGE_PTE) == __LARGE_PTE; }
 
-#define __LARGE_PTE (_PAGE_PSE|_PAGE_PRESENT)
 extern inline pte_t pte_rdprotect(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; }
 extern inline pte_t pte_exprotect(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; }
 extern inline pte_t pte_mkclean(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] x86-pte_huge
  2005-08-17 19:03 ` [PATCH 1/4] x86-pte_huge Adam Litke
@ 2005-08-17 19:18   ` Dave Hansen
  2005-08-17 19:27     ` Adam Litke
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2005-08-17 19:18 UTC (permalink / raw)
  To: Adam Litke; +Cc: linux-mm, christoph, ak, kenneth.w.chen, david

On Wed, 2005-08-17 at 14:03 -0500, Adam Litke wrote:
> @@ -254,8 +255,8 @@ extern inline int pte_dirty(pte_t pte)              
>  extern inline int pte_young(pte_t pte)         { return pte_val(pte) & _PAGE_ACCESSED; }
>  extern inline int pte_write(pte_t pte)         { return pte_val(pte) & _PAGE_RW; }
>  static inline int pte_file(pte_t pte)          { return pte_val(pte) & _PAGE_FILE; }
> +static inline int pte_huge(pte_t pte)           { return (pte_val(pte) & __LARGE_PTE) == __LARGE_PTE; }

Looks like a little whitespace issue.  Probably just tabs vs. spaces.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] x86-pte_huge
  2005-08-17 19:18   ` Dave Hansen
@ 2005-08-17 19:27     ` Adam Litke
  0 siblings, 0 replies; 14+ messages in thread
From: Adam Litke @ 2005-08-17 19:27 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, christoph, ak, kenneth.w.chen, david

On Wed, 2005-08-17 at 12:18 -0700, Dave Hansen wrote:
> Looks like a little whitespace issue.  Probably just tabs vs. spaces.

Ughh, don't know how that slipped in.

Fixed whitespace issue in asm-x86_64/pgtable.h

Initial Post (Wed, 17 Aug 2005)

This patch adds a macro pte_huge(pte) for i386/x86_64  which is needed by a
patch later in the series.  Instead of repeating (_PAGE_PRESENT | _PAGE_PSE),
I've added __LARGE_PTE to i386 to match x86_64.

Diffed against 2.6.13-rc6-git7

Signed-off-by: Adam Litke <agl@us.ibm.com>
---
 asm-i386/pgtable.h   |    4 +++-
 asm-x86_64/pgtable.h |    3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)
diff -upN reference/include/asm-i386/pgtable.h current/include/asm-i386/pgtable.h
--- reference/include/asm-i386/pgtable.h
+++ current/include/asm-i386/pgtable.h
@@ -215,11 +215,13 @@ extern unsigned long pg0[];
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
  */
+#define __LARGE_PTE (_PAGE_PSE | _PAGE_PRESENT)
 static inline int pte_user(pte_t pte)		{ return (pte).pte_low & _PAGE_USER; }
 static inline int pte_read(pte_t pte)		{ return (pte).pte_low & _PAGE_USER; }
 static inline int pte_dirty(pte_t pte)		{ return (pte).pte_low & _PAGE_DIRTY; }
 static inline int pte_young(pte_t pte)		{ return (pte).pte_low & _PAGE_ACCESSED; }
 static inline int pte_write(pte_t pte)		{ return (pte).pte_low & _PAGE_RW; }
+static inline int pte_huge(pte_t pte)		{ return ((pte).pte_low & __LARGE_PTE) == __LARGE_PTE; }
 
 /*
  * The following only works if pte_present() is not true.
@@ -236,7 +238,7 @@ static inline pte_t pte_mkexec(pte_t pte
 static inline pte_t pte_mkdirty(pte_t pte)	{ (pte).pte_low |= _PAGE_DIRTY; return pte; }
 static inline pte_t pte_mkyoung(pte_t pte)	{ (pte).pte_low |= _PAGE_ACCESSED; return pte; }
 static inline pte_t pte_mkwrite(pte_t pte)	{ (pte).pte_low |= _PAGE_RW; return pte; }
-static inline pte_t pte_mkhuge(pte_t pte)	{ (pte).pte_low |= _PAGE_PRESENT | _PAGE_PSE; return pte; }
+static inline pte_t pte_mkhuge(pte_t pte)	{ (pte).pte_low |= __LARGE_PTE; return pte; }
 
 #ifdef CONFIG_X86_PAE
 # include <asm/pgtable-3level.h>
diff -upN reference/include/asm-x86_64/pgtable.h current/include/asm-x86_64/pgtable.h
--- reference/include/asm-x86_64/pgtable.h
+++ current/include/asm-x86_64/pgtable.h
@@ -247,6 +247,7 @@ static inline pte_t pfn_pte(unsigned lon
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
  */
+#define __LARGE_PTE (_PAGE_PSE|_PAGE_PRESENT)
 static inline int pte_user(pte_t pte)		{ return pte_val(pte) & _PAGE_USER; }
 extern inline int pte_read(pte_t pte)		{ return pte_val(pte) & _PAGE_USER; }
 extern inline int pte_exec(pte_t pte)		{ return pte_val(pte) & _PAGE_USER; }
@@ -254,8 +255,8 @@ extern inline int pte_dirty(pte_t pte)		
 extern inline int pte_young(pte_t pte)		{ return pte_val(pte) & _PAGE_ACCESSED; }
 extern inline int pte_write(pte_t pte)		{ return pte_val(pte) & _PAGE_RW; }
 static inline int pte_file(pte_t pte)		{ return pte_val(pte) & _PAGE_FILE; }
+static inline int pte_huge(pte_t pte)		{ return (pte_val(pte) & __LARGE_PTE) == __LARGE_PTE; }
 
-#define __LARGE_PTE (_PAGE_PSE|_PAGE_PRESENT)
 extern inline pte_t pte_rdprotect(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; }
 extern inline pte_t pte_exprotect(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; }
 extern inline pte_t pte_mkclean(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/4] x86-move-stale-pgtable
  2005-08-17 18:56 [PATCH 0/4] Demand faunting for huge pages Adam Litke
  2005-08-17 19:03 ` [PATCH 1/4] x86-pte_huge Adam Litke
@ 2005-08-17 19:03 ` Adam Litke
  2005-08-17 19:04 ` [PATCH 3/4] x86-walk-check Adam Litke
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Adam Litke @ 2005-08-17 19:03 UTC (permalink / raw)
  To: linux-mm; +Cc: christoph, ak, kenneth.w.chen, david

Initial Post (Wed, 17 Aug 2005)

This patch moves the
	if (! pte_none(*pte))
		hugetlb_clean_stale_pgtable(pte);
logic into huge_pte_alloc() so all of its callers can be immune to the bug
described by Kenneth Chen at http://lkml.org/lkml/2004/6/16/246

> It turns out there is a bug in hugetlb_prefault(): with 3 level page table,
> huge_pte_alloc() might return a pmd that points to a PTE page. It happens
> if the virtual address for hugetlb mmap is recycled from previously used
> normal page mmap. free_pgtables() might not scrub the pmd entry on
> munmap and hugetlb_prefault skips on any pmd presence regardless what type 
> it is.

Unless I am missing something, it seems more correct to place the check inside
huge_pte_alloc() to prevent a the same bug wherever a huge pte is allocated.
It also allows checking for this condition when lazily faulting huge pages
later in the series.

Diffed against 2.6.13-rc6-git7

Signed-off-by: Adam Litke <agl@us.ibm.com>
---
 arch/i386/mm/hugetlbpage.c |   13 +++++++++++--
 mm/hugetlb.c               |    2 --
 2 files changed, 11 insertions(+), 4 deletions(-)
diff -upN reference/arch/i386/mm/hugetlbpage.c current/arch/i386/mm/hugetlbpage.c
--- reference/arch/i386/mm/hugetlbpage.c
+++ current/arch/i386/mm/hugetlbpage.c
@@ -22,12 +22,21 @@ pte_t *huge_pte_alloc(struct mm_struct *
 {
 	pgd_t *pgd;
 	pud_t *pud;
-	pmd_t *pmd = NULL;
+	pmd_t *pmd;
+	pte_t *pte = NULL;
 
 	pgd = pgd_offset(mm, addr);
 	pud = pud_alloc(mm, pgd, addr);
 	pmd = pmd_alloc(mm, pud, addr);
-	return (pte_t *) pmd;
+
+	if (!pmd)
+		goto out;
+	
+	pte = (pte_t *) pmd;
+	if (!pte_none(*pte) && !pte_huge(*pte))
+		hugetlb_clean_stale_pgtable(pte);
+out:
+	return pte;
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c
--- reference/mm/hugetlb.c
+++ current/mm/hugetlb.c
@@ -360,8 +360,6 @@ int hugetlb_prefault(struct address_spac
 			ret = -ENOMEM;
 			goto out;
 		}
-		if (! pte_none(*pte))
-			hugetlb_clean_stale_pgtable(pte);
 
 		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
 			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/4] x86-walk-check
  2005-08-17 18:56 [PATCH 0/4] Demand faunting for huge pages Adam Litke
  2005-08-17 19:03 ` [PATCH 1/4] x86-pte_huge Adam Litke
  2005-08-17 19:03 ` [PATCH 2/4] x86-move-stale-pgtable Adam Litke
@ 2005-08-17 19:04 ` Adam Litke
  2005-08-17 19:41   ` Dave Hansen
  2005-08-17 19:05 ` [PATCH 4/4] htlb-fault Adam Litke
  2005-08-17 21:04 ` [PATCH 0/4] Demand faunting for huge pages Andi Kleen
  4 siblings, 1 reply; 14+ messages in thread
From: Adam Litke @ 2005-08-17 19:04 UTC (permalink / raw)
  To: linux-mm; +Cc: christoph, ak, kenneth.w.chen, david

Initial Post (Wed, 17 Aug 2005)

For demand faulting, we cannot assume that the page tables will be populated.
Do what the rest of the architectures do and test p?d_present() while walking
down the page table.

Diffed against 2.6.13-rc6-git7

Signed-off-by: Adam Litke <agl@us.ibm.com>

---
 hugetlbpage.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)
diff -upN reference/arch/i386/mm/hugetlbpage.c current/arch/i386/mm/hugetlbpage.c
--- reference/arch/i386/mm/hugetlbpage.c
+++ current/arch/i386/mm/hugetlbpage.c
@@ -46,8 +46,12 @@ pte_t *huge_pte_offset(struct mm_struct 
 	pmd_t *pmd = NULL;
 
 	pgd = pgd_offset(mm, addr);
-	pud = pud_offset(pgd, addr);
-	pmd = pmd_offset(pud, addr);
+	if (pgd_present(*pgd)) {
+		pud = pud_offset(pgd, addr);
+		if (pud_present(*pud)) {
+			pmd = pmd_offset(pud, addr);
+		}
+	}
 	return (pte_t *) pmd;
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] x86-walk-check
  2005-08-17 19:04 ` [PATCH 3/4] x86-walk-check Adam Litke
@ 2005-08-17 19:41   ` Dave Hansen
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2005-08-17 19:41 UTC (permalink / raw)
  To: Adam Litke; +Cc: linux-mm, christoph, ak, kenneth.w.chen, david

On Wed, 2005-08-17 at 14:04 -0500, Adam Litke wrote:
> 
> +       if (pgd_present(*pgd)) {
> +               pud = pud_offset(pgd, addr);
> +               if (pud_present(*pud)) {
> +                       pmd = pmd_offset(pud, addr);
> +               }
> +       }

You can probably kill that extra set of braces on the indented if().

Or, do something like this (which I think is a little bit more
consistent with a lot of other code.

-       pud = pud_offset(pgd, addr);
-       pmd = pmd_offset(pud, addr);
+       if (!pgd_present(*pgd))
+		goto out;
+
+       pud = pud_offset(pgd, addr);
+       if (pud_present(*pud))
+       	pmd = pmd_offset(pud, addr);
+
+out:
        return (pte_t *) pmd;

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 4/4] htlb-fault
  2005-08-17 18:56 [PATCH 0/4] Demand faunting for huge pages Adam Litke
                   ` (2 preceding siblings ...)
  2005-08-17 19:04 ` [PATCH 3/4] x86-walk-check Adam Litke
@ 2005-08-17 19:05 ` Adam Litke
  2005-08-17 21:04 ` [PATCH 0/4] Demand faunting for huge pages Andi Kleen
  4 siblings, 0 replies; 14+ messages in thread
From: Adam Litke @ 2005-08-17 19:05 UTC (permalink / raw)
  To: linux-mm; +Cc: christoph, ak, kenneth.w.chen, david

Version 2 (Wed, 17 Aug 2005)
        Removed spurious WARN_ON()
    Patches added earlier in the series:
        Check for p?d_none() in arch/i386/mm/hugetlbpage.c:huge_pte_offset()
	Move i386 stale pte check into huge_pte_alloc()

Initial Post (Fri, 05 Aug 2005)

Below is a patch to implement demand faulting for huge pages.  The main
motivation for changing from prefaulting to demand faulting is so that
huge page memory areas can be allocated according to NUMA policy.

Thanks to consolidated hugetlb code, switching the behavior requires changing
only one fault handler.  The bulk of the patch just moves the logic from 
hugelb_prefault() to hugetlb_pte_fault().

Diffed against 2.6.13-rc6-git7

Signed-off-by: Adam Litke <agl@us.ibm.com>
---
 fs/hugetlbfs/inode.c    |    6 --
 include/linux/hugetlb.h |    2 
 mm/hugetlb.c            |  137 +++++++++++++++++++++++++++---------------------
 mm/memory.c             |    7 --
 4 files changed, 82 insertions(+), 70 deletions(-)
diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c
--- reference/fs/hugetlbfs/inode.c
+++ current/fs/hugetlbfs/inode.c
@@ -48,7 +48,6 @@ int sysctl_hugetlb_shm_group;
 static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct inode *inode = file->f_dentry->d_inode;
-	struct address_space *mapping = inode->i_mapping;
 	loff_t len, vma_len;
 	int ret;
 
@@ -79,10 +78,7 @@ static int hugetlbfs_file_mmap(struct fi
 	if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
 		goto out;
 
-	ret = hugetlb_prefault(mapping, vma);
-	if (ret)
-		goto out;
-
+	ret = 0;
 	if (inode->i_size < len)
 		inode->i_size = len;
 out:
diff -upN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h
--- reference/include/linux/hugetlb.h
+++ current/include/linux/hugetlb.h
@@ -25,6 +25,8 @@ int is_hugepage_mem_enough(size_t);
 unsigned long hugetlb_total_pages(void);
 struct page *alloc_huge_page(void);
 void free_huge_page(struct page *);
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct * vma,
+			unsigned long address, int write_access);
 
 extern unsigned long max_huge_pages;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c
--- reference/mm/hugetlb.c
+++ current/mm/hugetlb.c
@@ -277,18 +277,20 @@ int copy_hugetlb_page_range(struct mm_st
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
 
-	while (addr < end) {
+	for (; addr < end; addr += HPAGE_SIZE) {
+		src_pte = huge_pte_offset(src, addr);
+		if (!src_pte || pte_none(*src_pte))
+			continue;
+		
 		dst_pte = huge_pte_alloc(dst, addr);
 		if (!dst_pte)
 			goto nomem;
-		src_pte = huge_pte_offset(src, addr);
-		BUG_ON(!src_pte || pte_none(*src_pte)); /* prefaulted */
+		BUG_ON(!src_pte);
 		entry = *src_pte;
 		ptepage = pte_page(entry);
 		get_page(ptepage);
 		add_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
 		set_huge_pte_at(dst, addr, dst_pte, entry);
-		addr += HPAGE_SIZE;
 	}
 	return 0;
 
@@ -338,61 +340,6 @@ void zap_hugepage_range(struct vm_area_s
 	spin_unlock(&mm->page_table_lock);
 }
 
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
-	struct mm_struct *mm = current->mm;
-	unsigned long addr;
-	int ret = 0;
-
-	WARN_ON(!is_vm_hugetlb_page(vma));
-	BUG_ON(vma->vm_start & ~HPAGE_MASK);
-	BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
-	hugetlb_prefault_arch_hook(mm);
-
-	spin_lock(&mm->page_table_lock);
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
-		unsigned long idx;
-		pte_t *pte = huge_pte_alloc(mm, addr);
-		struct page *page;
-
-		if (!pte) {
-			ret = -ENOMEM;
-			goto out;
-		}
-
-		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
-			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
-		page = find_get_page(mapping, idx);
-		if (!page) {
-			/* charge the fs quota first */
-			if (hugetlb_get_quota(mapping)) {
-				ret = -ENOMEM;
-				goto out;
-			}
-			page = alloc_huge_page();
-			if (!page) {
-				hugetlb_put_quota(mapping);
-				ret = -ENOMEM;
-				goto out;
-			}
-			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
-			if (! ret) {
-				unlock_page(page);
-			} else {
-				hugetlb_put_quota(mapping);
-				free_huge_page(page);
-				goto out;
-			}
-		}
-		add_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
-		set_huge_pte_at(mm, addr, pte, make_huge_pte(vma, page));
-	}
-out:
-	spin_unlock(&mm->page_table_lock);
-	return ret;
-}
-
 int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			struct page **pages, struct vm_area_struct **vmas,
 			unsigned long *position, int *length, int i)
@@ -440,3 +387,75 @@ int follow_hugetlb_page(struct mm_struct
 
 	return i;
 }
+
+int hugetlb_pte_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, int write_access)
+{
+	int ret = VM_FAULT_MINOR;
+	unsigned long idx;
+	pte_t *pte;
+	struct page *page;
+	struct address_space *mapping;
+
+	BUG_ON(vma->vm_start & ~HPAGE_MASK);
+	BUG_ON(vma->vm_end & ~HPAGE_MASK);
+	BUG_ON(!vma->vm_file);
+
+	pte = huge_pte_alloc(mm, address);
+	if (!pte) {
+		ret = VM_FAULT_SIGBUS;
+		goto out;
+	}
+	if (! pte_none(*pte))
+		goto flush;
+
+	mapping = vma->vm_file->f_mapping;
+	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+retry:
+	page = find_get_page(mapping, idx);
+	if (!page) {
+		/* charge the fs quota first */
+		if (hugetlb_get_quota(mapping)) {
+			ret = VM_FAULT_SIGBUS;
+			goto out;
+		}
+		page = alloc_huge_page();
+		if (!page) {
+			hugetlb_put_quota(mapping);
+			ret = VM_FAULT_SIGBUS;
+			goto out;
+		}
+		if (add_to_page_cache(page, mapping, idx, GFP_ATOMIC)) {
+			put_page(page);
+			goto retry;
+		}
+		unlock_page(page);
+	}
+	add_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
+	set_huge_pte_at(mm, address, pte, make_huge_pte(vma, page));
+flush:
+	flush_tlb_page(vma, address);
+out:
+	return ret;
+}
+
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, int write_access)
+{
+	pte_t *ptep;
+	int rc = VM_FAULT_MINOR;
+
+	spin_lock(&mm->page_table_lock);
+
+	ptep = huge_pte_alloc(mm, address & HPAGE_MASK);
+	if (! ptep) {
+		rc = VM_FAULT_SIGBUS;
+		goto out;
+	}
+	if (pte_none(*ptep))
+		rc = hugetlb_pte_fault(mm, vma, address, write_access);
+out:
+	spin_unlock(&mm->page_table_lock);
+	return rc;
+}
diff -upN reference/mm/memory.c current/mm/memory.c
--- reference/mm/memory.c
+++ current/mm/memory.c
@@ -937,11 +937,6 @@ int get_user_pages(struct task_struct *t
 				|| !(flags & vma->vm_flags))
 			return i ? : -EFAULT;
 
-		if (is_vm_hugetlb_page(vma)) {
-			i = follow_hugetlb_page(mm, vma, pages, vmas,
-						&start, &len, i);
-			continue;
-		}
 		spin_lock(&mm->page_table_lock);
 		do {
 			int write_access = write;
@@ -2034,7 +2029,7 @@ int __handle_mm_fault(struct mm_struct *
 	inc_page_state(pgfault);
 
 	if (is_vm_hugetlb_page(vma))
-		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+		return hugetlb_fault(mm, vma, address, write_access);
 
 	/*
 	 * We need the page table lock to synchronize with kswapd


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] Demand faunting for huge pages
  2005-08-17 18:56 [PATCH 0/4] Demand faunting for huge pages Adam Litke
                   ` (3 preceding siblings ...)
  2005-08-17 19:05 ` [PATCH 4/4] htlb-fault Adam Litke
@ 2005-08-17 21:04 ` Andi Kleen
  2005-08-18  0:33   ` David Gibson
  2005-08-18 20:29   ` Adam Litke
  4 siblings, 2 replies; 14+ messages in thread
From: Andi Kleen @ 2005-08-17 21:04 UTC (permalink / raw)
  To: Adam Litke; +Cc: linux-mm, christoph, ak, kenneth.w.chen, david

What about the overcommit issue Ken noted? It needs to be solved
in some way at least, either with the full check or the lazy simple
check.

Also I still think your get_user_pages approach is questionable.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] Demand faunting for huge pages
  2005-08-17 21:04 ` [PATCH 0/4] Demand faunting for huge pages Andi Kleen
@ 2005-08-18  0:33   ` David Gibson
  2005-08-18  0:35     ` Andi Kleen
  2005-08-18 15:29     ` Ray Bryant
  2005-08-18 20:29   ` Adam Litke
  1 sibling, 2 replies; 14+ messages in thread
From: David Gibson @ 2005-08-18  0:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Adam Litke, linux-mm, christoph, kenneth.w.chen

On Wed, Aug 17, 2005 at 11:04:32PM +0200, Andi Kleen wrote:
> 
> What about the overcommit issue Ken noted? It needs to be solved
> in some way at least, either with the full check or the lazy simple
> check.

Hrm... I'm not 100% convinced that just allowing overcommit isn't the
right thing to do.  Overcommit has some unfortunate consequences, but
the semantics are clearly defined and trivial to implement.

Strict accounting leads to nicer behaviour in some cases - you'll tend
to die early rather than late - but it seems an awful lot of work for
a fairly small improvement in behaviour.

If we add copy-on-write for hugepages (i.e. MAP_PRIVATE support)
strict accounting is even harder to implement, and has clearly-wrong
behaviour in some circumstances: a process using most of the system's
hugepages, mapped MAP_PRIVATE couldn't fork()/exec() a trivial helper
program.

> Also I still think your get_user_pages approach is questionable.
> 
> -Andi
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/people/dgibson
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] Demand faunting for huge pages
  2005-08-18  0:33   ` David Gibson
@ 2005-08-18  0:35     ` Andi Kleen
  2005-08-18 20:33       ` Adam Litke
  2005-08-18 15:29     ` Ray Bryant
  1 sibling, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2005-08-18  0:35 UTC (permalink / raw)
  To: David Gibson; +Cc: Andi Kleen, Adam Litke, linux-mm, christoph, kenneth.w.chen

On Thu, Aug 18, 2005 at 10:33:02AM +1000, David Gibson wrote:
> On Wed, Aug 17, 2005 at 11:04:32PM +0200, Andi Kleen wrote:
> > 
> > What about the overcommit issue Ken noted? It needs to be solved
> > in some way at least, either with the full check or the lazy simple
> > check.
> 
> Hrm... I'm not 100% convinced that just allowing overcommit isn't the
> right thing to do.  Overcommit has some unfortunate consequences, but
> the semantics are clearly defined and trivial to implement.

I disagree. With Linux's primitive hugepage allocation scheme (static
pool that is usually too small) at least simple overcommit check
is absolutely essential.

> Strict accounting leads to nicer behaviour in some cases - you'll tend
> to die early rather than late - but it seems an awful lot of work for
> a fairly small improvement in behaviour.

Strict is a lot of work, but a simple "right in 99% of all cases, but racy" 
check is quite easy.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] Demand faunting for huge pages
  2005-08-18  0:35     ` Andi Kleen
@ 2005-08-18 20:33       ` Adam Litke
  0 siblings, 0 replies; 14+ messages in thread
From: Adam Litke @ 2005-08-18 20:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Gibson, linux-mm, christoph, kenneth.w.chen

On Thu, 2005-08-18 at 02:35 +0200, Andi Kleen wrote:
> I disagree. With Linux's primitive hugepage allocation scheme (static
> pool that is usually too small) at least simple overcommit check
> is absolutely essential.
> 
> > Strict accounting leads to nicer behaviour in some cases - you'll tend
> > to die early rather than late - but it seems an awful lot of work for
> > a fairly small improvement in behaviour.
> 
> Strict is a lot of work, but a simple "right in 99% of all cases, but racy" 
> check is quite easy.

How about something like the following?
---
Initial Post (Thu, 18 Aug 2005)

Basic overcommit checking for hugetlb_file_map() based on an implementation
used with demand faulting in SLES9.

Since demand faulting can't guarantee the availability of pages at mmap time,
this patch implements a basic sanity check to ensure that the number of huge
pages required to satisfy the mmap are currently available.  Despite the
obvious race, I think it is a good start on doing proper accounting.  I'd like
to work towards an accounting system that mimics the semantics of normal pages
(especially for the MAP_PRIVATE/COW case).  That work is underway and builds on
what this patch starts.

Huge page shared memory segments are simpler and still maintain their commit on shmget semantics.

Diffed against 2.6.13-rc6-git7

Signed-off-by: Adam Litke <agl@us.ibm.com>

---
 fs/hugetlbfs/inode.c    |   36 ++++++++++++++++++++++++++++++++++++
 include/linux/hugetlb.h |    3 +++
 2 files changed, 39 insertions(+)
diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c
--- reference/fs/hugetlbfs/inode.c
+++ current/fs/hugetlbfs/inode.c
@@ -45,9 +45,41 @@ static struct backing_dev_info hugetlbfs
 
 int sysctl_hugetlb_shm_group;
 
+static void huge_pagevec_release(struct pagevec *pvec);
+
+unsigned long
+huge_pages_needed(struct address_space *mapping, struct vm_area_struct *vma,
+		unsigned long start, unsigned long end)
+{
+	int i;
+	struct pagevec pvec;
+	unsigned long hugepages = (end - start) >> HPAGE_SHIFT;
+	pgoff_t next = vma->vm_pgoff + ((start - vma->vm_start)>>PAGE_SHIFT);
+	pgoff_t endpg = next + ((end - start) >> PAGE_SHIFT);
+
+	pagevec_init(&pvec, 0);
+	while (next < endpg) {
+		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE))
+			break;
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+			if (page->index > next)
+				next = page->index;
+			if (page->index >= endpg)
+				break;
+			next++;
+			hugepages--;
+		}
+		huge_pagevec_release(&pvec);
+	}
+	return hugepages << HPAGE_SHIFT;
+}
+
 static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct inode *inode = file->f_dentry->d_inode;
+	struct address_space *mapping = inode->i_mapping;
+	unsigned long bytes;
 	loff_t len, vma_len;
 	int ret;
 
@@ -66,6 +98,10 @@ static int hugetlbfs_file_mmap(struct fi
 	if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
 		return -EINVAL;
 
+	bytes = huge_pages_needed(mapping, vma, vma->vm_start, vma->vm_end);
+	if (!is_hugepage_mem_enough(bytes))
+		return -ENOMEM;
+
 	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
 
 	down(&inode->i_sem);
diff -upN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h
--- reference/include/linux/hugetlb.h
+++ current/include/linux/hugetlb.h
@@ -42,6 +42,9 @@ struct page *follow_huge_pmd(struct mm_s
 				pmd_t *pmd, int write);
 int is_aligned_hugepage_range(unsigned long addr, unsigned long len);
 int pmd_huge(pmd_t pmd);
+unsigned long huge_pages_needed(struct address_space *mapping,
+			struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
 
 #ifndef ARCH_HAS_HUGEPAGE_ONLY_RANGE
 #define is_hugepage_only_range(mm, addr, len)	0


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] Demand faunting for huge pages
  2005-08-18  0:33   ` David Gibson
  2005-08-18  0:35     ` Andi Kleen
@ 2005-08-18 15:29     ` Ray Bryant
  1 sibling, 0 replies; 14+ messages in thread
From: Ray Bryant @ 2005-08-18 15:29 UTC (permalink / raw)
  To: David Gibson
  Cc: Andi Kleen, Adam Litke, linux-mm, christoph, kenneth.w.chen, akpm

On Wednesday 17 August 2005 19:33, David Gibson wrote:

>
> Strict accounting leads to nicer behaviour in some cases - you'll tend
> to die early rather than late - but it seems an awful lot of work for
> a fairly small improvement in behaviour.
>

The last time we went around on this (April 2004?) Andrew thought that adding 
demand allocation for hugetlb pages without strict accounting was effectively 
an ABI change -- in the current approach the mmap() will fail if you ask for 
too many hugetlb pages whilst in the demand fault approach you will get 
SIGBUS at a later point in time.   At one time this was considered serious 
enough to fix.

Andy Whitcroft provided some code for the patch that Ken and I did back in
April 2004 time frame.   I can't find that one but the following patch from
Christoph Lameter appears to be the code.  The idea is that at mmap() time
a strict reservation is made that guarantees the necessary number of 
hugetlb pages is available. 

http://marc.theaimsgroup.com/?l=linux-kernel&m=109842250714489&w=2

-- 
Ray Bryant
AMD Performance Labs                   Austin, Tx
512-602-0038 (o)                 512-507-7807 (c)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] Demand faunting for huge pages
  2005-08-17 21:04 ` [PATCH 0/4] Demand faunting for huge pages Andi Kleen
  2005-08-18  0:33   ` David Gibson
@ 2005-08-18 20:29   ` Adam Litke
  1 sibling, 0 replies; 14+ messages in thread
From: Adam Litke @ 2005-08-18 20:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, christoph, kenneth.w.chen, david

On Wed, 2005-08-17 at 23:04 +0200, Andi Kleen wrote:
> Also I still think your get_user_pages approach is questionable.

I am pretty sure that my approach is safe and merely removes an
optimization.  Hopefully the following better states my reasons for
thinking so.  If anyone else who was involved in the demand fault
discussion when it went around the last time (see below) could chime in
I think it would help further clarify the issue.

---

Initial Post (Thu, 18 Aug 2005)

In preparation for hugetlb demand faulting, remove this get_user_pages()
optimization.  Since huge pages will no longer be prefaulted, we can't assume
that the huge ptes are established and hence, calling follow_hugetlb_page() is
not valid.

With the follow_hugetlb_page() call removed, the normal code path will be
triggered.  follow_page() will either use follow_huge_addr() or
follow_huge_pmd() to check for a previously faulted "page" to return.  When
this fails (ie. with demand faults), __handle_mm_fault() gets called which
invokes the hugetlb_fault() handler to instantiate the huge page.

This patch doesn't make a lot of sense by itself, but I've broken it out to
facilitate discussion on this specific element of the demand fault changes.
While coding this up, I referenced previous discussion on this topic starting
at http://lkml.org/lkml/2004/4/13/176 , which contains more opinions about the
correctness of this approach.

Diffed against 2.6.13-rc6-git7

Signed-off-by: Adam Litke <agl@us.ibm.com>

---
 memory.c |    5 -----
 1 files changed, 5 deletions(-)
diff -upN reference/mm/memory.c current/mm/memory.c
--- reference/mm/memory.c
+++ current/mm/memory.c
@@ -937,11 +937,6 @@ int get_user_pages(struct task_struct *t
 				|| !(flags & vma->vm_flags))
 			return i ? : -EFAULT;

-		if (is_vm_hugetlb_page(vma)) {
-			i = follow_hugetlb_page(mm, vma, pages, vmas,
-						&start, &len, i);
-			continue;
-		}
 		spin_lock(&mm->page_table_lock);
 		do {
 			int write_access = write;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2005-08-18 20:38 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-17 18:56 [PATCH 0/4] Demand faunting for huge pages Adam Litke
2005-08-17 19:03 ` [PATCH 1/4] x86-pte_huge Adam Litke
2005-08-17 19:18   ` Dave Hansen
2005-08-17 19:27     ` Adam Litke
2005-08-17 19:03 ` [PATCH 2/4] x86-move-stale-pgtable Adam Litke
2005-08-17 19:04 ` [PATCH 3/4] x86-walk-check Adam Litke
2005-08-17 19:41   ` Dave Hansen
2005-08-17 19:05 ` [PATCH 4/4] htlb-fault Adam Litke
2005-08-17 21:04 ` [PATCH 0/4] Demand faunting for huge pages Andi Kleen
2005-08-18  0:33   ` David Gibson
2005-08-18  0:35     ` Andi Kleen
2005-08-18 20:33       ` Adam Litke
2005-08-18 15:29     ` Ray Bryant
2005-08-18 20:29   ` Adam Litke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox