* [PATCH 0/4] Demand faunting for huge pages
@ 2005-08-17 18:56 Adam Litke
2005-08-17 19:03 ` [PATCH 1/4] x86-pte_huge Adam Litke
` (4 more replies)
0 siblings, 5 replies; 14+ messages in thread
From: Adam Litke @ 2005-08-17 18:56 UTC (permalink / raw)
To: linux-mm; +Cc: agl, christoph, ak, kenneth.w.chen, david
The following patch set implements demand faulting for huge pages. In
response to helpful feedback from Christoph Lameter, Kenneth Chen, and
Andi Kleen, I've split up the demand fault patch (previously posted on
LKML: http://lkml.org/lkml/2005/8/5/154 ) into a smaller, more
digestible set.
The first three patches should be pretty clear-cut and harmless and just
make way for a neater switch to demand faulting. The code touched by
the x86 patches is either already present or (AFAICT) not needed for
other architectures. Comments? Anyone want to try this out on their
specific huge page workload and architecture combinati?
The patches are:
x86-pte_huge - Create pte_huge() test function
x86-move-stale-pgtable - Check for stale pte in huge_pte_alloc()
x86-walk-check - Check for not present huge page table entries
htlb-fault - Demand faulting for huge pages
Patches coming soon in reply to this message.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH 1/4] x86-pte_huge 2005-08-17 18:56 [PATCH 0/4] Demand faunting for huge pages Adam Litke @ 2005-08-17 19:03 ` Adam Litke 2005-08-17 19:18 ` Dave Hansen 2005-08-17 19:03 ` [PATCH 2/4] x86-move-stale-pgtable Adam Litke ` (3 subsequent siblings) 4 siblings, 1 reply; 14+ messages in thread From: Adam Litke @ 2005-08-17 19:03 UTC (permalink / raw) To: linux-mm; +Cc: christoph, ak, kenneth.w.chen, david Initial Post (Wed, 17 Aug 2005) This patch adds a macro pte_huge(pte) for i386/x86_64 which is needed by a patch later in the series. Instead of repeating (_PAGE_PRESENT | _PAGE_PSE), I've added __LARGE_PTE to i386 to match x86_64. Diffed against 2.6.13-rc6-git7 Signed-off-by: Adam Litke <agl@us.ibm.com> --- asm-i386/pgtable.h | 4 +++- asm-x86_64/pgtable.h | 3 ++- 2 files changed, 5 insertions(+), 2 deletions(-) diff -upN reference/include/asm-i386/pgtable.h current/include/asm-i386/pgtable.h --- reference/include/asm-i386/pgtable.h +++ current/include/asm-i386/pgtable.h @@ -215,11 +215,13 @@ extern unsigned long pg0[]; * The following only work if pte_present() is true. * Undefined behaviour if not.. */ +#define __LARGE_PTE (_PAGE_PSE | _PAGE_PRESENT) static inline int pte_user(pte_t pte) { return (pte).pte_low & _PAGE_USER; } static inline int pte_read(pte_t pte) { return (pte).pte_low & _PAGE_USER; } static inline int pte_dirty(pte_t pte) { return (pte).pte_low & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return (pte).pte_low & _PAGE_ACCESSED; } static inline int pte_write(pte_t pte) { return (pte).pte_low & _PAGE_RW; } +static inline int pte_huge(pte_t pte) { return ((pte).pte_low & __LARGE_PTE) == __LARGE_PTE; } /* * The following only works if pte_present() is not true. @@ -236,7 +238,7 @@ static inline pte_t pte_mkexec(pte_t pte static inline pte_t pte_mkdirty(pte_t pte) { (pte).pte_low |= _PAGE_DIRTY; return pte; } static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte_low |= _PAGE_ACCESSED; return pte; } static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte_low |= _PAGE_RW; return pte; } -static inline pte_t pte_mkhuge(pte_t pte) { (pte).pte_low |= _PAGE_PRESENT | _PAGE_PSE; return pte; } +static inline pte_t pte_mkhuge(pte_t pte) { (pte).pte_low |= __LARGE_PTE; return pte; } #ifdef CONFIG_X86_PAE # include <asm/pgtable-3level.h> diff -upN reference/include/asm-x86_64/pgtable.h current/include/asm-x86_64/pgtable.h --- reference/include/asm-x86_64/pgtable.h +++ current/include/asm-x86_64/pgtable.h @@ -247,6 +247,7 @@ static inline pte_t pfn_pte(unsigned lon * The following only work if pte_present() is true. * Undefined behaviour if not.. */ +#define __LARGE_PTE (_PAGE_PSE|_PAGE_PRESENT) static inline int pte_user(pte_t pte) { return pte_val(pte) & _PAGE_USER; } extern inline int pte_read(pte_t pte) { return pte_val(pte) & _PAGE_USER; } extern inline int pte_exec(pte_t pte) { return pte_val(pte) & _PAGE_USER; } @@ -254,8 +255,8 @@ extern inline int pte_dirty(pte_t pte) extern inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } extern inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW; } static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } +static inline int pte_huge(pte_t pte) { return (pte_val(pte) & __LARGE_PTE) == __LARGE_PTE; } -#define __LARGE_PTE (_PAGE_PSE|_PAGE_PRESENT) extern inline pte_t pte_rdprotect(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; } extern inline pte_t pte_exprotect(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; } extern inline pte_t pte_mkclean(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/4] x86-pte_huge 2005-08-17 19:03 ` [PATCH 1/4] x86-pte_huge Adam Litke @ 2005-08-17 19:18 ` Dave Hansen 2005-08-17 19:27 ` Adam Litke 0 siblings, 1 reply; 14+ messages in thread From: Dave Hansen @ 2005-08-17 19:18 UTC (permalink / raw) To: Adam Litke; +Cc: linux-mm, christoph, ak, kenneth.w.chen, david On Wed, 2005-08-17 at 14:03 -0500, Adam Litke wrote: > @@ -254,8 +255,8 @@ extern inline int pte_dirty(pte_t pte) > extern inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } > extern inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW; } > static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } > +static inline int pte_huge(pte_t pte) { return (pte_val(pte) & __LARGE_PTE) == __LARGE_PTE; } Looks like a little whitespace issue. Probably just tabs vs. spaces. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/4] x86-pte_huge 2005-08-17 19:18 ` Dave Hansen @ 2005-08-17 19:27 ` Adam Litke 0 siblings, 0 replies; 14+ messages in thread From: Adam Litke @ 2005-08-17 19:27 UTC (permalink / raw) To: Dave Hansen; +Cc: linux-mm, christoph, ak, kenneth.w.chen, david On Wed, 2005-08-17 at 12:18 -0700, Dave Hansen wrote: > Looks like a little whitespace issue. Probably just tabs vs. spaces. Ughh, don't know how that slipped in. Fixed whitespace issue in asm-x86_64/pgtable.h Initial Post (Wed, 17 Aug 2005) This patch adds a macro pte_huge(pte) for i386/x86_64 which is needed by a patch later in the series. Instead of repeating (_PAGE_PRESENT | _PAGE_PSE), I've added __LARGE_PTE to i386 to match x86_64. Diffed against 2.6.13-rc6-git7 Signed-off-by: Adam Litke <agl@us.ibm.com> --- asm-i386/pgtable.h | 4 +++- asm-x86_64/pgtable.h | 3 ++- 2 files changed, 5 insertions(+), 2 deletions(-) diff -upN reference/include/asm-i386/pgtable.h current/include/asm-i386/pgtable.h --- reference/include/asm-i386/pgtable.h +++ current/include/asm-i386/pgtable.h @@ -215,11 +215,13 @@ extern unsigned long pg0[]; * The following only work if pte_present() is true. * Undefined behaviour if not.. */ +#define __LARGE_PTE (_PAGE_PSE | _PAGE_PRESENT) static inline int pte_user(pte_t pte) { return (pte).pte_low & _PAGE_USER; } static inline int pte_read(pte_t pte) { return (pte).pte_low & _PAGE_USER; } static inline int pte_dirty(pte_t pte) { return (pte).pte_low & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return (pte).pte_low & _PAGE_ACCESSED; } static inline int pte_write(pte_t pte) { return (pte).pte_low & _PAGE_RW; } +static inline int pte_huge(pte_t pte) { return ((pte).pte_low & __LARGE_PTE) == __LARGE_PTE; } /* * The following only works if pte_present() is not true. @@ -236,7 +238,7 @@ static inline pte_t pte_mkexec(pte_t pte static inline pte_t pte_mkdirty(pte_t pte) { (pte).pte_low |= _PAGE_DIRTY; return pte; } static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte_low |= _PAGE_ACCESSED; return pte; } static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte_low |= _PAGE_RW; return pte; } -static inline pte_t pte_mkhuge(pte_t pte) { (pte).pte_low |= _PAGE_PRESENT | _PAGE_PSE; return pte; } +static inline pte_t pte_mkhuge(pte_t pte) { (pte).pte_low |= __LARGE_PTE; return pte; } #ifdef CONFIG_X86_PAE # include <asm/pgtable-3level.h> diff -upN reference/include/asm-x86_64/pgtable.h current/include/asm-x86_64/pgtable.h --- reference/include/asm-x86_64/pgtable.h +++ current/include/asm-x86_64/pgtable.h @@ -247,6 +247,7 @@ static inline pte_t pfn_pte(unsigned lon * The following only work if pte_present() is true. * Undefined behaviour if not.. */ +#define __LARGE_PTE (_PAGE_PSE|_PAGE_PRESENT) static inline int pte_user(pte_t pte) { return pte_val(pte) & _PAGE_USER; } extern inline int pte_read(pte_t pte) { return pte_val(pte) & _PAGE_USER; } extern inline int pte_exec(pte_t pte) { return pte_val(pte) & _PAGE_USER; } @@ -254,8 +255,8 @@ extern inline int pte_dirty(pte_t pte) extern inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } extern inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW; } static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } +static inline int pte_huge(pte_t pte) { return (pte_val(pte) & __LARGE_PTE) == __LARGE_PTE; } -#define __LARGE_PTE (_PAGE_PSE|_PAGE_PRESENT) extern inline pte_t pte_rdprotect(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; } extern inline pte_t pte_exprotect(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; } extern inline pte_t pte_mkclean(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/4] x86-move-stale-pgtable 2005-08-17 18:56 [PATCH 0/4] Demand faunting for huge pages Adam Litke 2005-08-17 19:03 ` [PATCH 1/4] x86-pte_huge Adam Litke @ 2005-08-17 19:03 ` Adam Litke 2005-08-17 19:04 ` [PATCH 3/4] x86-walk-check Adam Litke ` (2 subsequent siblings) 4 siblings, 0 replies; 14+ messages in thread From: Adam Litke @ 2005-08-17 19:03 UTC (permalink / raw) To: linux-mm; +Cc: christoph, ak, kenneth.w.chen, david Initial Post (Wed, 17 Aug 2005) This patch moves the if (! pte_none(*pte)) hugetlb_clean_stale_pgtable(pte); logic into huge_pte_alloc() so all of its callers can be immune to the bug described by Kenneth Chen at http://lkml.org/lkml/2004/6/16/246 > It turns out there is a bug in hugetlb_prefault(): with 3 level page table, > huge_pte_alloc() might return a pmd that points to a PTE page. It happens > if the virtual address for hugetlb mmap is recycled from previously used > normal page mmap. free_pgtables() might not scrub the pmd entry on > munmap and hugetlb_prefault skips on any pmd presence regardless what type > it is. Unless I am missing something, it seems more correct to place the check inside huge_pte_alloc() to prevent a the same bug wherever a huge pte is allocated. It also allows checking for this condition when lazily faulting huge pages later in the series. Diffed against 2.6.13-rc6-git7 Signed-off-by: Adam Litke <agl@us.ibm.com> --- arch/i386/mm/hugetlbpage.c | 13 +++++++++++-- mm/hugetlb.c | 2 -- 2 files changed, 11 insertions(+), 4 deletions(-) diff -upN reference/arch/i386/mm/hugetlbpage.c current/arch/i386/mm/hugetlbpage.c --- reference/arch/i386/mm/hugetlbpage.c +++ current/arch/i386/mm/hugetlbpage.c @@ -22,12 +22,21 @@ pte_t *huge_pte_alloc(struct mm_struct * { pgd_t *pgd; pud_t *pud; - pmd_t *pmd = NULL; + pmd_t *pmd; + pte_t *pte = NULL; pgd = pgd_offset(mm, addr); pud = pud_alloc(mm, pgd, addr); pmd = pmd_alloc(mm, pud, addr); - return (pte_t *) pmd; + + if (!pmd) + goto out; + + pte = (pte_t *) pmd; + if (!pte_none(*pte) && !pte_huge(*pte)) + hugetlb_clean_stale_pgtable(pte); +out: + return pte; } pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c --- reference/mm/hugetlb.c +++ current/mm/hugetlb.c @@ -360,8 +360,6 @@ int hugetlb_prefault(struct address_spac ret = -ENOMEM; goto out; } - if (! pte_none(*pte)) - hugetlb_clean_stale_pgtable(pte); idx = ((addr - vma->vm_start) >> HPAGE_SHIFT) + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 3/4] x86-walk-check 2005-08-17 18:56 [PATCH 0/4] Demand faunting for huge pages Adam Litke 2005-08-17 19:03 ` [PATCH 1/4] x86-pte_huge Adam Litke 2005-08-17 19:03 ` [PATCH 2/4] x86-move-stale-pgtable Adam Litke @ 2005-08-17 19:04 ` Adam Litke 2005-08-17 19:41 ` Dave Hansen 2005-08-17 19:05 ` [PATCH 4/4] htlb-fault Adam Litke 2005-08-17 21:04 ` [PATCH 0/4] Demand faunting for huge pages Andi Kleen 4 siblings, 1 reply; 14+ messages in thread From: Adam Litke @ 2005-08-17 19:04 UTC (permalink / raw) To: linux-mm; +Cc: christoph, ak, kenneth.w.chen, david Initial Post (Wed, 17 Aug 2005) For demand faulting, we cannot assume that the page tables will be populated. Do what the rest of the architectures do and test p?d_present() while walking down the page table. Diffed against 2.6.13-rc6-git7 Signed-off-by: Adam Litke <agl@us.ibm.com> --- hugetlbpage.c | 8 ++++++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff -upN reference/arch/i386/mm/hugetlbpage.c current/arch/i386/mm/hugetlbpage.c --- reference/arch/i386/mm/hugetlbpage.c +++ current/arch/i386/mm/hugetlbpage.c @@ -46,8 +46,12 @@ pte_t *huge_pte_offset(struct mm_struct pmd_t *pmd = NULL; pgd = pgd_offset(mm, addr); - pud = pud_offset(pgd, addr); - pmd = pmd_offset(pud, addr); + if (pgd_present(*pgd)) { + pud = pud_offset(pgd, addr); + if (pud_present(*pud)) { + pmd = pmd_offset(pud, addr); + } + } return (pte_t *) pmd; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/4] x86-walk-check 2005-08-17 19:04 ` [PATCH 3/4] x86-walk-check Adam Litke @ 2005-08-17 19:41 ` Dave Hansen 0 siblings, 0 replies; 14+ messages in thread From: Dave Hansen @ 2005-08-17 19:41 UTC (permalink / raw) To: Adam Litke; +Cc: linux-mm, christoph, ak, kenneth.w.chen, david On Wed, 2005-08-17 at 14:04 -0500, Adam Litke wrote: > > + if (pgd_present(*pgd)) { > + pud = pud_offset(pgd, addr); > + if (pud_present(*pud)) { > + pmd = pmd_offset(pud, addr); > + } > + } You can probably kill that extra set of braces on the indented if(). Or, do something like this (which I think is a little bit more consistent with a lot of other code. - pud = pud_offset(pgd, addr); - pmd = pmd_offset(pud, addr); + if (!pgd_present(*pgd)) + goto out; + + pud = pud_offset(pgd, addr); + if (pud_present(*pud)) + pmd = pmd_offset(pud, addr); + +out: return (pte_t *) pmd; -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 4/4] htlb-fault 2005-08-17 18:56 [PATCH 0/4] Demand faunting for huge pages Adam Litke ` (2 preceding siblings ...) 2005-08-17 19:04 ` [PATCH 3/4] x86-walk-check Adam Litke @ 2005-08-17 19:05 ` Adam Litke 2005-08-17 21:04 ` [PATCH 0/4] Demand faunting for huge pages Andi Kleen 4 siblings, 0 replies; 14+ messages in thread From: Adam Litke @ 2005-08-17 19:05 UTC (permalink / raw) To: linux-mm; +Cc: christoph, ak, kenneth.w.chen, david Version 2 (Wed, 17 Aug 2005) Removed spurious WARN_ON() Patches added earlier in the series: Check for p?d_none() in arch/i386/mm/hugetlbpage.c:huge_pte_offset() Move i386 stale pte check into huge_pte_alloc() Initial Post (Fri, 05 Aug 2005) Below is a patch to implement demand faulting for huge pages. The main motivation for changing from prefaulting to demand faulting is so that huge page memory areas can be allocated according to NUMA policy. Thanks to consolidated hugetlb code, switching the behavior requires changing only one fault handler. The bulk of the patch just moves the logic from hugelb_prefault() to hugetlb_pte_fault(). Diffed against 2.6.13-rc6-git7 Signed-off-by: Adam Litke <agl@us.ibm.com> --- fs/hugetlbfs/inode.c | 6 -- include/linux/hugetlb.h | 2 mm/hugetlb.c | 137 +++++++++++++++++++++++++++--------------------- mm/memory.c | 7 -- 4 files changed, 82 insertions(+), 70 deletions(-) diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c --- reference/fs/hugetlbfs/inode.c +++ current/fs/hugetlbfs/inode.c @@ -48,7 +48,6 @@ int sysctl_hugetlb_shm_group; static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) { struct inode *inode = file->f_dentry->d_inode; - struct address_space *mapping = inode->i_mapping; loff_t len, vma_len; int ret; @@ -79,10 +78,7 @@ static int hugetlbfs_file_mmap(struct fi if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size) goto out; - ret = hugetlb_prefault(mapping, vma); - if (ret) - goto out; - + ret = 0; if (inode->i_size < len) inode->i_size = len; out: diff -upN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h --- reference/include/linux/hugetlb.h +++ current/include/linux/hugetlb.h @@ -25,6 +25,8 @@ int is_hugepage_mem_enough(size_t); unsigned long hugetlb_total_pages(void); struct page *alloc_huge_page(void); void free_huge_page(struct page *); +int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct * vma, + unsigned long address, int write_access); extern unsigned long max_huge_pages; extern const unsigned long hugetlb_zero, hugetlb_infinity; diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c --- reference/mm/hugetlb.c +++ current/mm/hugetlb.c @@ -277,18 +277,20 @@ int copy_hugetlb_page_range(struct mm_st unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; - while (addr < end) { + for (; addr < end; addr += HPAGE_SIZE) { + src_pte = huge_pte_offset(src, addr); + if (!src_pte || pte_none(*src_pte)) + continue; + dst_pte = huge_pte_alloc(dst, addr); if (!dst_pte) goto nomem; - src_pte = huge_pte_offset(src, addr); - BUG_ON(!src_pte || pte_none(*src_pte)); /* prefaulted */ + BUG_ON(!src_pte); entry = *src_pte; ptepage = pte_page(entry); get_page(ptepage); add_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE); set_huge_pte_at(dst, addr, dst_pte, entry); - addr += HPAGE_SIZE; } return 0; @@ -338,61 +340,6 @@ void zap_hugepage_range(struct vm_area_s spin_unlock(&mm->page_table_lock); } -int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma) -{ - struct mm_struct *mm = current->mm; - unsigned long addr; - int ret = 0; - - WARN_ON(!is_vm_hugetlb_page(vma)); - BUG_ON(vma->vm_start & ~HPAGE_MASK); - BUG_ON(vma->vm_end & ~HPAGE_MASK); - - hugetlb_prefault_arch_hook(mm); - - spin_lock(&mm->page_table_lock); - for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) { - unsigned long idx; - pte_t *pte = huge_pte_alloc(mm, addr); - struct page *page; - - if (!pte) { - ret = -ENOMEM; - goto out; - } - - idx = ((addr - vma->vm_start) >> HPAGE_SHIFT) - + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); - page = find_get_page(mapping, idx); - if (!page) { - /* charge the fs quota first */ - if (hugetlb_get_quota(mapping)) { - ret = -ENOMEM; - goto out; - } - page = alloc_huge_page(); - if (!page) { - hugetlb_put_quota(mapping); - ret = -ENOMEM; - goto out; - } - ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC); - if (! ret) { - unlock_page(page); - } else { - hugetlb_put_quota(mapping); - free_huge_page(page); - goto out; - } - } - add_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE); - set_huge_pte_at(mm, addr, pte, make_huge_pte(vma, page)); - } -out: - spin_unlock(&mm->page_table_lock); - return ret; -} - int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, unsigned long *position, int *length, int i) @@ -440,3 +387,75 @@ int follow_hugetlb_page(struct mm_struct return i; } + +int hugetlb_pte_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, int write_access) +{ + int ret = VM_FAULT_MINOR; + unsigned long idx; + pte_t *pte; + struct page *page; + struct address_space *mapping; + + BUG_ON(vma->vm_start & ~HPAGE_MASK); + BUG_ON(vma->vm_end & ~HPAGE_MASK); + BUG_ON(!vma->vm_file); + + pte = huge_pte_alloc(mm, address); + if (!pte) { + ret = VM_FAULT_SIGBUS; + goto out; + } + if (! pte_none(*pte)) + goto flush; + + mapping = vma->vm_file->f_mapping; + idx = ((address - vma->vm_start) >> HPAGE_SHIFT) + + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); +retry: + page = find_get_page(mapping, idx); + if (!page) { + /* charge the fs quota first */ + if (hugetlb_get_quota(mapping)) { + ret = VM_FAULT_SIGBUS; + goto out; + } + page = alloc_huge_page(); + if (!page) { + hugetlb_put_quota(mapping); + ret = VM_FAULT_SIGBUS; + goto out; + } + if (add_to_page_cache(page, mapping, idx, GFP_ATOMIC)) { + put_page(page); + goto retry; + } + unlock_page(page); + } + add_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE); + set_huge_pte_at(mm, address, pte, make_huge_pte(vma, page)); +flush: + flush_tlb_page(vma, address); +out: + return ret; +} + +int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, int write_access) +{ + pte_t *ptep; + int rc = VM_FAULT_MINOR; + + spin_lock(&mm->page_table_lock); + + ptep = huge_pte_alloc(mm, address & HPAGE_MASK); + if (! ptep) { + rc = VM_FAULT_SIGBUS; + goto out; + } + if (pte_none(*ptep)) + rc = hugetlb_pte_fault(mm, vma, address, write_access); +out: + spin_unlock(&mm->page_table_lock); + return rc; +} diff -upN reference/mm/memory.c current/mm/memory.c --- reference/mm/memory.c +++ current/mm/memory.c @@ -937,11 +937,6 @@ int get_user_pages(struct task_struct *t || !(flags & vma->vm_flags)) return i ? : -EFAULT; - if (is_vm_hugetlb_page(vma)) { - i = follow_hugetlb_page(mm, vma, pages, vmas, - &start, &len, i); - continue; - } spin_lock(&mm->page_table_lock); do { int write_access = write; @@ -2034,7 +2029,7 @@ int __handle_mm_fault(struct mm_struct * inc_page_state(pgfault); if (is_vm_hugetlb_page(vma)) - return VM_FAULT_SIGBUS; /* mapping truncation does this. */ + return hugetlb_fault(mm, vma, address, write_access); /* * We need the page table lock to synchronize with kswapd -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/4] Demand faunting for huge pages 2005-08-17 18:56 [PATCH 0/4] Demand faunting for huge pages Adam Litke ` (3 preceding siblings ...) 2005-08-17 19:05 ` [PATCH 4/4] htlb-fault Adam Litke @ 2005-08-17 21:04 ` Andi Kleen 2005-08-18 0:33 ` David Gibson 2005-08-18 20:29 ` Adam Litke 4 siblings, 2 replies; 14+ messages in thread From: Andi Kleen @ 2005-08-17 21:04 UTC (permalink / raw) To: Adam Litke; +Cc: linux-mm, christoph, ak, kenneth.w.chen, david What about the overcommit issue Ken noted? It needs to be solved in some way at least, either with the full check or the lazy simple check. Also I still think your get_user_pages approach is questionable. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/4] Demand faunting for huge pages 2005-08-17 21:04 ` [PATCH 0/4] Demand faunting for huge pages Andi Kleen @ 2005-08-18 0:33 ` David Gibson 2005-08-18 0:35 ` Andi Kleen 2005-08-18 15:29 ` Ray Bryant 2005-08-18 20:29 ` Adam Litke 1 sibling, 2 replies; 14+ messages in thread From: David Gibson @ 2005-08-18 0:33 UTC (permalink / raw) To: Andi Kleen; +Cc: Adam Litke, linux-mm, christoph, kenneth.w.chen On Wed, Aug 17, 2005 at 11:04:32PM +0200, Andi Kleen wrote: > > What about the overcommit issue Ken noted? It needs to be solved > in some way at least, either with the full check or the lazy simple > check. Hrm... I'm not 100% convinced that just allowing overcommit isn't the right thing to do. Overcommit has some unfortunate consequences, but the semantics are clearly defined and trivial to implement. Strict accounting leads to nicer behaviour in some cases - you'll tend to die early rather than late - but it seems an awful lot of work for a fairly small improvement in behaviour. If we add copy-on-write for hugepages (i.e. MAP_PRIVATE support) strict accounting is even harder to implement, and has clearly-wrong behaviour in some circumstances: a process using most of the system's hugepages, mapped MAP_PRIVATE couldn't fork()/exec() a trivial helper program. > Also I still think your get_user_pages approach is questionable. > > -Andi > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/4] Demand faunting for huge pages 2005-08-18 0:33 ` David Gibson @ 2005-08-18 0:35 ` Andi Kleen 2005-08-18 20:33 ` Adam Litke 2005-08-18 15:29 ` Ray Bryant 1 sibling, 1 reply; 14+ messages in thread From: Andi Kleen @ 2005-08-18 0:35 UTC (permalink / raw) To: David Gibson; +Cc: Andi Kleen, Adam Litke, linux-mm, christoph, kenneth.w.chen On Thu, Aug 18, 2005 at 10:33:02AM +1000, David Gibson wrote: > On Wed, Aug 17, 2005 at 11:04:32PM +0200, Andi Kleen wrote: > > > > What about the overcommit issue Ken noted? It needs to be solved > > in some way at least, either with the full check or the lazy simple > > check. > > Hrm... I'm not 100% convinced that just allowing overcommit isn't the > right thing to do. Overcommit has some unfortunate consequences, but > the semantics are clearly defined and trivial to implement. I disagree. With Linux's primitive hugepage allocation scheme (static pool that is usually too small) at least simple overcommit check is absolutely essential. > Strict accounting leads to nicer behaviour in some cases - you'll tend > to die early rather than late - but it seems an awful lot of work for > a fairly small improvement in behaviour. Strict is a lot of work, but a simple "right in 99% of all cases, but racy" check is quite easy. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/4] Demand faunting for huge pages 2005-08-18 0:35 ` Andi Kleen @ 2005-08-18 20:33 ` Adam Litke 0 siblings, 0 replies; 14+ messages in thread From: Adam Litke @ 2005-08-18 20:33 UTC (permalink / raw) To: Andi Kleen; +Cc: David Gibson, linux-mm, christoph, kenneth.w.chen On Thu, 2005-08-18 at 02:35 +0200, Andi Kleen wrote: > I disagree. With Linux's primitive hugepage allocation scheme (static > pool that is usually too small) at least simple overcommit check > is absolutely essential. > > > Strict accounting leads to nicer behaviour in some cases - you'll tend > > to die early rather than late - but it seems an awful lot of work for > > a fairly small improvement in behaviour. > > Strict is a lot of work, but a simple "right in 99% of all cases, but racy" > check is quite easy. How about something like the following? --- Initial Post (Thu, 18 Aug 2005) Basic overcommit checking for hugetlb_file_map() based on an implementation used with demand faulting in SLES9. Since demand faulting can't guarantee the availability of pages at mmap time, this patch implements a basic sanity check to ensure that the number of huge pages required to satisfy the mmap are currently available. Despite the obvious race, I think it is a good start on doing proper accounting. I'd like to work towards an accounting system that mimics the semantics of normal pages (especially for the MAP_PRIVATE/COW case). That work is underway and builds on what this patch starts. Huge page shared memory segments are simpler and still maintain their commit on shmget semantics. Diffed against 2.6.13-rc6-git7 Signed-off-by: Adam Litke <agl@us.ibm.com> --- fs/hugetlbfs/inode.c | 36 ++++++++++++++++++++++++++++++++++++ include/linux/hugetlb.h | 3 +++ 2 files changed, 39 insertions(+) diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c --- reference/fs/hugetlbfs/inode.c +++ current/fs/hugetlbfs/inode.c @@ -45,9 +45,41 @@ static struct backing_dev_info hugetlbfs int sysctl_hugetlb_shm_group; +static void huge_pagevec_release(struct pagevec *pvec); + +unsigned long +huge_pages_needed(struct address_space *mapping, struct vm_area_struct *vma, + unsigned long start, unsigned long end) +{ + int i; + struct pagevec pvec; + unsigned long hugepages = (end - start) >> HPAGE_SHIFT; + pgoff_t next = vma->vm_pgoff + ((start - vma->vm_start)>>PAGE_SHIFT); + pgoff_t endpg = next + ((end - start) >> PAGE_SHIFT); + + pagevec_init(&pvec, 0); + while (next < endpg) { + if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) + break; + for (i = 0; i < pagevec_count(&pvec); i++) { + struct page *page = pvec.pages[i]; + if (page->index > next) + next = page->index; + if (page->index >= endpg) + break; + next++; + hugepages--; + } + huge_pagevec_release(&pvec); + } + return hugepages << HPAGE_SHIFT; +} + static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) { struct inode *inode = file->f_dentry->d_inode; + struct address_space *mapping = inode->i_mapping; + unsigned long bytes; loff_t len, vma_len; int ret; @@ -66,6 +98,10 @@ static int hugetlbfs_file_mmap(struct fi if (vma->vm_end - vma->vm_start < HPAGE_SIZE) return -EINVAL; + bytes = huge_pages_needed(mapping, vma, vma->vm_start, vma->vm_end); + if (!is_hugepage_mem_enough(bytes)) + return -ENOMEM; + vma_len = (loff_t)(vma->vm_end - vma->vm_start); down(&inode->i_sem); diff -upN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h --- reference/include/linux/hugetlb.h +++ current/include/linux/hugetlb.h @@ -42,6 +42,9 @@ struct page *follow_huge_pmd(struct mm_s pmd_t *pmd, int write); int is_aligned_hugepage_range(unsigned long addr, unsigned long len); int pmd_huge(pmd_t pmd); +unsigned long huge_pages_needed(struct address_space *mapping, + struct vm_area_struct *vma, + unsigned long start, unsigned long end); #ifndef ARCH_HAS_HUGEPAGE_ONLY_RANGE #define is_hugepage_only_range(mm, addr, len) 0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/4] Demand faunting for huge pages 2005-08-18 0:33 ` David Gibson 2005-08-18 0:35 ` Andi Kleen @ 2005-08-18 15:29 ` Ray Bryant 1 sibling, 0 replies; 14+ messages in thread From: Ray Bryant @ 2005-08-18 15:29 UTC (permalink / raw) To: David Gibson Cc: Andi Kleen, Adam Litke, linux-mm, christoph, kenneth.w.chen, akpm On Wednesday 17 August 2005 19:33, David Gibson wrote: > > Strict accounting leads to nicer behaviour in some cases - you'll tend > to die early rather than late - but it seems an awful lot of work for > a fairly small improvement in behaviour. > The last time we went around on this (April 2004?) Andrew thought that adding demand allocation for hugetlb pages without strict accounting was effectively an ABI change -- in the current approach the mmap() will fail if you ask for too many hugetlb pages whilst in the demand fault approach you will get SIGBUS at a later point in time. At one time this was considered serious enough to fix. Andy Whitcroft provided some code for the patch that Ken and I did back in April 2004 time frame. I can't find that one but the following patch from Christoph Lameter appears to be the code. The idea is that at mmap() time a strict reservation is made that guarantees the necessary number of hugetlb pages is available. http://marc.theaimsgroup.com/?l=linux-kernel&m=109842250714489&w=2 -- Ray Bryant AMD Performance Labs Austin, Tx 512-602-0038 (o) 512-507-7807 (c) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/4] Demand faunting for huge pages 2005-08-17 21:04 ` [PATCH 0/4] Demand faunting for huge pages Andi Kleen 2005-08-18 0:33 ` David Gibson @ 2005-08-18 20:29 ` Adam Litke 1 sibling, 0 replies; 14+ messages in thread From: Adam Litke @ 2005-08-18 20:29 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-mm, christoph, kenneth.w.chen, david On Wed, 2005-08-17 at 23:04 +0200, Andi Kleen wrote: > Also I still think your get_user_pages approach is questionable. I am pretty sure that my approach is safe and merely removes an optimization. Hopefully the following better states my reasons for thinking so. If anyone else who was involved in the demand fault discussion when it went around the last time (see below) could chime in I think it would help further clarify the issue. --- Initial Post (Thu, 18 Aug 2005) In preparation for hugetlb demand faulting, remove this get_user_pages() optimization. Since huge pages will no longer be prefaulted, we can't assume that the huge ptes are established and hence, calling follow_hugetlb_page() is not valid. With the follow_hugetlb_page() call removed, the normal code path will be triggered. follow_page() will either use follow_huge_addr() or follow_huge_pmd() to check for a previously faulted "page" to return. When this fails (ie. with demand faults), __handle_mm_fault() gets called which invokes the hugetlb_fault() handler to instantiate the huge page. This patch doesn't make a lot of sense by itself, but I've broken it out to facilitate discussion on this specific element of the demand fault changes. While coding this up, I referenced previous discussion on this topic starting at http://lkml.org/lkml/2004/4/13/176 , which contains more opinions about the correctness of this approach. Diffed against 2.6.13-rc6-git7 Signed-off-by: Adam Litke <agl@us.ibm.com> --- memory.c | 5 ----- 1 files changed, 5 deletions(-) diff -upN reference/mm/memory.c current/mm/memory.c --- reference/mm/memory.c +++ current/mm/memory.c @@ -937,11 +937,6 @@ int get_user_pages(struct task_struct *t || !(flags & vma->vm_flags)) return i ? : -EFAULT; - if (is_vm_hugetlb_page(vma)) { - i = follow_hugetlb_page(mm, vma, pages, vmas, - &start, &len, i); - continue; - } spin_lock(&mm->page_table_lock); do { int write_access = write; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2005-08-18 20:38 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-08-17 18:56 [PATCH 0/4] Demand faunting for huge pages Adam Litke 2005-08-17 19:03 ` [PATCH 1/4] x86-pte_huge Adam Litke 2005-08-17 19:18 ` Dave Hansen 2005-08-17 19:27 ` Adam Litke 2005-08-17 19:03 ` [PATCH 2/4] x86-move-stale-pgtable Adam Litke 2005-08-17 19:04 ` [PATCH 3/4] x86-walk-check Adam Litke 2005-08-17 19:41 ` Dave Hansen 2005-08-17 19:05 ` [PATCH 4/4] htlb-fault Adam Litke 2005-08-17 21:04 ` [PATCH 0/4] Demand faunting for huge pages Andi Kleen 2005-08-18 0:33 ` David Gibson 2005-08-18 0:35 ` Andi Kleen 2005-08-18 20:33 ` Adam Litke 2005-08-18 15:29 ` Ray Bryant 2005-08-18 20:29 ` Adam Litke
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox