* [PATCH 0/4] hugetlb: copy on write
@ 2005-11-09 23:28 Adam Litke
2005-11-09 23:36 ` [PATCH 1/4] Hugetlb: Remove duplicate i_size check Adam Litke
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Adam Litke @ 2005-11-09 23:28 UTC (permalink / raw)
To: akpm
Cc: linux-mm, linux-kernel, David Gibson, wli, hugh, rohit.seth,
kenneth.w.chen, ADAM G. LITKE [imap]
This is a resend of the patches I sent on Nov 7th. I've broken them out
as requested. Comments (especially on the copy-on-write portion)
appreciated. Does anyone have a fundamental objection to moving forward
with these?
remove-dup-isize-check - Remove duplicated i_size truncation race check
rename-find_lock_huge_page - Switch to a more appropriate name
hugetlb_no_page - Mild reorg to support multiple fault types
htlb-cow - Copy on write support
--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH 1/4] Hugetlb: Remove duplicate i_size check 2005-11-09 23:28 [PATCH 0/4] hugetlb: copy on write Adam Litke @ 2005-11-09 23:36 ` Adam Litke 2005-11-10 0:10 ` William Lee Irwin III 2005-11-09 23:37 ` [PATCH 2/4] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page Adam Litke ` (2 subsequent siblings) 3 siblings, 1 reply; 14+ messages in thread From: Adam Litke @ 2005-11-09 23:36 UTC (permalink / raw) To: akpm Cc: linux-mm, linux-kernel, David Gibson, wli, hugh, rohit.seth, kenneth.w.chen, ADAM G. LITKE [imap] On Wed, 2005-10-26 at 12:00 +1000, David Gibson wrote: > - The check against i_size was duplicated: once in > find_lock_huge_page() and again in hugetlb_fault() after taking the > page_table_lock. We only really need the locked one, so remove the > other. Original post by David Gibson <david@gibson.dropbear.id.au> Version 2: Wed 9 Nov 2005 Split this cleanup out into a standalone patch Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> --- hugetlb.c | 7 ------- 1 files changed, 7 deletions(-) diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c --- reference/mm/hugetlb.c +++ current/mm/hugetlb.c @@ -344,19 +344,12 @@ static struct page *find_lock_huge_page( { struct page *page; int err; - struct inode *inode = mapping->host; - unsigned long size; retry: page = find_lock_page(mapping, idx); if (page) goto out; - /* Check to make sure the mapping hasn't been truncated */ - size = i_size_read(inode) >> HPAGE_SHIFT; - if (idx >= size) - goto out; - if (hugetlb_get_quota(mapping)) goto out; page = alloc_huge_page(); -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/4] Hugetlb: Remove duplicate i_size check 2005-11-09 23:36 ` [PATCH 1/4] Hugetlb: Remove duplicate i_size check Adam Litke @ 2005-11-10 0:10 ` William Lee Irwin III 0 siblings, 0 replies; 14+ messages in thread From: William Lee Irwin III @ 2005-11-10 0:10 UTC (permalink / raw) To: Adam Litke Cc: akpm, linux-mm, linux-kernel, David Gibson, hugh, rohit.seth, kenneth.w.chen On Wed, 2005-10-26 at 12:00 +1000, David Gibson wrote: >> - The check against i_size was duplicated: once in >> find_lock_huge_page() and again in hugetlb_fault() after taking the >> page_table_lock. We only really need the locked one, so remove the >> other. On Wed, Nov 09, 2005 at 05:36:49PM -0600, Adam Litke wrote: > Original post by David Gibson <david@gibson.dropbear.id.au> > Version 2: Wed 9 Nov 2005 > Split this cleanup out into a standalone patch > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Adam Litke <agl@us.ibm.com> Innocuous enough. Acked-by: William Irwin <wli@holomorphy.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/4] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page 2005-11-09 23:28 [PATCH 0/4] hugetlb: copy on write Adam Litke 2005-11-09 23:36 ` [PATCH 1/4] Hugetlb: Remove duplicate i_size check Adam Litke @ 2005-11-09 23:37 ` Adam Litke 2005-11-10 0:11 ` William Lee Irwin III 2005-11-09 23:38 ` [PATCH 3/4] Hugetlb: Reorganize hugetlb_fault to prepare for COW Adam Litke 2005-11-09 23:39 ` [PATCH 4/4] Hugetlb: Copy on Write support Adam Litke 3 siblings, 1 reply; 14+ messages in thread From: Adam Litke @ 2005-11-09 23:37 UTC (permalink / raw) To: akpm Cc: linux-mm, linux-kernel, David Gibson, wli, hugh, rohit.seth, kenneth.w.chen, ADAM G. LITKE [imap] On Wed, 2005-10-26 at 12:00 +1000, David Gibson wrote: - find_lock_huge_page() isn't a great name, since it does extra things not analagous to find_lock_page(). Rename it find_or_alloc_huge_page() which is closer to the mark. Original post by David Gibson <david@gibson.dropbear.id.au> Version 2: Wed 9 Nov 2005 Split into a separate patch Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> --- hugetlb.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c --- reference/mm/hugetlb.c +++ current/mm/hugetlb.c @@ -339,8 +339,8 @@ void unmap_hugepage_range(struct vm_area flush_tlb_range(vma, start, end); } -static struct page *find_lock_huge_page(struct address_space *mapping, - unsigned long idx) +static struct page *find_or_alloc_huge_page(struct address_space *mapping, + unsigned long idx) { struct page *page; int err; @@ -392,7 +392,7 @@ int hugetlb_fault(struct mm_struct *mm, * Use page lock to guard against racing truncation * before we get page_table_lock. */ - page = find_lock_huge_page(mapping, idx); + page = find_or_alloc_huge_page(mapping, idx); if (!page) goto out; -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/4] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page 2005-11-09 23:37 ` [PATCH 2/4] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page Adam Litke @ 2005-11-10 0:11 ` William Lee Irwin III 0 siblings, 0 replies; 14+ messages in thread From: William Lee Irwin III @ 2005-11-10 0:11 UTC (permalink / raw) To: Adam Litke Cc: akpm, linux-mm, linux-kernel, David Gibson, hugh, rohit.seth, kenneth.w.chen On Wed, Nov 09, 2005 at 05:37:52PM -0600, Adam Litke wrote: > Hugetlb: Rename find_lock_page to find_or_alloc_huge_page > On Wed, 2005-10-26 at 12:00 +1000, David Gibson wrote: > - find_lock_huge_page() isn't a great name, since it does extra things > not analagous to find_lock_page(). Rename it > find_or_alloc_huge_page() which is closer to the mark. > Original post by David Gibson <david@gibson.dropbear.id.au> > Version 2: Wed 9 Nov 2005 > Split into a separate patch > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Adam Litke <agl@us.ibm.com> Also innocuous. Acked-by: William Irwin <wli@holomorphy.com> -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 3/4] Hugetlb: Reorganize hugetlb_fault to prepare for COW 2005-11-09 23:28 [PATCH 0/4] hugetlb: copy on write Adam Litke 2005-11-09 23:36 ` [PATCH 1/4] Hugetlb: Remove duplicate i_size check Adam Litke 2005-11-09 23:37 ` [PATCH 2/4] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page Adam Litke @ 2005-11-09 23:38 ` Adam Litke 2005-11-10 0:13 ` William Lee Irwin III 2005-11-09 23:39 ` [PATCH 4/4] Hugetlb: Copy on Write support Adam Litke 3 siblings, 1 reply; 14+ messages in thread From: Adam Litke @ 2005-11-09 23:38 UTC (permalink / raw) To: akpm Cc: linux-mm, linux-kernel, David Gibson, wli, hugh, rohit.seth, kenneth.w.chen, ADAM G. LITKE [imap] This patch splits the "no_page()" type activity into its own function, hugetlb_no_page(). hugetlb_fault() becomes the entry point for hugetlb faults and delegates to the appropriate handler depending on the type of fault. Right now we still have only hugetlb_no_page() but a later patch introduces a COW fault. Original post by David Gibson <david@gibson.dropbear.id.au> Version 2: Wed 9 Nov 2005 Broken out into a separate patch Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> --- hugetlb.c | 34 +++++++++++++++++++++++++--------- 1 files changed, 25 insertions(+), 9 deletions(-) diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c --- reference/mm/hugetlb.c +++ current/mm/hugetlb.c @@ -370,20 +370,15 @@ out: return page; } -int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, int write_access) +int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep) { int ret = VM_FAULT_SIGBUS; unsigned long idx; unsigned long size; - pte_t *pte; struct page *page; struct address_space *mapping; - pte = huge_pte_alloc(mm, address); - if (!pte) - goto out; - mapping = vma->vm_file->f_mapping; idx = ((address - vma->vm_start) >> HPAGE_SHIFT) + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); @@ -402,11 +397,11 @@ int hugetlb_fault(struct mm_struct *mm, goto backout; ret = VM_FAULT_MINOR; - if (!pte_none(*pte)) + if (!pte_none(*ptep)) goto backout; add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE); - set_huge_pte_at(mm, address, pte, make_huge_pte(vma, page)); + set_huge_pte_at(mm, address, ptep, make_huge_pte(vma, page)); spin_unlock(&mm->page_table_lock); unlock_page(page); out: @@ -420,6 +415,27 @@ backout: goto out; } +int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, int write_access) +{ + pte_t *ptep; + pte_t entry; + + ptep = huge_pte_alloc(mm, address); + if (!ptep) + return VM_FAULT_OOM; + + entry = *ptep; + if (pte_none(entry)) + return hugetlb_no_page(mm, vma, address, ptep); + + /* + * We could get here if another thread instantiated the pte + * before the test above. + */ + return VM_FAULT_MINOR; +} + int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, unsigned long *position, int *length, int i) -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/4] Hugetlb: Reorganize hugetlb_fault to prepare for COW 2005-11-09 23:38 ` [PATCH 3/4] Hugetlb: Reorganize hugetlb_fault to prepare for COW Adam Litke @ 2005-11-10 0:13 ` William Lee Irwin III 0 siblings, 0 replies; 14+ messages in thread From: William Lee Irwin III @ 2005-11-10 0:13 UTC (permalink / raw) To: Adam Litke Cc: akpm, linux-mm, linux-kernel, David Gibson, hugh, rohit.seth, kenneth.w.chen On Wed, Nov 09, 2005 at 05:38:47PM -0600, Adam Litke wrote: > Hugetlb: Reorganize hugetlb_fault to prepare for COW > This patch splits the "no_page()" type activity into its own function, > hugetlb_no_page(). hugetlb_fault() becomes the entry point for hugetlb faults > and delegates to the appropriate handler depending on the type of fault. Right > now we still have only hugetlb_no_page() but a later patch introduces a COW > fault. > Original post by David Gibson <david@gibson.dropbear.id.au> > Version 2: Wed 9 Nov 2005 > Broken out into a separate patch > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Adam Litke <agl@us.ibm.com> Straightforward enough. Acked-by: William Irwin <wli@holomorphy.com> -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 4/4] Hugetlb: Copy on Write support 2005-11-09 23:28 [PATCH 0/4] hugetlb: copy on write Adam Litke ` (2 preceding siblings ...) 2005-11-09 23:38 ` [PATCH 3/4] Hugetlb: Reorganize hugetlb_fault to prepare for COW Adam Litke @ 2005-11-09 23:39 ` Adam Litke 2005-11-10 0:15 ` William Lee Irwin III 2005-11-10 1:52 ` Rohit Seth 3 siblings, 2 replies; 14+ messages in thread From: Adam Litke @ 2005-11-09 23:39 UTC (permalink / raw) To: akpm Cc: linux-mm, linux-kernel, David Gibson, wli, hugh, rohit.seth, kenneth.w.chen, ADAM G. LITKE [imap] Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be supported. This helps us to safely use hugetlb pages in many more applications. The patch makes the following changes. If needed, I also have it broken out according to the following paragraphs. 1. Add a pair of functions to set/clear write access on huge ptes. The writable check in make_huge_pte is moved out to the caller for use by COW later. 2. Hugetlb copy-on-write requires special case handling in the following situations: - copy_hugetlb_page_range() - Copied pages must be write protected so a COW fault will be triggered (if necessary) if those pages are written to. - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page cache. MAP_PRIVATE pages still need to be locked however. 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() which handles the COW fault by making the actual copy. 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping check. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> --- fs/hugetlbfs/inode.c | 3 - include/linux/hugetlb.h | 12 +++++ mm/hugetlb.c | 115 ++++++++++++++++++++++++++++++++++++++++-------- mm/mmap.c | 4 - 4 files changed, 110 insertions(+), 24 deletions(-) diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c --- reference/fs/hugetlbfs/inode.c +++ current/fs/hugetlbfs/inode.c @@ -100,9 +100,6 @@ static int hugetlbfs_file_mmap(struct fi loff_t len, vma_len; int ret; - if ((vma->vm_flags & (VM_MAYSHARE | VM_WRITE)) == VM_WRITE) - return -EINVAL; - if (vma->vm_pgoff & (HPAGE_SIZE / PAGE_SIZE - 1)) return -EINVAL; diff -upN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h --- reference/include/linux/hugetlb.h +++ current/include/linux/hugetlb.h @@ -65,6 +65,18 @@ pte_t huge_ptep_get_and_clear(struct mm_ pte_t *ptep); #endif +#define huge_ptep_set_wrprotect(mm, addr, ptep) \ + ptep_set_wrprotect(mm, addr, ptep) +static inline void set_huge_ptep_writable(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep) +{ + pte_t entry; + + entry = pte_mkwrite(pte_mkdirty(*ptep)); + ptep_set_access_flags(vma, address, ptep, entry, 1); + update_mmu_cache(vma, address, entry); +} + #ifndef ARCH_HAS_HUGETLB_PREFAULT_HOOK #define hugetlb_prefault_arch_hook(mm) do { } while (0) #else diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c --- reference/mm/hugetlb.c +++ current/mm/hugetlb.c @@ -255,11 +255,12 @@ struct vm_operations_struct hugetlb_vm_o .nopage = hugetlb_nopage, }; -static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page) +static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page, + int writable) { pte_t entry; - if (vma->vm_flags & VM_WRITE) { + if (writable) { entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); } else { @@ -277,6 +278,9 @@ int copy_hugetlb_page_range(struct mm_st pte_t *src_pte, *dst_pte, entry; struct page *ptepage; unsigned long addr; + int cow; + + cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) { src_pte = huge_pte_offset(src, addr); @@ -288,6 +292,8 @@ int copy_hugetlb_page_range(struct mm_st spin_lock(&dst->page_table_lock); spin_lock(&src->page_table_lock); if (!pte_none(*src_pte)) { + if (cow) + huge_ptep_set_wrprotect(src, addr, src_pte); entry = *src_pte; ptepage = pte_page(entry); get_page(ptepage); @@ -340,7 +346,7 @@ void unmap_hugepage_range(struct vm_area } static struct page *find_or_alloc_huge_page(struct address_space *mapping, - unsigned long idx) + unsigned long idx, int shared) { struct page *page; int err; @@ -358,26 +364,80 @@ retry: goto out; } - err = add_to_page_cache(page, mapping, idx, GFP_KERNEL); - if (err) { - put_page(page); - hugetlb_put_quota(mapping); - if (err == -EEXIST) - goto retry; - page = NULL; + if (shared) { + err = add_to_page_cache(page, mapping, idx, GFP_KERNEL); + if (err) { + put_page(page); + hugetlb_put_quota(mapping); + if (err == -EEXIST) + goto retry; + page = NULL; + } + } else { + /* Caller expects a locked page */ + lock_page(page); } out: return page; } +static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, pte_t pte) +{ + struct page *old_page, *new_page; + int i, avoidcopy; + + old_page = pte_page(pte); + + /* If no-one else is actually using this page, avoid the copy + * and just make the page writable */ + avoidcopy = (page_count(old_page) == 1); + if (avoidcopy) { + set_huge_ptep_writable(vma, address, ptep); + return VM_FAULT_MINOR; + } + + page_cache_get(old_page); + new_page = alloc_huge_page(); + + if (! new_page) { + page_cache_release(old_page); + + /* Logically this is OOM, not a SIGBUS, but an OOM + * could cause the kernel to go killing other + * processes which won't help the hugepage situation + * at all (?) */ + return VM_FAULT_SIGBUS; + } + + spin_unlock(&mm->page_table_lock); + for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) + copy_user_highpage(new_page + i, old_page + i, + address + i*PAGE_SIZE); + spin_lock(&mm->page_table_lock); + + ptep = huge_pte_offset(mm, address & HPAGE_MASK); + if (likely(pte_same(*ptep, pte))) { + /* Break COW */ + set_huge_pte_at(mm, address, ptep, + make_huge_pte(vma, new_page, 1)); + /* Make the old page be freed below */ + new_page = old_page; + } + page_cache_release(new_page); + page_cache_release(old_page); + return VM_FAULT_MINOR; +} + int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, pte_t *ptep) + unsigned long address, pte_t *ptep, int write_access) { int ret = VM_FAULT_SIGBUS; unsigned long idx; unsigned long size; struct page *page; struct address_space *mapping; + pte_t new_pte; mapping = vma->vm_file->f_mapping; idx = ((address - vma->vm_start) >> HPAGE_SHIFT) @@ -387,10 +447,13 @@ int hugetlb_no_page(struct mm_struct *mm * Use page lock to guard against racing truncation * before we get page_table_lock. */ - page = find_or_alloc_huge_page(mapping, idx); + page = find_or_alloc_huge_page(mapping, idx, + vma->vm_flags & VM_SHARED); if (!page) goto out; + BUG_ON(!PageLocked(page)); + spin_lock(&mm->page_table_lock); size = i_size_read(mapping->host) >> HPAGE_SHIFT; if (idx >= size) @@ -401,7 +464,15 @@ int hugetlb_no_page(struct mm_struct *mm goto backout; add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE); - set_huge_pte_at(mm, address, ptep, make_huge_pte(vma, page)); + new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE) + && (vma->vm_flags & VM_SHARED))); + set_huge_pte_at(mm, address, ptep, new_pte); + + if (write_access && !(vma->vm_flags & VM_SHARED)) { + /* Optimization, do the COW without a second fault */ + ret = hugetlb_cow(mm, vma, address, ptep, new_pte); + } + spin_unlock(&mm->page_table_lock); unlock_page(page); out: @@ -420,6 +491,7 @@ int hugetlb_fault(struct mm_struct *mm, { pte_t *ptep; pte_t entry; + int ret; ptep = huge_pte_alloc(mm, address); if (!ptep) @@ -427,13 +499,18 @@ int hugetlb_fault(struct mm_struct *mm, entry = *ptep; if (pte_none(entry)) - return hugetlb_no_page(mm, vma, address, ptep); + return hugetlb_no_page(mm, vma, address, ptep, write_access); - /* - * We could get here if another thread instantiated the pte - * before the test above. - */ - return VM_FAULT_MINOR; + ret = VM_FAULT_MINOR; + + spin_lock(&mm->page_table_lock); + /* Check for a racing update before calling hugetlb_cow */ + if (likely(pte_same(entry, *ptep))) + if (write_access && !pte_write(entry)) + ret = hugetlb_cow(mm, vma, address, ptep, entry); + spin_unlock(&mm->page_table_lock); + + return ret; } int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, diff -upN reference/mm/mmap.c current/mm/mmap.c --- reference/mm/mmap.c +++ current/mm/mmap.c @@ -1077,8 +1077,8 @@ munmap_back: error = file->f_op->mmap(file, vma); if (error) goto unmap_and_free_vma; - if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED)) - == (VM_WRITE | VM_RESERVED)) { + if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED + | VM_HUGETLB)) == (VM_WRITE | VM_RESERVED)) { printk(KERN_WARNING "program %s is using MAP_PRIVATE, " "PROT_WRITE mmap of VM_RESERVED memory, which " "is deprecated. Please report this to " -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support 2005-11-09 23:39 ` [PATCH 4/4] Hugetlb: Copy on Write support Adam Litke @ 2005-11-10 0:15 ` William Lee Irwin III 2005-11-10 0:49 ` David Gibson 2005-11-10 1:52 ` Rohit Seth 1 sibling, 1 reply; 14+ messages in thread From: William Lee Irwin III @ 2005-11-10 0:15 UTC (permalink / raw) To: Adam Litke Cc: akpm, linux-mm, linux-kernel, David Gibson, hugh, rohit.seth, kenneth.w.chen On Wed, Nov 09, 2005 at 05:39:55PM -0600, Adam Litke wrote: > Hugetlb: Copy on Write support > Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be > supported. This helps us to safely use hugetlb pages in many more > applications. The patch makes the following changes. If needed, I also have > it broken out according to the following paragraphs. > 1. Add a pair of functions to set/clear write access on huge ptes. The > writable check in make_huge_pte is moved out to the caller for use by COW > later. > 2. Hugetlb copy-on-write requires special case handling in the following > situations: > - copy_hugetlb_page_range() - Copied pages must be write protected so a COW > fault will be triggered (if necessary) if those pages are written to. > - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page > cache. MAP_PRIVATE pages still need to be locked however. > 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() > which handles the COW fault by making the actual copy. > 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be > allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping > check. Did you do the audit of pte protection bits I asked about? If not, I'll dredge them up and check to make sure. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support 2005-11-10 0:15 ` William Lee Irwin III @ 2005-11-10 0:49 ` David Gibson 2005-11-10 0:56 ` William Lee Irwin III 0 siblings, 1 reply; 14+ messages in thread From: David Gibson @ 2005-11-10 0:49 UTC (permalink / raw) To: William Lee Irwin III Cc: Adam Litke, akpm, linux-mm, linux-kernel, hugh, rohit.seth, kenneth.w.chen On Wed, Nov 09, 2005 at 04:15:34PM -0800, William Lee Irwin wrote: > On Wed, Nov 09, 2005 at 05:39:55PM -0600, Adam Litke wrote: > > Hugetlb: Copy on Write support > > Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be > > supported. This helps us to safely use hugetlb pages in many more > > applications. The patch makes the following changes. If needed, I also have > > it broken out according to the following paragraphs. > > 1. Add a pair of functions to set/clear write access on huge ptes. The > > writable check in make_huge_pte is moved out to the caller for use by COW > > later. > > 2. Hugetlb copy-on-write requires special case handling in the following > > situations: > > - copy_hugetlb_page_range() - Copied pages must be write protected so a COW > > fault will be triggered (if necessary) if those pages are written to. > > - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page > > cache. MAP_PRIVATE pages still need to be locked however. > > 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() > > which handles the COW fault by making the actual copy. > > 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be > > allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping > > check. > > Did you do the audit of pte protection bits I asked about? If not, I'll > dredge them up and check to make sure. I still don't know what you're talking about here - you never responded to my mail asking for clarification. The hugepage code already relies on pte_mkwrite() and pte_wrprotect() working correctly, I don't see that COW makes any difference. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support 2005-11-10 0:49 ` David Gibson @ 2005-11-10 0:56 ` William Lee Irwin III 0 siblings, 0 replies; 14+ messages in thread From: William Lee Irwin III @ 2005-11-10 0:56 UTC (permalink / raw) To: David Gibson Cc: Adam Litke, akpm, linux-mm, linux-kernel, hugh, rohit.seth, kenneth.w.chen On Wed, Nov 09, 2005 at 04:15:34PM -0800, William Lee Irwin wrote: >> Did you do the audit of pte protection bits I asked about? If not, I'll >> dredge them up and check to make sure. On Thu, Nov 10, 2005 at 11:49:07AM +1100, David Gibson wrote: > I still don't know what you're talking about here - you never > responded to my mail asking for clarification. The hugepage code > already relies on pte_mkwrite() and pte_wrprotect() working correctly, > I don't see that COW makes any difference. You appear to have a good idea of what's going on given that you've reminded me of that reliance. It looks like I dropped that email packet for some reason, sorry about that. Acked-by: William Irwin <wli@holomorphy.com> -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support 2005-11-09 23:39 ` [PATCH 4/4] Hugetlb: Copy on Write support Adam Litke 2005-11-10 0:15 ` William Lee Irwin III @ 2005-11-10 1:52 ` Rohit Seth 2005-11-10 3:54 ` David Gibson 1 sibling, 1 reply; 14+ messages in thread From: Rohit Seth @ 2005-11-10 1:52 UTC (permalink / raw) To: Adam Litke Cc: akpm, linux-mm, linux-kernel, David Gibson, wli, hugh, kenneth.w.chen On Wed, 2005-11-09 at 17:39 -0600, Adam Litke wrote: > > +#define huge_ptep_set_wrprotect(mm, addr, ptep) \ > + ptep_set_wrprotect(mm, addr, ptep) > +static inline void set_huge_ptep_writable(struct vm_area_struct *vma, > + unsigned long address, pte_t *ptep) > +{ > + pte_t entry; > + > + entry = pte_mkwrite(pte_mkdirty(*ptep)); > + ptep_set_access_flags(vma, address, ptep, entry, 1); > + update_mmu_cache(vma, address, entry); > +} lazy_mmu_prot_update will need to called here to make caches coherent for some archs. -rohit -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support 2005-11-10 1:52 ` Rohit Seth @ 2005-11-10 3:54 ` David Gibson 2005-11-10 4:20 ` William Lee Irwin III 0 siblings, 1 reply; 14+ messages in thread From: David Gibson @ 2005-11-10 3:54 UTC (permalink / raw) To: Rohit Seth Cc: Adam Litke, akpm, linux-mm, linux-kernel, wli, hugh, kenneth.w.chen On Wed, Nov 09, 2005 at 05:52:44PM -0800, Rohit Seth wrote: > On Wed, 2005-11-09 at 17:39 -0600, Adam Litke wrote: > > > > > +#define huge_ptep_set_wrprotect(mm, addr, ptep) \ > > + ptep_set_wrprotect(mm, addr, ptep) > > +static inline void set_huge_ptep_writable(struct vm_area_struct *vma, > > + unsigned long address, pte_t *ptep) > > +{ > > + pte_t entry; > > + > > + entry = pte_mkwrite(pte_mkdirty(*ptep)); > > + ptep_set_access_flags(vma, address, ptep, entry, 1); > > + update_mmu_cache(vma, address, entry); > > +} > > lazy_mmu_prot_update will need to called here to make caches coherent > for some archs. Ah, yes indeed. Revised version below. While I was at it, I moved set_huge_ptep_writable() into mm/hugetlb.c, since there's no actual need for it to be in the .h, and abolished huge_ptep_set_wrprotect() since there's no need for the macro at all. Hugetlb: Copy on Write support Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be supported. This helps us to safely use hugetlb pages in many more applications. The patch makes the following changes. If needed, I also have it broken out according to the following paragraphs. 1. Add a pair of functions to set/clear write access on huge ptes. The writable check in make_huge_pte is moved out to the caller for use by COW later. 2. Hugetlb copy-on-write requires special case handling in the following situations: - copy_hugetlb_page_range() - Copied pages must be write protected so a COW fault will be triggered (if necessary) if those pages are written to. - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page cache. MAP_PRIVATE pages still need to be locked however. 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() which handles the COW fault by making the actual copy. 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping check. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Index: working-2.6/fs/hugetlbfs/inode.c =================================================================== --- working-2.6.orig/fs/hugetlbfs/inode.c 2005-11-10 14:41:51.000000000 +1100 +++ working-2.6/fs/hugetlbfs/inode.c 2005-11-10 14:44:13.000000000 +1100 @@ -100,9 +100,6 @@ loff_t len, vma_len; int ret; - if ((vma->vm_flags & (VM_MAYSHARE | VM_WRITE)) == VM_WRITE) - return -EINVAL; - if (vma->vm_pgoff & (HPAGE_SIZE / PAGE_SIZE - 1)) return -EINVAL; Index: working-2.6/mm/hugetlb.c =================================================================== --- working-2.6.orig/mm/hugetlb.c 2005-11-10 14:41:51.000000000 +1100 +++ working-2.6/mm/hugetlb.c 2005-11-10 14:44:13.000000000 +1100 @@ -255,11 +255,12 @@ .nopage = hugetlb_nopage, }; -static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page) +static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page, + int writable) { pte_t entry; - if (vma->vm_flags & VM_WRITE) { + if (writable) { entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); } else { @@ -271,12 +272,27 @@ return entry; } +static void set_huge_ptep_writable(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep) +{ + pte_t entry; + + entry = pte_mkwrite(pte_mkdirty(*ptep)); + ptep_set_access_flags(vma, address, ptep, entry, 1); + update_mmu_cache(vma, address, entry); + lazy_mmu_prot_update(entry); +} + + int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma) { pte_t *src_pte, *dst_pte, entry; struct page *ptepage; unsigned long addr; + int cow; + + cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) { src_pte = huge_pte_offset(src, addr); @@ -288,6 +304,8 @@ spin_lock(&dst->page_table_lock); spin_lock(&src->page_table_lock); if (!pte_none(*src_pte)) { + if (cow) + ptep_set_wrprotect(src, addr, src_pte); entry = *src_pte; ptepage = pte_page(entry); get_page(ptepage); @@ -340,7 +358,7 @@ } static struct page *find_or_alloc_huge_page(struct address_space *mapping, - unsigned long idx) + unsigned long idx, int shared) { struct page *page; int err; @@ -358,26 +376,80 @@ goto out; } - err = add_to_page_cache(page, mapping, idx, GFP_KERNEL); - if (err) { - put_page(page); - hugetlb_put_quota(mapping); - if (err == -EEXIST) - goto retry; - page = NULL; + if (shared) { + err = add_to_page_cache(page, mapping, idx, GFP_KERNEL); + if (err) { + put_page(page); + hugetlb_put_quota(mapping); + if (err == -EEXIST) + goto retry; + page = NULL; + } + } else { + /* Caller expects a locked page */ + lock_page(page); } out: return page; } +static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, pte_t pte) +{ + struct page *old_page, *new_page; + int i, avoidcopy; + + old_page = pte_page(pte); + + /* If no-one else is actually using this page, avoid the copy + * and just make the page writable */ + avoidcopy = (page_count(old_page) == 1); + if (avoidcopy) { + set_huge_ptep_writable(vma, address, ptep); + return VM_FAULT_MINOR; + } + + page_cache_get(old_page); + new_page = alloc_huge_page(); + + if (! new_page) { + page_cache_release(old_page); + + /* Logically this is OOM, not a SIGBUS, but an OOM + * could cause the kernel to go killing other + * processes which won't help the hugepage situation + * at all (?) */ + return VM_FAULT_SIGBUS; + } + + spin_unlock(&mm->page_table_lock); + for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) + copy_user_highpage(new_page + i, old_page + i, + address + i*PAGE_SIZE); + spin_lock(&mm->page_table_lock); + + ptep = huge_pte_offset(mm, address & HPAGE_MASK); + if (likely(pte_same(*ptep, pte))) { + /* Break COW */ + set_huge_pte_at(mm, address, ptep, + make_huge_pte(vma, new_page, 1)); + /* Make the old page be freed below */ + new_page = old_page; + } + page_cache_release(new_page); + page_cache_release(old_page); + return VM_FAULT_MINOR; +} + int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, pte_t *ptep) + unsigned long address, pte_t *ptep, int write_access) { int ret = VM_FAULT_SIGBUS; unsigned long idx; unsigned long size; struct page *page; struct address_space *mapping; + pte_t new_pte; mapping = vma->vm_file->f_mapping; idx = ((address - vma->vm_start) >> HPAGE_SHIFT) @@ -387,10 +459,13 @@ * Use page lock to guard against racing truncation * before we get page_table_lock. */ - page = find_or_alloc_huge_page(mapping, idx); + page = find_or_alloc_huge_page(mapping, idx, + vma->vm_flags & VM_SHARED); if (!page) goto out; + BUG_ON(!PageLocked(page)); + spin_lock(&mm->page_table_lock); size = i_size_read(mapping->host) >> HPAGE_SHIFT; if (idx >= size) @@ -401,7 +476,15 @@ goto backout; add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE); - set_huge_pte_at(mm, address, ptep, make_huge_pte(vma, page)); + new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE) + && (vma->vm_flags & VM_SHARED))); + set_huge_pte_at(mm, address, ptep, new_pte); + + if (write_access && !(vma->vm_flags & VM_SHARED)) { + /* Optimization, do the COW without a second fault */ + ret = hugetlb_cow(mm, vma, address, ptep, new_pte); + } + spin_unlock(&mm->page_table_lock); unlock_page(page); out: @@ -420,6 +503,7 @@ { pte_t *ptep; pte_t entry; + int ret; ptep = huge_pte_alloc(mm, address); if (!ptep) @@ -427,13 +511,18 @@ entry = *ptep; if (pte_none(entry)) - return hugetlb_no_page(mm, vma, address, ptep); + return hugetlb_no_page(mm, vma, address, ptep, write_access); - /* - * We could get here if another thread instantiated the pte - * before the test above. - */ - return VM_FAULT_MINOR; + ret = VM_FAULT_MINOR; + + spin_lock(&mm->page_table_lock); + /* Check for a racing update before calling hugetlb_cow */ + if (likely(pte_same(entry, *ptep))) + if (write_access && !pte_write(entry)) + ret = hugetlb_cow(mm, vma, address, ptep, entry); + spin_unlock(&mm->page_table_lock); + + return ret; } int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, Index: working-2.6/mm/mmap.c =================================================================== --- working-2.6.orig/mm/mmap.c 2005-11-10 14:41:51.000000000 +1100 +++ working-2.6/mm/mmap.c 2005-11-10 14:44:13.000000000 +1100 @@ -1076,8 +1076,9 @@ error = file->f_op->mmap(file, vma); if (error) goto unmap_and_free_vma; - if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED)) - == (VM_WRITE | VM_RESERVED)) { + if ((vma->vm_flags + & (VM_SHARED | VM_WRITE | VM_RESERVED | VM_HUGETLB)) + == (VM_WRITE | VM_RESERVED)) { printk(KERN_WARNING "program %s is using MAP_PRIVATE, " "PROT_WRITE mmap of VM_RESERVED memory, which " "is deprecated. Please report this to " -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support 2005-11-10 3:54 ` David Gibson @ 2005-11-10 4:20 ` William Lee Irwin III 0 siblings, 0 replies; 14+ messages in thread From: William Lee Irwin III @ 2005-11-10 4:20 UTC (permalink / raw) To: David Gibson Cc: Rohit Seth, Adam Litke, akpm, linux-mm, linux-kernel, hugh, kenneth.w.chen On Wed, Nov 09, 2005 at 05:52:44PM -0800, Rohit Seth wrote: >> lazy_mmu_prot_update will need to called here to make caches coherent >> for some archs. On Thu, Nov 10, 2005 at 02:54:03PM +1100, David Gibson wrote: > Ah, yes indeed. Revised version below. While I was at it, I moved > set_huge_ptep_writable() into mm/hugetlb.c, since there's no actual > need for it to be in the .h, and abolished huge_ptep_set_wrprotect() > since there's no need for the macro at all. > Hugetlb: Copy on Write support Re-acking. Good catch, thanks Rohit. Acked-by: William Irwin <wli@holomorphy.com> -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2005-11-10 4:20 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-11-09 23:28 [PATCH 0/4] hugetlb: copy on write Adam Litke 2005-11-09 23:36 ` [PATCH 1/4] Hugetlb: Remove duplicate i_size check Adam Litke 2005-11-10 0:10 ` William Lee Irwin III 2005-11-09 23:37 ` [PATCH 2/4] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page Adam Litke 2005-11-10 0:11 ` William Lee Irwin III 2005-11-09 23:38 ` [PATCH 3/4] Hugetlb: Reorganize hugetlb_fault to prepare for COW Adam Litke 2005-11-10 0:13 ` William Lee Irwin III 2005-11-09 23:39 ` [PATCH 4/4] Hugetlb: Copy on Write support Adam Litke 2005-11-10 0:15 ` William Lee Irwin III 2005-11-10 0:49 ` David Gibson 2005-11-10 0:56 ` William Lee Irwin III 2005-11-10 1:52 ` Rohit Seth 2005-11-10 3:54 ` David Gibson 2005-11-10 4:20 ` William Lee Irwin III
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox