* [PATCH] Determine if mapping is MAP_SHARED using VM_MAYSHARE and not VM_SHARED in hugetlbfs @ 2009-05-19 8:36 Mel Gorman 2009-05-20 10:12 ` Eric B Munson 2009-05-25 21:09 ` Hugh Dickins 0 siblings, 2 replies; 9+ messages in thread From: Mel Gorman @ 2009-05-19 8:36 UTC (permalink / raw) To: linux-mm, npiggin, apw, ebmunson, andi, hugh, avid, kenneth.w.chen, wli Cc: linux-kernel, akpm, starlight hugetlbfs reserves huge pages and accounts for them differently depending on whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. However, the check made against VMA->vm_flags is sometimes VM_SHARED and not VM_MAYSHARE. For file-backed mappings, such as hugetlbfs, VM_SHARED is set only if the mapping is MAP_SHARED *and* it is read-write. For example, if a shared memory mapping was created read-write with shmget() for populating of data and mapped SHM_RDONLY by other processes, then hugetlbfs gets the accounting wrong and reservations leak. This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE when the intent of the code was to check whether the VMA was mapped MAP_SHARED or MAP_PRIVATE. The patch needs wider review as there are places where we really mean VM_SHARED and not VM_MAYSHARE. I believe I got all the right places, but a second opinion is needed. When/if this patch passes review, it'll be needed for 2.6.30 and -stable as it partially addresses the problem reported in http://bugzilla.kernel.org/show_bug.cgi?id=13302 and http://bugzilla.kernel.org/show_bug.cgi?id=12134. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/hugetlb.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 28c655b..e83ad2c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -316,7 +316,7 @@ static void resv_map_release(struct kref *ref) static struct resv_map *vma_resv_map(struct vm_area_struct *vma) { VM_BUG_ON(!is_vm_hugetlb_page(vma)); - if (!(vma->vm_flags & VM_SHARED)) + if (!(vma->vm_flags & VM_MAYSHARE)) return (struct resv_map *)(get_vma_private_data(vma) & ~HPAGE_RESV_MASK); return NULL; @@ -325,7 +325,7 @@ static struct resv_map *vma_resv_map(struct vm_area_struct *vma) static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map) { VM_BUG_ON(!is_vm_hugetlb_page(vma)); - VM_BUG_ON(vma->vm_flags & VM_SHARED); + VM_BUG_ON(vma->vm_flags & VM_MAYSHARE); set_vma_private_data(vma, (get_vma_private_data(vma) & HPAGE_RESV_MASK) | (unsigned long)map); @@ -334,7 +334,7 @@ static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map) static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags) { VM_BUG_ON(!is_vm_hugetlb_page(vma)); - VM_BUG_ON(vma->vm_flags & VM_SHARED); + VM_BUG_ON(vma->vm_flags & VM_MAYSHARE); set_vma_private_data(vma, get_vma_private_data(vma) | flags); } @@ -353,7 +353,7 @@ static void decrement_hugepage_resv_vma(struct hstate *h, if (vma->vm_flags & VM_NORESERVE) return; - if (vma->vm_flags & VM_SHARED) { + if (vma->vm_flags & VM_MAYSHARE) { /* Shared mappings always use reserves */ h->resv_huge_pages--; } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { @@ -369,14 +369,14 @@ static void decrement_hugepage_resv_vma(struct hstate *h, void reset_vma_resv_huge_pages(struct vm_area_struct *vma) { VM_BUG_ON(!is_vm_hugetlb_page(vma)); - if (!(vma->vm_flags & VM_SHARED)) + if (!(vma->vm_flags & VM_MAYSHARE)) vma->vm_private_data = (void *)0; } /* Returns true if the VMA has associated reserve pages */ static int vma_has_reserves(struct vm_area_struct *vma) { - if (vma->vm_flags & VM_SHARED) + if (vma->vm_flags & VM_MAYSHARE) return 1; if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) return 1; @@ -924,7 +924,7 @@ static long vma_needs_reservation(struct hstate *h, struct address_space *mapping = vma->vm_file->f_mapping; struct inode *inode = mapping->host; - if (vma->vm_flags & VM_SHARED) { + if (vma->vm_flags & VM_MAYSHARE) { pgoff_t idx = vma_hugecache_offset(h, vma, addr); return region_chg(&inode->i_mapping->private_list, idx, idx + 1); @@ -949,7 +949,7 @@ static void vma_commit_reservation(struct hstate *h, struct address_space *mapping = vma->vm_file->f_mapping; struct inode *inode = mapping->host; - if (vma->vm_flags & VM_SHARED) { + if (vma->vm_flags & VM_MAYSHARE) { pgoff_t idx = vma_hugecache_offset(h, vma, addr); region_add(&inode->i_mapping->private_list, idx, idx + 1); @@ -1893,7 +1893,7 @@ retry_avoidcopy: * at the time of fork() could consume its reserves on COW instead * of the full address range. */ - if (!(vma->vm_flags & VM_SHARED) && + if (!(vma->vm_flags & VM_MAYSHARE) && is_vma_resv_set(vma, HPAGE_RESV_OWNER) && old_page != pagecache_page) outside_reserve = 1; @@ -2000,7 +2000,7 @@ retry: clear_huge_page(page, address, huge_page_size(h)); __SetPageUptodate(page); - if (vma->vm_flags & VM_SHARED) { + if (vma->vm_flags & VM_MAYSHARE) { int err; struct inode *inode = mapping->host; @@ -2104,7 +2104,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, goto out_mutex; } - if (!(vma->vm_flags & VM_SHARED)) + if (!(vma->vm_flags & VM_MAYSHARE)) pagecache_page = hugetlbfs_pagecache_page(h, vma, address); } @@ -2289,7 +2289,7 @@ int hugetlb_reserve_pages(struct inode *inode, * to reserve the full area even if read-only as mprotect() may be * called to make the mapping read-write. Assume !vma is a shm mapping */ - if (!vma || vma->vm_flags & VM_SHARED) + if (!vma || vma->vm_flags & VM_MAYSHARE) chg = region_chg(&inode->i_mapping->private_list, from, to); else { struct resv_map *resv_map = resv_map_alloc(); @@ -2330,7 +2330,7 @@ int hugetlb_reserve_pages(struct inode *inode, * consumed reservations are stored in the map. Hence, nothing * else has to be done for private mappings here */ - if (!vma || vma->vm_flags & VM_SHARED) + if (!vma || vma->vm_flags & VM_MAYSHARE) region_add(&inode->i_mapping->private_list, from, to); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Determine if mapping is MAP_SHARED using VM_MAYSHARE and not VM_SHARED in hugetlbfs 2009-05-19 8:36 [PATCH] Determine if mapping is MAP_SHARED using VM_MAYSHARE and not VM_SHARED in hugetlbfs Mel Gorman @ 2009-05-20 10:12 ` Eric B Munson 2009-05-25 21:09 ` Hugh Dickins 1 sibling, 0 replies; 9+ messages in thread From: Eric B Munson @ 2009-05-20 10:12 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, npiggin, apw, andi, hugh, avid, kenneth.w.chen, wli, linux-kernel, akpm, starlight [-- Attachment #1: Type: text/plain, Size: 6548 bytes --] On Tue, 19 May 2009, Mel Gorman wrote: > hugetlbfs reserves huge pages and accounts for them differently depending > on whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. However, the > check made against VMA->vm_flags is sometimes VM_SHARED and not VM_MAYSHARE. > For file-backed mappings, such as hugetlbfs, VM_SHARED is set only if the > mapping is MAP_SHARED *and* it is read-write. For example, if a shared > memory mapping was created read-write with shmget() for populating of data > and mapped SHM_RDONLY by other processes, then hugetlbfs gets the accounting > wrong and reservations leak. > > This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE when > the intent of the code was to check whether the VMA was mapped MAP_SHARED > or MAP_PRIVATE. > > The patch needs wider review as there are places where we really mean > VM_SHARED and not VM_MAYSHARE. I believe I got all the right places, but a > second opinion is needed. When/if this patch passes review, it'll be needed > for 2.6.30 and -stable as it partially addresses the problem reported in > http://bugzilla.kernel.org/show_bug.cgi?id=13302 and > http://bugzilla.kernel.org/show_bug.cgi?id=12134. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> The libhugetlbfs test suite has a test that triggers this bug reliably, I have run this test both with and without this patch on x86_64 and ppc64. The bug triggers every time wiouth the patch and does not with the patch applied. Tested-by: Eric B Munson <ebmunson@us.ibm.com> > --- > mm/hugetlb.c | 26 +++++++++++++------------- > 1 file changed, 13 insertions(+), 13 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 28c655b..e83ad2c 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -316,7 +316,7 @@ static void resv_map_release(struct kref *ref) > static struct resv_map *vma_resv_map(struct vm_area_struct *vma) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - if (!(vma->vm_flags & VM_SHARED)) > + if (!(vma->vm_flags & VM_MAYSHARE)) > return (struct resv_map *)(get_vma_private_data(vma) & > ~HPAGE_RESV_MASK); > return NULL; > @@ -325,7 +325,7 @@ static struct resv_map *vma_resv_map(struct vm_area_struct *vma) > static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - VM_BUG_ON(vma->vm_flags & VM_SHARED); > + VM_BUG_ON(vma->vm_flags & VM_MAYSHARE); > > set_vma_private_data(vma, (get_vma_private_data(vma) & > HPAGE_RESV_MASK) | (unsigned long)map); > @@ -334,7 +334,7 @@ static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map) > static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - VM_BUG_ON(vma->vm_flags & VM_SHARED); > + VM_BUG_ON(vma->vm_flags & VM_MAYSHARE); > > set_vma_private_data(vma, get_vma_private_data(vma) | flags); > } > @@ -353,7 +353,7 @@ static void decrement_hugepage_resv_vma(struct hstate *h, > if (vma->vm_flags & VM_NORESERVE) > return; > > - if (vma->vm_flags & VM_SHARED) { > + if (vma->vm_flags & VM_MAYSHARE) { > /* Shared mappings always use reserves */ > h->resv_huge_pages--; > } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { > @@ -369,14 +369,14 @@ static void decrement_hugepage_resv_vma(struct hstate *h, > void reset_vma_resv_huge_pages(struct vm_area_struct *vma) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - if (!(vma->vm_flags & VM_SHARED)) > + if (!(vma->vm_flags & VM_MAYSHARE)) > vma->vm_private_data = (void *)0; > } > > /* Returns true if the VMA has associated reserve pages */ > static int vma_has_reserves(struct vm_area_struct *vma) > { > - if (vma->vm_flags & VM_SHARED) > + if (vma->vm_flags & VM_MAYSHARE) > return 1; > if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) > return 1; > @@ -924,7 +924,7 @@ static long vma_needs_reservation(struct hstate *h, > struct address_space *mapping = vma->vm_file->f_mapping; > struct inode *inode = mapping->host; > > - if (vma->vm_flags & VM_SHARED) { > + if (vma->vm_flags & VM_MAYSHARE) { > pgoff_t idx = vma_hugecache_offset(h, vma, addr); > return region_chg(&inode->i_mapping->private_list, > idx, idx + 1); > @@ -949,7 +949,7 @@ static void vma_commit_reservation(struct hstate *h, > struct address_space *mapping = vma->vm_file->f_mapping; > struct inode *inode = mapping->host; > > - if (vma->vm_flags & VM_SHARED) { > + if (vma->vm_flags & VM_MAYSHARE) { > pgoff_t idx = vma_hugecache_offset(h, vma, addr); > region_add(&inode->i_mapping->private_list, idx, idx + 1); > > @@ -1893,7 +1893,7 @@ retry_avoidcopy: > * at the time of fork() could consume its reserves on COW instead > * of the full address range. > */ > - if (!(vma->vm_flags & VM_SHARED) && > + if (!(vma->vm_flags & VM_MAYSHARE) && > is_vma_resv_set(vma, HPAGE_RESV_OWNER) && > old_page != pagecache_page) > outside_reserve = 1; > @@ -2000,7 +2000,7 @@ retry: > clear_huge_page(page, address, huge_page_size(h)); > __SetPageUptodate(page); > > - if (vma->vm_flags & VM_SHARED) { > + if (vma->vm_flags & VM_MAYSHARE) { > int err; > struct inode *inode = mapping->host; > > @@ -2104,7 +2104,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, > goto out_mutex; > } > > - if (!(vma->vm_flags & VM_SHARED)) > + if (!(vma->vm_flags & VM_MAYSHARE)) > pagecache_page = hugetlbfs_pagecache_page(h, > vma, address); > } > @@ -2289,7 +2289,7 @@ int hugetlb_reserve_pages(struct inode *inode, > * to reserve the full area even if read-only as mprotect() may be > * called to make the mapping read-write. Assume !vma is a shm mapping > */ > - if (!vma || vma->vm_flags & VM_SHARED) > + if (!vma || vma->vm_flags & VM_MAYSHARE) > chg = region_chg(&inode->i_mapping->private_list, from, to); > else { > struct resv_map *resv_map = resv_map_alloc(); > @@ -2330,7 +2330,7 @@ int hugetlb_reserve_pages(struct inode *inode, > * consumed reservations are stored in the map. Hence, nothing > * else has to be done for private mappings here > */ > - if (!vma || vma->vm_flags & VM_SHARED) > + if (!vma || vma->vm_flags & VM_MAYSHARE) > region_add(&inode->i_mapping->private_list, from, to); > return 0; > } > -- Eric B Munson IBM Linux Technology Center ebmunson@us.ibm.com [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Determine if mapping is MAP_SHARED using VM_MAYSHARE and not VM_SHARED in hugetlbfs 2009-05-19 8:36 [PATCH] Determine if mapping is MAP_SHARED using VM_MAYSHARE and not VM_SHARED in hugetlbfs Mel Gorman 2009-05-20 10:12 ` Eric B Munson @ 2009-05-25 21:09 ` Hugh Dickins 2009-05-26 10:12 ` Mel Gorman 1 sibling, 1 reply; 9+ messages in thread From: Hugh Dickins @ 2009-05-25 21:09 UTC (permalink / raw) To: Mel Gorman Cc: npiggin, apw, agl, ebmunson, andi, david, kenchen, wli, akpm, starlight, linux-kernel, linux-mm On Tue, 19 May 2009, Mel Gorman wrote: > hugetlbfs reserves huge pages and accounts for them differently depending > on whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. However, the > check made against VMA->vm_flags is sometimes VM_SHARED and not VM_MAYSHARE. > For file-backed mappings, such as hugetlbfs, VM_SHARED is set only if the > mapping is MAP_SHARED *and* it is read-write. For example, if a shared > memory mapping was created read-write with shmget() for populating of data > and mapped SHM_RDONLY by other processes, then hugetlbfs gets the accounting > wrong and reservations leak. > > This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE when > the intent of the code was to check whether the VMA was mapped MAP_SHARED > or MAP_PRIVATE. > > The patch needs wider review as there are places where we really mean > VM_SHARED and not VM_MAYSHARE. I believe I got all the right places, but a > second opinion is needed. When/if this patch passes review, it'll be needed > for 2.6.30 and -stable as it partially addresses the problem reported in > http://bugzilla.kernel.org/show_bug.cgi?id=13302 and > http://bugzilla.kernel.org/show_bug.cgi?id=12134. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> After another session looking at this one, Mel, I'm dubious about it. Let's make clear that I never attempted to understand hugetlb reservations and hugetlb private mappings at the time they went in; and after a little while gazing at the code, I wouldn't pretend to understand them now. It would be much better to hear from Adam and Andy about this than me. You're right to say that VM_MAYSHARE reflects MAP_SHARED, where VM_SHARED does not. But your description of VM_SHARED isn't quite clear: VM_SHARED is used if the file was opened read-write and its mapping is MAP_SHARED, even when the mapping is not PROT_WRITE (since the file was opened read- write, the mapping is eligible for an mprotect to PROT_WRITE later on). Yes, mm/hugetlb.c uses VM_SHARED throughout, rather than VM_MAYSHARE; and that means that its reservations behaviour won't quite follow the MAP_SHARED/MAP_PRIVATE split; but does that actually matter, so long as it remains consistent with itself? It would be nicer if it did follow that split, but I wouldn't want us to change its established behaviour around now without better reason. You suggest that you're fixing an inconsistency in the reservations behaviour, but you don't actually say what; and I don't see any confirmation from Starlight that it fixes actual anomalies seen. I'm all for fixing the bugs, but it's not self-evident that this patch does fix any: please explain in more detail. I've ended up worrying about the VM_SHAREDs you've left behind in mm/hugetlb.c: unless you can pin down exactly what you're fixing with this patch, my worry is that you're unbalancing the existing reservation assumptions. Certainly the patch shouldn't go in without libhugetlbfs testing by libhugetlbfs experts. Something I've noticed, to confirm that I can't really expect to understand how hugetlb works these days. I experimented by creating a hugetlb file, opening read-write, mmap'ing one page shared read-write (but not faulting it in); opening read-only, mmap'ing the one page read-only (shared or private, doesn't matter), faulting it in (contains zeroes of course); writing ffffffff to the one page through the read-write mapping, then looking at the read-only mapping - still contains zeroes, whereas with any normal file and mapping it should contain ffffffff, whether the read-only mapping was shared or private. And to fix that would need more than just a VM_SHARED to VM_MAYSHARE change, wouldn't it? It may well not be something fixable: perhaps there cannot be a reasonable private reservations strategy without that non-standard behaviour. But it does tell me not to trust my own preconceptions around here. Hugh > --- > mm/hugetlb.c | 26 +++++++++++++------------- > 1 file changed, 13 insertions(+), 13 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 28c655b..e83ad2c 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -316,7 +316,7 @@ static void resv_map_release(struct kref *ref) > static struct resv_map *vma_resv_map(struct vm_area_struct *vma) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - if (!(vma->vm_flags & VM_SHARED)) > + if (!(vma->vm_flags & VM_MAYSHARE)) > return (struct resv_map *)(get_vma_private_data(vma) & > ~HPAGE_RESV_MASK); > return NULL; > @@ -325,7 +325,7 @@ static struct resv_map *vma_resv_map(struct vm_area_struct *vma) > static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - VM_BUG_ON(vma->vm_flags & VM_SHARED); > + VM_BUG_ON(vma->vm_flags & VM_MAYSHARE); > > set_vma_private_data(vma, (get_vma_private_data(vma) & > HPAGE_RESV_MASK) | (unsigned long)map); > @@ -334,7 +334,7 @@ static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map) > static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - VM_BUG_ON(vma->vm_flags & VM_SHARED); > + VM_BUG_ON(vma->vm_flags & VM_MAYSHARE); > > set_vma_private_data(vma, get_vma_private_data(vma) | flags); > } > @@ -353,7 +353,7 @@ static void decrement_hugepage_resv_vma(struct hstate *h, > if (vma->vm_flags & VM_NORESERVE) > return; > > - if (vma->vm_flags & VM_SHARED) { > + if (vma->vm_flags & VM_MAYSHARE) { > /* Shared mappings always use reserves */ > h->resv_huge_pages--; > } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { > @@ -369,14 +369,14 @@ static void decrement_hugepage_resv_vma(struct hstate *h, > void reset_vma_resv_huge_pages(struct vm_area_struct *vma) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - if (!(vma->vm_flags & VM_SHARED)) > + if (!(vma->vm_flags & VM_MAYSHARE)) > vma->vm_private_data = (void *)0; > } > > /* Returns true if the VMA has associated reserve pages */ > static int vma_has_reserves(struct vm_area_struct *vma) > { > - if (vma->vm_flags & VM_SHARED) > + if (vma->vm_flags & VM_MAYSHARE) > return 1; > if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) > return 1; > @@ -924,7 +924,7 @@ static long vma_needs_reservation(struct hstate *h, > struct address_space *mapping = vma->vm_file->f_mapping; > struct inode *inode = mapping->host; > > - if (vma->vm_flags & VM_SHARED) { > + if (vma->vm_flags & VM_MAYSHARE) { > pgoff_t idx = vma_hugecache_offset(h, vma, addr); > return region_chg(&inode->i_mapping->private_list, > idx, idx + 1); > @@ -949,7 +949,7 @@ static void vma_commit_reservation(struct hstate *h, > struct address_space *mapping = vma->vm_file->f_mapping; > struct inode *inode = mapping->host; > > - if (vma->vm_flags & VM_SHARED) { > + if (vma->vm_flags & VM_MAYSHARE) { > pgoff_t idx = vma_hugecache_offset(h, vma, addr); > region_add(&inode->i_mapping->private_list, idx, idx + 1); > > @@ -1893,7 +1893,7 @@ retry_avoidcopy: > * at the time of fork() could consume its reserves on COW instead > * of the full address range. > */ > - if (!(vma->vm_flags & VM_SHARED) && > + if (!(vma->vm_flags & VM_MAYSHARE) && > is_vma_resv_set(vma, HPAGE_RESV_OWNER) && > old_page != pagecache_page) > outside_reserve = 1; > @@ -2000,7 +2000,7 @@ retry: > clear_huge_page(page, address, huge_page_size(h)); > __SetPageUptodate(page); > > - if (vma->vm_flags & VM_SHARED) { > + if (vma->vm_flags & VM_MAYSHARE) { > int err; > struct inode *inode = mapping->host; > > @@ -2104,7 +2104,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, > goto out_mutex; > } > > - if (!(vma->vm_flags & VM_SHARED)) > + if (!(vma->vm_flags & VM_MAYSHARE)) > pagecache_page = hugetlbfs_pagecache_page(h, > vma, address); > } > @@ -2289,7 +2289,7 @@ int hugetlb_reserve_pages(struct inode *inode, > * to reserve the full area even if read-only as mprotect() may be > * called to make the mapping read-write. Assume !vma is a shm mapping > */ > - if (!vma || vma->vm_flags & VM_SHARED) > + if (!vma || vma->vm_flags & VM_MAYSHARE) > chg = region_chg(&inode->i_mapping->private_list, from, to); > else { > struct resv_map *resv_map = resv_map_alloc(); > @@ -2330,7 +2330,7 @@ int hugetlb_reserve_pages(struct inode *inode, > * consumed reservations are stored in the map. Hence, nothing > * else has to be done for private mappings here > */ > - if (!vma || vma->vm_flags & VM_SHARED) > + if (!vma || vma->vm_flags & VM_MAYSHARE) > region_add(&inode->i_mapping->private_list, from, to); > return 0; > } > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Determine if mapping is MAP_SHARED using VM_MAYSHARE and not VM_SHARED in hugetlbfs 2009-05-25 21:09 ` Hugh Dickins @ 2009-05-26 10:12 ` Mel Gorman 2009-05-26 12:54 ` Eric B Munson 2009-05-26 20:51 ` Hugh Dickins 0 siblings, 2 replies; 9+ messages in thread From: Mel Gorman @ 2009-05-26 10:12 UTC (permalink / raw) To: Hugh Dickins Cc: npiggin, apw, agl, ebmunson, andi, david, kenchen, wli, akpm, starlight, linux-kernel, linux-mm On Mon, May 25, 2009 at 10:09:43PM +0100, Hugh Dickins wrote: > On Tue, 19 May 2009, Mel Gorman wrote: > > > hugetlbfs reserves huge pages and accounts for them differently depending > > on whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. However, the > > check made against VMA->vm_flags is sometimes VM_SHARED and not VM_MAYSHARE. > > For file-backed mappings, such as hugetlbfs, VM_SHARED is set only if the > > mapping is MAP_SHARED *and* it is read-write. For example, if a shared > > memory mapping was created read-write with shmget() for populating of data > > and mapped SHM_RDONLY by other processes, then hugetlbfs gets the accounting > > wrong and reservations leak. > > > > This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE when > > the intent of the code was to check whether the VMA was mapped MAP_SHARED > > or MAP_PRIVATE. > > > > The patch needs wider review as there are places where we really mean > > VM_SHARED and not VM_MAYSHARE. I believe I got all the right places, but a > > second opinion is needed. When/if this patch passes review, it'll be needed > > for 2.6.30 and -stable as it partially addresses the problem reported in > > http://bugzilla.kernel.org/show_bug.cgi?id=13302 and > > http://bugzilla.kernel.org/show_bug.cgi?id=12134. > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > After another session looking at this one, Mel, I'm dubious about it. > That doesn't surprise me. The patch is a lot less clear cut which is why I wanted more people to think about it. > Let's make clear that I never attempted to understand hugetlb reservations > and hugetlb private mappings at the time they went in; and after a little > while gazing at the code, I wouldn't pretend to understand them now. It > would be much better to hear from Adam and Andy about this than me. > For what it's worth, I wrote chunks of the reservation code, particularly with respect to the private reservations. It wasn't complete enough though which Andy fixed up as I was off for two weeks holiday at the time bugs came to light (thanks Andy!). Adam was active in hugetlbfs when the shared reservations were first implemented. Thing is, Adam is not active in kernel work at all any more and while Andy still is, it's not in this area. Hopefully they'll respond, but they might not. > You're right to say that VM_MAYSHARE reflects MAP_SHARED, where VM_SHARED > does not. But your description of VM_SHARED isn't quite clear: VM_SHARED > is used if the file was opened read-write and its mapping is MAP_SHARED, > even when the mapping is not PROT_WRITE (since the file was opened read- > write, the mapping is eligible for an mprotect to PROT_WRITE later on). > Very true, I've cleared that up in the description. For anyone watching, the relevant code is vm_flags |= VM_SHARED | VM_MAYSHARE; if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED); > Yes, mm/hugetlb.c uses VM_SHARED throughout, rather than VM_MAYSHARE; > and that means that its reservations behaviour won't quite follow the > MAP_SHARED/MAP_PRIVATE split; but does that actually matter, so long > as it remains consistent with itself? It needs to be consistent with itself at minimum. The purpose of the reservations in hugetlbfs is so that future faults will succeed for the process that called mmap(). It's not going to be a perfect match to the core VM although as always, I'd like to bring it as close as possible. > It would be nicer if it did > follow that split, but I wouldn't want us to change its established > behaviour around now without better reason. > > You suggest that you're fixing an inconsistency in the reservations > behaviour, but you don't actually say what; and I don't see any > confirmation from Starlight that it fixes actual anomalies seen. > I'm all for fixing the bugs, but it's not self-evident that this > patch does fix any: please explain in more detail. > Minimally, this patch fixes a testcase I added to libhugetlbfs specifically for this problem. It's in the "next" branch of libhugetlbfs and should be released as part of 2.4. # git clone git://libhugetlbfs.git.sourceforge.net/gitroot/libhugetlbfs # cd libhugetlbfs # git checkout origin/next -b origin-next # make # ./obj/hugeadm --create-global-mounts # ./obj/hugeadm --pool-pages-min 2M:128 # make func The test that this patch fixes up is shm-perms. It can be run directly with just # ./tests/obj32/shm-perms Does this help explain the problem any better? ====== hugetlbfs reserves huge pages and accounts for them differently depending on whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. For MAP_SHARED mappings, hugepages are reserved when mmap() is first called and are tracked based on information associated with the inode. MAP_PRIVATE track the reservations based on the VMA created as part of the mmap() operation. However, the check hugetlbfs makes when determining if a VMA is MAP_SHARED is with the VM_SHARED flag and not VM_MAYSHARE. For file-backed mappings, such as hugetlbfs, VM_SHARED is set only if the mapping is MAP_SHARED and the file was opened read-write. If a shared memory mapping was mapped shared-read-write for populating of data and mapped shared-read-only by other processes, then hugetlbfs gets inconsistent on how it accounts for the creation of reservations and how they are consumed. ====== > I've ended up worrying about the VM_SHAREDs you've left behind in > mm/hugetlb.c: unless you can pin down exactly what you're fixing > with this patch, my worry is that you're unbalancing the existing > reservation assumptions. Certainly the patch shouldn't go in > without libhugetlbfs testing by libhugetlbfs experts. > libhugetlbfs experts are thin on the ground. Currently, there are only two that are active in its development - Eric Munson and myself. The previous maintainer, Nish Aravamudan, moved away from hugepage development some time ago. I did run though the tests and didn't spot additional regressions. Best, I go through the remaining VM_SHARED and see what they are used for and what the expectation is. copy_hugetlb_page_range Here, it's used to determine if COW is happening. In that case it wants to know that the mapping it's dealing with is shared and read-write so I think that's ok. hugetlb_no_page Here, we are checking if COW should be broken early and then it's checking for the right write attribute for the page tables. Think that's ok too. follow_hugetlb_page This is checking of the zero page can be shared or not. Crap, this one looks like it should have been converted to VM_MAYSHARE as well. V2 is below which converts follow_hugetlb_page() as well. > Something I've noticed, to confirm that I can't really expect > to understand how hugetlb works these days. I experimented by > creating a hugetlb file, opening read-write, mmap'ing one page > shared read-write (but not faulting it in); At this point, one hugepage is reserved for the mapping but is not faulted and does not exist in the hugetlbfs page cache. > opening read-only, > mmap'ing the one page read-only (shared or private, doesn't matter), > faulting it in (contains zeroes of course); Of course. > writing ffffffff to > the one page through the read-write mapping, So, now a hugepage has been allocated and inserted into the page cache. > then looking at the > read-only mapping - still contains zeroes, whereas with any > normal file and mapping it should contain ffffffff, whether > the read-only mapping was shared or private. > I think the critical difference is that a normal file exists on a physical medium so both processes share the same data source. How would the normal file mapping behave on tmpfs for example? If tmpfs behaves correctly, I'll try and get hugetlbfs to match. There is one potential problem in there. I would have expected the pages to be shared if the second process was mapping MAP_SHARED because the page should have been in the page cache when the read-write process faulted. I'll check it out. > And to fix that would need more than just a VM_SHARED to VM_MAYSHARE > change, wouldn't it? It may well not be something fixable: perhaps > there cannot be a reasonable private reservations strategy without > that non-standard behaviour. > > But it does tell me not to trust my own preconceptions around here. > I'll have a look at that behaviour after this bug gets cleared up and see what I can find. My expectation is that anything I find in that area though will be more than the VM_SHARED vs VM_MAYSHARE though. Here is V2 of the patch. Starlight, can you confirm this patch fixes your problem for 2.6.29.4? Eric, can you confirm this passes libhugetlbfs tests and not screw something else up? Thanks ==== CUT HERE ==== ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Determine if mapping is MAP_SHARED using VM_MAYSHARE and not VM_SHARED in hugetlbfs 2009-05-26 10:12 ` Mel Gorman @ 2009-05-26 12:54 ` Eric B Munson 2009-05-26 20:51 ` Hugh Dickins 1 sibling, 0 replies; 9+ messages in thread From: Eric B Munson @ 2009-05-26 12:54 UTC (permalink / raw) To: Mel Gorman Cc: Hugh Dickins, npiggin, apw, agl, andi, david, kenchen, wli, akpm, starlight, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 7312 bytes --] <snip> > Here is V2 of the patch. Starlight, can you confirm this patch fixes > your problem for 2.6.29.4? Eric, can you confirm this passes > libhugetlbfs tests and not screw something else up? > > Thanks This patch passes all libhugetlbfs tests on x86_64 and ppc64 using kernels 2.6.29.4 and 2.6.30-rc7. Tested-by: Eric B Munson <ebmunson@us.ibm.com> > > ==== CUT HERE ==== > From 3ea2ed7c5f307bc4b53cfe2ceddd90c8e1298078 Mon Sep 17 00:00:00 2001 > From: Mel Gorman <mel@csn.ul.ie> > Date: Tue, 26 May 2009 10:47:09 +0100 > Subject: [PATCH] Account for MAP_SHARED mappings using VM_MAYSHARE and not VM_SHARED in hugetlbfs V2 > > Changelog since V1 > o Convert follow_hugetlb_page to use VM_MAYSHARE > > hugetlbfs reserves huge pages and accounts for them differently depending on > whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. For MAP_SHARED > mappings, hugepages are reserved when mmap() is first called and are > tracked based on information associated with the inode. MAP_PRIVATE track > the reservations based on the VMA created as part of the mmap() operation. > > However, the check hugetlbfs makes when determining if a VMA is MAP_SHARED > is with the VM_SHARED flag and not VM_MAYSHARE. For file-backed mappings, > such as hugetlbfs, VM_SHARED is set only if the mapping is MAP_SHARED > and the file was opened read-write. If a shared memory mapping was mapped > shared-read-write for populating of data and mapped shared-read-only by > other processes, then hugetlbfs gets inconsistent on how it accounts for > the creation of reservations and how they are consumed. > > This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE when > the intent of the code was to check whether the VMA was mapped MAP_SHARED > or MAP_PRIVATE. > > If this patch passes review, it's needed for 2.6.30 and -stable. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > --- > mm/hugetlb.c | 28 ++++++++++++++-------------- > 1 file changed, 14 insertions(+), 14 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 28c655b..3687f42 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -316,7 +316,7 @@ static void resv_map_release(struct kref *ref) > static struct resv_map *vma_resv_map(struct vm_area_struct *vma) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - if (!(vma->vm_flags & VM_SHARED)) > + if (!(vma->vm_flags & VM_MAYSHARE)) > return (struct resv_map *)(get_vma_private_data(vma) & > ~HPAGE_RESV_MASK); > return NULL; > @@ -325,7 +325,7 @@ static struct resv_map *vma_resv_map(struct vm_area_struct *vma) > static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - VM_BUG_ON(vma->vm_flags & VM_SHARED); > + VM_BUG_ON(vma->vm_flags & VM_MAYSHARE); > > set_vma_private_data(vma, (get_vma_private_data(vma) & > HPAGE_RESV_MASK) | (unsigned long)map); > @@ -334,7 +334,7 @@ static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map) > static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - VM_BUG_ON(vma->vm_flags & VM_SHARED); > + VM_BUG_ON(vma->vm_flags & VM_MAYSHARE); > > set_vma_private_data(vma, get_vma_private_data(vma) | flags); > } > @@ -353,7 +353,7 @@ static void decrement_hugepage_resv_vma(struct hstate *h, > if (vma->vm_flags & VM_NORESERVE) > return; > > - if (vma->vm_flags & VM_SHARED) { > + if (vma->vm_flags & VM_MAYSHARE) { > /* Shared mappings always use reserves */ > h->resv_huge_pages--; > } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { > @@ -369,14 +369,14 @@ static void decrement_hugepage_resv_vma(struct hstate *h, > void reset_vma_resv_huge_pages(struct vm_area_struct *vma) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - if (!(vma->vm_flags & VM_SHARED)) > + if (!(vma->vm_flags & VM_MAYSHARE)) > vma->vm_private_data = (void *)0; > } > > /* Returns true if the VMA has associated reserve pages */ > static int vma_has_reserves(struct vm_area_struct *vma) > { > - if (vma->vm_flags & VM_SHARED) > + if (vma->vm_flags & VM_MAYSHARE) > return 1; > if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) > return 1; > @@ -924,7 +924,7 @@ static long vma_needs_reservation(struct hstate *h, > struct address_space *mapping = vma->vm_file->f_mapping; > struct inode *inode = mapping->host; > > - if (vma->vm_flags & VM_SHARED) { > + if (vma->vm_flags & VM_MAYSHARE) { > pgoff_t idx = vma_hugecache_offset(h, vma, addr); > return region_chg(&inode->i_mapping->private_list, > idx, idx + 1); > @@ -949,7 +949,7 @@ static void vma_commit_reservation(struct hstate *h, > struct address_space *mapping = vma->vm_file->f_mapping; > struct inode *inode = mapping->host; > > - if (vma->vm_flags & VM_SHARED) { > + if (vma->vm_flags & VM_MAYSHARE) { > pgoff_t idx = vma_hugecache_offset(h, vma, addr); > region_add(&inode->i_mapping->private_list, idx, idx + 1); > > @@ -1893,7 +1893,7 @@ retry_avoidcopy: > * at the time of fork() could consume its reserves on COW instead > * of the full address range. > */ > - if (!(vma->vm_flags & VM_SHARED) && > + if (!(vma->vm_flags & VM_MAYSHARE) && > is_vma_resv_set(vma, HPAGE_RESV_OWNER) && > old_page != pagecache_page) > outside_reserve = 1; > @@ -2000,7 +2000,7 @@ retry: > clear_huge_page(page, address, huge_page_size(h)); > __SetPageUptodate(page); > > - if (vma->vm_flags & VM_SHARED) { > + if (vma->vm_flags & VM_MAYSHARE) { > int err; > struct inode *inode = mapping->host; > > @@ -2104,7 +2104,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, > goto out_mutex; > } > > - if (!(vma->vm_flags & VM_SHARED)) > + if (!(vma->vm_flags & VM_MAYSHARE)) > pagecache_page = hugetlbfs_pagecache_page(h, > vma, address); > } > @@ -2168,7 +2168,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, > int remainder = *length; > struct hstate *h = hstate_vma(vma); > int zeropage_ok = 0; > - int shared = vma->vm_flags & VM_SHARED; > + int shared = vma->vm_flags & VM_MAYSHARE; > > spin_lock(&mm->page_table_lock); > while (vaddr < vma->vm_end && remainder) { > @@ -2289,7 +2289,7 @@ int hugetlb_reserve_pages(struct inode *inode, > * to reserve the full area even if read-only as mprotect() may be > * called to make the mapping read-write. Assume !vma is a shm mapping > */ > - if (!vma || vma->vm_flags & VM_SHARED) > + if (!vma || vma->vm_flags & VM_MAYSHARE) > chg = region_chg(&inode->i_mapping->private_list, from, to); > else { > struct resv_map *resv_map = resv_map_alloc(); > @@ -2330,7 +2330,7 @@ int hugetlb_reserve_pages(struct inode *inode, > * consumed reservations are stored in the map. Hence, nothing > * else has to be done for private mappings here > */ > - if (!vma || vma->vm_flags & VM_SHARED) > + if (!vma || vma->vm_flags & VM_MAYSHARE) > region_add(&inode->i_mapping->private_list, from, to); > return 0; > } > -- Eric B Munson IBM Linux Technology Center ebmunson@us.ibm.com [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Determine if mapping is MAP_SHARED using VM_MAYSHARE and not VM_SHARED in hugetlbfs 2009-05-26 10:12 ` Mel Gorman 2009-05-26 12:54 ` Eric B Munson @ 2009-05-26 20:51 ` Hugh Dickins 2009-05-27 0:48 ` Mel Gorman 1 sibling, 1 reply; 9+ messages in thread From: Hugh Dickins @ 2009-05-26 20:51 UTC (permalink / raw) To: Mel Gorman Cc: kosaki.motohiro, npiggin, apw, agl, ebmunson, andi, david, kenchen, wli, akpm, starlight, linux-kernel, linux-mm On Tue, 26 May 2009, Mel Gorman wrote: > On Mon, May 25, 2009 at 10:09:43PM +0100, Hugh Dickins wrote: > ... > > Let's make clear that I never attempted to understand hugetlb reservations > > and hugetlb private mappings at the time they went in; and after a little > > while gazing at the code, I wouldn't pretend to understand them now. It > > would be much better to hear from Adam and Andy about this than me. > > For what it's worth, I wrote chunks of the reservation code, particularly > with respect to the private reservations. Ah, I hadn't realized you were so involved in it from the start: good. > It wasn't complete enough though > which Andy fixed up as I was off for two weeks holiday at the time bugs > came to light (thanks Andy!). Adam was active in hugetlbfs when the shared > reservations were first implemented. Thing is, Adam is not active in kernel > work at all any more And I hadn't realized that either (I did notice you'd left Adam off this thread, but he was on the other, so I added him to this too). > and while Andy still is, it's not in this area. Hopefully > they'll respond, but they might not. ... > Minimally, this patch fixes a testcase I added to libhugetlbfs > specifically for this problem. It's in the "next" branch of libhugetlbfs > and should be released as part of 2.4. > > # git clone git://libhugetlbfs.git.sourceforge.net/gitroot/libhugetlbfs > # cd libhugetlbfs > # git checkout origin/next -b origin-next > # make > # ./obj/hugeadm --create-global-mounts > # ./obj/hugeadm --pool-pages-min 2M:128 > # make func Originally I was going to say that I was sorry you pointed me there; because doing that "make func" a second time did nasty things for me, hung and eventually quite froze my machines (and not in shm-perms). But that was before applying your hugetlb.c patch: the good news is that your patch appears to fix all that nastiness. > > The test that this patch fixes up is shm-perms. It can be run directly > with just > > # ./tests/obj32/shm-perms > > Does this help explain the problem any better? I'm an ingrate: no, it doesn't help much. I'd have liked to hear how to reproduce "gets inconsistent on how it accounts for the creation of reservations and how they are consumed", what /proc/meminfo looks like in the end. I didn't see any such inconsistency when I was messing around, and I don't see shm-perms testing for any such inconsistency (just testing for whether something you think ought to work, does work). But now I've found your patch fixes my freezes from libhugetlbfs testing, I'm much happier with it; and see below for why I'm even happier. > > ====== > hugetlbfs reserves huge pages and accounts for them differently depending on > whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. For MAP_SHARED > mappings, hugepages are reserved when mmap() is first called and are > tracked based on information associated with the inode. MAP_PRIVATE track > the reservations based on the VMA created as part of the mmap() operation. > > However, the check hugetlbfs makes when determining if a VMA is MAP_SHARED > is with the VM_SHARED flag and not VM_MAYSHARE. For file-backed mappings, > such as hugetlbfs, VM_SHARED is set only if the mapping is MAP_SHARED > and the file was opened read-write. If a shared memory mapping was mapped > shared-read-write for populating of data and mapped shared-read-only by > other processes, then hugetlbfs gets inconsistent on how it accounts for > the creation of reservations and how they are consumed. > ====== ... > > Best, I go through the remaining VM_SHARED and see what they are used > for and what the expectation is. > > copy_hugetlb_page_range > Here, it's used to determine if COW is happening. In that case > it wants to know that the mapping it's dealing with is shared > and read-write so I think that's ok. Yes, that test is copied from mm/memory.c and should be okay. It's an odd test and could be written in other ways, I think cow = !(vma->vm_flags & VM_SHARED) would be sufficient, wouldn't it? but perhaps that requires too much knowledge of how the flags work. Anyway, no reason to change it; though equally you could change it to VM_MAYSHARE. > > hugetlb_no_page > Here, we are checking if COW should be broken early and then > it's checking for the right write attribute for the page tables. > Think that's ok too. These were the ones which worried me, since reservation checks (which you did change) are called conditional upon them. But in the end I agree with you, they don't need changing: because they're checked along with write_access (or VM_WRITE), and you cannot write to an area on which VM_SHARED and VM_MAYSHARE differ. > > follow_hugetlb_page > This is checking of the zero page can be shared or not. Crap, > this one looks like it should have been converted to VM_MAYSHARE > as well. Now, what makes you say that? I really am eager to understand, because I don't comprehend that VM_SHARED at all. I believe Kosaki-san's 4b2e38ad simply copied it from Linus's 672ca28e to mm/memory.c. But even back when that change was made, I confessed to having lost the plot on it: so far as I can see, putting a VM_SHARED test in there just happened to prevent some VMware code going the wrong way, but I don't see the actual justification for it. So, given that I don't understand it in the first place, I can't really support changing that VM_SHARED to VM_MAYSHARE. > > Something I've noticed, to confirm that I can't really expect > > to understand how hugetlb works these days. I experimented by > > creating a hugetlb file, opening read-write, mmap'ing one page > > shared read-write (but not faulting it in); > > At this point, one hugepage is reserved for the mapping but is not > faulted and does not exist in the hugetlbfs page cache. > > > opening read-only, > > mmap'ing the one page read-only (shared or private, doesn't matter), > > faulting it in (contains zeroes of course); > > Of course. > > > writing ffffffff to > > the one page through the read-write mapping, > > So, now a hugepage has been allocated and inserted into the page cache. > > > then looking at the > > read-only mapping - still contains zeroes, whereas with any > > normal file and mapping it should contain ffffffff, whether > > the read-only mapping was shared or private. > > > > I think the critical difference is that a normal file exists on a physical > medium so both processes share the same data source. How would the normal > file mapping behave on tmpfs for example? If tmpfs behaves correctly, I'll > try and get hugetlbfs to match. tmpfs and ramfs behave just as if there were a normal file existing on a physical medium, they behave just like every(?) other filesystem than hugetlbfs: a page in a MAP_PRIVATE mapping is shared with the underlying object (or other mappings) until it gets modified. > > There is one potential problem in there. I would have expected the pages > to be shared if the second process was mapping MAP_SHARED because the > page should have been in the page cache when the read-write process > faulted. I'll check it out. The big reason why I do now like your patch (modulo your additional change to follow_huge_page) is that it seems to fix this for MAP_SHARED. It's not quite obvious how it fixes it (you claim to be correcting the accounting rather than the actual sharing/COWing of pages), but it does seem to do so. (I say "seem to" because I've managed to confuse myself with thoroughly erratic results here, but now putting it down to forgetting to remove my old test file sometimes, so picking up cached pages from earlier tests; or getting in a muddle between my readwrite and readonly fds. Please check it out for yourself, before and after your patch.) And while I dislike hugetlbfs's behaviour on the MAP_PRIVATE pages, I willingly concede that it's much more important that it behave correctly on the MAP_SHARED ones (as it stood before, "shared memory" was not necessarily getting shared). Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Determine if mapping is MAP_SHARED using VM_MAYSHARE and not VM_SHARED in hugetlbfs 2009-05-26 20:51 ` Hugh Dickins @ 2009-05-27 0:48 ` Mel Gorman 2009-05-27 3:17 ` KOSAKI Motohiro 0 siblings, 1 reply; 9+ messages in thread From: Mel Gorman @ 2009-05-27 0:48 UTC (permalink / raw) To: Hugh Dickins Cc: kosaki.motohiro, npiggin, apw, agl, ebmunson, andi, david, kenchen, wli, akpm, starlight, linux-kernel, linux-mm On Tue, May 26, 2009 at 09:51:19PM +0100, Hugh Dickins wrote: > On Tue, 26 May 2009, Mel Gorman wrote: > > On Mon, May 25, 2009 at 10:09:43PM +0100, Hugh Dickins wrote: > > ... > > > Let's make clear that I never attempted to understand hugetlb reservations > > > and hugetlb private mappings at the time they went in; and after a little > > > while gazing at the code, I wouldn't pretend to understand them now. It > > > would be much better to hear from Adam and Andy about this than me. > > > > For what it's worth, I wrote chunks of the reservation code, particularly > > with respect to the private reservations. > > Ah, I hadn't realized you were so involved in it from the start: good. > > > It wasn't complete enough though > > which Andy fixed up as I was off for two weeks holiday at the time bugs > > came to light (thanks Andy!). Adam was active in hugetlbfs when the shared > > reservations were first implemented. Thing is, Adam is not active in kernel > > work at all any more > > And I hadn't realized that either (I did notice you'd left Adam off > this thread, but he was on the other, so I added him to this too). > Cool. > > and while Andy still is, it's not in this area. Hopefully > > they'll respond, but they might not. > > ... > > Minimally, this patch fixes a testcase I added to libhugetlbfs > > specifically for this problem. It's in the "next" branch of libhugetlbfs > > and should be released as part of 2.4. > > > > # git clone git://libhugetlbfs.git.sourceforge.net/gitroot/libhugetlbfs > > # cd libhugetlbfs > > # git checkout origin/next -b origin-next > > # make > > # ./obj/hugeadm --create-global-mounts > > # ./obj/hugeadm --pool-pages-min 2M:128 > > # make func > > Originally I was going to say that I was sorry you pointed me there; > because doing that "make func" a second time did nasty things for me, > hung and eventually quite froze my machines (and not in shm-perms). > Crap, yes, I should have warned you. I've been on a bit of a search for bugs in hugetlbfs recently. For problems encountered, I added a testcase to libhugetlbfs so it'd be caught in distros if backports were needed. Some of those tests can really hurt a vunerable kernel. Sorry :( > But that was before applying your hugetlb.c patch: the good news > is that your patch appears to fix all that nastiness. > Nice, good to know. > > > > The test that this patch fixes up is shm-perms. It can be run directly > > with just > > > > # ./tests/obj32/shm-perms > > > > Does this help explain the problem any better? > > I'm an ingrate: no, it doesn't help much. I'd have liked to hear how > to reproduce "gets inconsistent on how it accounts for the creation > of reservations and how they are consumed", what /proc/meminfo > looks like in the end. Ok, I see the problem with my first description now. This is my fault :(. I was working on two patches at the same time. One where the hugepage_total count and hugepage_total free would get out of sync and the second where the reservations are getting doubled. This patch fixes the latter problem but I described the reservations as "leaking" which is very misleading. > I didn't see any such inconsistency when > I was messing around, and I don't see shm-perms testing for any > such inconsistency (just testing for whether something you think > ought to work, does work). > The inconsistency is with hugetlbfs making the reservations. For MAP_SHARED mappings, one reservation is made that is shared between processes. Once the first mapping succeeds, future mappings should also succeed. For MAP_SHARED-read-write mapping, hugetlbfs is making the reservation and associating it with the inode correctly. For MAP_SHARED-read-only mappings, hugetlbfs is maping the reservation as if it was MAP_PRIVATE - i.e. requiring a full reservation for each process mapping. What the test case is checking is that a large number of processes can map the same mapping multiple times correctly. If the hugepage pool happened to be particularly massive > But now I've found your patch fixes my freezes from libhugetlbfs > testing, I'm much happier with it; and see below for why I'm even > happier. > > > > > ====== > > hugetlbfs reserves huge pages and accounts for them differently depending on > > whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. For MAP_SHARED > > mappings, hugepages are reserved when mmap() is first called and are > > tracked based on information associated with the inode. MAP_PRIVATE track > > the reservations based on the VMA created as part of the mmap() operation. > > > > However, the check hugetlbfs makes when determining if a VMA is MAP_SHARED > > is with the VM_SHARED flag and not VM_MAYSHARE. For file-backed mappings, > > such as hugetlbfs, VM_SHARED is set only if the mapping is MAP_SHARED > > and the file was opened read-write. If a shared memory mapping was mapped > > shared-read-write for populating of data and mapped shared-read-only by > > other processes, then hugetlbfs gets inconsistent on how it accounts for > > the creation of reservations and how they are consumed. > > ====== > > ... > > > > Best, I go through the remaining VM_SHARED and see what they are used > > for and what the expectation is. > > > > copy_hugetlb_page_range > > Here, it's used to determine if COW is happening. In that case > > it wants to know that the mapping it's dealing with is shared > > and read-write so I think that's ok. > > Yes, that test is copied from mm/memory.c and should be okay. > It's an odd test and could be written in other ways, I think > cow = !(vma->vm_flags & VM_SHARED) > would be sufficient, wouldn't it? but perhaps that requires > too much knowledge of how the flags work. It should be sufficient. Writing either statement requires a lot of knowledge of how the flags work unfortunately :/. > Anyway, no reason > to change it; though equally you could change it to VM_MAYSHARE. > > > > > hugetlb_no_page > > Here, we are checking if COW should be broken early and then > > it's checking for the right write attribute for the page tables. > > Think that's ok too. > > These were the ones which worried me, since reservation checks (which > you did change) are called conditional upon them. But in the end I > agree with you, they don't need changing: because they're checked > along with write_access (or VM_WRITE), and you cannot write to > an area on which VM_SHARED and VM_MAYSHARE differ. > Yep. > > > > follow_hugetlb_page > > This is checking of the zero page can be shared or not. Crap, > > this one looks like it should have been converted to VM_MAYSHARE > > as well. > > Now, what makes you say that? > > I really am eager to understand, because I don't comprehend > that VM_SHARED at all. I think I understand it, but I keep changing my mind on whether VM_SHARED is sufficient or not. In this specific case, the zeropage must not be used by process A where it's possible that process B has populated it with data. when I said "Crap" earlier, the scenario I imagined went something like; o Process A opens a hugetlbfs file read/write but does not map the file o Process B opens the same hugetlbfs read-only and maps it MAP_SHARED. hugetlbfs allows mmaps to files that have not been ftruncate() so it can fault pages without SIGBUS o Process A writes the file - currently this is impossible as hugetlbfs does not support write() but lets pretend it was possible o Process B calls mlock() which calls into follow_hugetlb_page(). VM_SHARED is not set because it's a read-only mapping and it returns the wrong page. This last step is where I went wrong. As process 2 had no PTE for that location, it would have faulted the page as normal and gotten the correct page and never considered the zero page so VM_SHARED was ok after all. But this is sufficiently difficult that I'm worried that there is some other scenario where Process B uses the zero page when it shouldn't. Testing for VM_MAYSHARE would prevent the zero page being used incorrectly whether the mapping is read-only or read-write but maybe that's too paranoid. Kosaki, can you comment on what impact (if any) testing for VM_MAYSHARE would have here with respect to core-dumping? > I believe Kosaki-san's 4b2e38ad simply > copied it from Linus's 672ca28e to mm/memory.c. But even back > when that change was made, I confessed to having lost the plot > on it: so far as I can see, putting a VM_SHARED test in there > just happened to prevent some VMware code going the wrong way, > but I don't see the actual justification for it. > Having no idea how vmware broke exactly, I'm not sure what exactly was fixed. Maybe by not checking VM_SHARED, it was possible that a caller of get_user_pages() would not see updates made by a parallel writer. > So, given that I don't understand it in the first place, > I can't really support changing that VM_SHARED to VM_MAYSHARE. > Lets see what Kosaki says. If he's happy with VM_SHARED, I'll leave it alone. > > > Something I've noticed, to confirm that I can't really expect > > > to understand how hugetlb works these days. I experimented by > > > creating a hugetlb file, opening read-write, mmap'ing one page > > > shared read-write (but not faulting it in); > > > > At this point, one hugepage is reserved for the mapping but is not > > faulted and does not exist in the hugetlbfs page cache. > > > > > opening read-only, > > > mmap'ing the one page read-only (shared or private, doesn't matter), > > > faulting it in (contains zeroes of course); > > > > Of course. > > > > > writing ffffffff to > > > the one page through the read-write mapping, > > > > So, now a hugepage has been allocated and inserted into the page cache. > > > > > then looking at the > > > read-only mapping - still contains zeroes, whereas with any > > > normal file and mapping it should contain ffffffff, whether > > > the read-only mapping was shared or private. > > > > > > > I think the critical difference is that a normal file exists on a physical > > medium so both processes share the same data source. How would the normal > > file mapping behave on tmpfs for example? If tmpfs behaves correctly, I'll > > try and get hugetlbfs to match. > > tmpfs and ramfs behave just as if there were a normal file existing on > a physical medium, they behave just like every(?) other filesystem than > hugetlbfs: a page in a MAP_PRIVATE mapping is shared with the underlying > object (or other mappings) until it gets modified. > Ok, I'll look at what's involved at making hugetlbfs do the same thing. > > > > There is one potential problem in there. I would have expected the pages > > to be shared if the second process was mapping MAP_SHARED because the > > page should have been in the page cache when the read-write process > > faulted. I'll check it out. > > The big reason why I do now like your patch (modulo your additional > change to follow_huge_page) is that it seems to fix this for MAP_SHARED. > It's not quite obvious how it fixes it (you claim to be correcting the > accounting rather than the actual sharing/COWing of pages), but it does > seem to do so. > In this case, it's the accounting we are caring about, not the sharing. >From the point of view of sharing, VM_SHARED vs VM_MAYSHARE may be fine but the double reservations mean MAP_SHARED fails when it shouldn't. > (I say "seem to" because I've managed to confuse myself with thoroughly > erratic results here, but now putting it down to forgetting to remove my > old test file sometimes, so picking up cached pages from earlier tests; > or getting in a muddle between my readwrite and readonly fds. Please > check it out for yourself, before and after your patch.) > I'll investigate it more and see can I find something new that breaks as a result of the patch. > And while I dislike hugetlbfs's behaviour on the MAP_PRIVATE pages, > I willingly concede that it's much more important that it behave > correctly on the MAP_SHARED ones (as it stood before, "shared memory" > was not necessarily getting shared). > Thanks. I'll see what I can do about bringing it closer in line with expected behaviour in the future. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Determine if mapping is MAP_SHARED using VM_MAYSHARE and not VM_SHARED in hugetlbfs 2009-05-27 0:48 ` Mel Gorman @ 2009-05-27 3:17 ` KOSAKI Motohiro 2009-05-27 9:56 ` Mel Gorman 0 siblings, 1 reply; 9+ messages in thread From: KOSAKI Motohiro @ 2009-05-27 3:17 UTC (permalink / raw) To: Mel Gorman Cc: kosaki.motohiro, Hugh Dickins, npiggin, apw, agl, ebmunson, andi, david, kenchen, wli, akpm, starlight, linux-kernel, linux-mm Hi > > > follow_hugetlb_page > > > This is checking of the zero page can be shared or not. Crap, > > > this one looks like it should have been converted to VM_MAYSHARE > > > as well. > > > > Now, what makes you say that? > > > > I really am eager to understand, because I don't comprehend > > that VM_SHARED at all. > > I think I understand it, but I keep changing my mind on whether > VM_SHARED is sufficient or not. > > In this specific case, the zeropage must not be used by process A where > it's possible that process B has populated it with data. when I said "Crap" > earlier, the scenario I imagined went something like; > > o Process A opens a hugetlbfs file read/write but does not map the file > o Process B opens the same hugetlbfs read-only and maps it > MAP_SHARED. hugetlbfs allows mmaps to files that have not been ftruncate() > so it can fault pages without SIGBUS > o Process A writes the file - currently this is impossible as hugetlbfs > does not support write() but lets pretend it was possible > o Process B calls mlock() which calls into follow_hugetlb_page(). > VM_SHARED is not set because it's a read-only mapping and it returns > the wrong page. > > This last step is where I went wrong. As process 2 had no PTE for that > location, it would have faulted the page as normal and gotten the correct > page and never considered the zero page so VM_SHARED was ok after all. > > But this is sufficiently difficult that I'm worried that there is some other > scenario where Process B uses the zero page when it shouldn't. Testing for > VM_MAYSHARE would prevent the zero page being used incorrectly whether the > mapping is read-only or read-write but maybe that's too paranoid. > > Kosaki, can you comment on what impact (if any) testing for VM_MAYSHARE > would have here with respect to core-dumping? Thank you for very kindful explanation. Perhaps, I don't understand this issue yet. Honestly I didn't think this issue at my patch making time. following is my current analysis. if I'm misunderstanding anythink, please correct me. hugepage mlocking call make_pages_present(). above case, follow_page_page() don't use ZERO_PAGE because vma don't have VM_SHARED. but that's ok. make_pages_present's intention is not get struct page, it is to make page population. in this case, we need follow_hugetlb_page() call hugetlb_fault(), I think. In the other hand, when core-dump case .text segment: open(O_RDONLY) + mmap(MAP_SHARED) .data segment: open(O_RDONLY) + mmap(MAP_PRIVATE) it mean .text can't use ZERO_PAGE. but I think no problem. In general .text is smaller than .data. It doesn't make so slowness. > > I believe Kosaki-san's 4b2e38ad simply > > copied it from Linus's 672ca28e to mm/memory.c. But even back > > when that change was made, I confessed to having lost the plot > > on it: so far as I can see, putting a VM_SHARED test in there > > just happened to prevent some VMware code going the wrong way, > > but I don't see the actual justification for it. > > > > Having no idea how vmware broke exactly, I'm not sure what exactly was > fixed. Maybe by not checking VM_SHARED, it was possible that a caller of > get_user_pages() would not see updates made by a parallel writer. > > > So, given that I don't understand it in the first place, > > I can't really support changing that VM_SHARED to VM_MAYSHARE. > > > > Lets see what Kosaki says. If he's happy with VM_SHARED, I'll leave it > alone. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Determine if mapping is MAP_SHARED using VM_MAYSHARE and not VM_SHARED in hugetlbfs 2009-05-27 3:17 ` KOSAKI Motohiro @ 2009-05-27 9:56 ` Mel Gorman 0 siblings, 0 replies; 9+ messages in thread From: Mel Gorman @ 2009-05-27 9:56 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Hugh Dickins, npiggin, apw, agl, ebmunson, andi, david, kenchen, wli, akpm, starlight, linux-kernel, linux-mm On Wed, May 27, 2009 at 12:17:41PM +0900, KOSAKI Motohiro wrote: > Hi > > > > > follow_hugetlb_page > > > > This is checking of the zero page can be shared or not. Crap, > > > > this one looks like it should have been converted to VM_MAYSHARE > > > > as well. > > > > > > Now, what makes you say that? > > > > > > I really am eager to understand, because I don't comprehend > > > that VM_SHARED at all. > > > > I think I understand it, but I keep changing my mind on whether > > VM_SHARED is sufficient or not. > > > > In this specific case, the zeropage must not be used by process A where > > it's possible that process B has populated it with data. when I said "Crap" > > earlier, the scenario I imagined went something like; > > > > o Process A opens a hugetlbfs file read/write but does not map the file > > o Process B opens the same hugetlbfs read-only and maps it > > MAP_SHARED. hugetlbfs allows mmaps to files that have not been ftruncate() > > so it can fault pages without SIGBUS > > o Process A writes the file - currently this is impossible as hugetlbfs > > does not support write() but lets pretend it was possible > > o Process B calls mlock() which calls into follow_hugetlb_page(). > > VM_SHARED is not set because it's a read-only mapping and it returns > > the wrong page. > > > > This last step is where I went wrong. As process 2 had no PTE for that > > location, it would have faulted the page as normal and gotten the correct > > page and never considered the zero page so VM_SHARED was ok after all. > > > > But this is sufficiently difficult that I'm worried that there is some other > > scenario where Process B uses the zero page when it shouldn't. Testing for > > VM_MAYSHARE would prevent the zero page being used incorrectly whether the > > mapping is read-only or read-write but maybe that's too paranoid. > > > > Kosaki, can you comment on what impact (if any) testing for VM_MAYSHARE > > would have here with respect to core-dumping? > > Thank you for very kindful explanation. > > Perhaps, I don't understand this issue yet. Honestly I didn't think this > issue at my patch making time. > > following is my current analysis. if I'm misunderstanding anythink, please > correct me. > > hugepage mlocking call make_pages_present(). > above case, follow_page_page() don't use ZERO_PAGE because vma don't have > VM_SHARED. > but that's ok. make_pages_present's intention is not get struct page, > it is to make page population. in this case, we need follow_hugetlb_page() call > hugetlb_fault(), I think. > > > In the other hand, when core-dump case > > .text segment: open(O_RDONLY) + mmap(MAP_SHARED) > .data segment: open(O_RDONLY) + mmap(MAP_PRIVATE) > > it mean .text can't use ZERO_PAGE. but I think no problem. In general > .text is smaller than .data. It doesn't make so slowness. > Ok, in that case, I'm going to leave VM_SHARED here alone rather than switching it to VM_MAYSHARE. Right now, VM_SHARED appears to be covering the cases we care about in this instance. Thanks. > > > > > I believe Kosaki-san's 4b2e38ad simply > > > copied it from Linus's 672ca28e to mm/memory.c. But even back > > > when that change was made, I confessed to having lost the plot > > > on it: so far as I can see, putting a VM_SHARED test in there > > > just happened to prevent some VMware code going the wrong way, > > > but I don't see the actual justification for it. > > > > > > > Having no idea how vmware broke exactly, I'm not sure what exactly was > > fixed. Maybe by not checking VM_SHARED, it was possible that a caller of > > get_user_pages() would not see updates made by a parallel writer. > > > > > So, given that I don't understand it in the first place, > > > I can't really support changing that VM_SHARED to VM_MAYSHARE. > > > > > > > Lets see what Kosaki says. If he's happy with VM_SHARED, I'll leave it > > alone. > > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2009-05-27 9:55 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-05-19 8:36 [PATCH] Determine if mapping is MAP_SHARED using VM_MAYSHARE and not VM_SHARED in hugetlbfs Mel Gorman 2009-05-20 10:12 ` Eric B Munson 2009-05-25 21:09 ` Hugh Dickins 2009-05-26 10:12 ` Mel Gorman 2009-05-26 12:54 ` Eric B Munson 2009-05-26 20:51 ` Hugh Dickins 2009-05-27 0:48 ` Mel Gorman 2009-05-27 3:17 ` KOSAKI Motohiro 2009-05-27 9:56 ` Mel Gorman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox