* [PATCH 1/5] hugetlb: Account for hugepages as locked_vm
2007-09-13 17:58 [PATCH 0/5] [hugetlb] Dynamic huge page pool resizing Adam Litke
@ 2007-09-13 17:59 ` Adam Litke
2007-09-14 5:41 ` Ken Chen
2007-09-13 17:59 ` [PATCH 2/5] hugetlb: Move update_and_free_page Adam Litke
` (3 subsequent siblings)
4 siblings, 1 reply; 14+ messages in thread
From: Adam Litke @ 2007-09-13 17:59 UTC (permalink / raw)
To: linux-mm
Cc: libhugetlbfs-devel, Adam Litke, Andy Whitcroft, Mel Gorman,
Bill Irwin, Ken Chen, Dave McCracken
Hugepages allocated to a process are pinned into memory and are not
reclaimable. Currently they do not contribute towards the process' locked
memory. This patch includes those pages in the process' 'locked_vm' pages.
NOTE: The locked_vm counter is only updated at fault and unmap time. Huge
pages are different from regular mlocked memory which is faulted in all at
once. Therefore, it does not make sense to charge at mmap time for huge
page mappings. This difference results in a deviation from normal mlock
accounting which cannot be trivially reconciled given the inherent
differences with huge pages.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
---
mm/hugetlb.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index de4cf45..1dfeafa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -428,6 +428,7 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
continue;
page = pte_page(pte);
+ mm->locked_vm -= HPAGE_SIZE >> PAGE_SHIFT;
if (pte_dirty(pte))
set_page_dirty(page);
list_add(&page->lru, &page_list);
@@ -561,6 +562,16 @@ retry:
&& (vma->vm_flags & VM_SHARED)));
set_huge_pte_at(mm, address, ptep, new_pte);
+ /*
+ * Account for huge pages as locked memory.
+ * The locked limits are not enforced at mmap time because hugetlbfs
+ * behaves differently than normal locked memory: 1) The pages are
+ * not pinned immediately, and 2) The pages come from a pre-configured
+ * pool of memory to which the administrator has separately arranged
+ * access.
+ */
+ mm->locked_vm += HPAGE_SIZE >> PAGE_SHIFT;
+
if (write_access && !(vma->vm_flags & VM_SHARED)) {
/* Optimization, do the COW without a second fault */
ret = hugetlb_cow(mm, vma, address, ptep, new_pte);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH 1/5] hugetlb: Account for hugepages as locked_vm
2007-09-13 17:59 ` [PATCH 1/5] hugetlb: Account for hugepages as locked_vm Adam Litke
@ 2007-09-14 5:41 ` Ken Chen
2007-09-14 9:15 ` Mel Gorman
0 siblings, 1 reply; 14+ messages in thread
From: Ken Chen @ 2007-09-14 5:41 UTC (permalink / raw)
To: Adam Litke
Cc: linux-mm, libhugetlbfs-devel, Andy Whitcroft, Mel Gorman,
Bill Irwin, Dave McCracken
On 9/13/07, Adam Litke <agl@us.ibm.com> wrote:
> Hugepages allocated to a process are pinned into memory and are not
> reclaimable. Currently they do not contribute towards the process' locked
> memory. This patch includes those pages in the process' 'locked_vm' pages.
On x86_64, hugetlb can share page table entry if multiple processes
have their virtual addresses all lined up perfectly. Because of that,
mm->locked_vm can go negative with this patch depending on the order
of which process fault in hugetlb pages and which one unmaps it last.
Have you checked all user of mm->locked_vm that a negative number
won't trigger unpleasant result?
- Ken
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/5] hugetlb: Account for hugepages as locked_vm
2007-09-14 5:41 ` Ken Chen
@ 2007-09-14 9:15 ` Mel Gorman
0 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2007-09-14 9:15 UTC (permalink / raw)
To: Ken Chen
Cc: Adam Litke, linux-mm, libhugetlbfs-devel, Andy Whitcroft,
Bill Irwin, Dave McCracken
On (13/09/07 22:41), Ken Chen didst pronounce:
> On 9/13/07, Adam Litke <agl@us.ibm.com> wrote:
> > Hugepages allocated to a process are pinned into memory and are not
> > reclaimable. Currently they do not contribute towards the process' locked
> > memory. This patch includes those pages in the process' 'locked_vm' pages.
>
> On x86_64, hugetlb can share page table entry if multiple processes
> have their virtual addresses all lined up perfectly. Because of that,
> mm->locked_vm can go negative with this patch depending on the order
> of which process fault in hugetlb pages and which one unmaps it last.
>
hmmm, on close inspection you are right. The worst case is where two processes
share a PMD and fault half of the hugepages in that region each. Whichever
of them unmaps last will get bad values.
> Have you checked all user of mm->locked_vm that a negative number
> won't trigger unpleasant result?
>
This, if it can occur is bad. It's looks stupid if absolutly nothing
else. Besides, locked_vm is an unsigned long. Wrapping negative would
actually be a huge positive so it's possible that a hostile process A
could cause a situation where innocent process B gets a large locked_vm
value and cannot dynamically resize the hugepage pool any more.
The choices for a fix I can think of are;
a) Do not use locked_vm at all. Instead use filesystem quotas to prevent the
pool growing in an unbounded fashion (this is Adam and Andy Whitcrofts idea,
not mine but it makes sense in light of this problem with locked_vm). I
liked the idea of being able to limit additional hugepage usage with
RLIMIT_LOCKED but maybe that is not such a great plan.
b) Double-count locked_vm. i.e. when pagetables are shared, the process
about to share increments it's locked_vm based on the pages already
faulted. On fault, all mm's sharing get their locked_vm increased and
unmap acts as it does. This would require the taking of many
page_table_locks to update locked_vm which would be very expensive.
Anyone got better suggestions than this? Mr. McCracken, how did you
handle the mlocked case in your pagetable sharing patches back when you
were working on them? I am assuming the problem is somewhat similar.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/5] hugetlb: Move update_and_free_page
2007-09-13 17:58 [PATCH 0/5] [hugetlb] Dynamic huge page pool resizing Adam Litke
2007-09-13 17:59 ` [PATCH 1/5] hugetlb: Account for hugepages as locked_vm Adam Litke
@ 2007-09-13 17:59 ` Adam Litke
2007-09-13 17:59 ` [PATCH 3/5] hugetlb: Try to grow hugetlb pool for MAP_PRIVATE mappings Adam Litke
` (2 subsequent siblings)
4 siblings, 0 replies; 14+ messages in thread
From: Adam Litke @ 2007-09-13 17:59 UTC (permalink / raw)
To: linux-mm
Cc: libhugetlbfs-devel, Adam Litke, Andy Whitcroft, Mel Gorman,
Bill Irwin, Ken Chen, Dave McCracken
This patch simply moves update_and_free_page() so that it can be reused
later in this patch series. The implementation is not changed.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
---
mm/hugetlb.c | 30 +++++++++++++++---------------
1 files changed, 15 insertions(+), 15 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1dfeafa..50195a2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -90,6 +90,21 @@ static struct page *dequeue_huge_page(struct vm_area_struct *vma,
return page;
}
+static void update_and_free_page(struct page *page)
+{
+ int i;
+ nr_huge_pages--;
+ nr_huge_pages_node[page_to_nid(page)]--;
+ for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
+ page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
+ 1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
+ 1 << PG_private | 1<< PG_writeback);
+ }
+ set_compound_page_dtor(page, NULL);
+ set_page_refcounted(page);
+ __free_pages(page, HUGETLB_PAGE_ORDER);
+}
+
static void free_huge_page(struct page *page)
{
BUG_ON(page_count(page));
@@ -199,21 +214,6 @@ static unsigned int cpuset_mems_nr(unsigned int *array)
}
#ifdef CONFIG_SYSCTL
-static void update_and_free_page(struct page *page)
-{
- int i;
- nr_huge_pages--;
- nr_huge_pages_node[page_to_nid(page)]--;
- for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
- page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
- 1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
- 1 << PG_private | 1<< PG_writeback);
- }
- set_compound_page_dtor(page, NULL);
- set_page_refcounted(page);
- __free_pages(page, HUGETLB_PAGE_ORDER);
-}
-
#ifdef CONFIG_HIGHMEM
static void try_to_free_low(unsigned long count)
{
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH 3/5] hugetlb: Try to grow hugetlb pool for MAP_PRIVATE mappings
2007-09-13 17:58 [PATCH 0/5] [hugetlb] Dynamic huge page pool resizing Adam Litke
2007-09-13 17:59 ` [PATCH 1/5] hugetlb: Account for hugepages as locked_vm Adam Litke
2007-09-13 17:59 ` [PATCH 2/5] hugetlb: Move update_and_free_page Adam Litke
@ 2007-09-13 17:59 ` Adam Litke
2007-09-13 18:06 ` [Libhugetlbfs-devel] " Dave Hansen
2007-09-13 17:59 ` [PATCH 4/5] hugetlb: Try to grow hugetlb pool for MAP_SHARED mappings Adam Litke
2007-09-13 17:59 ` [PATCH 5/5] hugetlb: Add hugetlb_dynamic_pool sysctl Adam Litke
4 siblings, 1 reply; 14+ messages in thread
From: Adam Litke @ 2007-09-13 17:59 UTC (permalink / raw)
To: linux-mm
Cc: libhugetlbfs-devel, Adam Litke, Andy Whitcroft, Mel Gorman,
Bill Irwin, Ken Chen, Dave McCracken
Because we overcommit hugepages for MAP_PRIVATE mappings, it is possible
that the hugetlb pool will be exhausted or completely reserved when a
hugepage is needed to satisfy a page fault. Before killing the process in
this situation, try to allocate a hugepage directly from the buddy
allocator. Only do this if the process would remain within its locked_vm
memory limits.
The explicitly configured pool size becomes a low watermark. When
dynamically grown, the allocated huge pages are accounted as a surplus over
the watermark. As huge pages are freed on a node, surplus pages are
released to the buddy allocator so that the pool will shrink back to the
watermark.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
---
mm/hugetlb.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 67 insertions(+), 4 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 50195a2..ec5207e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -27,6 +27,7 @@ unsigned long max_huge_pages;
static struct list_head hugepage_freelists[MAX_NUMNODES];
static unsigned int nr_huge_pages_node[MAX_NUMNODES];
static unsigned int free_huge_pages_node[MAX_NUMNODES];
+static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
unsigned long hugepages_treat_as_movable;
@@ -107,12 +108,18 @@ static void update_and_free_page(struct page *page)
static void free_huge_page(struct page *page)
{
- BUG_ON(page_count(page));
+ int nid = page_to_nid(page);
+ BUG_ON(page_count(page));
INIT_LIST_HEAD(&page->lru);
spin_lock(&hugetlb_lock);
- enqueue_huge_page(page);
+ if (surplus_huge_pages_node[nid]) {
+ update_and_free_page(page);
+ surplus_huge_pages_node[nid]--;
+ } else {
+ enqueue_huge_page(page);
+ }
spin_unlock(&hugetlb_lock);
}
@@ -148,10 +155,57 @@ static int alloc_fresh_huge_page(void)
return 0;
}
+/*
+ * Returns 1 if a process remains within lock limits after locking
+ * hpage_delta huge pages. It is expected that mmap_sem is held
+ * when calling this function, otherwise the locked_vm counter may
+ * change unexpectedly
+ */
+static int within_locked_vm_limits(long hpage_delta)
+{
+ unsigned long locked_pages, locked_pages_limit;
+
+ /* Check locked page limits */
+ locked_pages = current->mm->locked_vm;
+ locked_pages += hpage_delta * (HPAGE_SIZE >> PAGE_SHIFT);
+ locked_pages_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
+ locked_pages_limit >>= PAGE_SHIFT;
+
+ /* Return 0 if we would exceed locked_vm limits */
+ if (locked_pages > locked_pages_limit)
+ return 0;
+
+ /* Nice, we're within limits */
+ return 1;
+}
+
+static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
+ unsigned long address)
+{
+ struct page *page;
+
+ /* Check we remain within limits if 1 huge page is allocated */
+ if (!within_locked_vm_limits(1))
+ return NULL;
+
+ page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
+ HUGETLB_PAGE_ORDER);
+ if (page) {
+ set_compound_page_dtor(page, free_huge_page);
+ spin_lock(&hugetlb_lock);
+ nr_huge_pages++;
+ nr_huge_pages_node[page_to_nid(page)]++;
+ surplus_huge_pages_node[page_to_nid(page)]++;
+ spin_unlock(&hugetlb_lock);
+ }
+
+ return page;
+}
+
static struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr)
{
- struct page *page;
+ struct page *page = NULL;
spin_lock(&hugetlb_lock);
if (vma->vm_flags & VM_MAYSHARE)
@@ -171,7 +225,16 @@ fail:
if (vma->vm_flags & VM_MAYSHARE)
resv_huge_pages++;
spin_unlock(&hugetlb_lock);
- return NULL;
+
+ /*
+ * Private mappings do not use reserved huge pages so the allocation
+ * may have failed due to an undersized hugetlb pool. Try to grab a
+ * surplus huge page from the buddy allocator.
+ */
+ if (!(vma->vm_flags & VM_MAYSHARE))
+ page = alloc_buddy_huge_page(vma, addr);
+
+ return page;
}
static int __init hugetlb_init(void)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [Libhugetlbfs-devel] [PATCH 3/5] hugetlb: Try to grow hugetlb pool for MAP_PRIVATE mappings
2007-09-13 17:59 ` [PATCH 3/5] hugetlb: Try to grow hugetlb pool for MAP_PRIVATE mappings Adam Litke
@ 2007-09-13 18:06 ` Dave Hansen
2007-09-13 20:21 ` Adam Litke
0 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2007-09-13 18:06 UTC (permalink / raw)
To: Adam Litke
Cc: linux-mm, libhugetlbfs-devel, Dave McCracken, Mel Gorman,
Ken Chen, Andy Whitcroft, Bill Irwin
On Thu, 2007-09-13 at 10:59 -0700, Adam Litke wrote:
> +static int within_locked_vm_limits(long hpage_delta)
> +{
> + unsigned long locked_pages, locked_pages_limit;
> +
> + /* Check locked page limits */
> + locked_pages = current->mm->locked_vm;
> + locked_pages += hpage_delta * (HPAGE_SIZE >> PAGE_SHIFT);
> + locked_pages_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
> + locked_pages_limit >>= PAGE_SHIFT;
> +
> + /* Return 0 if we would exceed locked_vm limits */
> + if (locked_pages > locked_pages_limit)
> + return 0;
> +
> + /* Nice, we're within limits */
> + return 1;
> +}
> +
> +static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
> + unsigned long address)
> +{
> + struct page *page;
> +
> + /* Check we remain within limits if 1 huge page is allocated */
> + if (!within_locked_vm_limits(1))
> + return NULL;
> +
> + page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
...
Is there locking around this operation? Or, is there a way that a
process could do this concurrently in two different threads, both appear
to be within within_locked_vm_limits(), and both succeed to allocate
when doing so actually takes them over the limit?
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [Libhugetlbfs-devel] [PATCH 3/5] hugetlb: Try to grow hugetlb pool for MAP_PRIVATE mappings
2007-09-13 18:06 ` [Libhugetlbfs-devel] " Dave Hansen
@ 2007-09-13 20:21 ` Adam Litke
2007-09-14 5:46 ` David Gibson
0 siblings, 1 reply; 14+ messages in thread
From: Adam Litke @ 2007-09-13 20:21 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-mm, libhugetlbfs-devel, Dave McCracken, Mel Gorman,
Ken Chen, Andy Whitcroft, Bill Irwin
On Thu, 2007-09-13 at 11:06 -0700, Dave Hansen wrote:
> On Thu, 2007-09-13 at 10:59 -0700, Adam Litke wrote:
> > +static int within_locked_vm_limits(long hpage_delta)
> > +{
> > + unsigned long locked_pages, locked_pages_limit;
> > +
> > + /* Check locked page limits */
> > + locked_pages = current->mm->locked_vm;
> > + locked_pages += hpage_delta * (HPAGE_SIZE >> PAGE_SHIFT);
> > + locked_pages_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
> > + locked_pages_limit >>= PAGE_SHIFT;
> > +
> > + /* Return 0 if we would exceed locked_vm limits */
> > + if (locked_pages > locked_pages_limit)
> > + return 0;
> > +
> > + /* Nice, we're within limits */
> > + return 1;
> > +}
> > +
> > +static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
> > + unsigned long address)
> > +{
> > + struct page *page;
> > +
> > + /* Check we remain within limits if 1 huge page is allocated */
> > + if (!within_locked_vm_limits(1))
> > + return NULL;
> > +
> > + page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> ...
>
> Is there locking around this operation? Or, is there a way that a
> process could do this concurrently in two different threads, both appear
> to be within within_locked_vm_limits(), and both succeed to allocate
> when doing so actually takes them over the limit?
This case is prevented by hugetlb_instantiation_mutex. I'll include a
comment to make that clearer.
--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [Libhugetlbfs-devel] [PATCH 3/5] hugetlb: Try to grow hugetlb pool for MAP_PRIVATE mappings
2007-09-13 20:21 ` Adam Litke
@ 2007-09-14 5:46 ` David Gibson
2007-09-14 13:33 ` Adam Litke
0 siblings, 1 reply; 14+ messages in thread
From: David Gibson @ 2007-09-14 5:46 UTC (permalink / raw)
To: Adam Litke
Cc: Dave Hansen, libhugetlbfs-devel, Dave McCracken, linux-mm,
Mel Gorman, Ken Chen, Andy Whitcroft, Bill Irwin
On Thu, Sep 13, 2007 at 03:21:30PM -0500, Adam Litke wrote:
> On Thu, 2007-09-13 at 11:06 -0700, Dave Hansen wrote:
> > On Thu, 2007-09-13 at 10:59 -0700, Adam Litke wrote:
> > > +static int within_locked_vm_limits(long hpage_delta)
> > > +{
> > > + unsigned long locked_pages, locked_pages_limit;
> > > +
> > > + /* Check locked page limits */
> > > + locked_pages = current->mm->locked_vm;
> > > + locked_pages += hpage_delta * (HPAGE_SIZE >> PAGE_SHIFT);
> > > + locked_pages_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
> > > + locked_pages_limit >>= PAGE_SHIFT;
> > > +
> > > + /* Return 0 if we would exceed locked_vm limits */
> > > + if (locked_pages > locked_pages_limit)
> > > + return 0;
> > > +
> > > + /* Nice, we're within limits */
> > > + return 1;
> > > +}
> > > +
> > > +static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
> > > + unsigned long address)
> > > +{
> > > + struct page *page;
> > > +
> > > + /* Check we remain within limits if 1 huge page is allocated */
> > > + if (!within_locked_vm_limits(1))
> > > + return NULL;
> > > +
> > > + page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> > ...
> >
> > Is there locking around this operation? Or, is there a way that a
> > process could do this concurrently in two different threads, both appear
> > to be within within_locked_vm_limits(), and both succeed to allocate
> > when doing so actually takes them over the limit?
>
> This case is prevented by hugetlb_instantiation_mutex. I'll include a
> comment to make that clearer.
Hrm... a number of people are trying to get rid of, or at least reduce
the scope of the instatiation mutex, since it can be significant
bottlenect when clearing large numbers of hugepages on big SMP
systems.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [Libhugetlbfs-devel] [PATCH 3/5] hugetlb: Try to grow hugetlb pool for MAP_PRIVATE mappings
2007-09-14 5:46 ` David Gibson
@ 2007-09-14 13:33 ` Adam Litke
0 siblings, 0 replies; 14+ messages in thread
From: Adam Litke @ 2007-09-14 13:33 UTC (permalink / raw)
To: David Gibson
Cc: Dave Hansen, libhugetlbfs-devel, Dave McCracken, linux-mm,
Mel Gorman, Ken Chen, Andy Whitcroft, Bill Irwin
On Fri, 2007-09-14 at 15:46 +1000, David Gibson wrote:
> On Thu, Sep 13, 2007 at 03:21:30PM -0500, Adam Litke wrote:
> > On Thu, 2007-09-13 at 11:06 -0700, Dave Hansen wrote:
> > > On Thu, 2007-09-13 at 10:59 -0700, Adam Litke wrote:
> > > > +static int within_locked_vm_limits(long hpage_delta)
> > > > +{
> > > > + unsigned long locked_pages, locked_pages_limit;
> > > > +
> > > > + /* Check locked page limits */
> > > > + locked_pages = current->mm->locked_vm;
> > > > + locked_pages += hpage_delta * (HPAGE_SIZE >> PAGE_SHIFT);
> > > > + locked_pages_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
> > > > + locked_pages_limit >>= PAGE_SHIFT;
> > > > +
> > > > + /* Return 0 if we would exceed locked_vm limits */
> > > > + if (locked_pages > locked_pages_limit)
> > > > + return 0;
> > > > +
> > > > + /* Nice, we're within limits */
> > > > + return 1;
> > > > +}
> > > > +
> > > > +static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
> > > > + unsigned long address)
> > > > +{
> > > > + struct page *page;
> > > > +
> > > > + /* Check we remain within limits if 1 huge page is allocated */
> > > > + if (!within_locked_vm_limits(1))
> > > > + return NULL;
> > > > +
> > > > + page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> > > ...
> > >
> > > Is there locking around this operation? Or, is there a way that a
> > > process could do this concurrently in two different threads, both appear
> > > to be within within_locked_vm_limits(), and both succeed to allocate
> > > when doing so actually takes them over the limit?
> >
> > This case is prevented by hugetlb_instantiation_mutex. I'll include a
> > comment to make that clearer.
>
> Hrm... a number of people are trying to get rid of, or at least reduce
> the scope of the instatiation mutex, since it can be significant
> bottlenect when clearing large numbers of hugepages on big SMP
> systems.
Yes, and with the exception of this bit, this patch series furthers that
goal substantially. With a dynamic hugetlb pool, the
alloc-instantiation race can be handled by stretching the pool during
the race window to accommodate the temporary overage.
As for the safety of within_locked_vm_limits() depending on
hugetlb_instantiation_mutex, perhaps this is another reason to not use
the locked ulimit as a way to manage hugetlb pool growth (since we do
have the fs quota method).
--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 4/5] hugetlb: Try to grow hugetlb pool for MAP_SHARED mappings
2007-09-13 17:58 [PATCH 0/5] [hugetlb] Dynamic huge page pool resizing Adam Litke
` (2 preceding siblings ...)
2007-09-13 17:59 ` [PATCH 3/5] hugetlb: Try to grow hugetlb pool for MAP_PRIVATE mappings Adam Litke
@ 2007-09-13 17:59 ` Adam Litke
2007-09-13 22:24 ` Dave McCracken
2007-09-13 17:59 ` [PATCH 5/5] hugetlb: Add hugetlb_dynamic_pool sysctl Adam Litke
4 siblings, 1 reply; 14+ messages in thread
From: Adam Litke @ 2007-09-13 17:59 UTC (permalink / raw)
To: linux-mm
Cc: libhugetlbfs-devel, Adam Litke, Andy Whitcroft, Mel Gorman,
Bill Irwin, Ken Chen, Dave McCracken
Shared mappings require special handling because the huge pages needed to
fully populate the VMA must be reserved at mmap time. If not enough pages
are available when making the reservation, allocate all of the shortfall at
once from the buddy allocator and add the pages directly to the hugetlb
pool. If they cannot be allocated, then fail the mapping. The page
surplus is accounted for in the same way as for private mappings; faulted
surplus pages will be freed at unmap time. Reserved, surplus pages that
have not been used must be freed separately when their reservation has been
released.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
---
mm/hugetlb.c | 161 ++++++++++++++++++++++++++++++++++++++++++++++++++--------
1 files changed, 138 insertions(+), 23 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ec5207e..0cedcd0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -28,6 +28,7 @@ static struct list_head hugepage_freelists[MAX_NUMNODES];
static unsigned int nr_huge_pages_node[MAX_NUMNODES];
static unsigned int free_huge_pages_node[MAX_NUMNODES];
static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+static unsigned long unused_surplus_pages;
static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
unsigned long hugepages_treat_as_movable;
@@ -85,6 +86,10 @@ static struct page *dequeue_huge_page(struct vm_area_struct *vma,
list_del(&page->lru);
free_huge_pages--;
free_huge_pages_node[nid]--;
+ if (vma && vma->vm_flags & VM_MAYSHARE) {
+ resv_huge_pages--;
+ unused_surplus_pages--;
+ }
break;
}
}
@@ -202,15 +207,120 @@ static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
return page;
}
+/*
+ * Increase the hugetlb pool such that it can accomodate a reservation
+ * of size 'delta'.
+ */
+static int gather_surplus_pages(int delta)
+{
+ struct list_head surplus_list;
+ struct page *page, *tmp;
+ int ret, i;
+ int needed, allocated;
+
+ needed = (resv_huge_pages + delta) - free_huge_pages;
+ if (!needed)
+ return 0;
+
+ allocated = 0;
+ INIT_LIST_HEAD(&surplus_list);
+
+ ret = -ENOMEM;
+retry:
+ spin_unlock(&hugetlb_lock);
+ for (i = 0; i < needed; i++) {
+ page = alloc_buddy_huge_page(NULL, 0);
+ if (!page) {
+ /*
+ * We were not able to allocate enough pages to
+ * satisfy the entire reservation so we free what
+ * we've allocated so far.
+ */
+ spin_lock(&hugetlb_lock);
+ needed = 0;
+ goto free;
+ }
+
+ list_add(&page->lru, &surplus_list);
+ }
+ allocated += needed;
+
+ /*
+ * After retaking hugetlb_lock, we need to recalculate 'needed'
+ * because either resv_huge_pages or free_huge_pages may have changed.
+ */
+ spin_lock(&hugetlb_lock);
+ needed = (resv_huge_pages + delta) - (free_huge_pages + allocated);
+ if (needed > 0)
+ goto retry;
+
+ /*
+ * The surplus_list now contains _at_least_ the number of extra pages
+ * needed to accomodate the reservation. Add the appropriate number
+ * of pages to the hugetlb pool and free the extras back to the buddy
+ * allocator.
+ *
+ * Those pages that get added to the pool may never be allocated and
+ * subsequently freed so keep track of them in unused_surplus_pages
+ * so they can be freed again when a reservation is released.
+ */
+ needed += allocated;
+ unused_surplus_pages += needed;
+ ret = 0;
+free:
+ list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
+ list_del(&page->lru);
+ if ((--needed) >= 0)
+ enqueue_huge_page(page);
+ else
+ update_and_free_page(page);
+ }
+
+ return ret;
+}
+
+/*
+ * When releasing a reservation, free all unused, surplus huge pages that are
+ * no longer reserved.
+ */
+void return_unused_surplus_pages(void)
+{
+ static int nid = -1;
+ int delta;
+ struct page *page;
+
+ delta = unused_surplus_pages - resv_huge_pages;
+
+ while (delta) {
+ nid = next_node(nid, node_online_map);
+ if (nid == MAX_NUMNODES)
+ nid = first_node(node_online_map);
+
+ if (!surplus_huge_pages_node[nid])
+ continue;
+
+ if (!list_empty(&hugepage_freelists[nid])) {
+ page = list_entry(hugepage_freelists[nid].next,
+ struct page, lru);
+ list_del(&page->lru);
+ update_and_free_page(page);
+ free_huge_pages--;
+ free_huge_pages_node[nid]--;
+ surplus_huge_pages_node[nid]--;
+ unused_surplus_pages--;
+ delta--;
+ }
+ }
+}
+
static struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr)
{
struct page *page = NULL;
+ int use_reserved_page = vma->vm_flags & VM_MAYSHARE;
spin_lock(&hugetlb_lock);
- if (vma->vm_flags & VM_MAYSHARE)
- resv_huge_pages--;
- else if (free_huge_pages <= resv_huge_pages)
+ if (!use_reserved_page && (free_huge_pages <= resv_huge_pages))
goto fail;
page = dequeue_huge_page(vma, addr);
@@ -222,8 +332,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
return page;
fail:
- if (vma->vm_flags & VM_MAYSHARE)
- resv_huge_pages++;
spin_unlock(&hugetlb_lock);
/*
@@ -231,7 +339,7 @@ fail:
* may have failed due to an undersized hugetlb pool. Try to grab a
* surplus huge page from the buddy allocator.
*/
- if (!(vma->vm_flags & VM_MAYSHARE))
+ if (!use_reserved_page)
page = alloc_buddy_huge_page(vma, addr);
return page;
@@ -915,21 +1023,6 @@ static int hugetlb_acct_memory(long delta)
int ret = -ENOMEM;
spin_lock(&hugetlb_lock);
- if ((delta + resv_huge_pages) <= free_huge_pages) {
- resv_huge_pages += delta;
- ret = 0;
- }
- spin_unlock(&hugetlb_lock);
- return ret;
-}
-
-int hugetlb_reserve_pages(struct inode *inode, long from, long to)
-{
- long ret, chg;
-
- chg = region_chg(&inode->i_mapping->private_list, from, to);
- if (chg < 0)
- return chg;
/*
* When cpuset is configured, it breaks the strict hugetlb page
* reservation as the accounting is done on a global variable. Such
@@ -947,8 +1040,30 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to)
* a best attempt and hopefully to minimize the impact of changing
* semantics that cpuset has.
*/
- if (chg > cpuset_mems_nr(free_huge_pages_node))
- return -ENOMEM;
+ if (delta > 0) {
+ if (gather_surplus_pages(delta) < 0)
+ goto out;
+
+ if (delta > cpuset_mems_nr(free_huge_pages_node))
+ goto out;
+ }
+
+ ret = 0;
+ resv_huge_pages += delta;
+ if (delta <= 0)
+ return_unused_surplus_pages();
+out:
+ spin_unlock(&hugetlb_lock);
+ return ret;
+}
+
+int hugetlb_reserve_pages(struct inode *inode, long from, long to)
+{
+ long ret, chg;
+
+ chg = region_chg(&inode->i_mapping->private_list, from, to);
+ if (chg < 0)
+ return chg;
ret = hugetlb_acct_memory(chg);
if (ret < 0)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH 4/5] hugetlb: Try to grow hugetlb pool for MAP_SHARED mappings
2007-09-13 17:59 ` [PATCH 4/5] hugetlb: Try to grow hugetlb pool for MAP_SHARED mappings Adam Litke
@ 2007-09-13 22:24 ` Dave McCracken
2007-09-14 14:03 ` Adam Litke
0 siblings, 1 reply; 14+ messages in thread
From: Dave McCracken @ 2007-09-13 22:24 UTC (permalink / raw)
To: Adam Litke
Cc: linux-mm, libhugetlbfs-devel, Andy Whitcroft, Mel Gorman,
Bill Irwin, Ken Chen
On Thursday 13 September 2007, Adam Litke wrote:
> +static int gather_surplus_pages(int delta)
> +{
> + struct list_head surplus_list;
> + struct page *page, *tmp;
> + int ret, i;
> + int needed, allocated;
> +
> + needed = (resv_huge_pages + delta) - free_huge_pages;
> + if (!needed)
> + return 0;
It looks here like needed can be less than zero. Do we really intend to
continue with the function if that's true? Or should that test really be "if
(needed <= 0)"?
Dave
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH 4/5] hugetlb: Try to grow hugetlb pool for MAP_SHARED mappings
2007-09-13 22:24 ` Dave McCracken
@ 2007-09-14 14:03 ` Adam Litke
0 siblings, 0 replies; 14+ messages in thread
From: Adam Litke @ 2007-09-14 14:03 UTC (permalink / raw)
To: Dave McCracken
Cc: linux-mm, libhugetlbfs-devel, Andy Whitcroft, Mel Gorman,
Bill Irwin, Ken Chen
On Thu, 2007-09-13 at 17:24 -0500, Dave McCracken wrote:
> On Thursday 13 September 2007, Adam Litke wrote:
> > +static int gather_surplus_pages(int delta)
> > +{
> > + struct list_head surplus_list;
> > + struct page *page, *tmp;
> > + int ret, i;
> > + int needed, allocated;
> > +
> > + needed = (resv_huge_pages + delta) - free_huge_pages;
> > + if (!needed)
> > + return 0;
>
> It looks here like needed can be less than zero. Do we really intend to
> continue with the function if that's true? Or should that test really be "if
> (needed <= 0)"?
You are right about that. Thanks for the review :)
--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 5/5] hugetlb: Add hugetlb_dynamic_pool sysctl
2007-09-13 17:58 [PATCH 0/5] [hugetlb] Dynamic huge page pool resizing Adam Litke
` (3 preceding siblings ...)
2007-09-13 17:59 ` [PATCH 4/5] hugetlb: Try to grow hugetlb pool for MAP_SHARED mappings Adam Litke
@ 2007-09-13 17:59 ` Adam Litke
4 siblings, 0 replies; 14+ messages in thread
From: Adam Litke @ 2007-09-13 17:59 UTC (permalink / raw)
To: linux-mm
Cc: libhugetlbfs-devel, Adam Litke, Andy Whitcroft, Mel Gorman,
Bill Irwin, Ken Chen, Dave McCracken
Allowing the hugetlb pool to grow dynamically changes the semantics of the
system by permitting more system memory to be used for huge pages than has
been explicitly dedicated to the pool.
This patch introduces a sysctl which must be enabled to turn on the dynamic
pool resizing feature. This will avoid an involuntary change in behavior.
When hugetlb pool growth is enabled via the hugetlb_dynamic_pool sysctl, an
upper-bound on huge page allocation can be set by constraining the size of
the hugetlb filesystem via the 'size' mount option.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
---
include/linux/hugetlb.h | 1 +
kernel/sysctl.c | 8 ++++++++
mm/hugetlb.c | 5 +++++
3 files changed, 14 insertions(+), 0 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e6a71c8..cec45db 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -33,6 +33,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
extern unsigned long max_huge_pages;
extern unsigned long hugepages_treat_as_movable;
+extern int hugetlb_dynamic_pool;
extern const unsigned long hugetlb_zero, hugetlb_infinity;
extern int sysctl_hugetlb_shm_group;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8bdb8c0..fd60a5e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -879,6 +879,14 @@ static ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = &hugetlb_treat_movable_handler,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "hugetlb_dynamic_pool",
+ .data = &hugetlb_dynamic_pool,
+ .maxlen = sizeof(hugetlb_dynamic_pool),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#endif
{
.ctl_name = VM_LOWMEM_RESERVE_RATIO,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0cedcd0..caef721 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -31,6 +31,7 @@ static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
static unsigned long unused_surplus_pages;
static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
unsigned long hugepages_treat_as_movable;
+int hugetlb_dynamic_pool;
/*
* Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
@@ -189,6 +190,10 @@ static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
{
struct page *page;
+ /* Check if the dynamic pool is enabled */
+ if (!hugetlb_dynamic_pool)
+ return NULL;
+
/* Check we remain within limits if 1 huge page is allocated */
if (!within_locked_vm_limits(1))
return NULL;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread