RE: [patch] hugetlb strict commit accounting

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* RE: [patch] hugetlb strict commit accounting
@ 2006-03-09 12:14 Chen, Kenneth W
  0 siblings, 0 replies; 11+ messages in thread
From: Chen, Kenneth W @ 2006-03-09 12:14 UTC (permalink / raw)
  To: 'David Gibson'
  Cc: wli, 'Andrew Morton', linux-mm, linux-kernel

Chen, Kenneth W wrote on Thursday, March 09, 2006 4:02 AM
> David Gibson wrote on Thursday, March 09, 2006 3:27 AM
> > Again, there are no changes to the fault handler.  Including the
> > promised changes which would mean my instantiation serialization path
> > isn't necessary ;-).
> 
> This is the major portion that I omitted in the first patch and is the
> real kicker that fulfills the promise of guaranteed available hugetlb
> page for shared mapping.

Take a look at the following snippets of earlier patch:  in
hugetlb_reserve_pages(), region_chg() calculates an estimate how many
pages is needed, then calls to hugetlb_acct_memory() to make sure there
are enough pages available, then another call to region_add to confirm
the reservation.  It looks OK to me.


+int hugetlb_acct_memory(long delta)
+{
+	atomic_add(delta, &resv_huge_pages);
+	if (delta > 0 && atomic_read(&resv_huge_pages) >
+			VMACCTPG(hugetlb_total_pages())) {
+		atomic_add(-delta, &resv_huge_pages);
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+static int hugetlb_reserve_pages(struct inode *inode, int from, int to)
+{
+	int ret, chg;
+
+	chg = region_chg(&inode->i_mapping->private_list, from, to);
+	if (chg < 0)
+		return chg;
+	ret = hugetlb_acct_memory(chg);
+	if (ret < 0)
+		return ret;
+	region_add(&inode->i_mapping->private_list, from, to);
+	return 0;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] hugetlb strict commit accounting
  2006-03-10  0:45 Chen, Kenneth W
@ 2006-03-10  2:38 ` 'David Gibson'
  0 siblings, 0 replies; 11+ messages in thread
From: 'David Gibson' @ 2006-03-10  2:38 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: wli, 'Andrew Morton', linux-mm

On Thu, Mar 09, 2006 at 04:45:11PM -0800, Chen, Kenneth W wrote:
> hugetlb strict commit accounting for shared mapping - v2
> 
> Changes since v1:
> 
> * change resv_huge_pages to normal unsigned long
> * add proper lock around update/access resv_huge_pages
> * resv_huge_pages record future needs of hugetlb pages
> * strict commit accounting for shared mapping
> * don't allow free_huge_pages to dip below reserved page in sysctl path
> 
> 
> David - what do you think? I don't think kernel needs to traverse page
> cache twice. It already has all the information it needed to calculate
> what are the future reservation requirement: at truncate time, it knows:
> (1) total length, (2) how much to truncate, (3) how much hugetlb page
> was free'ed because of truncate.  Then you can just do the math.  This
> version doesn't do extra traverse.  I suspect you can do the same thing
> with yours too.

Ah.. yes, I believe I can.  Erm... except I'm not sure about the
locking, I suspect in both approaches we may need to hold tree_lock
across a larger chunk of the truncate path.

> I still want to convince you that this patch is better because it allows
> arbitrary mmap offset.

I'm almost convinced.  Only fundamental thing I still dislike is the
100 or so extra lines of code for the region manipulation.


One minor nitpick remaining:
[snip]
> +#define VMACCTPG(x) ((x) >> (HPAGE_SHIFT - PAGE_SHIFT))

This macro confuses me every time I see it - the name utterly fails to
conjure its very simple meaning.  Let's kill it, it's not really any
more verbose to expand its two callers.

> +static int hugetlb_acct_memory(long delta)
> +{
> +	int ret = -ENOMEM;
> +
> +	spin_lock(&hugetlb_lock);
> +	if ((delta + resv_huge_pages) <= free_huge_pages) {
> +		resv_huge_pages += delta;
> +		ret = 0;
> +	}
> +	spin_unlock(&hugetlb_lock);
> +	return ret;
> +}
> +
> +int hugetlb_reserve_pages(struct inode *inode, struct vm_area_struct *vma)
> +{
> +	int ret, chg;
> +	int from = VMACCTPG(vma->vm_pgoff);
> +	int to = VMACCTPG(vma->vm_pgoff +
> +			 ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT));
> +
> +	chg = region_chg(&inode->i_mapping->private_list, from, to);
> +	if (chg < 0)
> +		return chg;
> +	ret = hugetlb_acct_memory(chg);
> +	if (ret < 0)
> +		return ret;
> +	region_add(&inode->i_mapping->private_list, from, to);
> +	return 0;
> +}
> +
> +void hugetlb_unreserve_pages(struct inode *inode, pgoff_t offset, int freed)
> +{
> +	int chg;
> +	chg  = region_truncate(&inode->i_mapping->private_list, offset);
> +	hugetlb_acct_memory(freed - chg);
> +}
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch] hugetlb strict commit accounting
@ 2006-03-10  0:45 Chen, Kenneth W
  2006-03-10  2:38 ` 'David Gibson'
  0 siblings, 1 reply; 11+ messages in thread
From: Chen, Kenneth W @ 2006-03-10  0:45 UTC (permalink / raw)
  To: wli, 'David Gibson', 'Andrew Morton', linux-mm

hugetlb strict commit accounting for shared mapping - v2

Changes since v1:

* change resv_huge_pages to normal unsigned long
* add proper lock around update/access resv_huge_pages
* resv_huge_pages record future needs of hugetlb pages
* strict commit accounting for shared mapping
* don't allow free_huge_pages to dip below reserved page in sysctl path


David - what do you think? I don't think kernel needs to traverse page
cache twice. It already has all the information it needed to calculate
what are the future reservation requirement: at truncate time, it knows:
(1) total length, (2) how much to truncate, (3) how much hugetlb page
was free'ed because of truncate.  Then you can just do the math.  This
version doesn't do extra traverse.  I suspect you can do the same thing
with yours too.

I still want to convince you that this patch is better because it allows
arbitrary mmap offset.


Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>

--- ./fs/hugetlbfs/inode.c.orig	2006-03-09 15:02:25.558844840 -0800
+++ ./fs/hugetlbfs/inode.c	2006-03-09 16:54:38.877121731 -0800
@@ -56,48 +56,9 @@ static void huge_pagevec_release(struct 
 	pagevec_reinit(pvec);
 }
 
-/*
- * huge_pages_needed tries to determine the number of new huge pages that
- * will be required to fully populate this VMA.  This will be equal to
- * the size of the VMA in huge pages minus the number of huge pages
- * (covered by this VMA) that are found in the page cache.
- *
- * Result is in bytes to be compatible with is_hugepage_mem_enough()
- */
-static unsigned long
-huge_pages_needed(struct address_space *mapping, struct vm_area_struct *vma)
-{
-	int i;
-	struct pagevec pvec;
-	unsigned long start = vma->vm_start;
-	unsigned long end = vma->vm_end;
-	unsigned long hugepages = (end - start) >> HPAGE_SHIFT;
-	pgoff_t next = vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT);
-	pgoff_t endpg = next + hugepages;
-
-	pagevec_init(&pvec, 0);
-	while (next < endpg) {
-		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE))
-			break;
-		for (i = 0; i < pagevec_count(&pvec); i++) {
-			struct page *page = pvec.pages[i];
-			if (page->index > next)
-				next = page->index;
-			if (page->index >= endpg)
-				break;
-			next++;
-			hugepages--;
-		}
-		huge_pagevec_release(&pvec);
-	}
-	return hugepages << HPAGE_SHIFT;
-}
-
 static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct inode *inode = file->f_dentry->d_inode;
-	struct address_space *mapping = inode->i_mapping;
-	unsigned long bytes;
 	loff_t len, vma_len;
 	int ret;
 
@@ -113,10 +74,6 @@ static int hugetlbfs_file_mmap(struct fi
 	if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
 		return -EINVAL;
 
-	bytes = huge_pages_needed(mapping, vma);
-	if (!is_hugepage_mem_enough(bytes))
-		return -ENOMEM;
-
 	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
 
 	mutex_lock(&inode->i_mutex);
@@ -129,6 +86,10 @@ static int hugetlbfs_file_mmap(struct fi
 	if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
 		goto out;
 
+	if (vma->vm_flags & VM_MAYSHARE)
+		if (hugetlb_reserve_pages(inode, vma))
+			goto out;
+
 	ret = 0;
 	hugetlb_prefault_arch_hook(vma->vm_mm);
 	if (inode->i_size < len)
@@ -232,7 +193,7 @@ static void truncate_hugepages(struct ad
 	const pgoff_t start = lstart >> HPAGE_SHIFT;
 	struct pagevec pvec;
 	pgoff_t next;
-	int i;
+	int i, freed = 0;
 
 	pagevec_init(&pvec, 0);
 	next = start;
@@ -254,10 +215,12 @@ static void truncate_hugepages(struct ad
 			truncate_huge_page(page);
 			unlock_page(page);
 			hugetlb_put_quota(mapping);
+			freed++;
 		}
 		huge_pagevec_release(&pvec);
 	}
 	BUG_ON(!lstart && mapping->nrpages);
+	hugetlb_unreserve_pages(mapping->host, start, freed);
 }
 
 static void hugetlbfs_delete_inode(struct inode *inode)
@@ -401,6 +364,7 @@ static struct inode *hugetlbfs_get_inode
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+		INIT_LIST_HEAD(&inode->i_mapping->private_list);
 		info = HUGETLBFS_I(inode);
 		mpol_shared_policy_init(&info->policy, MPOL_DEFAULT, NULL);
 		switch (mode & S_IFMT) {
--- ./include/linux/hugetlb.h.orig	2006-03-09 15:02:25.559821402 -0800
+++ ./include/linux/hugetlb.h	2006-03-09 16:54:55.444504341 -0800
@@ -26,6 +26,8 @@ struct page *alloc_huge_page(struct vm_a
 void free_huge_page(struct page *);
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, int write_access);
+int hugetlb_reserve_pages(struct inode *inode, struct vm_area_struct *vma);
+void hugetlb_unreserve_pages(struct inode *inode, pgoff_t offset, int freed);
 
 extern unsigned long max_huge_pages;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
--- ./mm/hugetlb.c.orig	2006-03-09 15:02:25.559821402 -0800
+++ ./mm/hugetlb.c	2006-03-09 17:27:10.301902514 -0800
@@ -20,7 +20,7 @@
 #include <linux/hugetlb.h>
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
-static unsigned long nr_huge_pages, free_huge_pages;
+static unsigned long nr_huge_pages, free_huge_pages, resv_huge_pages;
 unsigned long max_huge_pages;
 static struct list_head hugepage_freelists[MAX_NUMNODES];
 static unsigned int nr_huge_pages_node[MAX_NUMNODES];
@@ -98,6 +98,12 @@ struct page *alloc_huge_page(struct vm_a
 	int i;
 
 	spin_lock(&hugetlb_lock);
+	if (vma->vm_flags & VM_MAYSHARE)
+		resv_huge_pages--;
+	else if (free_huge_pages <= resv_huge_pages) {
+		spin_unlock(&hugetlb_lock);
+		return NULL;
+	}
 	page = dequeue_huge_page(vma, addr);
 	if (!page) {
 		spin_unlock(&hugetlb_lock);
@@ -199,6 +205,7 @@ static unsigned long set_max_huge_pages(
 		return nr_huge_pages;
 
 	spin_lock(&hugetlb_lock);
+	count = max(count, resv_huge_pages);
 	try_to_free_low(count);
 	while (count < nr_huge_pages) {
 		struct page *page = dequeue_huge_page(NULL, 0);
@@ -225,9 +232,11 @@ int hugetlb_report_meminfo(char *buf)
 	return sprintf(buf,
 			"HugePages_Total: %5lu\n"
 			"HugePages_Free:  %5lu\n"
+			"HugePages_Resv:  %5lu\n"
 			"Hugepagesize:    %5lu kB\n",
 			nr_huge_pages,
 			free_huge_pages,
+			resv_huge_pages,
 			HPAGE_SIZE/1024);
 }
 
@@ -572,3 +581,166 @@ int follow_hugetlb_page(struct mm_struct
 
 	return i;
 }
+
+struct file_region {
+	struct list_head link;
+	int from;
+	int to;
+};
+
+static int region_add(struct list_head *head, int f, int t)
+{
+	struct file_region *rg;
+	struct file_region *nrg;
+	struct file_region *trg;
+
+	/* Locate the region we are either in or before. */
+	list_for_each_entry(rg, head, link)
+		if (f <= rg->to)
+			break;
+
+	/* Round our left edge to the current segment if it encloses us. */
+	if (f > rg->from)
+		f = rg->from;
+
+	/* Check for and consume any regions we now overlap with. */
+	nrg = rg;
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (&rg->link == head)
+			break;
+		if (rg->from > t)
+			break;
+
+		/* If this area reaches higher then extend our area to
+		 * include it completely.  If this is not the first area
+		 * which we intend to reuse, free it. */
+		if (rg->to > t)
+			t = rg->to;
+		if (rg != nrg) {
+			list_del(&rg->link);
+			kfree(rg);
+		}
+	}
+	nrg->from = f;
+	nrg->to = t;
+	return 0;
+}
+
+static int region_chg(struct list_head *head, int f, int t)
+{
+	struct file_region *rg;
+	struct file_region *nrg;
+	loff_t chg = 0;
+
+	/* Locate the region we are before or in. */
+	list_for_each_entry(rg, head, link)
+		if (f <= rg->to)
+			break;
+
+	/* If we are below the current region then a new region is required.
+	 * Subtle, allocate a new region at the position but make it zero
+	 * size such that we can guarentee to record the reservation. */
+	if (&rg->link == head || t < rg->from) {
+		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+		if (nrg == 0)
+			return -ENOMEM;
+		nrg->from = f;
+		nrg->to   = f;
+		INIT_LIST_HEAD(&nrg->link);
+		list_add(&nrg->link, rg->link.prev);
+
+		return t - f;
+	}
+
+	/* Round our left edge to the current segment if it encloses us. */
+	if (f > rg->from)
+		f = rg->from;
+	chg = t - f;
+
+	/* Check for and consume any regions we now overlap with. */
+	list_for_each_entry(rg, rg->link.prev, link) {
+		if (&rg->link == head)
+			break;
+		if (rg->from > t)
+			return chg;
+
+		/* We overlap with this area, if it extends futher than
+		 * us then we must extend ourselves.  Account for its
+		 * existing reservation. */
+		if (rg->to > t) {
+			chg += rg->to - t;
+			t = rg->to;
+		}
+		chg -= rg->to - rg->from;
+	}
+	return chg;
+}
+
+static int region_truncate(struct list_head *head, int end)
+{
+	struct file_region *rg;
+	struct file_region *trg;
+	int chg = 0;
+
+	/* Locate the region we are either in or before. */
+	list_for_each_entry(rg, head, link)
+		if (end <= rg->to)
+			break;
+	if (&rg->link == head)
+		return 0;
+
+	/* If we are in the middle of a region then adjust it. */
+	if (end > rg->from) {
+		chg = rg->to - end;
+		rg->to = end;
+		rg = list_entry(rg->link.next, typeof(*rg), link);
+	}
+
+	/* Drop any remaining regions. */
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (&rg->link == head)
+			break;
+		chg += rg->to - rg->from;
+		list_del(&rg->link);
+		kfree(rg);
+	}
+	return chg;
+}
+
+#define VMACCTPG(x) ((x) >> (HPAGE_SHIFT - PAGE_SHIFT))
+static int hugetlb_acct_memory(long delta)
+{
+	int ret = -ENOMEM;
+
+	spin_lock(&hugetlb_lock);
+	if ((delta + resv_huge_pages) <= free_huge_pages) {
+		resv_huge_pages += delta;
+		ret = 0;
+	}
+	spin_unlock(&hugetlb_lock);
+	return ret;
+}
+
+int hugetlb_reserve_pages(struct inode *inode, struct vm_area_struct *vma)
+{
+	int ret, chg;
+	int from = VMACCTPG(vma->vm_pgoff);
+	int to = VMACCTPG(vma->vm_pgoff +
+			 ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT));
+
+	chg = region_chg(&inode->i_mapping->private_list, from, to);
+	if (chg < 0)
+		return chg;
+	ret = hugetlb_acct_memory(chg);
+	if (ret < 0)
+		return ret;
+	region_add(&inode->i_mapping->private_list, from, to);
+	return 0;
+}
+
+void hugetlb_unreserve_pages(struct inode *inode, pgoff_t offset, int freed)
+{
+	int chg;
+	chg  = region_truncate(&inode->i_mapping->private_list, offset);
+	hugetlb_acct_memory(freed - chg);
+}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] hugetlb strict commit accounting
  2006-03-09 12:31       ` Chen, Kenneth W
@ 2006-03-09 12:54         ` 'David Gibson'
  0 siblings, 0 replies; 11+ messages in thread
From: 'David Gibson' @ 2006-03-09 12:54 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: wli, 'Andrew Morton', linux-mm, linux-kernel

On Thu, Mar 09, 2006 at 04:31:11AM -0800, Chen, Kenneth W wrote:
> David Gibson wrote on Thursday, March 09, 2006 4:07 AM
> > > Well, the reservation is already done at mmap time for shared mapping. Why
> > > does kernel need to do anything at fault time?  Doing it at fault time is
> > > an indication of weakness (or brokenness) - you already promised at mmap
> > > time that there will be a page available for faulting.  Why check them
> > > again at fault time?
> > 
> > You can't know (or bound) at mmap() time how many pages a PRIVATE
> > mapping will take (because of fork()).  So unless you have a test at
> > fault time (essentialy deciding whether to draw from "reserved" and
> > "unreserved" hugepage pool) a supposedly reserved SHARED mapping will
> > OOM later if there have been enough COW faults to use up all the
> > hugepages before it's instantiated.
> 
> I see. But that is easy to fix.  I just need to do exactly the same
> thing as what you did to alloc_huge_page.  I will then need to change
> definition of 'reservation' to needs-in-the future (also an easy thing
> to change).

Well.. except that then you *do* need to traverse the page cache on
truncate(), just like I do.  Note that in my latest revision,
hugetlb_extend_reservation() no longer walks the radix tree, only
hugetlb_truncate_reservation() does (extend *does* still take the
tree_lock, an oversight which I will send a patch for tomorrow).

(Oh, and you'll need to walk the reserved range list in
alloc_huge_page(), rather than one comparison like I have.  Although
in practice I imagine there will never be more than one entry on the
list, so I guess that doesn't really matter)

> The real question or discussion I want to bring up is whether kernel
> should do it's own accounting or relying on traversing the page cache. 
> My opinion is that kernel should do it's own accounting because it is
> simpler: you just need to do that at mmap and ftruncate time.

And as we've seen above, a little bit at fault time.  Which would be
exactly the same three places that my patch adds accounting.  I'm
quite willing to be convinced your patch is the better approach, but
this isn't an argument for it.

Incidentally, I've just realised that removing the dodgy heuristic and
allowing unconstrained overcommit for PRIVATE mappings (which both our
patches do) is potentially problematic.  In particular it means my
hugepage malloc() implementation will always OOM rather than fallback
to normal pages :( (I believe currently it will usually fall back, and
only OOM if you get unlucky with the timing).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [patch] hugetlb strict commit accounting
  2006-03-09 12:06     ` 'David Gibson'
@ 2006-03-09 12:31       ` Chen, Kenneth W
  2006-03-09 12:54         ` 'David Gibson'
  0 siblings, 1 reply; 11+ messages in thread
From: Chen, Kenneth W @ 2006-03-09 12:31 UTC (permalink / raw)
  To: 'David Gibson'
  Cc: wli, 'Andrew Morton', linux-mm, linux-kernel

David Gibson wrote on Thursday, March 09, 2006 4:07 AM
> > Well, the reservation is already done at mmap time for shared mapping. Why
> > does kernel need to do anything at fault time?  Doing it at fault time is
> > an indication of weakness (or brokenness) - you already promised at mmap
> > time that there will be a page available for faulting.  Why check them
> > again at fault time?
> 
> You can't know (or bound) at mmap() time how many pages a PRIVATE
> mapping will take (because of fork()).  So unless you have a test at
> fault time (essentialy deciding whether to draw from "reserved" and
> "unreserved" hugepage pool) a supposedly reserved SHARED mapping will
> OOM later if there have been enough COW faults to use up all the
> hugepages before it's instantiated.

I see. But that is easy to fix.  I just need to do exactly the same
thing as what you did to alloc_huge_page.  I will then need to change
definition of 'reservation' to needs-in-the future (also an easy thing
to change).

The real question or discussion I want to bring up is whether kernel
should do it's own accounting or relying on traversing the page cache. 
My opinion is that kernel should do it's own accounting because it is
simpler: you just need to do that at mmap and ftruncate time.

- Ken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] hugetlb strict commit accounting
  2006-03-09 12:02   ` Chen, Kenneth W
@ 2006-03-09 12:14     ` 'David Gibson'
  0 siblings, 0 replies; 11+ messages in thread
From: 'David Gibson' @ 2006-03-09 12:14 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: wli, 'Andrew Morton', linux-mm, linux-kernel

On Thu, Mar 09, 2006 at 04:02:06AM -0800, Chen, Kenneth W wrote:
> David Gibson wrote on Thursday, March 09, 2006 3:27 AM
> > Again, there are no changes to the fault handler.  Including the
> > promised changes which would mean my instantiation serialization path
> > isn't necessary ;-).
> 
> This is the major portion that I omitted in the first patch and is the
> real kicker that fulfills the promise of guaranteed available hugetlb
> page for shared mapping.
> 
> You can shower me all over on the lock protection :-) yes, this is not
> perfect and was the reason I did not post it earlier, but I want to give
> you the concept on how I envision this route would work.
> 
> Again PRIVATE mapping is busted, you can't count them from inode.  You
> would have to count them via mm_struct (I think).

I don't think there's any sane way to reserve for PRIVATE mappings.
To do it strictly you'd have to reaccount the whole block on every
fork(), and that would mean that any process using >0.5 of the
system's hugepages could never fork(), even if the child was just
going to exec().

Given that, it's simplest just to allow free overcommit for PRIVATE
mappings.  *But* you can ensure that PRIVATE allocations (i.e. COW
faults) don't mess with any previously reserved SHARED mappings.

> Note: definition of "reservation" in earlier patch is total hugetlb pages
> needed for that file, including the one that is already faulted in.  Maybe
> that throw you off a bit because I'm guessing your definition is "needed
> in the future" and probably you are looking for a decrement of the counter
> in the fault path?

No, I realised that distinction.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] hugetlb strict commit accounting
  2006-03-09 11:43   ` Chen, Kenneth W
@ 2006-03-09 12:06     ` 'David Gibson'
  2006-03-09 12:31       ` Chen, Kenneth W
  0 siblings, 1 reply; 11+ messages in thread
From: 'David Gibson' @ 2006-03-09 12:06 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: wli, 'Andrew Morton', linux-mm, linux-kernel

On Thu, Mar 09, 2006 at 03:43:12AM -0800, Chen, Kenneth W wrote:
> David Gibson wrote on Thursday, March 09, 2006 3:27 AM
> > Um... as far as I can tell, this patch doesn't actually reserve
> > anything.  There are no changes to the fault handler ot
> > alloc_huge_page(), so there's nothing to stop PRIVATE mappings dipping
> > into the supposedly reserved pool.
> 
> Well, the reservation is already done at mmap time for shared mapping. Why
> does kernel need to do anything at fault time?  Doing it at fault time is
> an indication of weakness (or brokenness) - you already promised at mmap
> time that there will be a page available for faulting.  Why check them
> again at fault time?

You can't know (or bound) at mmap() time how many pages a PRIVATE
mapping will take (because of fork()).  So unless you have a test at
fault time (essentialy deciding whether to draw from "reserved" and
"unreserved" hugepage pool) a supposedly reserved SHARED mapping will
OOM later if there have been enough COW faults to use up all the
hugepages before it's instantiated.

> I don't think your implementation handles PRIVATE mapping either, Isn't it?
> Private mapping doesn't enter into the page cache hanging of address_space
> pointer, so either way, it is busted.

Depends what you mean by "handle".  With my patch PRIVATE mappings are
never reserved or guaranteed (I couldn't think of any set of sane
semantics for it), *but* they will never eat into the pool reserved
for SHARED mappings.  With yours, they can, so:
	p = mmap(SHARED)
	/* Lots of COW faults elsewhere */
	*p = x;
Will result in an OOM Kill on the last line.

> > This looks a bit like a case of "let's make it an atomic_t to sprinkle
> > it with magic atomicity dust" without thinking about what operations
> > are and need to be atomic.  I think resv_huge_pages should be an
> > ordinary int, but protected by a lock (exactly which lock is not
> > immediately obvious).
> 
> Yeah, I agree.  It crossed my mind whether I should fix that or post a
> fairly straightforward back port.  I decided to do the latter and I got
> bitten :-(  That is in the pipeline if people agree that this variable
> reservation system is a better one.
> 
> 
> > What is the list of regions (mapping->private_list) protected by?
> > mmap_sem (the only thing I can think of off hand that's already taken)
> > doesn't cut it, because the mapping can be accessed by multiple mms.
> 
> I think it is the inode->i_mutex.

Ok, you should double check that's taken in all the right places, but
it sounds plausible.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [patch] hugetlb strict commit accounting
  2006-03-09 11:26 ` 'David Gibson'
  2006-03-09 11:43   ` Chen, Kenneth W
@ 2006-03-09 12:02   ` Chen, Kenneth W
  2006-03-09 12:14     ` 'David Gibson'
  1 sibling, 1 reply; 11+ messages in thread
From: Chen, Kenneth W @ 2006-03-09 12:02 UTC (permalink / raw)
  To: 'David Gibson'
  Cc: wli, 'Andrew Morton', linux-mm, linux-kernel

David Gibson wrote on Thursday, March 09, 2006 3:27 AM
> Again, there are no changes to the fault handler.  Including the
> promised changes which would mean my instantiation serialization path
> isn't necessary ;-).

This is the major portion that I omitted in the first patch and is the
real kicker that fulfills the promise of guaranteed available hugetlb
page for shared mapping.

You can shower me all over on the lock protection :-) yes, this is not
perfect and was the reason I did not post it earlier, but I want to give
you the concept on how I envision this route would work.

Again PRIVATE mapping is busted, you can't count them from inode.  You
would have to count them via mm_struct (I think).

- Ken

Note: definition of "reservation" in earlier patch is total hugetlb pages
needed for that file, including the one that is already faulted in.  Maybe
that throw you off a bit because I'm guessing your definition is "needed
in the future" and probably you are looking for a decrement of the counter
in the fault path?

--- ./mm/hugetlb.c.orig	2006-03-09 04:46:38.965547435 -0800
+++ ./mm/hugetlb.c	2006-03-09 04:48:20.804413375 -0800
@@ -196,6 +196,8 @@ static unsigned long set_max_huge_pages(
 		enqueue_huge_page(page);
 		spin_unlock(&hugetlb_lock);
 	}
+	if (count < atomic_read(&resv_huge_pages))
+		count = atomic_read(&resv_huge_pages);
 	if (count >= nr_huge_pages)
 		return nr_huge_pages;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [patch] hugetlb strict commit accounting
  2006-03-09 11:26 ` 'David Gibson'
@ 2006-03-09 11:43   ` Chen, Kenneth W
  2006-03-09 12:06     ` 'David Gibson'
  2006-03-09 12:02   ` Chen, Kenneth W
  1 sibling, 1 reply; 11+ messages in thread
From: Chen, Kenneth W @ 2006-03-09 11:43 UTC (permalink / raw)
  To: 'David Gibson'
  Cc: wli, 'Andrew Morton', linux-mm, linux-kernel

David Gibson wrote on Thursday, March 09, 2006 3:27 AM
> Um... as far as I can tell, this patch doesn't actually reserve
> anything.  There are no changes to the fault handler ot
> alloc_huge_page(), so there's nothing to stop PRIVATE mappings dipping
> into the supposedly reserved pool.

Well, the reservation is already done at mmap time for shared mapping. Why
does kernel need to do anything at fault time?  Doing it at fault time is
an indication of weakness (or brokenness) - you already promised at mmap
time that there will be a page available for faulting.  Why check them
again at fault time?

I don't think your implementation handles PRIVATE mapping either, Isn't it?
Private mapping doesn't enter into the page cache hanging of address_space
pointer, so either way, it is busted.

> This looks a bit like a case of "let's make it an atomic_t to sprinkle
> it with magic atomicity dust" without thinking about what operations
> are and need to be atomic.  I think resv_huge_pages should be an
> ordinary int, but protected by a lock (exactly which lock is not
> immediately obvious).

Yeah, I agree.  It crossed my mind whether I should fix that or post a
fairly straightforward back port.  I decided to do the latter and I got
bitten :-(  That is in the pipeline if people agree that this variable
reservation system is a better one.

> What is the list of regions (mapping->private_list) protected by?
> mmap_sem (the only thing I can think of off hand that's already taken)
> doesn't cut it, because the mapping can be accessed by multiple mms.

I think it is the inode->i_mutex.

- Ken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] hugetlb strict commit accounting
  2006-03-09 10:55 Chen, Kenneth W
@ 2006-03-09 11:26 ` 'David Gibson'
  2006-03-09 11:43   ` Chen, Kenneth W
  2006-03-09 12:02   ` Chen, Kenneth W
  0 siblings, 2 replies; 11+ messages in thread
From: 'David Gibson' @ 2006-03-09 11:26 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: wli, 'Andrew Morton', linux-mm, linux-kernel

On Thu, Mar 09, 2006 at 02:55:23AM -0800, Chen, Kenneth W wrote:
> Here is a competing implementation of hugetlb strict commit accounting.
> A back port of what was done about two years ago by Andy Whitcroft, Ray
> Bryant, and myself.
> 
> Essentially for the same purpose of this patch currently sit in -mm:
> 
> http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm3/broken-out/hugepage-strict-page-reservation-for
> -hugepage-inodes.patch
> 
> Except it is BETTER and more robust :-)  because:
> 
> (1) it does arbitrary variable length and arbitrary variable offset
> (2) doesn't need to perform linear traverse of page cache
> 
> It is more flexible that it will handle any arbitrary mmap offset
> versus the one in -mm that always reserve entire hugetlb file size.
> I've heard numerous complains from application developers that
> hugetlb is difficult to use in current state of affair. Having
> another peculiar behavior like "reservation would only work if
> mmap offset is zero" adds another horrendous factor.

Um... as far as I can tell, this patch doesn't actually reserve
anything.  There are no changes to the fault handler ot
alloc_huge_page(), so there's nothing to stop PRIVATE mappings dipping
into the supposedly reserved pool.

More comments below.

>  fs/hugetlbfs/inode.c    |  188 ++++++++++++++++++++++++++++++++++++++----------
>  include/linux/hugetlb.h |    1 
>  mm/hugetlb.c            |    3 
>  3 files changed, 156 insertions(+), 36 deletions(-)
> 
> --- ./fs/hugetlbfs/inode.c.orig	2006-03-09 02:29:28.166820138 -0800
> +++ ./fs/hugetlbfs/inode.c	2006-03-09 03:20:29.311313889 -0800
> @@ -56,48 +56,160 @@ static void huge_pagevec_release(struct 
>  	pagevec_reinit(pvec);
>  }
>  
> -/*
> - * huge_pages_needed tries to determine the number of new huge pages that
> - * will be required to fully populate this VMA.  This will be equal to
> - * the size of the VMA in huge pages minus the number of huge pages
> - * (covered by this VMA) that are found in the page cache.
> - *
> - * Result is in bytes to be compatible with is_hugepage_mem_enough()
> - */
> -static unsigned long
> -huge_pages_needed(struct address_space *mapping, struct vm_area_struct *vma)
> +struct file_region {
> +	struct list_head link;
> +	int from;
> +	int to;
> +};
> +
> +static int region_add(struct list_head *head, int f, int t)
>  {
> -	int i;
> -	struct pagevec pvec;
> -	unsigned long start = vma->vm_start;
> -	unsigned long end = vma->vm_end;
> -	unsigned long hugepages = (end - start) >> HPAGE_SHIFT;
> -	pgoff_t next = vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT);
> -	pgoff_t endpg = next + hugepages;
> +	struct file_region *rg;
> +	struct file_region *nrg;
> +	struct file_region *trg;
> +
> +	/* Locate the region we are either in or before. */
> +	list_for_each_entry(rg, head, link)
> +		if (f <= rg->to)
> +			break;
>  
> -	pagevec_init(&pvec, 0);
> -	while (next < endpg) {
> -		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE))
> +	/* Round our left edge to the current segment if it encloses us. */
> +	if (f > rg->from)
> +		f = rg->from;
> +
> +	/* Check for and consume any regions we now overlap with. */
> +	nrg = rg;
> +	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> +		if (&rg->link == head)
>  			break;
> -		for (i = 0; i < pagevec_count(&pvec); i++) {
> -			struct page *page = pvec.pages[i];
> -			if (page->index > next)
> -				next = page->index;
> -			if (page->index >= endpg)
> -				break;
> -			next++;
> -			hugepages--;
> +		if (rg->from > t)
> +			break;
> +
> +		/* If this area reaches higher then extend our area to
> +		 * include it completely.  If this is not the first area
> +		 * which we intend to reuse, free it. */
> +		if (rg->to > t)
> +			t = rg->to;
> +		if (rg != nrg) {
> +			list_del(&rg->link);
> +			kfree(rg);
>  		}
> -		huge_pagevec_release(&pvec);
>  	}
> -	return hugepages << HPAGE_SHIFT;
> +	nrg->from = f;
> +	nrg->to = t;
> +	return 0;
> +}
> +
> +static int region_chg(struct list_head *head, int f, int t)
> +{
> +	struct file_region *rg;
> +	struct file_region *nrg;
> +	loff_t chg = 0;
> +
> +	/* Locate the region we are before or in. */
> +	list_for_each_entry(rg, head, link)
> +		if (f <= rg->to)
> +			break;
> +
> +	/* If we are below the current region then a new region is required.
> +	 * Subtle, allocate a new region at the position but make it zero
> +	 * size such that we can guarentee to record the reservation. */
> +	if (&rg->link == head || t < rg->from) {
> +		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> +		if (nrg == 0)
> +			return -ENOMEM;
> +		nrg->from = f;
> +		nrg->to   = f;
> +		INIT_LIST_HEAD(&nrg->link);
> +		list_add(&nrg->link, rg->link.prev);
> +
> +		return t - f;
> +	}
> +
> +	/* Round our left edge to the current segment if it encloses us. */
> +	if (f > rg->from)
> +		f = rg->from;
> +	chg = t - f;
> +
> +	/* Check for and consume any regions we now overlap with. */

Looks like we have this test code duplicated from region_add().  Any
way to avoid that?

> +	list_for_each_entry(rg, rg->link.prev, link) {
> +		if (&rg->link == head)
> +			break;
> +		if (rg->from > t)
> +			return chg;
> +
> +		/* We overlap with this area, if it extends futher than
> +		 * us then we must extend ourselves.  Account for its
> +		 * existing reservation. */
> +		if (rg->to > t) {
> +			chg += rg->to - t;
> +			t = rg->to;
> +		}
> +		chg -= rg->to - rg->from;
> +	}
> +	return chg;
> +}
> +
> +static int region_truncate(struct list_head *head, int end)
> +{
> +	struct file_region *rg;
> +	struct file_region *trg;
> +	int chg = 0;
> +
> +	/* Locate the region we are either in or before. */
> +	list_for_each_entry(rg, head, link)
> +		if (end <= rg->to)
> +			break;
> +	if (&rg->link == head)
> +		return 0;
> +
> +	/* If we are in the middle of a region then adjust it. */
> +	if (end > rg->from) {
> +		chg = rg->to - end;
> +		rg->to = end;
> +		rg = list_entry(rg->link.next, typeof(*rg), link);
> +	}
> +
> +	/* Drop any remaining regions. */
> +	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> +		if (&rg->link == head)
> +			break;
> +		chg += rg->to - rg->from;
> +		list_del(&rg->link);
> +		kfree(rg);
> +	}
> +	return chg;
> +}
> +
> +#define VMACCTPG(x) ((x) >> (HPAGE_SHIFT - PAGE_SHIFT))
> +int hugetlb_acct_memory(long delta)
> +{
> +	atomic_add(delta, &resv_huge_pages);
> +	if (delta > 0 && atomic_read(&resv_huge_pages) >
> +			VMACCTPG(hugetlb_total_pages())) {
> +		atomic_add(-delta, &resv_huge_pages);
> +		return -ENOMEM;
> +	}
> +	return 0;
> +}

This looks a bit like a case of "let's make it an atomic_t to sprinkle
it with magic atomicity dust" without thinking about what operations
are and need to be atomic.  I think resv_huge_pages should be an
ordinary int, but protected by a lock (exactly which lock is not
immediately obvious).

> +static int hugetlb_reserve_pages(struct inode *inode, int from, int to)
> +{
> +	int ret, chg;
> +
> +	chg = region_chg(&inode->i_mapping->private_list, from, to);

What is the list of regions (mapping->private_list) protected by?
mmap_sem (the only thing I can think of off hand that's already taken)
doesn't cut it, because the mapping can be accessed by multiple mms.

> +	if (chg < 0)
> +		return chg;
> +	ret = hugetlb_acct_memory(chg);
> +	if (ret < 0)
> +		return ret;
> +	region_add(&inode->i_mapping->private_list, from, to);
> +	return 0;
>  }
>  
>  static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>  	struct inode *inode = file->f_dentry->d_inode;
> -	struct address_space *mapping = inode->i_mapping;
> -	unsigned long bytes;
>  	loff_t len, vma_len;
>  	int ret;
>  
> @@ -113,10 +225,6 @@ static int hugetlbfs_file_mmap(struct fi
>  	if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
>  		return -EINVAL;
>  
> -	bytes = huge_pages_needed(mapping, vma);
> -	if (!is_hugepage_mem_enough(bytes))
> -		return -ENOMEM;
> -
>  	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
>  
>  	mutex_lock(&inode->i_mutex);
> @@ -129,6 +237,11 @@ static int hugetlbfs_file_mmap(struct fi
>  	if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
>  		goto out;
>  
> +	if (vma->vm_flags & VM_MAYSHARE)
> +		if (hugetlb_reserve_pages(inode, VMACCTPG(vma->vm_pgoff),
> +			VMACCTPG(vma->vm_pgoff + (vma_len >> PAGE_SHIFT))))
> +			goto out;
> +
>  	ret = 0;
>  	hugetlb_prefault_arch_hook(vma->vm_mm);
>  	if (inode->i_size < len)
> @@ -258,6 +371,8 @@ static void truncate_hugepages(struct ad
>  		huge_pagevec_release(&pvec);
>  	}
>  	BUG_ON(!lstart && mapping->nrpages);
> +	i = region_truncate(&mapping->private_list, start);
> +	hugetlb_acct_memory(-i);
>  }
>  
>  static void hugetlbfs_delete_inode(struct inode *inode)
> @@ -401,6 +516,7 @@ static struct inode *hugetlbfs_get_inode
>  		inode->i_mapping->a_ops = &hugetlbfs_aops;
>  		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
>  		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> +		INIT_LIST_HEAD(&inode->i_mapping->private_list);
>  		info = HUGETLBFS_I(inode);
>  		mpol_shared_policy_init(&info->policy, MPOL_DEFAULT, NULL);
>  		switch (mode & S_IFMT) {
> --- ./include/linux/hugetlb.h.orig	2006-03-09 02:29:28.943187316 -0800
> +++ ./include/linux/hugetlb.h	2006-03-09 03:11:45.820109364 -0800
> @@ -30,6 +30,7 @@ int hugetlb_fault(struct mm_struct *mm, 
>  extern unsigned long max_huge_pages;
>  extern const unsigned long hugetlb_zero, hugetlb_infinity;
>  extern int sysctl_hugetlb_shm_group;
> +extern atomic_t resv_huge_pages;
>  
>  /* arch callbacks */
>  
> --- ./mm/hugetlb.c.orig	2006-03-09 02:29:29.314281061 -0800
> +++ ./mm/hugetlb.c	2006-03-09 03:24:34.546662447 -0800
> @@ -21,6 +21,7 @@
>  
>  const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
>  static unsigned long nr_huge_pages, free_huge_pages;
> +atomic_t resv_huge_pages;
>  unsigned long max_huge_pages;
>  static struct list_head hugepage_freelists[MAX_NUMNODES];
>  static unsigned int nr_huge_pages_node[MAX_NUMNODES];
> @@ -225,9 +226,11 @@ int hugetlb_report_meminfo(char *buf)
>  	return sprintf(buf,
>  			"HugePages_Total: %5lu\n"
>  			"HugePages_Free:  %5lu\n"
> +			"HugePages_Resv:  %5u\n"
>  			"Hugepagesize:    %5lu kB\n",
>  			nr_huge_pages,
>  			free_huge_pages,
> +			atomic_read(&resv_huge_pages),
>  			HPAGE_SIZE/1024);
>  }

Again, there are no changes to the fault handler.  Including the
promised changes which would mean my instantiation serialization path
isn't necessary ;-).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch] hugetlb strict commit accounting
@ 2006-03-09 10:55 Chen, Kenneth W
  2006-03-09 11:26 ` 'David Gibson'
  0 siblings, 1 reply; 11+ messages in thread
From: Chen, Kenneth W @ 2006-03-09 10:55 UTC (permalink / raw)
  To: 'David Gibson', wli, 'Andrew Morton'
  Cc: linux-mm, linux-kernel

Here is a competing implementation of hugetlb strict commit accounting.
A back port of what was done about two years ago by Andy Whitcroft, Ray
Bryant, and myself.

Essentially for the same purpose of this patch currently sit in -mm:

http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm3/broken-out/hugepage-strict-page-reservation-for
-hugepage-inodes.patch

Except it is BETTER and more robust :-)  because:

(1) it does arbitrary variable length and arbitrary variable offset
(2) doesn't need to perform linear traverse of page cache

It is more flexible that it will handle any arbitrary mmap offset
versus the one in -mm that always reserve entire hugetlb file size.
I've heard numerous complains from application developers that
hugetlb is difficult to use in current state of affair. Having
another peculiar behavior like "reservation would only work if
mmap offset is zero" adds another horrendous factor.


Posted here for -mm consideration.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>


 fs/hugetlbfs/inode.c    |  188 ++++++++++++++++++++++++++++++++++++++----------
 include/linux/hugetlb.h |    1 
 mm/hugetlb.c            |    3 
 3 files changed, 156 insertions(+), 36 deletions(-)

--- ./fs/hugetlbfs/inode.c.orig	2006-03-09 02:29:28.166820138 -0800
+++ ./fs/hugetlbfs/inode.c	2006-03-09 03:20:29.311313889 -0800
@@ -56,48 +56,160 @@ static void huge_pagevec_release(struct 
 	pagevec_reinit(pvec);
 }
 
-/*
- * huge_pages_needed tries to determine the number of new huge pages that
- * will be required to fully populate this VMA.  This will be equal to
- * the size of the VMA in huge pages minus the number of huge pages
- * (covered by this VMA) that are found in the page cache.
- *
- * Result is in bytes to be compatible with is_hugepage_mem_enough()
- */
-static unsigned long
-huge_pages_needed(struct address_space *mapping, struct vm_area_struct *vma)
+struct file_region {
+	struct list_head link;
+	int from;
+	int to;
+};
+
+static int region_add(struct list_head *head, int f, int t)
 {
-	int i;
-	struct pagevec pvec;
-	unsigned long start = vma->vm_start;
-	unsigned long end = vma->vm_end;
-	unsigned long hugepages = (end - start) >> HPAGE_SHIFT;
-	pgoff_t next = vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT);
-	pgoff_t endpg = next + hugepages;
+	struct file_region *rg;
+	struct file_region *nrg;
+	struct file_region *trg;
+
+	/* Locate the region we are either in or before. */
+	list_for_each_entry(rg, head, link)
+		if (f <= rg->to)
+			break;
 
-	pagevec_init(&pvec, 0);
-	while (next < endpg) {
-		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE))
+	/* Round our left edge to the current segment if it encloses us. */
+	if (f > rg->from)
+		f = rg->from;
+
+	/* Check for and consume any regions we now overlap with. */
+	nrg = rg;
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (&rg->link == head)
 			break;
-		for (i = 0; i < pagevec_count(&pvec); i++) {
-			struct page *page = pvec.pages[i];
-			if (page->index > next)
-				next = page->index;
-			if (page->index >= endpg)
-				break;
-			next++;
-			hugepages--;
+		if (rg->from > t)
+			break;
+
+		/* If this area reaches higher then extend our area to
+		 * include it completely.  If this is not the first area
+		 * which we intend to reuse, free it. */
+		if (rg->to > t)
+			t = rg->to;
+		if (rg != nrg) {
+			list_del(&rg->link);
+			kfree(rg);
 		}
-		huge_pagevec_release(&pvec);
 	}
-	return hugepages << HPAGE_SHIFT;
+	nrg->from = f;
+	nrg->to = t;
+	return 0;
+}
+
+static int region_chg(struct list_head *head, int f, int t)
+{
+	struct file_region *rg;
+	struct file_region *nrg;
+	loff_t chg = 0;
+
+	/* Locate the region we are before or in. */
+	list_for_each_entry(rg, head, link)
+		if (f <= rg->to)
+			break;
+
+	/* If we are below the current region then a new region is required.
+	 * Subtle, allocate a new region at the position but make it zero
+	 * size such that we can guarentee to record the reservation. */
+	if (&rg->link == head || t < rg->from) {
+		nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+		if (nrg == 0)
+			return -ENOMEM;
+		nrg->from = f;
+		nrg->to   = f;
+		INIT_LIST_HEAD(&nrg->link);
+		list_add(&nrg->link, rg->link.prev);
+
+		return t - f;
+	}
+
+	/* Round our left edge to the current segment if it encloses us. */
+	if (f > rg->from)
+		f = rg->from;
+	chg = t - f;
+
+	/* Check for and consume any regions we now overlap with. */
+	list_for_each_entry(rg, rg->link.prev, link) {
+		if (&rg->link == head)
+			break;
+		if (rg->from > t)
+			return chg;
+
+		/* We overlap with this area, if it extends futher than
+		 * us then we must extend ourselves.  Account for its
+		 * existing reservation. */
+		if (rg->to > t) {
+			chg += rg->to - t;
+			t = rg->to;
+		}
+		chg -= rg->to - rg->from;
+	}
+	return chg;
+}
+
+static int region_truncate(struct list_head *head, int end)
+{
+	struct file_region *rg;
+	struct file_region *trg;
+	int chg = 0;
+
+	/* Locate the region we are either in or before. */
+	list_for_each_entry(rg, head, link)
+		if (end <= rg->to)
+			break;
+	if (&rg->link == head)
+		return 0;
+
+	/* If we are in the middle of a region then adjust it. */
+	if (end > rg->from) {
+		chg = rg->to - end;
+		rg->to = end;
+		rg = list_entry(rg->link.next, typeof(*rg), link);
+	}
+
+	/* Drop any remaining regions. */
+	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+		if (&rg->link == head)
+			break;
+		chg += rg->to - rg->from;
+		list_del(&rg->link);
+		kfree(rg);
+	}
+	return chg;
+}
+
+#define VMACCTPG(x) ((x) >> (HPAGE_SHIFT - PAGE_SHIFT))
+int hugetlb_acct_memory(long delta)
+{
+	atomic_add(delta, &resv_huge_pages);
+	if (delta > 0 && atomic_read(&resv_huge_pages) >
+			VMACCTPG(hugetlb_total_pages())) {
+		atomic_add(-delta, &resv_huge_pages);
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+static int hugetlb_reserve_pages(struct inode *inode, int from, int to)
+{
+	int ret, chg;
+
+	chg = region_chg(&inode->i_mapping->private_list, from, to);
+	if (chg < 0)
+		return chg;
+	ret = hugetlb_acct_memory(chg);
+	if (ret < 0)
+		return ret;
+	region_add(&inode->i_mapping->private_list, from, to);
+	return 0;
 }
 
 static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct inode *inode = file->f_dentry->d_inode;
-	struct address_space *mapping = inode->i_mapping;
-	unsigned long bytes;
 	loff_t len, vma_len;
 	int ret;
 
@@ -113,10 +225,6 @@ static int hugetlbfs_file_mmap(struct fi
 	if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
 		return -EINVAL;
 
-	bytes = huge_pages_needed(mapping, vma);
-	if (!is_hugepage_mem_enough(bytes))
-		return -ENOMEM;
-
 	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
 
 	mutex_lock(&inode->i_mutex);
@@ -129,6 +237,11 @@ static int hugetlbfs_file_mmap(struct fi
 	if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
 		goto out;
 
+	if (vma->vm_flags & VM_MAYSHARE)
+		if (hugetlb_reserve_pages(inode, VMACCTPG(vma->vm_pgoff),
+			VMACCTPG(vma->vm_pgoff + (vma_len >> PAGE_SHIFT))))
+			goto out;
+
 	ret = 0;
 	hugetlb_prefault_arch_hook(vma->vm_mm);
 	if (inode->i_size < len)
@@ -258,6 +371,8 @@ static void truncate_hugepages(struct ad
 		huge_pagevec_release(&pvec);
 	}
 	BUG_ON(!lstart && mapping->nrpages);
+	i = region_truncate(&mapping->private_list, start);
+	hugetlb_acct_memory(-i);
 }
 
 static void hugetlbfs_delete_inode(struct inode *inode)
@@ -401,6 +516,7 @@ static struct inode *hugetlbfs_get_inode
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+		INIT_LIST_HEAD(&inode->i_mapping->private_list);
 		info = HUGETLBFS_I(inode);
 		mpol_shared_policy_init(&info->policy, MPOL_DEFAULT, NULL);
 		switch (mode & S_IFMT) {
--- ./include/linux/hugetlb.h.orig	2006-03-09 02:29:28.943187316 -0800
+++ ./include/linux/hugetlb.h	2006-03-09 03:11:45.820109364 -0800
@@ -30,6 +30,7 @@ int hugetlb_fault(struct mm_struct *mm, 
 extern unsigned long max_huge_pages;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
 extern int sysctl_hugetlb_shm_group;
+extern atomic_t resv_huge_pages;
 
 /* arch callbacks */
 
--- ./mm/hugetlb.c.orig	2006-03-09 02:29:29.314281061 -0800
+++ ./mm/hugetlb.c	2006-03-09 03:24:34.546662447 -0800
@@ -21,6 +21,7 @@
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
 static unsigned long nr_huge_pages, free_huge_pages;
+atomic_t resv_huge_pages;
 unsigned long max_huge_pages;
 static struct list_head hugepage_freelists[MAX_NUMNODES];
 static unsigned int nr_huge_pages_node[MAX_NUMNODES];
@@ -225,9 +226,11 @@ int hugetlb_report_meminfo(char *buf)
 	return sprintf(buf,
 			"HugePages_Total: %5lu\n"
 			"HugePages_Free:  %5lu\n"
+			"HugePages_Resv:  %5u\n"
 			"Hugepagesize:    %5lu kB\n",
 			nr_huge_pages,
 			free_huge_pages,
+			atomic_read(&resv_huge_pages),
 			HPAGE_SIZE/1024);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2006-03-10  2:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-09 12:14 [patch] hugetlb strict commit accounting Chen, Kenneth W
  -- strict thread matches above, loose matches on Subject: below --
2006-03-10  0:45 Chen, Kenneth W
2006-03-10  2:38 ` 'David Gibson'
2006-03-09 10:55 Chen, Kenneth W
2006-03-09 11:26 ` 'David Gibson'
2006-03-09 11:43   ` Chen, Kenneth W
2006-03-09 12:06     ` 'David Gibson'
2006-03-09 12:31       ` Chen, Kenneth W
2006-03-09 12:54         ` 'David Gibson'
2006-03-09 12:02   ` Chen, Kenneth W
2006-03-09 12:14     ` 'David Gibson'

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox