[PATCH] VM: add vm.free_node

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] VM: add vm.free_node_memory sysctl
       [not found]       ` <20050802210746.GA26494@elte.hu>
@ 2005-08-03 13:56         ` Martin Hicks
  2005-08-03 14:15           ` Andi Kleen
  0 siblings, 1 reply; 10+ messages in thread
From: Martin Hicks @ 2005-08-03 13:56 UTC (permalink / raw)
  To: Ingo Molnar, Linux MM
  Cc: Martin Hicks, Andrew Morton, torvalds, linux-kernel, ak

On Tue, Aug 02, 2005 at 11:07:46PM +0200, Ingo Molnar wrote:
> 
> * Martin Hicks <mort@sgi.com> wrote:
> 
> > On Mon, Aug 01, 2005 at 09:54:26PM +0200, Ingo Molnar wrote:
> > > 
> > > * Andrew Morton <akpm@osdl.org> wrote:
> > > 
> > > > >  We could perhaps add a CAP_SYS_ADMIN-only sysctl for this hack,
> > > > 
> > > > That would be more appropriate.
> > > > 
> > > > (I'm still not sure what happened to the idea of adding a call to 
> > > > "clear out this node+zone's pagecache now" rather than "set this 
> > > > noed+zone's policy")
> > > 
> > > lets do that as a sysctl hack. It would be useful for debugging purposes 
> > > anyway. But i'm not sure whether it's the same issue - Martin?
> > 
> > (Sorry..I was on vacation yesterday)
> > 
> > Yes, this is the same issue with a different way of making it happen. 
> > Setting a zone's policy allows reclaim to happen automatically.
> > 
> > I'll send in a patch to add a sysctl to do the manual dumping of 
> > pagecache really soon.
> 
> cool! [ Incidentally, when i found this problem i was looking for 
> existing bits in the kernel to write such a patch myself (which i wanted 
> to use on non-NUMA to create more reproducable workloads for 
> performance-testing) - now i'll wait for your patch. ]

Here's the promised sysctl to dump a node's pagecache.  Please review!

This patch depends on the zone reclaim atomic ops cleanup:
http://marc.theaimsgroup.com/?l=linux-mm&m=112307646306476&w=2


I split up zone_reclaim():

- __zone_reclaim() does the Real Work

- zone_reclaim() checks the rate-limiting stuff.

For the sysctl we don't want to be rate limited.  We always want to scan
the LRU lists looking for tossable pages.

Thanks,
mh

-- 
Martin Hicks   ||   Silicon Graphics Inc.   ||   mort@sgi.com



This patch adds the vm.free_node_memory sysctl.  This allows a root user
to ask the kernel to drop as many pages as possible out of the specified
node's pagecache.

Takes a single integer nodeID.  e.g.,

echo 14 > /proc/sys/vm/free_zone_memory

will clear pagecache on node 14.

Signed-off-by:  Martin Hicks <mort@sgi.com>

---
commit 9b0a83e09e4fea07cf877dc7f6ff8b38c0f48d61
tree 58d5467efa7f3bf103203e25c95c6f0936ed653f
parent 414acb15f0f237cbf560bfa56c74ca9d19c5cd5a
author Martin Hicks,,,,,,,engr <mort@tomahawk.engr.sgi.com> Wed, 03 Aug 2005 06:53:33 -0700
committer Martin Hicks,,,,,,,engr <mort@tomahawk.engr.sgi.com> Wed, 03 Aug 2005 06:53:33 -0700

 include/linux/mmzone.h |    3 ++
 include/linux/sysctl.h |    1 +
 kernel/sysctl.c        |   10 +++++++
 mm/vmscan.c            |   66 +++++++++++++++++++++++++++++++++++++++---------
 4 files changed, 68 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -403,6 +403,9 @@ int min_free_kbytes_sysctl_handler(struc
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, struct file *,
 					void __user *, size_t *, loff_t *);
+extern int sysctl_free_node_memory;
+int free_node_memory_sysctl_handler(struct ctl_table *, int, struct file *,
+				    void __user *, size_t *, loff_t *);
 
 #include <linux/topology.h>
 /* Returns the number of the current Node. */
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -180,6 +180,7 @@ enum
 	VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+	VM_FREE_NODE_MEMORY=29, /* free page cache from specified node */
 };
 
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -851,6 +851,16 @@ static ctl_table vm_table[] = {
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+	{
+		.ctl_name	= VM_FREE_NODE_MEMORY,
+		.procname	= "free_node_memory",
+		.data		= &sysctl_free_node_memory,
+		.maxlen		= sizeof(sysctl_free_node_memory),
+		.mode		= 0644,
+		.proc_handler	= &free_node_memory_sysctl_handler,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
 	{ .ctl_name = 0 }
 };
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/notifier.h>
 #include <linux/rwsem.h>
+#include <linux/sysctl.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -1329,21 +1330,17 @@ static int __init kswapd_init(void)
 
 module_init(kswapd_init)
 
-
 /*
- * Try to free up some pages from this zone through reclaim.
+ * Try to free up pages from the zone through reclaim.
  */
-int zone_reclaim(struct zone *zone, unsigned int gfp_mask, unsigned int order)
+int __zone_reclaim(struct zone *zone, unsigned int gfp_mask, unsigned int order)
 {
 	struct scan_control sc;
 	int nr_pages = 1 << order;
-	int total_reclaimed = 0;
 
 	/* The reclaim may sleep, so don't do it if sleep isn't allowed */
 	if (!(gfp_mask & __GFP_WAIT))
 		return 0;
-	if (zone->all_unreclaimable)
-		return 0;
 
 	sc.gfp_mask = gfp_mask;
 	sc.may_writepage = 0;
@@ -1359,15 +1356,22 @@ int zone_reclaim(struct zone *zone, unsi
 	else
 		sc.swap_cluster_max = SWAP_CLUSTER_MAX;
 
+	shrink_zone(zone, &sc);
+	return sc.nr_reclaimed;
+}
+
+/*
+ * Checks to make sure that reclaim isn't active on the zone already
+ */
+int zone_reclaim(struct zone *zone, unsigned int gfp_mask, unsigned int order)
+{
+	if (zone->all_unreclaimable)
+		return 0;
 	/* Don't reclaim the zone if there are other reclaimers active */
 	if (atomic_read(&zone->reclaim_in_progress) > 0)
-		goto out;
-
-	shrink_zone(zone, &sc);
-	total_reclaimed = sc.nr_reclaimed;
+		return 0;
 
- out:
-	return total_reclaimed;
+	return __zone_reclaim(zone, gfp_mask, order);
 }
 
 asmlinkage long sys_set_zone_reclaim(unsigned int node, unsigned int zone,
@@ -1393,6 +1397,44 @@ asmlinkage long sys_set_zone_reclaim(uns
 			z->reclaim_pages = 1;
 		else
 			z->reclaim_pages = 0;
+	}
+
+	return 0;
+}
+
+int sysctl_free_node_memory;
+static DECLARE_MUTEX(free_node_memory_lock);
+
+int free_node_memory_sysctl_handler(ctl_table *table, int write,
+				    struct file *file, void __user *buffer,
+				    size_t *length, loff_t *ppos)
+{
+	struct zone *z;
+	int node;
+	int gfp_mask = __GFP_WAIT;
+	int i;
+
+	if (!write)
+		return 0;
+
+	down_interruptible(&free_node_memory_lock);
+	proc_dointvec(table, write, file, buffer, length, ppos);
+
+	node = sysctl_free_node_memory;
+	up(&free_node_memory_lock);
+
+	if (node >= MAX_NUMNODES || !node_online(node))
+		return -EINVAL;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		z = &NODE_DATA(node)->node_zones[i];
+
+		if (!z->present_pages)
+			continue;
+
+		/* Reclaim pages from the zone */
+		while (__zone_reclaim(z, gfp_mask, SWAP_CLUSTER_MAX) != 0)
+			;
 	}
 
 	return 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] VM: add vm.free_node_memory sysctl
  2005-08-03 13:56         ` [PATCH] VM: add vm.free_node_memory sysctl Martin Hicks
@ 2005-08-03 14:15           ` Andi Kleen
  2005-08-03 14:24             ` Martin Hicks
  0 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2005-08-03 14:15 UTC (permalink / raw)
  To: Martin Hicks
  Cc: Ingo Molnar, Linux MM, Andrew Morton, torvalds, linux-kernel, ak

On Wed, Aug 03, 2005 at 09:56:46AM -0400, Martin Hicks wrote:
> 
> On Tue, Aug 02, 2005 at 11:07:46PM +0200, Ingo Molnar wrote:
> > 
> > * Martin Hicks <mort@sgi.com> wrote:
> > 
> > > On Mon, Aug 01, 2005 at 09:54:26PM +0200, Ingo Molnar wrote:
> > > > 
> > > > * Andrew Morton <akpm@osdl.org> wrote:
> > > > 
> > > > > >  We could perhaps add a CAP_SYS_ADMIN-only sysctl for this hack,
> > > > > 
> > > > > That would be more appropriate.
> > > > > 
> > > > > (I'm still not sure what happened to the idea of adding a call to 
> > > > > "clear out this node+zone's pagecache now" rather than "set this 
> > > > > noed+zone's policy")
> > > > 
> > > > lets do that as a sysctl hack. It would be useful for debugging purposes 
> > > > anyway. But i'm not sure whether it's the same issue - Martin?
> > > 
> > > (Sorry..I was on vacation yesterday)
> > > 
> > > Yes, this is the same issue with a different way of making it happen. 
> > > Setting a zone's policy allows reclaim to happen automatically.
> > > 
> > > I'll send in a patch to add a sysctl to do the manual dumping of 
> > > pagecache really soon.
> > 
> > cool! [ Incidentally, when i found this problem i was looking for 
> > existing bits in the kernel to write such a patch myself (which i wanted 
> > to use on non-NUMA to create more reproducable workloads for 
> > performance-testing) - now i'll wait for your patch. ]
> 
> Here's the promised sysctl to dump a node's pagecache.  Please review!
> 
> This patch depends on the zone reclaim atomic ops cleanup:
> http://marc.theaimsgroup.com/?l=linux-mm&m=112307646306476&w=2

Doesn't numactl --bind=node memhog nodesize-someslack do the same?

It just might kick in the oom killer if someslack is too small
or someone has unfreeable data there. But then there should be 
already an sysctl to turn that one off.


-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] VM: add vm.free_node_memory sysctl
  2005-08-03 14:15           ` Andi Kleen
@ 2005-08-03 14:24             ` Martin Hicks
  2005-08-03 14:38               ` Andi Kleen
  0 siblings, 1 reply; 10+ messages in thread
From: Martin Hicks @ 2005-08-03 14:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ingo Molnar, Linux MM, Andrew Morton, torvalds, linux-kernel

On Wed, Aug 03, 2005 at 04:15:29PM +0200, Andi Kleen wrote:
> On Wed, Aug 03, 2005 at 09:56:46AM -0400, Martin Hicks wrote:
> > 
> > Here's the promised sysctl to dump a node's pagecache.  Please review!
> > 
> > This patch depends on the zone reclaim atomic ops cleanup:
> > http://marc.theaimsgroup.com/?l=linux-mm&m=112307646306476&w=2
> 
> Doesn't numactl --bind=node memhog nodesize-someslack do the same?
> 
> It just might kick in the oom killer if someslack is too small
> or someone has unfreeable data there. But then there should be 
> already an sysctl to turn that one off.

Doesn't the memhog hack also cause the machine to swap a lot?  The
zone_reclaim() path doesn't let the memory reclaim code swap.

mh

-- 
Martin Hicks   ||   Silicon Graphics Inc.   ||   mort@sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] VM: add vm.free_node_memory sysctl
  2005-08-03 14:24             ` Martin Hicks
@ 2005-08-03 14:38               ` Andi Kleen
  2005-08-03 14:56                 ` Martin Hicks
  2005-08-03 19:59                 ` Ray Bryant
  0 siblings, 2 replies; 10+ messages in thread
From: Andi Kleen @ 2005-08-03 14:38 UTC (permalink / raw)
  To: Martin Hicks
  Cc: Andi Kleen, Ingo Molnar, Linux MM, Andrew Morton, torvalds, linux-kernel

On Wed, Aug 03, 2005 at 10:24:40AM -0400, Martin Hicks wrote:
> 
> On Wed, Aug 03, 2005 at 04:15:29PM +0200, Andi Kleen wrote:
> > On Wed, Aug 03, 2005 at 09:56:46AM -0400, Martin Hicks wrote:
> > > 
> > > Here's the promised sysctl to dump a node's pagecache.  Please review!
> > > 
> > > This patch depends on the zone reclaim atomic ops cleanup:
> > > http://marc.theaimsgroup.com/?l=linux-mm&m=112307646306476&w=2
> > 
> > Doesn't numactl --bind=node memhog nodesize-someslack do the same?
> > 
> > It just might kick in the oom killer if someslack is too small
> > or someone has unfreeable data there. But then there should be 
> > already an sysctl to turn that one off.
> 
> Doesn't the memhog hack also cause the machine to swap a lot?  The

Hack? - compared to your "solutions" it looks very clean to me.

> zone_reclaim() path doesn't let the memory reclaim code swap.

reclaim with bound policy should only swap on the bound nodemask
(or at least it did when I originally implemented NUMA policy) 

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] VM: add vm.free_node_memory sysctl
  2005-08-03 14:38               ` Andi Kleen
@ 2005-08-03 14:56                 ` Martin Hicks
  2005-08-03 19:59                 ` Ray Bryant
  1 sibling, 0 replies; 10+ messages in thread
From: Martin Hicks @ 2005-08-03 14:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ingo Molnar, Linux MM, Andrew Morton, torvalds, linux-kernel

On Wed, Aug 03, 2005 at 04:38:55PM +0200, Andi Kleen wrote:
> On Wed, Aug 03, 2005 at 10:24:40AM -0400, Martin Hicks wrote:
> 
> > zone_reclaim() path doesn't let the memory reclaim code swap.
> 
> reclaim with bound policy should only swap on the bound nodemask
> (or at least it did when I originally implemented NUMA policy) 

Yes, it still looks like it only swaps on the bound nodemask.

mh

-- 
Martin Hicks   ||   Silicon Graphics Inc.   ||   mort@sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] VM: add vm.free_node_memory sysctl
  2005-08-03 14:38               ` Andi Kleen
  2005-08-03 14:56                 ` Martin Hicks
@ 2005-08-03 19:59                 ` Ray Bryant
  2005-08-03 20:08                   ` Andi Kleen
  1 sibling, 1 reply; 10+ messages in thread
From: Ray Bryant @ 2005-08-03 19:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Martin Hicks, Ingo Molnar, Linux MM, Andrew Morton, torvalds,
	linux-kernel

On Wednesday 03 August 2005 09:38, Andi Kleen wrote:
> On Wed, Aug 03, 2005 at 10:24:40AM -0400, Martin Hicks wrote:
> > On Wed, Aug 03, 2005 at 04:15:29PM +0200, Andi Kleen wrote:
> > > On Wed, Aug 03, 2005 at 09:56:46AM -0400, Martin Hicks wrote:
> > > > Here's the promised sysctl to dump a node's pagecache.  Please
> > > > review!
> > > >
> > > > This patch depends on the zone reclaim atomic ops cleanup:
> > > > http://marc.theaimsgroup.com/?l=linux-mm&m=112307646306476&w=2
> > >
> > > Doesn't numactl --bind=node memhog nodesize-someslack do the same?
> > >
> > > It just might kick in the oom killer if someslack is too small
> > > or someone has unfreeable data there. But then there should be
> > > already an sysctl to turn that one off.
> >
Hmmm.... What happens if there are already mapped pages (e. g. mapped in the 
sense that pages are mapped into an address space) on the node and you want 
to allocate some more, but can't because the node is full of clean page cache 
pages?   Then one would have to set the memhog argument to the right thing to 
keep the existing mapped memory from being swapped out, right?  Is the data 
to set that argument readily available to user space?  Martin's patch has the 
advantage of targeting just the clean page cache pages.

The way I see this, the problem is that clean page cache pages >>should<< be 
easily available to be used to satisfy a request for mapped pages.   This 
works correctly in non-NUMA Linux systems.  But in NUMA Linux systems, we 
keep tripping over this problem all the time, particularly in the  HPC space, 
and patches like Martin's come about as an attempt to solve this in the VMM.
(We trip over this in the sense that we end up allocating off node storage 
because the current node is full of page cache pages.)

The best answer we have at the present time is to run a memory hog program 
that forces the clean page cache pages to be reclaimed by putting the node in 
question under memory pressure, but this seems like an indirect way to solve 
the problem at hand which is, really, to quickly release those page cache 
pages and make them available for user programs to allocate.  So the most 
direct way to fix this is to fix it in the VMM rather than depending on a 
memory hog based work-around of some kind.   Perhaps we haven't gotten the 
right set of patches together to do this, but my take is that is where the 
fix belongs. 

And, just for the record (  :-)  ), this is not just an Altix problem.  
Opterons are NUMA systems too, and we encounter exactly this same problem in 
the HPC space on 4-node systems.  
-- 
Ray Bryant
AMD Performance Labs                   Austin, Tx
512-602-0038 (o)                 512-507-7807 (c)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] VM: add vm.free_node_memory sysctl
  2005-08-03 19:59                 ` Ray Bryant
@ 2005-08-03 20:08                   ` Andi Kleen
  2005-08-05 17:45                     ` Ray Bryant
  0 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2005-08-03 20:08 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Andi Kleen, Martin Hicks, Ingo Molnar, Linux MM, Andrew Morton,
	torvalds, linux-kernel

On Wed, Aug 03, 2005 at 02:59:22PM -0500, Ray Bryant wrote:
> On Wednesday 03 August 2005 09:38, Andi Kleen wrote:
> > On Wed, Aug 03, 2005 at 10:24:40AM -0400, Martin Hicks wrote:
> > > On Wed, Aug 03, 2005 at 04:15:29PM +0200, Andi Kleen wrote:
> > > > On Wed, Aug 03, 2005 at 09:56:46AM -0400, Martin Hicks wrote:
> > > > > Here's the promised sysctl to dump a node's pagecache.  Please
> > > > > review!
> > > > >
> > > > > This patch depends on the zone reclaim atomic ops cleanup:
> > > > > http://marc.theaimsgroup.com/?l=linux-mm&m=112307646306476&w=2
> > > >
> > > > Doesn't numactl --bind=node memhog nodesize-someslack do the same?
> > > >
> > > > It just might kick in the oom killer if someslack is too small
> > > > or someone has unfreeable data there. But then there should be
> > > > already an sysctl to turn that one off.
> > >
> Hmmm.... What happens if there are already mapped pages (e. g. mapped in the 
> sense that pages are mapped into an address space) on the node and you want 
> to allocate some more, but can't because the node is full of clean page cache 
> pages?   Then one would have to set the memhog argument to the right thing to 

If you have a bind policy in the memory grabbing program then the standard try_to_free_pages
should DTRT. That is because we generated a custom zone list only containing nodes
in that zone and the zone reclaim only looks into those.

With prefered or other policies it's different though, in that cases t_t_f_p
will also look into other nodes because the policy is not binding.

That said it might be probably possible to even make non bind policies more
aggressive at freeing in the current node before looking into other nodes. 
I think the zone balancing has been mostly tuned on non NUMA systems, so
some improvements might be possible here.

Most people don't use BIND and changing the default policies like this 
might give NUMA systems a better "out of the box" experience.  However this 
memory balance is very subtle code and easy to break, so this would need some
care.

I don't think sysctls or new syscalls are the way to go here though.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] VM: add vm.free_node_memory sysctl
  2005-08-03 20:08                   ` Andi Kleen
@ 2005-08-05 17:45                     ` Ray Bryant
  2005-08-05 21:48                       ` Andi Kleen
  0 siblings, 1 reply; 10+ messages in thread
From: Ray Bryant @ 2005-08-05 17:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Martin Hicks, Ingo Molnar, Linux MM, Andrew Morton, torvalds,
	linux-kernel

On Wednesday 03 August 2005 15:08, Andi Kleen wrote:

> >
> > Hmmm.... What happens if there are already mapped pages (e. g. mapped in
> > the sense that pages are mapped into an address space) on the node and
> > you want to allocate some more, but can't because the node is full of
> > clean page cache pages?   Then one would have to set the memhog argument
> > to the right thing to
>

> If you have a bind policy in the memory grabbing program then the standard
> try_to_free_pages should DTRT. That is because we generated a custom zone
> list only containing nodes in that zone and the zone reclaim only looks
> into those.
>

It may depend on what your definition of DTRT is here.  :-)

As I understand things, if we have a node that has some mapped memory 
allocated, and if one starts up a numactl -bind node memhog nodesize-slop so 
as to clear some clean page cache pages from that node, then unless the 
"slop" is sized in proportion to the amount of mapped memory used on the 
node, then the existing mapped memory will get swapped out in order to 
satisfy the new request.  In addition, clean page-cache pages will get 
discarded.  I think what Martin and I would prefer to see is an interface 
that allows one to just get rid of the clean page cache (or at least enough 
of it) so that additional mapped page allocations will occur locally to the 
node without causing swapping.

AFAIK, the number of mapped pages on the node is not exported to user space 
(by, for example, /sys).   So there is no good way to size the "slop" to 
allow for an existing allocation.  If there was, then using a bound memory 
hog would likely be a reasonable replacement for Martin's syscall to release 
all free page cache, at least for small to medium sized sized systems.

> With prefered or other policies it's different though, in that cases
> t_t_f_p will also look into other nodes because the policy is not binding.
>
> That said it might be probably possible to even make non bind policies more
> aggressive at freeing in the current node before looking into other nodes.
> I think the zone balancing has been mostly tuned on non NUMA systems, so
> some improvements might be possible here.
>
> Most people don't use BIND and changing the default policies like this
> might give NUMA systems a better "out of the box" experience.  However this
> memory balance is very subtle code and easy to break, so this would need
> some care.
>

Of course!

> I don't think sysctls or new syscalls are the way to go here though.
>

The reason we ended up with a sysctl/syscall (to control the aggressiveness 
with which __alloc_pages will try to free page cache before spilling) is that 
deciding whether or not  to spend the effort to free up page cache pages on 
the local node before  spilling is a workload dependent optimization.   For 
an HPC application it is  typically worth the effort to try to free local 
node page cache before spilling off node because the program will run 
sufficiently long to make the improvement due to getting local storage 
dominates the extra cost of doing the page allocation.   For file server 
workloads, for example, it is typically important to minimize the time to do 
the page allocation; if it turns out to be on a remote node it really doesn't 
matter that much.   So it seems to me that we need some way for the 
application to tell the system which approach it prefers based on the type of 
workload it is -- hence the sysctl or syscall approach.

> -Andi

-- 
Ray Bryant
AMD Performance Labs                   Austin, Tx
512-602-0038 (o)                 512-507-7807 (c)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] VM: add vm.free_node_memory sysctl
  2005-08-05 17:45                     ` Ray Bryant
@ 2005-08-05 21:48                       ` Andi Kleen
  2005-08-15 16:05                         ` Martin Hicks
  0 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2005-08-05 21:48 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Andi Kleen, Martin Hicks, Ingo Molnar, Linux MM, Andrew Morton,
	torvalds, linux-kernel

On Fri, Aug 05, 2005 at 12:45:58PM -0500, Ray Bryant wrote:
> > try_to_free_pages should DTRT. That is because we generated a custom zone
> > list only containing nodes in that zone and the zone reclaim only looks
> > into those.
> >
> 
> It may depend on what your definition of DTRT is here.  :-)
> 
> As I understand things, if we have a node that has some mapped memory 
> allocated, and if one starts up a numactl -bind node memhog nodesize-slop so 
> as to clear some clean page cache pages from that node, then unless the 
> "slop" is sized in proportion to the amount of mapped memory used on the 
> node, then the existing mapped memory will get swapped out in order to 
> satisfy the new request.  In addition, clean page-cache pages will get 

The VM should first eat clean pages, but yes at some point it will
swap if you want enough. It has to because it doesn't know how 
to migrate to other nodes.

> discarded.  I think what Martin and I would prefer to see is an interface 
> that allows one to just get rid of the clean page cache (or at least enough 
> of it) so that additional mapped page allocations will occur locally to the 
> node without causing swapping.

That seems like a very special narrow case. But have you tried if  memhog
really doesn't work this way?

> 
> AFAIK, the number of mapped pages on the node is not exported to user space 
> (by, for example, /sys).   So there is no good way to size the "slop" to 
> allow for an existing allocation.  If there was, then using a bound memory 
> hog would likely be a reasonable replacement for Martin's syscall to release 
> all free page cache, at least for small to medium sized sized systems.

I guess it could be exported without too much trouble.

> The reason we ended up with a sysctl/syscall (to control the aggressiveness 
> with which __alloc_pages will try to free page cache before spilling) is that 
> deciding whether or not  to spend the effort to free up page cache pages on 
> the local node before  spilling is a workload dependent optimization.   For 
> an HPC application it is  typically worth the effort to try to free local 
> node page cache before spilling off node because the program will run 
> sufficiently long to make the improvement due to getting local storage 
> dominates the extra cost of doing the page allocation.   For file server 
> workloads, for example, it is typically important to minimize the time to do 
> the page allocation; if it turns out to be on a remote node it really doesn't 
> matter that much.   So it seems to me that we need some way for the 
> application to tell the system which approach it prefers based on the type of 
> workload it is -- hence the sysctl or syscall approach.

Ideally it should just work transparently. Maybe NUMA allocation
should be a bit more aggressive at cleaning local pages before fallback.
Problem is that it potentially makes the fast path slow.


-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] VM: add vm.free_node_memory sysctl
  2005-08-05 21:48                       ` Andi Kleen
@ 2005-08-15 16:05                         ` Martin Hicks
  0 siblings, 0 replies; 10+ messages in thread
From: Martin Hicks @ 2005-08-15 16:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ray Bryant, Martin Hicks, Ingo Molnar, Linux MM, Andrew Morton,
	torvalds, linux-kernel

On Fri, Aug 05, 2005 at 11:48:58PM +0200, Andi Kleen wrote:
> On Fri, Aug 05, 2005 at 12:45:58PM -0500, Ray Bryant wrote:
> 
> > discarded.  I think what Martin and I would prefer to see is an interface 
> > that allows one to just get rid of the clean page cache (or at least enough 
> > of it) so that additional mapped page allocations will occur locally to the 
> > node without causing swapping.
> 
> That seems like a very special narrow case. But have you tried if  memhog
> really doesn't work this way?

Yes.  This *is* a very special narrow case.  This doesn't really apply
to a desktop machine, nor to a normal unix server load.

It *does* apply to cleaning up a node (or set of nodes) before running HPC
apps which really need to get local memory to perform correctly.

I'm not suggesting that this is something useful for your average
computer, but it is a feature that would make life a lot easier on big
machines running HPC apps.

The memhog approach works, but it may negatively impact the performance
of other jobs on the same node (by swapping or needlessly forcing the
node into memory reclaim and dirty page writeback).

> > 
> > AFAIK, the number of mapped pages on the node is not exported to user space 
> > (by, for example, /sys).   So there is no good way to size the "slop" to 
> > allow for an existing allocation.  If there was, then using a bound memory 
> > hog would likely be a reasonable replacement for Martin's syscall to release 
> > all free page cache, at least for small to medium sized sized systems.
> 
> I guess it could be exported without too much trouble.

I did this to test out the memhog method.

> > The reason we ended up with a sysctl/syscall (to control the aggressiveness 
> > with which __alloc_pages will try to free page cache before spilling) is that 
> > deciding whether or not  to spend the effort to free up page cache pages on 
> > the local node before  spilling is a workload dependent optimization.   For 
> > an HPC application it is  typically worth the effort to try to free local 
> > node page cache before spilling off node because the program will run 
> > sufficiently long to make the improvement due to getting local storage 
> > dominates the extra cost of doing the page allocation.   For file server 
> > workloads, for example, it is typically important to minimize the time to do 
> > the page allocation; if it turns out to be on a remote node it really doesn't 
> > matter that much.   So it seems to me that we need some way for the 
> > application to tell the system which approach it prefers based on the type of 
> > workload it is -- hence the sysctl or syscall approach.
> 
> Ideally it should just work transparently. Maybe NUMA allocation
> should be a bit more aggressive at cleaning local pages before fallback.
> Problem is that it potentially makes the fast path slow.

This is what we need:  a better level of control over how NUMA
allocations work.  In some cases we *really* would prefer local pages,
even at the cost of page cache.

It does have the potential to make the fast path slower.  Some workloads
are willing to make this sacrifice.

mh

-- 
Martin Hicks   ||   Silicon Graphics Inc.   ||   mort@sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-08-15 16:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20050801113913.GA7000@elte.hu>
     [not found] ` <20050801102903.378da54f.akpm@osdl.org>
     [not found]   ` <20050801195426.GA17548@elte.hu>
     [not found]     ` <20050802171050.GG26803@localhost>
     [not found]       ` <20050802210746.GA26494@elte.hu>
2005-08-03 13:56         ` [PATCH] VM: add vm.free_node_memory sysctl Martin Hicks
2005-08-03 14:15           ` Andi Kleen
2005-08-03 14:24             ` Martin Hicks
2005-08-03 14:38               ` Andi Kleen
2005-08-03 14:56                 ` Martin Hicks
2005-08-03 19:59                 ` Ray Bryant
2005-08-03 20:08                   ` Andi Kleen
2005-08-05 17:45                     ` Ray Bryant
2005-08-05 21:48                       ` Andi Kleen
2005-08-15 16:05                         ` Martin Hicks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox