linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* NUMA: Patch for node based swapping
@ 2004-10-12 15:02 Christoph Lameter
  2004-10-12 15:16 ` Martin J. Bligh
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Christoph Lameter @ 2004-10-12 15:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: nickpiggin, linux-mm

In a NUMA systems single nodes may run out of memory. This may occur even
by only reading from files which will clutter node memory with cached
pages from the file.

However, as long as the system as a whole does have enough memory
available, kswapd is not run at all. This means that a process allocating
memory and running on a node that has no memory left, will get memory
allocated from other nodes which is inefficient to handle. It would be
better if kswapd would throw out some pages (maybe some of the cached
pages from files that have only once been read) to reclaim memory in the
node.

The following patch checks the memory usage after each allocation in a
zone. If the allocation in a zone falls below a certain minimum, kswapd is
started for that zone alone.

The minimum may be controlled through /proc/sys/vm/node_swap.
By default node_swap is set to 100 which means that kswapd will be run on
a zone if less than 10% are available after allocation.

Nick Piggin has a much better overall solution in the overhaul of the
memory subsystem that he is working on. I hope this patch may provide
a solution until Nick's patch gets into the kernel.

Index: linux-2.6.9-rc4/mm/page_alloc.c
===================================================================
--- linux-2.6.9-rc4.orig/mm/page_alloc.c	2004-10-10 19:57:03.000000000 -0700
+++ linux-2.6.9-rc4/mm/page_alloc.c	2004-10-11 12:54:51.000000000 -0700
@@ -41,6 +41,9 @@
 long nr_swap_pages;
 int numnodes = 1;
 int sysctl_lower_zone_protection = 0;
+#ifdef CONFIG_NUMA
+int sysctl_node_swap = 100;		/* invoke kswapd when local node memory lower than 20% */
+#endif

 EXPORT_SYMBOL(totalram_pages);
 EXPORT_SYMBOL(nr_swap_pages);
@@ -483,6 +486,13 @@
 	p = &z->pageset[cpu];
 	if (pg == orig) {
 		z->pageset[cpu].numa_hit++;
+		/*
+		 * If zone allocation leaves less than a (sysctl_node_swap * 10) %
+		 * of the zone free then invoke kswapd.
+		 * (to make it efficient we do (pages * sysctl_node_swap) / 1024))
+		 */
+		if (z->free_pages < (z->present_pages * sysctl_node_swap) << 10)
+			wakeup_kswapd(z);
 	} else {
 		p->numa_miss++;
 		zonelist->zones[0]->pageset[cpu].numa_foreign++;
Index: linux-2.6.9-rc4/kernel/sysctl.c
===================================================================
--- linux-2.6.9-rc4.orig/kernel/sysctl.c	2004-10-10 19:57:03.000000000 -0700
+++ linux-2.6.9-rc4/kernel/sysctl.c	2004-10-11 12:54:51.000000000 -0700
@@ -65,6 +65,9 @@
 extern int min_free_kbytes;
 extern int printk_ratelimit_jiffies;
 extern int printk_ratelimit_burst;
+#ifdef CONFIG_NUMA
+extern int sysctl_node_swap;
+#endif

 #if defined(CONFIG_X86_LOCAL_APIC) && defined(__i386__)
 int unknown_nmi_panic;
@@ -800,7 +803,17 @@
 		.extra1		= &zero,
 	},
 #endif
-	{ .ctl_name = 0 }
+#ifdef CONFIG_NUMA
+	{
+		.ctl_name	= VM_NODE_SWAP,
+		.procname	= "node_swap",
+		.data		= &sysctl_node_swap,
+		.maxlen		= sizeof(sysctl_node_swap),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
+#endif
+		{ .ctl_name = 0 }
 };

 static ctl_table proc_table[] = {
Index: linux-2.6.9-rc4/include/linux/sysctl.h
===================================================================
--- linux-2.6.9-rc4.orig/include/linux/sysctl.h	2004-10-10 19:58:05.000000000 -0700
+++ linux-2.6.9-rc4/include/linux/sysctl.h	2004-10-11 12:54:51.000000000 -0700
@@ -167,6 +167,7 @@
 	VM_HUGETLB_GROUP=25,	/* permitted hugetlb group */
 	VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
+	VM_NODE_SWAP=28,	/* Swap local node memory limit (in % *10) */
 };


Index: linux-2.6.9-rc4/mm/vmscan.c
===================================================================
--- linux-2.6.9-rc4.orig/mm/vmscan.c	2004-10-10 19:57:04.000000000 -0700
+++ linux-2.6.9-rc4/mm/vmscan.c	2004-10-11 12:54:51.000000000 -0700
@@ -1168,9 +1168,11 @@
  */
 void wakeup_kswapd(struct zone *zone)
 {
+	extern int sysctl_node_swap;
+
 	if (zone->present_pages == 0)
 		return;
-	if (zone->free_pages > zone->pages_low)
+	if (zone->free_pages > (zone->present_pages * sysctl_node_swap) << 10 && zone->free_pages > zone->pages_low)
 		return;
 	if (!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
 		return;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA: Patch for node based swapping
  2004-10-12 15:02 NUMA: Patch for node based swapping Christoph Lameter
@ 2004-10-12 15:16 ` Martin J. Bligh
  2004-10-12 15:38   ` Christoph Lameter
  2004-10-12 15:20 ` Jan-Benedict Glaw
  2004-10-12 15:27 ` Rik van Riel
  2 siblings, 1 reply; 12+ messages in thread
From: Martin J. Bligh @ 2004-10-12 15:16 UTC (permalink / raw)
  To: Christoph Lameter, linux-kernel; +Cc: nickpiggin, linux-mm

> In a NUMA systems single nodes may run out of memory. This may occur even
> by only reading from files which will clutter node memory with cached
> pages from the file.
> 
> However, as long as the system as a whole does have enough memory
> available, kswapd is not run at all. This means that a process allocating
> memory and running on a node that has no memory left, will get memory
> allocated from other nodes which is inefficient to handle. It would be
> better if kswapd would throw out some pages (maybe some of the cached
> pages from files that have only once been read) to reclaim memory in the
> node.
> 
> The following patch checks the memory usage after each allocation in a
> zone. If the allocation in a zone falls below a certain minimum, kswapd is
> started for that zone alone.

I agree it's a problem, but you really don't want to go kicking pages out
to disk when we have free memory - the solution is, I think, to migrate
the least-recently used pages out to the other node, not all the way to
disk. The page relocate stuff from the defrag code being proposed may help
(if they fix it not to go via swap ;-)). I'll try to find some time to
look at it again.

M.

PS, might be possible to add a mechanism to ask kswapd to reclaim some 
cache pages without doing swapout, but I fear of messing with the delicate
balance of the universe - cache vs user.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA: Patch for node based swapping
  2004-10-12 15:02 NUMA: Patch for node based swapping Christoph Lameter
  2004-10-12 15:16 ` Martin J. Bligh
@ 2004-10-12 15:20 ` Jan-Benedict Glaw
  2004-10-12 15:27 ` Rik van Riel
  2 siblings, 0 replies; 12+ messages in thread
From: Jan-Benedict Glaw @ 2004-10-12 15:20 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, nickpiggin, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1328 bytes --]

On Tue, 2004-10-12 08:02:40 -0700, Christoph Lameter <clameter@sgi.com>
wrote in message <Pine.LNX.4.58.0410120751010.11558@schroedinger.engr.sgi.com>:
> --- linux-2.6.9-rc4.orig/mm/page_alloc.c	2004-10-10 19:57:03.000000000 -0700
> +++ linux-2.6.9-rc4/mm/page_alloc.c	2004-10-11 12:54:51.000000000 -0700
> @@ -483,6 +486,13 @@
>  	p = &z->pageset[cpu];
>  	if (pg == orig) {
>  		z->pageset[cpu].numa_hit++;
> +		/*
> +		 * If zone allocation leaves less than a (sysctl_node_swap * 10) %
> +		 * of the zone free then invoke kswapd.
> +		 * (to make it efficient we do (pages * sysctl_node_swap) / 1024))
> +		 */
> +		if (z->free_pages < (z->present_pages * sysctl_node_swap) << 10)
> +			wakeup_kswapd(z);
>  	} else {
>  		p->numa_miss++;
>  		zonelist->zones[0]->pageset[cpu].numa_foreign++;

Shouldn't the comment read "less than (sysctl_node_swap / 10) %",
because the value in sysctl_node_swap is actually percent*10, so you
need the reverse action here?!

MfG, JBG

-- 
Jan-Benedict Glaw       jbglaw@lug-owl.de    . +49-172-7608481             _ O _
"Eine Freie Meinung in  einem Freien Kopf    | Gegen Zensur | Gegen Krieg  _ _ O
 fuer einen Freien Staat voll Freier Bürger" | im Internet! |   im Irak!   O O O
ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA));

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA: Patch for node based swapping
  2004-10-12 15:02 NUMA: Patch for node based swapping Christoph Lameter
  2004-10-12 15:16 ` Martin J. Bligh
  2004-10-12 15:20 ` Jan-Benedict Glaw
@ 2004-10-12 15:27 ` Rik van Riel
  2004-10-12 15:39   ` Christoph Lameter
  2004-10-12 19:33   ` NUMA: Patch for node based swapping Anton Blanchard
  2 siblings, 2 replies; 12+ messages in thread
From: Rik van Riel @ 2004-10-12 15:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, nickpiggin, linux-mm

On Tue, 12 Oct 2004, Christoph Lameter wrote:

> The minimum may be controlled through /proc/sys/vm/node_swap.
> By default node_swap is set to 100 which means that kswapd will be run on
> a zone if less than 10% are available after allocation.

That sounds like an extraordinarily bad idea for eg. AMD64
systems, which have a very low numa factor.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA: Patch for node based swapping
  2004-10-12 15:16 ` Martin J. Bligh
@ 2004-10-12 15:38   ` Christoph Lameter
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2004-10-12 15:38 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel, nickpiggin, linux-mm

On Tue, 12 Oct 2004, Martin J. Bligh wrote:

> PS, might be possible to add a mechanism to ask kswapd to reclaim some
> cache pages without doing swapout, but I fear of messing with the delicate
> balance of the universe - cache vs user.

That is also my concern. I think the patch is useful to address the
immediate issue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA: Patch for node based swapping
  2004-10-12 15:27 ` Rik van Riel
@ 2004-10-12 15:39   ` Christoph Lameter
  2004-10-12 15:52     ` Rik van Riel
  2004-10-12 19:33   ` NUMA: Patch for node based swapping Anton Blanchard
  1 sibling, 1 reply; 12+ messages in thread
From: Christoph Lameter @ 2004-10-12 15:39 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, nickpiggin, linux-mm

On Tue, 12 Oct 2004, Rik van Riel wrote:
> On Tue, 12 Oct 2004, Christoph Lameter wrote:
> > The minimum may be controlled through /proc/sys/vm/node_swap.
> > By default node_swap is set to 100 which means that kswapd will be run on
> > a zone if less than 10% are available after allocation.
> That sounds like an extraordinarily bad idea for eg. AMD64
> systems, which have a very low numa factor.

Any other suggestions?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA: Patch for node based swapping
  2004-10-12 15:39   ` Christoph Lameter
@ 2004-10-12 15:52     ` Rik van Riel
  2004-10-12 20:20       ` Christoph Lameter
  0 siblings, 1 reply; 12+ messages in thread
From: Rik van Riel @ 2004-10-12 15:52 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, nickpiggin, linux-mm

On Tue, 12 Oct 2004, Christoph Lameter wrote:

> Any other suggestions?

Since this is meant as a stop gap patch, waiting for a real
solution, and is only relevant for big (and rare) systems,
it would be an idea to at least leave it off by default.

I think it would be safe to assume that a $100k system has
a system administrator looking after it, while a $5k AMD64
whitebox might not have somebody watching its performance.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA: Patch for node based swapping
  2004-10-12 15:27 ` Rik van Riel
  2004-10-12 15:39   ` Christoph Lameter
@ 2004-10-12 19:33   ` Anton Blanchard
  1 sibling, 0 replies; 12+ messages in thread
From: Anton Blanchard @ 2004-10-12 19:33 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Christoph Lameter, linux-kernel, nickpiggin, linux-mm

 
> That sounds like an extraordinarily bad idea for eg. AMD64
> systems, which have a very low numa factor.

Same with ppc64.

Anton
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA: Patch for node based swapping
  2004-10-12 15:52     ` Rik van Riel
@ 2004-10-12 20:20       ` Christoph Lameter
  2004-10-13 10:59         ` Nick Piggin
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Lameter @ 2004-10-12 20:20 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, nickpiggin, linux-mm

On Tue, 12 Oct 2004, Rik van Riel wrote:

> On Tue, 12 Oct 2004, Christoph Lameter wrote:
>
> > Any other suggestions?
>
> Since this is meant as a stop gap patch, waiting for a real
> solution, and is only relevant for big (and rare) systems,
> it would be an idea to at least leave it off by default.
>
> I think it would be safe to assume that a $100k system has
> a system administrator looking after it, while a $5k AMD64
> whitebox might not have somebody watching its performance.

Ok. Will do that then. Should I submit the patch to Andrew?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA: Patch for node based swapping
  2004-10-12 20:20       ` Christoph Lameter
@ 2004-10-13 10:59         ` Nick Piggin
  2004-10-13 15:14           ` NUMA: Patch for node based swapping V2 Christoph Lameter
  0 siblings, 1 reply; 12+ messages in thread
From: Nick Piggin @ 2004-10-13 10:59 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Rik van Riel, linux-kernel, linux-mm

Christoph Lameter wrote:
> On Tue, 12 Oct 2004, Rik van Riel wrote:
> 
> 
>>On Tue, 12 Oct 2004, Christoph Lameter wrote:
>>
>>
>>>Any other suggestions?
>>
>>Since this is meant as a stop gap patch, waiting for a real
>>solution, and is only relevant for big (and rare) systems,
>>it would be an idea to at least leave it off by default.
>>
>>I think it would be safe to assume that a $100k system has
>>a system administrator looking after it, while a $5k AMD64
>>whitebox might not have somebody watching its performance.
> 
> 
> Ok. Will do that then. Should I submit the patch to Andrew?
> 

I can't see the harm in sending it after 2.6.9 if it defaults
to off (maybe also make it CONFIG_NUMA).

OTOH, if it is going to be painful to remove later on, then
maybe leave it local to your tree.

It's true that I have something a bit more sophisticated in
the pipe, but it is going to be an uphill battle to get it
and everything it depends on merged - so don't count on it for
2.6.10 :P

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* NUMA: Patch for node based swapping V2
  2004-10-13 10:59         ` Nick Piggin
@ 2004-10-13 15:14           ` Christoph Lameter
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2004-10-13 15:14 UTC (permalink / raw)
  To: akpm; +Cc: nickpiggin, Rik van Riel, linux-kernel, linux-mm

This was discussed yesterday on linux-mm.

Changelog:
	* NUMA: Add ability to invoke kswapd on a node if local memory falls below a
	  certain threshold. A node may fill up its memory by simply copying a file
	  which will fill up the nodes memory with cached pages. The nodes memory will
	  currently only be reclaimed if all nodes in the system fall below a certain
	  threshhold. Until that time the processes on the node will only be allocated
	  off node memory. Invoking kswapd on a node fixes this situation until
	  a better solution can be found.
	* Threshold may be set in /proc/sys/vm/node_swap in percent * 10. The threshold
	  is set to zero by default which means that node swapping is off.

Index: linux-2.6.9-rc4/mm/page_alloc.c
===================================================================
--- linux-2.6.9-rc4.orig/mm/page_alloc.c	2004-10-10 19:57:03.000000000 -0700
+++ linux-2.6.9-rc4/mm/page_alloc.c	2004-10-13 07:58:57.000000000 -0700
@@ -41,6 +41,19 @@
 long nr_swap_pages;
 int numnodes = 1;
 int sysctl_lower_zone_protection = 0;
+#ifdef CONFIG_NUMA
+/*
+ * sysctl_node_swap is a percentage of the pages available
+ * in a zone multiplied by 10. If the available pages
+ * in a zone drop below this limit then kswapd is invoked
+ * for this zone alone. This results in the reclaiming
+ * of local memory. Local memory may be filled up by simply reading
+ * a file. If local memory is not available the off node memory
+ * will be allocated to a process which makes all memory access
+ * less efficient then they could be.
+ */
+int sysctl_node_swap = 0;
+#endif

 EXPORT_SYMBOL(totalram_pages);
 EXPORT_SYMBOL(nr_swap_pages);
@@ -483,6 +496,14 @@
 	p = &z->pageset[cpu];
 	if (pg == orig) {
 		z->pageset[cpu].numa_hit++;
+		/*
+		 * If zone allocation has left less than
+		 * (sysctl_node_swap / 10) %  of the zone free invoke kswapd.
+		 * (the page limit is obtained through (pages*limit)/1024 to
+		 * make the calculation more efficient)
+		 */
+		if (z->free_pages < (z->present_pages * sysctl_node_swap) << 10)
+			wakeup_kswapd(z);
 	} else {
 		p->numa_miss++;
 		zonelist->zones[0]->pageset[cpu].numa_foreign++;
Index: linux-2.6.9-rc4/kernel/sysctl.c
===================================================================
--- linux-2.6.9-rc4.orig/kernel/sysctl.c	2004-10-10 19:57:03.000000000 -0700
+++ linux-2.6.9-rc4/kernel/sysctl.c	2004-10-11 12:54:51.000000000 -0700
@@ -65,6 +65,9 @@
 extern int min_free_kbytes;
 extern int printk_ratelimit_jiffies;
 extern int printk_ratelimit_burst;
+#ifdef CONFIG_NUMA
+extern int sysctl_node_swap;
+#endif

 #if defined(CONFIG_X86_LOCAL_APIC) && defined(__i386__)
 int unknown_nmi_panic;
@@ -800,7 +803,17 @@
 		.extra1		= &zero,
 	},
 #endif
-	{ .ctl_name = 0 }
+#ifdef CONFIG_NUMA
+	{
+		.ctl_name	= VM_NODE_SWAP,
+		.procname	= "node_swap",
+		.data		= &sysctl_node_swap,
+		.maxlen		= sizeof(sysctl_node_swap),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
+#endif
+		{ .ctl_name = 0 }
 };

 static ctl_table proc_table[] = {
Index: linux-2.6.9-rc4/include/linux/sysctl.h
===================================================================
--- linux-2.6.9-rc4.orig/include/linux/sysctl.h	2004-10-10 19:58:05.000000000 -0700
+++ linux-2.6.9-rc4/include/linux/sysctl.h	2004-10-11 12:54:51.000000000 -0700
@@ -167,6 +167,7 @@
 	VM_HUGETLB_GROUP=25,	/* permitted hugetlb group */
 	VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
+	VM_NODE_SWAP=28,	/* Swap local node memory limit (in % *10) */
 };


Index: linux-2.6.9-rc4/mm/vmscan.c
===================================================================
--- linux-2.6.9-rc4.orig/mm/vmscan.c	2004-10-10 19:57:04.000000000 -0700
+++ linux-2.6.9-rc4/mm/vmscan.c	2004-10-11 12:54:51.000000000 -0700
@@ -1168,9 +1168,11 @@
  */
 void wakeup_kswapd(struct zone *zone)
 {
+	extern int sysctl_node_swap;
+
 	if (zone->present_pages == 0)
 		return;
-	if (zone->free_pages > zone->pages_low)
+	if (zone->free_pages > (zone->present_pages * sysctl_node_swap) << 10 && zone->free_pages > zone->pages_low)
 		return;
 	if (!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
 		return;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NUMA: Patch for node based swapping
       [not found]   ` <m3llebn20a.fsf@averell.firstfloor.org>
@ 2004-10-12 21:38     ` Ray Bryant
  0 siblings, 0 replies; 12+ messages in thread
From: Ray Bryant @ 2004-10-12 21:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rik van Riel, linux-kernel, clameter, linux-mm

This patch is a bad idea and should not be merged into the mainline.

(1)  On bids for large SGI machines, we often see the requirement that
 > 90% of main memory be allocatable to user programs.  If, as suggested,
one were to set the /proc/sys/vm/node_swap to 10%, then any allocation
(e. g. alloction of a page cache page) will kick off kswapd when the
customer has allocated > 90% of storage.  The result is that kswapd will be
more or less constantly running on every node in the system.  Since that
same 90% requirement is often used to size the amount of memory purchased
to run the customers primary application, we have a recipe for providing
poor performance for that principle application.  Aa a result we will
likely end up disabling this feature on those large SGI machines, were it
to end up in one of our kernels.

Setting the node_swap limit to less than 10% would keep this from happening,
of course, but in this case the improvement gained is marginal and likely
not worth the effort.

(2)  In HPC applications, it is not sufficient to get "mostly" local storage.
Quite often such applications "settle in" on a set of nodes and sit there and
compute for an extremely long time.  Any imbalance in execution times between
the threads of such an application (e. g. due to one thread having one or more
pages located on a remote node) results in the entire application being slowed
down (A parallel application often runs only as quickly as its slowest thread.)

The application people running benchmarks for our systems insist on getting
100% of the storage they request as local to be truly backed by local storage.
Getting 98% of that figure is not acceptable.  Because this patch kicks off
kswapd asynchronously from the storage request, the current page being
allocated can still end up being allocated off node.  If one tries to solve 
this problem by setting the threshold lower (say at 20% of main memory), then
when the benchmark allocates 90% of memory we end up back in a situation
described above where any storage allocation will cause kswapd to run.
(Remember that even in an idle system, Linux is constantly scribbling stuff
out to disk -- so there are always allocations going on via way of 
__alloc_pages()>) Even then there is no guarentee that kswapd will be able to 
free up storage quickly enough to keep ahead of allocations.  (The real 
problem, here, of course is that clean page cache pages can fill up the node, 
and cause off node allocations to occur; and we would like to free those instead.)

I have patches that I am currently working on to do the latter instead of
the approach of this patch, and once we get those working I'd prefer to
see those included in the mainline instead of this solution.
-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2004-10-13 15:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-10-12 15:02 NUMA: Patch for node based swapping Christoph Lameter
2004-10-12 15:16 ` Martin J. Bligh
2004-10-12 15:38   ` Christoph Lameter
2004-10-12 15:20 ` Jan-Benedict Glaw
2004-10-12 15:27 ` Rik van Riel
2004-10-12 15:39   ` Christoph Lameter
2004-10-12 15:52     ` Rik van Riel
2004-10-12 20:20       ` Christoph Lameter
2004-10-13 10:59         ` Nick Piggin
2004-10-13 15:14           ` NUMA: Patch for node based swapping V2 Christoph Lameter
2004-10-12 19:33   ` NUMA: Patch for node based swapping Anton Blanchard
     [not found] <2OwBD-HV-31@gated-at.bofh.it>
     [not found] ` <2OwUX-Ua-23@gated-at.bofh.it>
     [not found]   ` <m3llebn20a.fsf@averell.firstfloor.org>
2004-10-12 21:38     ` Ray Bryant

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox