linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* NUMA node swapping V3
@ 2004-10-28 15:23 Christoph Lameter
  2004-10-28 15:40 ` Martin J. Bligh
  2004-10-28 16:20 ` Mika Penttilä
  0 siblings, 2 replies; 5+ messages in thread
From: Christoph Lameter @ 2004-10-28 15:23 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

Changes from V2: better documentation, fix missing #ifdef

In a NUMA systems single nodes may run out of memory. This may occur even
by only reading from files which will clutter node memory with cached
pages from the file.

However, as long as the system as a whole does have enough memory
available, kswapd is not run at all. This means that a process allocating
memory and running on a node that has no memory left, will get memory
allocated from other nodes which is inefficient to handle. It would be
better if kswapd would throw out some pages (maybe some of the cached
pages from files that have only once been read) to reclaim memory in the
node.

The following patch checks the memory usage after each allocation in a
zone. If the allocation in a zone falls below a certain minimum, kswapd is
started for that zone alone.

The minimum may be controlled through /proc/sys/vm/node_swap which is set
to zero by default and thus is off.

If it is set for example to 100 then kswapd will be run on
a zone/node if less than 10% of pages are available after an allocation.

Changelog:
	* NUMA: Add ability to invoke kswapd on a node if local memory falls below a
	  certain threshold. A node may fill up its memory by simply copying a file
	  which will fill up the nodes memory with cached pages. The nodes memory will
	  currently only be reclaimed if all nodes in the system fall below a certain
	  threshhold. Until that time the processes on the node will only be allocated
	  off node memory. Invoking kswapd on a node fixes this situation until
	  a better solution can be found.
	* Threshold may be set in /proc/sys/vm/node_swap in percent * 10. The threshold
	  is set to zero by default which means that node swapping is off.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9/mm/page_alloc.c
===================================================================
--- linux-2.6.9.orig/mm/page_alloc.c	2004-10-25 13:09:18.000000000 -0700
+++ linux-2.6.9/mm/page_alloc.c	2004-10-25 13:18:21.000000000 -0700
@@ -43,6 +43,19 @@
 long nr_swap_pages;
 int numnodes = 1;
 int sysctl_lower_zone_protection = 0;
+#ifdef CONFIG_NUMA
+/*
+ * sysctl_node_swap is a percentage of the pages available
+ * in a zone multiplied by 10. If the available pages
+ * in a zone drop below this limit then kswapd is invoked
+ * for this zone alone. This results in the reclaiming
+ * of local memory. Local memory may be filled up by simply reading
+ * a file. If local memory is not available then off node memory
+ * will be allocated to a process which makes all memory access
+ * less efficient then they could be.
+ */
+int sysctl_node_swap = 0;
+#endif

 EXPORT_SYMBOL(totalram_pages);
 EXPORT_SYMBOL(nr_swap_pages);
@@ -486,6 +499,14 @@
 	p = &z->pageset[cpu];
 	if (pg == orig) {
 		z->pageset[cpu].numa_hit++;
+		/*
+		 * If zone allocation has left less than
+		 * (sysctl_node_swap / 10) %  of the zone free invoke kswapd.
+		 * (the page limit is obtained through (pages*limit)/1024 to
+		 * make the calculation more efficient)
+		 */
+		if (z->free_pages < (z->present_pages * sysctl_node_swap) << 10)
+			wakeup_kswapd(z);
 	} else {
 		p->numa_miss++;
 		zonelist->zones[0]->pageset[cpu].numa_foreign++;
Index: linux-2.6.9/kernel/sysctl.c
===================================================================
--- linux-2.6.9.orig/kernel/sysctl.c	2004-10-25 13:09:18.000000000 -0700
+++ linux-2.6.9/kernel/sysctl.c	2004-10-25 13:10:14.000000000 -0700
@@ -67,6 +67,9 @@
 extern int printk_ratelimit_jiffies;
 extern int printk_ratelimit_burst;
 extern int pid_max_min, pid_max_max;
+#ifdef CONFIG_NUMA
+extern int sysctl_node_swap;
+#endif

 #if defined(CONFIG_X86_LOCAL_APIC) && defined(__i386__)
 int unknown_nmi_panic;
@@ -816,7 +819,17 @@
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
-	{ .ctl_name = 0 }
+#ifdef CONFIG_NUMA
+	{
+		.ctl_name	= VM_NODE_SWAP,
+		.procname	= "node_swap",
+		.data		= &sysctl_node_swap,
+		.maxlen		= sizeof(sysctl_node_swap),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
+#endif
+		{ .ctl_name = 0 }
 };

 static ctl_table proc_table[] = {
Index: linux-2.6.9/include/linux/sysctl.h
===================================================================
--- linux-2.6.9.orig/include/linux/sysctl.h	2004-10-25 13:09:18.000000000 -0700
+++ linux-2.6.9/include/linux/sysctl.h	2004-10-25 13:12:53.000000000 -0700
@@ -168,6 +168,7 @@
 	VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+	VM_NODE_SWAP=29,	/* Swap local node memory limit (in % *10) */
 };


Index: linux-2.6.9/mm/vmscan.c
===================================================================
--- linux-2.6.9.orig/mm/vmscan.c	2004-10-18 14:53:21.000000000 -0700
+++ linux-2.6.9/mm/vmscan.c	2004-10-25 13:21:17.000000000 -0700
@@ -1171,9 +1171,17 @@
  */
 void wakeup_kswapd(struct zone *zone)
 {
+#ifdef CONFIG_NUMA
+	extern int sysctl_node_swap;
+#endif
+
 	if (zone->present_pages == 0)
 		return;
+#ifdef CONFIG_NUMA
+	if (zone->free_pages > (zone->present_pages * sysctl_node_swap) << 10 && zone->free_pages > zone->pages_low)
+#else
 	if (zone->free_pages > zone->pages_low)
+#endif
 		return;
 	if (!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
 		return;
Index: linux-2.6.9/Documentation/vm/numa_node_swap
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/Documentation/vm/numa_node_swap	2004-10-25 13:17:07.000000000 -0700
@@ -0,0 +1,24 @@
+In a NUMA systems single nodes may run out of memory. This may occur even
+by only reading from files which will clutter node memory with cached
+pages from the file.
+
+However, as long as the system as a whole does have enough memory
+available, kswapd is not run at all. This means that a process allocating
+memory and running on a node that has no memory left, will get memory
+allocated from other nodes which is inefficient to handle. It would be
+better if kswapd would throw out some pages (maybe some of the cached
+pages from files that have only once been read) to reclaim memory in the
+node.
+
+The following patch checks the memory usage after each allocation in a
+zone. If the allocation in a zone falls below a certain minimum, kswapd is
+started for that zone alone.
+
+The minimum may be controlled through /proc/sys/vm/node_swap which is set
+to zero by default and thus is off.
+
+If it is set for example to 100 then kswapd will be run on
+a zone/node if less than 10% of pages are available after an allocation.
+
+October 25, 2004, Christoph Lameter <christoph@lameter.com>
+

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: NUMA node swapping V3
  2004-10-28 15:23 NUMA node swapping V3 Christoph Lameter
@ 2004-10-28 15:40 ` Martin J. Bligh
  2004-10-28 15:49   ` Christoph Lameter
  2004-10-28 16:20 ` Mika Penttilä
  1 sibling, 1 reply; 5+ messages in thread
From: Martin J. Bligh @ 2004-10-28 15:40 UTC (permalink / raw)
  To: Christoph Lameter, akpm; +Cc: linux-kernel, linux-mm

> Changes from V2: better documentation, fix missing #ifdef
> 
> In a NUMA systems single nodes may run out of memory. This may occur even
> by only reading from files which will clutter node memory with cached
> pages from the file.
> 
> However, as long as the system as a whole does have enough memory
> available, kswapd is not run at all. This means that a process allocating
> memory and running on a node that has no memory left, will get memory
> allocated from other nodes which is inefficient to handle. It would be
> better if kswapd would throw out some pages (maybe some of the cached
> pages from files that have only once been read) to reclaim memory in the
> node.
> 
> The following patch checks the memory usage after each allocation in a
> zone. If the allocation in a zone falls below a certain minimum, kswapd is
> started for that zone alone.
> 
> The minimum may be controlled through /proc/sys/vm/node_swap which is set
> to zero by default and thus is off.
> 
> If it is set for example to 100 then kswapd will be run on
> a zone/node if less than 10% of pages are available after an allocation.

I thought even the SGI people were saying this wouldn't actually help you,
due to some workload issues?

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: NUMA node swapping V3
  2004-10-28 15:40 ` Martin J. Bligh
@ 2004-10-28 15:49   ` Christoph Lameter
  0 siblings, 0 replies; 5+ messages in thread
From: Christoph Lameter @ 2004-10-28 15:49 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: akpm, linux-kernel, linux-mm

On Thu, 28 Oct 2004, Martin J. Bligh wrote:

> I thought even the SGI people were saying this wouldn't actually help you,
> due to some workload issues?

Our tests show that this does indeed address the issue. There may still be
some off node allocation while kswapd is starting up which causes some
objections but avoiding these would mean significant modifications to
__alloc_pages. This is a fix until a better solution can be found which I
would estimate to be 3-6 months down the road given the difficulties
getting vm changes into the kernel.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: NUMA node swapping V3
  2004-10-28 15:23 NUMA node swapping V3 Christoph Lameter
  2004-10-28 15:40 ` Martin J. Bligh
@ 2004-10-28 16:20 ` Mika Penttilä
  2004-10-28 20:45   ` Christoph Lameter
  1 sibling, 1 reply; 5+ messages in thread
From: Mika Penttilä @ 2004-10-28 16:20 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm

> 	if (pg == orig) {
> 		z->pageset[cpu].numa_hit++;
>+		/*
>+		 * If zone allocation has left less than
>+		 * (sysctl_node_swap / 10) %  of the zone free invoke kswapd.
>+		 * (the page limit is obtained through (pages*limit)/1024 to
>+		 * make the calculation more efficient)
>+		 */
>+		if (z->free_pages < (z->present_pages * sysctl_node_swap) << 10)
>+			wakeup_kswapd(z);
> 	} else {
> 		p->numa_miss++;
> 		zonelist->zones[0]->pageset[cpu].numa_foreign++;
>Index: linux-2.6.9/kernel/sysctl.c
>===================================================================
>  
>

I think you mean >> 10 though.

--Mika


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: NUMA node swapping V3
  2004-10-28 16:20 ` Mika Penttilä
@ 2004-10-28 20:45   ` Christoph Lameter
  0 siblings, 0 replies; 5+ messages in thread
From: Christoph Lameter @ 2004-10-28 20:45 UTC (permalink / raw)
  To: Mika Penttilä; +Cc: akpm, linux-kernel, linux-mm

On Thu, 28 Oct 2004, Mika Penttila wrote:

> >+		if (z->free_pages < (z->present_pages * sysctl_node_swap) << 10)
> >+			wakeup_kswapd(z);
> I think you mean >> 10 though.

Thanks for spotting that. This means we always were swapping before.

I did some additional test with various values of node_swap to see how it
does:

(test by filling node with pagecache pages through copy operation, then
 large anonymous alloc with touching of all pages)

setting|foreign| effectiveness
--------------------------------------
0	39136	0%
1	39138	0%
2	1460	96.3%
3	2429	93.8%
4	1171	97%
5	2868	92.7%
10	1995	95%
50	2242	94.3%
100	2348	94%
300	2768	93%

Seems that there are some sweet spots. Surprisingly these come early with
the swap ceiling even set to 0.2%. Higher values increase the number of
foreign pages?!?

Fixed patch:

Index: linux-2.6.9/mm/page_alloc.c
===================================================================
--- linux-2.6.9.orig/mm/page_alloc.c	2004-10-25 13:09:18.000000000 -0700
+++ linux-2.6.9/mm/page_alloc.c	2004-10-28 10:12:19.000000000 -0700
@@ -43,6 +43,19 @@
 long nr_swap_pages;
 int numnodes = 1;
 int sysctl_lower_zone_protection = 0;
+#ifdef CONFIG_NUMA
+/*
+ * sysctl_node_swap is a percentage of the pages available
+ * in a zone multiplied by 10. If the available pages
+ * in a zone drop below this limit then kswapd is invoked
+ * for this zone alone. This results in the reclaiming
+ * of local memory. Local memory may be filled up by simply reading
+ * a file. If local memory is not available then off node memory
+ * will be allocated to a process which makes all memory access
+ * less efficient then they could be.
+ */
+int sysctl_node_swap = 0;
+#endif

 EXPORT_SYMBOL(totalram_pages);
 EXPORT_SYMBOL(nr_swap_pages);
@@ -486,6 +499,14 @@
 	p = &z->pageset[cpu];
 	if (pg == orig) {
 		z->pageset[cpu].numa_hit++;
+		/*
+		 * If zone allocation has left less than
+		 * (sysctl_node_swap / 10) %  of the zone free invoke kswapd.
+		 * (the page limit is obtained through (pages*limit)/1024 to
+		 * make the calculation more efficient)
+		 */
+		if (z->free_pages < (z->present_pages * sysctl_node_swap) >> 10)
+			wakeup_kswapd(z);
 	} else {
 		p->numa_miss++;
 		zonelist->zones[0]->pageset[cpu].numa_foreign++;
Index: linux-2.6.9/kernel/sysctl.c
===================================================================
--- linux-2.6.9.orig/kernel/sysctl.c	2004-10-25 13:09:18.000000000 -0700
+++ linux-2.6.9/kernel/sysctl.c	2004-10-28 10:12:19.000000000 -0700
@@ -67,6 +67,9 @@
 extern int printk_ratelimit_jiffies;
 extern int printk_ratelimit_burst;
 extern int pid_max_min, pid_max_max;
+#ifdef CONFIG_NUMA
+extern int sysctl_node_swap;
+#endif

 #if defined(CONFIG_X86_LOCAL_APIC) && defined(__i386__)
 int unknown_nmi_panic;
@@ -816,7 +819,17 @@
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
-	{ .ctl_name = 0 }
+#ifdef CONFIG_NUMA
+	{
+		.ctl_name	= VM_NODE_SWAP,
+		.procname	= "node_swap",
+		.data		= &sysctl_node_swap,
+		.maxlen		= sizeof(sysctl_node_swap),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
+#endif
+		{ .ctl_name = 0 }
 };

 static ctl_table proc_table[] = {
Index: linux-2.6.9/include/linux/sysctl.h
===================================================================
--- linux-2.6.9.orig/include/linux/sysctl.h	2004-10-25 13:09:18.000000000 -0700
+++ linux-2.6.9/include/linux/sysctl.h	2004-10-28 10:12:19.000000000 -0700
@@ -168,6 +168,7 @@
 	VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+	VM_NODE_SWAP=29,	/* Swap local node memory limit (in % *10) */
 };


Index: linux-2.6.9/mm/vmscan.c
===================================================================
--- linux-2.6.9.orig/mm/vmscan.c	2004-10-18 14:53:21.000000000 -0700
+++ linux-2.6.9/mm/vmscan.c	2004-10-28 10:12:19.000000000 -0700
@@ -1171,9 +1171,17 @@
  */
 void wakeup_kswapd(struct zone *zone)
 {
+#ifdef CONFIG_NUMA
+	extern int sysctl_node_swap;
+#endif
+
 	if (zone->present_pages == 0)
 		return;
+#ifdef CONFIG_NUMA
+	if (zone->free_pages > (zone->present_pages * sysctl_node_swap) << 10 && zone->free_pages > zone->pages_low)
+#else
 	if (zone->free_pages > zone->pages_low)
+#endif
 		return;
 	if (!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
 		return;
Index: linux-2.6.9/Documentation/vm/numa_node_swap
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/Documentation/vm/numa_node_swap	2004-10-28 10:12:19.000000000 -0700
@@ -0,0 +1,24 @@
+In a NUMA systems single nodes may run out of memory. This may occur even
+by only reading from files which will clutter node memory with cached
+pages from the file.
+
+However, as long as the system as a whole does have enough memory
+available, kswapd is not run at all. This means that a process allocating
+memory and running on a node that has no memory left, will get memory
+allocated from other nodes which is inefficient to handle. It would be
+better if kswapd would throw out some pages (maybe some of the cached
+pages from files that have only once been read) to reclaim memory in the
+node.
+
+The following patch checks the memory usage after each allocation in a
+zone. If the allocation in a zone falls below a certain minimum, kswapd is
+started for that zone alone.
+
+The minimum may be controlled through /proc/sys/vm/node_swap which is set
+to zero by default and thus is off.
+
+If it is set for example to 100 then kswapd will be run on
+a zone/node if less than 10% of pages are available after an allocation.
+
+October 25, 2004, Christoph Lameter <christoph@lameter.com>
+
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-10-28 20:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-10-28 15:23 NUMA node swapping V3 Christoph Lameter
2004-10-28 15:40 ` Martin J. Bligh
2004-10-28 15:49   ` Christoph Lameter
2004-10-28 16:20 ` Mika Penttilä
2004-10-28 20:45   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox