linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [Request-For-Test] [PATCH] change zonelist order v6 [0/3] Introduction
@ 2007-05-10  7:16 KAMEZAWA Hiroyuki
  2007-05-10  7:23 ` [Request-For-Test] [PATCH] change zonelist order v6 [1/3] zonelist order selection logic KAMEZAWA Hiroyuki
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-10  7:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linux-MM, Lee.Schermerhorn, apw, Christoph Lameter, AKPM,
	Andi Kleen, jbarnes, kamezawa.hiroyu

This is zonelist-order-fix patch version 6. against 2.6.21-mm2.

Works as expected in my ia64/NUMA environment and found no problem
in x86/non-NUMA arch. (This patch has no change for non-NUMA)

There are many types of NUMA systems  and this patch affects *all* NUMA
system's memory allocation logic. please test.

ChangeLog V5 -> V6
- some cleanups and compile fixes (no logic change)
- merged documentaion fix from Lee Schermerhon.
- simplified kernel-parameter.txt
- adjusted to 2.6.21-mm2.

ChangeLog V4 -> V5
- separated 'doc' patch and rewrote it.
- more clean ups.
- sysctl/boot option params are simplified.

ChangeLog V2 -> V4
- automatic configuration is added.
- automatic configuration is now default.
- relaxed_zone_order is renamed to be numa_zonelist_order
  you can specify value "default" , "zone" , "numa"
- clean-up from Lee Schermerhorn
- patch is speareted to "base" and "autoconfiguration algorithm"

Changelog from V1 -> V2
- sysctl name is changed to be relaxed_zone_order
- NORMAL->NORMAL->....->DMA->DMA->DMA order (new ordering) is now default.
  NORMAL->DMA->NORMAL->DMA order (old ordering) is optional.
- addes boot opttion to set relaxed_zone_order. ia64 is supported now.
- Added documentation

As previous post, thanks to Lee Schermerhon for his great help.

[patch set]
[1/3] ---- add zonelist selection logic.
[2/3] ---- add automatic configuration of zonelist order
[3/3] ---- add documentaion.

Any comments are welcome.

[Description]
This patch modifies zonelist order in NUMA. This patch offers two zonelist
order.
(TypeA) zone is ordered by node locality, then zone type
(TypeB) zone is ordered by zone type, then node locality

(TypeA) is called as "Node Order", (TypeB) is called as "Zone Order"
Default zonelist order is determined by the kernel automatically.


Assume 2 Node NUMA, Node(0) has ZONE_DMA/ZONE_NORMAL and Node(1) has ZONE_NORMAL.
In this case, zonelist for GFP_KERNEL in Node(0) will be

In "Node Order",  Node(0)NORMAL -> Node(0)DMA -> Node(1)NORMAL
In "Zone Order",  Node(0)NORMAL -> Node(1)NORMAL -> Node(0) DMA

"Node Order" will guarantee "better locality" but  "Zone Order" places
ZONE_DMA at the tail of zonelist. This will offer robust zonelist agatist OOM on ZONE_DMA, which is tend to be small.

"Which is better ?" 
It depends on a system's environment and memory usage, I think.

[Case Study]
On my (and other) ia64 NUMA box, only Node(0) has 2Gbytes of ZONE_DMA.
Assume a machine with following configuration.

Node 0:   12GB of memory   10GB NORMAL 2GB DMA
Node 1:   12GB of memory   12GB NORMAL
Node 2:   12GB of memory   12GB NORMAL

Start a process which uses 12GB of memory on Node(0), then memory usage
will be
Node 0:   0/12 GB of memory is available, NORMAL: empty DMA: empty
Node 1:  12/12 GB of memory is available. NORMAL: 12G
Node 2:  12/12 GB of memory is available. NORMAL: 12G

An interesting matter is "ZONE_DMA is exhausted before ZONE_NORMAL".
This is current kernel's behavior. This can cause OOM very easily if the
system has a device which uses GFP_DMA. 

This patch fixes this kind of situation as following. (by using "Zone Order")
Node 0:   2/12 GB of memory is available, NORMAL: empty DMA: 2G
Node 1:  10/12 GB of memory is available. NORMAL: 10G
Node 2:  12/12 GB of memory is available. NORMAL  12G

A user can say "Good bye OOM-Killer" but 2GB of memory is allocated from
off-node memory. it's trade-off.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Request-For-Test] [PATCH] change zonelist order v6 [1/3] zonelist order selection logic
  2007-05-10  7:16 [Request-For-Test] [PATCH] change zonelist order v6 [0/3] Introduction KAMEZAWA Hiroyuki
@ 2007-05-10  7:23 ` KAMEZAWA Hiroyuki
  2007-05-10  7:24 ` [Request-For-Test] [PATCH] change zonelist order v6 [2/3] auto configuration KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-10  7:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, Lee.Schermerhorn, apw, clameter, akpm,
	ak, jbarnes

Make zonelist creation policy selectable from sysctl/boot option v6.

This patch makes NUMA's zonelist (of pgdat) order selectable.
Available order are Default(automatic)/ Node-based / Zone-based.

[Default Order]
The kernel selects Node-based or Zone-based order automatically.

[Node-based Order]
This policy treats the locality of memory as the most important parameter.
Zonelist order is created by each zone's locality. This means lower zones
(ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
IOW. ZONE_DMA will be in the middle of zonelist.
current 2.6.21 kernel uses this.

Pros.
 * A user can expect local memory as much as possible.
Cons.
 * lower zone will be exhansted before higher zone. This may cause OOM_KILL.

Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
because of ZONE_DMA exhaution and you need the best locality.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

 node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.

*node(1)'s memory allocation order:
 
 node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

[Zone-based order]
This policy treats the zone type as the most important parameter.
Zonelist order is created by zone-type order. This means lower zone 
never be used bofere higher zone exhaustion.
IOW. ZONE_DMA will be always at the tail of zonelist.

Pros.
 * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
Cons.
 * memory locality may not be best.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

 node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.

*node(1)'s memory allocation order:

 node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.

command:
%echo N > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Node-based order.

command:
%echo Z > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Zone-based order.

Thanks to Lee Schermerhorn, he gives me much help and codes.

Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/mmzone.h |    5 +
 kernel/sysctl.c        |   11 ++
 mm/page_alloc.c        |  213 +++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 213 insertions(+), 16 deletions(-)

Index: linux-2.6.21-mm2/kernel/sysctl.c
===================================================================
--- linux-2.6.21-mm2.orig/kernel/sysctl.c
+++ linux-2.6.21-mm2/kernel/sysctl.c
@@ -886,6 +886,17 @@ static ctl_table vm_table[] = {
 		.proc_handler	= &proc_dointvec_jiffies,
 		.strategy	= &sysctl_jiffies,
 	},
+#ifdef CONFIG_NUMA
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "numa_zonelist_order",
+		.data		= &numa_zonelist_order,
+		.maxlen		= NUMA_ZONELIST_ORDER_LEN,
+		.mode		= 0644,
+		.proc_handler	= &numa_zonelist_order_handler,
+		.strategy	= &sysctl_string,
+	},
+#endif
 #endif
 #if defined(CONFIG_X86_32) || \
    (defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL))
Index: linux-2.6.21-mm2/mm/page_alloc.c
===================================================================
--- linux-2.6.21-mm2.orig/mm/page_alloc.c
+++ linux-2.6.21-mm2/mm/page_alloc.c
@@ -1994,9 +1994,102 @@ static int __meminit build_zonelists_nod
 	return nr_zones;
 }
 
+
+/*
+ *  zonelist_order:
+ *  0 = automatic detection of better ordering.
+ *  1 = order by ([node] distance, -zonetype)
+ *  2 = order by (-zonetype, [node] distance)
+ *
+ *  If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create
+ *  the same zonelist. So only NUMA can configure this param.
+ */
+#define ZONELIST_ORDER_DEFAULT  0
+#define ZONELIST_ORDER_NODE     1
+#define ZONELIST_ORDER_ZONE     2
+
+/* zonelist order in the kernel.
+ * set_zonelist_order() will set this to NODE or ZONE.
+ */
+static int current_zonelist_order = ZONELIST_ORDER_DEFAULT;
+static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"};
+
+
 #ifdef CONFIG_NUMA
+/* The vaule user specified ....changed by config */
+static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
+/* string for sysctl */
+#define NUMA_ZONELIST_ORDER_LEN	16
+char numa_zonelist_order[16] = "default";
+
+/*
+ * interface for configure zonelist ordering.
+ * command line option "numa_zonelist_order"
+ *	= "[dD]efault	- default, automatic configuration.
+ *	= "[nN]ode 	- order by node locality, then by zone within node
+ *	= "[zZ]one      - order by zone, then by locality within zone
+ */
+
+static int __parse_numa_zonelist_order(char *s)
+{
+	if (*s == 'd' || *s == 'D') {
+		user_zonelist_order = ZONELIST_ORDER_DEFAULT;
+	} else if (*s == 'n' || *s == 'N') {
+		user_zonelist_order = ZONELIST_ORDER_NODE;
+	} else if (*s == 'z' || *s == 'Z') {
+		user_zonelist_order = ZONELIST_ORDER_ZONE;
+	} else {
+		printk(KERN_WARNING
+			"Ignoring invalid numa_zonelist_order value:  "
+			"%s\n", s);
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static __init int setup_numa_zonelist_order(char *s)
+{
+	if (s)
+		return __parse_numa_zonelist_order(s);
+	return 0;
+}
+early_param("numa_zonelist_order", setup_numa_zonelist_order);
+
+/*
+ * sysctl handler for numa_zonelist_order
+ */
+int numa_zonelist_order_handler(ctl_table *table, int write,
+		struct file *file, void __user *buffer, size_t *length,
+		loff_t *ppos)
+{
+	char saved_string[NUMA_ZONELIST_ORDER_LEN];
+	int ret;
+
+	if (write)
+		strncpy(saved_string, (char*)table->data,
+			NUMA_ZONELIST_ORDER_LEN);
+	ret = proc_dostring(table, write, file, buffer, length, ppos);
+	if (ret)
+		return ret;
+	if (write) {
+		int oldval = user_zonelist_order;
+		if (__parse_numa_zonelist_order((char*)table->data)) {
+			/*
+			 * bogus value.  restore saved string
+			 */
+			strncpy((char*)table->data, saved_string,
+				NUMA_ZONELIST_ORDER_LEN);
+			user_zonelist_order = oldval;
+		} else if (oldval != user_zonelist_order)
+			build_all_zonelists();
+	}
+	return 0;
+}
+
+
 #define MAX_NODE_LOAD (num_online_nodes())
-static int __meminitdata node_load[MAX_NUMNODES];
+static int node_load[MAX_NUMNODES];
+
 /**
  * find_next_best_node - find the next node that should appear in a given node's fallback list
  * @node: node whose fallback list we're appending
@@ -2011,7 +2104,7 @@ static int __meminitdata node_load[MAX_N
  * on them otherwise.
  * It returns -1 if no node is found.
  */
-static int __meminit find_next_best_node(int node, nodemask_t *used_node_mask)
+static int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
 	int n, val;
 	int min_val = INT_MAX;
@@ -2057,13 +2150,83 @@ static int __meminit find_next_best_node
 	return best_node;
 }
 
-static void __meminit build_zonelists(pg_data_t *pgdat)
+
+/*
+ * Build zonelists ordered by node and zones within node.
+ * This results in maximum locality--normal zone overflows into local
+ * DMA zone, if any--but risks exhausting DMA zone.
+ */
+static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
 {
-	int j, node, local_node;
 	enum zone_type i;
-	int prev_node, load;
+	int j;
+	struct zonelist *zonelist;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		zonelist = pgdat->node_zonelists + i;
+		for (j = 0; zonelist->zones[j] != NULL; j++);
+
+ 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
+		zonelist->zones[j] = NULL;
+	}
+}
+
+/*
+ * Build zonelists ordered by zone and nodes within zones.
+ * This results in conserving DMA zone[s] until all Normal memory is
+ * exhausted, but results in overflowing to remote node while memory
+ * may still exist in local DMA zone.
+ */
+static int node_order[MAX_NUMNODES];
+
+static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
+{
+	enum zone_type i;
+	int pos, j, node;
+	int zone_type;		/* needs to be signed */
+	struct zone *z;
 	struct zonelist *zonelist;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		zonelist = pgdat->node_zonelists + i;
+		pos = 0;
+		for (zone_type = i; zone_type >= 0; zone_type--) {
+			for (j = 0; j < nr_nodes; j++) {
+				node = node_order[j];
+				z = &NODE_DATA(node)->node_zones[zone_type];
+				if (populated_zone(z))
+					zonelist->zones[pos++] = z;
+			}
+		}
+		zonelist->zones[pos] = NULL;
+	}
+}
+
+static int default_zonelist_order(void)
+{
+	/* dummy, just select node order. */
+	return ZONELIST_ORDER_NODE;
+}
+
+static void set_zonelist_order(void)
+{
+	/* dummy, just select node order. */
+	if (user_zonelist_order == ZONELIST_ORDER_DEFAULT)
+		current_zonelist_order = default_zonelist_order();
+	else
+		current_zonelist_order = user_zonelist_order;
+}
+
+
+
+static void build_zonelists(pg_data_t *pgdat)
+{
+	int j, node, load;
+	enum zone_type i;
 	nodemask_t used_mask;
+	int local_node, prev_node;
+	struct zonelist *zonelist;
+	int order = current_zonelist_order;
 
 	/* initialize zonelists */
 	for (i = 0; i < MAX_NR_ZONES; i++) {
@@ -2076,6 +2239,11 @@ static void __meminit build_zonelists(pg
 	load = num_online_nodes();
 	prev_node = local_node;
 	nodes_clear(used_mask);
+
+	memset(node_load, 0, sizeof(node_load));
+	memset(node_order, 0, sizeof(node_order));
+	j = 0;
+
 	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
 		int distance = node_distance(local_node, node);
 
@@ -2091,18 +2259,20 @@ static void __meminit build_zonelists(pg
 		 * So adding penalty to the first node in same
 		 * distance group to make it round-robin.
 		 */
-
 		if (distance != node_distance(local_node, prev_node))
-			node_load[node] += load;
+			node_load[node] = load;
+
 		prev_node = node;
 		load--;
-		for (i = 0; i < MAX_NR_ZONES; i++) {
-			zonelist = pgdat->node_zonelists + i;
-			for (j = 0; zonelist->zones[j] != NULL; j++);
+		if (order == ZONELIST_ORDER_NODE)
+			build_zonelists_in_node_order(pgdat, node);
+		else
+			node_order[j++] = node;	/* remember order */
+	}
 
-	 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
-			zonelist->zones[j] = NULL;
-		}
+	if (order == ZONELIST_ORDER_ZONE) {
+		/* calculate node order -- i.e., DMA last! */
+		build_zonelists_in_zone_order(pgdat, j);
 	}
 }
 
@@ -2124,8 +2294,14 @@ static void __meminit build_zonelist_cac
 	}
 }
 
+
 #else	/* CONFIG_NUMA */
 
+static void set_zonelist_order(void)
+{
+	current_zonelist_order = ZONELIST_ORDER_ZONE;
+}
+
 static void __meminit build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
@@ -2173,7 +2349,7 @@ static void __meminit build_zonelist_cac
 #endif	/* CONFIG_NUMA */
 
 /* return values int ....just for stop_machine_run() */
-static int __meminit __build_all_zonelists(void *dummy)
+static int __build_all_zonelists(void *dummy)
 {
 	int nid;
 
@@ -2184,8 +2360,10 @@ static int __meminit __build_all_zonelis
 	return 0;
 }
 
-void __meminit build_all_zonelists(void)
+void build_all_zonelists(void)
 {
+	set_zonelist_order();
+
 	if (system_state == SYSTEM_BOOTING) {
 		__build_all_zonelists(NULL);
 		cpuset_init_current_mems_allowed();
@@ -2209,8 +2387,10 @@ void __meminit build_all_zonelists(void)
 	else
 		page_group_by_mobility_disabled = 0;
 
-	printk("Built %i zonelists, mobility grouping %s.  Total pages: %ld\n",
+	printk("Built %i zonelists in %s order, mobility grouping %s."
+	       "Total pages: %ld\n",
 			num_online_nodes(),
+			zonelist_order_name[current_zonelist_order],
 			page_group_by_mobility_disabled ? "off" : "on",
 			vm_total_pages);
 }
Index: linux-2.6.21-mm2/include/linux/mmzone.h
===================================================================
--- linux-2.6.21-mm2.orig/include/linux/mmzone.h
+++ linux-2.6.21-mm2/include/linux/mmzone.h
@@ -610,6 +610,11 @@ int sysctl_min_unmapped_ratio_sysctl_han
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
 
+extern int numa_zonelist_order_handler(struct ctl_table *, int,
+			struct file *, void __user *, size_t *, loff_t *);
+extern char numa_zonelist_order[];
+#define NUMA_ZONELIST_ORDER_LEN 16	/* string buffer size */
+
 #include <linux/topology.h>
 /* Returns the number of the current Node. */
 #ifndef numa_node_id

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Request-For-Test] [PATCH] change zonelist order v6 [2/3] auto configuration
  2007-05-10  7:16 [Request-For-Test] [PATCH] change zonelist order v6 [0/3] Introduction KAMEZAWA Hiroyuki
  2007-05-10  7:23 ` [Request-For-Test] [PATCH] change zonelist order v6 [1/3] zonelist order selection logic KAMEZAWA Hiroyuki
@ 2007-05-10  7:24 ` KAMEZAWA Hiroyuki
  2007-05-10  7:26 ` [Request-For-Test] [PATCH] change zonelist order v6 [3/3] documentaion KAMEZAWA Hiroyuki
  2007-05-10  8:36 ` [Request-For-Test] [PATCH] change zonelist order v6 [0/3] Introduction Andrew Morton
  3 siblings, 0 replies; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-10  7:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, Lee.Schermerhorn, apw, clameter, akpm,
	ak, jbarnes

Add auto zone ordering configuration.

This function will select ZONE_ORDER_NODE when

There are only ZONE_DMA or ZONE_DMA32.
|| size of (ZONE_DMA/DMA32) > (System Total Memory)/2
|| Assume Node(A)
	Node (A) is enough big &&
	Node (A)'s ZONE_DMA/DMA32 occupies 60% of Node(A)'s memory.
	(In this case, ZONE_ORDER_ZONE may not offer enough locality...)

otherwise, ZONE_ORDER_ZONE is selected.

Maybe there is no best way to configure zone order. I wrote this base on
my experience and discussion on the list.

Anyway, a user can specifiy zone order from boot option/sysctl.

Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/page_alloc.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

Index: linux-2.6.21-mm2/mm/page_alloc.c
===================================================================
--- linux-2.6.21-mm2.orig/mm/page_alloc.c
+++ linux-2.6.21-mm2/mm/page_alloc.c
@@ -2204,8 +2204,55 @@ static void build_zonelists_in_zone_orde
 
 static int default_zonelist_order(void)
 {
-	/* dummy, just select node order. */
-	return ZONELIST_ORDER_NODE;
+	int nid, zone_type;
+	unsigned long low_kmem_size,total_size;
+	struct zone *z;
+	int average_size;
+	/*
+         * ZONE_DMA and ZONE_DMA32 can be very small area in the sytem.
+	 * If they are really small and used heavily, the system can fall
+	 * into OOM very easily.
+	 * This function detect ZONE_DMA/DMA32 size and confgigures zone order.
+	 */
+	/* Is there ZONE_NORMAL ? (ex. ppc has only DMA zone..) */
+	low_kmem_size = 0;
+	total_size = 0;
+	for_each_online_node(nid) {
+		for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
+			z = &NODE_DATA(nid)->node_zones[zone_type];
+			if (populated_zone(z)) {
+				if (zone_type < ZONE_NORMAL)
+					low_kmem_size += z->present_pages;
+				total_size += z->present_pages;
+			}
+		}
+	}
+	if (!low_kmem_size ||  /* there are no DMA area. */
+	    low_kmem_size > total_size/2) /* DMA/DMA32 is big. */
+		return ZONELIST_ORDER_NODE;
+	/*
+	 * look into each node's config.
+  	 * If there is a node whose DMA/DMA32 memory is very big area on
+ 	 * local memory, NODE_ORDER may be suitable.
+         */
+	average_size = total_size / (num_online_nodes() + 1);
+	for_each_online_node(nid) {
+		low_kmem_size = 0;
+		total_size = 0;
+		for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
+			z = &NODE_DATA(nid)->node_zones[zone_type];
+			if (populated_zone(z)) {
+				if (zone_type < ZONE_NORMAL)
+					low_kmem_size += z->present_pages;
+				total_size += z->present_pages;
+			}
+		}
+		if (low_kmem_size &&
+		    total_size > average_size && /* ignore small node */
+		    low_kmem_size > total_size * 70/100)
+			return ZONELIST_ORDER_NODE;
+	}
+	return ZONELIST_ORDER_ZONE;
 }
 
 static void set_zonelist_order(void)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Request-For-Test] [PATCH] change zonelist order v6 [3/3] documentaion
  2007-05-10  7:16 [Request-For-Test] [PATCH] change zonelist order v6 [0/3] Introduction KAMEZAWA Hiroyuki
  2007-05-10  7:23 ` [Request-For-Test] [PATCH] change zonelist order v6 [1/3] zonelist order selection logic KAMEZAWA Hiroyuki
  2007-05-10  7:24 ` [Request-For-Test] [PATCH] change zonelist order v6 [2/3] auto configuration KAMEZAWA Hiroyuki
@ 2007-05-10  7:26 ` KAMEZAWA Hiroyuki
  2007-05-10  8:36 ` [Request-For-Test] [PATCH] change zonelist order v6 [0/3] Introduction Andrew Morton
  3 siblings, 0 replies; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-10  7:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, Lee.Schermerhorn, apw, clameter, akpm,
	ak, jbarnes

Documentation for numa_zonelist_order.


Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 Documentation/kernel-parameters.txt |    5 +++
 Documentation/sysctl/vm.txt         |   46 ++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

Index: linux-2.6.21-mm2/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.21-mm2.orig/Documentation/sysctl/vm.txt
+++ linux-2.6.21-mm2/Documentation/sysctl/vm.txt
@@ -33,6 +33,7 @@ Currently, these files are in /proc/sys/
 - panic_on_oom
 - swap_prefetch
 - stat_interval
+- numa_zonelist_order
 
 ==============================================================
 
@@ -248,3 +249,48 @@ determines the frequency of these consol
 
 The default value is 1 second.
 
+==============================================================
+
+numa_zonelist_order
+
+This sysctl is only for NUMA.
+'where the memory is allocated from' is controlled by zonelists.
+(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
+ you may be able to read ZONE_DMA as ZONE_DMA32...)
+
+In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following.
+ZONE_NORMAL -> ZONE_DMA
+This means that a memory allocation request for GFP_KERNEL will
+get memory from ZONE_DMA only when ZONE_NORMAL is not available.
+
+In NUMA case, you can think of following 2 types of order.
+Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL
+
+(A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL
+(B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA.
+
+Type(A) offers the best locality for processes on Node(0), but ZONE_DMA
+will be used before ZONE_NORMAL exhaustion. This increases possibility of
+out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small.
+
+Type(B) cannot offer the best locality but is more robust against OOM of
+the DMA zone.
+
+Type(A) is called as "Node" order. Type (B) is "Zone" order.
+
+"Node order" orders the zonelists by node, then by zone within each node.
+Specify "[Nn]ode" for zone order
+
+"Zone Order" orders the zonelists by zone type, then by node within each
+zone.  Specify "[Zz]one"for zode order.
+
+Specify "[Dd]efault" to request automatic configuration.  Autoconfiguration
+will select "node" order in following case.
+(1) if the DMA zone does not exist or
+(2) if the DMA zone comprises greater than 50% of the available memory or
+(3) if any node's DMA zone comprises greater than 60% of its local memory and
+    the amount of local memory is big enough.
+
+Otherwise, "zone" order will be selected. Default order is recommended unless
+this is causing problems for your system/application.
+
Index: linux-2.6.21-mm2/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.21-mm2.orig/Documentation/kernel-parameters.txt
+++ linux-2.6.21-mm2/Documentation/kernel-parameters.txt
@@ -1231,6 +1231,11 @@ and is between 256 and 4096 characters. 
 
 	nowb		[ARM]
 
+	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
+			one of ['zone', 'node', 'default'] can be specified
+			This can be set from sysctl after boot.
+			See Documentation/sysctl/vm.txt for details.
+
 	nr_uarts=	[SERIAL] maximum number of UARTs to be registered.
 
 	opl3=		[HW,OSS]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Request-For-Test] [PATCH] change zonelist order v6 [0/3] Introduction
  2007-05-10  7:16 [Request-For-Test] [PATCH] change zonelist order v6 [0/3] Introduction KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2007-05-10  7:26 ` [Request-For-Test] [PATCH] change zonelist order v6 [3/3] documentaion KAMEZAWA Hiroyuki
@ 2007-05-10  8:36 ` Andrew Morton
  2007-05-10  9:05   ` KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2007-05-10  8:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, Linux-MM, Lee.Schermerhorn, apw, Christoph Lameter,
	Andi Kleen, jbarnes

On Thu, 10 May 2007 16:16:11 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> This is zonelist-order-fix patch version 6. against 2.6.21-mm2.

This is new:

WARNING: mm/built-in.o - Section mismatch: reference to .init.text: from .text between '__build_all_zonelists' (at offset 0x3d13) and 'build_all_zonelists'
WARNING: mm/built-in.o - Section mismatch: reference to .init.text: from .text between '__build_all_zonelists' (at offset 0x3d2c) and 'build_all_zonelists'
WARNING: mm/built-in.o - Section mismatch: reference to .init.text: from .text between '__build_all_zonelists' (at offset 0x3d4b) and 'build_all_zonelists'

Using http://userweb.kernel.org/~akpm/config-sony.txt

Maybe it wasn't your match which did this, I didn't check.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Request-For-Test] [PATCH] change zonelist order v6 [0/3] Introduction
  2007-05-10  8:36 ` [Request-For-Test] [PATCH] change zonelist order v6 [0/3] Introduction Andrew Morton
@ 2007-05-10  9:05   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-10  9:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Lee.Schermerhorn, apw, clameter, ak, jbarnes

On Thu, 10 May 2007 01:36:19 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 10 May 2007 16:16:11 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > This is zonelist-order-fix patch version 6. against 2.6.21-mm2.
> 
> This is new:
> 
> WARNING: mm/built-in.o - Section mismatch: reference to .init.text: from .text between '__build_all_zonelists' (at offset 0x3d13) and 'build_all_zonelists'
> WARNING: mm/built-in.o - Section mismatch: reference to .init.text: from .text between '__build_all_zonelists' (at offset 0x3d2c) and 'build_all_zonelists'
> WARNING: mm/built-in.o - Section mismatch: reference to .init.text: from .text between '__build_all_zonelists' (at offset 0x3d4b) and 'build_all_zonelists'
> 
> Using http://userweb.kernel.org/~akpm/config-sony.txt
> 
> Maybe it wasn't your match which did this, I didn't check.
> 
Ah....thank you. this is fix. I turned off memory-hotplug and this is patch.
Because precise control of this meminit will need some #ifdef,
I removed them all.

-Kame
==
Fixes section mismatch.

Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Index: linux-2.6.21-mm2/mm/page_alloc.c
===================================================================
--- linux-2.6.21-mm2.orig/mm/page_alloc.c
+++ linux-2.6.21-mm2/mm/page_alloc.c
@@ -1974,7 +1974,7 @@ void show_free_areas(void)
  *
  * Add all populated zones of a node to the zonelist.
  */
-static int __meminit build_zonelists_node(pg_data_t *pgdat,
+static int build_zonelists_node(pg_data_t *pgdat,
 			struct zonelist *zonelist, int nr_zones, enum zone_type zone_type)
 {
 	struct zone *zone;
@@ -2324,7 +2324,7 @@ static void build_zonelists(pg_data_t *p
 }
 
 /* Construct the zonelist performance cache - see further mmzone.h */
-static void __meminit build_zonelist_cache(pg_data_t *pgdat)
+static void build_zonelist_cache(pg_data_t *pgdat)
 {
 	int i;
 
@@ -2349,7 +2349,7 @@ static void set_zonelist_order(void)
 	current_zonelist_order = ZONELIST_ORDER_ZONE;
 }
 
-static void __meminit build_zonelists(pg_data_t *pgdat)
+static void build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
 	enum zone_type i,j;
@@ -2385,7 +2385,7 @@ static void __meminit build_zonelists(pg
 }
 
 /* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
-static void __meminit build_zonelist_cache(pg_data_t *pgdat)
+static void build_zonelist_cache(pg_data_t *pgdat)
 {
 	int i;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-05-10  9:05 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-10  7:16 [Request-For-Test] [PATCH] change zonelist order v6 [0/3] Introduction KAMEZAWA Hiroyuki
2007-05-10  7:23 ` [Request-For-Test] [PATCH] change zonelist order v6 [1/3] zonelist order selection logic KAMEZAWA Hiroyuki
2007-05-10  7:24 ` [Request-For-Test] [PATCH] change zonelist order v6 [2/3] auto configuration KAMEZAWA Hiroyuki
2007-05-10  7:26 ` [Request-For-Test] [PATCH] change zonelist order v6 [3/3] documentaion KAMEZAWA Hiroyuki
2007-05-10  8:36 ` [Request-For-Test] [PATCH] change zonelist order v6 [0/3] Introduction Andrew Morton
2007-05-10  9:05   ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox