linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] change zonelist order v5 [0/3]
@ 2007-05-08 11:14 KAMEZAWA Hiroyuki
  2007-05-08 11:16 ` [PATCH] change zonelist order v5 [1/3] implements zonelist order selection KAMEZAWA Hiroyuki
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-08 11:14 UTC (permalink / raw)
  To: LKML
  Cc: Linux-MM, Lee.Schermerhorn, Christoph Lameter, AKPM, Andi Kleen,
	jbarnes, kamezawa.hiroyu

Hi, this is zonelist-order-fix patch version 5.
against 2.6.21-mm1. works well in my ia64/NUMA environment.


ChangeLog V4 -> V5
- separated 'doc' patch and rewrote it.
- more clean ups.
- sysctl/boot option params are simplified.

ChangeLog V2 -> V4
- automatic configuration is added.
- automatic configuration is now default.
- relaxed_zone_order is renamed to be numa_zonelist_order
  you can specify value "default" , "zone" , "numa"
- clean-up from Lee Schermerhorn
- patch is speareted to "base" and "autoconfiguration algorithm"

Changelog from V1 -> V2
- sysctl name is changed to be relaxed_zone_order
- NORMAL->NORMAL->....->DMA->DMA->DMA order (new ordering) is now default.
  NORMAL->DMA->NORMAL->DMA order (old ordering) is optional.
- addes boot opttion to set relaxed_zone_order. ia64 is supported now.
- Added documentation

Thanks to Lee Schermerhon for his great help. please ack or
give your sign-off if O.K.

[patch set]
[1/3] ---- add zonelist selection logic.
[2/3] ---- add automatic configration of zonelist order
[3/3] ---- add documentaion.

Any comments are welcome.

[Description]
This patch modifies zonelist order in NUMA. This patch offers two zonelist
order.
(TypeA) zone is ordered by node locality, then zone type
(TypeB) zone is ordered by zone type, then node locality

(TypeA) is called as "Node Order", (TypeB) is called as "Zone Order"
Default zonelist order is determined by the kernel automatically.


Assume 2 Node NUMA, Node(0) has ZONE_DMA/ZONE_NORMAL and Node(1) has ZONE_NORMAL.
In this case, zonelist for GFP_KERNEL in Node(0) will be

In "Node Order",  Node(0)NORMAL -> Node(0)DMA -> Node(1)NORMAL
In "Zone Order",  Node(0)NORMAL -> Node(1)NORMAL -> Node(0) DMA

"Node Order" will guarantee "better locality" but  "Zone Order" places
ZONE_DMA at the tail of zonelist. This will offer robust zonelist agatist OOM on ZONE_DMA, which is tend to be small.

"Which is better ?" 
It depends on a system's environment and memory usage, I think.

[Case Study]
On my (and other) ia64 NUMA box, only Node(0) has 2Gbytes of ZONE_DMA.
Assume a machine with following configuration.

Node 0:   12GB of memory   10GB NORMAL 2GB DMA
Node 1:   12GB of memory   12GB NORMAL
Node 2:   12GB of memory   12GB NORMAL

Start a process which uses 12GB of memory on Node(0), then memory usage
will be
Node 0:   0/12 GB of memory is available, NORMAL: empty DMA: empty
Node 1:  12/12 GB of memory is available. NORMAL: 12G
Node 2:  12/12 GB of memory is available. NORMAL: 12G

An interesting matter is "ZONE_DMA is exhausted before ZONE_NORMAL".
This is current kernel's behavior. This can cause OOM very easily if the
system has a device which uses GFP_DMA. 

This patch fixes this kind of situation as following. (by using "Zone Order")
Node 0:   2/12 GB of memory is available, NORMAL: empty DMA: 2G
Node 1:  10/12 GB of memory is available. NORMAL: 10G
Node 2:  12/12 GB of memory is available. NORMAL  12G

A user can say "Good bye OOM-Killer" but 2GB of memory is allocated from
off-node memory. it's trade-off.

-Kame







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] change zonelist order v5 [1/3] implements zonelist order selection
  2007-05-08 11:14 [PATCH] change zonelist order v5 [0/3] KAMEZAWA Hiroyuki
@ 2007-05-08 11:16 ` KAMEZAWA Hiroyuki
  2007-05-08 17:06   ` Lee Schermerhorn
  2007-05-08 11:18 ` [PATCH] change zonelist order v5 [2/3] automatic configuration KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-08 11:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, Lee.Schermerhorn, clameter, akpm, ak, jbarnes

Make zonelist creation policy selectable from sysctl/boot option v5.

This patch makes NUMA's zonelist (of pgdat) order selectable.
Available order are Default(automatic)/ Node-based / Zone-based.

[Default Order]
The kernel selects Node-based or Zone-based order automatically.

[Node-based Order]
This policy treats the locality of memory as the most important parameter.
Zonelist order is created by each zone's locality. This means lower zones
(ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
IOW. ZONE_DMA will be in the middle of zonelist.
current 2.6.21 kernel uses this.

Pros.
 * A user can expect local memory as much as possible.
Cons.
 * lower zone will be exhansted before higher zone. This may cause OOM_KILL.

Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
because of ZONE_DMA exhaution and you need the best locality.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

 node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.

*node(1)'s memory allocation order:
 
 node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

[Zone-based order]
This policy treats the zone type as the most important parameter.
Zonelist order is created by zone-type order. This means lower zone 
never be used bofere higher zone exhaustion.
IOW. ZONE_DMA will be always at the tail of zonelist.

Pros.
 * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
Cons.
 * memory locality may not be best.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

 node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.

*node(1)'s memory allocation order:

 node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.

command:
%echo N > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Node-based order.

command:
%echo Z > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Zone-based order.

Tested on ia64 2-Node NUMA. works well.

Thanks to Lee Schermerhorn, he gives me much help and codes.

Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/mmzone.h |    5 +
 kernel/sysctl.c        |    9 ++
 mm/page_alloc.c        |  217 ++++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 209 insertions(+), 22 deletions(-)

Index: linux-2.6.21-mm1/kernel/sysctl.c
===================================================================
--- linux-2.6.21-mm1.orig/kernel/sysctl.c
+++ linux-2.6.21-mm1/kernel/sysctl.c
@@ -891,6 +891,15 @@ static ctl_table vm_table[] = {
 		.proc_handler	= &proc_dointvec_jiffies,
 		.strategy	= &sysctl_jiffies,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "numa_zonelist_order",
+		.data		= &numa_zonelist_order,
+		.maxlen		= NUMA_ZONELIST_ORDER_LEN,
+		.mode		= 0644,
+		.proc_handler	= &numa_zonelist_order_handler,
+		.strategy	= &sysctl_string,
+	},
 #endif
 #if defined(CONFIG_X86_32) || \
    (defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL))
Index: linux-2.6.21-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/page_alloc.c
+++ linux-2.6.21-mm1/mm/page_alloc.c
@@ -2023,7 +2023,8 @@ void show_free_areas(void)
  * Add all populated zones of a node to the zonelist.
  */
 static int __meminit build_zonelists_node(pg_data_t *pgdat,
-			struct zonelist *zonelist, int nr_zones, enum zone_type zone_type)
+			struct zonelist *zonelist, int nr_zones,
+			enum zone_type zone_type)
 {
 	struct zone *zone;
 
@@ -2042,9 +2043,97 @@ static int __meminit build_zonelists_nod
 	return nr_zones;
 }
 
+
+/*
+ *  zonelist_order:
+ *  0 = automatic detection of better ordering.
+ *  1 = order by ([node] distance, -zonetype)
+ *  2 = order by (-zonetype, [node] distance)
+ *
+ *  If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create
+ *  the same zonelist. So only NUMA can configure this param.
+ */
+#define ZONELIST_ORDER_DEFAULT  0
+#define ZONELIST_ORDER_NODE     1
+#define ZONELIST_ORDER_ZONE     2
+
+static int zonelist_order = ZONELIST_ORDER_DEFAULT;
+static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"};
+
+
 #ifdef CONFIG_NUMA
+/* string for sysctl */
+#define NUMA_ZONELIST_ORDER_LEN	16
+char numa_zonelist_order[16] = "default";
+
+/*
+ * interface for configure zonelist ordering.
+ * command line option "numa_zonelist_order"
+ *	= "[dD]efault	- default, automatic configuration.
+ *	= "[nN]ode 	- order by node locality, then by zone within node
+ *	= "[zZ]one      - order by zone, then by locality within zone
+ */
+
+static int __parse_numa_zonelist_order(char *s)
+{
+	if (*s == 'd' || *s == 'D') {
+		zonelist_order = ZONELIST_ORDER_DEFAULT;
+	} else if (*s == 'n' || *s == 'N') {
+		zonelist_order = ZONELIST_ORDER_NODE;
+	} else if (*s == 'z' || *s == 'Z') {
+		zonelist_order = ZONELIST_ORDER_ZONE;
+	} else {
+		printk(KERN_WARNING
+			"Ignoring invalid numa_zonelist_order value:  "
+			"%s\n", s);
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static __init int setup_numa_zonelist_order(char *s)
+{
+	if (s)
+		return __parse_numa_zonelist_order(s);
+	return 0;
+}
+early_param("numa_zonelist_order", setup_numa_zonelist_order);
+
+/*
+ * sysctl handler for numa_zonelist_order
+ */
+int numa_zonelist_order_handler(ctl_table *table, int write,
+		struct file *file, void __user *buffer, size_t *length,
+		loff_t *ppos)
+{
+	char saved_string[NUMA_ZONELIST_ORDER_LEN];
+	int ret;
+
+	if (write)
+		strncpy(saved_string, (char*)table->data,
+			NUMA_ZONELIST_ORDER_LEN);
+	ret = proc_dostring(table, write, file, buffer, length, ppos);
+	if (ret)
+		return ret;
+	if (write) {
+		int oldval = zonelist_order;
+		if (__parse_numa_zonelist_order((char*)table->data)) {
+			/*
+			 * bogus value.  restore saved string
+			 */
+			strncpy((char*)table->data, saved_string,
+				NUMA_ZONELIST_ORDER_LEN);
+			zonelist_order = oldval;
+		} else if (oldval != zonelist_order)
+			build_all_zonelists();
+	}
+	return 0;
+}
+
+
 #define MAX_NODE_LOAD (num_online_nodes())
-static int __meminitdata node_load[MAX_NUMNODES];
+static int node_load[MAX_NUMNODES];
+
 /**
  * find_next_best_node - find the next node that should appear in a given node's fallback list
  * @node: node whose fallback list we're appending
@@ -2059,7 +2148,7 @@ static int __meminitdata node_load[MAX_N
  * on them otherwise.
  * It returns -1 if no node is found.
  */
-static int __meminit find_next_best_node(int node, nodemask_t *used_node_mask)
+static int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
 	int n, val;
 	int min_val = INT_MAX;
@@ -2105,13 +2194,73 @@ static int __meminit find_next_best_node
 	return best_node;
 }
 
-static void __meminit build_zonelists(pg_data_t *pgdat)
+
+/*
+ * Build zonelists ordered by node and zones within node.
+ * This results in maximum locality--normal zone overflows into local
+ * DMA zone, if any--but risks exhausting DMA zone.
+ */
+static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
 {
-	int j, node, local_node;
 	enum zone_type i;
-	int prev_node, load;
+	int j;
 	struct zonelist *zonelist;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		zonelist = pgdat->node_zonelists + i;
+		for (j = 0; zonelist->zones[j] != NULL; j++);
+
+ 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
+		zonelist->zones[j] = NULL;
+	}
+}
+
+/*
+ * Build zonelists ordered by zone and nodes within zones.
+ * This results in conserving DMA zone[s] until all Normal memory is
+ * exhausted, but results in overflowing to remote node while memory
+ * may still exist in local DMA zone.
+ */
+static int node_order[MAX_NUMNODES];
+
+static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
+{
+	enum zone_type i;
+	int pos, j, node;
+	int zone_type;		/* needs to be signed */
+	struct zone *z;
+	struct zonelist *zonelist;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		zonelist = pgdat->node_zonelists + i;
+		pos = 0;
+		for (zone_type = i; zone_type >= 0; zone_type--) {
+			for (j = 0; j < nr_nodes; j++) {
+				node = node_order[j];
+				z = &NODE_DATA(node)->node_zones[zone_type];
+				if (populated_zone(z))
+					zonelist->zones[pos++] = z;
+			}
+		}
+		zonelist->zones[pos] = NULL;
+	}
+}
+
+static int default_zonelist_order(void)
+{
+	/* dummy, just select node order. */
+	return ZONELIST_ORDER_NODE;
+}
+
+
+
+static void build_zonelists(pg_data_t *pgdat, int ordering)
+{
+	int j, node, load;
+	enum zone_type i;
 	nodemask_t used_mask;
+	int local_node, prev_node;
+	struct zonelist *zonelist;
 
 	/* initialize zonelists */
 	for (i = 0; i < MAX_NR_ZONES; i++) {
@@ -2124,6 +2273,10 @@ static void __meminit build_zonelists(pg
 	load = num_online_nodes();
 	prev_node = local_node;
 	nodes_clear(used_mask);
+
+	memset(node_order, 0, sizeof(node_order));
+	j = 0;
+
 	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
 		int distance = node_distance(local_node, node);
 
@@ -2139,18 +2292,20 @@ static void __meminit build_zonelists(pg
 		 * So adding penalty to the first node in same
 		 * distance group to make it round-robin.
 		 */
-
 		if (distance != node_distance(local_node, prev_node))
-			node_load[node] += load;
+			node_load[node] = load;
+
 		prev_node = node;
 		load--;
-		for (i = 0; i < MAX_NR_ZONES; i++) {
-			zonelist = pgdat->node_zonelists + i;
-			for (j = 0; zonelist->zones[j] != NULL; j++);
+		if (ordering == ZONELIST_ORDER_NODE)
+			build_zonelists_in_node_order(pgdat, node);
+		else
+			node_order[j++] = node;	/* remember order */
+	}
 
-	 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
-			zonelist->zones[j] = NULL;
-		}
+	if (ordering == ZONELIST_ORDER_ZONE) {
+		/* calculate node order -- i.e., DMA last! */
+		build_zonelists_in_zone_order(pgdat, j);
 	}
 }
 
@@ -2172,9 +2327,18 @@ static void __meminit build_zonelist_cac
 	}
 }
 
+
 #else	/* CONFIG_NUMA */
 
-static void __meminit build_zonelists(pg_data_t *pgdat)
+static int default_zonelist_order(void)
+{
+	return ZONELIST_ORDER_ZONE;
+}
+
+/*
+ * order is ignored.
+ */
+static void __meminit build_zonelists(pg_data_t *pgdat, int order)
 {
 	int node, local_node;
 	enum zone_type i,j;
@@ -2221,26 +2385,33 @@ static void __meminit build_zonelist_cac
 #endif	/* CONFIG_NUMA */
 
 /* return values int ....just for stop_machine_run() */
-static int __meminit __build_all_zonelists(void *dummy)
+static int __build_all_zonelists(void *dummy)
 {
 	int nid;
-
+	int order = *(int *)dummy;
 	for_each_online_node(nid) {
-		build_zonelists(NODE_DATA(nid));
+		build_zonelists(NODE_DATA(nid), order);
 		build_zonelist_cache(NODE_DATA(nid));
 	}
 	return 0;
 }
 
-void __meminit build_all_zonelists(void)
+void build_all_zonelists(void)
 {
+	int order;
+	if (zonelist_order == ZONELIST_ORDER_DEFAULT)
+		order = default_zonelist_order();
+	else
+		order = zonelist_order;
+
 	if (system_state == SYSTEM_BOOTING) {
-		__build_all_zonelists(NULL);
+		__build_all_zonelists(&order);
 		cpuset_init_current_mems_allowed();
 	} else {
+		memset(node_load, 0, sizeof(node_load));
 		/* we have to stop all cpus to guaranntee there is no user
 		   of zonelist */
-		stop_machine_run(__build_all_zonelists, NULL, NR_CPUS);
+		stop_machine_run(__build_all_zonelists, &order, NR_CPUS);
 		/* cpuset refresh routine should be here */
 	}
 	vm_total_pages = nr_free_pagecache_pages();
@@ -2257,8 +2428,10 @@ void __meminit build_all_zonelists(void)
 	else
 		page_group_by_mobility_disabled = 0;
 
-	printk("Built %i zonelists, mobility grouping %s.  Total pages: %ld\n",
+	printk("Built %i zonelists in %s order, mobility grouping %s."
+	       "Total pages: %ld\n",
 			num_online_nodes(),
+			zonelist_order_name[order],
 			page_group_by_mobility_disabled ? "off" : "on",
 			vm_total_pages);
 }
Index: linux-2.6.21-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.21-mm1.orig/include/linux/mmzone.h
+++ linux-2.6.21-mm1/include/linux/mmzone.h
@@ -610,6 +610,11 @@ int sysctl_min_unmapped_ratio_sysctl_han
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
 
+extern int numa_zonelist_order_handler(struct ctl_table *, int,
+			struct file *, void __user *, size_t *, loff_t *);
+extern char numa_zonelist_order[];
+#define NUMA_ZONELIST_ORDER_LEN 16	/* string buffer size */
+
 #include <linux/topology.h>
 /* Returns the number of the current Node. */
 #ifndef numa_node_id

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] change zonelist order v5 [2/3] automatic configuration
  2007-05-08 11:14 [PATCH] change zonelist order v5 [0/3] KAMEZAWA Hiroyuki
  2007-05-08 11:16 ` [PATCH] change zonelist order v5 [1/3] implements zonelist order selection KAMEZAWA Hiroyuki
@ 2007-05-08 11:18 ` KAMEZAWA Hiroyuki
  2007-05-08 17:07   ` Lee Schermerhorn
  2007-05-08 11:19 ` [PATCH] change zonelist order v5 [3/3] documentation KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-08 11:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, Lee.Schermerhorn, clameter, akpm, ak, jbarnes

Add auto zone ordering configuration.

This function will select ZONE_ORDER_NODE when

There are only ZONE_DMA or ZONE_DMA32.
|| size of (ZONE_DMA/DMA32) > (System Total Memory)/2
|| Assume Node(A)
	Node (A) is enough big &&
	Node (A)'s ZONE_DMA/DMA32 occupies 60% of Node(A)'s memory.
	(In this case, ZONE_ORDER_ZONE may not offer enough locality...)

otherwise, ZONE_ORDER_ZONE is selected.

Maybe there is no best and simple way to configure zone order. I wrote this base on
my experience and discussion on the list.

Anyway, a user can specifiy zone order from boot option/sysctl.

Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/page_alloc.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

Index: linux-2.6.21-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/page_alloc.c
+++ linux-2.6.21-mm1/mm/page_alloc.c
@@ -2248,8 +2248,55 @@ static void build_zonelists_in_zone_orde
 
 static int default_zonelist_order(void)
 {
-	/* dummy, just select node order. */
-	return ZONELIST_ORDER_NODE;
+	int nid, zone_type;
+	unsigned long low_kmem_size,total_size;
+	struct zone *z;
+	int average_size;
+	/*
+         * ZONE_DMA and ZONE_DMA32 can be very small area in the sytem.
+	 * If they are really small and used heavily, the system can fall
+	 * into OOM very easily.
+	 * This function detect ZONE_DMA/DMA32 size and confgigures zone order.
+	 */
+	/* Is there ZONE_NORMAL ? (ex. ppc has only DMA zone..) */
+	low_kmem_size = 0;
+	total_size = 0;
+	for_each_online_node(nid) {
+		for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
+			z = &NODE_DATA(nid)->node_zones[zone_type];
+			if (populated_zone(z)) {
+				if (zone_type < ZONE_NORMAL)
+					low_kmem_size += z->present_pages;
+				total_size += z->present_pages;
+			}
+		}
+	}
+	if (!low_kmem_size ||  /* there are no DMA area. */
+	    low_kmem_size > total_size/2) /* DMA/DMA32 is big. */
+		return ZONELIST_ORDER_NODE;
+	/*
+	 * look into each node's config.
+  	 * If there is a node whose DMA/DMA32 memory is very big area on
+ 	 * local memory, NODE_ORDER may be suitable.
+         */
+	average_size = total_size / (num_online_nodes() + 1);
+	for_each_online_node(nid) {
+		low_kmem_size = 0;
+		total_size = 0;
+		for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
+			z = &NODE_DATA(nid)->node_zones[zone_type];
+			if (populated_zone(z)) {
+				if (zone_type < ZONE_NORMAL)
+					low_kmem_size += z->present_pages;
+				total_size += z->present_pages;
+			}
+		}
+		if (low_kmem_size &&
+		    total_size > average_size && /* ignore small node */
+		    low_kmem_size > total_size * 70/100)
+			return ZONELIST_ORDER_NODE;
+	}
+	return ZONELIST_ORDER_ZONE;
 }
 
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] change zonelist order v5 [3/3] documentation
  2007-05-08 11:14 [PATCH] change zonelist order v5 [0/3] KAMEZAWA Hiroyuki
  2007-05-08 11:16 ` [PATCH] change zonelist order v5 [1/3] implements zonelist order selection KAMEZAWA Hiroyuki
  2007-05-08 11:18 ` [PATCH] change zonelist order v5 [2/3] automatic configuration KAMEZAWA Hiroyuki
@ 2007-05-08 11:19 ` KAMEZAWA Hiroyuki
  2007-05-08 17:08   ` Lee Schermerhorn
  2007-05-08 12:04 ` [PATCH] change zonelist order v5 [4/3] compile fix KAMEZAWA Hiroyuki
  2007-05-08 16:14 ` [PATCH] change zonelist order v5 [0/3] Christoph Lameter
  4 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-08 11:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, Lee.Schermerhorn, clameter, akpm, ak, jbarnes

Patch for documentation.

Signed-Off-By: KAMEZAWA hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


---
 Documentation/kernel-parameters.txt |   10 +++++++
 Documentation/sysctl/vm.txt         |   48 ++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

Index: linux-2.6.21-mm1/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.21-mm1.orig/Documentation/kernel-parameters.txt
+++ linux-2.6.21-mm1/Documentation/kernel-parameters.txt
@@ -1233,6 +1233,16 @@ and is between 256 and 4096 characters. 
 
 	nr_uarts=	[SERIAL] maximum number of UARTs to be registered.
 
+	numa_zonelist_oder= [KNL,BOOT]
+			Select zonelist order for NUMA. zonelist is used for
+			desiding where the kernel allocates memory from.
+			Default is automatic configuration. If "node" is
+			specified, zonelist is ordered by locality. This can
+			offer the best locality but possibility of OOM may
+			increase.  If "zone" is specified, the zonelist is
+			ordered by zone_type.
+			See Documentaion/sysctl/vm.txt numa_zonelist_order.
+			
 	opl3=		[HW,OSS]
 			Format: <io>
 
Index: linux-2.6.21-mm1/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.21-mm1.orig/Documentation/sysctl/vm.txt
+++ linux-2.6.21-mm1/Documentation/sysctl/vm.txt
@@ -35,6 +35,7 @@ Currently, these files are in /proc/sys/
 - stat_interval
 - readahead_ratio
 - readahead_hit_rate
+- numa_zonelist_order
 
 ==============================================================
 
@@ -293,3 +294,49 @@ Possible values can be:
 The larger value, the more capabilities, with more possible overheads.
 
 The default value is 1.
+
+==============================================================
+
+numa_zonelist_order
+
+This sysctl is only for NUMA.
+'where the memory is allocated from' is controlled by zonelist.
+(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
+ you may be able to read ZONE_DMA as ZONE_DMA32...)
+
+In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following.
+ZONE_NORMAL -> ZONE_DMA
+This means that a memory allocation request for GFP_KERNEL will
+get memory from ZONE_DMA only when ZONE_NORMAL is not available.
+
+In NUMA case, you can think of following 2 types of order.
+Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL
+
+(A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL
+(B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA.
+
+Type(A) offers the best locality for processes on Node(0), but ZONE_DMA
+will be used before ZONE_NORMAL exhaustion. This increases possibility of
+out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small.
+
+Type(B) cannot offer the best locality but very robust against OOM of DMA zone.
+
+Type(A) is called as "Node" order. Type (B) is "Zone" order.
+
+"Node order" orders the zonelists by node, then by zone within each node.
+This will offer the best locality but increases possibility of OOM.
+Specify "[Nn]ode" for zone order
+
+"Zone Order"  preserves the DMA zone as long as possible but
+results in off-node allocation [for node 0] earlier.
+Specify "[Zz]one"for zode order.
+
+Specify "[Dd]efault" to request automatic configuration.  Autoconfiguration
+will select "node" order in following case.
+(1) if the DMA zone does not exist or
+(2) if the DMA zone comprises greater than 50% of the available memory or
+(3) if a node's DMA zone comprises greater than 60% of its local memory and
+    the amount of local memory is enough big.
+
+Otherwise, "zone" order will be selected. Default order is recommended unless
+unless this is causing problems for your system/application.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [4/3] compile fix.....
  2007-05-08 11:14 [PATCH] change zonelist order v5 [0/3] KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2007-05-08 11:19 ` [PATCH] change zonelist order v5 [3/3] documentation KAMEZAWA Hiroyuki
@ 2007-05-08 12:04 ` KAMEZAWA Hiroyuki
  2007-05-08 16:14 ` [PATCH] change zonelist order v5 [0/3] Christoph Lameter
  4 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-08 12:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, Lee.Schermerhorn, clameter, akpm, ak, jbarnes

I'm very sorry for missing this fix for non-NUMA arch...
I'll repost the whole set if necessary....
-Kame

Compile-fix...

Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Index: linux-2.6.21-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/page_alloc.c
+++ linux-2.6.21-mm1/mm/page_alloc.c
@@ -2321,6 +2321,7 @@ static void build_zonelists(pg_data_t *p
 	prev_node = local_node;
 	nodes_clear(used_mask);
 
+	memset(node_load, 0, sizeof(node_load));
 	memset(node_order, 0, sizeof(node_order));
 	j = 0;
 
@@ -2455,7 +2456,6 @@ void build_all_zonelists(void)
 		__build_all_zonelists(&order);
 		cpuset_init_current_mems_allowed();
 	} else {
-		memset(node_load, 0, sizeof(node_load));
 		/* we have to stop all cpus to guaranntee there is no user
 		   of zonelist */
 		stop_machine_run(__build_all_zonelists, &order, NR_CPUS);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [0/3]
  2007-05-08 11:14 [PATCH] change zonelist order v5 [0/3] KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2007-05-08 12:04 ` [PATCH] change zonelist order v5 [4/3] compile fix KAMEZAWA Hiroyuki
@ 2007-05-08 16:14 ` Christoph Lameter
  4 siblings, 0 replies; 22+ messages in thread
From: Christoph Lameter @ 2007-05-08 16:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: LKML, Linux-MM, Lee.Schermerhorn, AKPM, Andi Kleen, jbarnes

Good explanation. Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [1/3] implements zonelist order selection
  2007-05-08 11:16 ` [PATCH] change zonelist order v5 [1/3] implements zonelist order selection KAMEZAWA Hiroyuki
@ 2007-05-08 17:06   ` Lee Schermerhorn
  2007-05-08 17:22     ` Christoph Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Lee Schermerhorn @ 2007-05-08 17:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, clameter, akpm, ak, jbarnes

On Tue, 2007-05-08 at 20:16 +0900, KAMEZAWA Hiroyuki wrote:
> Make zonelist creation policy selectable from sysctl/boot option v5.
> 
> This patch makes NUMA's zonelist (of pgdat) order selectable.
> Available order are Default(automatic)/ Node-based / Zone-based.
> 
> [Default Order]
> The kernel selects Node-based or Zone-based order automatically.
> 
> [Node-based Order]
> This policy treats the locality of memory as the most important parameter.
> Zonelist order is created by each zone's locality. This means lower zones
> (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
> IOW. ZONE_DMA will be in the middle of zonelist.
> current 2.6.21 kernel uses this.
> 
> Pros.
>  * A user can expect local memory as much as possible.
> Cons.
>  * lower zone will be exhansted before higher zone. This may cause OOM_KILL.
> 
> Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
> because of ZONE_DMA exhaution and you need the best locality.
> 
> (example)
> assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.
> 
> *node(0)'s memory allocation order:
> 
>  node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.
> 
> *node(1)'s memory allocation order:
>  
>  node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
> 
> [Zone-based order]
> This policy treats the zone type as the most important parameter.
> Zonelist order is created by zone-type order. This means lower zone 
> never be used bofere higher zone exhaustion.
> IOW. ZONE_DMA will be always at the tail of zonelist.
> 
> Pros.
>  * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
> Cons.
>  * memory locality may not be best.
> 
> (example)
> assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.
> 
> *node(0)'s memory allocation order:
> 
>  node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.
> 
> *node(1)'s memory allocation order:
> 
>  node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
> 
> bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.
> 
> command:
> %echo N > /proc/sys/vm/numa_zonelist_order
> 
> Will rebuild zonelist in Node-based order.
> 
> command:
> %echo Z > /proc/sys/vm/numa_zonelist_order
> 
> Will rebuild zonelist in Zone-based order.
> 
> Tested on ia64 2-Node NUMA. works well.
> 
> Thanks to Lee Schermerhorn, he gives me much help and codes.
> 
> Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Tested OK on my platform.
Acked-by:   Lee Schermerhorn <lee.schermerhorn@hp.com>

> 
> ---
>  include/linux/mmzone.h |    5 +
>  kernel/sysctl.c        |    9 ++
>  mm/page_alloc.c        |  217 ++++++++++++++++++++++++++++++++++++++++++++-----
>  3 files changed, 209 insertions(+), 22 deletions(-)
> 
> Index: linux-2.6.21-mm1/kernel/sysctl.c
> ===================================================================
> --- linux-2.6.21-mm1.orig/kernel/sysctl.c
> +++ linux-2.6.21-mm1/kernel/sysctl.c
> @@ -891,6 +891,15 @@ static ctl_table vm_table[] = {
>  		.proc_handler	= &proc_dointvec_jiffies,
>  		.strategy	= &sysctl_jiffies,
>  	},
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname	= "numa_zonelist_order",
> +		.data		= &numa_zonelist_order,
> +		.maxlen		= NUMA_ZONELIST_ORDER_LEN,
> +		.mode		= 0644,
> +		.proc_handler	= &numa_zonelist_order_handler,
> +		.strategy	= &sysctl_string,
> +	},
>  #endif
>  #if defined(CONFIG_X86_32) || \
>     (defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL))
> Index: linux-2.6.21-mm1/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.21-mm1.orig/mm/page_alloc.c
> +++ linux-2.6.21-mm1/mm/page_alloc.c
> @@ -2023,7 +2023,8 @@ void show_free_areas(void)
>   * Add all populated zones of a node to the zonelist.
>   */
>  static int __meminit build_zonelists_node(pg_data_t *pgdat,
> -			struct zonelist *zonelist, int nr_zones, enum zone_type zone_type)
> +			struct zonelist *zonelist, int nr_zones,
> +			enum zone_type zone_type)
>  {
>  	struct zone *zone;
>  
> @@ -2042,9 +2043,97 @@ static int __meminit build_zonelists_nod
>  	return nr_zones;
>  }
>  
> +
> +/*
> + *  zonelist_order:
> + *  0 = automatic detection of better ordering.
> + *  1 = order by ([node] distance, -zonetype)
> + *  2 = order by (-zonetype, [node] distance)
> + *
> + *  If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create
> + *  the same zonelist. So only NUMA can configure this param.
> + */
> +#define ZONELIST_ORDER_DEFAULT  0
> +#define ZONELIST_ORDER_NODE     1
> +#define ZONELIST_ORDER_ZONE     2
> +
> +static int zonelist_order = ZONELIST_ORDER_DEFAULT;
> +static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"};
> +
> +
>  #ifdef CONFIG_NUMA
> +/* string for sysctl */
> +#define NUMA_ZONELIST_ORDER_LEN	16
> +char numa_zonelist_order[16] = "default";
> +
> +/*
> + * interface for configure zonelist ordering.
> + * command line option "numa_zonelist_order"
> + *	= "[dD]efault	- default, automatic configuration.
> + *	= "[nN]ode 	- order by node locality, then by zone within node
> + *	= "[zZ]one      - order by zone, then by locality within zone
> + */
> +
> +static int __parse_numa_zonelist_order(char *s)
> +{
> +	if (*s == 'd' || *s == 'D') {
> +		zonelist_order = ZONELIST_ORDER_DEFAULT;
> +	} else if (*s == 'n' || *s == 'N') {
> +		zonelist_order = ZONELIST_ORDER_NODE;
> +	} else if (*s == 'z' || *s == 'Z') {
> +		zonelist_order = ZONELIST_ORDER_ZONE;
> +	} else {
> +		printk(KERN_WARNING
> +			"Ignoring invalid numa_zonelist_order value:  "
> +			"%s\n", s);
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +static __init int setup_numa_zonelist_order(char *s)
> +{
> +	if (s)
> +		return __parse_numa_zonelist_order(s);
> +	return 0;
> +}
> +early_param("numa_zonelist_order", setup_numa_zonelist_order);
> +
> +/*
> + * sysctl handler for numa_zonelist_order
> + */
> +int numa_zonelist_order_handler(ctl_table *table, int write,
> +		struct file *file, void __user *buffer, size_t *length,
> +		loff_t *ppos)
> +{
> +	char saved_string[NUMA_ZONELIST_ORDER_LEN];
> +	int ret;
> +
> +	if (write)
> +		strncpy(saved_string, (char*)table->data,
> +			NUMA_ZONELIST_ORDER_LEN);
> +	ret = proc_dostring(table, write, file, buffer, length, ppos);
> +	if (ret)
> +		return ret;
> +	if (write) {
> +		int oldval = zonelist_order;
> +		if (__parse_numa_zonelist_order((char*)table->data)) {
> +			/*
> +			 * bogus value.  restore saved string
> +			 */
> +			strncpy((char*)table->data, saved_string,
> +				NUMA_ZONELIST_ORDER_LEN);
> +			zonelist_order = oldval;
> +		} else if (oldval != zonelist_order)
> +			build_all_zonelists();
> +	}
> +	return 0;
> +}
> +
> +
>  #define MAX_NODE_LOAD (num_online_nodes())
> -static int __meminitdata node_load[MAX_NUMNODES];
> +static int node_load[MAX_NUMNODES];
> +
>  /**
>   * find_next_best_node - find the next node that should appear in a given node's fallback list
>   * @node: node whose fallback list we're appending
> @@ -2059,7 +2148,7 @@ static int __meminitdata node_load[MAX_N
>   * on them otherwise.
>   * It returns -1 if no node is found.
>   */
> -static int __meminit find_next_best_node(int node, nodemask_t *used_node_mask)
> +static int find_next_best_node(int node, nodemask_t *used_node_mask)
>  {
>  	int n, val;
>  	int min_val = INT_MAX;
> @@ -2105,13 +2194,73 @@ static int __meminit find_next_best_node
>  	return best_node;
>  }
>  
> -static void __meminit build_zonelists(pg_data_t *pgdat)
> +
> +/*
> + * Build zonelists ordered by node and zones within node.
> + * This results in maximum locality--normal zone overflows into local
> + * DMA zone, if any--but risks exhausting DMA zone.
> + */
> +static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
>  {
> -	int j, node, local_node;
>  	enum zone_type i;
> -	int prev_node, load;
> +	int j;
>  	struct zonelist *zonelist;
> +
> +	for (i = 0; i < MAX_NR_ZONES; i++) {
> +		zonelist = pgdat->node_zonelists + i;
> +		for (j = 0; zonelist->zones[j] != NULL; j++);
> +
> + 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
> +		zonelist->zones[j] = NULL;
> +	}
> +}
> +
> +/*
> + * Build zonelists ordered by zone and nodes within zones.
> + * This results in conserving DMA zone[s] until all Normal memory is
> + * exhausted, but results in overflowing to remote node while memory
> + * may still exist in local DMA zone.
> + */
> +static int node_order[MAX_NUMNODES];
> +
> +static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
> +{
> +	enum zone_type i;
> +	int pos, j, node;
> +	int zone_type;		/* needs to be signed */
> +	struct zone *z;
> +	struct zonelist *zonelist;
> +
> +	for (i = 0; i < MAX_NR_ZONES; i++) {
> +		zonelist = pgdat->node_zonelists + i;
> +		pos = 0;
> +		for (zone_type = i; zone_type >= 0; zone_type--) {
> +			for (j = 0; j < nr_nodes; j++) {
> +				node = node_order[j];
> +				z = &NODE_DATA(node)->node_zones[zone_type];
> +				if (populated_zone(z))
> +					zonelist->zones[pos++] = z;
> +			}
> +		}
> +		zonelist->zones[pos] = NULL;
> +	}
> +}
> +
> +static int default_zonelist_order(void)
> +{
> +	/* dummy, just select node order. */
> +	return ZONELIST_ORDER_NODE;
> +}
> +
> +
> +
> +static void build_zonelists(pg_data_t *pgdat, int ordering)
> +{
> +	int j, node, load;
> +	enum zone_type i;
>  	nodemask_t used_mask;
> +	int local_node, prev_node;
> +	struct zonelist *zonelist;
>  
>  	/* initialize zonelists */
>  	for (i = 0; i < MAX_NR_ZONES; i++) {
> @@ -2124,6 +2273,10 @@ static void __meminit build_zonelists(pg
>  	load = num_online_nodes();
>  	prev_node = local_node;
>  	nodes_clear(used_mask);
> +
> +	memset(node_order, 0, sizeof(node_order));
> +	j = 0;
> +
>  	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
>  		int distance = node_distance(local_node, node);
>  
> @@ -2139,18 +2292,20 @@ static void __meminit build_zonelists(pg
>  		 * So adding penalty to the first node in same
>  		 * distance group to make it round-robin.
>  		 */
> -
>  		if (distance != node_distance(local_node, prev_node))
> -			node_load[node] += load;
> +			node_load[node] = load;
> +
>  		prev_node = node;
>  		load--;
> -		for (i = 0; i < MAX_NR_ZONES; i++) {
> -			zonelist = pgdat->node_zonelists + i;
> -			for (j = 0; zonelist->zones[j] != NULL; j++);
> +		if (ordering == ZONELIST_ORDER_NODE)
> +			build_zonelists_in_node_order(pgdat, node);
> +		else
> +			node_order[j++] = node;	/* remember order */
> +	}
>  
> -	 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
> -			zonelist->zones[j] = NULL;
> -		}
> +	if (ordering == ZONELIST_ORDER_ZONE) {
> +		/* calculate node order -- i.e., DMA last! */
> +		build_zonelists_in_zone_order(pgdat, j);
>  	}
>  }
>  
> @@ -2172,9 +2327,18 @@ static void __meminit build_zonelist_cac
>  	}
>  }
>  
> +
>  #else	/* CONFIG_NUMA */
>  
> -static void __meminit build_zonelists(pg_data_t *pgdat)
> +static int default_zonelist_order(void)
> +{
> +	return ZONELIST_ORDER_ZONE;
> +}
> +
> +/*
> + * order is ignored.
> + */
> +static void __meminit build_zonelists(pg_data_t *pgdat, int order)
>  {
>  	int node, local_node;
>  	enum zone_type i,j;
> @@ -2221,26 +2385,33 @@ static void __meminit build_zonelist_cac
>  #endif	/* CONFIG_NUMA */
>  
>  /* return values int ....just for stop_machine_run() */
> -static int __meminit __build_all_zonelists(void *dummy)
> +static int __build_all_zonelists(void *dummy)
>  {
>  	int nid;
> -
> +	int order = *(int *)dummy;
>  	for_each_online_node(nid) {
> -		build_zonelists(NODE_DATA(nid));
> +		build_zonelists(NODE_DATA(nid), order);
>  		build_zonelist_cache(NODE_DATA(nid));
>  	}
>  	return 0;
>  }
>  
> -void __meminit build_all_zonelists(void)
> +void build_all_zonelists(void)
>  {
> +	int order;
> +	if (zonelist_order == ZONELIST_ORDER_DEFAULT)
> +		order = default_zonelist_order();
> +	else
> +		order = zonelist_order;
> +
>  	if (system_state == SYSTEM_BOOTING) {
> -		__build_all_zonelists(NULL);
> +		__build_all_zonelists(&order);
>  		cpuset_init_current_mems_allowed();
>  	} else {
> +		memset(node_load, 0, sizeof(node_load));
>  		/* we have to stop all cpus to guaranntee there is no user
>  		   of zonelist */
> -		stop_machine_run(__build_all_zonelists, NULL, NR_CPUS);
> +		stop_machine_run(__build_all_zonelists, &order, NR_CPUS);
>  		/* cpuset refresh routine should be here */
>  	}
>  	vm_total_pages = nr_free_pagecache_pages();
> @@ -2257,8 +2428,10 @@ void __meminit build_all_zonelists(void)
>  	else
>  		page_group_by_mobility_disabled = 0;
>  
> -	printk("Built %i zonelists, mobility grouping %s.  Total pages: %ld\n",
> +	printk("Built %i zonelists in %s order, mobility grouping %s."
> +	       "Total pages: %ld\n",
>  			num_online_nodes(),
> +			zonelist_order_name[order],
>  			page_group_by_mobility_disabled ? "off" : "on",
>  			vm_total_pages);
>  }
> Index: linux-2.6.21-mm1/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.21-mm1.orig/include/linux/mmzone.h
> +++ linux-2.6.21-mm1/include/linux/mmzone.h
> @@ -610,6 +610,11 @@ int sysctl_min_unmapped_ratio_sysctl_han
>  int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
>  			struct file *, void __user *, size_t *, loff_t *);
>  
> +extern int numa_zonelist_order_handler(struct ctl_table *, int,
> +			struct file *, void __user *, size_t *, loff_t *);
> +extern char numa_zonelist_order[];
> +#define NUMA_ZONELIST_ORDER_LEN 16	/* string buffer size */
> +
>  #include <linux/topology.h>
>  /* Returns the number of the current Node. */
>  #ifndef numa_node_id
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [2/3] automatic configuration
  2007-05-08 11:18 ` [PATCH] change zonelist order v5 [2/3] automatic configuration KAMEZAWA Hiroyuki
@ 2007-05-08 17:07   ` Lee Schermerhorn
  0 siblings, 0 replies; 22+ messages in thread
From: Lee Schermerhorn @ 2007-05-08 17:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, clameter, akpm, ak, jbarnes

On Tue, 2007-05-08 at 20:18 +0900, KAMEZAWA Hiroyuki wrote:
> Add auto zone ordering configuration.
> 
> This function will select ZONE_ORDER_NODE when
> 
> There are only ZONE_DMA or ZONE_DMA32.
> || size of (ZONE_DMA/DMA32) > (System Total Memory)/2
> || Assume Node(A)
> 	Node (A) is enough big &&
> 	Node (A)'s ZONE_DMA/DMA32 occupies 60% of Node(A)'s memory.
> 	(In this case, ZONE_ORDER_ZONE may not offer enough locality...)
> 
> otherwise, ZONE_ORDER_ZONE is selected.
> 
> Maybe there is no best and simple way to configure zone order. I wrote this base on
> my experience and discussion on the list.
> 
> Anyway, a user can specifiy zone order from boot option/sysctl.
> 
> Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 

Acked-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

> ---
>  mm/page_alloc.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 49 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.21-mm1/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.21-mm1.orig/mm/page_alloc.c
> +++ linux-2.6.21-mm1/mm/page_alloc.c
> @@ -2248,8 +2248,55 @@ static void build_zonelists_in_zone_orde
>  
>  static int default_zonelist_order(void)
>  {
> -	/* dummy, just select node order. */
> -	return ZONELIST_ORDER_NODE;
> +	int nid, zone_type;
> +	unsigned long low_kmem_size,total_size;
> +	struct zone *z;
> +	int average_size;
> +	/*
> +         * ZONE_DMA and ZONE_DMA32 can be very small area in the sytem.
> +	 * If they are really small and used heavily, the system can fall
> +	 * into OOM very easily.
> +	 * This function detect ZONE_DMA/DMA32 size and confgigures zone order.
> +	 */
> +	/* Is there ZONE_NORMAL ? (ex. ppc has only DMA zone..) */
> +	low_kmem_size = 0;
> +	total_size = 0;
> +	for_each_online_node(nid) {
> +		for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
> +			z = &NODE_DATA(nid)->node_zones[zone_type];
> +			if (populated_zone(z)) {
> +				if (zone_type < ZONE_NORMAL)
> +					low_kmem_size += z->present_pages;
> +				total_size += z->present_pages;
> +			}
> +		}
> +	}
> +	if (!low_kmem_size ||  /* there are no DMA area. */
> +	    low_kmem_size > total_size/2) /* DMA/DMA32 is big. */
> +		return ZONELIST_ORDER_NODE;
> +	/*
> +	 * look into each node's config.
> +  	 * If there is a node whose DMA/DMA32 memory is very big area on
> + 	 * local memory, NODE_ORDER may be suitable.
> +         */
> +	average_size = total_size / (num_online_nodes() + 1);
> +	for_each_online_node(nid) {
> +		low_kmem_size = 0;
> +		total_size = 0;
> +		for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
> +			z = &NODE_DATA(nid)->node_zones[zone_type];
> +			if (populated_zone(z)) {
> +				if (zone_type < ZONE_NORMAL)
> +					low_kmem_size += z->present_pages;
> +				total_size += z->present_pages;
> +			}
> +		}
> +		if (low_kmem_size &&
> +		    total_size > average_size && /* ignore small node */
> +		    low_kmem_size > total_size * 70/100)
> +			return ZONELIST_ORDER_NODE;
> +	}
> +	return ZONELIST_ORDER_ZONE;
>  }
>  
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [3/3] documentation
  2007-05-08 11:19 ` [PATCH] change zonelist order v5 [3/3] documentation KAMEZAWA Hiroyuki
@ 2007-05-08 17:08   ` Lee Schermerhorn
  2007-05-09  0:23     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 22+ messages in thread
From: Lee Schermerhorn @ 2007-05-08 17:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, clameter, akpm, ak, jbarnes

On Tue, 2007-05-08 at 20:19 +0900, KAMEZAWA Hiroyuki wrote:
> Patch for documentation.
> 
> Signed-Off-By: KAMEZAWA hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 

Will send followup patch with minor editorial changes.
Acked-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

> 
> ---
>  Documentation/kernel-parameters.txt |   10 +++++++
>  Documentation/sysctl/vm.txt         |   48 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 58 insertions(+)
> 
> Index: linux-2.6.21-mm1/Documentation/kernel-parameters.txt
> ===================================================================
> --- linux-2.6.21-mm1.orig/Documentation/kernel-parameters.txt
> +++ linux-2.6.21-mm1/Documentation/kernel-parameters.txt
> @@ -1233,6 +1233,16 @@ and is between 256 and 4096 characters. 
>  
>  	nr_uarts=	[SERIAL] maximum number of UARTs to be registered.
>  
> +	numa_zonelist_oder= [KNL,BOOT]
> +			Select zonelist order for NUMA. zonelist is used for
> +			desiding where the kernel allocates memory from.
> +			Default is automatic configuration. If "node" is
> +			specified, zonelist is ordered by locality. This can
> +			offer the best locality but possibility of OOM may
> +			increase.  If "zone" is specified, the zonelist is
> +			ordered by zone_type.
> +			See Documentaion/sysctl/vm.txt numa_zonelist_order.
> +			
>  	opl3=		[HW,OSS]
>  			Format: <io>
>  
> Index: linux-2.6.21-mm1/Documentation/sysctl/vm.txt
> ===================================================================
> --- linux-2.6.21-mm1.orig/Documentation/sysctl/vm.txt
> +++ linux-2.6.21-mm1/Documentation/sysctl/vm.txt
> @@ -35,6 +35,7 @@ Currently, these files are in /proc/sys/
>  - stat_interval
>  - readahead_ratio
>  - readahead_hit_rate
> +- numa_zonelist_order
>  
>  ==============================================================
>  
> @@ -293,3 +294,49 @@ Possible values can be:
>  The larger value, the more capabilities, with more possible overheads.
>  
>  The default value is 1.
> +
> +==============================================================
> +
> +numa_zonelist_order
> +
> +This sysctl is only for NUMA.
> +'where the memory is allocated from' is controlled by zonelist.
> +(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
> + you may be able to read ZONE_DMA as ZONE_DMA32...)
> +
> +In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following.
> +ZONE_NORMAL -> ZONE_DMA
> +This means that a memory allocation request for GFP_KERNEL will
> +get memory from ZONE_DMA only when ZONE_NORMAL is not available.
> +
> +In NUMA case, you can think of following 2 types of order.
> +Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL
> +
> +(A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL
> +(B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA.
> +
> +Type(A) offers the best locality for processes on Node(0), but ZONE_DMA
> +will be used before ZONE_NORMAL exhaustion. This increases possibility of
> +out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small.
> +
> +Type(B) cannot offer the best locality but very robust against OOM of DMA zone.
> +
> +Type(A) is called as "Node" order. Type (B) is "Zone" order.
> +
> +"Node order" orders the zonelists by node, then by zone within each node.
> +This will offer the best locality but increases possibility of OOM.
> +Specify "[Nn]ode" for zone order
> +
> +"Zone Order"  preserves the DMA zone as long as possible but
> +results in off-node allocation [for node 0] earlier.
> +Specify "[Zz]one"for zode order.
> +
> +Specify "[Dd]efault" to request automatic configuration.  Autoconfiguration
> +will select "node" order in following case.
> +(1) if the DMA zone does not exist or
> +(2) if the DMA zone comprises greater than 50% of the available memory or
> +(3) if a node's DMA zone comprises greater than 60% of its local memory and
> +    the amount of local memory is enough big.
> +
> +Otherwise, "zone" order will be selected. Default order is recommended unless
> +unless this is causing problems for your system/application.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [1/3] implements zonelist order selection
  2007-05-08 17:06   ` Lee Schermerhorn
@ 2007-05-08 17:22     ` Christoph Lameter
  2007-05-08 17:33       ` Lee Schermerhorn
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2007-05-08 17:22 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, akpm, ak, jbarnes

On Tue, 8 May 2007, Lee Schermerhorn wrote:

> > Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Tested OK on my platform.
> Acked-by:   Lee Schermerhorn <lee.schermerhorn@hp.com>

So far testing is IA64 only?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [1/3] implements zonelist order selection
  2007-05-08 17:22     ` Christoph Lameter
@ 2007-05-08 17:33       ` Lee Schermerhorn
  2007-05-08 18:05         ` Christoph Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Lee Schermerhorn @ 2007-05-08 17:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, akpm, ak, jbarnes

On Tue, 2007-05-08 at 10:22 -0700, Christoph Lameter wrote:
> On Tue, 8 May 2007, Lee Schermerhorn wrote:
> 
> > > Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Tested OK on my platform.
> > Acked-by:   Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
> So far testing is IA64 only?

Yes, so far.  I will test on an Opteron platform this pm.  
Assume that no news is good news.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [1/3] implements zonelist order selection
  2007-05-08 17:33       ` Lee Schermerhorn
@ 2007-05-08 18:05         ` Christoph Lameter
  2007-05-08 20:37           ` Lee Schermerhorn
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2007-05-08 18:05 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, akpm, ak, jbarnes

On Tue, 8 May 2007, Lee Schermerhorn wrote:

> > So far testing is IA64 only?
> Yes, so far.  I will test on an Opteron platform this pm.  
> Assume that no news is good news.

A better assumption: no news -> no testing. You probably need a 
configuration with a couple of nodes. Maybesomething less symmetric than 
Kame? I.e. have 4GB nodes and then DMA32 takes out a sizeable chunk of it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [1/3] implements zonelist order selection
  2007-05-08 18:05         ` Christoph Lameter
@ 2007-05-08 20:37           ` Lee Schermerhorn
  2007-05-09  0:29             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 22+ messages in thread
From: Lee Schermerhorn @ 2007-05-08 20:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, akpm, ak, jbarnes

On Tue, 2007-05-08 at 11:05 -0700, Christoph Lameter wrote:
> On Tue, 8 May 2007, Lee Schermerhorn wrote:
> 
> > > So far testing is IA64 only?
> > Yes, so far.  I will test on an Opteron platform this pm.  
> > Assume that no news is good news.
> 
> A better assumption: no news -> no testing. 

Before you asked, yes.  I meant after the last message, if you didn't
hear from me, everything worked fine.  And it does, sort of...


> You probably need a 
> configuration with a couple of nodes. Maybesomething less symmetric than 
> Kame? I.e. have 4GB nodes and then DMA32 takes out a sizeable chunk of it?
> 

I tested on a 2 socket, 4GB Opteron blade.  All memory is either DMA32
or DMA.  I added some ad hoc instrumentation to the build_zonelist_*
functions to see what's happening.  I have verified that the patches
appear to build the zonelists correctly:

default -> node order, because "low_kmem" [DMA+DMA32] > total_mem/2.
Zone lists:
DMA:  DMA-0
DMA32: DMA32-0, DMA-0, DMA32-1
Normal:  same as DMA32 [no normal memory]
Movable:  same as DMA32 & Normal

explicit zone order also builds as expected:
DMA:  DMA-0
DMA32:  DMA32-1, DMA32-0, DMA-0
and same for normal and movable

However, a curious thing happens:  in either order, allocations seem to
overflow to the remote DMA32 before dipping into the DMA!!!?  I'm using
memtoy to create a large [3+GB] anon segment and locking it down.

I need to check a non-patched kernel to see if it behaves the same way,
and examine the code to see why...  For one thing, the kernel seems to
do a bit better at reclaiming memory before overflowing.  Eventually, it
will dip into DMA and finally get killed--OOM.

I'll be off-line most of the rest of the week, so I probably won't get
to investigate much further nor test on a larger socket count/memory
system until next week.  

Lee



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [3/3] documentation
  2007-05-08 17:08   ` Lee Schermerhorn
@ 2007-05-09  0:23     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-09  0:23 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-kernel, linux-mm, clameter, akpm, ak, jbarnes

On Tue, 08 May 2007 13:08:55 -0400
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> On Tue, 2007-05-08 at 20:19 +0900, KAMEZAWA Hiroyuki wrote:
> > Patch for documentation.
> > 
> > Signed-Off-By: KAMEZAWA hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> 
> Will send followup patch with minor editorial changes.
> Acked-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
Thank you. it's helpful.

-kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [1/3] implements zonelist order selection
  2007-05-08 20:37           ` Lee Schermerhorn
@ 2007-05-09  0:29             ` KAMEZAWA Hiroyuki
  2007-05-09  0:58               ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-09  0:29 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: clameter, linux-kernel, linux-mm, akpm, ak, jbarnes

On Tue, 08 May 2007 16:37:06 -0400
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> > You probably need a 
> > configuration with a couple of nodes. Maybesomething less symmetric than 
> > Kame? I.e. have 4GB nodes and then DMA32 takes out a sizeable chunk of it?
> > 
> 
> I tested on a 2 socket, 4GB Opteron blade.  All memory is either DMA32
> or DMA.  I added some ad hoc instrumentation to the build_zonelist_*
> functions to see what's happening.  I have verified that the patches
> appear to build the zonelists correctly:
> 
Thank you. good news.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [1/3] implements zonelist order selection
  2007-05-09  0:29             ` KAMEZAWA Hiroyuki
@ 2007-05-09  0:58               ` Andrew Morton
  2007-05-09  1:07                 ` Christoph Lameter
                                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Andrew Morton @ 2007-05-09  0:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Lee Schermerhorn, clameter, linux-kernel, linux-mm, ak, jbarnes

On Wed, 9 May 2007 09:29:12 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 08 May 2007 16:37:06 -0400
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > > You probably need a 
> > > configuration with a couple of nodes. Maybesomething less symmetric than 
> > > Kame? I.e. have 4GB nodes and then DMA32 takes out a sizeable chunk of it?
> > > 
> > 
> > I tested on a 2 socket, 4GB Opteron blade.  All memory is either DMA32
> > or DMA.  I added some ad hoc instrumentation to the build_zonelist_*
> > functions to see what's happening.  I have verified that the patches
> > appear to build the zonelists correctly:
> > 
> Thank you. good news.
> 

I'm still cowering in fear of these patches, btw.

Please keep testing and sending them ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [1/3] implements zonelist order selection
  2007-05-09  0:58               ` Andrew Morton
@ 2007-05-09  1:07                 ` Christoph Lameter
  2007-05-09  1:20                 ` KAMEZAWA Hiroyuki
  2007-05-09  4:12                 ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 22+ messages in thread
From: Christoph Lameter @ 2007-05-09  1:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Lee Schermerhorn, linux-kernel, linux-mm, ak, jbarnes

On Tue, 8 May 2007, Andrew Morton wrote:

> I'm still cowering in fear of these patches, btw.
> 
> Please keep testing and sending them ;)

I hope you finally get a feel for the evil nature of ZONE_DMAxx. I 
think our x86_64 platform will have node 0 cordoned off for DMA if any 
DMA32 or DMA devices are on the system.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [1/3] implements zonelist order selection
  2007-05-09  0:58               ` Andrew Morton
  2007-05-09  1:07                 ` Christoph Lameter
@ 2007-05-09  1:20                 ` KAMEZAWA Hiroyuki
  2007-05-09 13:55                   ` Lee Schermerhorn
  2007-05-09  4:12                 ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-09  1:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lee.Schermerhorn, clameter, linux-kernel, linux-mm, ak, jbarnes

On Tue, 8 May 2007 17:58:55 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 9 May 2007 09:29:12 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Tue, 08 May 2007 16:37:06 -0400
> > Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > 
> > > > You probably need a 
> > > > configuration with a couple of nodes. Maybesomething less symmetric than 
> > > > Kame? I.e. have 4GB nodes and then DMA32 takes out a sizeable chunk of it?
> > > > 
> > > 
> > > I tested on a 2 socket, 4GB Opteron blade.  All memory is either DMA32
> > > or DMA.  I added some ad hoc instrumentation to the build_zonelist_*
> > > functions to see what's happening.  I have verified that the patches
> > > appear to build the zonelists correctly:
> > > 
> > Thank you. good news.
> > 
> 
> I'm still cowering in fear of these patches, btw.
> 
Hmm, the patches looks unclear ? 

> Please keep testing and sending them ;)
> 
Okay. but it seems I need other testers...

I wonder I should drop sysctl of this patch and just support boot option
in next version.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [1/3] implements zonelist order selection
  2007-05-09  0:58               ` Andrew Morton
  2007-05-09  1:07                 ` Christoph Lameter
  2007-05-09  1:20                 ` KAMEZAWA Hiroyuki
@ 2007-05-09  4:12                 ` KAMEZAWA Hiroyuki
  2007-05-09  8:53                   ` Andy Whitcroft
  2 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-09  4:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lee.Schermerhorn, clameter, linux-kernel, linux-mm, ak, jbarnes

On Tue, 8 May 2007 17:58:55 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> 
> I'm still cowering in fear of these patches, btw.
> 
> Please keep testing and sending them ;)
> 
I'll repost "Request-Fot-Test" version "6" against next -mm and
add x86 as my test target at least. (I don't have other hardware.)


-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [1/3] implements zonelist order selection
  2007-05-09  4:12                 ` KAMEZAWA Hiroyuki
@ 2007-05-09  8:53                   ` Andy Whitcroft
  2007-05-09  9:04                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Whitcroft @ 2007-05-09  8:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Lee.Schermerhorn, clameter, linux-kernel,
	linux-mm, ak, jbarnes

KAMEZAWA Hiroyuki wrote:
> On Tue, 8 May 2007 17:58:55 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
>> I'm still cowering in fear of these patches, btw.
>>
>> Please keep testing and sending them ;)
>>
> I'll repost "Request-Fot-Test" version "6" against next -mm and
> add x86 as my test target at least. (I don't have other hardware.)

Copy me on the email and I'll shove the patches through TKO.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [1/3] implements zonelist order selection
  2007-05-09  8:53                   ` Andy Whitcroft
@ 2007-05-09  9:04                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-05-09  9:04 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: akpm, Lee.Schermerhorn, clameter, linux-kernel, linux-mm, ak, jbarnes

On Wed, 09 May 2007 09:53:34 +0100
Andy Whitcroft <apw@shadowen.org> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Tue, 8 May 2007 17:58:55 -0700
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> >> I'm still cowering in fear of these patches, btw.
> >>
> >> Please keep testing and sending them ;)
> >>
> > I'll repost "Request-Fot-Test" version "6" against next -mm and
> > add x86 as my test target at least. (I don't have other hardware.)
> 
> Copy me on the email and I'll shove the patches through TKO.
> 
Oh, thank you ! I think I will be able to post v6 tomorrow.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] change zonelist order v5 [1/3] implements zonelist order selection
  2007-05-09  1:20                 ` KAMEZAWA Hiroyuki
@ 2007-05-09 13:55                   ` Lee Schermerhorn
  0 siblings, 0 replies; 22+ messages in thread
From: Lee Schermerhorn @ 2007-05-09 13:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, clameter, linux-kernel, linux-mm, ak, jbarnes

On Wed, 2007-05-09 at 10:20 +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 8 May 2007 17:58:55 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Wed, 9 May 2007 09:29:12 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Tue, 08 May 2007 16:37:06 -0400
> > > Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > > 
> > > > > You probably need a 
> > > > > configuration with a couple of nodes. Maybesomething less symmetric than 
> > > > > Kame? I.e. have 4GB nodes and then DMA32 takes out a sizeable chunk of it?
> > > > > 
> > > > 
> > > > I tested on a 2 socket, 4GB Opteron blade.  All memory is either DMA32
> > > > or DMA.  I added some ad hoc instrumentation to the build_zonelist_*
> > > > functions to see what's happening.  I have verified that the patches
> > > > appear to build the zonelists correctly:
> > > > 
> > > Thank you. good news.
> > > 
> > 
> > I'm still cowering in fear of these patches, btw.
> > 
> Hmm, the patches looks unclear ? 
> 
> > Please keep testing and sending them ;)
> > 
> Okay. but it seems I need other testers...
> 
> I wonder I should drop sysctl of this patch and just support boot option
> in next version.

I think the system still need to be able to rebuild the zonelists at
run-time in response to memory hotplug [someday, maybe?].  And for now,
the sysctl is very useful for testing.  And, it does avoid a
reboot--quite expensive, timewise, on large platforms--should one find
that the default order is not appropriate.  

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2007-05-09 13:55 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-08 11:14 [PATCH] change zonelist order v5 [0/3] KAMEZAWA Hiroyuki
2007-05-08 11:16 ` [PATCH] change zonelist order v5 [1/3] implements zonelist order selection KAMEZAWA Hiroyuki
2007-05-08 17:06   ` Lee Schermerhorn
2007-05-08 17:22     ` Christoph Lameter
2007-05-08 17:33       ` Lee Schermerhorn
2007-05-08 18:05         ` Christoph Lameter
2007-05-08 20:37           ` Lee Schermerhorn
2007-05-09  0:29             ` KAMEZAWA Hiroyuki
2007-05-09  0:58               ` Andrew Morton
2007-05-09  1:07                 ` Christoph Lameter
2007-05-09  1:20                 ` KAMEZAWA Hiroyuki
2007-05-09 13:55                   ` Lee Schermerhorn
2007-05-09  4:12                 ` KAMEZAWA Hiroyuki
2007-05-09  8:53                   ` Andy Whitcroft
2007-05-09  9:04                     ` KAMEZAWA Hiroyuki
2007-05-08 11:18 ` [PATCH] change zonelist order v5 [2/3] automatic configuration KAMEZAWA Hiroyuki
2007-05-08 17:07   ` Lee Schermerhorn
2007-05-08 11:19 ` [PATCH] change zonelist order v5 [3/3] documentation KAMEZAWA Hiroyuki
2007-05-08 17:08   ` Lee Schermerhorn
2007-05-09  0:23     ` KAMEZAWA Hiroyuki
2007-05-08 12:04 ` [PATCH] change zonelist order v5 [4/3] compile fix KAMEZAWA Hiroyuki
2007-05-08 16:14 ` [PATCH] change zonelist order v5 [0/3] Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox