linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/13] RFC memoryless node handling fixes
@ 2007-06-14  7:50 clameter
  2007-06-14  7:50 ` [RFC 01/13] NUMA: introduce node_memory_map clameter
                   ` (13 more replies)
  0 siblings, 14 replies; 20+ messages in thread
From: clameter @ 2007-06-14  7:50 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, linux-mm

This has now become a longer series since I have seen a couple of things in
various places where we do not take into account memoryless nodes.

I changed the GFP_THISNODE fix to generate a new set of zonelists. GFP_THISNODE
will then simply use a zonelist that only has the zones of the node.

I have only tested this by booting on a IA64 simulator. Please review. I do not
have a real system with a memoryless node.

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 01/13] NUMA: introduce node_memory_map
  2007-06-14  7:50 [RFC 00/13] RFC memoryless node handling fixes clameter
@ 2007-06-14  7:50 ` clameter
  2007-06-14  7:50 ` [RFC 02/13] Fix MPOL_INTERLEAVE behavior for memoryless nodes clameter
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: clameter @ 2007-06-14  7:50 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, linux-mm

[-- Attachment #1: node_memory_map --]
[-- Type: text/plain, Size: 4417 bytes --]

It is necessary to know if nodes have memory since we have recently
begun to add support for memoryless nodes. For that purpose we introduce
a new bitmap called

node_memory_map

A node has its bit in node_memory_map set if it has memory. If a node
has memory then it has at least one zone defined in its pgdat structure
that is located in the pgdat itself.

The node_memory_map can then be used in various places to insure that we
do the right thing when we encounter a memoryless node.

Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.22-rc4-mm2/include/linux/nodemask.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/nodemask.h	2007-06-12 12:32:38.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/nodemask.h	2007-06-13 23:09:30.000000000 -0700
@@ -64,12 +64,16 @@
  *
  * int node_online(node)		Is some node online?
  * int node_possible(node)		Is some node possible?
+ * int node_memory(node)		Does a node have memory?
  *
  * int any_online_node(mask)		First online node in mask
  *
  * node_set_online(node)		set bit 'node' in node_online_map
  * node_set_offline(node)		clear bit 'node' in node_online_map
  *
+ * node_set_has_memory(node)		set bit 'node' in node_memory_map
+ * node_set_no_memory(node)		clear bit 'node' in node_memory_map
+ *
  * for_each_node(node)			for-loop node over node_possible_map
  * for_each_online_node(node)		for-loop node over node_online_map
  *
@@ -344,12 +348,14 @@ static inline void __nodes_remap(nodemas
 
 extern nodemask_t node_online_map;
 extern nodemask_t node_possible_map;
+extern nodemask_t node_memory_map;
 
 #if MAX_NUMNODES > 1
 #define num_online_nodes()	nodes_weight(node_online_map)
 #define num_possible_nodes()	nodes_weight(node_possible_map)
 #define node_online(node)	node_isset((node), node_online_map)
 #define node_possible(node)	node_isset((node), node_possible_map)
+#define node_memory(node)	node_isset((node), node_memory_map)
 #define first_online_node	first_node(node_online_map)
 #define next_online_node(nid)	next_node((nid), node_online_map)
 extern int nr_node_ids;
@@ -358,6 +364,8 @@ extern int nr_node_ids;
 #define num_possible_nodes()	1
 #define node_online(node)	((node) == 0)
 #define node_possible(node)	((node) == 0)
+#define node_memory(node)	((node) == 0)
+#define node_populated(node)	((node) == 0)
 #define first_online_node	0
 #define next_online_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1
@@ -375,7 +383,11 @@ extern int nr_node_ids;
 #define node_set_online(node)	   set_bit((node), node_online_map.bits)
 #define node_set_offline(node)	   clear_bit((node), node_online_map.bits)
 
+#define node_set_has_memory(node)  set_bit((node), node_memory_map.bits)
+#define node_set_no_memory(node)   clear_bit((node), node_memory_map.bits)
+
 #define for_each_node(node)	   for_each_node_mask((node), node_possible_map)
 #define for_each_online_node(node) for_each_node_mask((node), node_online_map)
+#define for_each_memory_node(node) for_each_node_mask((node), node_memory_map)
 
 #endif /* __LINUX_NODEMASK_H */
Index: linux-2.6.22-rc4-mm2/mm/page_alloc.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/page_alloc.c	2007-06-12 12:32:38.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/page_alloc.c	2007-06-13 23:09:58.000000000 -0700
@@ -54,6 +54,9 @@ nodemask_t node_online_map __read_mostly
 EXPORT_SYMBOL(node_online_map);
 nodemask_t node_possible_map __read_mostly = NODE_MASK_ALL;
 EXPORT_SYMBOL(node_possible_map);
+nodemask_t node_memory_map __read_mostly = NODE_MASK_NONE;
+EXPORT_SYMBOL(node_memory_map);
+
 unsigned long totalram_pages __read_mostly;
 unsigned long totalreserve_pages __read_mostly;
 long nr_swap_pages;
@@ -2299,6 +2302,9 @@ static void build_zonelists(pg_data_t *p
 		/* calculate node order -- i.e., DMA last! */
 		build_zonelists_in_zone_order(pgdat, j);
 	}
+
+	if (pgdat->node_present_pages)
+		node_set_has_memory(local_node);
 }
 
 /* Construct the zonelist performance cache - see further mmzone.h */

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 02/13] Fix MPOL_INTERLEAVE behavior for memoryless nodes
  2007-06-14  7:50 [RFC 00/13] RFC memoryless node handling fixes clameter
  2007-06-14  7:50 ` [RFC 01/13] NUMA: introduce node_memory_map clameter
@ 2007-06-14  7:50 ` clameter
  2007-06-14  7:50 ` [RFC 03/13] OOM: use the node_memory_map instead of constructing one on the fly clameter
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: clameter @ 2007-06-14  7:50 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, linux-mm

[-- Attachment #1: fix_interleave --]
[-- Type: text/plain, Size: 1353 bytes --]

MPOL_INTERLEAVE currently simply loops over all nodes. Allocations on
memoryless nodes will be redirected to nodes with memory. This results in
an imbalance because the neighboring nodes to memoryless nodes will get significantly
more interleave hits that the rest of the nodes on the system.

We can avoid this imbalance by clearing the nodes in the interleave node
set that have no memory.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

Index: linux-2.6.22-rc4-mm2/mm/mempolicy.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/mempolicy.c	2007-06-13 23:06:14.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/mempolicy.c	2007-06-14 00:49:43.000000000 -0700
@@ -185,7 +185,8 @@ static struct mempolicy *mpol_new(int mo
 	switch (mode) {
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
-		if (nodes_weight(*nodes) == 0) {
+		nodes_and(policy->v.nodes, policy->v.nodes, node_memory_map);
+		if (nodes_weight(policy->v.nodes) == 0) {
 			kmem_cache_free(policy_cache, policy);
 			return ERR_PTR(-EINVAL);
 		}

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 03/13] OOM: use the node_memory_map instead of constructing one on the fly
  2007-06-14  7:50 [RFC 00/13] RFC memoryless node handling fixes clameter
  2007-06-14  7:50 ` [RFC 01/13] NUMA: introduce node_memory_map clameter
  2007-06-14  7:50 ` [RFC 02/13] Fix MPOL_INTERLEAVE behavior for memoryless nodes clameter
@ 2007-06-14  7:50 ` clameter
  2007-06-14  7:50 ` [RFC 04/13] Memoryless Nodes: No need for kswapd clameter
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: clameter @ 2007-06-14  7:50 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, linux-mm

[-- Attachment #1: nodeless_oom_kill --]
[-- Type: text/plain, Size: 1102 bytes --]

constrained_alloc() builds its own memory map for nodes with memory.
We have that available in node_memory_map now. So simplify the code.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.22-rc4-mm2/mm/oom_kill.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/oom_kill.c	2007-06-13 23:11:32.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/oom_kill.c	2007-06-13 23:12:39.000000000 -0700
@@ -176,14 +176,7 @@ static inline int constrained_alloc(stru
 {
 #ifdef CONFIG_NUMA
 	struct zone **z;
-	nodemask_t nodes;
-	int node;
-
-	nodes_clear(nodes);
-	/* node has memory ? */
-	for_each_online_node(node)
-		if (NODE_DATA(node)->node_present_pages)
-			node_set(node, nodes);
+	nodemask_t nodes = node_memory_map;
 
 	for (z = zonelist->zones; *z; z++)
 		if (cpuset_zone_allowed_softwall(*z, gfp_mask))

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 04/13] Memoryless Nodes: No need for kswapd
  2007-06-14  7:50 [RFC 00/13] RFC memoryless node handling fixes clameter
                   ` (2 preceding siblings ...)
  2007-06-14  7:50 ` [RFC 03/13] OOM: use the node_memory_map instead of constructing one on the fly clameter
@ 2007-06-14  7:50 ` clameter
  2007-06-14  7:50 ` [RFC 05/13] Memoryless Node: Slab support clameter
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: clameter @ 2007-06-14  7:50 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, linux-mm

[-- Attachment #1: nodeless_no_kswapd --]
[-- Type: text/plain, Size: 867 bytes --]

A node without memory does not need a kswapd. So use the memory map instead
of the online map to start kswapd.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.22-rc4-mm2/mm/vmscan.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/vmscan.c	2007-06-13 23:15:05.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/vmscan.c	2007-06-13 23:16:30.000000000 -0700
@@ -1716,7 +1716,7 @@ static int __init kswapd_init(void)
 	int nid;
 
 	swap_setup();
-	for_each_online_node(nid)
+	for_each_memory_node(nid)
  		kswapd_run(nid);
 	hotcpu_notifier(cpu_callback, 0);
 	return 0;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 05/13] Memoryless Node: Slab support
  2007-06-14  7:50 [RFC 00/13] RFC memoryless node handling fixes clameter
                   ` (3 preceding siblings ...)
  2007-06-14  7:50 ` [RFC 04/13] Memoryless Nodes: No need for kswapd clameter
@ 2007-06-14  7:50 ` clameter
  2007-06-14  7:50 ` [RFC 06/13] Memoryless nodes: SLUB support clameter
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: clameter @ 2007-06-14  7:50 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, linux-mm

[-- Attachment #1: nodeless_slab --]
[-- Type: text/plain, Size: 2095 bytes --]

Slab should not allocate control structures for nodes without memory. This may work right
now but its unreliable since not all allocations can fall back due to the use of GFP_THISNODE.

Switching a few for_each_online_node's to for_each_memory_node will allow us to
only allocate for nodes that actually have memory.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.22-rc4-mm2/mm/slab.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slab.c	2007-06-13 23:16:51.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slab.c	2007-06-13 23:20:29.000000000 -0700
@@ -1562,7 +1562,7 @@ void __init kmem_cache_init(void)
 		/* Replace the static kmem_list3 structures for the boot cpu */
 		init_list(&cache_cache, &initkmem_list3[CACHE_CACHE], node);
 
-		for_each_online_node(nid) {
+		for_each_memory_node(nid) {
 			init_list(malloc_sizes[INDEX_AC].cs_cachep,
 				  &initkmem_list3[SIZE_AC + nid], nid);
 
@@ -1940,7 +1940,7 @@ static void __init set_up_list3s(struct 
 {
 	int node;
 
-	for_each_online_node(node) {
+	for_each_memory_node(node) {
 		cachep->nodelists[node] = &initkmem_list3[index + node];
 		cachep->nodelists[node]->next_reap = jiffies +
 		    REAPTIMEOUT_LIST3 +
@@ -2071,7 +2071,7 @@ static int __init_refok setup_cpu_cache(
 			g_cpucache_up = PARTIAL_L3;
 		} else {
 			int node;
-			for_each_online_node(node) {
+			for_each_memory_node(node) {
 				cachep->nodelists[node] =
 				    kmalloc_node(sizeof(struct kmem_list3),
 						GFP_KERNEL, node);
@@ -3828,7 +3828,7 @@ static int alloc_kmemlist(struct kmem_ca
 	struct array_cache *new_shared;
 	struct array_cache **new_alien = NULL;
 
-	for_each_online_node(node) {
+	for_each_memory_node(node) {
 
                 if (use_alien_caches) {
                         new_alien = alloc_alien_cache(node, cachep->limit);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 06/13] Memoryless nodes: SLUB support
  2007-06-14  7:50 [RFC 00/13] RFC memoryless node handling fixes clameter
                   ` (4 preceding siblings ...)
  2007-06-14  7:50 ` [RFC 05/13] Memoryless Node: Slab support clameter
@ 2007-06-14  7:50 ` clameter
  2007-06-14  7:50 ` [RFC 07/13] Uncached allocator: Handle memoryless nodes clameter
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: clameter @ 2007-06-14  7:50 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, linux-mm

[-- Attachment #1: nodeless_slub --]
[-- Type: text/plain, Size: 2795 bytes --]

Simply switch all for_each_online_node to for_each_memory_node. That way
SLUB only operates on nodes with memory. Any allocation attempt on a
memoryless node will fall whereupon SLUB will fetch memory from a nearby
node (depending on how memory policies and cpuset describe fallback).

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-13 23:23:35.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-13 23:23:59.000000000 -0700
@@ -1887,7 +1887,7 @@ static void free_kmem_cache_nodes(struct
 {
 	int node;
 
-	for_each_online_node(node) {
+	for_each_memory_node(node) {
 		struct kmem_cache_node *n = s->node[node];
 		if (n && n != &s->local_node)
 			kmem_cache_free(kmalloc_caches, n);
@@ -1905,7 +1905,7 @@ static int init_kmem_cache_nodes(struct 
 	else
 		local_node = 0;
 
-	for_each_online_node(node) {
+	for_each_memory_node(node) {
 		struct kmem_cache_node *n;
 
 		if (local_node == node)
@@ -2159,7 +2159,7 @@ static int kmem_cache_close(struct kmem_
 	flush_all(s);
 
 	/* Attempt to free all objects */
-	for_each_online_node(node) {
+	for_each_memory_node(node) {
 		struct kmem_cache_node *n = get_node(s, node);
 
 		n->nr_partial -= free_list(s, n, &n->partial);
@@ -2406,7 +2406,7 @@ int kmem_cache_shrink(struct kmem_cache 
 		return -ENOMEM;
 
 	flush_all(s);
-	for_each_online_node(node) {
+	for_each_memory_node(node) {
 		n = get_node(s, node);
 
 		if (!n->nr_partial)
@@ -2842,7 +2842,7 @@ static unsigned long validate_slab_cache
 	unsigned long count = 0;
 
 	flush_all(s);
-	for_each_online_node(node) {
+	for_each_memory_node(node) {
 		struct kmem_cache_node *n = get_node(s, node);
 
 		count += validate_slab_node(s, n);
@@ -3064,7 +3064,7 @@ static int list_locations(struct kmem_ca
 	/* Push back cpu slabs */
 	flush_all(s);
 
-	for_each_online_node(node) {
+	for_each_memory_node(node) {
 		struct kmem_cache_node *n = get_node(s, node);
 		unsigned long flags;
 		struct page *page;
@@ -3189,7 +3189,7 @@ static unsigned long slab_objects(struct
 		}
 	}
 
-	for_each_online_node(node) {
+	for_each_memory_node(node) {
 		struct kmem_cache_node *n = get_node(s, node);
 
 		if (flags & SO_PARTIAL) {
@@ -3217,7 +3217,7 @@ static unsigned long slab_objects(struct
 
 	x = sprintf(buf, "%lu", total);
 #ifdef CONFIG_NUMA
-	for_each_online_node(node)
+	for_each_memory_node(node)
 		if (nodes[node])
 			x += sprintf(buf + x, " N%d=%lu",
 					node, nodes[node]);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 07/13] Uncached allocator: Handle memoryless nodes
  2007-06-14  7:50 [RFC 00/13] RFC memoryless node handling fixes clameter
                   ` (5 preceding siblings ...)
  2007-06-14  7:50 ` [RFC 06/13] Memoryless nodes: SLUB support clameter
@ 2007-06-14  7:50 ` clameter
  2007-06-14  7:50 ` [RFC 08/13] Memoryless node: Allow profiling data to fall back to other nodes clameter
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: clameter @ 2007-06-14  7:50 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, linux-mm

[-- Attachment #1: nodeless_mspec --]
[-- Type: text/plain, Size: 1776 bytes --]

The checks for node_online in the uncached allocator are made to make sure
that memory is available on these nodes. Thus switch all the checks to use
the node_memory and for_each_memory_node functions.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.22-rc4-mm2/arch/ia64/kernel/uncached.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/arch/ia64/kernel/uncached.c	2007-06-13 23:29:58.000000000 -0700
+++ linux-2.6.22-rc4-mm2/arch/ia64/kernel/uncached.c	2007-06-13 23:32:35.000000000 -0700
@@ -196,7 +196,7 @@ unsigned long uncached_alloc_page(int st
 	nid = starting_nid;
 
 	do {
-		if (!node_online(nid))
+		if (!node_memory(nid))
 			continue;
 		uc_pool = &uncached_pools[nid];
 		if (uc_pool->pool == NULL)
@@ -268,7 +268,7 @@ static int __init uncached_init(void)
 {
 	int nid;
 
-	for_each_online_node(nid) {
+	for_each_memory_node(nid) {
 		uncached_pools[nid].pool = gen_pool_create(PAGE_SHIFT, nid);
 		mutex_init(&uncached_pools[nid].add_chunk_mutex);
 	}
Index: linux-2.6.22-rc4-mm2/drivers/char/mspec.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/char/mspec.c	2007-06-13 23:28:15.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/char/mspec.c	2007-06-13 23:29:35.000000000 -0700
@@ -353,7 +353,7 @@ mspec_init(void)
 		is_sn2 = 1;
 		if (is_shub2()) {
 			ret = -ENOMEM;
-			for_each_online_node(nid) {
+			for_each_memory_node(nid) {
 				int actual_nid;
 				int nasid;
 				unsigned long phys;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 08/13] Memoryless node: Allow profiling data to fall back to other nodes
  2007-06-14  7:50 [RFC 00/13] RFC memoryless node handling fixes clameter
                   ` (6 preceding siblings ...)
  2007-06-14  7:50 ` [RFC 07/13] Uncached allocator: Handle memoryless nodes clameter
@ 2007-06-14  7:50 ` clameter
  2007-06-14  7:50 ` [RFC 09/13] Memoryless nodes: Update memory policy and page migration clameter
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: clameter @ 2007-06-14  7:50 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, linux-mm

[-- Attachment #1: nodeless_profile --]
[-- Type: text/plain, Size: 1334 bytes --]

Processors on memoryless nodes must be able to fall back to remote nodes
in order to get a profiling buffer. This may lead to excessive NUMA traffic
but I think we should allow this rather than failing.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.22-rc4-mm2/kernel/profile.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/kernel/profile.c	2007-06-13 23:36:42.000000000 -0700
+++ linux-2.6.22-rc4-mm2/kernel/profile.c	2007-06-13 23:36:55.000000000 -0700
@@ -346,7 +346,7 @@ static int __devinit profile_cpu_callbac
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
 			page = alloc_pages_node(node,
-					GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
 				return NOTIFY_BAD;
@@ -354,7 +354,7 @@ static int __devinit profile_cpu_callbac
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
 			page = alloc_pages_node(node,
-					GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
 				goto out_free;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 09/13] Memoryless nodes: Update memory policy and page migration
  2007-06-14  7:50 [RFC 00/13] RFC memoryless node handling fixes clameter
                   ` (7 preceding siblings ...)
  2007-06-14  7:50 ` [RFC 08/13] Memoryless node: Allow profiling data to fall back to other nodes clameter
@ 2007-06-14  7:50 ` clameter
  2007-06-14  7:50 ` [RFC 10/13] Memoryless nodes: Fix GFP_THISNODE behavior clameter
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: clameter @ 2007-06-14  7:50 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, linux-mm

[-- Attachment #1: nodeless_migrate --]
[-- Type: text/plain, Size: 4060 bytes --]

Online nodes now may have no memory. The checks and initialization must therefore
be changed to no longer use the online functions.

This will correctly initialize the interleave on bootup to only target
nodes with memory and will make sys_move_pages return an error when a page
is to be moved to a memoryless node. Similarly we will get an error if
MPOL_BIND and MPOL_INTERLEAVE is used on a memoryless node.

These are somewhat new semantics. So far one could specify memoryless nodes
and we would maybe do the right thing and just ignore the node (or wed do
something strange like with MPOL_INTERLEAVE). If we want to allow the
specification of memoryless nodes via memory policies then we need to keep
checking for online nodes.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.22-rc4-mm2/mm/migrate.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/migrate.c	2007-06-13 23:40:38.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/migrate.c	2007-06-13 23:41:26.000000000 -0700
@@ -963,7 +963,7 @@ asmlinkage long sys_move_pages(pid_t pid
 				goto out;
 
 			err = -ENODEV;
-			if (!node_online(node))
+			if (!node_memory(node))
 				goto out;
 
 			err = -EACCES;
Index: linux-2.6.22-rc4-mm2/mm/mempolicy.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/mempolicy.c	2007-06-13 23:42:50.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/mempolicy.c	2007-06-13 23:44:57.000000000 -0700
@@ -130,7 +130,7 @@ static int mpol_check_policy(int mode, n
 			return -EINVAL;
 		break;
 	}
-	return nodes_subset(*nodes, node_online_map) ? 0 : -EINVAL;
+	return nodes_subset(*nodes, node_memory_map) ? 0 : -EINVAL;
 }
 
 /* Generate a custom zonelist for the BIND policy. */
@@ -495,9 +495,9 @@ static void get_zonemask(struct mempolic
 		*nodes = p->v.nodes;
 		break;
 	case MPOL_PREFERRED:
-		/* or use current node instead of online map? */
+		/* or use current node instead of memory_map? */
 		if (p->v.preferred_node < 0)
-			*nodes = node_online_map;
+			*nodes = node_memory_map;
 		else
 			node_set(p->v.preferred_node, *nodes);
 		break;
@@ -1606,7 +1606,7 @@ int mpol_parse_options(char *value, int 
 		*nodelist++ = '\0';
 		if (nodelist_parse(nodelist, *policy_nodes))
 			goto out;
-		if (!nodes_subset(*policy_nodes, node_online_map))
+		if (!nodes_subset(*policy_nodes, node_memory_map))
 			goto out;
 	}
 	if (!strcmp(value, "default")) {
@@ -1631,9 +1631,9 @@ int mpol_parse_options(char *value, int 
 			err = 0;
 	} else if (!strcmp(value, "interleave")) {
 		*policy = MPOL_INTERLEAVE;
-		/* Default to nodes online if no nodelist */
+		/* Default to nodes memory map if no nodelist */
 		if (!nodelist)
-			*policy_nodes = node_online_map;
+			*policy_nodes = node_memory_map;
 		err = 0;
 	}
 out:
@@ -1674,14 +1674,14 @@ void __init numa_policy_init(void)
 
 	/*
 	 * Use the specified nodemask for init, or fall back to
-	 * node_online_map.
+	 * node_memory_map.
 	 */
 	if (policy_sysinit == MPOL_DEFAULT)
 		nmask = NULL;
 	else if (!nodes_empty(nmask_sysinit))
 		nmask = &nmask_sysinit;
 	else
-		nmask = &node_online_map;
+		nmask = &node_memory_map;
 
 	if (do_set_mempolicy(policy_sysinit, nmask))
 		printk("numa_policy_init: setting init policy failed\n");
@@ -1945,7 +1945,7 @@ int show_numa_map(struct seq_file *m, vo
 		seq_printf(m, " huge");
 	} else {
 		check_pgd_range(vma, vma->vm_start, vma->vm_end,
-				&node_online_map, MPOL_MF_STATS, md);
+				&node_memory_map, MPOL_MF_STATS, md);
 	}
 
 	if (!md->pages)
@@ -1972,7 +1972,7 @@ int show_numa_map(struct seq_file *m, vo
 	if (md->writeback)
 		seq_printf(m," writeback=%lu", md->writeback);
 
-	for_each_online_node(n)
+	for_each_memory_node(n)
 		if (md->node[n])
 			seq_printf(m, " N%d=%lu", n, md->node[n]);
 out:

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 10/13] Memoryless nodes: Fix GFP_THISNODE behavior
  2007-06-14  7:50 [RFC 00/13] RFC memoryless node handling fixes clameter
                   ` (8 preceding siblings ...)
  2007-06-14  7:50 ` [RFC 09/13] Memoryless nodes: Update memory policy and page migration clameter
@ 2007-06-14  7:50 ` clameter
  2007-06-14 16:07   ` Nishanth Aravamudan
  2007-06-14  7:50 ` [RFC 11/13] SLUB: Ensure that the # object per slabs stays low enough clameter
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 20+ messages in thread
From: clameter @ 2007-06-14  7:50 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, linux-mm

[-- Attachment #1: memless_thisnode_fix --]
[-- Type: text/plain, Size: 4812 bytes --]

GFP_THISNODE checks that the zone selected is within the pgdat (node) of the
first zone of a nodelist. That only works if the node has memory. A
memoryless node will have its first node on another pgdat (node).

GFP_THISNODE currently will return simply memory on the first pgdat.
Thus it is returning memory on other nodes. GFP_THISNODE should fail
if there is no local memory on a node.


Add a new set of zonelists for each node that only contain the nodes
that belong to the zones itself so that no fallback is possible.

Then modify gfp_type to pickup the right zone based on the presence
of __GFP_THISNODE.

Then we can drop the existing GFP_THISNODE code from the hot path.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.22-rc4-mm2/include/linux/gfp.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/gfp.h	2007-06-14 00:22:42.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/gfp.h	2007-06-14 00:24:17.000000000 -0700
@@ -116,22 +116,28 @@ static inline int allocflags_to_migratet
 
 static inline enum zone_type gfp_zone(gfp_t flags)
 {
+	int offset = 0;
+
+#ifdef CONFIG_NUMA
+	if (flags & __GFP_THISNODE)
+		offset = MAX_NR_ZONES;
+#endif
 #ifdef CONFIG_ZONE_DMA
 	if (flags & __GFP_DMA)
-		return ZONE_DMA;
+		return offset + ZONE_DMA;
 #endif
 #ifdef CONFIG_ZONE_DMA32
 	if (flags & __GFP_DMA32)
-		return ZONE_DMA32;
+		return offset + ZONE_DMA32;
 #endif
 	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
 			(__GFP_HIGHMEM | __GFP_MOVABLE))
-		return ZONE_MOVABLE;
+		return offset + ZONE_MOVABLE;
 #ifdef CONFIG_HIGHMEM
 	if (flags & __GFP_HIGHMEM)
-		return ZONE_HIGHMEM;
+		return offset + ZONE_HIGHMEM;
 #endif
-	return ZONE_NORMAL;
+	return offset + ZONE_NORMAL;
 }
 
 static inline gfp_t set_migrateflags(gfp_t gfp, gfp_t migrate_flags)
Index: linux-2.6.22-rc4-mm2/mm/page_alloc.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/page_alloc.c	2007-06-14 00:25:29.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/page_alloc.c	2007-06-14 00:36:44.000000000 -0700
@@ -1433,9 +1433,6 @@ zonelist_scan:
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
 		zone = *z;
-		if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) &&
-			zone->zone_pgdat != zonelist->zones[0]->zone_pgdat))
-				break;
 		if ((alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
@@ -1556,7 +1553,10 @@ restart:
 	z = zonelist->zones;  /* the list of zones suitable for gfp_mask */
 
 	if (unlikely(*z == NULL)) {
-		/* Should this ever happen?? */
+		/*
+		 * Happens if we have an empty zonelist as a result of
+		 * GFP_THISNODE being used on a memoryless node
+		 */
 		return NULL;
 	}
 
@@ -2154,6 +2154,22 @@ static void build_zonelists_in_node_orde
 }
 
 /*
+ * Build gfp_thisnode zonelists
+ */
+static void build_thisnode_zonelists(pg_data_t *pgdat)
+{
+	enum zone_type i;
+	int j;
+	struct zonelist *zonelist;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		zonelist = pgdat->node_zonelists + MAX_NR_ZONES + i;
+ 		j = build_zonelists_node(pgdat, zonelist, 0, i);
+		zonelist->zones[j] = NULL;
+	}
+}
+
+/*
  * Build zonelists ordered by zone and nodes within zones.
  * This results in conserving DMA zone[s] until all Normal memory is
  * exhausted, but results in overflowing to remote node while memory
@@ -2257,7 +2273,7 @@ static void build_zonelists(pg_data_t *p
 	int order = current_zonelist_order;
 
 	/* initialize zonelists */
-	for (i = 0; i < MAX_NR_ZONES; i++) {
+	for (i = 0; i < 2 * MAX_NR_ZONES; i++) {
 		zonelist = pgdat->node_zonelists + i;
 		zonelist->zones[0] = NULL;
 	}
@@ -2303,6 +2319,8 @@ static void build_zonelists(pg_data_t *p
 		build_zonelists_in_zone_order(pgdat, j);
 	}
 
+	build_thisnode_zonelists(pgdat);
+
 	if (pgdat->node_present_pages)
 		node_set_has_memory(local_node);
 }
Index: linux-2.6.22-rc4-mm2/include/linux/mmzone.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/mmzone.h	2007-06-14 00:24:28.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/mmzone.h	2007-06-14 00:25:25.000000000 -0700
@@ -469,7 +469,11 @@ extern struct page *mem_map;
 struct bootmem_data;
 typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
+#ifdef CONFIG_NUMA
+	struct zonelist node_zonelists[2 * MAX_NR_ZONES];
+#else
 	struct zonelist node_zonelists[MAX_NR_ZONES];
+#endif
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
 	struct page *node_mem_map;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 11/13] SLUB: Ensure that the # object per slabs stays low enough.
  2007-06-14  7:50 [RFC 00/13] RFC memoryless node handling fixes clameter
                   ` (9 preceding siblings ...)
  2007-06-14  7:50 ` [RFC 10/13] Memoryless nodes: Fix GFP_THISNODE behavior clameter
@ 2007-06-14  7:50 ` clameter
  2007-06-14  7:50 ` [RFC 12/13] SLUB: minimum alignment fixes clameter
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 20+ messages in thread
From: clameter @ 2007-06-14  7:50 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, linux-mm

[-- Attachment #1: slub_oversize --]
[-- Type: text/plain, Size: 2663 bytes --]

Currently SLUB has no provision to deal with too high page orders
that may be specified on the kernel boot line. If an order higher
than 6 (on a 4k platform) is generated then we will BUG() because
slabs get more than 65535 objects.

Add some logic that decreases order for slabs that have too many
objects. This allow booting with slab sizes up to MAX_ORDER.

For example

	slub_min_order=10

will boot with a default slab size of 4M and reduce slab sizes
for small object sizes to lower orders if the number of objects
becomes too big. Large slab sizes like that allow a concentration
of objects of the same slab cache under as few as possible TLB
entries and thus reduce TLB pressure.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |   21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

Index: vps/mm/slub.c
===================================================================
--- vps.orig/mm/slub.c	2007-06-12 15:58:35.000000000 -0700
+++ vps/mm/slub.c	2007-06-12 16:04:01.000000000 -0700
@@ -212,6 +212,11 @@ static inline void ClearSlabDebug(struct
 #define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
 #endif
 
+/*
+ * The page->inuse field is 16 bit thus we have this limitation
+ */
+#define MAX_OBJECTS_PER_SLAB 65535
+
 /* Internal SLUB flags */
 #define __OBJECT_POISON 0x80000000	/* Poison object */
 
@@ -1751,8 +1756,17 @@ static inline int slab_order(int size, i
 {
 	int order;
 	int rem;
+	int min_order = slub_min_order;
 
-	for (order = max(slub_min_order,
+	/*
+	 * If we would create too many object per slab then reduce
+	 * the slab order even if it goes below slub_min_order.
+	 */
+	while (min_order > 0 &&
+		(PAGE_SIZE << min_order) >= MAX_OBJECTS_PER_SLAB * size)
+			min_order--;
+
+	for (order = max(min_order,
 				fls(min_objects * size - 1) - PAGE_SHIFT);
 			order <= max_order; order++) {
 
@@ -1766,6 +1780,9 @@ static inline int slab_order(int size, i
 		if (rem <= slab_size / fract_leftover)
 			break;
 
+		/* If the next size is too high then exit now */
+		if (slab_size * 2 >= MAX_OBJECTS_PER_SLAB * size)
+			break;
 	}
 
 	return order;
@@ -2048,7 +2065,7 @@ static int calculate_sizes(struct kmem_c
 	 * The page->inuse field is only 16 bit wide! So we cannot have
 	 * more than 64k objects per slab.
 	 */
-	if (!s->objects || s->objects > 65535)
+	if (!s->objects || s->objects > MAX_OBJECTS_PER_SLAB)
 		return 0;
 	return 1;
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 12/13] SLUB: minimum alignment fixes
  2007-06-14  7:50 [RFC 00/13] RFC memoryless node handling fixes clameter
                   ` (10 preceding siblings ...)
  2007-06-14  7:50 ` [RFC 11/13] SLUB: Ensure that the # object per slabs stays low enough clameter
@ 2007-06-14  7:50 ` clameter
  2007-06-14  7:56   ` Christoph Lameter
  2007-06-14  7:50 ` [RFC 13/13] I finally found a way to get rid of the nasty list of comparisions in slub_def.h. ilog2 seems to work right for constants clameter
  2007-06-14 14:24 ` [RFC 00/13] RFC memoryless node handling fixes Nishanth Aravamudan
  13 siblings, 1 reply; 20+ messages in thread
From: clameter @ 2007-06-14  7:50 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, linux-mm

[-- Attachment #1: slub_min_align --]
[-- Type: text/plain, Size: 3991 bytes --]

If ARCH_KMALLOC_MINALIGN is set to a value greater than 8 (SLUBs smallest
kmalloc cache) then SLUB may generate duplicate slabs in sysfs (yes again).

However, no arch sets ARCH_KMALLOC_MINALIGN larger than 8 though except mips
which for some reason wants a 128 byte alignment.

This patch increases the size of the smallest cache if ARCH_KMALLOC_MINALIGN
is greater than 8. In that case more and more of the smallest caches are
disabled.

If we do that then the count of the active general caches that is displayed
on boot is not correct anymore since we may skip elements of the kmalloc
array. So count them separately.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slub_def.h |   13 +++++++++++--
 mm/slub.c                |   20 +++++++++++++++-----
 2 files changed, 26 insertions(+), 7 deletions(-)

Index: vps/include/linux/slub_def.h
===================================================================
--- vps.orig/include/linux/slub_def.h	2007-06-12 15:58:30.000000000 -0700
+++ vps/include/linux/slub_def.h	2007-06-12 16:00:43.000000000 -0700
@@ -28,7 +28,7 @@ struct kmem_cache {
 	int size;		/* The size of an object including meta data */
 	int objsize;		/* The size of an object without meta data */
 	int offset;		/* Free pointer offset. */
-	unsigned int order;
+	int order;
 
 	/*
 	 * Avoid an extra cache line for UP, SMP and for the node local to
@@ -56,7 +56,13 @@ struct kmem_cache {
 /*
  * Kmalloc subsystem.
  */
-#define KMALLOC_SHIFT_LOW 3
+#if defined(ARCH_KMALLOC_MINALIGN) && ARCH_KMALLOC_MINALIGN > 8
+#define KMALLOC_MIN_SIZE ARCH_KMALLOC_MINALIGN
+#else
+#define KMALLOC_MIN_SIZE 8
+#endif
+
+#define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
 
 /*
  * We keep the general caches in an array of slab caches that are used for
@@ -76,6 +82,9 @@ static inline int kmalloc_index(size_t s
 	if (size > KMALLOC_MAX_SIZE)
 		return -1;
 
+	if (size <= KMALLOC_MIN_SIZE)
+		return KMALLOC_SHIFT_LOW;
+
 	if (size > 64 && size <= 96)
 		return 1;
 	if (size > 128 && size <= 192)
Index: vps/mm/slub.c
===================================================================
--- vps.orig/mm/slub.c	2007-06-12 15:58:37.000000000 -0700
+++ vps/mm/slub.c	2007-06-12 16:03:00.000000000 -0700
@@ -2521,6 +2521,7 @@ EXPORT_SYMBOL(krealloc);
 void __init kmem_cache_init(void)
 {
 	int i;
+	int caches = 0;
 
 	if (!page_group_by_mobility_disabled && !user_override) {
 		/*
@@ -2540,20 +2541,29 @@ void __init kmem_cache_init(void)
 	create_kmalloc_cache(&kmalloc_caches[0], "kmem_cache_node",
 		sizeof(struct kmem_cache_node), GFP_KERNEL);
 	kmalloc_caches[0].refcount = -1;
+	caches++;
 #endif
 
 	/* Able to allocate the per node structures */
 	slab_state = PARTIAL;
 
 	/* Caches that are not of the two-to-the-power-of size */
-	create_kmalloc_cache(&kmalloc_caches[1],
+	if (KMALLOC_MIN_SIZE <= 64) {
+		create_kmalloc_cache(&kmalloc_caches[1],
 				"kmalloc-96", 96, GFP_KERNEL);
-	create_kmalloc_cache(&kmalloc_caches[2],
+		caches++;
+	}
+	if (KMALLOC_MIN_SIZE <= 128) {
+		create_kmalloc_cache(&kmalloc_caches[2],
 				"kmalloc-192", 192, GFP_KERNEL);
+		caches++;
+	}
 
-	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++)
+	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
 		create_kmalloc_cache(&kmalloc_caches[i],
 			"kmalloc", 1 << i, GFP_KERNEL);
+		caches++;
+	}
 
 	slab_state = UP;
 
@@ -2570,8 +2580,8 @@ void __init kmem_cache_init(void)
 				nr_cpu_ids * sizeof(struct page *);
 
 	printk(KERN_INFO "SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d, MinObjects=%d,"
-		" Processors=%d, Nodes=%d\n",
-		KMALLOC_SHIFT_HIGH, cache_line_size(),
+		" CPUs=%d, Nodes=%d\n",
+		caches, cache_line_size(),
 		slub_min_order, slub_max_order, slub_min_objects,
 		nr_cpu_ids, nr_node_ids);
 }

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 13/13] I finally found a way to get rid of the nasty list of comparisions in slub_def.h. ilog2 seems to work right for constants.
  2007-06-14  7:50 [RFC 00/13] RFC memoryless node handling fixes clameter
                   ` (11 preceding siblings ...)
  2007-06-14  7:50 ` [RFC 12/13] SLUB: minimum alignment fixes clameter
@ 2007-06-14  7:50 ` clameter
  2007-06-14  7:57   ` Christoph Lameter
  2007-06-14 14:24 ` [RFC 00/13] RFC memoryless node handling fixes Nishanth Aravamudan
  13 siblings, 1 reply; 20+ messages in thread
From: clameter @ 2007-06-14  7:50 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Lee Schermerhorn, linux-mm, Pekka Enberg, Andrew Morton

[-- Attachment #1: slub_ilog2 --]
[-- Type: text/plain, Size: 3664 bytes --]

Also update comments

Drop the generation of an unresolved symbol for the case that the size is
too big. A simple BUG_ON sufficies now that we can alloc up to MAX_ORDER
size slab objects.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/slub_def.h |   56 +++++++++--------------------------------------
 1 file changed, 11 insertions(+), 45 deletions(-)

Index: vps/include/linux/slub_def.h
===================================================================
--- vps.orig/include/linux/slub_def.h	2007-06-12 16:09:56.000000000 -0700
+++ vps/include/linux/slub_def.h	2007-06-12 16:32:44.000000000 -0700
@@ -10,6 +10,7 @@
 #include <linux/gfp.h>
 #include <linux/workqueue.h>
 #include <linux/kobject.h>
+#include <linux/log2.h>
 
 struct kmem_cache_node {
 	spinlock_t list_lock;	/* Protect partial list and nr_partial */
@@ -71,8 +72,9 @@ struct kmem_cache {
 extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1];
 
 /*
- * Sorry that the following has to be that ugly but some versions of GCC
- * have trouble with constant propagation and loops.
+ * Determine the kmalloc array index given the object size.
+ *
+ * Return -1 if the object size is not supported.
  */
 static inline int kmalloc_index(size_t size)
 {
@@ -85,42 +87,15 @@ static inline int kmalloc_index(size_t s
 	if (size <= KMALLOC_MIN_SIZE)
 		return KMALLOC_SHIFT_LOW;
 
+	/*
+	 * We map the non power of two slabs to the unused
+	 * log2 values in the kmalloc array.
+	 */
 	if (size > 64 && size <= 96)
 		return 1;
 	if (size > 128 && size <= 192)
 		return 2;
-	if (size <=          8) return 3;
-	if (size <=         16) return 4;
-	if (size <=         32) return 5;
-	if (size <=         64) return 6;
-	if (size <=        128) return 7;
-	if (size <=        256) return 8;
-	if (size <=        512) return 9;
-	if (size <=       1024) return 10;
-	if (size <=   2 * 1024) return 11;
-	if (size <=   4 * 1024) return 12;
-	if (size <=   8 * 1024) return 13;
-	if (size <=  16 * 1024) return 14;
-	if (size <=  32 * 1024) return 15;
-	if (size <=  64 * 1024) return 16;
-	if (size <= 128 * 1024) return 17;
-	if (size <= 256 * 1024) return 18;
-	if (size <=  512 * 1024) return 19;
-	if (size <= 1024 * 1024) return 20;
-	if (size <=  2 * 1024 * 1024) return 21;
-	if (size <=  4 * 1024 * 1024) return 22;
-	if (size <=  8 * 1024 * 1024) return 23;
-	if (size <= 16 * 1024 * 1024) return 24;
-	if (size <= 32 * 1024 * 1024) return 25;
-	return -1;
-
-/*
- * What we really wanted to do and cannot do because of compiler issues is:
- *	int i;
- *	for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++)
- *		if (size <= (1 << i))
- *			return i;
- */
+	return ilog2(size - 1) + 1;
 }
 
 /*
@@ -137,18 +112,9 @@ static inline struct kmem_cache *kmalloc
 		return NULL;
 
 	/*
-	 * This function only gets expanded if __builtin_constant_p(size), so
-	 * testing it here shouldn't be needed.  But some versions of gcc need
-	 * help.
+	 * If this triggers then the amount of memory requested was too large.
 	 */
-	if (__builtin_constant_p(size) && index < 0) {
-		/*
-		 * Generate a link failure. Would be great if we could
-		 * do something to stop the compile here.
-		 */
-		extern void __kmalloc_size_too_large(void);
-		__kmalloc_size_too_large();
-	}
+	BUG_ON(index < 0);
 	return &kmalloc_caches[index];
 }
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 12/13] SLUB: minimum alignment fixes
  2007-06-14  7:50 ` [RFC 12/13] SLUB: minimum alignment fixes clameter
@ 2007-06-14  7:56   ` Christoph Lameter
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2007-06-14  7:56 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, linux-mm

Please disregard this slipped in at the end of a quilt mail.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 13/13] I finally found a way to get rid of the nasty list of comparisions in slub_def.h. ilog2 seems to work right for constants.
  2007-06-14  7:50 ` [RFC 13/13] I finally found a way to get rid of the nasty list of comparisions in slub_def.h. ilog2 seems to work right for constants clameter
@ 2007-06-14  7:57   ` Christoph Lameter
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2007-06-14  7:57 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Lee Schermerhorn, linux-mm, Pekka Enberg, Andrew Morton

Please disregard. Accidentally send out again.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 00/13] RFC memoryless node handling fixes
  2007-06-14  7:50 [RFC 00/13] RFC memoryless node handling fixes clameter
                   ` (12 preceding siblings ...)
  2007-06-14  7:50 ` [RFC 13/13] I finally found a way to get rid of the nasty list of comparisions in slub_def.h. ilog2 seems to work right for constants clameter
@ 2007-06-14 14:24 ` Nishanth Aravamudan
  13 siblings, 0 replies; 20+ messages in thread
From: Nishanth Aravamudan @ 2007-06-14 14:24 UTC (permalink / raw)
  To: clameter; +Cc: Lee Schermerhorn, linux-mm

On 14.06.2007 [00:50:26 -0700], clameter@sgi.com wrote:
> This has now become a longer series since I have seen a couple of
> things in various places where we do not take into account memoryless
> nodes.
> 
> I changed the GFP_THISNODE fix to generate a new set of zonelists.
> GFP_THISNODE will then simply use a zonelist that only has the zones
> of the node.
> 
> I have only tested this by booting on a IA64 simulator. Please review.
> I do not have a real system with a memoryless node.

I do :) -- will stack your patches on rc4-mm2 and rebase my patches on
top to test.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 10/13] Memoryless nodes: Fix GFP_THISNODE behavior
  2007-06-14  7:50 ` [RFC 10/13] Memoryless nodes: Fix GFP_THISNODE behavior clameter
@ 2007-06-14 16:07   ` Nishanth Aravamudan
  2007-06-14 16:13     ` Christoph Lameter
  2007-06-18 16:47     ` Nishanth Aravamudan
  0 siblings, 2 replies; 20+ messages in thread
From: Nishanth Aravamudan @ 2007-06-14 16:07 UTC (permalink / raw)
  To: clameter; +Cc: Lee Schermerhorn, linux-mm

On 14.06.2007 [00:50:36 -0700], clameter@sgi.com wrote:
> GFP_THISNODE checks that the zone selected is within the pgdat (node) of the
> first zone of a nodelist. That only works if the node has memory. A
> memoryless node will have its first node on another pgdat (node).
> 
> GFP_THISNODE currently will return simply memory on the first pgdat.
> Thus it is returning memory on other nodes. GFP_THISNODE should fail
> if there is no local memory on a node.
> 
> 
> Add a new set of zonelists for each node that only contain the nodes
> that belong to the zones itself so that no fallback is possible.

Should be

Add a new set of zonelists for each node that only contain the zones
that belong to the node itself so that no fallback is possible?

This is the last patch in the stack I should based my patches on,
correct (I believe 11-13 were mis-sends)?

Will test everything and send out Acks later today, hopefully.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 10/13] Memoryless nodes: Fix GFP_THISNODE behavior
  2007-06-14 16:07   ` Nishanth Aravamudan
@ 2007-06-14 16:13     ` Christoph Lameter
  2007-06-18 16:47     ` Nishanth Aravamudan
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2007-06-14 16:13 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, linux-mm

On Thu, 14 Jun 2007, Nishanth Aravamudan wrote:

> > Add a new set of zonelists for each node that only contain the nodes
> > that belong to the zones itself so that no fallback is possible.
> 
> Should be
> 
> Add a new set of zonelists for each node that only contain the zones
> that belong to the node itself so that no fallback is possible?

Right.


> This is the last patch in the stack I should based my patches on,
> correct (I believe 11-13 were mis-sends)?

Right.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 10/13] Memoryless nodes: Fix GFP_THISNODE behavior
  2007-06-14 16:07   ` Nishanth Aravamudan
  2007-06-14 16:13     ` Christoph Lameter
@ 2007-06-18 16:47     ` Nishanth Aravamudan
  1 sibling, 0 replies; 20+ messages in thread
From: Nishanth Aravamudan @ 2007-06-18 16:47 UTC (permalink / raw)
  To: clameter; +Cc: Lee Schermerhorn, linux-mm

On 14.06.2007 [09:07:04 -0700], Nishanth Aravamudan wrote:
> On 14.06.2007 [00:50:36 -0700], clameter@sgi.com wrote:
> > GFP_THISNODE checks that the zone selected is within the pgdat (node) of the
> > first zone of a nodelist. That only works if the node has memory. A
> > memoryless node will have its first node on another pgdat (node).
> > 
> > GFP_THISNODE currently will return simply memory on the first pgdat.
> > Thus it is returning memory on other nodes. GFP_THISNODE should fail
> > if there is no local memory on a node.
> > 
> > 
> > Add a new set of zonelists for each node that only contain the nodes
> > that belong to the zones itself so that no fallback is possible.
> 
> Should be
> 
> Add a new set of zonelists for each node that only contain the zones
> that belong to the node itself so that no fallback is possible?
> 
> This is the last patch in the stack I should based my patches on,
> correct (I believe 11-13 were mis-sends)?
> 
> Will test everything and send out Acks later today, hopefully.

Tested on a 4-node ppc64 w/ 2 memoryless nodes and a 4-node x86_64 w/
no memoryless nodes, with my patches applied on top (will send out the
latest versions again).

All get

Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

Thanks for doing this work, Christoph!

-Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2007-06-18 16:47 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-06-14  7:50 [RFC 00/13] RFC memoryless node handling fixes clameter
2007-06-14  7:50 ` [RFC 01/13] NUMA: introduce node_memory_map clameter
2007-06-14  7:50 ` [RFC 02/13] Fix MPOL_INTERLEAVE behavior for memoryless nodes clameter
2007-06-14  7:50 ` [RFC 03/13] OOM: use the node_memory_map instead of constructing one on the fly clameter
2007-06-14  7:50 ` [RFC 04/13] Memoryless Nodes: No need for kswapd clameter
2007-06-14  7:50 ` [RFC 05/13] Memoryless Node: Slab support clameter
2007-06-14  7:50 ` [RFC 06/13] Memoryless nodes: SLUB support clameter
2007-06-14  7:50 ` [RFC 07/13] Uncached allocator: Handle memoryless nodes clameter
2007-06-14  7:50 ` [RFC 08/13] Memoryless node: Allow profiling data to fall back to other nodes clameter
2007-06-14  7:50 ` [RFC 09/13] Memoryless nodes: Update memory policy and page migration clameter
2007-06-14  7:50 ` [RFC 10/13] Memoryless nodes: Fix GFP_THISNODE behavior clameter
2007-06-14 16:07   ` Nishanth Aravamudan
2007-06-14 16:13     ` Christoph Lameter
2007-06-18 16:47     ` Nishanth Aravamudan
2007-06-14  7:50 ` [RFC 11/13] SLUB: Ensure that the # object per slabs stays low enough clameter
2007-06-14  7:50 ` [RFC 12/13] SLUB: minimum alignment fixes clameter
2007-06-14  7:56   ` Christoph Lameter
2007-06-14  7:50 ` [RFC 13/13] I finally found a way to get rid of the nasty list of comparisions in slub_def.h. ilog2 seems to work right for constants clameter
2007-06-14  7:57   ` Christoph Lameter
2007-06-14 14:24 ` [RFC 00/13] RFC memoryless node handling fixes Nishanth Aravamudan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox