* [PATCH] numa: mempolicy: dynamic interleave map for system init. @ 2007-06-07 1:17 Paul Mundt 2007-06-08 1:01 ` Andrew Morton 0 siblings, 1 reply; 21+ messages in thread From: Paul Mundt @ 2007-06-07 1:17 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, ak, clameter, hugh, lee.schermerhorn This is an alternative approach to the MPOL_INTERLEAVE across online nodes as the system init policy. Andi suggested it might be worthwhile trying to do this dynamically rather than as a command line option, so that's what this tries to do. With this, the online nodes are sized and packed in to an interleave map if they're large enough for interleave to be worthwhile. I arbitrarily chose 16MB as the node size to enable interleaving, but perhaps someone has a better figure in mind? In the case where all of the nodes are smaller than that, the largest node is selected and placed in to the map by itself (if they're all the same size, the first online node gets used). If people prefer this approach, the previous patch adding mpolinit can be dropped. Signed-off-by: Paul Mundt <lethal@linux-sh.org> -- mm/mempolicy.c | 31 ++++++++++++++++++++++++++++--- 1 file changed, 28 insertions(+), 3 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d76e8eb..a67c8f1 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1597,6 +1597,10 @@ void mpol_free_shared_policy(struct shared_policy *p) /* assumes fs == KERNEL_DS */ void __init numa_policy_init(void) { + nodemask_t interleave_nodes; + unsigned long largest = 0; + int nid, prefer = 0; + policy_cache = kmem_cache_create("numa_policy", sizeof(struct mempolicy), 0, SLAB_PANIC, NULL, NULL); @@ -1605,10 +1609,31 @@ void __init numa_policy_init(void) sizeof(struct sp_node), 0, SLAB_PANIC, NULL, NULL); - /* Set interleaving policy for system init. This way not all - the data structures allocated at system boot end up in node zero. */ + /* + * Set interleaving policy for system init. Interleaving is only + * enabled across suitably sized nodes (default is >= 16MB), or + * fall back to the largest node if they're all smaller. + */ + nodes_clear(interleave_nodes); + for_each_online_node(nid) { + unsigned long total_pages = node_present_pages(nid); + + /* Preserve the largest node */ + if (largest < total_pages) { + largest = total_pages; + prefer = nid; + } + + /* Interleave this node? */ + if ((total_pages << PAGE_SHIFT) >= (16 << 20)) + node_set(nid, interleave_nodes); + } + + /* All too small, use the largest */ + if (unlikely(nodes_empty(interleave_nodes))) + node_set(prefer, interleave_nodes); - if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map)) + if (do_set_mempolicy(MPOL_INTERLEAVE, &interleave_nodes)) printk("numa_policy_init: interleaving failed\n"); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-07 1:17 [PATCH] numa: mempolicy: dynamic interleave map for system init Paul Mundt @ 2007-06-08 1:01 ` Andrew Morton 2007-06-08 2:47 ` Christoph Lameter 0 siblings, 1 reply; 21+ messages in thread From: Andrew Morton @ 2007-06-08 1:01 UTC (permalink / raw) To: Paul Mundt; +Cc: linux-mm, ak, clameter, hugh, lee.schermerhorn On Thu, 7 Jun 2007 10:17:01 +0900 Paul Mundt <lethal@linux-sh.org> wrote: > This is an alternative approach to the MPOL_INTERLEAVE across online > nodes as the system init policy. Andi suggested it might be worthwhile > trying to do this dynamically rather than as a command line option, so > that's what this tries to do. > > With this, the online nodes are sized and packed in to an interleave map > if they're large enough for interleave to be worthwhile. I arbitrarily > chose 16MB as the node size to enable interleaving, but perhaps someone > has a better figure in mind? > > In the case where all of the nodes are smaller than that, the largest > node is selected and placed in to the map by itself (if they're all the > same size, the first online node gets used). > > If people prefer this approach, the previous patch adding mpolinit can be > dropped. > > Signed-off-by: Paul Mundt <lethal@linux-sh.org> Well I took silence as assent. None of the above text is suitable for a changelog. Please send a changelog for this patch, thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-08 1:01 ` Andrew Morton @ 2007-06-08 2:47 ` Christoph Lameter 2007-06-08 3:01 ` Andrew Morton 2007-06-08 3:25 ` Paul Mundt 0 siblings, 2 replies; 21+ messages in thread From: Christoph Lameter @ 2007-06-08 2:47 UTC (permalink / raw) To: Andrew Morton; +Cc: Paul Mundt, linux-mm, ak, hugh, lee.schermerhorn On Thu, 7 Jun 2007, Andrew Morton wrote: > Well I took silence as assent. Well, grudgingly. How far are we willing to go to support these asymmetric setups? The NUMA code initially was designed for mostly symmetric systems with roughly the same amount of memory on each node. The farther we go from this the more options we will have to add special casing to deal with these imbalances. With memoryless nodes we already have one issue that will ripple through the kernel likely requiring numerous modifications and special casing. Then we now have the ZONE_DMA issues reording the zonelists. Now we will support systems with 1MB size nodes? We will need to modify the slab allocators to only allocate on special processors? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-08 2:47 ` Christoph Lameter @ 2007-06-08 3:01 ` Andrew Morton 2007-06-08 3:11 ` Christoph Lameter 2007-06-08 3:25 ` Paul Mundt 1 sibling, 1 reply; 21+ messages in thread From: Andrew Morton @ 2007-06-08 3:01 UTC (permalink / raw) To: Christoph Lameter; +Cc: Paul Mundt, linux-mm, ak, hugh, lee.schermerhorn On Thu, 7 Jun 2007 19:47:09 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > On Thu, 7 Jun 2007, Andrew Morton wrote: > > > Well I took silence as assent. > > Well, grudgingly. How far are we willing to go to support these asymmetric > setups? The NUMA code initially was designed for mostly symmetric systems > with roughly the same amount of memory on each node. The farther we go > from this the more options we will have to add special casing to deal with > these imbalances. > > With memoryless nodes we already have one issue that will ripple through > the kernel likely requiring numerous modifications and special casing. > Then we now have the ZONE_DMA issues reording the zonelists. Now we will > support systems with 1MB size nodes? We will need to modify the slab > allocators to only allocate on special processors? > Failing to support memoryless nodes was a bug, and we should continue to take bugfixes for that. Dunno about the rest - it depends upon how real-world are the problems which people hit, and upon how messy the fixes look. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-08 3:01 ` Andrew Morton @ 2007-06-08 3:11 ` Christoph Lameter 0 siblings, 0 replies; 21+ messages in thread From: Christoph Lameter @ 2007-06-08 3:11 UTC (permalink / raw) To: Andrew Morton; +Cc: Paul Mundt, linux-mm, ak, hugh, lee.schermerhorn On Thu, 7 Jun 2007, Andrew Morton wrote: > Failing to support memoryless nodes was a bug, and we should continue to > take bugfixes for that. We intentionally did support memoryless nodes on the arch level but not in the core in order to avoid these issues. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-08 2:47 ` Christoph Lameter 2007-06-08 3:01 ` Andrew Morton @ 2007-06-08 3:25 ` Paul Mundt 2007-06-08 3:49 ` Christoph Lameter 2007-06-08 14:50 ` Matt Mackall 1 sibling, 2 replies; 21+ messages in thread From: Paul Mundt @ 2007-06-08 3:25 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm On Thu, Jun 07, 2007 at 07:47:09PM -0700, Christoph Lameter wrote: > On Thu, 7 Jun 2007, Andrew Morton wrote: > > > Well I took silence as assent. > > Well, grudgingly. How far are we willing to go to support these asymmetric > setups? The NUMA code initially was designed for mostly symmetric systems > with roughly the same amount of memory on each node. The farther we go > from this the more options we will have to add special casing to deal with > these imbalances. > Well, this doesn't all have to be dynamic either. I opted for the mpolinit= approach first so we wouldn't make the accounting for the common case heavier, but certainly having it dynamic is less hassle. The asymmetric case will likely be the common case for embedded, but it's obviously possible to try to work that in to SLOB or something similar, if making SLUB or SLAB lighterweight and more tunable for these cases ends up being a real barrier. On the other hand, as we start having machines with multiple gigs of RAM that are stashed in node 0 (with many smaller memories in other nodes), SLOB isn't going to be a long-term option either. The pgdat is already special cased for things like flatmem and memory hotplug, throwing in something similar to scheduler domains in the pgdat for node behavioural hints might be the least intrusive (and could be ifdefed out for symmetric nodes). > With memoryless nodes we already have one issue that will ripple through > the kernel likely requiring numerous modifications and special casing. > Then we now have the ZONE_DMA issues reording the zonelists. Now we will > support systems with 1MB size nodes? We will need to modify the slab > allocators to only allocate on special processors? > Unfortunately CONFIG_NUMA deals with all of the problems that embedded with multiple memories has (albeit perhaps somewhat heavy-handed), so extending this seems to be a far more productive approach than reinventing things. If we have to do this through a special allocator for the asymmetric node case, so be it, but I don't expect the problem to go away. Even with just the mempolicy changes for dynamic interleave, a 128k or 512k node is already usable (despite slab and slub both chewing through a good chunk of it). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-08 3:25 ` Paul Mundt @ 2007-06-08 3:49 ` Christoph Lameter 2007-06-08 4:13 ` Paul Mundt 2007-06-08 14:50 ` Matt Mackall 1 sibling, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2007-06-08 3:49 UTC (permalink / raw) To: Paul Mundt; +Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm On Fri, 8 Jun 2007, Paul Mundt wrote: > obviously possible to try to work that in to SLOB or something similar, > if making SLUB or SLAB lighterweight and more tunable for these cases > ends up being a real barrier. Its obviously possible and as far as I can tell the architecture you have there requires it to operate. But the question is how much special casing we will have to add to the core VM. We would likely have to add a slub_nodes= parameter that allows the specification of a nodelist that is allowed for the slab allocator. Then modify slub to use its own nodemap instead of the node online map. Modify get_partial_node to not try a node not in the nodemap and go to get_any_partial immediately. In addition to checking cpuset_zone_allowed we would need to check the slab node list. Hmm.... That would also help to create isolated nodes that have no memory on them. See what evil things you drive me to... Could you try this patch (untested)? Set the allowed nodes on boot with slub_nodes=0 if you have only node 0 for SLUB. --- mm/slub.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 63 insertions(+), 6 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2007-06-07 20:32:30.000000000 -0700 +++ linux-2.6/mm/slub.c 2007-06-07 20:48:19.000000000 -0700 @@ -270,6 +270,20 @@ static inline struct kmem_cache_node *ge #endif } +#ifdef CONFIG_NUMA +static nodemask_t slub_nodes = NODE_MASK_ALL; + +static inline int forbidden_node(int node) +{ + return !node_isset(node, slub_nodes); +} +#else +static inline int forbidden_node(int node) +{ + return 0; +} +#endif + static inline int check_valid_pointer(struct kmem_cache *s, struct page *page, const void *object) { @@ -1242,8 +1256,12 @@ static struct page *get_any_partial(stru ->node_zonelists[gfp_zone(flags)]; for (z = zonelist->zones; *z; z++) { struct kmem_cache_node *n; + int node = zone_to_nid(*z); - n = get_node(s, zone_to_nid(*z)); + if (forbidden_node(node)) + continue; + + n = get_node(s, node); if (n && cpuset_zone_allowed_hardwall(*z, flags) && n->nr_partial > MIN_PARTIAL) { @@ -1261,10 +1279,12 @@ static struct page *get_any_partial(stru */ static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node) { - struct page *page; + struct page *page = NULL; int searchnode = (node == -1) ? numa_node_id() : node; - page = get_partial_node(get_node(s, searchnode)); + if (!forbidden_node(node)) + page = get_partial_node(get_node(s, searchnode)); + if (page || (flags & __GFP_THISNODE)) return page; @@ -1819,7 +1839,11 @@ static void free_kmem_cache_nodes(struct int node; for_each_online_node(node) { - struct kmem_cache_node *n = s->node[node]; + struct kmem_cache_node *n; + + if (forbidden_node(node)) + continue; + n= s->node[node]; if (n && n != &s->local_node) kmem_cache_free(kmalloc_caches, n); s->node[node] = NULL; @@ -1839,6 +1863,9 @@ static int init_kmem_cache_nodes(struct for_each_online_node(node) { struct kmem_cache_node *n; + if (forbidden_node(node)) + continue; + if (local_node == node) n = &s->local_node; else { @@ -2092,7 +2119,12 @@ static int kmem_cache_close(struct kmem_ /* Attempt to free all objects */ for_each_online_node(node) { - struct kmem_cache_node *n = get_node(s, node); + struct kmem_cache_node *n; + + if (forbidden_node(node)) + continue; + + n = get_node(s, node); n->nr_partial -= free_list(s, n, &n->partial); if (atomic_long_read(&n->nr_slabs)) @@ -2167,6 +2199,17 @@ static int __init setup_slub_nomerge(cha __setup("slub_nomerge", setup_slub_nomerge); +#ifdef CONFIG_NUMA +static int __init setup_slub_nodes(char *str) +{ + if (*str == '=') + nodelist_parse(str + 1, slub_nodes); + return 1; +} + +__setup("slub_nodes", setup_slub_nodes); +#endif + static struct kmem_cache *create_kmalloc_cache(struct kmem_cache *s, const char *name, int size, gfp_t gfp_flags) { @@ -2329,6 +2372,9 @@ int kmem_cache_shrink(struct kmem_cache flush_all(s); for_each_online_node(node) { + if (forbidden_node(node)) + continue; + n = get_node(s, node); if (!n->nr_partial) @@ -2755,7 +2801,12 @@ static unsigned long validate_slab_cache flush_all(s); for_each_online_node(node) { - struct kmem_cache_node *n = get_node(s, node); + struct kmem_cache_node *n; + + if (forbidden_node(node)) + continue; + + n = get_node(s, node); count += validate_slab_node(s, n); } @@ -2981,6 +3032,9 @@ static int list_locations(struct kmem_ca unsigned long flags; struct page *page; + if (forbidden_node(node)) + continue; + if (!atomic_read(&n->nr_slabs)) continue; @@ -3104,6 +3158,9 @@ static unsigned long slab_objects(struct for_each_online_node(node) { struct kmem_cache_node *n = get_node(s, node); + if (forbidden_node(node)) + continue; + if (flags & SO_PARTIAL) { if (flags & SO_OBJECTS) x = count_partial(n); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-08 3:49 ` Christoph Lameter @ 2007-06-08 4:13 ` Paul Mundt 2007-06-08 4:27 ` Christoph Lameter 0 siblings, 1 reply; 21+ messages in thread From: Paul Mundt @ 2007-06-08 4:13 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm On Thu, Jun 07, 2007 at 08:49:53PM -0700, Christoph Lameter wrote: > On Fri, 8 Jun 2007, Paul Mundt wrote: > > > obviously possible to try to work that in to SLOB or something similar, > > if making SLUB or SLAB lighterweight and more tunable for these cases > > ends up being a real barrier. > > Its obviously possible and as far as I can tell the architecture you have > there requires it to operate. But the question is how much special casing > we will have to add to the core VM. > > We would likely have to add a > > slub_nodes= > > parameter that allows the specification of a nodelist that is allowed for > the slab allocator. Then modify slub to use its own nodemap instead of > the node online map. Modify get_partial_node to not try a node not in the > nodemap and go to get_any_partial immediately. In addition to checking > cpuset_zone_allowed we would need to check the slab node list. > > Hmm.... That would also help to create isolated nodes that have no memory > on them. > > See what evil things you drive me to... > > Could you try this patch (untested)? Set the allowed nodes on boot > with > > slub_nodes=0 > > if you have only node 0 for SLUB. > Yes, that works better (Note that node 1 interleave is disabled in both cases): With patch: / # cat /sys/devices/system/node/node1/meminfo Node 1 MemTotal: 128 kB Node 1 MemFree: 72 kB Node 1 MemUsed: 56 kB Node 1 Active: 0 kB Node 1 Inactive: 0 kB Node 1 Dirty: 0 kB Node 1 Writeback: 0 kB Node 1 FilePages: 0 kB Node 1 Mapped: 0 kB Node 1 AnonPages: 0 kB Node 1 PageTables: 0 kB Node 1 NFS_Unstable: 0 kB Node 1 Bounce: 0 kB Node 1 Slab: 0 kB Node 1 SReclaimable: 0 kB Node 1 SUnreclaim: 0 kB Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 [ 117.216293] Node 0 Normal free:55900kB min:1016kB low:1268kB high:1524kB active:692kB inactive:536kB present:65024kB pages_scanned:0 all_unreclaimable? no [ 117.230029] lowmem_reserve[]: 0 [ 117.233140] Node 1 Normal free:72kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:128kB pages_scanned:0 all_unreclaimable? no [ 117.245322] lowmem_reserve[]: 0 [ 117.248434] Node 0 Normal: 1*4kB 5*8kB 3*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 13*4096kB = 55900kB [ 117.259320] Node 1 Normal: 2*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 72kB Without: / # cat /sys/devices/system/node/node1/meminfo Node 1 MemTotal: 128 kB Node 1 MemFree: 64 kB Node 1 MemUsed: 64 kB Node 1 Active: 0 kB Node 1 Inactive: 0 kB Node 1 Dirty: 0 kB Node 1 Writeback: 0 kB Node 1 FilePages: 0 kB Node 1 Mapped: 0 kB Node 1 AnonPages: 0 kB Node 1 PageTables: 0 kB Node 1 NFS_Unstable: 0 kB Node 1 Bounce: 0 kB Node 1 Slab: 8 kB Node 1 SReclaimable: 0 kB Node 1 SUnreclaim: 8 kB Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 [ 87.000717] Node 0 Normal free:55912kB min:1016kB low:1268kB high:1524kB active:668kB inactive:556kB present:65024kB pages_scanned:0 all_unreclaimable? no [ 87.014453] lowmem_reserve[]: 0 [ 87.017565] Node 1 Normal free:64kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:128kB pages_scanned:0 all_unreclaimable? no [ 87.029746] lowmem_reserve[]: 0 [ 87.032858] Node 0 Normal: 0*4kB 9*8kB 2*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 13*4096kB = 55912kB [ 87.043744] Node 1 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 64kB So at least that gets back the couple of slab pages! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-08 4:13 ` Paul Mundt @ 2007-06-08 4:27 ` Christoph Lameter 2007-06-08 6:05 ` Paul Mundt 0 siblings, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2007-06-08 4:27 UTC (permalink / raw) To: Paul Mundt; +Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm On Fri, 8 Jun 2007, Paul Mundt wrote: > Node 1 SUnreclaim: 8 kB > So at least that gets back the couple of slab pages! Hmmmm.. is that worth it? The patch is not right btw. There is still the case that new_slab can acquire a page on the wrong node and since we are not setup to allow that node in SLUB we will crash. This now gets a bit ugly. In order to avoid that situation we check first if the node is allowed. If not then we simply ask for an alloc on the first node. But that may still make the page allocator fall back. If that happens then we redo the allocation with GFP_THISNODE to force an allocation on the first node or fail. I think we could do better by constructing a custom zonelist but that will be even more special casing. --- mm/slub.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 58 insertions(+), 5 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2007-06-07 21:01:32.000000000 -0700 +++ linux-2.6/mm/slub.c 2007-06-07 21:23:04.000000000 -0700 @@ -215,6 +215,10 @@ static inline void ClearSlabDebug(struct static int kmem_size = sizeof(struct kmem_cache); +#ifdef CONFIG_NUMA +static nodemask_t slub_nodes = NODE_MASK_ALL; +#endif + #ifdef CONFIG_SMP static struct notifier_block slab_notifier; #endif @@ -1023,6 +1027,11 @@ static struct page *new_slab(struct kmem if (flags & __GFP_WAIT) local_irq_enable(); + /* Hack: Just get the first node if the node is not allowed */ + if (slab_state >= UP && !get_node(s, node)) + node = first_node(slub_nodes); + +redo: page = allocate_slab(s, flags & GFP_LEVEL_MASK, node); if (!page) goto out; @@ -1030,6 +1039,27 @@ static struct page *new_slab(struct kmem n = get_node(s, page_to_nid(page)); if (n) atomic_long_inc(&n->nr_slabs); +#ifdef CONFIG_NUMA + else { + if (slab_state >= UP) { + /* + * The baaad page allocator gave us a page on a + * node that we should not use. Force a page on + * a legit node or fail. + */ + __free_pages(page, s->order); + flags |= GFP_THISNODE; + + mod_zone_page_state(page_zone(page), + (s->flags & SLAB_RECLAIM_ACCOUNT) ? + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE, + - (1 << s->order)); + + goto redo; + } + } +#endif + page->offset = s->offset / sizeof(void *); page->slab = s; page->flags |= 1 << PG_slab; @@ -1261,10 +1291,13 @@ static struct page *get_any_partial(stru */ static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node) { - struct page *page; + struct page *page = NULL; int searchnode = (node == -1) ? numa_node_id() : node; + struct kmem_cache_node *n = get_node(s, searchnode); + + if (n) + page = get_partial_node(n); - page = get_partial_node(get_node(s, searchnode)); if (page || (flags & __GFP_THISNODE)) return page; @@ -1820,12 +1853,22 @@ static void free_kmem_cache_nodes(struct for_each_online_node(node) { struct kmem_cache_node *n = s->node[node]; + if (n && n != &s->local_node) kmem_cache_free(kmalloc_caches, n); s->node[node] = NULL; } } +static int __init setup_slub_nodes(char *str) +{ + if (*str == '=') + nodelist_parse(str + 1, slub_nodes); + return 1; +} + +__setup("slub_nodes", setup_slub_nodes); + static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags) { int node; @@ -1839,6 +1882,9 @@ static int init_kmem_cache_nodes(struct for_each_online_node(node) { struct kmem_cache_node *n; + if (!node_isset(node, slub_nodes)) + continue; + if (local_node == node) n = &s->local_node; else { @@ -2094,6 +2140,9 @@ static int kmem_cache_close(struct kmem_ for_each_online_node(node) { struct kmem_cache_node *n = get_node(s, node); + if (!n) + continue; + n->nr_partial -= free_list(s, n, &n->partial); if (atomic_long_read(&n->nr_slabs)) return 1; @@ -2331,7 +2380,7 @@ int kmem_cache_shrink(struct kmem_cache for_each_online_node(node) { n = get_node(s, node); - if (!n->nr_partial) + if (!n || !n->nr_partial) continue; for (i = 0; i < s->objects; i++) @@ -2757,7 +2806,8 @@ static unsigned long validate_slab_cache for_each_online_node(node) { struct kmem_cache_node *n = get_node(s, node); - count += validate_slab_node(s, n); + if (n) + count += validate_slab_node(s, n); } return count; } @@ -2981,7 +3031,7 @@ static int list_locations(struct kmem_ca unsigned long flags; struct page *page; - if (!atomic_read(&n->nr_slabs)) + if (!n || !atomic_read(&n->nr_slabs)) continue; spin_lock_irqsave(&n->list_lock, flags); @@ -3104,6 +3154,9 @@ static unsigned long slab_objects(struct for_each_online_node(node) { struct kmem_cache_node *n = get_node(s, node); + if (!n) + continue; + if (flags & SO_PARTIAL) { if (flags & SO_OBJECTS) x = count_partial(n); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-08 4:27 ` Christoph Lameter @ 2007-06-08 6:05 ` Paul Mundt 2007-06-08 6:09 ` Christoph Lameter 0 siblings, 1 reply; 21+ messages in thread From: Paul Mundt @ 2007-06-08 6:05 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm On Thu, Jun 07, 2007 at 09:27:01PM -0700, Christoph Lameter wrote: > On Fri, 8 Jun 2007, Paul Mundt wrote: > > > Node 1 SUnreclaim: 8 kB > > > So at least that gets back the couple of slab pages! > > Hmmmm.. is that worth it? The patch is not right btw. There is still the > case that new_slab can acquire a page on the wrong node and since we are > not setup to allow that node in SLUB we will crash. > Well, every page we can get back is a win in this situation, since we're talking about individual pages being used by applications. The other 56k is a bit more problematic, but that's something I'd like to narrow down as well. I don't mind giving up a chunk of the node as long as the majority of it is usable for applications, but certainly every page we can get back helps. > This now gets a bit ugly. In order to avoid that situation we check > first if the node is allowed. If not then we simply ask for an alloc on > the first node. > > But that may still make the page allocator fall back. If that happens then > we redo the allocation with GFP_THISNODE to force an allocation on the > first node or fail. > This patch works fine for the few cases I've tried at least. > I think we could do better by constructing a custom zonelist but that will > be even more special casing. > I don't know if a custom zonelist is worth the trouble. For the common asymmetric case, you could at least infer that ZONE_NORMAL is the only thing populated per node (well, small nodes other than node 0). If you mean just creating the zonelist from the range of allowable SLUB nodes, that could work. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-08 6:05 ` Paul Mundt @ 2007-06-08 6:09 ` Christoph Lameter 2007-06-08 6:27 ` Paul Mundt 0 siblings, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2007-06-08 6:09 UTC (permalink / raw) To: Paul Mundt; +Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm On Fri, 8 Jun 2007, Paul Mundt wrote: > > I think we could do better by constructing a custom zonelist but that will > > be even more special casing. > > > I don't know if a custom zonelist is worth the trouble. For the common > asymmetric case, you could at least infer that ZONE_NORMAL is the only > thing populated per node (well, small nodes other than node 0). If you > mean just creating the zonelist from the range of allowable SLUB nodes, > that could work. Well that is quit difficult because of the other constraints on the alloc. The allocation must consider the cpuset context and the memory policies of the task (which may need special casing already there for interleave). Maybe we can determine from those restrictions a zonelist. Then we need to kick out the zones belonging to the illegal nodes from that zonelist. Then pass that to __alloc_pages to perform the alloc. Looks like we are heading for a new alloc function alloc_pages_node_not_nodes(order, gfpmask, node, forbidden-nodes) But may be the hack of just going to node 0 on a problem is enough??? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-08 6:09 ` Christoph Lameter @ 2007-06-08 6:27 ` Paul Mundt 2007-06-08 6:43 ` Christoph Lameter 0 siblings, 1 reply; 21+ messages in thread From: Paul Mundt @ 2007-06-08 6:27 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm On Thu, Jun 07, 2007 at 11:09:48PM -0700, Christoph Lameter wrote: > On Fri, 8 Jun 2007, Paul Mundt wrote: > > > > I think we could do better by constructing a custom zonelist but that will > > > be even more special casing. > > > > > I don't know if a custom zonelist is worth the trouble. For the common > > asymmetric case, you could at least infer that ZONE_NORMAL is the only > > thing populated per node (well, small nodes other than node 0). If you > > mean just creating the zonelist from the range of allowable SLUB nodes, > > that could work. > > Well that is quit difficult because of the other constraints on the alloc. > The allocation must consider the cpuset context and the memory policies of > the task (which may need special casing already there for interleave). > Maybe we can determine from those restrictions a zonelist. Then we need to > kick out the zones belonging to the illegal nodes from that zonelist. > Then pass that to __alloc_pages to perform the alloc. > > Looks like we are heading for a new alloc function > > alloc_pages_node_not_nodes(order, gfpmask, node, forbidden-nodes) > > But may be the hack of just going to node 0 on a problem is enough??? That depends on the policy, in the MPOL_BIND case we certainly don't want to bleed out to node 0. For the general case, falling back on node 0 in the event of a problem seems to be a reasonable compromise. In the longer term, alloc_pages_not_nodes() may be worthwhile for the cases where symmetric and asymmetric nodes are both present, without wanting to put all of the pressure on node 0. This is largely why I was leaning towards flags in the pgdat, suggesting what the node is willing to put up with. It would be fairly trivial to construct a map of allowable SLUB nodes and potentials for the zonelist out of that. This still doesn't solve the problem of cpuset constraints, though. Incidentally, the interleave map created for mempol sysinit is something that could also be picked up by SLUB for the allowable node map (at least as a starting point, exlucding cpuset constraints). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-08 6:27 ` Paul Mundt @ 2007-06-08 6:43 ` Christoph Lameter 0 siblings, 0 replies; 21+ messages in thread From: Christoph Lameter @ 2007-06-08 6:43 UTC (permalink / raw) To: Paul Mundt; +Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm On Fri, 8 Jun 2007, Paul Mundt wrote: > Incidentally, the interleave map created for mempol sysinit is something > that could also be picked up by SLUB for the allowable node map (at least > as a starting point, exlucding cpuset constraints). SLUB already uses that map on bootup through the page allocator. So for boot you can actually restrict slub without any additional patches. The problem is later when the policy is set to MPOL_DEFAULT. The key problem is that the node restrictions add an additional constraint to the ones that SLUB already obeys. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-08 3:25 ` Paul Mundt 2007-06-08 3:49 ` Christoph Lameter @ 2007-06-08 14:50 ` Matt Mackall 2007-06-12 2:36 ` Nick Piggin 2007-06-12 9:43 ` Paul Mundt 1 sibling, 2 replies; 21+ messages in thread From: Matt Mackall @ 2007-06-08 14:50 UTC (permalink / raw) To: Paul Mundt, Christoph Lameter, Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, Nick Piggin On Fri, Jun 08, 2007 at 12:25:05PM +0900, Paul Mundt wrote: > On Thu, Jun 07, 2007 at 07:47:09PM -0700, Christoph Lameter wrote: > > On Thu, 7 Jun 2007, Andrew Morton wrote: > > > > > Well I took silence as assent. > > > > Well, grudgingly. How far are we willing to go to support these asymmetric > > setups? The NUMA code initially was designed for mostly symmetric systems > > with roughly the same amount of memory on each node. The farther we go > > from this the more options we will have to add special casing to deal with > > these imbalances. > > > Well, this doesn't all have to be dynamic either. I opted for the > mpolinit= approach first so we wouldn't make the accounting for the > common case heavier, but certainly having it dynamic is less hassle. The > asymmetric case will likely be the common case for embedded, but it's > obviously possible to try to work that in to SLOB or something similar, > if making SLUB or SLAB lighterweight and more tunable for these cases > ends up being a real barrier. > > On the other hand, as we start having machines with multiple gigs of RAM > that are stashed in node 0 (with many smaller memories in other nodes), > SLOB isn't going to be a long-term option either. SLOB in -mm should scale to this size reasonably well now, and Nick and I have another tweak planned that should make it quite fast here. SLOB's big scalability problem at this point is number of CPUs. Throwing some fine-grained locking at it or the like may be able to help with that too. Why would you even want to bother making it scale that large? For starters, it's less affected by things like dcache fragmentation. The majority of pages pinned by long-lived dcache entries will still be available to other allocations. Haven't given any thought to NUMA yet though.. -- Mathematics is the supreme nostalgia of our time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-08 14:50 ` Matt Mackall @ 2007-06-12 2:36 ` Nick Piggin 2007-06-12 9:43 ` Paul Mundt 1 sibling, 0 replies; 21+ messages in thread From: Nick Piggin @ 2007-06-12 2:36 UTC (permalink / raw) To: Matt Mackall Cc: Paul Mundt, Christoph Lameter, Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn Matt Mackall wrote: > On Fri, Jun 08, 2007 at 12:25:05PM +0900, Paul Mundt wrote: > >>Well, this doesn't all have to be dynamic either. I opted for the >>mpolinit= approach first so we wouldn't make the accounting for the >>common case heavier, but certainly having it dynamic is less hassle. The >>asymmetric case will likely be the common case for embedded, but it's >>obviously possible to try to work that in to SLOB or something similar, >>if making SLUB or SLAB lighterweight and more tunable for these cases >>ends up being a real barrier. >> >>On the other hand, as we start having machines with multiple gigs of RAM >>that are stashed in node 0 (with many smaller memories in other nodes), >>SLOB isn't going to be a long-term option either. > > > SLOB in -mm should scale to this size reasonably well now, and Nick > and I have another tweak planned that should make it quite fast here. Indeed. The existing code in -mm should hopefully get merged next cycle, so if you have ever wanted to use SLOB but had performance problems, please reevaluate and report if you still hit problems. Even on small SMPs, it might be a reasonable choice, although it won't be able to match the other allocators for performance. Again, if you have problems with SMP scalability of SLOB, then please let us know too, because as Matt said there are a few things we could do (such as multiple freelists) which may improve performance quite a bit without hurting complexity or memory usage much. -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-08 14:50 ` Matt Mackall 2007-06-12 2:36 ` Nick Piggin @ 2007-06-12 9:43 ` Paul Mundt 2007-06-12 15:32 ` Matt Mackall 1 sibling, 1 reply; 21+ messages in thread From: Paul Mundt @ 2007-06-12 9:43 UTC (permalink / raw) To: Matt Mackall Cc: Christoph Lameter, Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, Nick Piggin On Fri, Jun 08, 2007 at 09:50:11AM -0500, Matt Mackall wrote: > SLOB's big scalability problem at this point is number of CPUs. > Throwing some fine-grained locking at it or the like may be able to > help with that too. > > Why would you even want to bother making it scale that large? For > starters, it's less affected by things like dcache fragmentation. The > majority of pages pinned by long-lived dcache entries will still be > available to other allocations. > > Haven't given any thought to NUMA yet though.. > This is what I've hacked together and tested with my small nodes. It's not terribly intelligent, and it pushes off most of the logic to the page allocator. Obviously it's not terribly scalable, and I haven't tested it with page migration, either. Still, it works for me with my simple tmpfs + mpol policy tests. Tested on a UP + SPARSEMEM (static, not extreme) + NUMA (2 nodes) + SLOB configuration. Flame away! Signed-off-by: Paul Mundt <lethal@linux-sh.org> -- include/linux/slab.h | 7 ++++ mm/slob.c | 80 ++++++++++++++++++++++++++++++++++++++++++--------- 2 files changed, 73 insertions(+), 14 deletions(-) diff --git a/include/linux/slab.h b/include/linux/slab.h index a015236..efc87c1 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -200,6 +200,13 @@ static inline void *__kmalloc_node(size_t size, gfp_t flags, int node) { return __kmalloc(size, flags); } +#elif defined(CONFIG_SLOB) +extern void *__kmalloc_node(size_t size, gfp_t flags, int node); + +static inline void *kmalloc_node(size_t size, gfp_t flags, int node) +{ + return __kmalloc_node(size, flags, node); +} #endif /* !CONFIG_NUMA */ /* diff --git a/mm/slob.c b/mm/slob.c index 71976c5..48af24c 100644 --- a/mm/slob.c +++ b/mm/slob.c @@ -74,7 +74,7 @@ static void slob_free(void *b, int size); static void slob_timer_cbk(void); -static void *slob_alloc(size_t size, gfp_t gfp, int align) +static void *slob_alloc(size_t size, gfp_t gfp, int align, int node) { slob_t *prev, *cur, *aligned = 0; int delta = 0, units = SLOB_UNITS(size); @@ -111,12 +111,19 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align) return cur; } if (cur == slobfree) { + void *pages; + spin_unlock_irqrestore(&slob_lock, flags); if (size == PAGE_SIZE) /* trying to shrink arena? */ return 0; - cur = (slob_t *)__get_free_page(gfp); + if (node == -1) + pages = alloc_pages(gfp, 0); + else + pages = alloc_pages_node(node, gfp, 0); + + cur = page_address(pages); if (!cur) return 0; @@ -161,23 +168,29 @@ static void slob_free(void *block, int size) spin_unlock_irqrestore(&slob_lock, flags); } -void *__kmalloc(size_t size, gfp_t gfp) +static void *__kmalloc_alloc(size_t size, gfp_t gfp, int node) { slob_t *m; bigblock_t *bb; unsigned long flags; + void *page; if (size < PAGE_SIZE - SLOB_UNIT) { - m = slob_alloc(size + SLOB_UNIT, gfp, 0); + m = slob_alloc(size + SLOB_UNIT, gfp, 0, node); return m ? (void *)(m + 1) : 0; } - bb = slob_alloc(sizeof(bigblock_t), gfp, 0); + bb = slob_alloc(sizeof(bigblock_t), gfp, 0, node); if (!bb) return 0; bb->order = get_order(size); - bb->pages = (void *)__get_free_pages(gfp, bb->order); + if (node == -1) + page = alloc_pages(gfp, bb->order); + else + page = alloc_pages_node(node, gfp, bb->order); + + bb->pages = page_address(page); if (bb->pages) { spin_lock_irqsave(&block_lock, flags); @@ -190,8 +203,21 @@ void *__kmalloc(size_t size, gfp_t gfp) slob_free(bb, sizeof(bigblock_t)); return 0; } + +void *__kmalloc(size_t size, gfp_t gfp) +{ + return __kmalloc_alloc(size, gfp, -1); +} EXPORT_SYMBOL(__kmalloc); +#ifdef CONFIG_NUMA +void *__kmalloc_node(size_t size, gfp_t gfp, int node) +{ + return __kmalloc_alloc(size, gfp, node); +} +EXPORT_SYMBOL(__kmalloc_node); +#endif + /** * krealloc - reallocate memory. The contents will remain unchanged. * @@ -289,7 +315,7 @@ struct kmem_cache *kmem_cache_create(const char *name, size_t size, { struct kmem_cache *c; - c = slob_alloc(sizeof(struct kmem_cache), flags, 0); + c = slob_alloc(sizeof(struct kmem_cache), flags, 0, -1); if (c) { c->name = name; @@ -317,22 +343,44 @@ void kmem_cache_destroy(struct kmem_cache *c) } EXPORT_SYMBOL(kmem_cache_destroy); -void *kmem_cache_alloc(struct kmem_cache *c, gfp_t flags) +static void *__kmem_cache_alloc(struct kmem_cache *c, gfp_t flags, int node) { void *b; if (c->size < PAGE_SIZE) - b = slob_alloc(c->size, flags, c->align); - else - b = (void *)__get_free_pages(flags, get_order(c->size)); + b = slob_alloc(c->size, flags, c->align, node); + else { + void *pages; + + if (node == -1) + pages = alloc_pages(flags, get_order(c->size)); + else + pages = alloc_pages_node(node, flags, + get_order(c->size)); + + b = page_address(pages); + } if (c->ctor) c->ctor(b, c, 0); return b; } + +void *kmem_cache_alloc(struct kmem_cache *c, gfp_t flags) +{ + return __kmem_cache_alloc(c, flags, -1); +} EXPORT_SYMBOL(kmem_cache_alloc); +#ifdef CONFIG_NUMA +void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node) +{ + return __kmem_cache_alloc(c, flags, node); +} +EXPORT_SYMBOL(kmem_cache_alloc_node); +#endif + void *kmem_cache_zalloc(struct kmem_cache *c, gfp_t flags) { void *ret = kmem_cache_alloc(c, flags); @@ -406,10 +454,14 @@ void __init kmem_cache_init(void) static void slob_timer_cbk(void) { - void *p = slob_alloc(PAGE_SIZE, 0, PAGE_SIZE-1); + int node; + + for_each_online_node(node) { + void *p = slob_alloc(PAGE_SIZE, 0, PAGE_SIZE-1, node); - if (p) - free_page((unsigned long)p); + if (p) + free_page((unsigned long)p); + } mod_timer(&slob_timer, jiffies + HZ); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-12 9:43 ` Paul Mundt @ 2007-06-12 15:32 ` Matt Mackall 2007-06-13 2:10 ` Nick Piggin 2007-06-13 2:53 ` Paul Mundt 0 siblings, 2 replies; 21+ messages in thread From: Matt Mackall @ 2007-06-12 15:32 UTC (permalink / raw) To: Paul Mundt, Christoph Lameter, Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, Nick Piggin On Tue, Jun 12, 2007 at 06:43:59PM +0900, Paul Mundt wrote: > On Fri, Jun 08, 2007 at 09:50:11AM -0500, Matt Mackall wrote: > > SLOB's big scalability problem at this point is number of CPUs. > > Throwing some fine-grained locking at it or the like may be able to > > help with that too. > > > > Why would you even want to bother making it scale that large? For > > starters, it's less affected by things like dcache fragmentation. The > > majority of pages pinned by long-lived dcache entries will still be > > available to other allocations. > > > > Haven't given any thought to NUMA yet though.. > > > This is what I've hacked together and tested with my small nodes. It's > not terribly intelligent, and it pushes off most of the logic to the page > allocator. Obviously it's not terribly scalable, and I haven't tested it > with page migration, either. Still, it works for me with my simple tmpfs > + mpol policy tests. > > Tested on a UP + SPARSEMEM (static, not extreme) + NUMA (2 nodes) + SLOB > configuration. > > Flame away! For starters, it's not against the current SLOB, which no longer has the bigblock list. > -void *__kmalloc(size_t size, gfp_t gfp) > +static void *__kmalloc_alloc(size_t size, gfp_t gfp, int node) That's a ridiculous name. So, uh.. more underbars! Though really, I think you can just name it __kmalloc_node? > + if (node == -1) > + pages = alloc_pages(flags, get_order(c->size)); > + else > + pages = alloc_pages_node(node, flags, > + get_order(c->size)); This fragment appears a few times. Looks like it ought to get its own function. And that function can reduce to a trivial inline in the !NUMA case. > +void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node) > +{ > + return __kmem_cache_alloc(c, flags, node); > +} If we make the underlying functions all take a node, this stuff all gets simpler. > static void slob_timer_cbk(void) This is gone in the latest SLOB too. -- Mathematics is the supreme nostalgia of our time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-12 15:32 ` Matt Mackall @ 2007-06-13 2:10 ` Nick Piggin 2007-06-13 3:12 ` Matt Mackall 2007-06-13 2:53 ` Paul Mundt 1 sibling, 1 reply; 21+ messages in thread From: Nick Piggin @ 2007-06-13 2:10 UTC (permalink / raw) To: Matt Mackall Cc: Paul Mundt, Christoph Lameter, Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn Matt Mackall wrote: > On Tue, Jun 12, 2007 at 06:43:59PM +0900, Paul Mundt wrote: > >>On Fri, Jun 08, 2007 at 09:50:11AM -0500, Matt Mackall wrote: >> >>>SLOB's big scalability problem at this point is number of CPUs. >>>Throwing some fine-grained locking at it or the like may be able to >>>help with that too. >>> >>>Why would you even want to bother making it scale that large? For >>>starters, it's less affected by things like dcache fragmentation. The >>>majority of pages pinned by long-lived dcache entries will still be >>>available to other allocations. >>> >>>Haven't given any thought to NUMA yet though.. >>> >> >>This is what I've hacked together and tested with my small nodes. It's >>not terribly intelligent, and it pushes off most of the logic to the page >>allocator. Obviously it's not terribly scalable, and I haven't tested it >>with page migration, either. Still, it works for me with my simple tmpfs >>+ mpol policy tests. >> >>Tested on a UP + SPARSEMEM (static, not extreme) + NUMA (2 nodes) + SLOB >>configuration. >> >>Flame away! > > > For starters, it's not against the current SLOB, which no longer has > the bigblock list. > > >>-void *__kmalloc(size_t size, gfp_t gfp) >>+static void *__kmalloc_alloc(size_t size, gfp_t gfp, int node) > > > That's a ridiculous name. So, uh.. more underbars! > > Though really, I think you can just name it __kmalloc_node? > > >>+ if (node == -1) >>+ pages = alloc_pages(flags, get_order(c->size)); >>+ else >>+ pages = alloc_pages_node(node, flags, >>+ get_order(c->size)); > > > This fragment appears a few times. Looks like it ought to get its own > function. And that function can reduce to a trivial inline in the > !NUMA case. BTW. what I would like to see tried initially -- which may give reasonable scalability and NUMAness -- is perhaps a percpu or per-node free pages lists. However these lists would not be exclusively per-cpu, because that would result in worse memory consumption (we should always try to put memory consumption above all else with SLOB). So each list would have its own lock and can be accessed by any CPU, but they would default to their own list first (or in the case of a kmalloc_node, they could default to some other list). Then we'd probably like to introduce a *little* bit of slack, so that we will allocate a new page on our local list even if there is a small amount of memory free on another list. I think this might be enough to get a reasonable number of list-local allocations without blowing out the memory usage much. The slack ratio could be configurable so at one extreme we could always allocate from our local lists for best NUMA placement I guess. I haven't given it a great deal of thought, so this strategy might go horribly wrong in some cases... but I have a feeling something reasonably simple like that might go a long way to improving locking scalability and NUMAness. -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-13 2:10 ` Nick Piggin @ 2007-06-13 3:12 ` Matt Mackall 0 siblings, 0 replies; 21+ messages in thread From: Matt Mackall @ 2007-06-13 3:12 UTC (permalink / raw) To: Nick Piggin Cc: Paul Mundt, Christoph Lameter, Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn On Wed, Jun 13, 2007 at 12:10:21PM +1000, Nick Piggin wrote: > Matt Mackall wrote: > >On Tue, Jun 12, 2007 at 06:43:59PM +0900, Paul Mundt wrote: > > > >>On Fri, Jun 08, 2007 at 09:50:11AM -0500, Matt Mackall wrote: > >> > >>>SLOB's big scalability problem at this point is number of CPUs. > >>>Throwing some fine-grained locking at it or the like may be able to > >>>help with that too. > >>> > >>>Why would you even want to bother making it scale that large? For > >>>starters, it's less affected by things like dcache fragmentation. The > >>>majority of pages pinned by long-lived dcache entries will still be > >>>available to other allocations. > >>> > >>>Haven't given any thought to NUMA yet though.. > >>> > >> > >>This is what I've hacked together and tested with my small nodes. It's > >>not terribly intelligent, and it pushes off most of the logic to the page > >>allocator. Obviously it's not terribly scalable, and I haven't tested it > >>with page migration, either. Still, it works for me with my simple tmpfs > >>+ mpol policy tests. > >> > >>Tested on a UP + SPARSEMEM (static, not extreme) + NUMA (2 nodes) + SLOB > >>configuration. > >> > >>Flame away! > > > > > >For starters, it's not against the current SLOB, which no longer has > >the bigblock list. > > > > > >>-void *__kmalloc(size_t size, gfp_t gfp) > >>+static void *__kmalloc_alloc(size_t size, gfp_t gfp, int node) > > > > > >That's a ridiculous name. So, uh.. more underbars! > > > >Though really, I think you can just name it __kmalloc_node? > > > > > >>+ if (node == -1) > >>+ pages = alloc_pages(flags, get_order(c->size)); > >>+ else > >>+ pages = alloc_pages_node(node, flags, > >>+ get_order(c->size)); > > > > > >This fragment appears a few times. Looks like it ought to get its own > >function. And that function can reduce to a trivial inline in the > >!NUMA case. > > BTW. what I would like to see tried initially -- which may give reasonable > scalability and NUMAness -- is perhaps a percpu or per-node free pages > lists. However these lists would not be exclusively per-cpu, because that > would result in worse memory consumption (we should always try to put > memory consumption above all else with SLOB). > > So each list would have its own lock and can be accessed by any CPU, but > they would default to their own list first (or in the case of a > kmalloc_node, they could default to some other list). > > Then we'd probably like to introduce a *little* bit of slack, so that we > will allocate a new page on our local list even if there is a small amount > of memory free on another list. I think this might be enough to get a > reasonable number of list-local allocations without blowing out the memory > usage much. The slack ratio could be configurable so at one extreme we > could always allocate from our local lists for best NUMA placement I guess. > > I haven't given it a great deal of thought, so this strategy might go > horribly wrong in some cases... but I have a feeling something reasonably > simple like that might go a long way to improving locking scalability and > NUMAness. It's an interesting problem. There's a fair amount more we can do to get performance up on SMP which should probably happen before we think too much about NUMA. -- Mathematics is the supreme nostalgia of our time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-12 15:32 ` Matt Mackall 2007-06-13 2:10 ` Nick Piggin @ 2007-06-13 2:53 ` Paul Mundt 2007-06-13 3:16 ` Matt Mackall 1 sibling, 1 reply; 21+ messages in thread From: Paul Mundt @ 2007-06-13 2:53 UTC (permalink / raw) To: Matt Mackall Cc: Christoph Lameter, Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, Nick Piggin On Tue, Jun 12, 2007 at 10:32:34AM -0500, Matt Mackall wrote: > On Tue, Jun 12, 2007 at 06:43:59PM +0900, Paul Mundt wrote: > > On Fri, Jun 08, 2007 at 09:50:11AM -0500, Matt Mackall wrote: > > > Haven't given any thought to NUMA yet though.. > > > > > This is what I've hacked together and tested with my small nodes. It's > > not terribly intelligent, and it pushes off most of the logic to the page > > allocator. Obviously it's not terribly scalable, and I haven't tested it > > with page migration, either. Still, it works for me with my simple tmpfs > > + mpol policy tests. > > > > Tested on a UP + SPARSEMEM (static, not extreme) + NUMA (2 nodes) + SLOB > > configuration. > > > > Flame away! > > For starters, it's not against the current SLOB, which no longer has > the bigblock list. > Sorry about that, seems I used the wrong tree. > > -void *__kmalloc(size_t size, gfp_t gfp) > > +static void *__kmalloc_alloc(size_t size, gfp_t gfp, int node) > > That's a ridiculous name. So, uh.. more underbars! > Agreed, though I couldn't think of a better one. > Though really, I think you can just name it __kmalloc_node? > No, kmalloc_node and __kmalloc_node are both required by CONFIG_NUMA, otherwise that would have been the logical choice. > > + if (node == -1) > > + pages = alloc_pages(flags, get_order(c->size)); > > + else > > + pages = alloc_pages_node(node, flags, > > + get_order(c->size)); > > This fragment appears a few times. Looks like it ought to get its own > function. And that function can reduce to a trivial inline in the > !NUMA case. > Ok. > > +void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node) > > +{ > > + return __kmem_cache_alloc(c, flags, node); > > +} > > If we make the underlying functions all take a node, this stuff all > gets simpler. > Could you elaborate on that? We only require the node specifier in the allocation path, and this simply hands it down in accordance with the existing APIs. After allocation time the node id is encoded in the page flags, so we can easily figure out which node a page is tied to. I'll post the updated patch separately. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init. 2007-06-13 2:53 ` Paul Mundt @ 2007-06-13 3:16 ` Matt Mackall 0 siblings, 0 replies; 21+ messages in thread From: Matt Mackall @ 2007-06-13 3:16 UTC (permalink / raw) To: Paul Mundt, Christoph Lameter, Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, Nick Piggin On Wed, Jun 13, 2007 at 11:53:37AM +0900, Paul Mundt wrote: > On Tue, Jun 12, 2007 at 10:32:34AM -0500, Matt Mackall wrote: > > On Tue, Jun 12, 2007 at 06:43:59PM +0900, Paul Mundt wrote: > > > On Fri, Jun 08, 2007 at 09:50:11AM -0500, Matt Mackall wrote: > > > > Haven't given any thought to NUMA yet though.. > > > > > > > This is what I've hacked together and tested with my small nodes. It's > > > not terribly intelligent, and it pushes off most of the logic to the page > > > allocator. Obviously it's not terribly scalable, and I haven't tested it > > > with page migration, either. Still, it works for me with my simple tmpfs > > > + mpol policy tests. > > > > > > Tested on a UP + SPARSEMEM (static, not extreme) + NUMA (2 nodes) + SLOB > > > configuration. > > > > > > Flame away! > > > > For starters, it's not against the current SLOB, which no longer has > > the bigblock list. > > > Sorry about that, seems I used the wrong tree. > > > > -void *__kmalloc(size_t size, gfp_t gfp) > > > +static void *__kmalloc_alloc(size_t size, gfp_t gfp, int node) > > > > That's a ridiculous name. So, uh.. more underbars! > > > Agreed, though I couldn't think of a better one. > > > Though really, I think you can just name it __kmalloc_node? > > > No, kmalloc_node and __kmalloc_node are both required by CONFIG_NUMA, > otherwise that would have been the logical choice. What I'm suggesting is: _always_ have __kmalloc_node and have __kmalloc be a trivial inline that calls it. Together with cleaning up the following piece, it may compile down to what we currently have on UP/SMP: > > > + if (node == -1) > > > + pages = alloc_pages(flags, get_order(c->size)); > > > + else > > > + pages = alloc_pages_node(node, flags, > > > + get_order(c->size)); > > > > This fragment appears a few times. Looks like it ought to get its own > > function. And that function can reduce to a trivial inline in the > > !NUMA case. > > > Ok. > > > > +void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node) > > > +{ > > > + return __kmem_cache_alloc(c, flags, node); > > > +} > > > > If we make the underlying functions all take a node, this stuff all > > gets simpler. > > > Could you elaborate on that? See above. Just make the non-node versions wrappers around the node versions everywhere. -- Mathematics is the supreme nostalgia of our time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2007-06-13 3:16 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-06-07 1:17 [PATCH] numa: mempolicy: dynamic interleave map for system init Paul Mundt 2007-06-08 1:01 ` Andrew Morton 2007-06-08 2:47 ` Christoph Lameter 2007-06-08 3:01 ` Andrew Morton 2007-06-08 3:11 ` Christoph Lameter 2007-06-08 3:25 ` Paul Mundt 2007-06-08 3:49 ` Christoph Lameter 2007-06-08 4:13 ` Paul Mundt 2007-06-08 4:27 ` Christoph Lameter 2007-06-08 6:05 ` Paul Mundt 2007-06-08 6:09 ` Christoph Lameter 2007-06-08 6:27 ` Paul Mundt 2007-06-08 6:43 ` Christoph Lameter 2007-06-08 14:50 ` Matt Mackall 2007-06-12 2:36 ` Nick Piggin 2007-06-12 9:43 ` Paul Mundt 2007-06-12 15:32 ` Matt Mackall 2007-06-13 2:10 ` Nick Piggin 2007-06-13 3:12 ` Matt Mackall 2007-06-13 2:53 ` Paul Mundt 2007-06-13 3:16 ` Matt Mackall
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox