6. alloc_task_struct_node() calls kmem_cache_alloc_node(,GFP_KERNEL, node)
7. alloc_thread_info_node() either calls kmem_cache_alloc_node(,GFP_KERNEL, node) or alloc_kmem_pages_node(node,GFP_KERNEL,), depending on the size of THREAD_INFO relative to PAGE_SIZE.
8a. alloc_kmem_pages_node() -> alloc_pages_node -> __alloc_pages with a zonelist built from node_zonelist. This should lead to proper fallback.
8b. kmem_cache_alloc_node() calls slab_alloc_node()
9. For a memoryless node, we will trigger the following:
if (unlikely(!object || !node_match(page, node))) {
object = __slab_alloc(s, gfpflags, node, addr, c);
stat(s, ALLOC_SLOWPATH);
}
10. __slab_alloc() in turn will:
if (unlikely(!node_match(page, node))) {
stat(s, ALLOC_NODE_MISMATCH);
deactivate_slab(s, page, c->freelist);
c->page = NULL;
c->freelist = NULL;
goto new_slab;
}
deactivating the slab. Thus, every kthread created with a node specification leads to a single object on a slab. We see an explosion in the slab consumption, all of which is unreclaimable.
Anton originally proposed not deactivating slabs when we *know* the allocation will be remote (i.e., from a memoryless node). Joonsoo and Christoph disagreed with this and proposed alternative solutions, which weren't agreed upon at the time.
> If the memory is allocated from the most preferable node for the @memoryless_node,
> why we need to bother and use cpu_to_mem() in the caller site?
The reason is that the node passed is a hint into the MM subsystem of what node we want memory to come from. Well, I take that back, I think semantically there are two ways to interpret the node parameter:
1) The NUMA node we want memory from
2) The NUMA node we expect memory from
The path through the MM above sort of conflates the two, the caller specified an impossible request (which the MM subsystem technically knows but which knowledge it isn't using at this point) of memory from a node that has none.
We could change the core MM to do better in the presence of memoryless nodes, and should, but this seems far less invasive and does the right thing. Semantically, I think the workqueue's pool->node is meant to be the node from which we want memory allocated, which is the node with memory closest to the CPU.
Thanks,
Nish
> If not, why the memory allocation subsystem refuses to find a preferable node
> for @memoryless_node in this case? Does it intend on some purpose or
> it can't find in some cases?
>
> Thanks,
> Lai
>
> Added CC to Tejun (workqueue maintainer).
>
> On 07/18/2014 07:09 AM, Nishanth Aravamudan wrote:
> > In the presence of memoryless nodes, the workqueue code incorrectly uses
> > cpu_to_node() to determine what node to prefer memory allocations come
> > from. cpu_to_mem() should be used instead, which will use the nearest
> > NUMA node with memory.
> >
> > Signed-off-by: Nishanth Aravamudan <
nacc@linux.vnet.ibm.com>
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 35974ac..0bba022 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -3547,7 +3547,12 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
> > for_each_node(node) {
> > if (cpumask_subset(pool->attrs->cpumask,
> > wq_numa_possible_cpumask[node])) {
> > - pool->node = node;
> > + /*
> > + * We could use local_memory_node(node) here,
> > + * but it is expensive and the following caches
> > + * the same value.
> > + */
> > + pool->node = cpu_to_mem(cpumask_first(pool->attrs->cpumask));
> > break;
> > }
> > }
> > @@ -4921,7 +4926,7 @@ static int __init init_workqueues(void)
> > pool->cpu = cpu;
> > cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
> > pool->attrs->nice = std_nice[i++];
> > - pool->node = cpu_to_node(cpu);
> > + pool->node = cpu_to_mem(cpu);
> >
> > /* alloc pool ID */
> > mutex_lock(&wq_pool_mutex);
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to
majordomo@vger.kernel.org> > More majordomo info at
http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at
http://www.tux.org/lkml/> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to
majordomo@vger.kernel.org> More majordomo info at
http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at
http://www.tux.org/lkml/