Re: [RFC 1/2] workqueue: use the nearest NUMA node, not the local one

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Nish Aravamudan <nish.aravamudan@gmail.com>
To: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	David Rientjes <rientjes@google.com>,
	Wanpeng Li <liwanp@linux.vnet.ibm.com>,
	Jiang Liu <jiang.liu@linux.intel.com>,
	Tony Luck <tony.luck@intel.com>,
	Fenghua Yu <fenghua.yu@intel.com>,
	linux-ia64@vger.kernel.org,
	Linux Memory Management List <linux-mm@kvack.org>,
	linuxppc-dev@lists.ozlabs.org,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Tejun Heo <tj@kernel.org>
Subject: Re: [RFC 1/2] workqueue: use the nearest NUMA node, not the local one
Date: Fri, 18 Jul 2014 10:33:41 -0700	[thread overview]
Message-ID: <CAOhV88OCqvfo_0yjA3b7uKiuXE6bVwH7WQLj00BES7JzbMimkg@mail.gmail.com> (raw)
In-Reply-To: <53C8D6A8.3040400@cn.fujitsu.com>

[-- Attachment #1: Type: text/plain, Size: 5911 bytes --]

[ Apologies for replying from a different address, we have a service outage
at work. ]

On Fri, Jul 18, 2014 at 1:11 AM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
>
> Hi,

Thank you for your response!

> I'm curious about what will it happen when
alloc_pages_node(memoryless_node).

alloc_pages_node() is only involved in one of the possible paths (maybe
this occurs on x86 with THREAD_INFO > PAGE_SIZE?) On powerpc, though,
that's not the case.

Details:

1. pool->node is used in the invocation of kthread_create_on_node() in
create_worker().
2. kthread_create_on_node sets up a struct kthread_create_info with
create->node = node and wakes up kthreadd.
3. kthreadd calls create_kthread, which sets current->pref_node_fork =
create->node.
4. dup_task_struct() calls node = tsk_fork_get_node() before invoking
alloc_task_struct_node(node) and alloc_thread_info_node(node).
5. tsk_fork_get_node() returns current->pref_node_fork for kthreadd.
6. alloc_task_struct_node() calls kmem_cache_alloc_node(,GFP_KERNEL, node)
7. alloc_thread_info_node() either calls kmem_cache_alloc_node(,GFP_KERNEL,
node) or alloc_kmem_pages_node(node,GFP_KERNEL,), depending on the size of
THREAD_INFO relative to PAGE_SIZE.
8a. alloc_kmem_pages_node() -> alloc_pages_node -> __alloc_pages with a
zonelist built from node_zonelist. This should lead to proper fallback.
8b. kmem_cache_alloc_node() calls slab_alloc_node()
9. For a memoryless node, we will trigger the following:

        if (unlikely(!object || !node_match(page, node))) {
                object = __slab_alloc(s, gfpflags, node, addr, c);
                stat(s, ALLOC_SLOWPATH);
        }

10. __slab_alloc() in turn will:

        if (unlikely(!node_match(page, node))) {
                stat(s, ALLOC_NODE_MISMATCH);
                deactivate_slab(s, page, c->freelist);
                c->page = NULL;
                c->freelist = NULL;
                goto new_slab;
        }

deactivating the slab. Thus, every kthread created with a node
specification leads to a single object on a slab. We see an explosion in
the slab consumption, all of which is unreclaimable.

Anton originally proposed not deactivating slabs when we *know* the
allocation will be remote (i.e., from a memoryless node). Joonsoo and
Christoph disagreed with this and proposed alternative solutions, which
weren't agreed upon at the time.

> If the memory is allocated from the most preferable node for the
@memoryless_node,
> why we need to bother and use cpu_to_mem() in the caller site?

The reason is that the node passed is a hint into the MM subsystem of what
node we want memory to come from. Well, I take that back, I think
semantically there are two ways to interpret the node parameter:

1) The NUMA node we want memory from
2) The NUMA node we expect memory from

The path through the MM above sort of conflates the two, the caller
specified an impossible request (which the MM subsystem technically knows
but which knowledge it isn't using at this point) of memory from a node
that has none.

We could change the core MM to do better in the presence of memoryless
nodes, and should, but this seems far less invasive and does the right
thing. Semantically, I think the workqueue's pool->node is meant to be the
node from which we want memory allocated, which is the node with memory
closest to the CPU.

Thanks,
Nish

> If not, why the memory allocation subsystem refuses to find a preferable
node
> for @memoryless_node in this case? Does it intend on some purpose or
> it can't find in some cases?
>
> Thanks,
> Lai
>
> Added CC to Tejun (workqueue maintainer).
>
> On 07/18/2014 07:09 AM, Nishanth Aravamudan wrote:
> > In the presence of memoryless nodes, the workqueue code incorrectly uses
> > cpu_to_node() to determine what node to prefer memory allocations come
> > from. cpu_to_mem() should be used instead, which will use the nearest
> > NUMA node with memory.
> >
> > Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 35974ac..0bba022 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -3547,7 +3547,12 @@ static struct worker_pool
*get_unbound_pool(const struct workqueue_attrs *attrs)
> >               for_each_node(node) {
> >                       if (cpumask_subset(pool->attrs->cpumask,
> >
 wq_numa_possible_cpumask[node])) {
> > -                             pool->node = node;
> > +                             /*
> > +                              * We could use local_memory_node(node)
here,
> > +                              * but it is expensive and the following
caches
> > +                              * the same value.
> > +                              */
> > +                             pool->node =
cpu_to_mem(cpumask_first(pool->attrs->cpumask));
> >                               break;
> >                       }
> >               }
> > @@ -4921,7 +4926,7 @@ static int __init init_workqueues(void)
> >                       pool->cpu = cpu;
> >                       cpumask_copy(pool->attrs->cpumask,
cpumask_of(cpu));
> >                       pool->attrs->nice = std_nice[i++];
> > -                     pool->node = cpu_to_node(cpu);
> > +                     pool->node = cpu_to_mem(cpu);
> >
> >                       /* alloc pool ID */
> >                       mutex_lock(&wq_pool_mutex);
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

[-- Attachment #2: Type: text/html, Size: 7835 bytes --]

next prev parent reply	other threads:[~2014-07-18 17:33 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-17 23:09 [RFC 0/2] Memoryless nodes and kworker Nishanth Aravamudan
2014-07-17 23:09 ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Nishanth Aravamudan
2014-07-17 23:15   ` [RFC 2/2] powerpc: reorder per-cpu NUMA information's initialization Nishanth Aravamudan
2014-07-18  8:11   ` [RFC 1/2] workqueue: use the nearest NUMA node, not the local one Lai Jiangshan
2014-07-18 17:33     ` Nish Aravamudan [this message]
2014-07-18 11:20 ` [RFC 0/2] Memoryless nodes and kworker Tejun Heo
2014-07-18 17:42   ` Nish Aravamudan
2014-07-18 18:00     ` Tejun Heo
2014-07-18 18:01       ` Tejun Heo
2014-07-18 18:12       ` Nish Aravamudan
2014-07-18 18:19         ` Tejun Heo
2014-07-18 18:47           ` Nish Aravamudan
2014-07-18 18:58             ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOhV88OCqvfo_0yjA3b7uKiuXE6bVwH7WQLj00BES7JzbMimkg@mail.gmail.com \
    --to=nish.aravamudan@gmail.com \
    --cc=benh@kernel.crashing.org \
    --cc=fenghua.yu@intel.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=jiang.liu@linux.intel.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-ia64@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=liwanp@linux.vnet.ibm.com \
    --cc=nacc@linux.vnet.ibm.com \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox