From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vc0-f179.google.com (mail-vc0-f179.google.com [209.85.220.179]) by kanga.kvack.org (Postfix) with ESMTP id A6ED46B0037 for ; Fri, 18 Jul 2014 14:12:02 -0400 (EDT) Received: by mail-vc0-f179.google.com with SMTP id hq11so6419849vcb.38 for ; Fri, 18 Jul 2014 11:12:02 -0700 (PDT) Received: from mail-vc0-x22a.google.com (mail-vc0-x22a.google.com [2607:f8b0:400c:c03::22a]) by mx.google.com with ESMTPS id z8si6636562ven.51.2014.07.18.11.12.02 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 18 Jul 2014 11:12:02 -0700 (PDT) Received: by mail-vc0-f170.google.com with SMTP id lf12so8103008vcb.1 for ; Fri, 18 Jul 2014 11:12:02 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20140718180008.GC13012@htj.dyndns.org> References: <20140717230923.GA32660@linux.vnet.ibm.com> <20140718112039.GA8383@htj.dyndns.org> <20140718180008.GC13012@htj.dyndns.org> Date: Fri, 18 Jul 2014 11:12:01 -0700 Message-ID: Subject: Re: [RFC 0/2] Memoryless nodes and kworker From: Nish Aravamudan Content-Type: multipart/alternative; boundary=001a11c3bc7aa94e1004fe7bb104 Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: Nishanth Aravamudan , Benjamin Herrenschmidt , Joonsoo Kim , David Rientjes , Wanpeng Li , Jiang Liu , Tony Luck , Fenghua Yu , linux-ia64@vger.kernel.org, Linux Memory Management List , linuxppc-dev@lists.ozlabs.org, "linux-kernel@vger.kernel.org" --001a11c3bc7aa94e1004fe7bb104 Content-Type: text/plain; charset=UTF-8 Hi Tejun, [I found the other thread where you made these points, thanks you for expressing them so clearly again!] On Fri, Jul 18, 2014 at 11:00 AM, Tejun Heo wrote: > > Hello, > > On Fri, Jul 18, 2014 at 10:42:29AM -0700, Nish Aravamudan wrote: > > So, to be clear, this is not *necessarily* about memoryless nodes. It's > > about the semantics intended. The workqueue code currently calls > > cpu_to_node() in a few places, and passes that node into the core MM as a > > hint about where the memory should come from. However, when memoryless > > nodes are present, that hint is guaranteed to be wrong, as it's the nearest > > NUMA node to the CPU (which happens to be the one its on), not the nearest > > NUMA node with memory. The hint is correctly specified as cpu_to_mem(), > > It's telling the allocator the node the CPU is on. Choosing and > falling back the actual allocation is the allocator's job. Ok, I agree with you then, if that's all the semantic is supposed to be. But looking at the comment for kthread_create_on_node: * If thread is going to be bound on a particular cpu, give its node * in @node, to get NUMA affinity for kthread stack, or else give -1. so the API interprets it as a suggestion for the affinity itself, *not* the node the kthread should be on. Piddly, yes, but actually I have another thought altogether, and in reviewing Jiang's patches this seems like the right approach: why aren't these callers using kthread_create_on_cpu()? That API was already change to use cpu_to_mem() [so one change, rather than of all over the kernel source]. We could change it back to cpu_to_node and push down the knowledge about the fallback. > > which does the right thing in the presence or absence of memoryless nodes. > > And I think encapsulates the hint's semantics correctly -- please give me > > memory from where I expect it, which is the closest NUMA node. > > I don't think it does. It loses information at too high a layer. > Workqueue here doesn't care how memory subsystem is structured, it's > just telling the allocator where it's at and expecting it to do the > right thing. Please consider the following scenario. > > A - B - C - D - E > > Let's say C is a memory-less node. If we map from C to either B or D > from individual users and that node can't serve that memory request, > the allocator would fall back to A or E respectively when the right > thing to do would be falling back to D or B respectively, right? Yes, this is a good point. But honestly, we're not really even to the point of talking about fallback here, at least in my testing, going off-node at all causes SLUB-configured slabs to deactivate, which then leads to an explosion in the unreclaimable slab. > This isn't a huge issue but it shows that this is the wrong layer to > deal with this issue. Let the allocators express where they are. > Choosing and falling back belong to the memory allocator. That's the > only place which has all the information that's necessary and those > details must be contained there. Please don't leak it to memory > allocator users. Ok, I will continue to work at that level of abstraction. Thanks, Nish --001a11c3bc7aa94e1004fe7bb104 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Tejun,

[I found the other thread wher= e you made these points, thanks you for expressing them so clearly again!]<= br>

On Fri, Jul 18, 2014 at 11:00 AM, Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Jul 18, 2014 at 10:42:29AM -070= 0, Nish Aravamudan wrote:
> > So, to be clear, this is not *necess= arily* about memoryless nodes. It's
> > about the semantics in= tended. The workqueue code currently calls
> > cpu_to_node() in a few places, and passes that node into the core= MM as a
> > hint about where the memory should come from. However= , when memoryless
> > nodes are present, that hint is guaranteed t= o be wrong, as it's the nearest
> > NUMA node to the CPU (which happens to be the one its on), not th= e nearest
> > NUMA node with memory. The hint is correctly specifi= ed as cpu_to_mem(),
>
> It's telling the allocator the node= the CPU is on. =C2=A0Choosing and
> falling back the actual allocation is the allocator's job.

=
Ok, I agree with you then, if that's all the semantic is sup= posed to be.

But looking at the comment for kthread_creat= e_on_node:

=C2=A0* If thread is going to be bound on a particular cpu, give its no= de
=C2=A0* in @node, to get NUMA affinity for kthread stack, or else giv= e -1.

so the API interprets it as a suggestion for the af= finity itself, *not* the node the kthread should be on. Piddly, yes, but ac= tually I have another thought altogether, and in reviewing Jiang's patc= hes this seems like the right approach:

why aren't these callers using kthread_create_on_cpu()? = That API was already change to use cpu_to_mem() [so one change, rather than= of all over the kernel source]. We could change it back to cpu_to_node and= push down the knowledge about the fallback.

> > which does the right thing in the presence or abse= nce of memoryless nodes.
> > And I think encapsulates the hint'= ;s semantics correctly -- please give me
> > memory from where I e= xpect it, which is the closest NUMA node.
>
> I don't think it does. =C2=A0It loses information at too h= igh a layer.
> Workqueue here doesn't care how memory subsystem i= s structured, it's
> just telling the allocator where it's at= and expecting it to do the
> right thing. =C2=A0Please consider the following scenario.
>
= > =C2=A0 =C2=A0 =C2=A0 =C2=A0 A - B - C - D - E
>
> Let'= s say C is a memory-less node. =C2=A0If we map from C to either B or D
&= gt; from individual users and that node can't serve that memory request= ,
> the allocator would fall back to A or E respectively when the right> thing to do would be falling back to D or B respectively, right?
<= br>
Yes, this is a good point. But honestly, we're not really= even to the point of talking about fallback here, at least in my testing, = going off-node at all causes SLUB-configured slabs to deactivate, which the= n leads to an explosion in the unreclaimable slab.

> This isn't a huge issue but it shows that this is t= he wrong layer to
> deal with this issue. =C2=A0Let the allocators ex= press where they are.
> Choosing and falling back belong to the memor= y allocator. =C2=A0That's the
> only place which has all the information that's necessary and thos= e
> details must be contained there. =C2=A0Please don't leak it t= o memory
> allocator users.

Ok, I will continue to = work at that level of abstraction.

Thanks,
Nish
--001a11c3bc7aa94e1004fe7bb104-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org