From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f51.google.com (mail-qg0-f51.google.com [209.85.192.51]) by kanga.kvack.org (Postfix) with ESMTP id 350F36B0253 for ; Tue, 4 Aug 2015 04:25:07 -0400 (EDT) Received: by qgab18 with SMTP id b18so1414270qga.2 for ; Tue, 04 Aug 2015 01:25:07 -0700 (PDT) Received: from unicom145.biz-email.net (unicom145.biz-email.net. [210.51.26.145]) by mx.google.com with ESMTPS id 8si482951qku.96.2015.08.04.01.25.04 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 04 Aug 2015 01:25:06 -0700 (PDT) Date: Tue, 4 Aug 2015 16:26:39 +0800 From: "gongzhaogang@inspur.com" Subject: Re: Re: [PATCH 1/5] x86, gfp: Cache best near node for memory allocation. References: <1436261425-29881-1-git-send-email-tangchen@cn.fujitsu.com>, <1436261425-29881-2-git-send-email-tangchen@cn.fujitsu.com>, <20150715214802.GL15934@mtj.duckdns.org>, <55C03332.2030808@cn.fujitsu.com> MIME-Version: 1.0 Message-ID: <201508041626380745999@inspur.com> Content-Type: multipart/alternative; boundary="----=_001_NextPart081587038428_=----" Sender: owner-linux-mm@kvack.org List-ID: To: Tang Chen , "tj@kernel.org" Cc: "mingo@redhat.com" , "akpm@linux-foundation.org" , "rjw@rjwysocki.net" , "hpa@zytor.com" , "laijs@cn.fujitsu.com" , "yasu.isimatu@gmail.com" , "isimatu.yasuaki@jp.fujitsu.com" , "kamezawa.hiroyu@jp.fujitsu.com" , "izumi.taku@jp.fujitsu.com" , "qiaonuohan@cn.fujitsu.com" , "x86@kernel.org" , "linux-acpi@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" ------=_001_NextPart081587038428_=---- Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: base64 U29ycnksSSBhbSBuZXcuDQo+QnV0LA0KPjEpIGluIGNwdV91cCgpLCBpdCB3aWxsIHRyeSB0byBv bmxpbmUgYSBub2RlLCBhbmQgaXQgZG9lc24ndCBjaGVjayBpZg0KPnRoZSBub2RlIGhhcyBtZW1v cnkuDQo+MikgaW4gdHJ5X29mZmxpbmVfbm9kZSgpLCBpdCBvZmZsaW5lcyBDUFVzIGZpcnN0LCBh bmQgdGhlbiB0aGUgbWVtb3J5Lg0KIA0KPlRoaXMgYmVoYXZpb3IgbG9va3MgYSBsaXR0bGUgd2ly ZWQsIG9yIGxldCdzIHNheSBpdCBpcyBhbWJpZ3VvdXMuIEl0DQo+c2VlbXMgdGhhdCBhIE5VTUEg bm9kZQ0KPmNvbnNpc3RzIG9mIENQVXMgYW5kIG1lbW9yeS4gU28gaWYgdGhlIENQVXMgYXJlIG9u bGluZSwgdGhlIG5vZGUgc2hvdWxkDQo+YmUgb25saW5lLg0KSSBzdWdnZXN0ZWQgeW91IHRvIHRy eSB0aGUgcGF0Y2ggb2ZmZXJlZCBieSBMaXUgSmlhbmcuDQoNCmh0dHBzOi8vbGttbC5vcmcvbGtt bC8yMDE0LzkvMTEvMTA4NyANCg0KSSBoYXZlIHRyaWVkICxJdCBpcyBPSy4NCg0KPlVuZm9ydHVu YXRlbHksIHNpbmNlIEkgZG9uJ3QgaGF2ZSBhIG1hY2hpbmUgYSB3aXRoIG1lbW9yeS1sZXNzIG5v ZGUsIEkNCj5jYW5ub3QgcmVwcm9kdWNlDQo+dGhlIHByb2JsZW0gcmlnaHQgbm93Lg0KDQpJZiAg bm90IGh1cnJpZWQgICwgSSBjYW4gdGVzdCB5b3VyIHBhdGNoZXMgaW4gb3VyIGVudmlyb25tZW50 IG9uIHdlZWtlbmRzLg0KDQoNCg0KZ29uZ3poYW9nYW5nQGluc3B1ci5jb20NCiANCkZyb206IFRh bmcgQ2hlbg0KRGF0ZTogMjAxNS0wOC0wNCAxMTozNg0KVG86IFRlanVuIEhlbw0KQ0M6IG1pbmdv QHJlZGhhdC5jb207IGFrcG1AbGludXgtZm91bmRhdGlvbi5vcmc7IHJqd0Byand5c29ja2kubmV0 OyBocGFAenl0b3IuY29tOyBsYWlqc0Bjbi5mdWppdHN1LmNvbTsgeWFzdS5pc2ltYXR1QGdtYWls LmNvbTsgaXNpbWF0dS55YXN1YWtpQGpwLmZ1aml0c3UuY29tOyBrYW1lemF3YS5oaXJveXVAanAu ZnVqaXRzdS5jb207IGl6dW1pLnRha3VAanAuZnVqaXRzdS5jb207IGdvbmd6aGFvZ2FuZ0BpbnNw dXIuY29tOyBxaWFvbnVvaGFuQGNuLmZ1aml0c3UuY29tOyB4ODZAa2VybmVsLm9yZzsgbGludXgt YWNwaUB2Z2VyLmtlcm5lbC5vcmc7IGxpbnV4LWtlcm5lbEB2Z2VyLmtlcm5lbC5vcmc7IGxpbnV4 LW1tQGt2YWNrLm9yZzsgdGFuZ2NoZW5AY24uZnVqaXRzdS5jb20NClN1YmplY3Q6IFJlOiBbUEFU Q0ggMS81XSB4ODYsIGdmcDogQ2FjaGUgYmVzdCBuZWFyIG5vZGUgZm9yIG1lbW9yeSBhbGxvY2F0 aW9uLg0KSGkgVEosDQogDQpTb3JyeSBmb3IgdGhlIGxhdGUgcmVwbHkuDQogDQpPbiAwNy8xNi8y MDE1IDA1OjQ4IEFNLCBUZWp1biBIZW8gd3JvdGU6DQo+IC4uLi4uLg0KPiBzbyBpbiBpbml0aWFs aXphdGlvbiBwaGFyc2UgbWFrZXMgbm8gc2Vuc2UgYW55IG1vcmUuIFRoZSBiZXN0IG5lYXIgb25s aW5lDQo+IG5vZGUgZm9yIGVhY2ggY3B1IHNob3VsZCBiZSBjYWNoZWQgc29tZXdoZXJlLg0KPiBJ J20gbm90IHJlYWxseSBmb2xsb3dpbmcuICBJcyB0aGlzIGJlY2F1c2UgdGhlIG5vdyBvZmZsaW5l IG5vZGUgY2FuDQo+IGxhdGVyIGNvbWUgb25saW5lIGFuZCB3ZSdkIGhhdmUgdG8gYnJlYWsgdGhl IGNvbnN0YW50IG1hcHBpbmcNCj4gaW52YXJpYW50IGlmIHdlIHVwZGF0ZSB0aGUgbWFwcGluZyBs YXRlcj8gIElmIHNvLCBpdCdkIGJlIG5pY2UgdG8NCj4gc3BlbGwgdGhhdCBvdXQuDQogDQpZZXMu IFdpbGwgZG9jdW1lbnQgdGhpcyBpbiB0aGUgbmV4dCB2ZXJzaW9uLg0KIA0KPj4gLi4uLi4uDQo+ PiAgIA0KPj4gK2ludCBnZXRfbmVhcl9vbmxpbmVfbm9kZShpbnQgbm9kZSkNCj4+ICt7DQo+PiAr IHJldHVybiBwZXJfY3B1KHg4Nl9jcHVfdG9fbmVhcl9vbmxpbmVfbm9kZSwNCj4+ICsgICAgICAg IGNwdW1hc2tfZmlyc3QoJm5vZGVfdG9fY3B1aWRfbWFza19tYXBbbm9kZV0pKTsNCj4+ICt9DQo+ PiArRVhQT1JUX1NZTUJPTChnZXRfbmVhcl9vbmxpbmVfbm9kZSk7DQo+IFVtbS4uLiB0aGlzIGZ1 bmN0aW9uIGlzIHNpdHRpbmcgb24gYSBmYWlybHkgaG90IHBhdGggYW5kIHNjYW5uaW5nIGENCj4g Y3B1bWFzayBlYWNoIHRpbWUuICBXaHkgbm90IGp1c3QgYnVpbGQgYSBudW1hIG5vZGUgLT4gbnVt YSBub2RlIGFycmF5Pw0KIA0KSW5kZWVkLiBXaWxsIGF2b2lkIHRvIHNjYW4gYSBjcHVtYXNrLg0K IA0KPiAuLi4uLi4NCj4NCj4+ICAgDQo+PiAgIHN0YXRpYyBpbmxpbmUgc3RydWN0IHBhZ2UgKmFs bG9jX3BhZ2VzX2V4YWN0X25vZGUoaW50IG5pZCwgZ2ZwX3QgZ2ZwX21hc2ssDQo+PiAgIHVuc2ln bmVkIGludCBvcmRlcikNCj4+ICAgew0KPj4gLSBWTV9CVUdfT04obmlkIDwgMCB8fCBuaWQgPj0g TUFYX05VTU5PREVTIHx8ICFub2RlX29ubGluZShuaWQpKTsNCj4+ICsgVk1fQlVHX09OKG5pZCA8 IDAgfHwgbmlkID49IE1BWF9OVU1OT0RFUyk7DQo+PiArDQo+PiArI2lmIElTX0VOQUJMRUQoQ09O RklHX1g4NikgJiYgSVNfRU5BQkxFRChDT05GSUdfTlVNQSkNCj4+ICsgaWYgKCFub2RlX29ubGlu ZShuaWQpKQ0KPj4gKyBuaWQgPSBnZXRfbmVhcl9vbmxpbmVfbm9kZShuaWQpOw0KPj4gKyNlbmRp Zg0KPj4gICANCj4+ICAgcmV0dXJuIF9fYWxsb2NfcGFnZXMoZ2ZwX21hc2ssIG9yZGVyLCBub2Rl X3pvbmVsaXN0KG5pZCwgZ2ZwX21hc2spKTsNCj4+ICAgfQ0KPiBEaXR0by4gIEFsc28sIHdoYXQn cyB0aGUgc3luY2hyb25pemF0aW9uIHJ1bGVzIGZvciBOVU1BIG5vZGUNCj4gb24vb2ZmbGluaW5n LiAgSWYgeW91IGVuZCB1cCB1cGRhdGluZyB0aGUgbWFwcGluZyBsYXRlciwgaG93IHdvdWxkDQo+ IHRoYXQgYmUgc3luY2hyb25pemVkIGFnYWluc3QgdGhlIGFib3ZlIHVzYWdlcz8NCiANCkkgdGhp bmsgdGhlIG5lYXIgb25saW5lIG5vZGUgbWFwIHNob3VsZCBiZSB1cGRhdGVkIHdoZW4gbm9kZSBv bmxpbmUvb2ZmbGluZQ0KaGFwcGVucy4gQnV0IGFib3V0IHRoaXMsIEkgdGhpbmsgdGhlIGN1cnJl bnQgbnVtYSBjb2RlIGhhcyBhIGxpdHRsZSBwcm9ibGVtLg0KIA0KQXMgeW91IGtub3csIGZpcm13 YXJlIGluZm8gYmluZHMgYSBzZXQgb2YgQ1BVcyBhbmQgbWVtb3J5IHRvIGEgbm9kZS4gQnV0DQph dCBib290IHRpbWUsIGlmIHRoZSBub2RlIGhhcyBubyBtZW1vcnkgKGEgbWVtb3J5LWxlc3Mgbm9k ZSkgLCBpdCB3b24ndCANCmJlIG9ubGluZS4NCkJ1dCB0aGUgQ1BVcyBvbiB0aGF0IG5vZGUgaXMg YXZhaWxhYmxlLCBhbmQgYm91bmQgdG8gdGhlIG5lYXIgb25saW5lIG5vZGUuDQooSGVyZSwgSSBt ZWFuIG51bWFfc2V0X25vZGUoY3B1LCBub2RlKS4pDQogDQpXaHkgZG9lcyB0aGUga2VybmVsIGRv IHRoaXMgPyBJIHRoaW5rIGl0IGlzIHVzZWQgdG8gZW5zdXJlIHRoYXQgd2UgY2FuIA0KYWxsb2Nh dGUgbWVtb3J5DQpzdWNjZXNzZnVsbHkgYnkgY2FsbGluZyBmdW5jdGlvbnMgbGlrZSBhbGxvY19w YWdlc19ub2RlKCkgYW5kIA0KYWxsb2NfcGFnZXNfZXhhY3Rfbm9kZSgpLg0KQnkgdGhlc2UgdHdv IGZ1Y3Rpb25zLCBhbnkgQ1BVIHNob3VsZCBiZSBib3VuZCB0byBhIG5vZGUgd2hvIGhhcyBtZW1v cnkgDQpzbyB0aGF0DQptZW1vcnkgYWxsb2NhdGlvbiBjYW4gYmUgc3VjY2Vzc2Z1bC4NCiANClRo YXQgbWVhbnMsIGZvciBhIG1lbW9yeS1sZXNzIG5vZGUgYXQgYm9vdCB0aW1lLCBDUFVzIG9uIHRo ZSBub2RlIGlzIA0Kb25saW5lLA0KYnV0IHRoZSBub2RlIGlzIG5vdCBvbmxpbmUuDQogDQpUaGF0 IGFsc28gbWVhbnMsICJ0aGUgbm9kZSBpcyBvbmxpbmUiIGVxdWFscyB0byAidGhlIG5vZGUgaGFz IG1lbW9yeSIuIA0KQWN0dWFsbHksIHRoZXJlDQphcmUgYSBsb3Qgb2YgY29kZSBpbiB0aGUga2Vy bmVsIGlzIHVzaW5nIHRoaXMgcnVsZS4NCiANCiANCkJ1dCwNCjEpIGluIGNwdV91cCgpLCBpdCB3 aWxsIHRyeSB0byBvbmxpbmUgYSBub2RlLCBhbmQgaXQgZG9lc24ndCBjaGVjayBpZiANCnRoZSBu b2RlIGhhcyBtZW1vcnkuDQoyKSBpbiB0cnlfb2ZmbGluZV9ub2RlKCksIGl0IG9mZmxpbmVzIENQ VXMgZmlyc3QsIGFuZCB0aGVuIHRoZSBtZW1vcnkuDQogDQpUaGlzIGJlaGF2aW9yIGxvb2tzIGEg bGl0dGxlIHdpcmVkLCBvciBsZXQncyBzYXkgaXQgaXMgYW1iaWd1b3VzLiBJdCANCnNlZW1zIHRo YXQgYSBOVU1BIG5vZGUNCmNvbnNpc3RzIG9mIENQVXMgYW5kIG1lbW9yeS4gU28gaWYgdGhlIENQ VXMgYXJlIG9ubGluZSwgdGhlIG5vZGUgc2hvdWxkIA0KYmUgb25saW5lLg0KIA0KQW5kIGFsc28s DQpUaGUgbWFpbiBwdXJwb3NlIG9mIHRoaXMgcGF0Y2gtc2V0IGlzIHRvIG1ha2UgdGhlIGNwdWlk IDwtPiBub2RlaWQgDQptYXBwaW5nIHBlcnNpc3RlbnQuDQpBZnRlciB0aGlzIHBhdGNoLXNldCwg YWxsb2NfcGFnZXNfbm9kZSgpIGFuZCBhbGxvY19wYWdlc19leGFjdF9ub2RlKCkgDQp3b24ndCBk ZXBlbmQgb24NCmNwdWlkIDwtPiBub2RlaWQgbWFwcGluZyBhbnkgbW9yZS4gU28gdGhlIG5vZGUg c2hvdWxkIGJlIG9ubGluZSBpZiB0aGUgDQpDUFVzIG9uIGl0IGFyZQ0Kb25saW5lLiBPdGhlcndp c2UsIHdlIGNhbm5vdCBzZXR1cCBpbnRlcmZhY2VzIG9mIENQVXMgdW5kZXIgL3N5cy4NCiANCiAN ClVuZm9ydHVuYXRlbHksIHNpbmNlIEkgZG9uJ3QgaGF2ZSBhIG1hY2hpbmUgYSB3aXRoIG1lbW9y eS1sZXNzIG5vZGUsIEkgDQpjYW5ub3QgcmVwcm9kdWNlDQp0aGUgcHJvYmxlbSByaWdodCBub3cu DQogDQpIb3cgZG8geW91IHRoaW5rIHRoZSBub2RlIG9ubGluZSBiZWhhdmlvciBzaG91bGQgYmUg Y2hhbmdlZCA/DQogDQpUaGFua3MuDQogDQogDQogDQogDQogDQogDQogDQogDQogDQogDQogDQog DQogDQogDQogDQogDQogDQogDQogDQogDQogDQogDQogDQogDQogDQogDQogDQogDQogDQogDQog DQogDQogDQogDQogDQogDQogDQo= ------=_001_NextPart081587038428_=---- Content-Type: text/html; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable =0A
Sorry,I am new.
>But,
>1) in cpu_up(), it will try to online a node, and it doesn't check if=
>the node has memory.
>2) in try_offline_node()= , it offlines CPUs first, and then the memory.
 
= >This behavior looks a little wired, or let's say it is ambiguous. It
>seems that a NUMA node
>consists of CPUs and mem= ory. So if the CPUs are online, the node should
>be online.
=0A
I suggested you to try the patch offered by Liu Jiang.

https://lkml.org/lkml/2014/9/11/108= 7 

I have tried ,It is OK.
=
>Unfortunately, since I don't have a machine a with= memory-less node, I
>cannot reproduce
>the prob= lem right now.

If  not hurried  ,= I can test your patches in our environment on weekends.


=0A
gongzhaogang@inspur.com
= =0A
 
From: <= a href=3D"mailto:tangchen@cn.fujitsu.com">Tang Chen
Date:=  2015-08-04 11:36
Hi TJ,
=0A
&n= bsp;
=0A
Sorry for the late reply.
=0A
 
=0AOn 07/16/2015 05:48 AM, Tejun Heo wrote:
=0A
> ......
= =0A
> so in initialization pharse makes no sense any more. The best= near online
=0A
> node for each cpu should be cached somewher= e.
=0A
> I'm not really following.  Is this because the n= ow offline node can
=0A
> later come online and we'd have to b= reak the constant mapping
=0A
> invariant if we update the map= ping later?  If so, it'd be nice to
=0A
> spell that out.=
=0A
 
=0A
Yes. Will document this in the next vers= ion.
=0A
 
=0A
>> ......
=0A
>>= ;  
=0A
>> +int get_near_online_node(int node)=0A
>> +{
=0A
>> + return per_cpu(x86_cpu_to_= near_online_node,
=0A
>> +      &= nbsp; cpumask_first(&node_to_cpuid_mask_map[node]));
=0A
>= > +}
=0A
>> +EXPORT_SYMBOL(get_near_online_node);
= =0A
> Umm... this function is sitting on a fairly hot path and scan= ning a
=0A
> cpumask each time.  Why not just build a num= a node -> numa node array?
=0A
 
=0A
Indeed. Wil= l avoid to scan a cpumask.
=0A
 
=0A
> ......=0A
>
=0A
>>  
=0A
>>&n= bsp;  static inline struct page *alloc_pages_exact_node(int nid, gfp_= t gfp_mask,
=0A
>>   unsigned int order)=0A
>>   {
=0A
>> - VM_BUG_ON(nid &l= t; 0 || nid >=3D MAX_NUMNODES || !node_online(nid));
=0A
>&= gt; + VM_BUG_ON(nid < 0 || nid >=3D MAX_NUMNODES);
=0A
>= > +
=0A
>> +#if IS_ENABLED(CONFIG_X86) && IS_ENA= BLED(CONFIG_NUMA)
=0A
>> + if (!node_online(nid))
=0A<= div>>> + nid =3D get_near_online_node(nid);
=0A
>> += #endif
=0A
>>  
=0A
>>  = ; return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));=0A
>>   }
=0A
> Ditto.  Also, wha= t's the synchronization rules for NUMA node
=0A
> on/offlining= .  If you end up updating the mapping later, how would
=0A
&= gt; that be synchronized against the above usages?
=0A
 =0A
I think the near online node map should be updated when node onl= ine/offline
=0A
happens. But about this, I think the current numa= code has a little problem.
=0A
 
=0A
As you know, = firmware info binds a set of CPUs and memory to a node. But
=0A
a= t boot time, if the node has no memory (a memory-less node) , it won't =0A
be online.
=0A
But the CPUs on that node is available,= and bound to the near online node.
=0A
(Here, I mean numa_set_no= de(cpu, node).)
=0A
 
=0A
Why does the kernel do th= is ? I think it is used to ensure that we can
=0A
allocate memor= y
=0A
successfully by calling functions like alloc_pages_node() a= nd
=0A
alloc_pages_exact_node().
=0A
By these two fucti= ons, any CPU should be bound to a node who has memory
=0A
so tha= t
=0A
memory allocation can be successful.
=0A
 =0A
That means, for a memory-less node at boot time, CPUs on the no= de is
=0A
online,
=0A
but the node is not online.
= =0A
 
=0A
That also means, "the node is online" equals t= o "the node has memory".
=0A
Actually, there
=0A
are a = lot of code in the kernel is using this rule.
=0A
 
=0A=
 
=0A
But,
=0A
1) in cpu_up(), it will try to = online a node, and it doesn't check if
=0A
the node has memory.<= /div>=0A
2) in try_offline_node(), it offlines CPUs first, and then th= e memory.
=0A
 
=0A
This behavior looks a little wi= red, or let's say it is ambiguous. It
=0A
seems that a NUMA node=
=0A
consists of CPUs and memory. So if the CPUs are online, the = node should
=0A
be online.
=0A
 
=0A
And = also,
=0A
The main purpose of this patch-set is to make the cpuid= <-> nodeid
=0A
mapping persistent.
=0A
After thi= s patch-set, alloc_pages_node() and alloc_pages_exact_node()
=0Awon't depend on
=0A
cpuid <-> nodeid mapping any more. So = the node should be online if the
=0A
CPUs on it are
=0Aonline. Otherwise, we cannot setup interfaces of CPUs under /sys.
= =0A
 
=0A
 
=0A
Unfortunately, since I don= 't have a machine a with memory-less node, I
=0A
cannot reproduc= e
=0A
the problem right now.
=0A
 
=0A
How= do you think the node online behavior should be changed ?
=0A
&n= bsp;
=0A
Thanks.
=0A
 
=0A
 
=0A=
 
=0A
 
=0A
 
=0A
 =0A
 
=0A
 
=0A
 
=0A
&nb= sp;
=0A
 
=0A
 
=0A
 
=0A 
=0A
 
=0A
 
=0A
 =0A
 
=0A
 
=0A
 
=0A
 = ;
=0A
 
=0A
 
=0A
 
=0A 
=0A
 
=0A
 
=0A
 
= =0A
 
=0A
 
=0A
 
=0A
 =
=0A
 
=0A
 
=0A
 
=0A
=  
=0A
 
=0A
=0A ------=_001_NextPart081587038428_=------ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org