From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, NICE_REPLY_A,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9AF2BC43461 for ; Wed, 9 Sep 2020 07:49:11 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id AC9AA208FE for ; Wed, 9 Sep 2020 07:49:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="KNqmX7wJ" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AC9AA208FE Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.ibm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id D540D8E0001; Wed, 9 Sep 2020 03:49:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D031C6B006E; Wed, 9 Sep 2020 03:49:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BCC8D8E0001; Wed, 9 Sep 2020 03:49:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0053.hostedemail.com [216.40.44.53]) by kanga.kvack.org (Postfix) with ESMTP id A6CD46B0062 for ; Wed, 9 Sep 2020 03:49:09 -0400 (EDT) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 7627D824805A for ; Wed, 9 Sep 2020 07:49:09 +0000 (UTC) X-FDA: 77242747218.30.bath45_56062a7270db Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin30.hostedemail.com (Postfix) with ESMTP id 3C793180B3C83 for ; Wed, 9 Sep 2020 07:49:09 +0000 (UTC) X-HE-Tag: bath45_56062a7270db X-Filterd-Recvd-Size: 15373 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf27.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Sep 2020 07:49:08 +0000 (UTC) Received: from pps.filterd (m0098414.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 0896Xjba042536; Wed, 9 Sep 2020 03:49:06 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=pp1; bh=BNnuQrTRsQSggOyfXyC6BDKmGqKusAIVtGXkoNLzTcQ=; b=KNqmX7wJDmyZW6fBOw+ONu98/dLnnUn9WrBSWJ8vp6Md4Rg9DKDn2GLCOA7/h2g8fHfq 8r2rL6R84IDzFxdp3NlrRVdOaGzTMm+w8AlnosU0qrx94q+C2wmywcjXUR9Y/74Ojke3 TRPA0pJE1P/81PLlXOofmN0SW07ie6sC6tB2vx6UJClNtT8IZ+M8QnYO9r+i7QoPJ63t h5twJ2yzftV7FBJcaYhGS547YxSg95uFwshpXlwDmwj1BOwWHFFkskhK8hW2AA6F/nJr n6+PeJ0lqDXbLIOLTRIWWMgT0M5NK6Foqy/dDNcTQFAFaAUl2M0nd3qI+wD8USptbb5e QQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com with ESMTP id 33er125h3a-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 09 Sep 2020 03:49:06 -0400 Received: from m0098414.ppops.net (m0098414.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 0897n5Zc094646; Wed, 9 Sep 2020 03:49:05 -0400 Received: from ppma06fra.de.ibm.com (48.49.7a9f.ip4.static.sl-reverse.com [159.122.73.72]) by mx0b-001b2d01.pphosted.com with ESMTP id 33er125h2h-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 09 Sep 2020 03:49:05 -0400 Received: from pps.filterd (ppma06fra.de.ibm.com [127.0.0.1]) by ppma06fra.de.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 0897gMii010534; Wed, 9 Sep 2020 07:49:03 GMT Received: from b06avi18626390.portsmouth.uk.ibm.com (b06avi18626390.portsmouth.uk.ibm.com [9.149.26.192]) by ppma06fra.de.ibm.com with ESMTP id 33e5gmrkqh-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 09 Sep 2020 07:49:03 +0000 Received: from d06av22.portsmouth.uk.ibm.com (d06av22.portsmouth.uk.ibm.com [9.149.105.58]) by b06avi18626390.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 0897lSmw61407730 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 9 Sep 2020 07:47:28 GMT Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C98864C044; Wed, 9 Sep 2020 07:49:00 +0000 (GMT) Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1C0C94C059; Wed, 9 Sep 2020 07:49:00 +0000 (GMT) Received: from pomme.local (unknown [9.145.19.60]) by d06av22.portsmouth.uk.ibm.com (Postfix) with ESMTP; Wed, 9 Sep 2020 07:49:00 +0000 (GMT) Subject: Re: [PATCH] mm: don't rely on system state to detect hot-plug operations To: Michal Hocko Cc: akpm@linux-foundation.org, David Hildenbrand , Oscar Salvador , rafael@kernel.org, nathanl@linux.ibm.com, cheloha@linux.ibm.com, stable@vger.kernel.org, Greg Kroah-Hartman , linux-mm@kvack.org, LKML References: <5cbd92e1-c00a-4253-0119-c872bfa0f2bc@redhat.com> <20200908170835.85440-1-ldufour@linux.ibm.com> <20200909074011.GD7348@dhcp22.suse.cz> From: Laurent Dufour Message-ID: <9faac1ce-c02d-7dbc-f79a-4aaaa5a73d28@linux.ibm.com> Date: Wed, 9 Sep 2020 09:48:59 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 MIME-Version: 1.0 In-Reply-To: <20200909074011.GD7348@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235,18.0.687 definitions=2020-09-09_03:2020-09-08,2020-09-09 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 spamscore=0 adultscore=0 mlxlogscore=999 bulkscore=0 priorityscore=1501 suspectscore=0 impostorscore=0 malwarescore=0 phishscore=0 lowpriorityscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2009090059 X-Rspamd-Queue-Id: 3C793180B3C83 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Le 09/09/2020 =C3=A0 09:40, Michal Hocko a =C3=A9crit=C2=A0: > [reposting because the malformed cc list confused my email client] >=20 > On Tue 08-09-20 19:08:35, Laurent Dufour wrote: >> In register_mem_sect_under_node() the system_state=E2=80=99s value is = checked to >> detect whether the operation the call is made during boot time or duri= ng an >> hot-plug operation. Unfortunately, that check is wrong on some >> architecture, and may lead to sections being registered under multiple >> nodes if node's memory ranges are interleaved. >=20 > Why is this check arch specific? I was wrong the check is not arch specific. >> This can be seen on PowerPC LPAR after multiple memory hot-plug and >> hot-unplug operations are done. At the next reboot the node's memory r= anges >> can be interleaved >=20 > What is the exact memory layout? For instance: [ 0.000000] Early memory node ranges [ 0.000000] node 1: [mem 0x0000000000000000-0x000000011fffffff] [ 0.000000] node 2: [mem 0x0000000120000000-0x000000014fffffff] [ 0.000000] node 1: [mem 0x0000000150000000-0x00000001ffffffff] [ 0.000000] node 0: [mem 0x0000000200000000-0x000000048fffffff] [ 0.000000] node 2: [mem 0x0000000490000000-0x00000007ffffffff] >=20 >> and since the call to link_mem_sections() is made in >> topology_init() while the system is in the SYSTEM_SCHEDULING state, th= e >> node's id is not checked, and the sections registered multiple times. >=20 > So a single memory section/memblock belongs to two numa nodes? If the node id is not checked in register_mem_sect_under_node(), yes that= the case. >=20 >> In >> that case, the system is able to boot but later hot-plug operation may= lead >> to this panic because the node's links are correctly broken: >=20 > Correctly broken? Could you provide more details on the inconsistency > please? laurent@ltczep3-lp4:~$ ls -l /sys/devices/system/memory/memory21 total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 -rw-r--r-- 1 root root 65536 Aug 24 05:27 online -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index drwxr-xr-x 2 root root 0 Aug 24 05:27 power -r--r--r-- 1 root root 65536 Aug 24 05:27 removable -rw-r--r-- 1 root root 65536 Aug 24 05:27 state lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/me= mory -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones >=20 > Which physical memory range you are trying to add here and what is the > node affinity? None is added, the root cause of the issue is happening at boot time. >=20 >> ------------[ cut here ]------------ >> kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! >> Oops: Exception in kernel mode, sig: 5 [#1] >> LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA pSeries >> Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto = gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autof= s4 >> CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 >> NIP: c000000000403f34 LR: c000000000403f2c CTR: 0000000000000000 >> REGS: c0000004876e3660 TRAP: 0700 Not tainted (5.9.0-rc1+) >> MSR: 800000000282b033 CR: 24000448= XER: 20040000 >> CFAR: c000000000846d20 IRQMASK: 0 >> GPR00: c000000000403f2c c0000004876e38f0 c0000000012f6f00 ffffffffffff= ffef >> GPR04: 0000000000000227 c0000004805ae680 0000000000000000 00000004886f= 0000 >> GPR08: 0000000000000226 0000000000000003 0000000000000002 ffffffffffff= fffd >> GPR12: 0000000088000484 c00000001ec96280 0000000000000000 000000000000= 0000 >> GPR16: 0000000000000000 0000000000000000 0000000000000004 000000000000= 0003 >> GPR20: c00000047814ffe0 c0000007ffff7c08 0000000000000010 c00000000133= 32c8 >> GPR24: 0000000000000000 c0000000011f6cc0 0000000000000000 000000000000= 0000 >> GPR28: ffffffffffffffef 0000000000000001 0000000150000000 000000001000= 0000 >> NIP [c000000000403f34] add_memory_resource+0x244/0x340 >> LR [c000000000403f2c] add_memory_resource+0x23c/0x340 >> Call Trace: >> [c0000004876e38f0] [c000000000403f2c] add_memory_resource+0x23c/0x340 = (unreliable) >> [c0000004876e39c0] [c00000000040408c] __add_memory+0x5c/0xf0 >> [c0000004876e39f0] [c0000000000e2b94] dlpar_add_lmb+0x1b4/0x500 >> [c0000004876e3ad0] [c0000000000e3888] dlpar_memory+0x1f8/0xb80 >> [c0000004876e3b60] [c0000000000dc0d0] handle_dlpar_errorlog+0xc0/0x190 >> [c0000004876e3bd0] [c0000000000dc398] dlpar_store+0x198/0x4a0 >> [c0000004876e3c90] [c00000000072e630] kobj_attr_store+0x30/0x50 >> [c0000004876e3cb0] [c00000000051f954] sysfs_kf_write+0x64/0x90 >> [c0000004876e3cd0] [c00000000051ee40] kernfs_fop_write+0x1b0/0x290 >> [c0000004876e3d20] [c000000000438dd8] vfs_write+0xe8/0x290 >> [c0000004876e3d70] [c0000000004391ac] ksys_write+0xdc/0x130 >> [c0000004876e3dc0] [c000000000034e40] system_call_exception+0x160/0x27= 0 >> [c0000004876e3e20] [c00000000000d740] system_call_common+0xf0/0x27c >> Instruction dump: >> 48442e35 60000000 0b030000 3cbe0001 7fa3eb78 7bc48402 38a5fffe 7ca5fa1= 4 >> 78a58402 48442db1 60000000 7c7c1b78 <0b030000> 7f23cb78 4bda371d 60000= 000 >> ---[ end trace 562fd6c109cd0fb2 ]--- >=20 > The BUG_ON on failure is absolutely horrendous. There must be a better > way to handle a failure like that. The failure means that > sysfs_create_link_nowarn has failed. Please describe why that is the > case. >=20 >> This patch addresses the root cause by not relying on the system_state >> value to detect whether the call is due to a hot-plug operation or not= . An >> additional parameter is added to link_mem_sections() to tell the conte= xt of >> the call and this parameter is propagated to register_mem_sect_under_n= ode() >> throuugh the walk_memory_blocks()'s call. >=20 > This looks like a hack to me and it deserves a better explanation. The > existing code is a hack on its own and it is inconsistent with other > boot time detection. We are using (system_state < SYSTEM_RUNNING) at ot= her > places IIRC. Would it help to use the same here as well? Maybe we want = to > wrap that inside a helper (early_memory_init()) and use it at all > places. I agree, this looks like a hack to check for the system_state value. I'll follow the David's proposal and introduce an enum detailing when the= node=20 id check has to be done or not. The option of the wrapper seems good to me to, but it doesn't highlight w= hy the=20 early processing is differing from the hot plug one. By using an enum exp= licitly=20 saying that the node id check is not done seems better to me. >> Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_unde= r_node() a callback of walk_memory_range()") >> Signed-off-by: Laurent Dufour >> Cc: stable@vger.kernel.org >> Cc: Greg Kroah-Hartman >> Cc: "Rafael J. Wysocki" >> Cc: Andrew Morton >> --- >> drivers/base/node.c | 20 +++++++++++++++----- >> include/linux/node.h | 6 +++--- >> mm/memory_hotplug.c | 3 ++- >> 3 files changed, 20 insertions(+), 9 deletions(-) >> >> diff --git a/drivers/base/node.c b/drivers/base/node.c >> index 508b80f6329b..27f828eeb531 100644 >> --- a/drivers/base/node.c >> +++ b/drivers/base/node.c >> @@ -762,14 +762,19 @@ static int __ref get_nid_for_pfn(unsigned long p= fn) >> } >> =20 >> /* register memory section under specified node if it spans that nod= e */ >> +struct rmsun_args { >> + int nid; >> + bool hotadd; >> +}; >> static int register_mem_sect_under_node(struct memory_block *mem_blk= , >> - void *arg) >> + void *args) >> { >> unsigned long memory_block_pfns =3D memory_block_size_bytes() / PAG= E_SIZE; >> unsigned long start_pfn =3D section_nr_to_pfn(mem_blk->start_sectio= n_nr); >> unsigned long end_pfn =3D start_pfn + memory_block_pfns - 1; >> - int ret, nid =3D *(int *)arg; >> + int ret, nid =3D ((struct rmsun_args *)args)->nid; >> unsigned long pfn; >> + bool hotadd =3D ((struct rmsun_args *)args)->hotadd; >> =20 >> for (pfn =3D start_pfn; pfn <=3D end_pfn; pfn++) { >> int page_nid; >> @@ -789,7 +794,7 @@ static int register_mem_sect_under_node(struct mem= ory_block *mem_blk, >> * case, during hotplug we know that all pages in the memory >> * block belong to the same node. >> */ >> - if (system_state =3D=3D SYSTEM_BOOTING) { >> + if (!hotadd) { >> page_nid =3D get_nid_for_pfn(pfn); >> if (page_nid < 0) >> continue; >> @@ -832,10 +837,15 @@ void unregister_memory_block_under_nodes(struct = memory_block *mem_blk) >> kobject_name(&node_devices[mem_blk->nid]->dev.kobj)); >> } >> =20 >> -int link_mem_sections(int nid, unsigned long start_pfn, unsigned long= end_pfn) >> +int link_mem_sections(int nid, unsigned long start_pfn, unsigned long= end_pfn, >> + bool hotadd) >> { >> + struct rmsun_args args; >> + >> + args.nid =3D nid; >> + args.hotadd =3D hotadd; >> return walk_memory_blocks(PFN_PHYS(start_pfn), >> - PFN_PHYS(end_pfn - start_pfn), (void *)&nid, >> + PFN_PHYS(end_pfn - start_pfn), (void *)&args, >> register_mem_sect_under_node); >> } >> =20 >> diff --git a/include/linux/node.h b/include/linux/node.h >> index 4866f32a02d8..6df9a4548650 100644 >> --- a/include/linux/node.h >> +++ b/include/linux/node.h >> @@ -100,10 +100,10 @@ typedef void (*node_registration_func_t)(struct= node *); >> =20 >> #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_NUMA) >> extern int link_mem_sections(int nid, unsigned long start_pfn, >> - unsigned long end_pfn); >> + unsigned long end_pfn, bool hotadd); >> #else >> static inline int link_mem_sections(int nid, unsigned long start_pfn= , >> - unsigned long end_pfn) >> + unsigned long end_pfn, bool hotadd) >> { >> return 0; >> } >> @@ -128,7 +128,7 @@ static inline int register_one_node(int nid) >> if (error) >> return error; >> /* link memory sections under this node */ >> - error =3D link_mem_sections(nid, start_pfn, end_pfn); >> + error =3D link_mem_sections(nid, start_pfn, end_pfn, false); >> } >> =20 >> return error; >> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c >> index e9d5ab5d3ca0..28028db8364a 100644 >> --- a/mm/memory_hotplug.c >> +++ b/mm/memory_hotplug.c >> @@ -1080,7 +1080,8 @@ int __ref add_memory_resource(int nid, struct re= source *res) >> } >> =20 >> /* link memory sections under this node.*/ >> - ret =3D link_mem_sections(nid, PFN_DOWN(start), PFN_UP(start + size = - 1)); >> + ret =3D link_mem_sections(nid, PFN_DOWN(start), PFN_UP(start + size = - 1), >> + true); >> BUG_ON(ret); >> =20 >> /* create new memmap entry */ >> --=20 >> 2.28.0