From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2503BC4332F for ; Thu, 2 Nov 2023 02:55:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ADB0A8E0011; Wed, 1 Nov 2023 22:55:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A8AF78E0009; Wed, 1 Nov 2023 22:55:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 979AD8E0011; Wed, 1 Nov 2023 22:55:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 892228E0009 for ; Wed, 1 Nov 2023 22:55:43 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 4A324C0E40 for ; Thu, 2 Nov 2023 02:55:43 +0000 (UTC) X-FDA: 81411498966.16.7BDB2F7 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by imf16.hostedemail.com (Postfix) with ESMTP id AEA1E18000C for ; Thu, 2 Nov 2023 02:55:39 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=none; spf=pass (imf16.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698893741; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DzFQgtQ/xh0Q5gajpAgEwvQB54+m4uxncP4eAn/L9YQ=; b=lDKuZmAz6wx1cH43UEi8eqcelgB2aq6hmapvDwgrmO9dWVdf8f6ODSby6MuZrmxYgZ5mDL LB6WHBubIFSA5e4WSUxbkfWKif/X3PaKKJ9a0HedLZHCRWgTmUX2f/1j8KxU72ZPL0Goir t1KA+6gswLYJMDs68Z3ez88s+HatuGI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698893741; a=rsa-sha256; cv=none; b=BTkz4k9QyE+59dmAPr5FDcdokQS50g3+gZpK9I5XPolDdBAJzY9BHXrDKiWcWf5oL0ZUBq CSiUMvc0P+DWFCrz57/iWU9IxG/+ujc2XlUggGMMkZysbAmD6IupxDOf/jzjgLP2al8TzJ O7Vv9xS+izQvhlIegevUYOKoIHUCPR8= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=none; spf=pass (imf16.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com Received: from dggpemm100001.china.huawei.com (unknown [172.30.72.54]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4SLT50059rzVlJD; Thu, 2 Nov 2023 10:55:15 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemm100001.china.huawei.com (7.185.36.93) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.31; Thu, 2 Nov 2023 10:55:18 +0800 Message-ID: <504a2d9f-0792-474b-bf64-44b58b9731db@huawei.com> Date: Thu, 2 Nov 2023 10:55:17 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm: cma: report correct node id Content-Language: en-US To: Nathan Chancellor CC: Andrew Morton , , Christoph Hellwig , , Palmer Dabbelt , Conor Dooley , Atish Patra References: <20231019013253.2792048-1-wangkefeng.wang@huawei.com> <20231025163703.GA2440148@dev-arch.thelio-3990X> <47437c2b-5946-41c6-ad1b-cc03329eb230@huawei.com> <20231101172923.GB1368360@dev-arch.thelio-3990X> From: Kefeng Wang In-Reply-To: <20231101172923.GB1368360@dev-arch.thelio-3990X> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To dggpemm100001.china.huawei.com (7.185.36.93) X-CFilter-Loop: Reflected X-Stat-Signature: gogccguns6495ccq9g8rwwjt5yifzoq9 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: AEA1E18000C X-Rspam-User: X-HE-Tag: 1698893739-617167 X-HE-Meta: U2FsdGVkX1+FU/+wNKLQEEnDHFVbSKjpUHXte1OJE76U5f5J787a5Tr+2hjOj9IawAOO3FcAzOBvaugcSBKWpTqPCQMRw9QCea3O3K3h2HVDEFKZ3bSHW/9HKlo1ygnJ5G5/KT0kyy5qm3RC44EidEgveE2SxZAa0kAxw8wOf5w64gGPbZCgSgAxNv3h8IpZbmAP6wC50oqWufiX/svjQI52nvQRw+Pyq6P5B1LsGWRe1e2xZcNoDQtuxkPT15doCxRu/R1uCwq+7h2nsk42zJVtpzci1HbOqLaK0XuMFw+Dsf3FzNRQ7ADPkl5I+fv6vDsnjekh8l7zhT5JLRfb2Rx7JRHmXkDqpgIJ3H2KQNTK4sL2iCvSr4Z7iheuDdh13vTK479E5V452r2r7UhCAMcvc6qeIrhXyGmV8ACGO/5uxBt9JNUa+KoXuxAVPClG9qcj1+AU8TWyWtts3y1Bda/hR2M5laasb2RaPIxlFbklBBGejeLodvSJ2TXyYSXkT2f238BYMvt3/5bdlQPvFVz/9Ope+OeBQKXWXQ2PKmpX5mK64cscJdktgaStCA881nBAIj/UsoNaG1lI0rMVU0lO72rKIfriw947GgGPt8RpNaW877V2bt514BsBZi1/Wuf0yRS5toOOj1V5Z/shTRkKxh3Hb3dBYmpcVAC8sNJr9Z31WqM64iFyUFgew51Vp9IBGDdlk9eOzuOUWyFhYuyJs1T+ybHd0uePQIQBeD1PZnxWdlOS0qNBi0OhFa7Ni6kAv9PnSwZ/yCM9bIl73UVWHpgMhxG4mwNKdqLR8Weie7yDWJJhCNxzI0etdBYFUt6XBI6rQTaQk+FKsDmC4Z0iIryQS56BZoEgrlOcdUzZSKJ57cnz74nNqaAlatmRYUQ/wDV7keXBTUTdzMg2JdkzvfHkFhK+I6HRKDxPMWVDd1v2hqPNq5iVx1PXXBRByNUOXqlQvpML6fnTb4B tEQTJX9Q U0nJZ4Kf9U5jNoFBEDVybgkcrtCzN4MlpZdDN7IswnQaluuZ5j3Gc/1gco8zDlyG2imauQfJ4VRARWH7D67ArLKVWsCvDr36ZnTaTEfYmNMjZguZHNCiCF/U/KZjYbUYwub7BTIlo6kGc2FDQo2qzhOaXUTVf2JaXSQQa0ZnYZK+r3zlC7lSWrwQr2L9732r8W7FFcQNqO2qzqvnH/n/SgYDAIQPZbR9MtZxFkIInLWbJgljHmbT0xBjXk3Aftn8hzHHl/80en3rFCbNZ1GwB6VF5yQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2023/11/2 1:29, Nathan Chancellor wrote: > Hi Kefeng, > > On Mon, Oct 30, 2023 at 05:34:32PM +0800, Kefeng Wang wrote: >> On 2023/10/26 0:37, Nathan Chancellor wrote: >>> On Thu, Oct 19, 2023 at 09:32:53AM +0800, Kefeng Wang wrote: >>>> Use early_pfn_to_nid() to get correct node id from base instead of >>>> the default NUMA_NO_NODE in cma_declare_contiguous_nid(). >>>> >>>> Signed-off-by: Kefeng Wang >>>> --- >>>> mm/cma.c | 3 +++ >>>> 1 file changed, 3 insertions(+) >>>> >>>> diff --git a/mm/cma.c b/mm/cma.c >>>> index 2b2494fd6b59..97c27e5fe1a2 100644 >>>> --- a/mm/cma.c >>>> +++ b/mm/cma.c >>>> @@ -375,6 +375,9 @@ int __init cma_declare_contiguous_nid(phys_addr_t base, >>>> if (ret) >>>> goto free_mem; >>>> + if (nid == NUMA_NO_NODE) >>>> + nid = early_pfn_to_nid(PHYS_PFN(base)); >>>> + >>>> pr_info("Reserved %ld MiB at %pa on node %d\n", (unsigned long)size / SZ_1M, >>>> &base, nid); >>>> return 0; >>>> -- >>>> 2.27.0 >>>> >>> >>> I bisected a RISC-V boot failure in QEMU to this change in -next. It >>> happens with OpenSUSE's RISC-V configuration [1], which I was able to >>> narrow down to the follow configurations on top of defconfig: >>> >> >> I think the root cause is the bad node info of memory address, meanwhile, >> the riscv's cma reserve is before numa init, see the following log, >> >> [ 0.000000] cma: Reserved 16 MiB at 0x000000009f000000 on node 4 >> [ 0.000000] NUMA: Faking a node at [mem >> 0x0000000080000000-0x000000009fffffff] >> [ 0.000000] NUMA: NODE_DATA [mem 0x9eff2780-0x9eff3fff] >> [ 0.000000] NUMA: NODE_DATA(0) on node 4 // should be node 0 >> [ 0.000000] [ff1c000002000000-ff1c000002000fff] potential offnode >> page_structs >> >> additional, early_pfn_to_nid will cache the recent lookups of pfn-to-nid, >> which >> led to the next early_pfn_to_nid get the cache nid, not the new nid(changed >> by numa init), >> >> setup_arch >> paging_init >> dma_contiguous_reserve >> cma_declare_contiguous_nid // 9f000000 node 4 >> early_pfn_to_nid // 1. lookup memblk, pfn=9f000, nid=4 cached >> misc_mem_init >> arch_numa_init >> numa_init >> dummy_numa_init >> numa_add_memblk // 2. setup new nid of memblk >> numa_register_nodes >> setup_node_data >> early_pfn_to_nid // 3. *but still use cached pfn,nid* >> mm_core_init >> mem_init >> memblock_free_all >> __free_pages_core // 4. check page and find bad page >> >> Firstly, 9f000000 on nid=4 should be fixed in firmware(I don't know where >> store this infomation), secondly, if we want to fix it or avoid > > I believe the firmware for QEMU is just OpenSBI but that is about all I > know, I am not a RISC-V developer. > > I've explicitly added some RISC-V folks, the start of the thread is > available at > https://lore.kernel.org/20231025163703.GA2440148@dev-arch.thelio-3990X/. Thanks,phy=9f000000 on nid=4 is strange, it could be fixed in firmware, but in any case,I think the following change should be needed, as once one use early_pfn_to_nid(), it will cache pfn-to-nid, if the pfn-to-nid is updated, eg, dummy_numa_init() or other numa_add_memblk() callers, the cached pfn-to-nid should be reset, or new early_pfn_to_nid() will get wrong old numa nid. > > Cheers, > Nathan > >> similar issue happened in other scene,a reset function to cleanup the >> cached pfn-nid should be added, I try following diff, it should work. >> >> diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c >> index eaa31e567d1e..24100e45971c 100644 >> --- a/drivers/base/arch_numa.c >> +++ b/drivers/base/arch_numa.c >> @@ -210,6 +210,7 @@ int __init numa_add_memblk(int nid, u64 start, u64 end) >> } >> >> node_set(nid, numa_nodes_parsed); >> + early_pfn_reset_nid(); >> return ret; >> } >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 418d26608ece..f20a8da22b35 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -3173,9 +3173,11 @@ static inline int early_pfn_to_nid(unsigned long pfn) >> { >> return 0; >> } >> +static inline void early_pfn_reset_nid(void) {} >> #else >> /* please see mm/page_alloc.c */ >> extern int __meminit early_pfn_to_nid(unsigned long pfn); >> +extern void __meminit early_pfn_reset_nid(void); >> #endif >> >> extern void set_dma_reserve(unsigned long new_dma_reserve); >> diff --git a/mm/mm_init.c b/mm/mm_init.c >> index 077bfe393b5e..fb7751b233c4 100644 >> --- a/mm/mm_init.c >> +++ b/mm/mm_init.c >> @@ -586,6 +586,7 @@ struct mminit_pfnnid_cache { >> }; >> >> static struct mminit_pfnnid_cache early_pfnnid_cache __meminitdata; >> +static DEFINE_SPINLOCK(early_pfn_lock); >> >> /* >> * Required by SPARSEMEM. Given a PFN, return what node the PFN is on. >> @@ -611,7 +612,6 @@ static int __meminit __early_pfn_to_nid(unsigned long >> pfn, >> >> int __meminit early_pfn_to_nid(unsigned long pfn) >> { >> - static DEFINE_SPINLOCK(early_pfn_lock); >> int nid; >> >> spin_lock(&early_pfn_lock); >> @@ -623,6 +623,15 @@ int __meminit early_pfn_to_nid(unsigned long pfn) >> return nid; >> } >> >> +void __meminit early_pfn_reset_nid(void) >> +{ >> + spin_lock(&early_pfn_lock); >> + early_pfnnid_cache.last_start = 0; >> + early_pfnnid_cache.last_end = 0; >> + early_pfnnid_cache.last_nid = 0; >> + spin_unlock(&early_pfn_lock); >> +} >> + >> int hashdist = HASHDIST_DEFAULT; >> >> static int __init set_hashdist(char *str) >> >> >> >>> >>> >>> >>> Without CONFIG_ACPI_SPCR_TABLE=y, there is a visible crash. >>> >>> [ 0.000000] Linux version 6.6.0-rc7-next-20231025 (nathan@dev-fedora.c3-large-arm64) (riscv64-linux-gcc (GCC) 13.2.0, GNU ld (GNU Binutils) 2.41) #1 SMP Wed Oct 25 16:14:59 UTC 2023 >>> ... >>> [ 0.000000] mem auto-init: stack:all(zero), heap alloc:off, heap free:off >>> [ 0.000000] page:ff1c000002200000 is uninitialized and poisoned >>> [ 0.000000] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) >>> [ 0.000000] ------------[ cut here ]------------ >>> [ 0.000000] kernel BUG at include/linux/page-flags.h:493! >>> [ 0.000000] Kernel BUG [#1] >>> [ 0.000000] Modules linked in: >>> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.0-rc7-next-20231025 #1 >>> [ 0.000000] Hardware name: riscv-virtio,qemu (DT) >>> [ 0.000000] epc : __free_pages_core+0x78/0x126 >>> [ 0.000000] ra : __free_pages_core+0x78/0x126 >>> [ 0.000000] epc : ffffffff8018dd8e ra : ffffffff8018dd8e sp : ffffffff81403d40 >>> [ 0.000000] gp : ffffffff815013a0 tp : ffffffff8140db00 t0 : 6d75642065676170 >>> [ 0.000000] t1 : 0000000000000070 t2 : 706d756420656761 s0 : ffffffff81403d50 >>> [ 0.000000] s1 : 0000000000000004 a0 : 000000000000003c a1 : ffffffff814866a8 >>> [ 0.000000] a2 : 0000000000000000 a3 : 0000000000000001 a4 : 0000000000000000 >>> [ 0.000000] a5 : 0000000000000000 a6 : 0000000000000008 a7 : 0000000000000038 >>> [ 0.000000] s2 : 0000000000088000 s3 : ff1c000002200000 s4 : 0000000000000009 >>> [ 0.000000] s5 : 00000000ffffffff s6 : 0000000000081800 s7 : 0000000000088200 >>> [ 0.000000] s8 : 00000000000001c0 s9 : 0040000000000000 s10: ffffffff81500bdd >>> [ 0.000000] s11: ffffffff81500bdc t3 : ffffffff81515aa7 t4 : ffffffff81515aa7 >>> [ 0.000000] t5 : ffffffff81515aa8 t6 : ffffffff81403b58 >>> [ 0.000000] status: 0000000200000100 badaddr: 0000000000000000 cause: 0000000000000003 >>> [ 0.000000] [] __free_pages_core+0x78/0x126 >>> [ 0.000000] [] memblock_free_pages+0x52/0x62 >>> [ 0.000000] [] memblock_free_all+0x1fc/0x27e >>> [ 0.000000] [] mem_init+0x34/0x22c >>> [ 0.000000] [] mm_core_init+0x116/0x2d0 >>> [ 0.000000] [] start_kernel+0x3c6/0x742 >>> [ 0.000000] Code: 0405 8399 8b85 d7f1 9597 00e2 8593 2ae5 90ef e5dd (9002) 6597 >>> [ 0.000000] ---[ end trace 0000000000000000 ]--- >>> [ 0.000000] Kernel panic - not syncing: Fatal exception in interrupt >>> >>> The rootfs is available at [2] if necessary. If there is any more >>> information I can provide or patches I can test, I am more than happy to >>> do so. >>> >>> [1]: https://github.com/openSUSE/kernel-source/raw/master/config/riscv64/default >>> [2]: https://github.com/ClangBuiltLinux/boot-utils/releases >>> >>> Cheers, >>> Nathan