From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 370B9CED276 for ; Tue, 8 Oct 2024 08:00:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B18DD6B0085; Tue, 8 Oct 2024 04:00:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AC8B56B0088; Tue, 8 Oct 2024 04:00:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 990266B0089; Tue, 8 Oct 2024 04:00:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 7A0336B0085 for ; Tue, 8 Oct 2024 04:00:13 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 18089C19EA for ; Tue, 8 Oct 2024 08:00:12 +0000 (UTC) X-FDA: 82649687106.06.728EB15 Received: from out-173.mta1.migadu.com (out-173.mta1.migadu.com [95.215.58.173]) by imf10.hostedemail.com (Postfix) with ESMTP id 25ED6C0010 for ; Tue, 8 Oct 2024 08:00:10 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=Om7Wf7CZ; spf=pass (imf10.hostedemail.com: domain of muchun.song@linux.dev designates 95.215.58.173 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728374276; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hESvoyJcwOF4RcDa3iND9uUYL3Pet+nYQsliKOo7j3g=; b=OE3ZAdbRRcdpNKUzSPJvEuMfwywnrZu08S8MvMziLvEhUoPesdM3/svdYL15cjWfdd+yEd qcY29Q2VzdaC3ix83VYDTGfd8Fp3XnZpf2ZPLihv8dORIb/tc73ngMDInn5rSWy8X8tjuY j7kjkpG83oZEpAvMHa3KsqfMg0IcKGA= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728374276; a=rsa-sha256; cv=none; b=ueXp6SqizIbMXL7hOPXnk/RwnJXIbLpyzwTWq4DCyDWjYmkXpd6F8tq1xQ6zWvcuoZanAi 8d0S9ZuaYPFI7AUoH7gogdT/o77TVI8i8Fa2CLKltxRqJl1tcqe9xzh14uWV0iBj8ftUvb XU0PiQJ1pURdTGKGVbI2rZlytprmqCo= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=Om7Wf7CZ; spf=pass (imf10.hostedemail.com: domain of muchun.song@linux.dev designates 95.215.58.173 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Content-Type: text/plain; charset=us-ascii DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1728374408; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hESvoyJcwOF4RcDa3iND9uUYL3Pet+nYQsliKOo7j3g=; b=Om7Wf7CZuqkORuTXNWGkQSeOxpi7KOPW4aWzqoq4Tt6nCG0Z5P9cP5diiqRcX4aBOOyDss oIacZwNzg56jhps09YgTz3tT18ALWsORKK3NY4UTxUsYLnmBTuF2fzyjC409PaUxEkw1DZ FZPlqXfTUXXIOWGOqoxhz8cNut3NXvY= Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3818.100.11.1.3\)) Subject: Re: [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: <87bjzvycoy.fsf@gmail.com> Date: Tue, 8 Oct 2024 15:59:22 +0800 Cc: linux-mm@kvack.org, Donet Tom , Gang Li , Daniel Jordan , David Rientjes Content-Transfer-Encoding: quoted-printable Message-Id: <73DCC4CB-DB4F-4E66-B208-A515A6A4DE96@linux.dev> References: <7e0ca1e8acd7dd5c1fe7cbb252de4eb55a8e851b.1727984881.git.ritesh.list@gmail.com> <87bjzvycoy.fsf@gmail.com> To: "Ritesh Harjani (IBM)" X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 25ED6C0010 X-Stat-Signature: igxn67as3gkzro9k6ihoc8ygbmseehc7 X-HE-Tag: 1728374410-603747 X-HE-Meta: U2FsdGVkX1/2VU2ikG9xX7XnPSGTcise/t2YZkonj/bqKmX9YbpiU7/4kbRXGsqOcUGixD8hLm5M17YhH6CGDWLJXm8UW8zwCjvI/0kjiEs4+IsiLoKnhdbNpjbG+cBOJ09ZyuvCjcL1bTL6s0co1w+z8sFT91CsVjVjn587+J3zsAROoEFb2VmI8/I2no3HqcXVYRfY3OLDovYdWSMAMwxEJg06y0DMYkSKZqh/ycrbRyij1h/c1ucH6XYjHBZQTcRENAqJZZHf9uKUT+LozqkZC5f22LpQFLdsMsjy8S2ot7uwLQ/jkx6QS0veDs4ozeFp2IhVHlSGS7Pj3fyIQ96bV6IXrQnlvQq+Kh4fP8jR0HAvIfHMl99wtMTZOJIkc4G1Am/o/b4ciqgfyTmyuBJ96xeZD/pEKjlcJdCdjZfxXNtSqC1jwnZ7huXgGFQiBCFVFKq8kIKT4Hr8KV5ai5I9KF6FO4wxcKz55MpyNw0iPfdwsaVbZBiorYa9jGQU05slA03oIzJbdwgEPKRl5o3RPv2mgYLCEijRKjvHLClx0QpSsZgJ26Ebtna93nPNNKIWhloaYTP4H2ElXzgnlwsi1YYbiEh4LIWsv4yq/NWG/OBBOlV9iXP2zQDWlybiI/mqVgSd0TWBS7KAnvQS+YM1mwOFBrDQrBP2RRzfRxbZ8G72kIwd4pAEygL+Pw1GC6u5DwfT7RWPrICzvIwCyoJn6tLopY49mIIhpKFPoC1K8RgjaebjnlNJfT4Nm8ebhkiptbYOnZ2EKsTIfD7eTh5H7pFW5Xpt6G6Qj9Cx40TeWgXRUA9wZ3MJmgv95rSj+57x53/D2PKUL8I9KVXt1oA1i3Vze3Ff/7y52q1AQyvOcLuWeGtIZtp4x4Tjh51KsRCXDDT1b2hGYOVzN6a1QNhDMFadD0ZG/3Xqbuj8NmLdE0asUuKQO1ea/pZ8KIRIQ4OfjEuisOujeoFWwyn 3OS6IIGt 1wxK+L4kA6QMcvpJLkF4Azrs0P/Fc6u809NN3ALs65YmSRBRTBZtl/W9wxUVfl9kvVfX6RdVE2gG7Yos3LLfP/5AnwApPGfyNq5M+z7pW4jOjuifTlFGmO50Ujit5fIahw4emP1qsFkUlAiwE0iFvm1xuZGhHeRj4VHSOfX4YG9XQvyhpr0NW0Y1zHammISkCYvIGwt9Agvvq1EGll28BAO2MDQD5HiAoL6yXXfXzZbJCMcRCiEvlUTSQ/OuyZsCArr/HB5XBSjmgocztYo66gxB7pR97TVsp+r0lB931oNT0WVKrM3VBuILJ0btus4dH8C6vEr5FklUQYJoeefsHyw1jWP5ZVP3pFSQee/DGaPwe5oMaG7cOyJyhIUHTOTjzCpxM24grtjUb384= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > On Oct 8, 2024, at 02:45, Ritesh Harjani (IBM) = wrote: >=20 > "Ritesh Harjani (IBM)" writes: >=20 >> gather_bootmem_prealloc() function assumes the start nid as >> 0 and size as num_node_state(N_MEMORY). Since memory attached numa = nodes >> can be interleaved in any fashion, hence ensure current code checks = for all >> online numa nodes as part of gather_bootmem_prealloc_parallel(). >> Let's still make max_threads as N_MEMORY so that we can possibly have >> a uniform distribution of online nodes among these parallel threads. >>=20 >> e.g. qemu cmdline >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= >> numa_cmd=3D"-numa node,nodeid=3D1,memdev=3Dmem1,cpus=3D2-3 -numa = node,nodeid=3D0,cpus=3D0-1 -numa dist,src=3D0,dst=3D1,val=3D20" >> mem_cmd=3D"-object memory-backend-ram,id=3Dmem1,size=3D16G" >>=20 >=20 > I think this patch still might not work for below numa config. Because > in this we have an offline node-0, node-1 with only cpus and node-2 = with=20 > cpus and memory.=20 >=20 > numa_cmd=3D"-numa node,nodeid=3D2,memdev=3Dmem1,cpus=3D2-3 -numa = node,nodeid=3D1,cpus=3D0-1 -numa node,nodeid=3D0" > mem_cmd=3D"-object memory-backend-ram,id=3Dmem1,size=3D32G" >=20 > Maybe N_POSSIBLE will help instead of N_MEMORY in below patch, but let > me give some thought to this before posting v2. How about setting .size with nr_node_ids? Muchun, THanks. >=20 > -ritesh >=20 >=20 >> w/o this patch for cmdline (default_hugepagesz=3D1GB hugepagesz=3D1GB = hugepages=3D2): >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D >> ~ # cat /proc/meminfo |grep -i huge >> AnonHugePages: 0 kB >> ShmemHugePages: 0 kB >> FileHugePages: 0 kB >> HugePages_Total: 0 >> HugePages_Free: 0 >> HugePages_Rsvd: 0 >> HugePages_Surp: 0 >> Hugepagesize: 1048576 kB >> Hugetlb: 0 kB >>=20 >> with this patch for cmdline (default_hugepagesz=3D1GB hugepagesz=3D1GB = hugepages=3D2): >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D >> ~ # cat /proc/meminfo |grep -i huge >> AnonHugePages: 0 kB >> ShmemHugePages: 0 kB >> FileHugePages: 0 kB >> HugePages_Total: 2 >> HugePages_Free: 2 >> HugePages_Rsvd: 0 >> HugePages_Surp: 0 >> Hugepagesize: 1048576 kB >> Hugetlb: 2097152 kB >>=20 >> Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb = initialization") >> Signed-off-by: Ritesh Harjani (IBM) >> Cc: Donet Tom >> Cc: Gang Li >> Cc: Daniel Jordan >> Cc: Muchun Song >> Cc: David Rientjes >> Cc: linux-mm@kvack.org >> --- >>=20 >> =3D=3D=3D=3D Additional data =3D=3D=3D=3D >>=20 >> w/o this patch: >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> ~ # dmesg |grep -Ei "numa|node|huge" >> [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. >> [ 0.000000][ T0] memory[0x0] = [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on = node 1 flags: 0x0 >> [ 0.000000][ T0] numa: NODE_DATA [mem = 0x3dde50800-0x3dde57fff] >> [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 >> [ 0.000000][ T0] numa: NODE_DATA [mem = 0x3dde49000-0x3dde507ff] >> [ 0.000000][ T0] Movable zone start for each node >> [ 0.000000][ T0] Early memory node ranges >> [ 0.000000][ T0] node 1: [mem = 0x0000000000000000-0x00000003ffffffff] >> [ 0.000000][ T0] Initmem setup node 0 as memoryless >> [ 0.000000][ T0] Initmem setup node 1 [mem = 0x0000000000000000-0x00000003ffffffff] >> [ 0.000000][ T0] Kernel command line: root=3D/dev/vda1 = console=3DttyS0 nokaslr slub_max_order=3D0 norandmaps memblock=3Ddebug = noreboot default_hugepagesz=3D1GB hugepagesz=3D1GB hugepages=3D2 >> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes = align=3D0x40000000 nid=3D1 from=3D0x0000000000000000 = max_addr=3D0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes = align=3D0x40000000 nid=3D1 from=3D0x0000000000000000 = max_addr=3D0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >> [ 0.000000][ T0] Inode-cache hash table entries: 1048576 = (order: 7, 8388608 bytes, linear) >> [ 0.000000][ T0] Fallback order for Node 0: 1 >> [ 0.000000][ T0] Fallback order for Node 1: 1 >> [ 0.000000][ T0] SLUB: HWalign=3D128, Order=3D0-0, = MinObjects=3D0, CPUs=3D4, Nodes=3D2 >> [ 0.044978][ T0] mempolicy: Enabling automatic NUMA balancing. = Configure with numa_balancing=3D or the kernel.numa_balancing sysctl >> [ 0.209159][ T1] Timer migration: 2 hierarchy levels; 8 = children per group; 1 crossnode level >> [ 0.414281][ T1] smp: Brought up 2 nodes, 4 CPUs >> [ 0.415268][ T1] numa: Node 0 CPUs: 0-1 >> [ 0.416030][ T1] numa: Node 1 CPUs: 2-3 >> [ 13.644459][ T41] node 1 deferred pages initialised in 12040ms >> [ 14.241701][ T1] HugeTLB: registered 1.00 GiB page size, = pre-allocated 0 pages >> [ 14.242781][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 = GiB page >> [ 14.243806][ T1] HugeTLB: registered 2.00 MiB page size, = pre-allocated 0 pages >> [ 14.244753][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 = MiB page >> [ 16.490452][ T1] pci_bus 0000:00: Unknown NUMA node; = performance will be reduced >> [ 27.804266][ T1] Demotion targets for Node 1: null >>=20 >> with this patch: >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> ~ # dmesg |grep -Ei "numa|node|huge" >> [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. >> [ 0.000000][ T0] memory[0x0] = [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on = node 1 flags: 0x0 >> [ 0.000000][ T0] numa: NODE_DATA [mem = 0x3dde50800-0x3dde57fff] >> [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 >> [ 0.000000][ T0] numa: NODE_DATA [mem = 0x3dde49000-0x3dde507ff] >> [ 0.000000][ T0] Movable zone start for each node >> [ 0.000000][ T0] Early memory node ranges >> [ 0.000000][ T0] node 1: [mem = 0x0000000000000000-0x00000003ffffffff] >> [ 0.000000][ T0] Initmem setup node 0 as memoryless >> [ 0.000000][ T0] Initmem setup node 1 [mem = 0x0000000000000000-0x00000003ffffffff] >> [ 0.000000][ T0] Kernel command line: root=3D/dev/vda1 = console=3DttyS0 nokaslr slub_max_order=3D0 norandmaps memblock=3Ddebug = noreboot default_hugepagesz=3D1GB hugepagesz=3D1GB hugepages=3D2 >> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes = align=3D0x40000000 nid=3D1 from=3D0x0000000000000000 = max_addr=3D0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes = align=3D0x40000000 nid=3D1 from=3D0x0000000000000000 = max_addr=3D0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >> [ 0.000000][ T0] Inode-cache hash table entries: 1048576 = (order: 7, 8388608 bytes, linear) >> [ 0.000000][ T0] Fallback order for Node 0: 1 >> [ 0.000000][ T0] Fallback order for Node 1: 1 >> [ 0.000000][ T0] SLUB: HWalign=3D128, Order=3D0-0, = MinObjects=3D0, CPUs=3D4, Nodes=3D2 >> [ 0.048825][ T0] mempolicy: Enabling automatic NUMA balancing. = Configure with numa_balancing=3D or the kernel.numa_balancing sysctl >> [ 0.204211][ T1] Timer migration: 2 hierarchy levels; 8 = children per group; 1 crossnode level >> [ 0.378821][ T1] smp: Brought up 2 nodes, 4 CPUs >> [ 0.379642][ T1] numa: Node 0 CPUs: 0-1 >> [ 0.380302][ T1] numa: Node 1 CPUs: 2-3 >> [ 11.577527][ T41] node 1 deferred pages initialised in 10250ms >> [ 12.557856][ T1] HugeTLB: registered 1.00 GiB page size, = pre-allocated 2 pages >> [ 12.574197][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 = GiB page >> [ 12.576339][ T1] HugeTLB: registered 2.00 MiB page size, = pre-allocated 0 pages >> [ 12.577262][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 = MiB page >> [ 15.102445][ T1] pci_bus 0000:00: Unknown NUMA node; = performance will be reduced >> [ 26.173888][ T1] Demotion targets for Node 1: null >>=20 >> mm/hugetlb.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >>=20 >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 9a3a6e2dee97..60f45314c151 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -3443,7 +3443,7 @@ static void __init = gather_bootmem_prealloc(void) >> .thread_fn =3D gather_bootmem_prealloc_parallel, >> .fn_arg =3D NULL, >> .start =3D 0, >> - .size =3D num_node_state(N_MEMORY), >> + .size =3D num_node_state(N_ONLINE), >> .align =3D 1, >> .min_chunk =3D 1, >> .max_threads =3D num_node_state(N_MEMORY), >> -- >> 2.39.5