* [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes
@ 2024-10-03 20:00 Ritesh Harjani (IBM)
2024-10-07 18:45 ` Ritesh Harjani
0 siblings, 1 reply; 4+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-10-03 20:00 UTC (permalink / raw)
To: linux-mm
Cc: Ritesh Harjani (IBM),
Donet Tom, Gang Li, Daniel Jordan, Muchun Song, David Rientjes
gather_bootmem_prealloc() function assumes the start nid as
0 and size as num_node_state(N_MEMORY). Since memory attached numa nodes
can be interleaved in any fashion, hence ensure current code checks for all
online numa nodes as part of gather_bootmem_prealloc_parallel().
Let's still make max_threads as N_MEMORY so that we can possibly have
a uniform distribution of online nodes among these parallel threads.
e.g. qemu cmdline
========================
numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20"
mem_cmd="-object memory-backend-ram,id=mem1,size=16G"
w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
==========================
~ # cat /proc/meminfo |grep -i huge
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Hugetlb: 0 kB
with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
===========================
~ # cat /proc/meminfo |grep -i huge
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 2
HugePages_Free: 2
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Hugetlb: 2097152 kB
Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization")
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Gang Li <gang.li@linux.dev>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: David Rientjes <rientjes@google.com>
Cc: linux-mm@kvack.org
---
==== Additional data ====
w/o this patch:
================
~ # dmesg |grep -Ei "numa|node|huge"
[ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes.
[ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0
[ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff]
[ 0.000000][ T0] numa: NODE_DATA(0) on node 1
[ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff]
[ 0.000000][ T0] Movable zone start for each node
[ 0.000000][ T0] Early memory node ranges
[ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff]
[ 0.000000][ T0] Initmem setup node 0 as memoryless
[ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff]
[ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2
[ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
[ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
[ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear)
[ 0.000000][ T0] Fallback order for Node 0: 1
[ 0.000000][ T0] Fallback order for Node 1: 1
[ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2
[ 0.044978][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[ 0.209159][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
[ 0.414281][ T1] smp: Brought up 2 nodes, 4 CPUs
[ 0.415268][ T1] numa: Node 0 CPUs: 0-1
[ 0.416030][ T1] numa: Node 1 CPUs: 2-3
[ 13.644459][ T41] node 1 deferred pages initialised in 12040ms
[ 14.241701][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages
[ 14.242781][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
[ 14.243806][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
[ 14.244753][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
[ 16.490452][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
[ 27.804266][ T1] Demotion targets for Node 1: null
with this patch:
=================
~ # dmesg |grep -Ei "numa|node|huge"
[ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes.
[ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0
[ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff]
[ 0.000000][ T0] numa: NODE_DATA(0) on node 1
[ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff]
[ 0.000000][ T0] Movable zone start for each node
[ 0.000000][ T0] Early memory node ranges
[ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff]
[ 0.000000][ T0] Initmem setup node 0 as memoryless
[ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff]
[ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2
[ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
[ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
[ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear)
[ 0.000000][ T0] Fallback order for Node 0: 1
[ 0.000000][ T0] Fallback order for Node 1: 1
[ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2
[ 0.048825][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[ 0.204211][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
[ 0.378821][ T1] smp: Brought up 2 nodes, 4 CPUs
[ 0.379642][ T1] numa: Node 0 CPUs: 0-1
[ 0.380302][ T1] numa: Node 1 CPUs: 2-3
[ 11.577527][ T41] node 1 deferred pages initialised in 10250ms
[ 12.557856][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 2 pages
[ 12.574197][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
[ 12.576339][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
[ 12.577262][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
[ 15.102445][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
[ 26.173888][ T1] Demotion targets for Node 1: null
mm/hugetlb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9a3a6e2dee97..60f45314c151 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3443,7 +3443,7 @@ static void __init gather_bootmem_prealloc(void)
.thread_fn = gather_bootmem_prealloc_parallel,
.fn_arg = NULL,
.start = 0,
- .size = num_node_state(N_MEMORY),
+ .size = num_node_state(N_ONLINE),
.align = 1,
.min_chunk = 1,
.max_threads = num_node_state(N_MEMORY),
--
2.39.5
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes 2024-10-03 20:00 [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes Ritesh Harjani (IBM) @ 2024-10-07 18:45 ` Ritesh Harjani 2024-10-08 7:59 ` Muchun Song 0 siblings, 1 reply; 4+ messages in thread From: Ritesh Harjani @ 2024-10-07 18:45 UTC (permalink / raw) To: linux-mm; +Cc: Donet Tom, Gang Li, Daniel Jordan, Muchun Song, David Rientjes "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> writes: > gather_bootmem_prealloc() function assumes the start nid as > 0 and size as num_node_state(N_MEMORY). Since memory attached numa nodes > can be interleaved in any fashion, hence ensure current code checks for all > online numa nodes as part of gather_bootmem_prealloc_parallel(). > Let's still make max_threads as N_MEMORY so that we can possibly have > a uniform distribution of online nodes among these parallel threads. > > e.g. qemu cmdline > ======================== > numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20" > mem_cmd="-object memory-backend-ram,id=mem1,size=16G" > I think this patch still might not work for below numa config. Because in this we have an offline node-0, node-1 with only cpus and node-2 with cpus and memory. numa_cmd="-numa node,nodeid=2,memdev=mem1,cpus=2-3 -numa node,nodeid=1,cpus=0-1 -numa node,nodeid=0" mem_cmd="-object memory-backend-ram,id=mem1,size=32G" Maybe N_POSSIBLE will help instead of N_MEMORY in below patch, but let me give some thought to this before posting v2. -ritesh > w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): > ========================== > ~ # cat /proc/meminfo |grep -i huge > AnonHugePages: 0 kB > ShmemHugePages: 0 kB > FileHugePages: 0 kB > HugePages_Total: 0 > HugePages_Free: 0 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 1048576 kB > Hugetlb: 0 kB > > with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): > =========================== > ~ # cat /proc/meminfo |grep -i huge > AnonHugePages: 0 kB > ShmemHugePages: 0 kB > FileHugePages: 0 kB > HugePages_Total: 2 > HugePages_Free: 2 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 1048576 kB > Hugetlb: 2097152 kB > > Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization") > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> > Cc: Donet Tom <donettom@linux.ibm.com> > Cc: Gang Li <gang.li@linux.dev> > Cc: Daniel Jordan <daniel.m.jordan@oracle.com> > Cc: Muchun Song <muchun.song@linux.dev> > Cc: David Rientjes <rientjes@google.com> > Cc: linux-mm@kvack.org > --- > > ==== Additional data ==== > > w/o this patch: > ================ > ~ # dmesg |grep -Ei "numa|node|huge" > [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. > [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0 > [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff] > [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 > [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff] > [ 0.000000][ T0] Movable zone start for each node > [ 0.000000][ T0] Early memory node ranges > [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff] > [ 0.000000][ T0] Initmem setup node 0 as memoryless > [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff] > [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2 > [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 > [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 > [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear) > [ 0.000000][ T0] Fallback order for Node 0: 1 > [ 0.000000][ T0] Fallback order for Node 1: 1 > [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2 > [ 0.044978][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl > [ 0.209159][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level > [ 0.414281][ T1] smp: Brought up 2 nodes, 4 CPUs > [ 0.415268][ T1] numa: Node 0 CPUs: 0-1 > [ 0.416030][ T1] numa: Node 1 CPUs: 2-3 > [ 13.644459][ T41] node 1 deferred pages initialised in 12040ms > [ 14.241701][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages > [ 14.242781][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page > [ 14.243806][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages > [ 14.244753][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page > [ 16.490452][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced > [ 27.804266][ T1] Demotion targets for Node 1: null > > with this patch: > ================= > ~ # dmesg |grep -Ei "numa|node|huge" > [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. > [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0 > [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff] > [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 > [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff] > [ 0.000000][ T0] Movable zone start for each node > [ 0.000000][ T0] Early memory node ranges > [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff] > [ 0.000000][ T0] Initmem setup node 0 as memoryless > [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff] > [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2 > [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 > [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 > [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear) > [ 0.000000][ T0] Fallback order for Node 0: 1 > [ 0.000000][ T0] Fallback order for Node 1: 1 > [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2 > [ 0.048825][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl > [ 0.204211][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level > [ 0.378821][ T1] smp: Brought up 2 nodes, 4 CPUs > [ 0.379642][ T1] numa: Node 0 CPUs: 0-1 > [ 0.380302][ T1] numa: Node 1 CPUs: 2-3 > [ 11.577527][ T41] node 1 deferred pages initialised in 10250ms > [ 12.557856][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 2 pages > [ 12.574197][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page > [ 12.576339][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages > [ 12.577262][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page > [ 15.102445][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced > [ 26.173888][ T1] Demotion targets for Node 1: null > > mm/hugetlb.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 9a3a6e2dee97..60f45314c151 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3443,7 +3443,7 @@ static void __init gather_bootmem_prealloc(void) > .thread_fn = gather_bootmem_prealloc_parallel, > .fn_arg = NULL, > .start = 0, > - .size = num_node_state(N_MEMORY), > + .size = num_node_state(N_ONLINE), > .align = 1, > .min_chunk = 1, > .max_threads = num_node_state(N_MEMORY), > -- > 2.39.5 ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes 2024-10-07 18:45 ` Ritesh Harjani @ 2024-10-08 7:59 ` Muchun Song 2025-01-10 9:37 ` Ritesh Harjani 0 siblings, 1 reply; 4+ messages in thread From: Muchun Song @ 2024-10-08 7:59 UTC (permalink / raw) To: Ritesh Harjani (IBM) Cc: linux-mm, Donet Tom, Gang Li, Daniel Jordan, David Rientjes > On Oct 8, 2024, at 02:45, Ritesh Harjani (IBM) <ritesh.list@gmail.com> wrote: > > "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> writes: > >> gather_bootmem_prealloc() function assumes the start nid as >> 0 and size as num_node_state(N_MEMORY). Since memory attached numa nodes >> can be interleaved in any fashion, hence ensure current code checks for all >> online numa nodes as part of gather_bootmem_prealloc_parallel(). >> Let's still make max_threads as N_MEMORY so that we can possibly have >> a uniform distribution of online nodes among these parallel threads. >> >> e.g. qemu cmdline >> ======================== >> numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20" >> mem_cmd="-object memory-backend-ram,id=mem1,size=16G" >> > > I think this patch still might not work for below numa config. Because > in this we have an offline node-0, node-1 with only cpus and node-2 with > cpus and memory. > > numa_cmd="-numa node,nodeid=2,memdev=mem1,cpus=2-3 -numa node,nodeid=1,cpus=0-1 -numa node,nodeid=0" > mem_cmd="-object memory-backend-ram,id=mem1,size=32G" > > Maybe N_POSSIBLE will help instead of N_MEMORY in below patch, but let > me give some thought to this before posting v2. How about setting .size with nr_node_ids? Muchun, THanks. > > -ritesh > > >> w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): >> ========================== >> ~ # cat /proc/meminfo |grep -i huge >> AnonHugePages: 0 kB >> ShmemHugePages: 0 kB >> FileHugePages: 0 kB >> HugePages_Total: 0 >> HugePages_Free: 0 >> HugePages_Rsvd: 0 >> HugePages_Surp: 0 >> Hugepagesize: 1048576 kB >> Hugetlb: 0 kB >> >> with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): >> =========================== >> ~ # cat /proc/meminfo |grep -i huge >> AnonHugePages: 0 kB >> ShmemHugePages: 0 kB >> FileHugePages: 0 kB >> HugePages_Total: 2 >> HugePages_Free: 2 >> HugePages_Rsvd: 0 >> HugePages_Surp: 0 >> Hugepagesize: 1048576 kB >> Hugetlb: 2097152 kB >> >> Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization") >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> >> Cc: Donet Tom <donettom@linux.ibm.com> >> Cc: Gang Li <gang.li@linux.dev> >> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> >> Cc: Muchun Song <muchun.song@linux.dev> >> Cc: David Rientjes <rientjes@google.com> >> Cc: linux-mm@kvack.org >> --- >> >> ==== Additional data ==== >> >> w/o this patch: >> ================ >> ~ # dmesg |grep -Ei "numa|node|huge" >> [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. >> [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0 >> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff] >> [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 >> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff] >> [ 0.000000][ T0] Movable zone start for each node >> [ 0.000000][ T0] Early memory node ranges >> [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff] >> [ 0.000000][ T0] Initmem setup node 0 as memoryless >> [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff] >> [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2 >> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >> [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear) >> [ 0.000000][ T0] Fallback order for Node 0: 1 >> [ 0.000000][ T0] Fallback order for Node 1: 1 >> [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2 >> [ 0.044978][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl >> [ 0.209159][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level >> [ 0.414281][ T1] smp: Brought up 2 nodes, 4 CPUs >> [ 0.415268][ T1] numa: Node 0 CPUs: 0-1 >> [ 0.416030][ T1] numa: Node 1 CPUs: 2-3 >> [ 13.644459][ T41] node 1 deferred pages initialised in 12040ms >> [ 14.241701][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages >> [ 14.242781][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page >> [ 14.243806][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages >> [ 14.244753][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page >> [ 16.490452][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced >> [ 27.804266][ T1] Demotion targets for Node 1: null >> >> with this patch: >> ================= >> ~ # dmesg |grep -Ei "numa|node|huge" >> [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. >> [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0 >> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff] >> [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 >> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff] >> [ 0.000000][ T0] Movable zone start for each node >> [ 0.000000][ T0] Early memory node ranges >> [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff] >> [ 0.000000][ T0] Initmem setup node 0 as memoryless >> [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff] >> [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2 >> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >> [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear) >> [ 0.000000][ T0] Fallback order for Node 0: 1 >> [ 0.000000][ T0] Fallback order for Node 1: 1 >> [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2 >> [ 0.048825][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl >> [ 0.204211][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level >> [ 0.378821][ T1] smp: Brought up 2 nodes, 4 CPUs >> [ 0.379642][ T1] numa: Node 0 CPUs: 0-1 >> [ 0.380302][ T1] numa: Node 1 CPUs: 2-3 >> [ 11.577527][ T41] node 1 deferred pages initialised in 10250ms >> [ 12.557856][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 2 pages >> [ 12.574197][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page >> [ 12.576339][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages >> [ 12.577262][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page >> [ 15.102445][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced >> [ 26.173888][ T1] Demotion targets for Node 1: null >> >> mm/hugetlb.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 9a3a6e2dee97..60f45314c151 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -3443,7 +3443,7 @@ static void __init gather_bootmem_prealloc(void) >> .thread_fn = gather_bootmem_prealloc_parallel, >> .fn_arg = NULL, >> .start = 0, >> - .size = num_node_state(N_MEMORY), >> + .size = num_node_state(N_ONLINE), >> .align = 1, >> .min_chunk = 1, >> .max_threads = num_node_state(N_MEMORY), >> -- >> 2.39.5 ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes 2024-10-08 7:59 ` Muchun Song @ 2025-01-10 9:37 ` Ritesh Harjani 0 siblings, 0 replies; 4+ messages in thread From: Ritesh Harjani @ 2025-01-10 9:37 UTC (permalink / raw) To: Muchun Song; +Cc: linux-mm, Donet Tom, Gang Li, Daniel Jordan, David Rientjes Muchun Song <muchun.song@linux.dev> writes: >> On Oct 8, 2024, at 02:45, Ritesh Harjani (IBM) <ritesh.list@gmail.com> wrote: >> >> "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> writes: >> >>> gather_bootmem_prealloc() function assumes the start nid as >>> 0 and size as num_node_state(N_MEMORY). Since memory attached numa nodes >>> can be interleaved in any fashion, hence ensure current code checks for all >>> online numa nodes as part of gather_bootmem_prealloc_parallel(). >>> Let's still make max_threads as N_MEMORY so that we can possibly have >>> a uniform distribution of online nodes among these parallel threads. >>> >>> e.g. qemu cmdline >>> ======================== >>> numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20" >>> mem_cmd="-object memory-backend-ram,id=mem1,size=16G" >>> >> >> I think this patch still might not work for below numa config. Because >> in this we have an offline node-0, node-1 with only cpus and node-2 with >> cpus and memory. >> >> numa_cmd="-numa node,nodeid=2,memdev=mem1,cpus=2-3 -numa node,nodeid=1,cpus=0-1 -numa node,nodeid=0" >> mem_cmd="-object memory-backend-ram,id=mem1,size=32G" >> >> Maybe N_POSSIBLE will help instead of N_MEMORY in below patch, but let >> me give some thought to this before posting v2. > > How about setting .size with nr_node_ids? > Yes, I agree. We could do .size = nr_node_ids. Let me send a patch with above fix and your suggested-by. Sorry about the delay. Got pulled into other things. -ritesh > Muchun, > THanks. > >> >> -ritesh >> >> >>> w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): >>> ========================== >>> ~ # cat /proc/meminfo |grep -i huge >>> AnonHugePages: 0 kB >>> ShmemHugePages: 0 kB >>> FileHugePages: 0 kB >>> HugePages_Total: 0 >>> HugePages_Free: 0 >>> HugePages_Rsvd: 0 >>> HugePages_Surp: 0 >>> Hugepagesize: 1048576 kB >>> Hugetlb: 0 kB >>> >>> with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): >>> =========================== >>> ~ # cat /proc/meminfo |grep -i huge >>> AnonHugePages: 0 kB >>> ShmemHugePages: 0 kB >>> FileHugePages: 0 kB >>> HugePages_Total: 2 >>> HugePages_Free: 2 >>> HugePages_Rsvd: 0 >>> HugePages_Surp: 0 >>> Hugepagesize: 1048576 kB >>> Hugetlb: 2097152 kB >>> >>> Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization") >>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> >>> Cc: Donet Tom <donettom@linux.ibm.com> >>> Cc: Gang Li <gang.li@linux.dev> >>> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> >>> Cc: Muchun Song <muchun.song@linux.dev> >>> Cc: David Rientjes <rientjes@google.com> >>> Cc: linux-mm@kvack.org >>> --- >>> >>> ==== Additional data ==== >>> >>> w/o this patch: >>> ================ >>> ~ # dmesg |grep -Ei "numa|node|huge" >>> [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. >>> [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0 >>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff] >>> [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 >>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff] >>> [ 0.000000][ T0] Movable zone start for each node >>> [ 0.000000][ T0] Early memory node ranges >>> [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff] >>> [ 0.000000][ T0] Initmem setup node 0 as memoryless >>> [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff] >>> [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2 >>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >>> [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear) >>> [ 0.000000][ T0] Fallback order for Node 0: 1 >>> [ 0.000000][ T0] Fallback order for Node 1: 1 >>> [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2 >>> [ 0.044978][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl >>> [ 0.209159][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level >>> [ 0.414281][ T1] smp: Brought up 2 nodes, 4 CPUs >>> [ 0.415268][ T1] numa: Node 0 CPUs: 0-1 >>> [ 0.416030][ T1] numa: Node 1 CPUs: 2-3 >>> [ 13.644459][ T41] node 1 deferred pages initialised in 12040ms >>> [ 14.241701][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages >>> [ 14.242781][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page >>> [ 14.243806][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages >>> [ 14.244753][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page >>> [ 16.490452][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced >>> [ 27.804266][ T1] Demotion targets for Node 1: null >>> >>> with this patch: >>> ================= >>> ~ # dmesg |grep -Ei "numa|node|huge" >>> [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. >>> [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0 >>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff] >>> [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 >>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff] >>> [ 0.000000][ T0] Movable zone start for each node >>> [ 0.000000][ T0] Early memory node ranges >>> [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff] >>> [ 0.000000][ T0] Initmem setup node 0 as memoryless >>> [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff] >>> [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2 >>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >>> [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear) >>> [ 0.000000][ T0] Fallback order for Node 0: 1 >>> [ 0.000000][ T0] Fallback order for Node 1: 1 >>> [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2 >>> [ 0.048825][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl >>> [ 0.204211][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level >>> [ 0.378821][ T1] smp: Brought up 2 nodes, 4 CPUs >>> [ 0.379642][ T1] numa: Node 0 CPUs: 0-1 >>> [ 0.380302][ T1] numa: Node 1 CPUs: 2-3 >>> [ 11.577527][ T41] node 1 deferred pages initialised in 10250ms >>> [ 12.557856][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 2 pages >>> [ 12.574197][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page >>> [ 12.576339][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages >>> [ 12.577262][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page >>> [ 15.102445][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced >>> [ 26.173888][ T1] Demotion targets for Node 1: null >>> >>> mm/hugetlb.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index 9a3a6e2dee97..60f45314c151 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -3443,7 +3443,7 @@ static void __init gather_bootmem_prealloc(void) >>> .thread_fn = gather_bootmem_prealloc_parallel, >>> .fn_arg = NULL, >>> .start = 0, >>> - .size = num_node_state(N_MEMORY), >>> + .size = num_node_state(N_ONLINE), >>> .align = 1, >>> .min_chunk = 1, >>> .max_threads = num_node_state(N_MEMORY), >>> -- >>> 2.39.5 ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-01-10 9:41 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-10-03 20:00 [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes Ritesh Harjani (IBM) 2024-10-07 18:45 ` Ritesh Harjani 2024-10-08 7:59 ` Muchun Song 2025-01-10 9:37 ` Ritesh Harjani
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox