From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77649E7719C for ; Fri, 10 Jan 2025 09:41:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F348F8D0005; Fri, 10 Jan 2025 04:41:02 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EE46C8D0002; Fri, 10 Jan 2025 04:41:02 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BAE988D0005; Fri, 10 Jan 2025 04:41:02 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id D3C168D0002 for ; Fri, 10 Jan 2025 04:41:01 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 568E4C0980 for ; Fri, 10 Jan 2025 09:41:01 +0000 (UTC) X-FDA: 82991048322.27.551E563 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) by imf03.hostedemail.com (Postfix) with ESMTP id 8BFD920008 for ; Fri, 10 Jan 2025 09:40:59 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="CApAOw3/"; spf=pass (imf03.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736502059; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references:dkim-signature; bh=qHAl4DeK5nx0cN0nK3vVIhLO7ht6PG7j1svrbvjRfIM=; b=y2nKB0EzkhF6uXlozQsGw9mQ0Jf2MZhPENAXj6JS7DnuIYJw45Bh9Es37CpuNybFfqBhu7 mrwIScyZICkQqW0Q3pBcFnCwV518n2kwZTnSdFiK63/He5tlEf+9TLcinmdQgdrDc4JAhZ T+uKXUn0M1XQI2MiXKsaRAlNjVQW1AM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736502059; a=rsa-sha256; cv=none; b=jvA+G1gzY+QpZCEvuL+B6emHXp7m5peYhLYhooRCmlVSW3D+Z/SLKpKLYd+H1ILLTWjNZO /CMLZ59ec+Z9mpUBFL7WDFr7Erl5jXqJS+EymEWPd67rLTX/3PDZ7txj/0Phw9wIFxhWZj zqNEC9e8+US0gE21BRtSjC3WpGjZ1fk= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="CApAOw3/"; spf=pass (imf03.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-216281bc30fso35886765ad.0 for ; Fri, 10 Jan 2025 01:40:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736502058; x=1737106858; darn=kvack.org; h=references:message-id:date:in-reply-to:subject:cc:to:from:from:to :cc:subject:date:message-id:reply-to; bh=qHAl4DeK5nx0cN0nK3vVIhLO7ht6PG7j1svrbvjRfIM=; b=CApAOw3/pQtrkKZiGVfILLbJGjT8SX8oGnNGKbtK9JuErM0aM10lO7rahl+Iv80lTX 5uwd2AYMuTWq4OO6L1Nc3DdkBvTUWEiA7UgrxlXHa8eiXO4Z2tgUfKKFfGiZCjqa2/lw jwbX07bYgvJaRy751Lx7TMMvhjIhFSYR5iwgYGESmrfZa8iWAUd7GN6hqAJNuIFF5rCr zmNI1RwXExjPg9F9w0OUIzYWbaeTQWX4Ngtdk3PFvTQJIsQHdmndq+DDjE/oEtpZpFOz npUIPCzz4+QqUy9hiXAiz5qxacsrqZA7DE5M+Q7UPtPnzIy9LWEm+X21nJWK0/w9LPjM bO2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736502058; x=1737106858; h=references:message-id:date:in-reply-to:subject:cc:to:from :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=qHAl4DeK5nx0cN0nK3vVIhLO7ht6PG7j1svrbvjRfIM=; b=Z01u334bhJBhxm0a3Y0goi586ZUNW0PqGUQUWPXVFPdtGTBWU5ocS3uIwVyqO2hoNo VGGS+uU5GpFJ2l3DsKFwV4eewZYzotm9c2g4Uyb4IkTBPIjZuOuvNuWPtWtbVgS35SQ9 5H+XXR/xt+jtBQSZ/Th3HDRKwoGZAdDZJ2IvVrThlYuIQyR3KmcydzYnnY1wdbxDFOyl 9jJuZsjBsb4//6cwN5Snnkrd4ZmUjAGt/o2dzlYyocU9v9cl3/SKMD1LAGzIICsvr22s mqymkAQwM11NS2ckuUBHShFPeqIuBJ5FwdUOexOmUiVWWgPu/u90mCvcqp0qYcqi2ClY wxDQ== X-Gm-Message-State: AOJu0YzMApSQb9ZbuuCdAvnf4ILFAuy/BV3uSfLJW5j6NuIe6561G8Qv BXcg7rPSW7gb9nUtbBHsHRNUDaSeP1tlIWKgzPz8LaRxVbK0wehl X-Gm-Gg: ASbGncvjthKVp9oe23LGvfW1xWzW7i1z9l0T59wXPYJQ3lXSsdCZ9LZhtSlo4er5a9T A4n/ouS2ZhQ8P7yJkHmXgXMY9/v/wadMzGWVfWCsROfNACMNcgFSAwBW2FYlBPVxwCd82CLHR88 5NSxbKnGqVYcIfYB/HJ26ugBMqz90xDOZQZ1IflMlIAKJJCBn2oyu0YxFS6vkllf3cPC8qwjY0U nTKfSwsPIEraUpQMQF/dC4TNnJUispTQxwc4XFItU3q+XmVhA== X-Google-Smtp-Source: AGHT+IHmI/13jTeqZIorHxzN6K9xqZSlsCPCTT91fqs7Ii0cNXP0/CSeYBjobDuIkdyneY3iYRz9rA== X-Received: by 2002:a05:6a21:164e:b0:1e0:d851:cda5 with SMTP id adf61e73a8af0-1e88d10f711mr17179113637.14.1736502058110; Fri, 10 Jan 2025 01:40:58 -0800 (PST) Received: from dw-tp ([171.76.80.131]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-72d40549413sm1227897b3a.12.2025.01.10.01.40.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 10 Jan 2025 01:40:57 -0800 (PST) From: Ritesh Harjani (IBM) To: Muchun Song Cc: linux-mm@kvack.org, Donet Tom , Gang Li , Daniel Jordan , David Rientjes Subject: Re: [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes In-Reply-To: <73DCC4CB-DB4F-4E66-B208-A515A6A4DE96@linux.dev> Date: Fri, 10 Jan 2025 15:07:09 +0530 Message-ID: <87tta7ui0q.fsf@gmail.com> References: <7e0ca1e8acd7dd5c1fe7cbb252de4eb55a8e851b.1727984881.git.ritesh.list@gmail.com> <87bjzvycoy.fsf@gmail.com> <73DCC4CB-DB4F-4E66-B208-A515A6A4DE96@linux.dev> X-Rspamd-Queue-Id: 8BFD920008 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: rec7thmk4igyonmxumt6k3i51acp9p91 X-HE-Tag: 1736502059-338742 X-HE-Meta: U2FsdGVkX1/0eekO4uPjkEBuMG0zQsPW8/qVU7ef2EcQRQGhmBaeKY9dpIfKRRSyN+7SasF9L8apxlkmBA1pEWkgiOQQREuDU1DH5dZt0mUrZpJS5Weueq9KEdIMnTyrcQwYljkNLdfMcSXlvEXcvR20NIIH0D2DRQ2zNtJdzs/3EIP1yKjRV2XY+r1oGM/f+IkI3lJkJ7nROddJ8XeT4y7uu0GEpSJY7uzzhvwRRf/iRGX0AErFBxqaBmv10Q6fdujRadxskHcQmgWkWOqGhdcV1fG0lIpJLlVfVMO/hR+2Dxqm5qIFLBKWg3g6dP1Rg8Wm1yL5FtdDmwDkOnuVvyNsfNRi1SN/EU3gWLmOqkdiPbpPY6NdrT5om9wS3dAWTqVRNhDw9YVytpEbzT2LPGX6HpJ6fMWl79s3JgN+SBDQgnL8fNOpXvNn3XrtjZCBLCfDLA0XVJgRtxVb3V2Zhz1Rgt7whxnPphII8z3Bd41Doe7xDzjZCsooe/zWtTx1F1tBF0JB6CNj4km/zvVYtFlaVXjUis4Ge+e0qTHk3gMDHtzn+vjA5zBWC5yJ9ZNQU/kBpw824krA76t9i40w82+hNEd9su78PIjW/4/wG19MSG3Vkq2+yS6uNbikrdBcCcSs1T+QsaDS7tsB3KKhR11cJhh/QTj1ZGqSshY/aAIK8IzN9og2zWrbbQwpIHeg1dufKuQp+pP/q3uldHMvYalDPFv9IsoNzduKJYqntwjFrGhDhjc8ieik5klSUmcKegUbt3IwCtYq9vKjtRVW7zczMgEv/KP3OOMtGQLN4ta7eFm2sjAZOaAUaWK9glunr8ZQz+iRy7KSQaarLKHHn4vmmFPhCXmLOQgLliJH2VxQhVko7u5UvuZYzBQAyqs2ZVpNbqYUv5La1KYRlmSxWeajUwF2d4Pcj0g2y+s9NKYWrOPL6WVXkYR78qSZDGwQQAxZk+sF3rnmzwlUo3v qTkXpud5 MmNo3H7eTwBqY0XkGWkD2o0fa6IYincPmw+aOvUFKV1atpI6CSmNPu+oW2/S/wQpFvFPqOWCD4gGs/qg9Of9OffHPDEAoob4GSnN1IPMmyk0eFI8ISes+a9P6Hw6vxLs1z2EfOcG4YSo4BT+t7oq/bE7WDi8BzZls3uYaDbuFcB+6em1OjZdgVnYDLG8+31lmGnU7OaNqKu/zaM5M0QJNsxMNAb2JlADP2ISc8Wdpj+/1BIaVxdNQDomIZrPA/Vh0pou1ZH8+3ZgqRbpRpNc/cZX7QtJAmuVAnyEIGgCA9W0+OAZXfe5TXioAipL/aUSSGf+K58TKefm/qoAIsvLvbqB6OjK2ai6NhOOEI+fZvvLmWh/ib8xrPP7JhQJf0MDikzKkxJa0VoczBiaqDn9zkaLdmo4nPUzgcwLVbX1fliT6o7Ei/PP+ir7KQlZw6Hjm04X3EzstZ+cMbFamY00xM7DsmPbV7RA+TvJViNPTzeZQhOMMVYNK37XUfwjp5Dpwr6l0B5FoaAxPKXz48Rl3qqLzzQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Muchun Song writes: >> On Oct 8, 2024, at 02:45, Ritesh Harjani (IBM) wrote: >> >> "Ritesh Harjani (IBM)" writes: >> >>> gather_bootmem_prealloc() function assumes the start nid as >>> 0 and size as num_node_state(N_MEMORY). Since memory attached numa nodes >>> can be interleaved in any fashion, hence ensure current code checks for all >>> online numa nodes as part of gather_bootmem_prealloc_parallel(). >>> Let's still make max_threads as N_MEMORY so that we can possibly have >>> a uniform distribution of online nodes among these parallel threads. >>> >>> e.g. qemu cmdline >>> ======================== >>> numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20" >>> mem_cmd="-object memory-backend-ram,id=mem1,size=16G" >>> >> >> I think this patch still might not work for below numa config. Because >> in this we have an offline node-0, node-1 with only cpus and node-2 with >> cpus and memory. >> >> numa_cmd="-numa node,nodeid=2,memdev=mem1,cpus=2-3 -numa node,nodeid=1,cpus=0-1 -numa node,nodeid=0" >> mem_cmd="-object memory-backend-ram,id=mem1,size=32G" >> >> Maybe N_POSSIBLE will help instead of N_MEMORY in below patch, but let >> me give some thought to this before posting v2. > > How about setting .size with nr_node_ids? > Yes, I agree. We could do .size = nr_node_ids. Let me send a patch with above fix and your suggested-by. Sorry about the delay. Got pulled into other things. -ritesh > Muchun, > THanks. > >> >> -ritesh >> >> >>> w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): >>> ========================== >>> ~ # cat /proc/meminfo |grep -i huge >>> AnonHugePages: 0 kB >>> ShmemHugePages: 0 kB >>> FileHugePages: 0 kB >>> HugePages_Total: 0 >>> HugePages_Free: 0 >>> HugePages_Rsvd: 0 >>> HugePages_Surp: 0 >>> Hugepagesize: 1048576 kB >>> Hugetlb: 0 kB >>> >>> with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): >>> =========================== >>> ~ # cat /proc/meminfo |grep -i huge >>> AnonHugePages: 0 kB >>> ShmemHugePages: 0 kB >>> FileHugePages: 0 kB >>> HugePages_Total: 2 >>> HugePages_Free: 2 >>> HugePages_Rsvd: 0 >>> HugePages_Surp: 0 >>> Hugepagesize: 1048576 kB >>> Hugetlb: 2097152 kB >>> >>> Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization") >>> Signed-off-by: Ritesh Harjani (IBM) >>> Cc: Donet Tom >>> Cc: Gang Li >>> Cc: Daniel Jordan >>> Cc: Muchun Song >>> Cc: David Rientjes >>> Cc: linux-mm@kvack.org >>> --- >>> >>> ==== Additional data ==== >>> >>> w/o this patch: >>> ================ >>> ~ # dmesg |grep -Ei "numa|node|huge" >>> [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. >>> [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0 >>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff] >>> [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 >>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff] >>> [ 0.000000][ T0] Movable zone start for each node >>> [ 0.000000][ T0] Early memory node ranges >>> [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff] >>> [ 0.000000][ T0] Initmem setup node 0 as memoryless >>> [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff] >>> [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2 >>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >>> [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear) >>> [ 0.000000][ T0] Fallback order for Node 0: 1 >>> [ 0.000000][ T0] Fallback order for Node 1: 1 >>> [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2 >>> [ 0.044978][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl >>> [ 0.209159][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level >>> [ 0.414281][ T1] smp: Brought up 2 nodes, 4 CPUs >>> [ 0.415268][ T1] numa: Node 0 CPUs: 0-1 >>> [ 0.416030][ T1] numa: Node 1 CPUs: 2-3 >>> [ 13.644459][ T41] node 1 deferred pages initialised in 12040ms >>> [ 14.241701][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages >>> [ 14.242781][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page >>> [ 14.243806][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages >>> [ 14.244753][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page >>> [ 16.490452][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced >>> [ 27.804266][ T1] Demotion targets for Node 1: null >>> >>> with this patch: >>> ================= >>> ~ # dmesg |grep -Ei "numa|node|huge" >>> [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. >>> [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0 >>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff] >>> [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 >>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff] >>> [ 0.000000][ T0] Movable zone start for each node >>> [ 0.000000][ T0] Early memory node ranges >>> [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff] >>> [ 0.000000][ T0] Initmem setup node 0 as memoryless >>> [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff] >>> [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2 >>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 >>> [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear) >>> [ 0.000000][ T0] Fallback order for Node 0: 1 >>> [ 0.000000][ T0] Fallback order for Node 1: 1 >>> [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2 >>> [ 0.048825][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl >>> [ 0.204211][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level >>> [ 0.378821][ T1] smp: Brought up 2 nodes, 4 CPUs >>> [ 0.379642][ T1] numa: Node 0 CPUs: 0-1 >>> [ 0.380302][ T1] numa: Node 1 CPUs: 2-3 >>> [ 11.577527][ T41] node 1 deferred pages initialised in 10250ms >>> [ 12.557856][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 2 pages >>> [ 12.574197][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page >>> [ 12.576339][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages >>> [ 12.577262][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page >>> [ 15.102445][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced >>> [ 26.173888][ T1] Demotion targets for Node 1: null >>> >>> mm/hugetlb.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index 9a3a6e2dee97..60f45314c151 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -3443,7 +3443,7 @@ static void __init gather_bootmem_prealloc(void) >>> .thread_fn = gather_bootmem_prealloc_parallel, >>> .fn_arg = NULL, >>> .start = 0, >>> - .size = num_node_state(N_MEMORY), >>> + .size = num_node_state(N_ONLINE), >>> .align = 1, >>> .min_chunk = 1, >>> .max_threads = num_node_state(N_MEMORY), >>> -- >>> 2.39.5