From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69D78CFB453 for ; Mon, 7 Oct 2024 18:49:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4CE376B0082; Mon, 7 Oct 2024 14:49:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 47E956B0083; Mon, 7 Oct 2024 14:49:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 31F436B0085; Mon, 7 Oct 2024 14:49:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 129F96B0082 for ; Mon, 7 Oct 2024 14:49:03 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 3D1268026A for ; Mon, 7 Oct 2024 18:49:02 +0000 (UTC) X-FDA: 82647693324.11.FC92E19 Received: from mail-pf1-f172.google.com (mail-pf1-f172.google.com [209.85.210.172]) by imf29.hostedemail.com (Postfix) with ESMTP id 885D7120003 for ; Mon, 7 Oct 2024 18:48:59 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=f9jQoGAx; spf=pass (imf29.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.210.172 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728326871; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references:dkim-signature; bh=lkPj3ahbDrXzmyATUsrwM7r18eZtDdUGblbIo1A57sw=; b=7CFCygyPBqQmi1X/lirBAYMr3TosSWQzKvLxlIrWHXyzGHQrTi1TBJuhs6NCfTNzXVSXQy DxU2MJOarp1m0daHgdzoHMe2hPOMiRfUKmVf3SKlQd7W0nZHuWnuO/bPYoG+ZrjJ0XrTIJ 9QiKEPPcCoT5rLsK2r7JtS8I7hlBtGs= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=f9jQoGAx; spf=pass (imf29.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.210.172 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728326871; a=rsa-sha256; cv=none; b=z0DP3YrGx23e/N2GE0EP2SuBQPWDB0iRzv/T6GCvqlVGtRoJTC9nAg1IykxEcOavK0i+Na tKnSa8jARNOLCnf+0TVF6wh5URyvwW7TwYO+BN3bG4CyXaK73rJtCeG6HdOqRYkl1qT/X9 Z2EqgW4iJkG6DDWZRHy2uWHF0m6xS5g= Received: by mail-pf1-f172.google.com with SMTP id d2e1a72fcca58-71e05198d1dso796747b3a.1 for ; Mon, 07 Oct 2024 11:48:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1728326938; x=1728931738; darn=kvack.org; h=references:message-id:date:in-reply-to:subject:cc:to:from:from:to :cc:subject:date:message-id:reply-to; bh=lkPj3ahbDrXzmyATUsrwM7r18eZtDdUGblbIo1A57sw=; b=f9jQoGAxZJ76K5VZQljZazF/6Tu3Bm+kLrvEPQvHSbbj4aV5QQcZNCA4AffG/0Ch40 fLRrVglHmyN4TPlD/0xb0BSVU7qqTIbLCgITeuQiNudkov5HrM4TXPg6e1phkYgMmV7m ly/1Mg6D3ac8zi9NJTzn/j/u1b9RbdhXjdFtiqpMcBFGsIst93wkLJXcgfsnK0nRIzRt qx9DpniTlZY+uPtzXOP2F+eVETW1XFnq04IeoVrHH19SUgMT6+VcFanZnzd+e+TUYe9T yjkTpw4AZNvYOyjAY1gJOsmMCyuW/Y3C6KJmXHV1FbayGDj61cGXtC7i3NtwJrEJ/zpf G7SA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728326938; x=1728931738; h=references:message-id:date:in-reply-to:subject:cc:to:from :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=lkPj3ahbDrXzmyATUsrwM7r18eZtDdUGblbIo1A57sw=; b=u5Cr6LydTlvLlFNH0K/q/tco2oiQuLPVsO3tVgs/hXDS6sxzusnFxYG//c8nCLU5N5 sKjCHBCEO7Er+jDzshbilVSV1sZPiJOiVzw7d0fKeIgjSUSRbGVVNv+rc/cz+Ock6Ggx 4c1qOxe1jdLEfYJJp+C12ixk/9G1DmBpgCifRuAU9eGEK2OXDIp2nyfpGEMNgkRBzjQ1 NSZbWAAzx68n1T+6LamGN63A1XuZr/yaLat/4UkmY1qQXIqh42Pv1DDKLrlHuiU6lLxP D/TG1gclRbd5qZCnGg7BK95WB6DRGUdyftFGGRFQx0Nc3l4m1V+l3oR62VxYpUp23rc3 rtfA== X-Gm-Message-State: AOJu0YyV5jPwiF+xbGxe7Yqez3p50uE59AcgwO3QBr3p9SVz99hshLZE HQfeemNJ/giCYGoAdMPcReaKZ/Jw/s89nizxJrc6nONyDs/0eJz4 X-Google-Smtp-Source: AGHT+IE7KPb5WrzaBnvE2YyOSAQfyAFYQdQ1soCNIEiuTKr6SyvvIfg+CwWcJrcJvnLVfiJxo5iPpQ== X-Received: by 2002:a05:6a20:b418:b0:1cf:6c65:68d with SMTP id adf61e73a8af0-1d6dfae3126mr18910930637.46.1728326938034; Mon, 07 Oct 2024 11:48:58 -0700 (PDT) Received: from dw-tp ([171.76.87.188]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-71df0cbc4cfsm4722811b3a.38.2024.10.07.11.48.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 07 Oct 2024 11:48:57 -0700 (PDT) From: Ritesh Harjani (IBM) To: linux-mm@kvack.org Cc: Donet Tom , Gang Li , Daniel Jordan , Muchun Song , David Rientjes Subject: Re: [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes In-Reply-To: <7e0ca1e8acd7dd5c1fe7cbb252de4eb55a8e851b.1727984881.git.ritesh.list@gmail.com> Date: Tue, 08 Oct 2024 00:15:17 +0530 Message-ID: <87bjzvycoy.fsf@gmail.com> References: <7e0ca1e8acd7dd5c1fe7cbb252de4eb55a8e851b.1727984881.git.ritesh.list@gmail.com> X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 885D7120003 X-Stat-Signature: oggbniyoc78zzhujcgzm63pykgkksnb7 X-HE-Tag: 1728326939-659582 X-HE-Meta: U2FsdGVkX1936QmgK9PJF1HMgA0vQwjmBOkjvswaWOslw6iJEIw7l28ncfB3n//5R/0756YqOvXWInF6D9t2QGGeQJwcKK3uFK2MnZL0KVN/f8SLBIxPo7SQOfXyPDFLiei/xQ/HaTczTCVtiT7yDPMt91HhLywlvvrVAz2cVuy+Xv3zsNnHSGO8xb5MCJxIuZ8v1zgdmVza8jHgw/xUF/iXj8ocbZullsbKJA+d9RnjomL2JktBhBr6rZuiqFwvxA8L7fwFQ5LK3ORIQGHT8Y0BIMU27orAiwpjgC92TpI/YkLHMfOnDDuV/MDTaLQFqAUO9ptEhcAuUZk8GgKmArt7ROw8fqhJK/fS/ssJuXaWw/MMsS79wLLRBbHL8R45ZbkDlwbUsyQWcvoQNrw1b7IqlrcXR9lsIXqln4Yttz5sVpBJFuwZGt+R1RA1RHf71yQ8NrgN3RVrXzL6mlUn9DjHeAAVaSFpvcbugHRuE4CR4pvwLYqmcOxw3YKaS5KO9X8zWcjoEw4IiqEUyqofSj+M2SJobLKr19q/XX91gPMvGzj9jZmKlomhZhO7R9E1YzxOpvw9cpqkDWpSMrkbMWugR+0dn8VijbdfH1xBG0rAuPs+vSWov52AO/AF8aLx0V4JeStu4JiNjfpkVKu5AFRu1EGZYvjFpluo2eeU+R31KER8jByp+Jr93vLRYLFe12u/jmqpoC4VG34H6NMghyrxJ/z5FvAfxHseQq1I2/lYJsfM4/0/4CFita2WSl+tW/8ZSblBFAuQ+wJcRJjx0GET7FF2kkhLAEJAQn0iTnAZdCWWmHNo+JlirxKYcqEC4IeRH+VnKD1kgVgrALdmPnTY2SU0zdyZdWWufW9+lQsccen7EibpZqGBLa1m+nNOW5XuOsmsvfYK2a8YqhYtYNFn6PUvusb+yZARJTuRXKS5yamWw5JoRNH1Gw9EtBkEpT38Z2ipT08hYX2nfVy caGCC0Kc y6GQPN+6CLsIs8UbsUE/7zrhiHX3W8hWBjPOs8VbBERGACy8hgHYtPsbw4eJ8i2Vcrw/br+W0BK69Hss2t9LzMaB/euV6i7MOGuy3UzzC+maZ5a62gEpfqzQ63RLZV1t0kSG+oA/HbD8+dvAZYNoCEyg3RQIsMUT20YHnSYTp8CHcQQRbVH4ju3jOkW6pR6RDotqFgjpyas7tyJtpLB4ixm9KfhFH8gPclle63k+/V2uoNPIk2oWwGJ7sx3Qw4qfIlDz6A+26Xeqj7E4nymSzXu2y6HkqVT37emD9ku3TcN1AoEny16Yaa5EGeBwDWBUy1rGFSkSl74qdRwARBgoFc1RKdAW+UAVwVjt6hUsM7Sm/7diR3WxdbUATUTceEJLeeiAD7QTIE48IoZ4+f6wdwEEwPAh4DXDoVHiPrzUzAHWW01o3DFB02hQaFiFJClf3TYtA7N8ahLbaMzRJE/IwE6iPynTTHjEBj2oYFmCJYNpxa0cGC6i5yKkZR5JAfrPfVjVcoAMhHRs3YmvFKARDadiHKQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: "Ritesh Harjani (IBM)" writes: > gather_bootmem_prealloc() function assumes the start nid as > 0 and size as num_node_state(N_MEMORY). Since memory attached numa nodes > can be interleaved in any fashion, hence ensure current code checks for all > online numa nodes as part of gather_bootmem_prealloc_parallel(). > Let's still make max_threads as N_MEMORY so that we can possibly have > a uniform distribution of online nodes among these parallel threads. > > e.g. qemu cmdline > ======================== > numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20" > mem_cmd="-object memory-backend-ram,id=mem1,size=16G" > I think this patch still might not work for below numa config. Because in this we have an offline node-0, node-1 with only cpus and node-2 with cpus and memory. numa_cmd="-numa node,nodeid=2,memdev=mem1,cpus=2-3 -numa node,nodeid=1,cpus=0-1 -numa node,nodeid=0" mem_cmd="-object memory-backend-ram,id=mem1,size=32G" Maybe N_POSSIBLE will help instead of N_MEMORY in below patch, but let me give some thought to this before posting v2. -ritesh > w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): > ========================== > ~ # cat /proc/meminfo |grep -i huge > AnonHugePages: 0 kB > ShmemHugePages: 0 kB > FileHugePages: 0 kB > HugePages_Total: 0 > HugePages_Free: 0 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 1048576 kB > Hugetlb: 0 kB > > with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): > =========================== > ~ # cat /proc/meminfo |grep -i huge > AnonHugePages: 0 kB > ShmemHugePages: 0 kB > FileHugePages: 0 kB > HugePages_Total: 2 > HugePages_Free: 2 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 1048576 kB > Hugetlb: 2097152 kB > > Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization") > Signed-off-by: Ritesh Harjani (IBM) > Cc: Donet Tom > Cc: Gang Li > Cc: Daniel Jordan > Cc: Muchun Song > Cc: David Rientjes > Cc: linux-mm@kvack.org > --- > > ==== Additional data ==== > > w/o this patch: > ================ > ~ # dmesg |grep -Ei "numa|node|huge" > [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. > [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0 > [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff] > [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 > [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff] > [ 0.000000][ T0] Movable zone start for each node > [ 0.000000][ T0] Early memory node ranges > [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff] > [ 0.000000][ T0] Initmem setup node 0 as memoryless > [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff] > [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2 > [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 > [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 > [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear) > [ 0.000000][ T0] Fallback order for Node 0: 1 > [ 0.000000][ T0] Fallback order for Node 1: 1 > [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2 > [ 0.044978][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl > [ 0.209159][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level > [ 0.414281][ T1] smp: Brought up 2 nodes, 4 CPUs > [ 0.415268][ T1] numa: Node 0 CPUs: 0-1 > [ 0.416030][ T1] numa: Node 1 CPUs: 2-3 > [ 13.644459][ T41] node 1 deferred pages initialised in 12040ms > [ 14.241701][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages > [ 14.242781][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page > [ 14.243806][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages > [ 14.244753][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page > [ 16.490452][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced > [ 27.804266][ T1] Demotion targets for Node 1: null > > with this patch: > ================= > ~ # dmesg |grep -Ei "numa|node|huge" > [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. > [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0 > [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff] > [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 > [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff] > [ 0.000000][ T0] Movable zone start for each node > [ 0.000000][ T0] Early memory node ranges > [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff] > [ 0.000000][ T0] Initmem setup node 0 as memoryless > [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff] > [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2 > [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 > [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 > [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear) > [ 0.000000][ T0] Fallback order for Node 0: 1 > [ 0.000000][ T0] Fallback order for Node 1: 1 > [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2 > [ 0.048825][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl > [ 0.204211][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level > [ 0.378821][ T1] smp: Brought up 2 nodes, 4 CPUs > [ 0.379642][ T1] numa: Node 0 CPUs: 0-1 > [ 0.380302][ T1] numa: Node 1 CPUs: 2-3 > [ 11.577527][ T41] node 1 deferred pages initialised in 10250ms > [ 12.557856][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 2 pages > [ 12.574197][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page > [ 12.576339][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages > [ 12.577262][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page > [ 15.102445][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced > [ 26.173888][ T1] Demotion targets for Node 1: null > > mm/hugetlb.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 9a3a6e2dee97..60f45314c151 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3443,7 +3443,7 @@ static void __init gather_bootmem_prealloc(void) > .thread_fn = gather_bootmem_prealloc_parallel, > .fn_arg = NULL, > .start = 0, > - .size = num_node_state(N_MEMORY), > + .size = num_node_state(N_ONLINE), > .align = 1, > .min_chunk = 1, > .max_threads = num_node_state(N_MEMORY), > -- > 2.39.5