From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A0B6EC3ABC9 for ; Fri, 16 May 2025 08:20:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AEBE76B0100; Fri, 16 May 2025 04:20:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A9D1C6B0102; Fri, 16 May 2025 04:20:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 93E736B0103; Fri, 16 May 2025 04:20:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 729E86B0100 for ; Fri, 16 May 2025 04:20:31 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id B7FD3E3464 for ; Fri, 16 May 2025 08:20:32 +0000 (UTC) X-FDA: 83448074304.17.C53C5CC Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf27.hostedemail.com (Postfix) with ESMTP id 598D540002 for ; Fri, 16 May 2025 08:20:30 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="MwU/y4Le"; spf=pass (imf27.hostedemail.com: domain of donettom@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=donettom@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747383630; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=FOj3ZuKq9IsMVvh7KnBX3N/X5goeE4AUqmzTk5UGGOk=; b=ZesmGB2qUiixzG/m43IqEjD28aVL9B6HZLwIq/LcU4Ii2p0arPZ/r6E6crxb/aPpYMknhM RfQv5fL9UzF2fo5np6aduUwPViXTLpVXyDnQ0Ue9QQKncVDjCBdb5s7cHBbaC25L7pjhIy O+qpfQgUSsAOA6ZYKBvg3GTDMcMa1Gc= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="MwU/y4Le"; spf=pass (imf27.hostedemail.com: domain of donettom@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=donettom@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747383630; a=rsa-sha256; cv=none; b=ia/LpfMvd/T1heZcFT4TUoA8mv9LIkkPlWNqW6vNcovCQr7YJk4Va6jD2o5/lCHI63pu5i Pe4SG4bHYq0GB9cBFE30BE7lZ5wtZ3SHTcA5/SKUjQrzSttdpiTwvNs2kKbPRHrNc0bapC 7D/ZFS2dRBVcyKTOhC6zIYghTFNpmNk= Received: from pps.filterd (m0360083.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 54G5ugQC016684; Fri, 16 May 2025 08:20:22 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:content-type:date:from:message-id :mime-version:subject:to; s=pp1; bh=FOj3ZuKq9IsMVvh7KnBX3N/X5goe E4AUqmzTk5UGGOk=; b=MwU/y4LegwOn7dPzjcyB6ZL808IFDp+H2OyTstklEb6R s0MuUXG7EN6ahZ70U6VCrY2C4UUHgOi/wfc/PDBGjDXeZkGp9lvRePY+gjLuwQMW waTCJ1wrc1cCkhvby5aKsQypFtvAY0DbOrLPhuRa4/d6T94p3FWmTftqo7ejnGLN i5DjIjoMgvIa0vZjQNHPF/KY6WZhYfF7k+7OaPu5a5dDw1Z64IePCZ3pIVCCeYh1 idUPsZ5UDQ7JdXG3uvtO6XjLtVQvCd3S2yxz7dHr60mgh9PPCsNCWRR6TdPwOxRn vTyfF7ws5Xp1jpuixfF92oOtGIPYUaw4o92jlg35OA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 46nd4gwx2g-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 16 May 2025 08:20:22 +0000 (GMT) Received: from m0360083.ppops.net (m0360083.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 54G8KMF6002161; Fri, 16 May 2025 08:20:22 GMT Received: from ppma11.dal12v.mail.ibm.com (db.9e.1632.ip4.static.sl-reverse.com [50.22.158.219]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 46nd4gwx2c-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 16 May 2025 08:20:22 +0000 (GMT) Received: from pps.filterd (ppma11.dal12v.mail.ibm.com [127.0.0.1]) by ppma11.dal12v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 54G67NZr026961; Fri, 16 May 2025 08:20:21 GMT Received: from smtprelay04.fra02v.mail.ibm.com ([9.218.2.228]) by ppma11.dal12v.mail.ibm.com (PPS) with ESMTPS id 46mbfppkgt-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 16 May 2025 08:20:20 +0000 Received: from smtpav04.fra02v.mail.ibm.com (smtpav04.fra02v.mail.ibm.com [10.20.54.103]) by smtprelay04.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 54G8KJUh12845374 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 16 May 2025 08:20:19 GMT Received: from smtpav04.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 35C29201EC; Fri, 16 May 2025 08:20:19 +0000 (GMT) Received: from smtpav04.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 84C40201EB; Fri, 16 May 2025 08:20:16 +0000 (GMT) Received: from ltczz402-lp1.aus.stglabs.ibm.com (unknown [9.40.194.31]) by smtpav04.fra02v.mail.ibm.com (Postfix) with ESMTP; Fri, 16 May 2025 08:20:16 +0000 (GMT) From: Donet Tom To: David Hildenbrand , Andrew Morton , Mike Rapoport , Oscar Salvador , Zi Yan Cc: Ritesh Harjani , rafael@kernel.org, Danilo Krummrich , Greg Kroah-Hartman , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Jonathan Cameron , Alison Schofield , Yury Norov , Dave Jiang , Donet Tom Subject: [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time Date: Fri, 16 May 2025 03:19:51 -0500 Message-ID: X-Mailer: git-send-email 2.43.5 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-GUID: YaEPSMFhGaqrhoP7AYdtJuhUOShA-B-k X-Proofpoint-Spam-Details-Enc: AW1haW4tMjUwNTE2MDA3NSBTYWx0ZWRfX4N8W5Hn80c4q +rt+XQQFB8STwM/tahFyINsYT9jyyQYm6+5ekiV0zZZrQsOnQWKVhovz5Qc0BgsRL6lDIhHoYCP QavMbl/DEoQfg7pKlEz0PVdt7gJSE3CCbVFH0/B3pFuZvyt8mKliio90mGY4XS0i0KUYyYaND0M oPYLgfFSPsblwVT7j7nsiMPZOlsZhQ/fDMovDSjKA/WhbYSxGb0kN4aHMBvz5jioZhO8+7PgImO fIWBZaBv9W76uZAe+dRzFKXBcFAE1208BOqMA+vtOx/F6tdBaXIsggnxfZNXYHsneGSI1jyZf+s q5QuT7pUbsUcs8fwbIlbjGZIpjxzdpp10cn8uqLKN/7dtIiGy2XTtE6KpVuaCl0gaWNM3Ue3ii0 7UfQTs5c72SsrwF54wtsJzSzQQeyerJGxqwbgdB7D21K9Dra+fJjTpryPLiQkwrduI7Hmf3q X-Proofpoint-ORIG-GUID: b7KtB8zcfvcAoUdxdkNwzOGEEsKnEdW9 X-Authority-Analysis: v=2.4 cv=OsNPyz/t c=1 sm=1 tr=0 ts=6826f546 cx=c_pps a=aDMHemPKRhS1OARIsFnwRA==:117 a=aDMHemPKRhS1OARIsFnwRA==:17 a=IkcTkHD0fZMA:10 a=dt9VzEwgFbYA:10 a=VwQbUJbxAAAA:8 a=VnNF1IyMAAAA:8 a=20KFwNOVAAAA:8 a=Ikd4Dj_1AAAA:8 a=jP_1eRF1Qyl1rAf-Wa8A:9 a=3ZKOabzyN94A:10 a=QEXdDO2ut3YA:10 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1099,Hydra:6.0.736,FMLib:17.12.80.40 definitions=2025-05-16_03,2025-05-15_01,2025-03-28_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 suspectscore=0 malwarescore=0 spamscore=0 lowpriorityscore=0 clxscore=1015 impostorscore=0 mlxlogscore=999 mlxscore=0 adultscore=0 bulkscore=0 phishscore=0 classifier=spam authscore=0 authtc=n/a authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.19.0-2505070000 definitions=main-2505160075 X-Rspamd-Server: rspam10 X-Stat-Signature: r1n9fbomxb5xwwgwu1a39i4xmxdkphtj X-Rspamd-Queue-Id: 598D540002 X-Rspam-User: X-HE-Tag: 1747383630-391296 X-HE-Meta: U2FsdGVkX187ko7UmjDV+I+R5McoWK9X9o320kJbswwYjkpF0tapM2fZvpdZvskh7dOFX7h6nRkJU6VTMb4B1cR8uJh+Jpax4cninRLuz/rHXaTbEDYAA8prXIzec0kGM1LfswFua7FLyzvRIqRSNkHlop3Hxg8XQHa1XuHk7u1lAg/5Wjqu+LLuQBkizt8VA0b3E2L+6DoG/zEjILdZMZRXK8H0qv+M7oEOmFZxN1TjEsVA5XjYZM/g3LEuc0J/Bk2xRLJGRRhyXP0gMzePSbSmvOJs6CIiDs/v7nRGrPpTedu5C+Tm/qRTGIvhtr1X1uj5Z8+w4Y/lRYEdJQicib9LVYotZsniaqOOVd5zbro9HsxNxwb87tro2VD9+ZO23gOOvZ61dmGHL8xV6D1pVsbQmOI36HtZXxNJ1K8Sob28UWaUr71X7N6hrZYDvah4bXhp0m3pdOYiIzVm1bP2iAxQnYPj+P24Ya2+B5kjIPnk13qleVRxBvBlxPLpis+jFVccgKSMxqy9y3B4sLEGcqI0qWvGzPU6VjOEWrRJQ+Q7GsviYRdq/kGZYtEZcMP/u0b5y5tu1A39eYJQDqIIIBcODvBG86eoug4rfQsNvau/1AiwEzynXkGe6iGZMPZIflUMlDnk3BuRtURwzW52W1Q2Dj/4TFir2D0rJLDqoJO/o2cZsSoKSsQ8fo8q/Dz9SHAFvjjHggBg45Z9EIuTSsrRb9sWMOHWpa14uII3m1Nq6uNwWxSPoeJ/nuNgWU3IDo0CJUPvWxwAySO34/ST8C+kpshOIAA5wmEyy6TliFO4beuYpp3LAJrnQ/Ju4MmdS1Di/70LcfTvwfVkNJNiEhYHQHuMOVvEIf0ox+WIHDwtUj3bzapiJgdTDQtl9Ch7xzv03MnlqzSoMJCUdOQbqX92CpfNKttp6rl5ZAxF5K7057LsJhgnVVFxY/tZmiaoNBes0rJqZsstikxiKQK t6wxjeII mJWnEAmVoN1eWu10vfTVs0oSvfLJR7M9t1Wvde7XsVhY8OML2lUdUZHD6SkZQFJItTxdVth6+v6m9mUQaszjz/FRlkeqb2yMK8U0T0c8tUi1b298546JvwjrsdgfWpBFJwQ3fVO+9BOHxX2vWMn/q9iu9aY9bqxoWHXdQZwK2lUjhSlvTzWwJOccw6EgsCvt3/ME4KYThGK94ue25oXyIUkk2toVUdxQWrVYjLmPtCH1eYoq+WpjROzUc8BrTMZdxCLAlo36fQugqPYC9R2iVS9gjuvLWJfFDcixqb3J+yJg3q9gdOr5Cr0gCUZAVvthWI/nT9hyON1y0p73Hn8ASl6HqH7PkzCUDlzdM9xTDg1W1DAUywqWLIVv9JFd2l8Tbq20gTKiEFLcUBR/dp17vSgaI0Dey056mpGS1ncpM8gjAgIh1Cl0qr9naPVobzaT95qlgk3Qpn/Jzg2HBpRtgTH7cRg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: During node device initialization, `memory blocks` are registered under each NUMA node. The `memory blocks` to be registered are identified using the node’s start and end PFNs, which are obtained from the node's pg_data However, not all PFNs within this range necessarily belong to the same node—some may belong to other nodes. Additionally, due to the discontiguous nature of physical memory, certain sections within a `memory block` may be absent. As a result, `memory blocks` that fall between a node’s start and end PFNs may span across multiple nodes, and some sections within those blocks may be missing. `Memory blocks` have a fixed size, which is architecture dependent. Due to these considerations, the memory block registration is currently performed as follows: for_each_online_node(nid): start_pfn = pgdat->node_start_pfn; end_pfn = pgdat->node_start_pfn + node_spanned_pages; for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn)) mem_blk = memory_block_id(pfn_to_section_nr(pfn)); pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr) pfn_mb_end = pfn_start + memory_block_pfns - 1 for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++): if (get_nid_for_pfn(pfn) != nid): continue; else do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY); Here, we derive the start and end PFNs from the node's pg_data, then determine the memory blocks that may belong to the node. For each `memory block` in this range, we inspect all PFNs it contains and check their associated NUMA node ID. If a PFN within the block matches the current node, the memory block is registered under that node. If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs a binary search in the `memblock regions` to determine the NUMA node ID for a given PFN. If it is not enabled, the node ID is retrieved directly from the struct page. On large systems, this process can become time-consuming, especially since we iterate over each `memory block` and all PFNs within it until a match is found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional overhead of the binary search increases the execution time significantly, potentially leading to soft lockups during boot. In this patch, we iterate over `memblock region` to identify the `memory blocks` that belong to the current NUMA node. `memblock regions` are contiguous memory ranges, each associated with a single NUMA node, and they do not span across multiple nodes. for_each_online_node(nid): for_each_memory_region(r): // r => region if (r->nid != nid): continue; else for_each_memory_block_between(r->base, r->base + r->size - 1): do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY); We iterate over all `memblock regions` and identify those that belong to the current NUMA node. For each `memblock region` associated with the current node, we calculate the start and end `memory blocks` based on the region's start and end PFNs. We then register all `memory blocks` within that range under the current node. Test Results on My system with 32TB RAM ======================================= 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled. Without this patch ------------------ Startup finished in 1min 16.528s (kernel) With this patch --------------- Startup finished in 17.236s (kernel) - 78% Improvement 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled. Without this patch ------------------ Startup finished in 28.320s (kernel) With this patch --------------- Startup finished in 15.621s (kernel) - 46% Improvement Acked-by: David Hildenbrand Acked-by: Zi Yan Signed-off-by: Donet Tom --- v3 -> v4 Addressed Mike's comment by making node_dev_init() call __register_one_node(). V3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/ v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/ v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/ --- drivers/base/memory.c | 4 ++-- drivers/base/node.c | 41 ++++++++++++++++++++++++++++++++++++++++- include/linux/memory.h | 2 ++ include/linux/node.h | 3 +++ 4 files changed, 47 insertions(+), 3 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 19469e7f88c2..7f1d266ae593 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -60,7 +60,7 @@ static inline unsigned long pfn_to_block_id(unsigned long pfn) return memory_block_id(pfn_to_section_nr(pfn)); } -static inline unsigned long phys_to_block_id(unsigned long phys) +unsigned long phys_to_block_id(unsigned long phys) { return pfn_to_block_id(PFN_DOWN(phys)); } @@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn) * * Called under device_hotplug_lock. */ -static struct memory_block *find_memory_block_by_id(unsigned long block_id) +struct memory_block *find_memory_block_by_id(unsigned long block_id) { struct memory_block *mem; diff --git a/drivers/base/node.c b/drivers/base/node.c index cd13ef287011..f8cafd8c8fb1 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -20,6 +20,7 @@ #include #include #include +#include static const struct bus_type node_subsys = { .name = "node", @@ -850,6 +851,43 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk) kobject_name(&node_devices[mem_blk->nid]->dev.kobj)); } +/* + * register_memory_blocks_under_node_early : Register the memory + * blocks under the current node. + * @nid : Current node under registration + * + * This function iterates over all memblock regions and identifies the regions + * that belong to the current node. For each region which belongs to current + * node, it calculates the start and end memory blocks based on the region's + * start and end PFNs. It then registers all memory blocks within that range + * under the current node. + */ +static void register_memory_blocks_under_node_early(int nid) +{ + struct memblock_region *r; + + for_each_mem_region(r) { + if (r->nid != nid) + continue; + + const unsigned long start_block_id = phys_to_block_id(r->base); + const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1); + unsigned long block_id; + + for (block_id = start_block_id; block_id <= end_block_id; block_id++) { + struct memory_block *mem; + + mem = find_memory_block_by_id(block_id); + if (!mem) + continue; + + do_register_memory_block_under_node(nid, mem, MEMINIT_EARLY); + put_device(&mem->dev); + } + + } +} + void register_memory_blocks_under_node(int nid, unsigned long start_pfn, unsigned long end_pfn, enum meminit_context context) @@ -974,8 +1012,9 @@ void __init node_dev_init(void) * to applicable memory block devices and already created cpu devices. */ for_each_online_node(i) { - ret = register_one_node(i); + ret = __register_one_node(i); if (ret) panic("%s() failed to add node: %d\n", __func__, ret); + register_memory_blocks_under_node_early(i); } } diff --git a/include/linux/memory.h b/include/linux/memory.h index 12daa6ec7d09..cb8579226536 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -171,6 +171,8 @@ struct memory_group *memory_group_find_by_id(int mgid); typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *); int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func, struct memory_group *excluded, void *arg); +unsigned long phys_to_block_id(unsigned long phys); +struct memory_block *find_memory_block_by_id(unsigned long block_id); #define hotplug_memory_notifier(fn, pri) ({ \ static __meminitdata struct notifier_block fn##_mem_nb =\ { .notifier_call = fn, .priority = pri };\ diff --git a/include/linux/node.h b/include/linux/node.h index 2b7517892230..806e62638cbe 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -120,6 +120,9 @@ static inline void register_memory_blocks_under_node(int nid, unsigned long star enum meminit_context context) { } +static inline void register_memory_blocks_under_node_early(int nid) +{ +} #endif extern void unregister_node(struct node *node); -- 2.43.5