From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F2007C369DC for ; Mon, 28 Apr 2025 17:04:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 853636B009F; Mon, 28 Apr 2025 13:04:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7DD146B00A2; Mon, 28 Apr 2025 13:04:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 657336B00A3; Mon, 28 Apr 2025 13:04:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 460656B009F for ; Mon, 28 Apr 2025 13:04:10 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id BFC35C738B for ; Mon, 28 Apr 2025 17:04:10 +0000 (UTC) X-FDA: 83384075460.12.33C7176 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf07.hostedemail.com (Postfix) with ESMTP id 6E51840012 for ; Mon, 28 Apr 2025 17:04:08 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=EBFTi4Om; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf07.hostedemail.com: domain of donettom@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=donettom@linux.ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745859848; a=rsa-sha256; cv=none; b=bJCxzjEOfk4JYWfzOAGMkAHHbQb5TC95BQfjUOy09oBtiNg7t0hOpdARdT9tCeFz/uSNkU ly7Z6BE1n/OtMuY8VXOuYtRdK3tva7uHhCbV1wmBqGHuoDD5iLK/VuVN1bosBgJtk+p+kT WI/jmknY5L0PcuiZJMdtfgP45OIYJoE= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=EBFTi4Om; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf07.hostedemail.com: domain of donettom@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=donettom@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745859848; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=ra7a3JR6/w1bgNT/XNXzzRdtbwDXfRlC0bVk0TkeZsM=; b=CP18JnUdJNXtLrcWQ4BJ50x7BcWk5PdMB4oz08KB6/js1hjFi0SOjcQ9Y5kXL5RKOcYkl7 dJ7IDoXVYZ1n+AByNEgyCOuPeg7116VgrhIVh2ugSX2H5ieH2KhX/8yjc736xRT5oufDy6 lXHZVCY622IIlZKd261rE1Ql8jrtvis= Received: from pps.filterd (m0356516.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 53SAecjs022759; Mon, 28 Apr 2025 17:03:58 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:content-type:date:from:message-id :mime-version:subject:to; s=pp1; bh=ra7a3JR6/w1bgNT/XNXzzRdtbwDX fRlC0bVk0TkeZsM=; b=EBFTi4Om2TCtXH/ov8+flL0qm3A7UsuuzxVVS/SoUDw4 0vHq+ye3khas9GhfaURv4mQ09pW2BJ2NENfYE67MxLDo2ltrWrw4LokyDeok0AAL G/4E3Km7WyBMCbxNxTivmOacf/0rQkEeD4rt5Dl7Y/zE7+9q5NgPOCcDwg8z37eE F6uQ76GseljWBuaMchAtUsMahsiPIdozfhQ4mCPohLBLKT6RJUWgnehA8zDUZk5Y CqvOzwwzlrlrT0Ol0BHuwFyqkLOSm/D7tHAeeTg/UrKOMgXYNLcleBstPcmpWOfF v5DX+O+neo9V5kLadl6KriOr8/LuwOpZU/JMY5vegw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 46a84s1v83-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 28 Apr 2025 17:03:57 +0000 (GMT) Received: from m0356516.ppops.net (m0356516.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 53SGm4FL006763; Mon, 28 Apr 2025 17:03:57 GMT Received: from ppma22.wdc07v.mail.ibm.com (5c.69.3da9.ip4.static.sl-reverse.com [169.61.105.92]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 46a84s1v80-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 28 Apr 2025 17:03:57 +0000 (GMT) Received: from pps.filterd (ppma22.wdc07v.mail.ibm.com [127.0.0.1]) by ppma22.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 53SF55o1016122; Mon, 28 Apr 2025 17:03:56 GMT Received: from smtprelay06.fra02v.mail.ibm.com ([9.218.2.230]) by ppma22.wdc07v.mail.ibm.com (PPS) with ESMTPS id 469a707jc3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 28 Apr 2025 17:03:56 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay06.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 53SH3sLn33817124 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 28 Apr 2025 17:03:54 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6FE8A2004B; Mon, 28 Apr 2025 17:03:54 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D493020040; Mon, 28 Apr 2025 17:03:50 +0000 (GMT) Received: from li-06431bcc-2712-11b2-a85c-a6fe68df28f9.ibm.com.com (unknown [9.39.16.18]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 28 Apr 2025 17:03:50 +0000 (GMT) From: Donet Tom To: Mike Rapoport , David Hildenbrand , Oscar Salvador , Greg Kroah-Hartman , Andrew Morton , rafael@kernel.org, Danilo Krummrich Cc: Ritesh Harjani , Jonathan Cameron , Alison Schofield , Yury Norov , Dave Jiang , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Donet Tom Subject: [PATCH v2 1/2] driver/base: Optimize memory block registration to reduce boot time Date: Mon, 28 Apr 2025 22:33:46 +0530 Message-ID: X-Mailer: git-send-email 2.48.1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Authority-Analysis: v=2.4 cv=Mchsu4/f c=1 sm=1 tr=0 ts=680fb4fd cx=c_pps a=5BHTudwdYE3Te8bg5FgnPg==:117 a=5BHTudwdYE3Te8bg5FgnPg==:17 a=IkcTkHD0fZMA:10 a=XR8D0OoHHMoA:10 a=VwQbUJbxAAAA:8 a=VnNF1IyMAAAA:8 a=xwvsNI4MMw8m2L9-Us0A:9 a=3ZKOabzyN94A:10 a=QEXdDO2ut3YA:10 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjUwNDI4MDEzNiBTYWx0ZWRfX9cpfZh7O2Cl+ HY9khMC2SMhjP6ioOBjFE0SmhJPvCOm+Y3evms0/+/JSFwSg+CKRMjWcZ+W69nrkmbjzfcr/z2S R+YKBYHDNdxEfezP5/aF6g2RlQcwKxR50+/8B58HTWcC25+E87Hu17FAzwupxRwzYvud/XWztTr dBIhBH0LPVmbkuxhKSdlhdY9V+l5NG7R1FDnVzntc+/t11+o4aJOG8hYlNoazKq100+4Tr2Z9t1 TyizbRy0pEjEeOtcJiPQkYPCc9cFiRcYCy++xumxMqNm351WmLzbFL3TP+zkj6Ce7EmA2rOYw5B a9hmO0s7rxRvPpmqbcgd3SboO26OTRmd6199LYlEgjxSYUliHT1mZ+rTvAySKhnM2nHT7LqO2OR 0DvHNHoIU34zczpxrzCtcK/2RMkMLLIWUFsTkNpVUkqaRK6jKz45rK9PD+uF0N8R8s6aKAl5 X-Proofpoint-ORIG-GUID: KaaQPWMbq-Xf9LHlXdmB5tGKvISvZgWv X-Proofpoint-GUID: MbgBmvYER7BBdanPS0AmUAsCQbOedb2p X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1099,Hydra:6.0.736,FMLib:17.12.80.40 definitions=2025-04-28_06,2025-04-24_02,2025-02-21_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 clxscore=1011 lowpriorityscore=0 mlxlogscore=952 impostorscore=0 bulkscore=0 malwarescore=0 spamscore=0 priorityscore=1501 adultscore=0 mlxscore=0 suspectscore=0 classifier=spam authscore=0 authtc=n/a authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.19.0-2504070000 definitions=main-2504280136 X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 6E51840012 X-Stat-Signature: a9ey3w9ay9winno6enypmqpunwks4w7p X-Rspam-User: X-HE-Tag: 1745859848-181097 X-HE-Meta: U2FsdGVkX19BzEiupcZXoOLR7yOJx+YAucCw9gR4uwcNiYrAU/lIhJu8cImTt8wGsad+96HxhXCr9BArCAx+pNSjUa/+jHG4w0fdffA1Je8Ul9w9kE1yc7DTP7hQAYvn51dho5rx/0hhKHcVGsxiioL9b8qk7VNqkdAaGoO3lYSQakJFe+Vo1+naNOQ0d/nOOrOCgpUe5WQjYeiPZxiJ4XtBORR2vKhLyPaxKf5Egrm1Fx9BVNzx5noeIrikjAlro4eEwjK79lqP9A9EPqFJo/NW6hkAZxZH7f5P6FqkLxkuwLAu5Z1xNDG75WS58NzVjvYZKxZNBaDg/TSXPMI0iqtNK5stAMc7n/T8rrmC0DUCkNksewkoabS8XKCSbWxUobFyU+6Dg2QOThNkI/cdSEQH9u/+dj114/9uNNIZvhFB0Y+8vB+yDwY5NH8eRwMdywQlc+DW1dQVX8BFnfRDbIMVmbNAVZNQ3sXOAPyWTAiqOPiI3sSNGwL/NrLA/4bKqeuvapiCZ34MoXggZwsSpMP2VJnTbqa34/HNPk1VCA4IdTrRz/76tLjLQMWH4zoZMGZISAL4ZIdHSXg7BTvXBxOitWQO8/be4Z0TzDqtWkBZRn+09HnNibZKOZtf1Ev2AFbayaGRY+c5FMQ4g82LakWB1RL6iNrtPjGqiLoLc2wL3OdLnu9Lr1W1sH1I80hFaVWuWSTg1E/q8uWUoMdGO2c3rnUv/YmaFN4p+lKYh66IwAehI7uwRqapUsy2R8XBQ21WS1C7aZgxAgzvSjnyrC+iq67ldtz83Y465o0kcDcvzmy+zQGvfNn7qXjDuEs0qL9bZ5DI9k5aPb8kXsXwQm9nBOvAowJAgdOSDQ91Bqbnqt/oFYinDjK/BBkRnTw+uCtv+Ui1CPBKmTAp2w4hV0noG/ajZLw5hai17CQFyUG1/8yDToNyS8CtJMwUvC+tuOgK1l5dgmQpWW1Wlmq o4UnvY9t 3IvKSpaie35yUSG8BASEFELbxwbLjfxKI7Dd5XwOsQ4Op1ld74o8mlgFwqrU3oQSxuUgE/E2ovOUZBGLA3JjDh9U13O7DJVyXo7svklK60bU313eeBfRJiYVgzlPH5jE+AkdbMLkx0qZUrKwx3aaqaNN00ptTqOHU41WB64+YWgyTrHxtzdybV9bERCub+f6TFc/D4nfkTA5JP1xS9PGX46gdftA7Ca+Br9daufQfSBXkUAcsjQ4/qWcBAU8puLjbHp5u5TI8oJoRNB1or18e/a7peec8vabWRHhyvRG4RENPVz6E1RxQ6m/DrVP7vn7hG1vfaFzMXjqJ3aMGzHYs1fGSkQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: During node device initialization, `memory blocks` are registered under each NUMA node. The `memory blocks` to be registered are identified using the node’s start and end PFNs, which are obtained from the node's pg_data However, not all PFNs within this range necessarily belong to the same node—some may belong to other nodes. Additionally, due to the discontiguous nature of physical memory, certain sections within a `memory block` may be absent. As a result, `memory blocks` that fall between a node’s start and end PFNs may span across multiple nodes, and some sections within those blocks may be missing. `Memory blocks` have a fixed size, which is architecture dependent. Due to these considerations, the memory block registration is currently performed as follows: for_each_online_node(nid): start_pfn = pgdat->node_start_pfn; end_pfn = pgdat->node_start_pfn + node_spanned_pages; for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn)) mem_blk = memory_block_id(pfn_to_section_nr(pfn)); pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr) pfn_mb_end = pfn_start + memory_block_pfns - 1 for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++): if (get_nid_for_pfn(pfn) != nid): continue; else do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY); Here, we derive the start and end PFNs from the node's pg_data, then determine the memory blocks that may belong to the node. For each `memory block` in this range, we inspect all PFNs it contains and check their associated NUMA node ID. If a PFN within the block matches the current node, the memory block is registered under that node. If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs a binary search in the `memblock regions` to determine the NUMA node ID for a given PFN. If it is not enabled, the node ID is retrieved directly from the struct page. On large systems, this process can become time-consuming, especially since we iterate over each `memory block` and all PFNs within it until a match is found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional overhead of the binary search increases the execution time significantly, potentially leading to soft lockups during boot. In this patch, we iterate over `memblock region` to identify the `memory blocks` that belong to the current NUMA node. `memblock regions` are contiguous memory ranges, each associated with a single NUMA node, and they do not span across multiple nodes. for_each_online_node(nid): for_each_memory_region(r): // r => region if (r->nid != nid): continue; else for_each_memory_block_between(r->base, r->base + r->size - 1): do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY); We iterate over all `memblock regions` and identify those that belong to the current NUMA node. For each `memblock region` associated with the current node, we calculate the start and end `memory blocks` based on the region's start and end PFNs. We then register all `memory blocks` within that range under the current node. Test Results on My system with 32TB RAM ======================================= 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled. Without this patch ------------------ Startup finished in 1min 16.528s (kernel) With this patch --------------- Startup finished in 17.236s (kernel) - 78% Improvement 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled. Without this patch ------------------ Startup finished in 28.320s (kernel) With this patch --------------- Startup finished in 15.621s (kernel) - 46% Improvement Signed-off-by: Donet Tom --- v1->v2 Reworked the implementation according to suggestions from Mike Rapoport[1] [1] - https://lore.kernel.org/all/Z_j2Gv9n4DOj6LSs@kernel.org/ v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/ --- drivers/base/memory.c | 4 ++-- drivers/base/node.c | 39 +++++++++++++++++++++++++++++++++++++++ include/linux/memory.h | 2 ++ include/linux/node.h | 11 +++++------ 4 files changed, 48 insertions(+), 8 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 19469e7f88c2..7f1d266ae593 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -60,7 +60,7 @@ static inline unsigned long pfn_to_block_id(unsigned long pfn) return memory_block_id(pfn_to_section_nr(pfn)); } -static inline unsigned long phys_to_block_id(unsigned long phys) +unsigned long phys_to_block_id(unsigned long phys) { return pfn_to_block_id(PFN_DOWN(phys)); } @@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn) * * Called under device_hotplug_lock. */ -static struct memory_block *find_memory_block_by_id(unsigned long block_id) +struct memory_block *find_memory_block_by_id(unsigned long block_id) { struct memory_block *mem; diff --git a/drivers/base/node.c b/drivers/base/node.c index cd13ef287011..4869333d366d 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -20,6 +20,7 @@ #include #include #include +#include static const struct bus_type node_subsys = { .name = "node", @@ -850,6 +851,44 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk) kobject_name(&node_devices[mem_blk->nid]->dev.kobj)); } +/* + * register_memory_blocks_under_node_early : Register the memory + * blocks under the current node. + * @nid : Current node under registration + * + * This function iterates over all memblock regions and identifies the regions + * that belong to the current node. For each region which belongs to current + * node, it calculates the start and end memory blocks based on the region's + * start and end PFNs. It then registers all memory blocks within that range + * under the current node. + * + */ +void register_memory_blocks_under_node_early(int nid) +{ + struct memblock_region *r; + unsigned long start_block_id; + unsigned long end_block_id; + struct memory_block *mem; + unsigned long block_id; + + for_each_mem_region(r) { + if (r->nid == nid) { + start_block_id = phys_to_block_id(r->base); + end_block_id = phys_to_block_id(r->base + r->size - 1); + + for (block_id = start_block_id; block_id <= end_block_id; block_id++) { + mem = find_memory_block_by_id(block_id); + if (!mem) + continue; + + do_register_memory_block_under_node(nid, mem, MEMINIT_EARLY); + put_device(&mem->dev); + } + + } + } +} + void register_memory_blocks_under_node(int nid, unsigned long start_pfn, unsigned long end_pfn, enum meminit_context context) diff --git a/include/linux/memory.h b/include/linux/memory.h index 12daa6ec7d09..cb8579226536 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -171,6 +171,8 @@ struct memory_group *memory_group_find_by_id(int mgid); typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *); int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func, struct memory_group *excluded, void *arg); +unsigned long phys_to_block_id(unsigned long phys); +struct memory_block *find_memory_block_by_id(unsigned long block_id); #define hotplug_memory_notifier(fn, pri) ({ \ static __meminitdata struct notifier_block fn##_mem_nb =\ { .notifier_call = fn, .priority = pri };\ diff --git a/include/linux/node.h b/include/linux/node.h index 2b7517892230..c5a8a7f0aac7 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -114,12 +114,16 @@ extern struct node *node_devices[]; void register_memory_blocks_under_node(int nid, unsigned long start_pfn, unsigned long end_pfn, enum meminit_context context); +void register_memory_blocks_under_node_early(int nid); #else static inline void register_memory_blocks_under_node(int nid, unsigned long start_pfn, unsigned long end_pfn, enum meminit_context context) { } +void register_memory_blocks_under_node_early(int nid) +{ +} #endif extern void unregister_node(struct node *node); @@ -134,15 +138,10 @@ static inline int register_one_node(int nid) int error = 0; if (node_online(nid)) { - struct pglist_data *pgdat = NODE_DATA(nid); - unsigned long start_pfn = pgdat->node_start_pfn; - unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages; - error = __register_one_node(nid); if (error) return error; - register_memory_blocks_under_node(nid, start_pfn, end_pfn, - MEMINIT_EARLY); + register_memory_blocks_under_node_early(nid); } return error; -- 2.48.1