From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ECF27C3ABA5 for ; Tue, 29 Apr 2025 14:09:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 80CFB6B000D; Tue, 29 Apr 2025 10:09:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7BE536B000E; Tue, 29 Apr 2025 10:09:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6396E6B0010; Tue, 29 Apr 2025 10:09:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 440D06B000D for ; Tue, 29 Apr 2025 10:09:27 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id B2A7EBE2BA for ; Tue, 29 Apr 2025 14:09:27 +0000 (UTC) X-FDA: 83387263974.09.2F185BA Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf28.hostedemail.com (Postfix) with ESMTP id 4BE57C0007 for ; Tue, 29 Apr 2025 14:09:25 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=O8XTZMSZ; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf28.hostedemail.com: domain of donettom@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=donettom@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745935765; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8fQA3DTFAQ6XF38YFYQgZVRYg95r4NVgeNp5RdlIWB4=; b=zexYRXcoRRPrkwrZLb8ChMuNRqvM4OdK9y7m4qNy6VOU6VDeUJmgztdj/w4jGVAQMB91JK i0foPgnA/FOfUn5M+FPFkxU/0u+vKpnaTJj7W2vJw+iuQelYlr7jyh1d9dIjw/aGdAd+10 ol8sR8Wvhhy4BACocxEPl3vai37gRns= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745935765; a=rsa-sha256; cv=none; b=h7yfX1opZRTjJhJ5VVjXIPieXBuTVK4WyRTj/4c7V+qPCybt06LuReMVlOr+Ltac9LZEB2 4kI/X6HO8HQcMVEdb7l+iZOtO/DzTbckruv5TRtMD+SfqcVuM8WJ9pJdryaZZsZ2jFpJ6J bccfcGNrau/6GAECWod7v67LYynMXaE= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=O8XTZMSZ; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf28.hostedemail.com: domain of donettom@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=donettom@linux.ibm.com Received: from pps.filterd (m0360072.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 53T4EdWu026747; Tue, 29 Apr 2025 14:09:08 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=pp1; bh=8fQA3D TFAQ6XF38YFYQgZVRYg95r4NVgeNp5RdlIWB4=; b=O8XTZMSZlwSf4/o9qjNVgV 5IoImELAv3OGJdRwKcVPzQYXJBEqx3qnzaNVOUzB1o18QPtbEXbXd1+V5/p2nVbN jUmwxaGIQqpPRhNBf3ud+QkXVzSqgW0uiURAmvHYa/VuXnk57qbwe1ujWI1qKSnd 8dt2GL0TyjSe9DrziyUQcQLkbB2xqrqB1RiC+t/uW7u42097X8psAP68PEZVu0Hm VuFbpxloFgwZh9cwh4Bn8WNN7nwmhn8WIlNUdLonRlJthSUewR5RPo4o0UVh6lnh BwXOsOREFljAIvlvvCcbnVhRSbnqTVECkqRew8M8SQYhEy72VU4pedRkn1HnFQBg == Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 46ahtwkceb-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 29 Apr 2025 14:09:07 +0000 (GMT) Received: from m0360072.ppops.net (m0360072.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 53TDi9In029585; Tue, 29 Apr 2025 14:09:07 GMT Received: from ppma13.dal12v.mail.ibm.com (dd.9e.1632.ip4.static.sl-reverse.com [50.22.158.221]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 46ahtwkce7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 29 Apr 2025 14:09:07 +0000 (GMT) Received: from pps.filterd (ppma13.dal12v.mail.ibm.com [127.0.0.1]) by ppma13.dal12v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 53TAGsBa024662; Tue, 29 Apr 2025 14:09:06 GMT Received: from smtprelay03.dal12v.mail.ibm.com ([172.16.1.5]) by ppma13.dal12v.mail.ibm.com (PPS) with ESMTPS id 469c1m39ep-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 29 Apr 2025 14:09:06 +0000 Received: from smtpav01.dal12v.mail.ibm.com (smtpav01.dal12v.mail.ibm.com [10.241.53.100]) by smtprelay03.dal12v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 53TE95FU12452594 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 29 Apr 2025 14:09:05 GMT Received: from smtpav01.dal12v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C8B8F58061; Tue, 29 Apr 2025 14:09:05 +0000 (GMT) Received: from smtpav01.dal12v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 42C7C58057; Tue, 29 Apr 2025 14:09:01 +0000 (GMT) Received: from [9.39.31.64] (unknown [9.39.31.64]) by smtpav01.dal12v.mail.ibm.com (Postfix) with ESMTP; Tue, 29 Apr 2025 14:09:00 +0000 (GMT) Message-ID: <1f750ad6-b6b1-41de-9cdd-9abe64c14eae@linux.ibm.com> Date: Tue, 29 Apr 2025 19:38:59 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 1/2] driver/base: Optimize memory block registration to reduce boot time To: David Hildenbrand , Mike Rapoport , Oscar Salvador , Greg Kroah-Hartman , Andrew Morton , rafael@kernel.org, Danilo Krummrich Cc: Ritesh Harjani , Jonathan Cameron , Alison Schofield , Yury Norov , Dave Jiang , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: Content-Language: en-US From: Donet Tom In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjUwNDI5MDEwNSBTYWx0ZWRfX/S4hOCsK+GPE 50pklfhfW2USyJChZkwlQc/E7EKKz9Jt3+EHt6NSft5UVRaBSoccjxbJBtxJAeqsRtkQvwzjrK4 3oMDoGb3dmquNb1IC71BJbe6UWSXvFlHtdKP+Pa1ShVbXDyhZpa5TmCKmMb+wXnk8QaLvATs+jN qbbiHmjlOt+ah4+ioSMiOETnuZWs1IPK2n9HdIcIRbds1B7l3slysu8c9SPR8Ht8+kYjL6fErk2 s39KjVN9TxxiPY0MwrlYXT9mHOIvG+NlASbaTO7ncEaTidavvMjgRcy1gB+AXL41JZYpmyTx5X8 JW2Bi6qZPKs+P/z9o+Sqcs/oKcwKnm3XsZz/tgj1DUKeGqK/oqAKsQJE+gAs1wtloFJ950PlPMp C0LAGNGW9snjhZ1XnVaWcKjeTNzbWQlHa1nrzlgdN2IjZFu7S5bewDuIswgILf1fYZrb4hd4 X-Proofpoint-GUID: ZZ-4DrQ-mrc1jE70m3YdAZRSlKVsrK2s X-Authority-Analysis: v=2.4 cv=KtxN2XWN c=1 sm=1 tr=0 ts=6810dd83 cx=c_pps a=AfN7/Ok6k8XGzOShvHwTGQ==:117 a=AfN7/Ok6k8XGzOShvHwTGQ==:17 a=IkcTkHD0fZMA:10 a=XR8D0OoHHMoA:10 a=VwQbUJbxAAAA:8 a=VnNF1IyMAAAA:8 a=hDAVcJoJGwRf6u5NEeoA:9 a=3ZKOabzyN94A:10 a=QEXdDO2ut3YA:10 X-Proofpoint-ORIG-GUID: zr9kNah8aQuDCFQ7fkf1-52da0INA5kC X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1099,Hydra:6.0.736,FMLib:17.12.80.40 definitions=2025-04-29_05,2025-04-24_02,2025-02-21_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 phishscore=0 priorityscore=1501 mlxlogscore=999 suspectscore=0 impostorscore=0 lowpriorityscore=0 adultscore=0 malwarescore=0 bulkscore=0 mlxscore=0 spamscore=0 classifier=spam authscore=0 authtc=n/a authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.19.0-2504070000 definitions=main-2504290105 X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 4BE57C0007 X-Rspam-User: X-Stat-Signature: 7yr3ase7ez5w6474q1j7hmi7stmnpyhm X-HE-Tag: 1745935765-686060 X-HE-Meta: U2FsdGVkX18+jVDT1lP1TXfSUiBHcRgspk/FcJEnHe/Hlp3FnNvj2XI7cKGLyfj6YcJ+JZGQkLPXmgOcGXhmXkCl7ShzJZCrbo6D4G0WgV2wVkeyHYRDUdNNgdlA3DD85oWViMY8FivQeJLk+CdLdQ3cf1VMEuxmRElTHjryHCjSnfP4ijJhgZi9YQKvIM3q9iBBJV2Y8C1TNZDesY/VVJgDaQotr+ttDqZzVZjNUb4XZ0rJCAQ3NJMlDkZlT3v6NhRrszGIb41ziodoP/Jj8HZk+vQS58CD6lwY3T3vGksH3Sc+4Gs1W9dwNhg39/2AzuBkYJjTk3xc2gx3QMyWe/nyeNprSHefoKLtfMyLqOoFI33evHrk1DAn3dnV7o6Jr0QaU9/WX1LmadXj7x/24wpD+uS1qbwt6GCT9j+5qNpsBJBbS/sTraQ/l+ZHw+zYi6xC+GKztf9oPMjPW6bwX5GawTJl1WNwnNFvUYNb5Sb9Hf53JEHfx0267Ux8r4D/FG+tVrD0HB0fsOGmGrZZIRL0OfzEgaUYFKvGAx1ZNsE8Rm0GuylsIE6poTpq6J36f3rwtWxnTX7aIso6MG+Th63sMMd4uDOf+EXZrfQ9EVepAEWNOOoedPzv3MuWqO0cPHtWLbEbGCUAn5jFq2kjI1eInKGYkM3uXGUw0eAxpZvvrqw5xCyz7NXaTbrC7dSshu+ljmSY16BJYmov+HwucnK0PIcwMWg4MSjjGLx+tgnkjJVZhkoBXW9tpIWP+FXQmJqvR0EXZwqIXzEpIU6iz/6ncE/PWl/1yMeb1xGBb+q51CqD2t35z/NqOLkSJc6+MyzdmriFTUg/vZyeGWecx7fswU0Djlzb8NdEAm7Cumz2py82jvJ072wT69Hmg0hm2spFyKjbP9KYYK0Y7//QSQZP3NGEFxnSgilMoklV2e2tLhyMAdruXBRThA93pp8RdqjCBzMvVEoTYtVu/PS GkWPz0XG 0i/7HFdCLB5iTe867v8Z3e6s0HACegun+ngg9NDTJbfxZiVxuF5AKtgpP0gUfX1euE/O6bV820Rbu8lTA+QaBmNgrW68a2kHKl7ThSdQZEF1UEJxzPASDimfAjwSHb8QpyJYXhSGVw1BgOTbwNr+IWDhbsnNzv6n/6JJK3w6WiFTezKRPK6SHVI2dItunI344noVXhWMF7CJc7WdWBXaPbWF+Vtht/x1Fe7NqREzcqBpD6ifNzB65Z0ukCIBwAS2i1s+SL14d6ZuXvQ2FjMt9yGHe13gxKJXuNtelBEK51dJbkbh+6o1EgNMsqd+zioD7toHa6Ra8mpC2CZ4cBJTgcLsIJY3g1WajptTF/E7vBgIQMLB08yeTLxjRKffBjhwvL/ZRGoOrxkH4u3ua1c8YhRmerT8SehnpQBp/Hu0tnsicTxh/C5gPfOkaJ0uCR5U6tV3Oy9le+KI8G2ggUZf283PG8w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 4/29/25 2:49 AM, David Hildenbrand wrote: > On 28.04.25 19:03, Donet Tom wrote: >> During node device initialization, `memory blocks` are registered under >> each NUMA node. The `memory blocks` to be registered are identified >> using >> the node’s start and end PFNs, which are obtained from the node's >> pg_data >> >> However, not all PFNs within this range necessarily belong to the same >> node—some may belong to other nodes. Additionally, due to the >> discontiguous nature of physical memory, certain sections within a >> `memory block` may be absent.> As a result, `memory blocks` that fall >> between a node’s start and end >> PFNs may span across multiple nodes, and some sections within those >> blocks >> may be missing. `Memory blocks` have a fixed size, which is architecture >> dependent. >> >> Due to these considerations, the memory block registration is currently >> performed as follows: >> >> for_each_online_node(nid): >>      start_pfn = pgdat->node_start_pfn; >>      end_pfn = pgdat->node_start_pfn + node_spanned_pages; >>      for_each_memory_block_between(PFN_PHYS(start_pfn), >> PFN_PHYS(end_pfn)) >>          mem_blk = memory_block_id(pfn_to_section_nr(pfn)); >> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr) >>          pfn_mb_end = pfn_start + memory_block_pfns - 1 >>          for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++): >>              if (get_nid_for_pfn(pfn) != nid): >>                  continue; >>              else >>                  do_register_memory_block_under_node(nid, mem_blk, >> MEMINIT_EARLY); >> >> Here, we derive the start and end PFNs from the node's pg_data, then >> determine the memory blocks that may belong to the node. For each >> `memory block` in this range, we inspect all PFNs it contains and check >> their associated NUMA node ID. If a PFN within the block matches the >> current node, the memory block is registered under that node. >> >> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() >> performs >> a binary search in the `memblock regions` to determine the NUMA node ID >> for a given PFN. If it is not enabled, the node ID is retrieved directly >> from the struct page. >> >> On large systems, this process can become time-consuming, especially >> since >> we iterate over each `memory block` and all PFNs within it until a >> match is >> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional >> overhead of the binary search increases the execution time >> significantly, >> potentially leading to soft lockups during boot. >> >> In this patch, we iterate over `memblock region` to identify the >> `memory blocks` that belong to the current NUMA node. `memblock regions` >> are contiguous memory ranges, each associated with a single NUMA >> node, and >> they do not span across multiple nodes. >> >> for_each_online_node(nid): >>    for_each_memory_region(r): // r => region >>      if (r->nid != nid): >>        continue; >>      else >>        for_each_memory_block_between(r->base, r->base + r->size - 1): >>          do_register_memory_block_under_node(nid, mem_blk, >> MEMINIT_EARLY); >> >> We iterate over all `memblock regions` and identify those that belong to >> the current NUMA node. For each `memblock region` associated with the >> current node, we calculate the start and end `memory blocks` based on >> the >> region's start and end PFNs. We then register all `memory blocks` within >> that range under the current node. > > Yes, makes sense. > >> >> Test Results on My system with 32TB RAM >> ======================================= >> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled. >> >> Without this patch >> ------------------ >> Startup finished in 1min 16.528s (kernel) >> >> With this patch >> --------------- >> Startup finished in 17.236s (kernel) - 78% Improvement > > Wow! > >> >> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled. >> >> Without this patch >> ------------------ >> Startup finished in 28.320s (kernel) >> >> With this patch >> --------------- >> Startup finished in 15.621s (kernel) - 46% Improvement >> > > Also very nice! > >> Signed-off-by: Donet Tom >> --- >> >> v1->v2 >> >> Reworked the implementation according to suggestions from >> Mike Rapoport[1] >> >> [1] - https://lore.kernel.org/all/Z_j2Gv9n4DOj6LSs@kernel.org/ >> >> v1 - >> https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/ >> --- >>   drivers/base/memory.c  |  4 ++-- >>   drivers/base/node.c    | 39 +++++++++++++++++++++++++++++++++++++++ >>   include/linux/memory.h |  2 ++ >>   include/linux/node.h   | 11 +++++------ >>   4 files changed, 48 insertions(+), 8 deletions(-) >> >> diff --git a/drivers/base/memory.c b/drivers/base/memory.c >> index 19469e7f88c2..7f1d266ae593 100644 >> --- a/drivers/base/memory.c >> +++ b/drivers/base/memory.c >> @@ -60,7 +60,7 @@ static inline unsigned long >> pfn_to_block_id(unsigned long pfn) >>       return memory_block_id(pfn_to_section_nr(pfn)); >>   } >>   -static inline unsigned long phys_to_block_id(unsigned long phys) >> +unsigned long phys_to_block_id(unsigned long phys) >>   { >>       return pfn_to_block_id(PFN_DOWN(phys)); >>   } >> @@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned >> long start_pfn) >>    * >>    * Called under device_hotplug_lock. >>    */ >> -static struct memory_block *find_memory_block_by_id(unsigned long >> block_id) >> +struct memory_block *find_memory_block_by_id(unsigned long block_id) >>   { >>       struct memory_block *mem; >>   diff --git a/drivers/base/node.c b/drivers/base/node.c >> index cd13ef287011..4869333d366d 100644 >> --- a/drivers/base/node.c >> +++ b/drivers/base/node.c >> @@ -20,6 +20,7 @@ >>   #include >>   #include >>   #include >> +#include >>     static const struct bus_type node_subsys = { >>       .name = "node", >> @@ -850,6 +851,44 @@ void unregister_memory_block_under_nodes(struct >> memory_block *mem_blk) >> kobject_name(&node_devices[mem_blk->nid]->dev.kobj)); >>   } >>   +/* >> + * register_memory_blocks_under_node_early : Register the memory >> + *          blocks under the current node. >> + * @nid : Current node under registration >> + * >> + * This function iterates over all memblock regions and identifies >> the regions >> + * that belong to the current node. For each region which belongs to >> current >> + * node, it calculates the start and end memory blocks based on the >> region's >> + * start and end PFNs. It then registers all memory blocks within >> that range >> + * under the current node. >> + * >> + */ >> +void register_memory_blocks_under_node_early(int nid) >> +{ >> +    struct memblock_region *r; > > You almost achieved a reverse x-mas tree :) > >> +    unsigned long start_block_id; >> +    unsigned long end_block_id; >> +    struct memory_block *mem; >> +    unsigned long block_id; >> + >> +    for_each_mem_region(r) { >> +        if (r->nid == nid) { > > To reduce indentation > > if (r->nid != nid) >     continue; ok. > >> +            start_block_id = phys_to_block_id(r->base); >> +            end_block_id = phys_to_block_id(r->base + r->size - 1); > > Probably you could make them const in the for loop > >     const unsigned long start_block_id = phys_to_block_id(r->base); >     const unsigned long end_block_id = phys_to_block_id(r->base + > r->size - 1); ok. I will add this change. > > Okay, so end is inclusive as well. yes > >> + >> +            for (block_id = start_block_id; block_id <= >> end_block_id; block_id++) { >> +                mem = find_memory_block_by_id(block_id); >> +                if (!mem) >> +                    continue; >> + >> +                do_register_memory_block_under_node(nid, mem, >> MEMINIT_EARLY); >