From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id EBB50E674B5 for ; Mon, 22 Dec 2025 15:56:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D842E6B0005; Mon, 22 Dec 2025 10:56:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D52006B0089; Mon, 22 Dec 2025 10:56:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C30656B008A; Mon, 22 Dec 2025 10:56:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id AE3626B0005 for ; Mon, 22 Dec 2025 10:56:15 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 3D7681A0259 for ; Mon, 22 Dec 2025 15:56:15 +0000 (UTC) X-FDA: 84247558710.03.D7913D4 Received: from mail-ed1-f53.google.com (mail-ed1-f53.google.com [209.85.208.53]) by imf25.hostedemail.com (Postfix) with ESMTP id 503E4A000C for ; Mon, 22 Dec 2025 15:56:13 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=soleen.com header.s=google header.b=WIiVKjSJ; dmarc=pass (policy=reject) header.from=soleen.com; spf=pass (imf25.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1766418973; a=rsa-sha256; cv=none; b=oKOHn4qyQ7d8Y1m24tGImq6P/nDkr2a+QKkCoOkn4aTRptD0SPiYmypd8HMYgeyt8ji7CF M0/eXbq8S1dSmsZkHyh3IrxuGJrD3tpSL9Lr+LleQ2YP3hT/oVlAclvcxoxMr7lguCTV4Z WCCU6ORHUdLawUYTQKGrdZtsyofkPaQ= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=soleen.com header.s=google header.b=WIiVKjSJ; dmarc=pass (policy=reject) header.from=soleen.com; spf=pass (imf25.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1766418973; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nmNs3UwGZr7nK6ylT0yHv/l8DmNd4tuOSujZj63Dqws=; b=k4TnwjZrdtCJEKzTE4zxIM27kt17ZUIgVZVdYxntZM+/V0houf5P10NWcIRHG2jtWRpci0 v/BzR35OUITFVgrukfwCx1InHTv5ri5GfzJtLFSvHo9CX8+9ibOhe1PKJDzVWDZuJr0IHp OTfqTyge8Ov148fnPUO1YfJSGruMRdM= Received: by mail-ed1-f53.google.com with SMTP id 4fb4d7f45d1cf-64d80a47491so1593864a12.1 for ; Mon, 22 Dec 2025 07:56:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen.com; s=google; t=1766418971; x=1767023771; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=nmNs3UwGZr7nK6ylT0yHv/l8DmNd4tuOSujZj63Dqws=; b=WIiVKjSJp/AFLyr0DvJOQtSPFkqnU9ejwieq2r1t5li6I4WnQBmcdW7rZ5Gnbr/YCR F5rZlIZnRNvp6t3qnXmgVp1lU5C2lNvINwXN63wIP35bNI1I75Im+5Qi9qmdfz55E4c3 umXyqpcRLB/akDeoRdypV0HEllikmPeFuEDB4uuvK7jsmuDCRt/GhyLJjOCSKPhaZDY2 0YUFvQK8k6vR5TBRFbo114UvJr1P5Z6+5ZlbAEO1hwujr6khYwNQbWtD+xvKzDxM2zHH WEMuUWoyowP20M6hSApkZJr6MyrYVZ1qW6mS/6+Fr7byQhVZDX0OUiEbbvfthqU1XA/h /TmA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766418971; x=1767023771; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=nmNs3UwGZr7nK6ylT0yHv/l8DmNd4tuOSujZj63Dqws=; b=L1Ykhxb1+LgxIFjuM3gAGkOFqpm0EFJhyI/goFC/FzxL5Ob2LilVncQWoH1vah+VzP ugP0FIDYarWGLrzFx3Vz1Yu7FDsaBCkpOmvOw2djv6fJ6FkKvi6ENUtv1t0lTM8QgiPO SYNKnlHtDKvRzniNxzdZI8ySmsT+7DjRhwvOVcRCJYMgYGfWJhu+GiydLCzyv8kvUtOe I6vVVWgAMtGMBVcoxd48Cn/ox8TCqovyKbuF3bT8RLH8BuJSiLPbosORhWOdtqqdKJFj 4SXpz2ExMvSToPbRw3K/K8ZDV7lA2jkyJzKgPgbzC54R6jZVd3iZHrO5Np9fkEK0PsUM gr5g== X-Forwarded-Encrypted: i=1; AJvYcCXYoB3qeIkDnlV5MWjcwBobOHwWiFKBE8+VftiSmV66zOCK61E3szyQZbCVLZ72WNIEs1ZqIYcu4Q==@kvack.org X-Gm-Message-State: AOJu0YxBTpaFCT6nyHdhvwiNISmnWomKQ+de75TRMkmRh0RBafeoOnTb 7b5T96VwWToKlrVjPv4iPfr5aolplhlvWJ+LUDNf0w0ur0e+7JuObcZxGng5y7iUBFQ/LO9zHn5 oDLdTEoJaVfwCyBicaghSHh5XSDutrFghKNFuqQA9gg== X-Gm-Gg: AY/fxX5cTM8Vi81EU8UyCc/Hed0y/BL4U87OqlXI/buV52fWqc+xNUvsf6FfLqItQBU jx0T+Dx+XCsLKovjh46905za9AUD9mfkUOJD7a80r8HaK+1uDCJ6jWb1sAg29d202JB6Wbr99bC cJ7A+Ddu5YkpNYVBo2W5TsRTjG+Jx4vuHrpwOzSvVKmFYph5uOGM44Q+x0m3mu3x1AC7IzSN+hB cNsYgJcSBc+3vU22wmVVoki+nNTkkk55hknFdYJ9qCHdVdhvqWJHU0/UV+SpZ713fjc/PLdEZlA jDjlG03Dq7RqqL9OGheTm6n1 X-Google-Smtp-Source: AGHT+IFEBAeFCtUhSPTbc4A/9WLyV7z0M2uSHjsZD2E0OEjezgIA0DyueqQgo6GKqcyBRi3/5gblK0XZObEH+LeqhO8= X-Received: by 2002:aa7:d0ce:0:b0:641:7cf4:552d with SMTP id 4fb4d7f45d1cf-64b5833c86dmr10687656a12.0.1766418971102; Mon, 22 Dec 2025 07:56:11 -0800 (PST) MIME-Version: 1.0 References: <20251216084913.86342-1-epetron@amazon.de> <861pkpkffh.fsf@kernel.org> <86jyyecyzh.fsf@kernel.org> In-Reply-To: <86jyyecyzh.fsf@kernel.org> From: Pasha Tatashin Date: Mon, 22 Dec 2025 10:55:34 -0500 X-Gm-Features: AQt7F2rPWh34v3gLfcQrWbQB0eZHj1dwNG_vdzmClZfskwU3PjPqxiYRCVmv4Uo Message-ID: Subject: Re: [PATCH] kho: add support for deferred struct page init To: Pratyush Yadav Cc: Mike Rapoport , Evangelos Petrongonas , Alexander Graf , Andrew Morton , Jason Miu , linux-kernel@vger.kernel.org, kexec@lists.infradead.org, linux-mm@kvack.org, nh-open-source@amazon.com Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 503E4A000C X-Stat-Signature: dgfoyojj1pzggfuh8pmmrxpyrjo9qao7 X-Rspam-User: X-HE-Tag: 1766418973-762090 X-HE-Meta: U2FsdGVkX188fHmMVL8gXGvCKbsMhN/6UdtgCNSXaszqVMNu/Zc/DpoMssvkJr9MvoGTXSElVtLakbk9v3QV0eWQ/Vq1I2X8bVw6eQVnFQbyRhgbNkVAf6Ilt2xJHZ4YFDqaDaugh6Oj7W0sbSlx0V9qolkAodgk7X9wdmfsMCqKfAYu1tc+68li4PQoC87KaEFJ90D5SKmDu0oykUdJckOoBtL4w04odRyK2cWYiOByF9zY9h56Cw2P/LDePSxnQJbPwmHBnBb88ty2zU3DauJgPggOqM7bZrOPpfbDTLEyvVKKIWDC546hSJOpiFNCtMscfDorXXHZqiRfIEG/0ILdJEmBZzzQpaTjY316y/OJJCf7tZ3+bnwLjIRye4OiaeP+bTT9OakCo9XSGR1vOmvD2cJbj4PHYpAAI732AcvP30NmgcYjqd0TZf5bG4GY4dvetN+NwiQANHKV92BGFs85bHRS8FAGV/MRpD+mxUtp62CEWrcg6/9R/upPH4o8ZCOuCw8cHlauhlpfbHCP7CS++8PRc3xZ6a7e1QHvTt860KKBIgS5l+sR7JF1tqeM0GN4E0pWWnJC0tS8f/XkVLQDX4GfHxTDo1P6nVoQahES962uF7jwGpGaF5rbLPsYCoklT7MFKP/10z3CztNBU/gHpwG3YAz9V+bqe2PAcC391IpbrGFu9Zh3TFHa3gWVrwZwv8r2cn6qBOAec8joqEhVJmP9rRIZLoli7LpyBOPFInyigsouoDw6+oB08FomupltwRZKthUFE8oJcwih8Idx6iKNpAzgoISgPpHD6rqNv9DO7E160dIDmg2TGk5By05ad/yP8LBngBXM0HhKygAJppQik6PGYYOzA1Z1hRkBdo6l8SLJ8zveMFxrgpEm1RAg8jRlOERZM4K/tdInOiGOEApkUI00HfvEVmDRNTXJeNF0RR1dqNs0bfvmQqIN9SNYyP3pjJWPgrvaGJZ OOxHWe4Y mVT8PsEYfqgelcZbxd+4+u1Q2s47mXK9ZOdsVDrwPQcv/tqdej/uamH0l8ImIvbm78Bqo1pLMZAgHMD0atF0WZxkp2xme0LIc2um6T2OTBf/9EmQAtJjfP9WV7whggKtAI1QLnicByCA4eIsD5zvfcVYwBZCvC/M7a4uMpgSu6yvKGiMeCKjwboxID2akXjVeq1PK X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > > NUMA node boundaries are SECTION_SIZE aligned. Since SECTION_SIZE is > > larger than MAX_PAGE_ORDER it is mathematically impossible for a > > single chunk to span multiple nodes. > > For folios, yes. The whole folio should only be in a single node. But we > also have kho_preserve_pages() (formerly kho_preserve_phys()) which can > be used to preserve an arbitrary size of memory and _that_ doesn't have > to be in the same section. And if the memory is properly aligned, then > it will end up being just one higher-order preservation in KHO. Both restore pages and folios we use: kho_restore_page() which has the following: /* * deserialize_bitmap() only sets the magic on the head page. This magic * check also implicitly makes sure phys is order-aligned since for * non-order-aligned phys addresses, magic will never be set. */ if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC || info.order > MAX_PAGE_ORDER)) return NULL; My understanding the head page can never be more than MAX_PAGE_ORDER hence why I am saying it will be less than SECTION_SIZE. With HugeTLB the order can be more than MAX_PAGE_ORDER, but in that case it still has to be within a single NID, since a huge page cannot be split across multiple nodes. > >> > This approach seems to give us the best of both worlds: It avoids the > >> > memblock dependency during restoration. It keeps the serial work in > >> > deserialize_bitmap() to a minimum (O(1)O(1) per region). It allows the > >> > heavy lifting of tail page initialization to be done later in the boot > >> > process, potentially in parallel, as you suggested. > >> > >> Here's another idea I have been thinking about, but never dug deep > >> enough to figure out if it actually works. > >> > >> __init_page_from_nid() loops through all the zones for the node to find > >> the zone id for the page. We can flip it the other way round and loop > >> through all zones (on all nodes) to find out if the PFN spans that zone. > >> Once we find the zone, we can directly call __init_single_page() on it. > >> If a contiguous chunk of preserved memory lands in one zone, we can > >> batch the init to save some time. > >> > >> Something like the below (completely untested): > >> > >> > >> static void kho_init_page(struct page *page) > >> { > >> unsigned long pfn = page_to_pfn(page); > >> struct zone *zone; > >> > >> for_each_zone(zone) { > >> if (zone_spans_pfn(zone, pfn)) > >> break; > >> } > >> > >> __init_single_page(page, pfn, zone_idx(zone), zone_to_nid(zone)); > >> } > >> > >> It doesn't do the batching I mentioned, but I think it at least gets the > >> point across. And I think even this simple version would be a good first > >> step. > >> > >> This lets us initialize the page from kho_restore_folio() without having > >> to rely of memblock being alive, and saves us from doing work during > >> early boot. We should only have a handful of zones and nodes in > >> practice, so I think it should perform fairly well too. > >> > >> We would of course need to see how it performs in practice. If it works, > >> I think it would be cleaner and simpler than splitting the > >> initialization into two separate parts. > > > > I think your idea is clever and would work. However, consider the > > cache efficiency: in deserialize_bitmap(), we must write to the head > > struct page anyway to preserve the order. Since we are already > > bringing that 64-byte cacheline in and dirtying it, and since memblock > > is available and fast at this stage, it makes sense to fully > > initialize the head page right then. > > You will also bring in the cache line and dirty it during > kho_restore_folio() since you need to write the page refcounts. So I > don't think the cache efficiency makes any difference between either > approach. > > > If we do that, we get the nid for "free" (cache-wise) and we avoid the > > overhead of iterating zones during the restore phase. We can then > > simply inherit the nid from the head page when initializing the tail > > pages later. > > To get the nid, you would need to call early_pfn_to_nid(). This takes a > spinlock and searches through all memblock memory regions. I don't think > it is too expensive, but it isn't free either. And all this would be > done serially. With the zone search, you at least have some room for > concurrency. > > I think either approach only makes a difference when we have a large > number of low-order preservations. If we have a handful of high-order > preservations, I suppose the overhead of nid search would be negligible. We should be targeting a situation where the vast majority of the preserved memory is HugeTLB, but I am still worried about lower order preservation efficiency for IOMMU page tables, etc. > Long term, I think we should hook this into page_alloc_init_late() so > that all the KHO pages also get initalized along with all the other > pages. This will result in better integration of KHO with rest of MM > init, and also have more consistent page restore performance. But we keep KHO as reserved memory, and hooking it up into page_alloc_init_late() would make it very different, since that memory is part of the buddy allocator memory... > Jason's radix tree patches will make that a bit easier to do I think. > The zone search will scale better I reckon. It could, perhaps early in boot we should reserve the radix tree, and use it as a source of truth look-ups later in boot?