From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 861BEC61CE8 for ; Thu, 12 Jun 2025 10:51:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2667A6B008A; Thu, 12 Jun 2025 06:51:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 216BA6B008C; Thu, 12 Jun 2025 06:51:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0DE476B0092; Thu, 12 Jun 2025 06:51:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id E45B86B008A for ; Thu, 12 Jun 2025 06:51:34 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 95F93BE935 for ; Thu, 12 Jun 2025 10:51:34 +0000 (UTC) X-FDA: 83546432508.15.4D3E6EF Received: from mout-p-201.mailbox.org (mout-p-201.mailbox.org [80.241.56.171]) by imf14.hostedemail.com (Postfix) with ESMTP id B7F0A10000F for ; Thu, 12 Jun 2025 10:51:32 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=samsung.com (policy=none); spf=pass (imf14.hostedemail.com: domain of kernel@pankajraghav.com designates 80.241.56.171 as permitted sender) smtp.mailfrom=kernel@pankajraghav.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749725493; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=f8RW/KJ77Iblv3w28ScWr5TmU6K4gIGaNWLaRfr7dNE=; b=56xsFHvJxRY9q2PZgoC0amIzAL6Y1jG5vAPDPnpcYp09fd+jp9NPNzZSTYlNjcshklZ/eQ S5SYh8+/AZEtg82qq0PMYGwNY9z6Rc83NDQ9QnHgp6hWqqFKcGKWkOcHhC3AmEctAGkBQ2 CrUkDzxZ/mJmP24rDGtCDEeh4ODFNNo= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=samsung.com (policy=none); spf=pass (imf14.hostedemail.com: domain of kernel@pankajraghav.com designates 80.241.56.171 as permitted sender) smtp.mailfrom=kernel@pankajraghav.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749725493; a=rsa-sha256; cv=none; b=u1IxuK5POa9m9nJxEkWryeN8K2TiSOSckX/WkpLO+iwXoCd2BzVT0lI2ihLGInlR812GYP 3AoCEIEcJAJdhWSEckeSoRFSyCkA7nybOXemYa5tW9UOgDMsJDvFqZ9suYqSkfdfcmqEe+ LqpQ3tQunSVFGXaON19GwBoAO79ridw= Received: from smtp102.mailbox.org (smtp102.mailbox.org [IPv6:2001:67c:2050:b231:465::102]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by mout-p-201.mailbox.org (Postfix) with ESMTPS id 4bHzq5122sz9sx5; Thu, 12 Jun 2025 12:51:29 +0200 (CEST) From: Pankaj Raghav To: Suren Baghdasaryan , Ryan Roberts , Mike Rapoport , Michal Hocko , Thomas Gleixner , Nico Pache , Dev Jain , Baolin Wang , Borislav Petkov , Ingo Molnar , "H . Peter Anvin" , Vlastimil Babka , Zi Yan , Dave Hansen , David Hildenbrand , Lorenzo Stoakes , Andrew Morton , "Liam R . Howlett" , Jens Axboe Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, willy@infradead.org, x86@kernel.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org, "Darrick J . Wong" , mcgrof@kernel.org, gost.dev@samsung.com, kernel@pankajraghav.com, hch@lst.de, Pankaj Raghav Subject: [PATCH 3/5] mm: add static PMD zero page Date: Thu, 12 Jun 2025 12:50:58 +0200 Message-ID: <20250612105100.59144-4-p.raghav@samsung.com> In-Reply-To: <20250612105100.59144-1-p.raghav@samsung.com> References: <20250612105100.59144-1-p.raghav@samsung.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: B7F0A10000F X-Rspamd-Server: rspam03 X-Rspam-User: X-Stat-Signature: ey9nipifpggppcjxrwrhcero6hxwumez X-HE-Tag: 1749725492-346029 X-HE-Meta: U2FsdGVkX1/hhj//EooUHPyQM4e11MkBR7jLA6/L4TgR7xhuDYK+3CKZRb1kQmvL29wVgynkVILHWAdzHRHch5xXl7dkqq2DX3DfCKZfop9y4FG7+dgdZizUiYxIAPzyyXJ61ER+45UGS1An4qUNZnqyLCWeiEtWGXZBn9zJnKMBtk3939gSX7R4mlrbSmXPgqqKBroKW9J+6dASk5OQbI2hJQq0lZxiJf0ZuhGY9eQQjgn4dB9+qMa5c7n3cfK3hX8cLmYp03MG24TajDZ+ktj9AJj3lNm0utrf2Xn1dPBGMHcSAlmdMvMPorXZ3pK0V+667/AC8wFoH/wGjscRYz2AkoCXvHdD6XeYyUzbOsZixUdbn0N0x9KKUPfHIjOqQYwuk91QdJCjYk0UzVM+gNivCksfL0QWg8/7GpKvAoGQmnS1J/2/0cdpQpitZoD0kk4L/HhIJ5vsYa6eaD62JAvL5Oudjzm6wDAb5pG3mo/rGTRWtUcNVNzAKfOVEshsF2VSUSVSogTCstqJk/7SwbdVGR1L8wv2FjXTihXmpAZ2SCSvUO3dQYlEsyYyxjvYjT62FjmVpmp7WRkNxGOPpSi/TGAIhIyzRDThkrt0VQRj7A6TTbYc+M7U/Hee7lgNAb+rs+T0HYht82ncdG/CDLtZsC2hsGkMnEJi1bcaW/t0VNfq35FgnMEG1utRR8s5Tp5U6wQ46rtxs8QT3vVokvvzpdVeNidS1m5vyWtInQvyjCobTGxoNJ8587X7rty/TUZSbrfvZqzcfHH6AwugovhpohZHMglXZGuX20Gd0sbI8msHUw4flEkw5NKL0UZFnlVPF8EdDKE7pP87Z9J7ivbS7kUtZsTjcjwsZG1ZOHFwrRfOJgPy+GNRVAsPT9Bh+lbGPcoBpObFI+3GX486k73o9mxOmWV+fn3rl9ZY8p/0QjPTCDafMutVcd9XxgirgzaUuWVPJ88HEKjqYif QYG0hE86 j2ZJYUUQQ0a9sSwTyMIcVgFebZUmORmFlRfpxJ9sqeAjiAqMcVk7L4Oe3VISZwDCBgpzeksPNWJ22y+UUdTzqfH4UQYblDClrqIWUFTwxAB70QuystYEG3P1PEBPFjdSDk+n1YLDOZ7uaa9uS8FyB+Rsa2Rg4695qeIMHSmPm4RbIKq2xV+EnfP4OChK3ihy4Lp1SHnZcxfrU7Y425zioD/9Qc6tgjr/7BUKuM1qoiOtN64LKznk2o/dhSf2UM7vPXuLgT9ttRSV+9lTnydLkaH7YP2Y/BZJwetnIunUIyqvuswHrqaDnP7+rVZuU5YFqZn33zCuFV4g/T3GpN1TrVl38pkvFqaCz4KCaMsFCgrmvwiUO+2VU2Dn1XgaV3Ree42yjwXDaKz4FgewuZR+GuS7CO7eaHK0JlJtL5AMDgwuOT82jLw5NBrWHh5xEcJ0+klIxNhyRUY68Whr57A7QGqmhOA0LhrU2CzGtzoR6RnQEYse96TD/h6lhlG32WC51MAsO X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: There are many places in the kernel where we need to zeroout larger chunks but the maximum segment we can zeroout at a time by ZERO_PAGE is limited by PAGE_SIZE. This is especially annoying in block devices and filesystems where we attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage bvec support in block layer, it is much more efficient to send out larger zero pages as a part of single bvec. This concern was raised during the review of adding LBS support to XFS[1][2]. Usually huge_zero_folio is allocated on demand, and it will be deallocated by the shrinker if there are no users of it left. Add a config option STATIC_PMD_ZERO_PAGE that will always allocate the huge_zero_folio in .bss, and it will never be freed. This makes using the huge_zero_folio without having to pass any mm struct and call put_folio in the destructor. As STATIC_PMD_ZERO_PAGE does not depend on THP, declare huge_zero_folio and huge_zero_pfn outside of the THP ifdef. It can only be enabled from x86_64, but it is an optional config. We could expand it more architectures in the future. [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/ [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/ Suggested-by: David Hildenbrand Signed-off-by: Pankaj Raghav --- Questions: - Can we call __split_huge_zero_page_pmd() on static PMD page? arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 8 ++++++++ arch/x86/kernel/head_64.S | 8 ++++++++ include/linux/mm.h | 16 +++++++++++++++- mm/Kconfig | 13 +++++++++++++ mm/huge_memory.c | 24 ++++++++++++++++++++---- mm/memory.c | 19 +++++++++++++++++++ 7 files changed, 84 insertions(+), 5 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 340e5468980e..c3a9d136ec0a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -153,6 +153,7 @@ config X86 select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64 select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64 select ARCH_WANTS_THP_SWAP if X86_64 + select ARCH_HAS_STATIC_PMD_ZERO_PAGE if X86_64 select ARCH_HAS_PARANOID_L1D_FLUSH select ARCH_WANT_IRQS_OFF_ACTIVATE_MM select BUILDTIME_TABLE_SORT diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 774430c3abff..7013a7d26da5 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -47,6 +47,14 @@ void ptdump_walk_user_pgd_level_checkwx(void); #define debug_checkwx_user() do { } while (0) #endif +#ifdef CONFIG_STATIC_PMD_ZERO_PAGE +/* + * PMD_ZERO_PAGE is a global shared PMD page that is always zero. + */ +extern unsigned long empty_pmd_zero_page[(PMD_SIZE) / sizeof(unsigned long)] + __visible; +#endif + /* * ZERO_PAGE is a global shared page that is always zero: used * for zero-mapped memory areas etc.. diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S index 3e9b3a3bd039..86aaa53fd619 100644 --- a/arch/x86/kernel/head_64.S +++ b/arch/x86/kernel/head_64.S @@ -714,6 +714,14 @@ EXPORT_SYMBOL(phys_base) #include "../xen/xen-head.S" __PAGE_ALIGNED_BSS + +#ifdef CONFIG_STATIC_PMD_ZERO_PAGE +SYM_DATA_START_PAGE_ALIGNED(empty_pmd_zero_page) + .skip PMD_SIZE +SYM_DATA_END(empty_pmd_zero_page) +EXPORT_SYMBOL(empty_pmd_zero_page) +#endif + SYM_DATA_START_PAGE_ALIGNED(empty_zero_page) .skip PAGE_SIZE SYM_DATA_END(empty_zero_page) diff --git a/include/linux/mm.h b/include/linux/mm.h index c8fbeaacf896..b20d60d68b3c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4018,10 +4018,10 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma) #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE extern struct folio *huge_zero_folio; extern unsigned long huge_zero_pfn; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline bool is_huge_zero_folio(const struct folio *folio) { return READ_ONCE(huge_zero_folio) == folio; @@ -4032,9 +4032,23 @@ static inline bool is_huge_zero_pmd(pmd_t pmd) return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd); } +#ifdef CONFIG_STATIC_PMD_ZERO_PAGE +static inline struct folio *mm_get_huge_zero_folio(struct mm_struct *mm) +{ + return READ_ONCE(huge_zero_folio); +} + +static inline void mm_put_huge_zero_folio(struct mm_struct *mm) +{ + return; +} + +#else struct folio *mm_get_huge_zero_folio(struct mm_struct *mm); void mm_put_huge_zero_folio(struct mm_struct *mm); +#endif /* CONFIG_STATIC_PMD_ZERO_PAGE */ + #else static inline bool is_huge_zero_folio(const struct folio *folio) { diff --git a/mm/Kconfig b/mm/Kconfig index 781be3240e21..fd1c51995029 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -826,6 +826,19 @@ config ARCH_WANTS_THP_SWAP config MM_ID def_bool n +config ARCH_HAS_STATIC_PMD_ZERO_PAGE + def_bool n + +config STATIC_PMD_ZERO_PAGE + bool "Allocate a PMD page for zeroing" + depends on ARCH_HAS_STATIC_PMD_ZERO_PAGE + help + Typically huge_zero_folio, which is a PMD page of zeroes, is allocated + on demand and deallocated when not in use. This option will + allocate a PMD sized zero page in .bss and huge_zero_folio will + use it instead allocating dynamically. + Not suitable for memory constrained systems. + menuconfig TRANSPARENT_HUGEPAGE bool "Transparent Hugepage Support" depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 101b67ab2eb6..c12ca7134e88 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -75,9 +75,6 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, struct shrink_control *sc); static bool split_underused_thp = true; -static atomic_t huge_zero_refcount; -struct folio *huge_zero_folio __read_mostly; -unsigned long huge_zero_pfn __read_mostly = ~0UL; unsigned long huge_anon_orders_always __read_mostly; unsigned long huge_anon_orders_madvise __read_mostly; unsigned long huge_anon_orders_inherit __read_mostly; @@ -208,6 +205,23 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, return orders; } +#ifdef CONFIG_STATIC_PMD_ZERO_PAGE +static int huge_zero_page_shrinker_init(void) +{ + return 0; +} + +static void huge_zero_page_shrinker_exit(void) +{ + return; +} +#else + +static struct shrinker *huge_zero_page_shrinker; +static atomic_t huge_zero_refcount; +struct folio *huge_zero_folio __read_mostly; +unsigned long huge_zero_pfn __read_mostly = ~0UL; + static bool get_huge_zero_page(void) { struct folio *zero_folio; @@ -288,7 +302,6 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink, return 0; } -static struct shrinker *huge_zero_page_shrinker; static int huge_zero_page_shrinker_init(void) { huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero"); @@ -307,6 +320,7 @@ static void huge_zero_page_shrinker_exit(void) return; } +#endif #ifdef CONFIG_SYSFS static ssize_t enabled_show(struct kobject *kobj, @@ -2843,6 +2857,8 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, pte_t *pte; int i; + // FIXME: can this be called with static zero page? + VM_BUG_ON(IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE)); /* * Leave pmd empty until pte is filled note that it is fine to delay * notification until mmu_notifier_invalidate_range_end() as we are diff --git a/mm/memory.c b/mm/memory.c index 8eba595056fe..77721f5ae043 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -159,6 +159,25 @@ static int __init init_zero_pfn(void) } early_initcall(init_zero_pfn); +#ifdef CONFIG_STATIC_PMD_ZERO_PAGE +struct folio *huge_zero_folio __read_mostly; +unsigned long huge_zero_pfn __read_mostly = ~0UL; + +static int __init init_pmd_zero_pfn(void) +{ + huge_zero_folio = virt_to_folio(empty_pmd_zero_page); + huge_zero_pfn = page_to_pfn(virt_to_page(empty_pmd_zero_page)); + + __folio_set_head(huge_zero_folio); + prep_compound_head((struct page *)huge_zero_folio, PMD_ORDER); + /* Ensure zero folio won't have large_rmappable flag set. */ + folio_clear_large_rmappable(huge_zero_folio); + + return 0; +} +early_initcall(init_pmd_zero_pfn); +#endif + void mm_trace_rss_stat(struct mm_struct *mm, int member) { trace_rss_stat(mm, member); -- 2.49.0