From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B79CAFEFB6E for ; Fri, 27 Feb 2026 17:57:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 256406B00B3; Fri, 27 Feb 2026 12:57:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 20E3E6B00B5; Fri, 27 Feb 2026 12:57:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 117976B00B6; Fri, 27 Feb 2026 12:57:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id F1D006B00B3 for ; Fri, 27 Feb 2026 12:57:05 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id AEF9F13BA52 for ; Fri, 27 Feb 2026 17:57:05 +0000 (UTC) X-FDA: 84490992810.15.747B519 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf08.hostedemail.com (Postfix) with ESMTP id F25E3160006 for ; Fri, 27 Feb 2026 17:57:03 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf08.hostedemail.com: domain of kevin.brodsky@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=kevin.brodsky@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1772215024; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=nk3K6VsTtGAe00QGUsySAx+qxNoxLJZRi61IfFi030Y=; b=G+X5Ec+X5IgaZe8OR/FIjhxzTMp/orrD+djsFDXdSTFY7y1UtvBit3eZrreu55JqP614F+ 1o2edc9u/f73ECQABEhcxPG12y54jV9eYPG6SE0Goh0tfXGbty1jDYBYXpv3owFg0Gfsv6 GlYJOZiWhXW9yHGnmyRdtFQPIik+CW0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1772215024; a=rsa-sha256; cv=none; b=rqrjYpSNLekpch0eCjmBCsvmcBvtclfrjcvUytNUloKKyrKL1jaWucduOdbZFeqHLRZERX +4cDmGpp7pDtV4/M3A0JGcK+9w9sQQriWa/Z7z96TXfto55LpMQg9W7wUU8oQIYOA97Et3 JQDfgq+5VbUylmboJGxb7GjjQLLHM1Q= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf08.hostedemail.com: domain of kevin.brodsky@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=kevin.brodsky@arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0169814BF; Fri, 27 Feb 2026 09:56:57 -0800 (PST) Received: from e123572-lin.arm.com (e123572-lin.cambridge.arm.com [10.1.194.54]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E2AD73F73B; Fri, 27 Feb 2026 09:56:58 -0800 (PST) From: Kevin Brodsky To: linux-hardening@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Kevin Brodsky , Andrew Morton , Andy Lutomirski , Catalin Marinas , Dave Hansen , David Hildenbrand , Ira Weiny , Jann Horn , Jeff Xu , Joey Gouly , Kees Cook , Linus Walleij , Lorenzo Stoakes , Marc Zyngier , Mark Brown , Matthew Wilcox , Maxwell Bland , "Mike Rapoport (IBM)" , Peter Zijlstra , Pierre Langlois , Quentin Perret , Rick Edgecombe , Ryan Roberts , Thomas Gleixner , Vlastimil Babka , Will Deacon , Yang Shi , Yeoreum Yun , linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, x86@kernel.org Subject: [PATCH v6 17/30] mm: kpkeys: Add shrinker for block pgtable allocator Date: Fri, 27 Feb 2026 17:55:05 +0000 Message-ID: <20260227175518.3728055-18-kevin.brodsky@arm.com> X-Mailer: git-send-email 2.51.2 In-Reply-To: <20260227175518.3728055-1-kevin.brodsky@arm.com> References: <20260227175518.3728055-1-kevin.brodsky@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: F25E3160006 X-Stat-Signature: 1ddksumpuka38yq85n8tjpk6o7u59ajp X-Rspam-User: X-HE-Tag: 1772215023-195766 X-HE-Meta: U2FsdGVkX18ix+h1lBSSFXLSFFA9Hmhnvvjmacf4CKIKhdfAwWeNLg4LVnDnfLsehXR9dHkABB3uJPkXqc2J/knOAGPubZKPct1/+UdPYjUFNGkcoGS0sF+yzGxRcBeqSPxA6LpGeP7lRADXatq09TZp+8UtfjXXcjMghBW8yqfoP5rp5E4BtaWcv9lM2eL7QAPwIhTYoADIylnciUAb2hq8KfEDwikcOE8lkwZOxkDlez+ZJYk9/zk8InsKGeA+W2ZhitMPX3KfEi9x9oiTL3CJo1nZc/nNSQuzuR9HKkQzG9+7t9tTgopjQWTeGnN17/zlaX52buyIiUyLnVdORADVAgk+MEVkjbZXFSYnFSZtd8ftnmY0eafEb4o2KmvqTs2vj8VzdP9StR7hmD/6NC1d+VhA6JvF2vNCBewtb5mFr5NkuzjbTbQ1UcWyVuVB00Qu6hSMooVy7EoKiJFi1mSrUZyWkdqbJCgj6Q4EMsUUFo/PdP1zedfKlXfErjiKZU/KGddT6HzsbRU8I18RseA/X6+N9rb30xRnRt84AI2o9ojuxbacekWvIg+Sg3RFE5wW0TrTYNE7Fxry+D1DnSe54qgTyOBAOklJmEAfLm4Hw2VOu/1/4kSXfj9Y6Ga2J0i5GHTPU3iyzdH1xhu7XNTQ5rwyacVK57Vjzy7Rj663uMo5BGhsKWka5DV5+1ZadEZmvZI9TV7k6ZvuhnXKEMGL8WDXTBLEtaUIu/PXtv6rvzP8wl/nq6l56MOX6fx80jBfzbmU97FT0GpDJT2p5vHFJCjaq7HA+a2d8ZSnNKfz9EDRgZxGIR58n37DoPQEsJMVGIUHGRNFy5NRlNOLqcUIscARxVglzcZyznpUjU5GKKVaGZcSZ8TAfN/dszp8hr0Lm7M2WFFvRdG6oSWjh83CnP+NnF+FecAe+F5CDcJS4z1elvjc7ZT6p2rat7JSZJpsNuSqtCQrdJkz1aX nWiQcvJ3 jnbi4DtnwWo9DHeTaQA/Oauval6xCIBXszmxwYl9MueCTDqEEsgh1JP+ujM4wLJfyE/qUhNBQch7hxpluNenSll4eFy1ebaqI6qeQWkWOYMZx2YPHRhsIptXpdQfOp4abERGSl8/MRf8h7VDsJglOu180QA== Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: The newly introduced kpkeys block allocator does not return freed page table pages (PTPs) to the buddy allocator, but instead caches them, so that: 1. Future allocation requests can be quickly serviced, without calling alloc_pages() or set_memory_pkey(). 2. Blocks are not needlessly split, since releasing a single page to the buddy allocator requires resetting the pkey for just that page, splitting the PMD into PTEs. We cannot however let this cache grow indefinitely. This patch introduces a shrinker that allows reclaiming those cached pages. Like the rest of the allocator, the primary objective is to minimise the splitting of blocks. Each shrinker pass (call to pba_shrink_scan()) attempts to release all free pages within a given block, or cached pages that do not lie inside a block managed by the allocator. In order to choose which block to shrink, we need to know how many free pages a block contains. The approach taken here is to store that value in the head page of each managed block and update it whenever a page is allocated or freed. Other pages in the block are also marked with a flag. This simplifies the shrinker pass, and the associated overhead should be minimal. We then scan all cached pages and find the "emptiest" block, i.e. containing the most free pages. If a block is completely empty, then we release it right away as that can be done without splitting. Otherwise, we pay the price of splitting the block if we consider it has enough free pages (and there are not enough non-block free pages). Signed-off-by: Kevin Brodsky --- Much of patch is up for debate, with various thresholds that would deserve to be tuned. Tracking blocks seems like a good idea to reduce fragmentation, but it's unclear how much that helps in a real-life scenario. Feedback welcome! --- mm/kpkeys_hardened_pgtables.c | 287 ++++++++++++++++++++++++++++++++++ 1 file changed, 287 insertions(+) diff --git a/mm/kpkeys_hardened_pgtables.c b/mm/kpkeys_hardened_pgtables.c index 223a0bb02df0..dcc5e6da7c85 100644 --- a/mm/kpkeys_hardened_pgtables.c +++ b/mm/kpkeys_hardened_pgtables.c @@ -143,6 +143,7 @@ void __init kpkeys_hardened_pgtables_init_late(void) #define PBA_NR_RESERVED_PAGES 4 #define BLOCK_ORDER PMD_ORDER +#define BLOCK_NR_PAGES (1ul << (BLOCK_ORDER)) /* * Refilling the cache is done by attempting allocation in decreasing orders @@ -226,6 +227,68 @@ static void __ref register_early_region(struct page *head_page, pba_early_region.order = order; } +/* + * Private per-page allocator data. It needs to be preserved when a page table + * page is allocated, so we cannot use page->private, which overlaps with + * struct ptdesc::ptl. page->mapping is unused in struct ptdesc so we store it + * there instead. + */ +struct pba_page_data { + bool in_block; + u32 block_nr_free; /* Only used for the head page of a block */ +}; + +static struct pba_page_data *page_pba_data(struct page *page) +{ + BUILD_BUG_ON(sizeof(struct pba_page_data) > sizeof(page->mapping)); + + return (struct pba_page_data *)&page->mapping; +} + +static void mark_block_cached(struct page *head_page, struct page *cached_pages, + unsigned int nr_cached_pages) +{ + page_pba_data(head_page)->in_block = true; + page_pba_data(head_page)->block_nr_free = nr_cached_pages; + + for (unsigned int i = 0; i < nr_cached_pages; i++) + page_pba_data(&cached_pages[i])->in_block = true; +} + +static void mark_block_noncached(struct page *head_page) +{ + for (unsigned int i = 0; i < BLOCK_NR_PAGES; i++) + head_page[i].mapping = NULL; +} + +static struct page *block_head_page(struct page *page) +{ + unsigned long page_pfn; + + if (!page_pba_data(page)->in_block) + return NULL; + + page_pfn = page_to_pfn(page); + + return pfn_to_page(ALIGN_DOWN(page_pfn, BLOCK_NR_PAGES)); +} + +static void inc_block_nr_free(struct page *page) +{ + struct page *head_page = block_head_page(page); + + if (head_page) + page_pba_data(head_page)->block_nr_free++; +} + +static void dec_block_nr_free(struct page *page) +{ + struct page *head_page = block_head_page(page); + + if (head_page) + page_pba_data(head_page)->block_nr_free--; +} + static void cached_list_add_pages(struct page *page, unsigned int nr_pages) { struct pkeys_block_allocator *pba = &pkeys_block_allocator; @@ -248,6 +311,7 @@ static void __refill_pages_add_to_cache(struct page *page, unsigned int order, bool alloc_one) { struct pkeys_block_allocator *pba = &pkeys_block_allocator; + struct page *head_page = page; unsigned int nr_pages = 1 << order; if (alloc_one) { @@ -255,6 +319,9 @@ static void __refill_pages_add_to_cache(struct page *page, unsigned int order, nr_pages--; } + if (order == BLOCK_ORDER) + mark_block_cached(head_page, page, nr_pages); + guard(spinlock_bh)(&pba->lock); cached_list_add_pages(page, nr_pages); @@ -309,6 +376,56 @@ static struct page *refill_pages_and_alloc_one(void) return __refill_pages(true); } +static unsigned long release_page_list(struct list_head *page_list) +{ + struct pkeys_block_allocator *pba = &pkeys_block_allocator; + unsigned long nr_freed = 0; + struct page *page, *tmp; + + /* _safe is required because __free_page() overwrites page->lru */ + list_for_each_entry_safe(page, tmp, page_list, lru) { + int ret = 0; + + ret = set_pkey_default(page, 1); + + if (ret) { + guard(spinlock_bh)(&pba->lock); + cached_list_add_pages(page, 1); + break; + } + + __free_page(page); + nr_freed++; + } + + return nr_freed; +} + +static unsigned long release_whole_block(struct list_head *page_list, + struct page *block_head) +{ + struct pkeys_block_allocator *pba = &pkeys_block_allocator; + unsigned long nr_freed = 0; + struct page *page, *tmp; + int ret; + + /* Reset the pkey for the full block to avoid splitting the linear map */ + ret = set_pkey_default(block_head, BLOCK_NR_PAGES); + + if (ret) { + guard(spinlock_bh)(&pba->lock); + cached_list_add_pages(block_head, BLOCK_NR_PAGES); + return 0; + } + + list_for_each_entry_safe(page, tmp, page_list, lru) { + __free_page(page); + nr_freed++; + } + + return nr_freed; +} + static bool cached_page_available(gfp_t gfp) { struct pkeys_block_allocator *pba = &pkeys_block_allocator; @@ -337,6 +454,7 @@ static struct page *get_cached_page(gfp_t gfp) return NULL; cached_list_del_page(page); + dec_block_nr_free(page); return page; } @@ -409,6 +527,7 @@ static void pba_pgtable_free(struct page *page) guard(spinlock_bh)(&pba->lock); cached_list_add_pages(page, 1); + inc_block_nr_free(page); } static int pba_prepare_direct_map_split(void) @@ -464,3 +583,171 @@ static void __init pba_init_late(void) set_pkey_pgtable(pba_early_region.head_page, 1 << pba_early_region.order); } + +/* Shrinker */ + +/* Keep some pages around to avoid shrinking causing a refill right away */ +#define PBA_UNSHRINKABLE_PAGES 16 +/* Don't shrink a block that is almost full to avoid excessive splitting */ +#define PBA_SHRINK_BLOCK_MIN_PAGES (BLOCK_NR_PAGES / 8) + +static unsigned long count_shrinkable_pages(void) +{ + struct pkeys_block_allocator *pba = &pkeys_block_allocator; + unsigned long nr_cached = READ_ONCE(pba->nr_cached); + + return nr_cached > PBA_UNSHRINKABLE_PAGES ? + nr_cached - PBA_UNSHRINKABLE_PAGES : 0; +} + +static unsigned long pba_shrink_count(struct shrinker *shrink, + struct shrink_control *sc) +{ + + return count_shrinkable_pages() ?: SHRINK_EMPTY; +} + +static bool block_worth_shrinking(unsigned long nr_pages_target_block, + unsigned long nr_pages_nonblock, + struct shrink_control *sc) +{ + /* + * Avoid partially shrinking a block (which means splitting it) if + * we can reclaim enough/more non-block pages instead, or if we would + * reclaim only few pages (below PBA_SHRINK_BLOCK_MIN_PAGES) + */ + return nr_pages_nonblock < nr_pages_target_block && + nr_pages_nonblock < sc->nr_to_scan && + nr_pages_target_block >= PBA_SHRINK_BLOCK_MIN_PAGES; +} + +static unsigned long pba_shrink_scan(struct shrinker *shrink, + struct shrink_control *sc) +{ + struct pkeys_block_allocator *pba = &pkeys_block_allocator; + LIST_HEAD(pages_to_free); + struct page *page, *tmp; + unsigned long nr_pages_nonblock = 0, nr_pages_target_block = 0; + unsigned long nr_pages_uncached = 0, nr_freed = 0; + unsigned long nr_pages_shrinkable; + struct page *target_block = NULL; + + sc->nr_scanned = 0; + + pr_debug("%s: nr_to_scan = %lu, nr_cached = %lu\n", + __func__, sc->nr_to_scan, pba->nr_cached); + + spin_lock_bh(&pba->lock); + nr_pages_shrinkable = count_shrinkable_pages(); + + /* + * Count pages that don't belong to any block, and find the block + * with the highest number of free pages + */ + list_for_each_entry(page, &pba->cached_list, lru) { + struct page *block = block_head_page(page); + unsigned long block_nr_free; + + if (!block) { + nr_pages_nonblock++; + continue; + } + + block_nr_free = page_pba_data(block)->block_nr_free; + + if (block_nr_free > nr_pages_target_block) { + target_block = block; + nr_pages_target_block = block_nr_free; + } + + /* We will free this block, so no need to continue scanning */ + if (nr_pages_target_block == BLOCK_NR_PAGES) + break; + } + + if (nr_pages_target_block == BLOCK_NR_PAGES) { + /* + * If a whole block is empty, take the opportunity to free it + * completely (regardless of the requested nr_to_scan) to avoid + * splitting the linear map. If nr_pages_shrinkable is too low, + * we bail out as we would have to split the block to shrink it + * partially (and there is nothing else we can shrink). + */ + if (nr_pages_shrinkable < BLOCK_NR_PAGES) { + spin_unlock_bh(&pba->lock); + pr_debug("%s: cannot free empty block, bailing out\n", + __func__); + goto out; + } + + sc->nr_to_scan = BLOCK_NR_PAGES; + } else if (block_worth_shrinking(nr_pages_target_block, + nr_pages_nonblock, sc)) { + /* Shrink block (partially) */ + sc->nr_to_scan = min(sc->nr_to_scan, nr_pages_target_block); + } else { + /* Free non-block pages */ + sc->nr_to_scan = min(sc->nr_to_scan, nr_pages_nonblock); + target_block = NULL; + } + + list_for_each_entry_safe(page, tmp, &pba->cached_list, lru) { + struct page *block = block_head_page(page); + + if (!(nr_pages_uncached < sc->nr_to_scan && + nr_pages_uncached < nr_pages_shrinkable)) + break; + + if (block == target_block) { + list_move(&page->lru, &pages_to_free); + nr_pages_uncached++; + } + } + + pba->nr_cached -= nr_pages_uncached; + sc->nr_scanned = nr_pages_uncached; + + if (target_block) + mark_block_noncached(target_block); + spin_unlock_bh(&pba->lock); + + if (target_block) + pr_debug("%s: freeing block (pfn = %lx, %lu/%lu free pages)\n", + __func__, page_to_pfn(target_block), + nr_pages_target_block, BLOCK_NR_PAGES); + else + pr_debug("%s: freeing non-block (%lu free pages)\n", + __func__, nr_pages_nonblock); + + if (nr_pages_target_block == BLOCK_NR_PAGES) { + VM_WARN_ON(nr_pages_uncached != BLOCK_NR_PAGES); + nr_freed = release_whole_block(&pages_to_free, target_block); + } else { + nr_freed = release_page_list(&pages_to_free); + } + + pr_debug("%s: freed %lu pages, nr_cached = %lu\n", __func__, + nr_freed, pba->nr_cached); +out: + return nr_freed ?: SHRINK_STOP; +} + +static int __init pba_init_shrinker(void) +{ + struct shrinker *shrinker; + + if (!pba_enabled()) + return 0; + + shrinker = shrinker_alloc(0, "kpkeys-pgtable-block"); + if (!shrinker) + return -ENOMEM; + + shrinker->count_objects = pba_shrink_count; + shrinker->scan_objects = pba_shrink_scan; + shrinker->seeks = 0; + shrinker->batch = BLOCK_NR_PAGES; + shrinker_register(shrinker); + return 0; +} +late_initcall(pba_init_shrinker); -- 2.51.2