[PATCH v6 17/30] mm: kpkeys: Add shrinker for block pgtable allocator

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Kevin Brodsky <kevin.brodsky@arm.com>
To: linux-hardening@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	Kevin Brodsky <kevin.brodsky@arm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andy Lutomirski <luto@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	David Hildenbrand <david@redhat.com>,
	Ira Weiny <ira.weiny@intel.com>, Jann Horn <jannh@google.com>,
	Jeff Xu <jeffxu@chromium.org>, Joey Gouly <joey.gouly@arm.com>,
	Kees Cook <kees@kernel.org>,
	Linus Walleij <linus.walleij@linaro.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Marc Zyngier <maz@kernel.org>, Mark Brown <broonie@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Maxwell Bland <mbland@motorola.com>,
	"Mike Rapoport (IBM)" <rppt@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Pierre Langlois <pierre.langlois@arm.com>,
	Quentin Perret <qperret@google.com>,
	Rick Edgecombe <rick.p.edgecombe@intel.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>,
	Yang Shi <yang@os.amperecomputing.com>,
	Yeoreum Yun <yeoreum.yun@arm.com>,
	linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	x86@kernel.org
Subject: [PATCH v6 17/30] mm: kpkeys: Add shrinker for block pgtable allocator
Date: Fri, 27 Feb 2026 17:55:05 +0000	[thread overview]
Message-ID: <20260227175518.3728055-18-kevin.brodsky@arm.com> (raw)
In-Reply-To: <20260227175518.3728055-1-kevin.brodsky@arm.com>

The newly introduced kpkeys block allocator does not return freed
page table pages (PTPs) to the buddy allocator, but instead caches
them, so that:

1. Future allocation requests can be quickly serviced, without
   calling alloc_pages() or set_memory_pkey().

2. Blocks are not needlessly split, since releasing a single page to
   the buddy allocator requires resetting the pkey for just that
   page, splitting the PMD into PTEs.

We cannot however let this cache grow indefinitely. This patch
introduces a shrinker that allows reclaiming those cached pages.

Like the rest of the allocator, the primary objective is to minimise
the splitting of blocks. Each shrinker pass (call to
pba_shrink_scan()) attempts to release all free pages within a given
block, or cached pages that do not lie inside a block managed by the
allocator.

In order to choose which block to shrink, we need to know how many
free pages a block contains. The approach taken here is to store
that value in the head page of each managed block and update it
whenever a page is allocated or freed. Other pages in the block are
also marked with a flag. This simplifies the shrinker pass, and the
associated overhead should be minimal.

We then scan all cached pages and find the "emptiest" block, i.e.
containing the most free pages. If a block is completely empty, then
we release it right away as that can be done without splitting.
Otherwise, we pay the price of splitting the block if we consider it
has enough free pages (and there are not enough non-block free
pages).

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---

Much of patch is up for debate, with various thresholds that would
deserve to be tuned. Tracking blocks seems like a good idea to reduce
fragmentation, but it's unclear how much that helps in a real-life
scenario. Feedback welcome!

---
 mm/kpkeys_hardened_pgtables.c | 287 ++++++++++++++++++++++++++++++++++
 1 file changed, 287 insertions(+)

diff --git a/mm/kpkeys_hardened_pgtables.c b/mm/kpkeys_hardened_pgtables.c
index 223a0bb02df0..dcc5e6da7c85 100644
--- a/mm/kpkeys_hardened_pgtables.c
+++ b/mm/kpkeys_hardened_pgtables.c
@@ -143,6 +143,7 @@ void __init kpkeys_hardened_pgtables_init_late(void)
 #define PBA_NR_RESERVED_PAGES	4
 
 #define BLOCK_ORDER		PMD_ORDER
+#define BLOCK_NR_PAGES		(1ul << (BLOCK_ORDER))
 
 /*
  * Refilling the cache is done by attempting allocation in decreasing orders
@@ -226,6 +227,68 @@ static void __ref register_early_region(struct page *head_page,
 	pba_early_region.order = order;
 }
 
+/*
+ * Private per-page allocator data. It needs to be preserved when a page table
+ * page is allocated, so we cannot use page->private, which overlaps with
+ * struct ptdesc::ptl. page->mapping is unused in struct ptdesc so we store it
+ * there instead.
+ */
+struct pba_page_data {
+	bool in_block;
+	u32 block_nr_free; /* Only used for the head page of a block */
+};
+
+static struct pba_page_data *page_pba_data(struct page *page)
+{
+	BUILD_BUG_ON(sizeof(struct pba_page_data) > sizeof(page->mapping));
+
+	return (struct pba_page_data *)&page->mapping;
+}
+
+static void mark_block_cached(struct page *head_page, struct page *cached_pages,
+			      unsigned int nr_cached_pages)
+{
+	page_pba_data(head_page)->in_block = true;
+	page_pba_data(head_page)->block_nr_free = nr_cached_pages;
+
+	for (unsigned int i = 0; i < nr_cached_pages; i++)
+		page_pba_data(&cached_pages[i])->in_block = true;
+}
+
+static void mark_block_noncached(struct page *head_page)
+{
+	for (unsigned int i = 0; i < BLOCK_NR_PAGES; i++)
+		head_page[i].mapping = NULL;
+}
+
+static struct page *block_head_page(struct page *page)
+{
+	unsigned long page_pfn;
+
+	if (!page_pba_data(page)->in_block)
+		return NULL;
+
+	page_pfn = page_to_pfn(page);
+
+	return pfn_to_page(ALIGN_DOWN(page_pfn, BLOCK_NR_PAGES));
+}
+
+static void inc_block_nr_free(struct page *page)
+{
+	struct page *head_page = block_head_page(page);
+
+	if (head_page)
+		page_pba_data(head_page)->block_nr_free++;
+}
+
+static void dec_block_nr_free(struct page *page)
+{
+	struct page *head_page = block_head_page(page);
+
+	if (head_page)
+		page_pba_data(head_page)->block_nr_free--;
+}
+
 static void cached_list_add_pages(struct page *page, unsigned int nr_pages)
 {
 	struct pkeys_block_allocator *pba = &pkeys_block_allocator;
@@ -248,6 +311,7 @@ static void __refill_pages_add_to_cache(struct page *page, unsigned int order,
 					bool alloc_one)
 {
 	struct pkeys_block_allocator *pba = &pkeys_block_allocator;
+	struct page *head_page = page;
 	unsigned int nr_pages = 1 << order;
 
 	if (alloc_one) {
@@ -255,6 +319,9 @@ static void __refill_pages_add_to_cache(struct page *page, unsigned int order,
 		nr_pages--;
 	}
 
+	if (order == BLOCK_ORDER)
+		mark_block_cached(head_page, page, nr_pages);
+
 	guard(spinlock_bh)(&pba->lock);
 
 	cached_list_add_pages(page, nr_pages);
@@ -309,6 +376,56 @@ static struct page *refill_pages_and_alloc_one(void)
 	return __refill_pages(true);
 }
 
+static unsigned long release_page_list(struct list_head *page_list)
+{
+	struct pkeys_block_allocator *pba = &pkeys_block_allocator;
+	unsigned long nr_freed = 0;
+	struct page *page, *tmp;
+
+	/* _safe is required because __free_page() overwrites page->lru */
+	list_for_each_entry_safe(page, tmp, page_list, lru) {
+		int ret = 0;
+
+		ret = set_pkey_default(page, 1);
+
+		if (ret) {
+			guard(spinlock_bh)(&pba->lock);
+			cached_list_add_pages(page, 1);
+			break;
+		}
+
+		__free_page(page);
+		nr_freed++;
+	}
+
+	return nr_freed;
+}
+
+static unsigned long release_whole_block(struct list_head *page_list,
+					struct page *block_head)
+{
+	struct pkeys_block_allocator *pba = &pkeys_block_allocator;
+	unsigned long nr_freed = 0;
+	struct page *page, *tmp;
+	int ret;
+
+	/* Reset the pkey for the full block to avoid splitting the linear map */
+	ret = set_pkey_default(block_head, BLOCK_NR_PAGES);
+
+	if (ret) {
+		guard(spinlock_bh)(&pba->lock);
+		cached_list_add_pages(block_head, BLOCK_NR_PAGES);
+		return 0;
+	}
+
+	list_for_each_entry_safe(page, tmp, page_list, lru) {
+		__free_page(page);
+		nr_freed++;
+	}
+
+	return nr_freed;
+}
+
 static bool cached_page_available(gfp_t gfp)
 {
 	struct pkeys_block_allocator *pba = &pkeys_block_allocator;
@@ -337,6 +454,7 @@ static struct page *get_cached_page(gfp_t gfp)
 		return NULL;
 
 	cached_list_del_page(page);
+	dec_block_nr_free(page);
 	return page;
 }
 
@@ -409,6 +527,7 @@ static void pba_pgtable_free(struct page *page)
 	guard(spinlock_bh)(&pba->lock);
 
 	cached_list_add_pages(page, 1);
+	inc_block_nr_free(page);
 }
 
 static int pba_prepare_direct_map_split(void)
@@ -464,3 +583,171 @@ static void __init pba_init_late(void)
 		set_pkey_pgtable(pba_early_region.head_page,
 				 1 << pba_early_region.order);
 }
+
+/* Shrinker */
+
+/* Keep some pages around to avoid shrinking causing a refill right away */
+#define PBA_UNSHRINKABLE_PAGES		16
+/* Don't shrink a block that is almost full to avoid excessive splitting */
+#define PBA_SHRINK_BLOCK_MIN_PAGES	(BLOCK_NR_PAGES / 8)
+
+static unsigned long count_shrinkable_pages(void)
+{
+	struct pkeys_block_allocator *pba = &pkeys_block_allocator;
+	unsigned long nr_cached = READ_ONCE(pba->nr_cached);
+
+	return nr_cached > PBA_UNSHRINKABLE_PAGES ?
+		nr_cached - PBA_UNSHRINKABLE_PAGES : 0;
+}
+
+static unsigned long pba_shrink_count(struct shrinker *shrink,
+				      struct shrink_control *sc)
+{
+
+	return count_shrinkable_pages() ?: SHRINK_EMPTY;
+}
+
+static bool block_worth_shrinking(unsigned long nr_pages_target_block,
+				  unsigned long nr_pages_nonblock,
+				  struct shrink_control *sc)
+{
+	/*
+	 * Avoid partially shrinking a block (which means splitting it) if
+	 * we can reclaim enough/more non-block pages instead, or if we would
+	 * reclaim only few pages (below PBA_SHRINK_BLOCK_MIN_PAGES)
+	 */
+	return nr_pages_nonblock < nr_pages_target_block &&
+		nr_pages_nonblock < sc->nr_to_scan &&
+		nr_pages_target_block >= PBA_SHRINK_BLOCK_MIN_PAGES;
+}
+
+static unsigned long pba_shrink_scan(struct shrinker *shrink,
+				     struct shrink_control *sc)
+{
+	struct pkeys_block_allocator *pba = &pkeys_block_allocator;
+	LIST_HEAD(pages_to_free);
+	struct page *page, *tmp;
+	unsigned long nr_pages_nonblock = 0, nr_pages_target_block = 0;
+	unsigned long nr_pages_uncached = 0, nr_freed = 0;
+	unsigned long nr_pages_shrinkable;
+	struct page *target_block = NULL;
+
+	sc->nr_scanned = 0;
+
+	pr_debug("%s: nr_to_scan = %lu, nr_cached = %lu\n",
+		 __func__, sc->nr_to_scan, pba->nr_cached);
+
+	spin_lock_bh(&pba->lock);
+	nr_pages_shrinkable = count_shrinkable_pages();
+
+	/*
+	 * Count pages that don't belong to any block, and find the block
+	 * with the highest number of free pages
+	 */
+	list_for_each_entry(page, &pba->cached_list, lru) {
+		struct page *block = block_head_page(page);
+		unsigned long block_nr_free;
+
+		if (!block) {
+			nr_pages_nonblock++;
+			continue;
+		}
+
+		block_nr_free = page_pba_data(block)->block_nr_free;
+
+		if (block_nr_free > nr_pages_target_block) {
+			target_block = block;
+			nr_pages_target_block = block_nr_free;
+		}
+
+		/* We will free this block, so no need to continue scanning */
+		if (nr_pages_target_block == BLOCK_NR_PAGES)
+			break;
+	}
+
+	if (nr_pages_target_block == BLOCK_NR_PAGES) {
+		/*
+		 * If a whole block is empty, take the opportunity to free it
+		 * completely (regardless of the requested nr_to_scan) to avoid
+		 * splitting the linear map. If nr_pages_shrinkable is too low,
+		 * we bail out as we would have to split the block to shrink it
+		 * partially (and there is nothing else we can shrink).
+		 */
+		if (nr_pages_shrinkable < BLOCK_NR_PAGES) {
+			spin_unlock_bh(&pba->lock);
+			pr_debug("%s: cannot free empty block, bailing out\n",
+				 __func__);
+			goto out;
+		}
+
+		sc->nr_to_scan = BLOCK_NR_PAGES;
+	} else if (block_worth_shrinking(nr_pages_target_block,
+					 nr_pages_nonblock, sc)) {
+		/* Shrink block (partially) */
+		sc->nr_to_scan = min(sc->nr_to_scan, nr_pages_target_block);
+	} else {
+		/* Free non-block pages */
+		sc->nr_to_scan = min(sc->nr_to_scan, nr_pages_nonblock);
+		target_block = NULL;
+	}
+
+	list_for_each_entry_safe(page, tmp, &pba->cached_list, lru) {
+		struct page *block = block_head_page(page);
+
+		if (!(nr_pages_uncached < sc->nr_to_scan &&
+		      nr_pages_uncached < nr_pages_shrinkable))
+			break;
+
+		if (block == target_block) {
+			list_move(&page->lru, &pages_to_free);
+			nr_pages_uncached++;
+		}
+	}
+
+	pba->nr_cached -= nr_pages_uncached;
+	sc->nr_scanned = nr_pages_uncached;
+
+	if (target_block)
+		mark_block_noncached(target_block);
+	spin_unlock_bh(&pba->lock);
+
+	if (target_block)
+		pr_debug("%s: freeing block (pfn = %lx, %lu/%lu free pages)\n",
+			 __func__, page_to_pfn(target_block),
+			 nr_pages_target_block, BLOCK_NR_PAGES);
+	else
+		pr_debug("%s: freeing non-block (%lu free pages)\n",
+			 __func__, nr_pages_nonblock);
+
+	if (nr_pages_target_block == BLOCK_NR_PAGES) {
+		VM_WARN_ON(nr_pages_uncached != BLOCK_NR_PAGES);
+		nr_freed = release_whole_block(&pages_to_free, target_block);
+	} else {
+		nr_freed = release_page_list(&pages_to_free);
+	}
+
+	pr_debug("%s: freed %lu pages, nr_cached = %lu\n", __func__,
+		 nr_freed, pba->nr_cached);
+out:
+	return nr_freed ?: SHRINK_STOP;
+}
+
+static int __init pba_init_shrinker(void)
+{
+	struct shrinker *shrinker;
+
+	if (!pba_enabled())
+		return 0;
+
+	shrinker = shrinker_alloc(0, "kpkeys-pgtable-block");
+	if (!shrinker)
+		return -ENOMEM;
+
+	shrinker->count_objects = pba_shrink_count;
+	shrinker->scan_objects = pba_shrink_scan;
+	shrinker->seeks = 0;
+	shrinker->batch = BLOCK_NR_PAGES;
+	shrinker_register(shrinker);
+	return 0;
+}
+late_initcall(pba_init_shrinker);
-- 
2.51.2

next prev parent reply	other threads:[~2026-02-27 17:57 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-27 17:54 [PATCH v6 00/30] pkeys-based page table hardening Kevin Brodsky
2026-02-27 17:54 ` [PATCH v6 01/30] mm: Introduce kpkeys Kevin Brodsky
2026-02-27 17:54 ` [PATCH v6 02/30] set_memory: Introduce set_memory_pkey() stub Kevin Brodsky
2026-02-27 17:54 ` [PATCH v6 03/30] arm64: mm: Enable overlays for all EL1 indirect permissions Kevin Brodsky
2026-02-27 17:54 ` [PATCH v6 04/30] arm64: Introduce por_elx_set_pkey_perms() helper Kevin Brodsky
2026-02-27 17:54 ` [PATCH v6 05/30] arm64: Implement asm/kpkeys.h using POE Kevin Brodsky
2026-02-27 17:54 ` [PATCH v6 06/30] arm64: set_memory: Implement set_memory_pkey() Kevin Brodsky
2026-02-27 17:54 ` [PATCH v6 07/30] arm64: Reset POR_EL1 on exception entry Kevin Brodsky
2026-02-27 17:54 ` [PATCH v6 08/30] arm64: Context-switch POR_EL1 Kevin Brodsky
2026-02-27 17:54 ` [PATCH v6 09/30] arm64: Initialize POR_EL1 register on cpu_resume() Kevin Brodsky
2026-02-27 17:54 ` [PATCH v6 10/30] arm64: Enable kpkeys Kevin Brodsky
2026-02-27 17:54 ` [PATCH v6 11/30] memblock: Move INIT_MEMBLOCK_* macros to header Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 12/30] set_memory: Introduce arch_has_pte_only_direct_map() Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 13/30] mm: kpkeys: Introduce kpkeys_hardened_pgtables feature Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 14/30] mm: kpkeys: Introduce block-based page table allocator Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 15/30] mm: kpkeys: Handle splitting of linear map Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 16/30] mm: kpkeys: Defer early call to set_memory_pkey() Kevin Brodsky
2026-02-27 17:55 ` Kevin Brodsky [this message]
2026-02-27 17:55 ` [PATCH v6 18/30] mm: kpkeys: Introduce early page table allocator Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 19/30] mm: kpkeys: Introduce hook for protecting static page tables Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 20/30] arm64: cpufeature: Add helper to directly probe CPU for POE support Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 21/30] arm64: set_memory: Implement arch_has_pte_only_direct_map() Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 22/30] arm64: kpkeys: Support KPKEYS_LVL_PGTABLES Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 23/30] arm64: kpkeys: Ensure the linear map can be modified Kevin Brodsky
2026-02-27 20:28   ` kernel test robot
2026-02-27 17:55 ` [PATCH v6 24/30] arm64: kpkeys: Handle splitting of linear map Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 25/30] arm64: kpkeys: Protect early page tables Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 26/30] arm64: kpkeys: Protect init_pg_dir Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 27/30] arm64: kpkeys: Guard page table writes Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 28/30] arm64: kpkeys: Batch KPKEYS_LVL_PGTABLES switches Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 29/30] arm64: kpkeys: Enable kpkeys_hardened_pgtables support Kevin Brodsky
2026-02-27 17:55 ` [PATCH v6 30/30] mm: Add basic tests for kpkeys_hardened_pgtables Kevin Brodsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260227175518.3728055-18-kevin.brodsky@arm.com \
    --to=kevin.brodsky@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=broonie@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=ira.weiny@intel.com \
    --cc=jannh@google.com \
    --cc=jeffxu@chromium.org \
    --cc=joey.gouly@arm.com \
    --cc=kees@kernel.org \
    --cc=linus.walleij@linaro.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=luto@kernel.org \
    --cc=maz@kernel.org \
    --cc=mbland@motorola.com \
    --cc=peterz@infradead.org \
    --cc=pierre.langlois@arm.com \
    --cc=qperret@google.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=tglx@linutronix.de \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=yang@os.amperecomputing.com \
    --cc=yeoreum.yun@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox