[PATCH RFC 17/19] mm/page_alloc: implement __GFP_UNMAPPED allocations

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Brendan Jackman <jackmanb@google.com>
To: Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	 Peter Zijlstra <peterz@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	 David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	 Vlastimil Babka <vbabka@kernel.org>, Wei Xu <weixugc@google.com>,
	 Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org,
	 rppt@kernel.org, Sumit Garg <sumit.garg@oss.qualcomm.com>,
	derkling@google.com,  reijiw@google.com,
	Will Deacon <will@kernel.org>,
	rientjes@google.com,  "Kalyazin, Nikita" <kalyazin@amazon.co.uk>,
	patrick.roy@linux.dev,  "Itazuri, Takahiro" <itazur@amazon.co.uk>,
	Andy Lutomirski <luto@kernel.org>,
	 David Kaplan <david.kaplan@amd.com>,
	Thomas Gleixner <tglx@kernel.org>,
	 Brendan Jackman <jackmanb@google.com>,
	Yosry Ahmed <yosry.ahmed@linux.dev>
Subject: [PATCH RFC 17/19] mm/page_alloc: implement __GFP_UNMAPPED allocations
Date: Wed, 25 Feb 2026 16:34:42 +0000	[thread overview]
Message-ID: <20260225-page_alloc-unmapped-v1-17-e8808a03cd66@google.com> (raw)
In-Reply-To: <20260225-page_alloc-unmapped-v1-0-e8808a03cd66@google.com>

Currently __GFP_UNMAPPED allocs will always fail because, although the
lists exist to hold them, there is no way to actually create an unmapped
page block. This commit adds one, and also the logic to map it back
again when that's needed.

Doing this at pageblock granularity ensures that the pageblock flags can
be used to infer which freetype a page belongs to. It also provides nice
batching of TLB flushes, and also avoids creating too much unnecessary
TLB fragmentation in the physmap.

There are some functional requirements for flipping a block:

 - Unmapping requires a TLB shootdown, meaning IRQs must be enabled.

 - Because the main usecase of this feature is to protect against CPU
   exploits, when a block is mapped it needs to be zeroed to ensure no
   residual data is available to attackers. Zeroing a block with a
   spinlock held seems undesirable.

 - Updating the pagetables might require allocating a pagetable to break
   down a huge page. This would deadlock if the zone lock was held.

This makes allocations that need to change sensitivity _somewhat_
similar to those that need to fallback to a different migratetype. But,
the locking requirements mean that this can't just be squashed into the
existing "fallback" allocator logic, instead a new allocator path just
for this purpose is needed.

The new path is assumed to be much cheaper than the really heavyweight
stuff like compaction and reclaim. But at present it is treated as less
desirable than the mobility-related "fallback" and "stealing" logic.
This might turn out to need revision (in particular, maybe it's a
problem that __rmqueue_steal(), which causes fragmentation, happens
before __rmqueue_direct_map()), but that should be treated as a subsequent
optimisation project.

This currently forbids __GFP_ZERO, this is just to keep the patch from
getting too large, the next patch will remove this restriction.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 include/linux/gfp.h |  11 +++-
 mm/Kconfig          |   4 +-
 mm/page_alloc.c     | 163 ++++++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 164 insertions(+), 14 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f189bee7a974c..8abc9f4b1e7e6 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -20,6 +20,7 @@ struct mempolicy;
 static inline freetype_t gfp_freetype(const gfp_t gfp_flags)
 {
 	int migratetype;
+	unsigned int ft_flags = 0;
 
 	VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
 	BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
@@ -36,7 +37,15 @@ static inline freetype_t gfp_freetype(const gfp_t gfp_flags)
 			>> GFP_MOVABLE_SHIFT;
 	}
 
-	return migrate_to_freetype(migratetype, 0);
+#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
+	if (gfp_flags & __GFP_UNMAPPED) {
+		if (WARN_ON_ONCE(migratetype != MIGRATE_UNMOVABLE))
+			migratetype = MIGRATE_UNMOVABLE;
+		ft_flags |= FREETYPE_UNMAPPED;
+	}
+#endif
+
+	return migrate_to_freetype(migratetype, ft_flags);
 }
 #undef GFP_MOVABLE_MASK
 #undef GFP_MOVABLE_SHIFT
diff --git a/mm/Kconfig b/mm/Kconfig
index ccf1cda90cf4a..3200ea8836432 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1502,8 +1502,8 @@ config MERMAP_KUNIT_TEST
 
 	  If unsure, say N.
 
-endmenu
-
 config PAGE_ALLOC_UNMAPPED
 	bool "Support allocating pages that aren't in the direct map" if COMPILE_TEST
 	default COMPILE_TEST
+
+endmenu
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5576bd6a26b7b..f7754080dd25b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -34,6 +34,7 @@
 #include <linux/pagevec.h>
 #include <linux/memory_hotplug.h>
 #include <linux/nodemask.h>
+#include <linux/set_memory.h>
 #include <linux/vmstat.h>
 #include <linux/fault-inject.h>
 #include <linux/compaction.h>
@@ -1037,6 +1038,26 @@ static void change_pageblock_range(struct page *pageblock_page,
 	}
 }
 
+/*
+ * Can pages of these two freetypes be combined into a single higher-order free
+ * page?
+ */
+static inline bool can_merge_freetypes(freetype_t a, freetype_t b)
+{
+	if (freetypes_equal(a, b))
+		return true;
+
+	if (!migratetype_is_mergeable(free_to_migratetype(a)) ||
+	    !migratetype_is_mergeable(free_to_migratetype(b)))
+		return false;
+
+	/*
+	 * Mustn't "just" merge pages with different freetype flags, changing
+	 * those requires updating pagetables.
+	 */
+	return freetype_flags(a) == freetype_flags(b);
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
@@ -1105,9 +1126,7 @@ static inline void __free_one_page(struct page *page,
 			buddy_ft = get_pfnblock_freetype(buddy, buddy_pfn);
 			buddy_mt = free_to_migratetype(buddy_ft);
 
-			if (migratetype != buddy_mt &&
-			    (!migratetype_is_mergeable(migratetype) ||
-			     !migratetype_is_mergeable(buddy_mt)))
+			if (!can_merge_freetypes(freetype, buddy_ft))
 				goto done_merging;
 		}
 
@@ -1124,7 +1143,9 @@ static inline void __free_one_page(struct page *page,
 			/*
 			 * Match buddy type. This ensures that an
 			 * expand() down the line puts the sub-blocks
-			 * on the right freelists.
+			 * on the right freelists. Freetype flags are
+			 * already set correctly because of
+			 * can_merge_freetypes().
 			 */
 			change_pageblock_range(buddy, order, migratetype);
 		}
@@ -3361,6 +3382,117 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 #endif
 }
 
+#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
+/* Try to allocate a page by mapping/unmapping a block from the direct map. */
+static inline struct page *
+__rmqueue_direct_map(struct zone *zone, unsigned int request_order,
+		     unsigned int alloc_flags, freetype_t freetype)
+{
+	unsigned int ft_flags_other = freetype_flags(freetype) ^ FREETYPE_UNMAPPED;
+	freetype_t ft_other = migrate_to_freetype(free_to_migratetype(freetype),
+						  ft_flags_other);
+	bool want_mapped = !(freetype_flags(freetype) & FREETYPE_UNMAPPED);
+	enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
+	unsigned long irq_flags;
+	int nr_pageblocks;
+	struct page *page;
+	int alloc_order;
+	int err;
+
+	if (freetype_idx(ft_other) < 0)
+		return NULL;
+
+	/*
+	 * Might need a TLB shootdown. Even if IRQs are on this isn't
+	 * safe if the caller holds a lock (in case the other CPUs need that
+	 * lock to handle the shootdown IPI).
+	 */
+	if (alloc_flags & ALLOC_NOBLOCK)
+		return NULL;
+
+	if (!can_set_direct_map())
+		return NULL;
+
+	lockdep_assert(!irqs_disabled() || unlikely(early_boot_irqs_disabled));
+
+	/*
+	 * Need to [un]map a whole pageblock (otherwise it might require
+	 * allocating pagetables). First allocate it.
+	 */
+	alloc_order = max(request_order, pageblock_order);
+	nr_pageblocks = 1 << (alloc_order - pageblock_order);
+	spin_lock_irqsave(&zone->lock, irq_flags);
+	page = __rmqueue(zone, alloc_order, ft_other, alloc_flags, &rmqm);
+	spin_unlock_irqrestore(&zone->lock, irq_flags);
+	if (!page)
+		return NULL;
+
+	/*
+	 * Now that IRQs are on it's safe to do a TLB shootdown, and now that we
+	 * released the zone lock it's possible to allocate a pagetable if
+	 * needed to split up a huge page.
+	 *
+	 * Note that modifying the direct map may need to allocate pagetables.
+	 * What about unbounded recursion? Here are the assumptions that make it
+	 * safe:
+	 *
+	 * - The direct map starts out fully mapped at boot. (This is not really
+	 *   an assumption" as its in direct control of page_alloc.c).
+	 *
+	 * - Once pages in the direct map are broken down, they are not
+	 *   re-aggregated into larger pages again.
+	 *
+	 * - Pagetables are never allocated with __GFP_UNMAPPED.
+	 *
+	 * Under these assumptions, a pagetable might need to be allocated while
+	 * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED
+	 * allocation. But, the allocation of that pagetable never requires
+	 * allocating a further pagetable.
+	 */
+	err = set_direct_map_valid_noflush(page,
+				nr_pageblocks << pageblock_order, want_mapped);
+	if (err == -ENOMEM || WARN_ONCE(err, "err=%d\n", err)) {
+		__free_one_page(page, page_to_pfn(page), zone,
+				alloc_order, freetype, FPI_SKIP_REPORT_NOTIFY);
+		return NULL;
+	}
+
+	if (!want_mapped) {
+		unsigned long start = (unsigned long)page_address(page);
+		unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT));
+
+		flush_tlb_kernel_range(start, end);
+	}
+
+	for (int i = 0; i < nr_pageblocks; i++) {
+		struct page *block_page = page + (pageblock_nr_pages * i);
+
+		set_pageblock_freetype_flags(block_page, freetype_flags(freetype));
+	}
+
+	if (request_order >= alloc_order)
+		return page;
+
+	/* Free any remaining pages in the block. */
+	spin_lock_irqsave(&zone->lock, irq_flags);
+	for (unsigned int i = request_order; i < alloc_order; i++) {
+		struct page *page_to_free = page + (1 << i);
+
+		__free_one_page(page_to_free, page_to_pfn(page_to_free), zone,
+			i, freetype, FPI_SKIP_REPORT_NOTIFY);
+	}
+	spin_unlock_irqrestore(&zone->lock, irq_flags);
+
+	return page;
+}
+#else /* CONFIG_PAGE_ALLOC_UNMAPPED */
+static inline struct page *__rmqueue_direct_map(struct zone *zone, unsigned int request_order,
+				unsigned int alloc_flags, freetype_t freetype)
+{
+	return NULL;
+}
+#endif /* CONFIG_PAGE_ALLOC_UNMAPPED */
+
 static __always_inline
 struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			   unsigned int order, unsigned int alloc_flags,
@@ -3394,13 +3526,15 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			 */
 			if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_HARDER)))
 				page = __rmqueue_smallest(zone, order, ft_high);
-
-			if (!page) {
-				spin_unlock_irqrestore(&zone->lock, flags);
-				return NULL;
-			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
+
+		/* Try changing direct map, now we've released the zone lock */
+		if (!page)
+			page = __rmqueue_direct_map(zone, order, alloc_flags, freetype);
+		if (!page)
+			return NULL;
+
 	} while (check_new_pages(page, order));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -3625,6 +3759,8 @@ static void reserve_highatomic_pageblock(struct page *page, int order,
 static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 						bool force)
 {
+	freetype_t ft_high = freetype_with_migrate(ac->freetype,
+					MIGRATE_HIGHATOMIC);
 	struct zonelist *zonelist = ac->zonelist;
 	unsigned long flags;
 	struct zoneref *z;
@@ -3633,6 +3769,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 	int order;
 	int ret;
 
+	if (freetype_idx(ft_high) < 0)
+		return false;
+
 	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->highest_zoneidx,
 								ac->nodemask) {
 		/*
@@ -3646,8 +3785,6 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < NR_PAGE_ORDERS; order++) {
 			struct free_area *area = &(zone->free_area[order]);
-			freetype_t ft_high = freetype_with_migrate(ac->freetype,
-							MIGRATE_HIGHATOMIC);
 			unsigned long size;
 
 			page = get_page_from_free_area(area, ft_high);
@@ -5147,6 +5284,10 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 	ac->nodemask = nodemask;
 	ac->freetype = gfp_freetype(gfp_mask);
 
+	/* Not implemented yet. */
+	if (freetype_flags(ac->freetype) & FREETYPE_UNMAPPED && gfp_mask & __GFP_ZERO)
+		return false;
+
 	if (cpusets_enabled()) {
 		*alloc_gfp |= __GFP_HARDWALL;
 		/*

-- 
2.51.2

next prev parent reply	other threads:[~2026-02-25 16:35 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-25 16:34 [PATCH RFC 00/19] mm: Add __GFP_UNMAPPED Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 01/19] x86/mm: split out preallocate_sub_pgd() Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 02/19] x86/mm: Generalize LDT remap into "mm-local region" Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 03/19] x86/tlb: Expose some flush function declarations to modules Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 04/19] x86/mm: introduce the mermap Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 05/19] mm: KUnit tests for " Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 06/19] mm: introduce for_each_free_list() Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 07/19] mm/page_alloc: don't overload migratetype in find_suitable_fallback() Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 08/19] mm: introduce freetype_t Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 09/19] mm: move migratetype definitions to freetype.h Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 10/19] mm: add definitions for allocating unmapped pages Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 11/19] mm: rejig pageblock mask definitions Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 12/19] mm: encode freetype flags in pageblock flags Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 13/19] mm/page_alloc: remove ifdefs from pindex helpers Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 14/19] mm/page_alloc: separate pcplists by freetype flags Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 15/19] mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 16/19] mm/page_alloc: introduce ALLOC_NOBLOCK Brendan Jackman
2026-02-25 16:34 ` Brendan Jackman [this message]
2026-02-25 16:34 ` [PATCH RFC 18/19] mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 19/19] mm: Minimal KUnit tests for some new page_alloc logic Brendan Jackman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260225-page_alloc-unmapped-v1-17-e8808a03cd66@google.com \
    --to=jackmanb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david.kaplan@amd.com \
    --cc=david@kernel.org \
    --cc=derkling@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=itazur@amazon.co.uk \
    --cc=kalyazin@amazon.co.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=luto@kernel.org \
    --cc=patrick.roy@linux.dev \
    --cc=peterz@infradead.org \
    --cc=reijiw@google.com \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=sumit.garg@oss.qualcomm.com \
    --cc=tglx@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=weixugc@google.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=yosry.ahmed@linux.dev \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox