linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Brendan Jackman <jackmanb@google.com>
To: Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	 Peter Zijlstra <peterz@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	 David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	 Vlastimil Babka <vbabka@kernel.org>, Wei Xu <weixugc@google.com>,
	 Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org,
	 rppt@kernel.org, Sumit Garg <sumit.garg@oss.qualcomm.com>,
	derkling@google.com,  reijiw@google.com,
	Will Deacon <will@kernel.org>,
	rientjes@google.com,  "Kalyazin, Nikita" <kalyazin@amazon.co.uk>,
	patrick.roy@linux.dev,  "Itazuri, Takahiro" <itazur@amazon.co.uk>,
	Andy Lutomirski <luto@kernel.org>,
	 David Kaplan <david.kaplan@amd.com>,
	Thomas Gleixner <tglx@kernel.org>,
	 Brendan Jackman <jackmanb@google.com>,
	Yosry Ahmed <yosry.ahmed@linux.dev>
Subject: [PATCH RFC 00/19] mm: Add __GFP_UNMAPPED
Date: Wed, 25 Feb 2026 16:34:25 +0000	[thread overview]
Message-ID: <20260225-page_alloc-unmapped-v1-0-e8808a03cd66@google.com> (raw)

.:: What? Why?

This series adds support for efficiently allocating pages that are not
present in the direct map. This is instrumental to two different
immediate goals:

1. This supports the effort to remove guest_memfd memory from the direct
   map [0]. One of the challenges faced in that effort has been
   efficiently eliminating TLB entries, this series offers a solution to
   that problem

2. Address Space Isolation (ASI) [1] also needs an efficient way to
   allocate pages that are missing from the direct map. Although for ASI
   the needs are slightly different (in that case, the pages need only
   be removed from ASI's special pagetables), the most interesting mm
   challenges are basically the same.

   So, __GFP_UNMAPPED serves as a Trojan horse to get the page allocator
   into a state where adding ASI's features "Should Be Easy".

   This series _also_ serves as a Trojan horse for the "mermap" (details
   below) which is also a key building block for making ASI efficient.

Longer term, there are a wide range of security techniques unlocked by
being able to efficiently remove pages from the kernel's address space.

There may also be non-security usecases for this feature, for example
at LPC Sumit Garg presented an issue with memory-firewalled client
devices that could he remediated by __GFP_UNMAPPED [2]. 

.:: Design

The key design elements introduced here are just repurposed from
previous attempts to directly introduce ASI's needs to the page
allocator [3]. The only real difference is that now these support
totally unmapping stuff from the direct map, instead of only unmapping
it from ASI's special pagetables.

.:::: Design: Introducing "freetypes"

The biggest challenge for efficiently getting stuff out of the direct
map is TLB flushing. Pushing this problem into the page allocator turns
out to enable amortising that flush cost into almost nothing. The core
idea is to have pools of already-unmapped pages. We'd like those pages
to be physically contiguous so they don't unduly fragment the pagetables
around them, and we'd like to be able to efficiently look up these
already-unmapped pages during allocation. The page allocator already has
deeply-ingrained functionality for physically grouping pages by a
certain attribute, and then indexing free pages by that attribute, this
mechanism is: migratetypes.

So basically, this series extends the concepts of migratetypes in the
allocator so that as well as just representing mobility, they can
represent other properties of the page too. (Actually, migratetypes are
already sort of overloaded, but the main extension is to be able to
represent _orthogonal_ properties). In order to avoid further
overloading the concept of a migratetype, this extension is done by
adding a new concept on top of migratetype: the _freetype_. A freetype
is basically just a migratetype plus some flags, and it replaces
migratetypes wherever the latter is currently used as to index free
pages.

The first freetype flag is then added, which marks the pages it indexes
as being absent from the direct map. This is then used to implement the
new __GFP_UNMAPPED flag, which allocates pages from pageblocks that have
the new flag, or unmaps pages if no existing ones are already available.

.:::: Design: Introducing the "mermap"

Sharp readers might by now be asking how __GFP_UNMAPPED interacts with
__GFP_ZERO. If pages aren't in the direct map, how can the page
allocator zero them? The solution is the "mermap", short for "epheMERal
mapping". The mermap provides an efficient way to temporarily map pages
into the local address space, and the allocator uses these mappings to
zero pages.

Using the mermap securely requires some knowledge about the usage of the
pages. One slightly awkward part of this design is that the page
allocator's usage of the mermap then "leaks" out so that callers who
allocate with __GFP_UNMAPPED|__GFP_ZERO need to be aware of the mermap's
security implications. For the guest_memfd unmapping usecase, that means
when guest_memfd.c makes these special allocations, it is only safe
because the pages will belong to the current process. In other words,
the use of the mermap potentially allows that process to leak the pages
via CPU sidechannels (unless more holistic/expensive mitigations are
enabled).

Since this cover letter is already too long I won't describe most
details of the mermap here, please see the patch that introduces it.

But one key detail is that it requires a kernel-space but mm-local
virtual address region. So... this series adds that too (for x86). This
is called the mm-local region and is implemented by "just" extending and
generalising the LDT remap area.

.:: Outline of the patchset

- Patches  1 ->  2 introduce the mm-local region for x86

- Patches  3 ->  5 introduce the mermap

- Patches  6 -> 14 introduce freetypes

  - Patch 8 in particular is the big annoying switch-over which changes
    a whole bunch of code from "migratetype" to "freetype". In order to
    try and have the compiler help out with catching bugs, this is done
    with an annoying typedef. I'm sorry that this patch is so annoying,
    but I think if we do want to extend the allocator along these lines
    then a typedef + big annoying patch is probably the safest way.

- Patches 15 -> 20 introduce __GFP_UNMAPPED

.:: Why [RFC]?

I really wanted to stop sending RFC and start sending PATCHes but
getting this series out has taken months longer than I expected, so it's
time to get something on the list. The known issues here are:

1. __GFP_UNMAPPED isn't useful yet until guest_memfd unmapping support
   [0] gets merged.

2. Apparently while implementing the mm-local region, I totally forgot
   that KPTI existed on 32-bit systems. I expect the 0-day bot to fire a
   failure on that patch.

There is also one really nasty hack in mermap.c, namely
set_unmapped_pte(). This is basically a symptom of the problem I
propose to discuss at LSF/MM/BPF [3], i.e. the fact that there are
lots of pagetable libraries yet none of them are flexible enough to do
anything new (in this case the "new thing" is pre-allocating pagetables
then subsequently populating them in a separate context). Whether this
particular hack should block merging the mermap is not clear to me, I'd
be interested to hear opinions.

.:: Performance

In [4] is a branch containing: 

1. This series.

2. All the key kernel patches from the Firecracker team's "secret-free"
   effort, which includes guest_memfd unmapping ([0]).

3. Some prototype patches to switch guest_memfd over from an ad-hoc
   unmapping logic to use of __GFP_UNMAPPED (plus direct use of the
   mermap to implement write()).

I benchmarked this using Firecracker's own performance tests [4], which
measure the time required to populate the VM guest's memory. This
population happens via write() so it exercises the mermap. I ran this on
a Sapphire Rapids machine [5]. The baseline here is just the secret-free
patches on their own. "gfp_unmapped" is the branch described above.
"skip-flush" provides a reference against an implementation that just
skips flushing the TLB when unmapping guest_memfd pages, which serves as
an upper-bound on performance.

metric: populate_latency (ms)   |  test: firecracker-perf-tests-wrapped
+---------------+---------+----------+----------+------------------------+----------+--------+
| nixos_variant | samples |     mean |      min | histogram              |      max | Δμ     |
+---------------+---------+----------+----------+------------------------+----------+--------+
|               |      30 |    1.04s |    1.02s |                     █  |    1.10s |        |
| gfp_unmapped  |      30 | 313.02ms | 299.48ms |       █                | 343.25ms | -70.0% |
| skip-flush    |      30 | 325.80ms | 307.91ms |       █                | 333.30ms | -68.8% |
+---------------+---------+----------+----------+------------------------+----------+--------+

Conclusion: it's close to the best case performance for this particular
workload. (Note in the sample above the mean is actually faster - that's
noise, this isn't a consistent observation).

[0] [PATCH v10 00/15] Direct Map Removal Support for guest_memfd
    https://lore.kernel.org/all/20260126164445.11867-1-kalyazin@amazon.com/

[1] https://linuxasi.dev/

[2] https://lpc.events/event/19/contributions/2095/

[3] https://lore.kernel.org/all/20260219175113.618562-1-jackmanb@google.com/

[4] https://github.com/bjackman/kernel-benchmarks-nix/blob/fd56c93344760927b71161368230a15741a5869f/packages/benchmarks/firecracker-perf-tests/firecracker-perf-tests.sh

[5] https://github.com/bjackman/aethelred/blob/eb0dd0e99ee08fa0534733113e93b89499affe91

Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: x86@kernel.org
Cc: rppt@kernel.org
Cc: Sumit Garg <sumit.garg@oss.qualcomm.com>
To: Borislav Petkov <bp@alien8.de>
To: Dave Hansen <dave.hansen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>
To: Andrew Morton <akpm@linux-foundation.org>
To: David Hildenbrand <david@kernel.org>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Vlastimil Babka <vbabka@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
To: Wei Xu <weixugc@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
To: Zi Yan <ziy@nvidia.com>
Cc: yosryahmed@google.com
Cc: derkling@google.com
Cc: reijiw@google.com
Cc: Will Deacon <will@kernel.org>
Cc: rientjes@google.com
Cc: "Kalyazin, Nikita" <kalyazin@amazon.co.uk>
Cc: patrick.roy@linux.dev
Cc: "Itazuri, Takahiro" <itazur@amazon.co.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Kaplan <david.kaplan@amd.com>
Cc: Thomas Gleixner <tglx@kernel.org>

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
Brendan Jackman (19):
      x86/mm: split out preallocate_sub_pgd()
      x86/mm: Generalize LDT remap into "mm-local region"
      x86/tlb: Expose some flush function declarations to modules
      x86/mm: introduce the mermap
      mm: KUnit tests for the mermap
      mm: introduce for_each_free_list()
      mm/page_alloc: don't overload migratetype in find_suitable_fallback()
      mm: introduce freetype_t
      mm: move migratetype definitions to freetype.h
      mm: add definitions for allocating unmapped pages
      mm: rejig pageblock mask definitions
      mm: encode freetype flags in pageblock flags
      mm/page_alloc: remove ifdefs from pindex helpers
      mm/page_alloc: separate pcplists by freetype flags
      mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER
      mm/page_alloc: introduce ALLOC_NOBLOCK
      mm/page_alloc: implement __GFP_UNMAPPED allocations
      mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations
      mm: Minimal KUnit tests for some new page_alloc logic

 Documentation/arch/x86/x86_64/mm.rst    |   4 +-
 arch/x86/Kconfig                        |   3 +
 arch/x86/include/asm/mermap.h           |  23 +
 arch/x86/include/asm/mmu_context.h      |  71 ++-
 arch/x86/include/asm/pgalloc.h          |  33 ++
 arch/x86/include/asm/pgtable_64_types.h |  19 +-
 arch/x86/include/asm/pgtable_types.h    |   2 +
 arch/x86/include/asm/tlbflush.h         |  43 +-
 arch/x86/kernel/ldt.c                   | 137 ++----
 arch/x86/mm/init_64.c                   |  44 +-
 arch/x86/mm/pgtable.c                   |   3 +
 include/linux/freetype.h                | 147 ++++++
 include/linux/gfp.h                     |  25 +-
 include/linux/gfp_types.h               |  26 ++
 include/linux/mermap.h                  |  63 +++
 include/linux/mermap_types.h            |  43 ++
 include/linux/mm.h                      |  13 +
 include/linux/mm_types.h                |   6 +
 include/linux/mmzone.h                  |  84 ++--
 include/linux/pageblock-flags.h         |  16 +-
 include/trace/events/mmflags.h          |   9 +-
 kernel/fork.c                           |   6 +
 kernel/panic.c                          |   2 +
 kernel/power/snapshot.c                 |   8 +-
 mm/Kconfig                              |  41 ++
 mm/Makefile                             |   3 +
 mm/compaction.c                         |  36 +-
 mm/init-mm.c                            |   3 +
 mm/internal.h                           |  43 +-
 mm/mermap.c                             | 323 +++++++++++++
 mm/mm_init.c                            |  11 +-
 mm/page_alloc.c                         | 782 +++++++++++++++++++++++---------
 mm/page_isolation.c                     |   2 +-
 mm/page_owner.c                         |   7 +-
 mm/page_reporting.c                     |   4 +-
 mm/pgalloc-track.h                      |   6 +
 mm/show_mem.c                           |   4 +-
 mm/tests/mermap_kunit.c                 | 231 ++++++++++
 mm/tests/page_alloc_kunit.c             | 250 ++++++++++
 39 files changed, 2099 insertions(+), 477 deletions(-)
---
base-commit: 44982d352c33767cd8d19f8044e7e1161a587ff7
change-id: 20260112-page_alloc-unmapped-944fe5d7b55c

Best regards,
-- 
Brendan Jackman <jackmanb@google.com>



             reply	other threads:[~2026-02-25 16:34 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-25 16:34 Brendan Jackman [this message]
2026-02-25 16:34 ` [PATCH RFC 01/19] x86/mm: split out preallocate_sub_pgd() Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 02/19] x86/mm: Generalize LDT remap into "mm-local region" Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 03/19] x86/tlb: Expose some flush function declarations to modules Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 04/19] x86/mm: introduce the mermap Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 05/19] mm: KUnit tests for " Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 06/19] mm: introduce for_each_free_list() Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 07/19] mm/page_alloc: don't overload migratetype in find_suitable_fallback() Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 08/19] mm: introduce freetype_t Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 09/19] mm: move migratetype definitions to freetype.h Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 10/19] mm: add definitions for allocating unmapped pages Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 11/19] mm: rejig pageblock mask definitions Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 12/19] mm: encode freetype flags in pageblock flags Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 13/19] mm/page_alloc: remove ifdefs from pindex helpers Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 14/19] mm/page_alloc: separate pcplists by freetype flags Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 15/19] mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 16/19] mm/page_alloc: introduce ALLOC_NOBLOCK Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 17/19] mm/page_alloc: implement __GFP_UNMAPPED allocations Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 18/19] mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 19/19] mm: Minimal KUnit tests for some new page_alloc logic Brendan Jackman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260225-page_alloc-unmapped-v1-0-e8808a03cd66@google.com \
    --to=jackmanb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david.kaplan@amd.com \
    --cc=david@kernel.org \
    --cc=derkling@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=itazur@amazon.co.uk \
    --cc=kalyazin@amazon.co.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=luto@kernel.org \
    --cc=patrick.roy@linux.dev \
    --cc=peterz@infradead.org \
    --cc=reijiw@google.com \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=sumit.garg@oss.qualcomm.com \
    --cc=tglx@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=weixugc@google.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=yosry.ahmed@linux.dev \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox