* [PATCH RFC 01/11] x86/mm: Bare minimum ASI API for page_alloc integration
2025-03-13 18:11 [PATCH RFC 00/11] mm: ASI integration for the page allocator Brendan Jackman
@ 2025-03-13 18:11 ` Brendan Jackman
2025-03-13 18:11 ` [PATCH RFC 02/11] x86/mm: Factor out phys_pgd_init() Brendan Jackman
` (10 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Brendan Jackman @ 2025-03-13 18:11 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand
Cc: linux-kernel, linux-mm, Mike Rapoport, Junaid Shahid,
Reiji Watanabe, Patrick Bellasi, Brendan Jackman, Yosry Ahmed
This commit serves to provide a minimal framework to present an ASI
integration into the page allocator, without getting distracted by
irrelevant details. There's no need to review this actively, just refer
back to it as-needed when reading the later patches.
In a real [PATCH] series this should be several separate commits.
Aside from missing the actual core address-space switching and security
logic, this is missing runtime-disablement of ASI. If you enable it in
Kconfig, ASI's mm logic gets run unconditionally. That isn't what we
want in the real implementation (certainly not in the initial version,
anyway).
- Add CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION. Attempt to follow the
proposal by Mike Rapoport here:
https://lore.kernel.org/linux-mm/Z8K2B3WJoICVbDj3@kernel.org/
In this RFC, there's only a small amount of x86-specific logic,
perhaps it's possible to implement this logic without any arch/
dependency. But, this is absolutely not true of the full ASI
implementation. So that's already reflected in the Kconfig stuff
here.
- Introduce struct asi, which is an "ASI domain", i.e. an address space.
For now this is nothing but a wrapper for a PGD.
- Introduce the "global nonsensitive" ASI domain. This contains all the
mappings that do not need to be protected from any attacker.
Maintaining these mappings is the subject of this RFC.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
arch/Kconfig | 14 ++++++++++++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/asi.h | 28 ++++++++++++++++++++++++++++
arch/x86/mm/Makefile | 1 +
arch/x86/mm/asi.c | 8 ++++++++
arch/x86/mm/init.c | 3 ++-
include/linux/asi.h | 18 ++++++++++++++++++
7 files changed, 72 insertions(+), 1 deletion(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index b8a4ff36558228240080a5677f702d37f4f8d547..871ad0987c8740205ceec675a6b7304c644f28e1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -17,6 +17,20 @@ config CPU_MITIGATIONS
def_bool y
endif
+config ARCH_HAS_MITIGATION_ADDRESS_SPACE_ISOLATION
+ bool
+
+config MITIGATION_ADDRESS_SPACE_ISOLATION
+ bool "Allow code to run with a reduced kernel address space"
+ default n
+ depends on ARCH_HAS_MITIGATION_ADDRESS_SPACE_ISOLATION && !PARAVIRT
+ help
+ This feature provides the ability to run some kernel code
+ with a reduced kernel address space. This can be used to
+ mitigate some speculative execution attacks.
+
+ !PARAVIRT dependency is a temporary hack while ASI has custom
+ pagetable manipulation code.
#
# Selected by architectures that need custom DMA operations for e.g. legacy
# IOMMUs not handled by dma-iommu. Drivers must never select this symbol.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0e27ebd7e36a9e3d69ad3e77c8db5dcf11ae3016..19ceecf5978bbe62e0742072c192c8ee952082dc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -36,6 +36,7 @@ config X86_64
select ARCH_HAS_ELFCORE_COMPAT
select ZONE_DMA32
select EXECMEM if DYNAMIC_FTRACE
+ select ARCH_HAS_MITIGATION_ADDRESS_SPACE_ISOLATION
config FORCE_DYNAMIC_FTRACE
def_bool y
diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
new file mode 100644
index 0000000000000000000000000000000000000000..b8f604df6a36508acbc10710f821d5f95e8cdceb
--- /dev/null
+++ b/arch/x86/include/asm/asi.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_ASI_H
+#define _ASM_X86_ASI_H
+
+#include <asm/pgtable_types.h>
+
+#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
+
+extern struct asi __asi_global_nonsensitive;
+#define ASI_GLOBAL_NONSENSITIVE (&__asi_global_nonsensitive)
+
+/*
+ * An ASI domain (struct asi) represents a restricted address space. The
+ * unrestricted address space (and user address space under PTI) are not
+ * represented as a domain.
+ */
+struct asi {
+ pgd_t *pgd;
+};
+
+static __always_inline pgd_t *asi_pgd(struct asi *asi)
+{
+ return asi ? asi->pgd : NULL;
+}
+
+#endif /* CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION */
+
+#endif /* _ASM_X86_ASI_H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 690fbf48e8538b62a176ce838820e363575b7897..89ade7363798cc20d5e5643526eba7378174baa0 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_ACPI_NUMA) += srat.o
obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
obj-$(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION) += pti.o
+obj-$(CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION) += asi.o
obj-$(CONFIG_X86_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_amd.o
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
new file mode 100644
index 0000000000000000000000000000000000000000..e5a981a7b3192655cd981633514fbf945b92c9b6
--- /dev/null
+++ b/arch/x86/mm/asi.c
@@ -0,0 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <asm/asi.h>
+
+static __aligned(PAGE_SIZE) pgd_t asi_global_nonsensitive_pgd[PTRS_PER_PGD];
+
+struct asi __asi_global_nonsensitive = {
+ .pgd = asi_global_nonsensitive_pgd,
+};
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 62aa4d66a032d59191e79d34fc0cdaa4f32f88db..44d3dc574881dd23bb48f9af3f6191be309405ef 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -250,7 +250,8 @@ static void __init probe_page_size_mask(void)
/* By the default is everything supported: */
__default_kernel_pte_mask = __supported_pte_mask;
/* Except when with PTI where the kernel is mostly non-Global: */
- if (cpu_feature_enabled(X86_FEATURE_PTI))
+ if (cpu_feature_enabled(X86_FEATURE_PTI) ||
+ IS_ENABLED(CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION))
__default_kernel_pte_mask &= ~_PAGE_GLOBAL;
/* Enable 1 GB linear kernel mappings if available: */
diff --git a/include/linux/asi.h b/include/linux/asi.h
new file mode 100644
index 0000000000000000000000000000000000000000..2d3049d5fe423e139dcce8f3d68cdffcc0ec0bfe
--- /dev/null
+++ b/include/linux/asi.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _INCLUDE_ASI_H
+#define _INCLUDE_ASI_H
+
+#include <asm/pgtable_types.h>
+
+#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
+#include <asm/asi.h>
+#else
+
+#define ASI_GLOBAL_NONSENSITIVE NULL
+
+struct asi {};
+
+static inline pgd_t *asi_pgd(struct asi *asi) { return NULL; }
+
+#endif /* CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION */
+#endif /* _INCLUDE_ASI_H */
--
2.49.0.rc1.451.g8f38331e32-goog
^ permalink raw reply [flat|nested] 18+ messages in thread* [PATCH RFC 02/11] x86/mm: Factor out phys_pgd_init()
2025-03-13 18:11 [PATCH RFC 00/11] mm: ASI integration for the page allocator Brendan Jackman
2025-03-13 18:11 ` [PATCH RFC 01/11] x86/mm: Bare minimum ASI API for page_alloc integration Brendan Jackman
@ 2025-03-13 18:11 ` Brendan Jackman
2025-03-13 22:14 ` Yosry Ahmed
2025-03-13 18:11 ` [PATCH RFC 03/11] x86/mm: Add lookup_pgtable_in_pgd() Brendan Jackman
` (9 subsequent siblings)
11 siblings, 1 reply; 18+ messages in thread
From: Brendan Jackman @ 2025-03-13 18:11 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand
Cc: linux-kernel, linux-mm, Mike Rapoport, Junaid Shahid,
Reiji Watanabe, Patrick Bellasi, Brendan Jackman, Yosry Ahmed
__kernel_physical_mapping_init() will soon need to work on multiple
PGDs, so factor out something similar to phys_p4d_init() and friends,
which takes the base of the PGD as an argument.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
arch/x86/mm/init_64.c | 33 +++++++++++++++++++++++----------
1 file changed, 23 insertions(+), 10 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 01ea7c6df3036bd185cdb3f54ddf244b79cbce8c..8f75274fddd96b8285aff48493ebad93e30daebe 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -731,21 +731,20 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
}
static unsigned long __meminit
-__kernel_physical_mapping_init(unsigned long paddr_start,
- unsigned long paddr_end,
- unsigned long page_size_mask,
- pgprot_t prot, bool init)
+phys_pgd_init(pgd_t *pgd_page, unsigned long paddr_start, unsigned long paddr_end,
+ unsigned long page_size_mask, pgprot_t prot, bool init, bool *pgd_changed)
{
- bool pgd_changed = false;
unsigned long vaddr, vaddr_start, vaddr_end, vaddr_next, paddr_last;
+ *pgd_changed = false;
+
paddr_last = paddr_end;
vaddr = (unsigned long)__va(paddr_start);
vaddr_end = (unsigned long)__va(paddr_end);
vaddr_start = vaddr;
for (; vaddr < vaddr_end; vaddr = vaddr_next) {
- pgd_t *pgd = pgd_offset_k(vaddr);
+ pgd_t *pgd = pgd_offset_pgd(pgd_page, vaddr);
p4d_t *p4d;
vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;
@@ -771,15 +770,29 @@ __kernel_physical_mapping_init(unsigned long paddr_start,
(pud_t *) p4d, init);
spin_unlock(&init_mm.page_table_lock);
- pgd_changed = true;
+ *pgd_changed = true;
}
- if (pgd_changed)
- sync_global_pgds(vaddr_start, vaddr_end - 1);
-
return paddr_last;
}
+static unsigned long __meminit
+__kernel_physical_mapping_init(unsigned long paddr_start,
+ unsigned long paddr_end,
+ unsigned long page_size_mask,
+ pgprot_t prot, bool init)
+{
+ bool pgd_changed;
+ unsigned long paddr_last;
+
+ paddr_last = phys_pgd_init(init_mm.pgd, paddr_start, paddr_end, page_size_mask,
+ prot, init, &pgd_changed);
+ if (pgd_changed)
+ sync_global_pgds((unsigned long)__va(paddr_start),
+ (unsigned long)__va(paddr_end) - 1);
+
+ return paddr_last;
+}
/*
* Create page table mapping for the physical memory for specific physical
--
2.49.0.rc1.451.g8f38331e32-goog
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH RFC 02/11] x86/mm: Factor out phys_pgd_init()
2025-03-13 18:11 ` [PATCH RFC 02/11] x86/mm: Factor out phys_pgd_init() Brendan Jackman
@ 2025-03-13 22:14 ` Yosry Ahmed
2025-03-17 16:24 ` Brendan Jackman
0 siblings, 1 reply; 18+ messages in thread
From: Yosry Ahmed @ 2025-03-13 22:14 UTC (permalink / raw)
To: Brendan Jackman
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport,
Junaid Shahid, Reiji Watanabe, Patrick Bellasi
On Thu, Mar 13, 2025 at 06:11:21PM +0000, Brendan Jackman wrote:
> __kernel_physical_mapping_init() will soon need to work on multiple
> PGDs, so factor out something similar to phys_p4d_init() and friends,
> which takes the base of the PGD as an argument.
>
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
> arch/x86/mm/init_64.c | 33 +++++++++++++++++++++++----------
> 1 file changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 01ea7c6df3036bd185cdb3f54ddf244b79cbce8c..8f75274fddd96b8285aff48493ebad93e30daebe 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -731,21 +731,20 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
> }
>
> static unsigned long __meminit
> -__kernel_physical_mapping_init(unsigned long paddr_start,
> - unsigned long paddr_end,
> - unsigned long page_size_mask,
> - pgprot_t prot, bool init)
> +phys_pgd_init(pgd_t *pgd_page, unsigned long paddr_start, unsigned long paddr_end,
> + unsigned long page_size_mask, pgprot_t prot, bool init, bool *pgd_changed)
> {
> - bool pgd_changed = false;
> unsigned long vaddr, vaddr_start, vaddr_end, vaddr_next, paddr_last;
>
> + *pgd_changed = false;
> +
> paddr_last = paddr_end;
> vaddr = (unsigned long)__va(paddr_start);
> vaddr_end = (unsigned long)__va(paddr_end);
> vaddr_start = vaddr;
>
> for (; vaddr < vaddr_end; vaddr = vaddr_next) {
> - pgd_t *pgd = pgd_offset_k(vaddr);
> + pgd_t *pgd = pgd_offset_pgd(pgd_page, vaddr);
> p4d_t *p4d;
>
> vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;
> @@ -771,15 +770,29 @@ __kernel_physical_mapping_init(unsigned long paddr_start,
> (pud_t *) p4d, init);
>
> spin_unlock(&init_mm.page_table_lock);
> - pgd_changed = true;
> + *pgd_changed = true;
> }
>
> - if (pgd_changed)
> - sync_global_pgds(vaddr_start, vaddr_end - 1);
> -
> return paddr_last;
> }
>
> +static unsigned long __meminit
> +__kernel_physical_mapping_init(unsigned long paddr_start,
> + unsigned long paddr_end,
> + unsigned long page_size_mask,
> + pgprot_t prot, bool init)
> +{
> + bool pgd_changed;
> + unsigned long paddr_last;
> +
> + paddr_last = phys_pgd_init(init_mm.pgd, paddr_start, paddr_end, page_size_mask,
> + prot, init, &pgd_changed);
> + if (pgd_changed)
> + sync_global_pgds((unsigned long)__va(paddr_start),
> + (unsigned long)__va(paddr_end) - 1);
This patch keeps the sync_global_pgds() in
__kernel_physical_mapping_init(), then a following patch adds it back in
phys_pgd_init() (but still leaves it here).
Should we just leave sync_global_pgds() in phys_pgd_init() and eliminate
the pgd_changed argument?
> +
> + return paddr_last;
> +}
>
> /*
> * Create page table mapping for the physical memory for specific physical
>
> --
> 2.49.0.rc1.451.g8f38331e32-goog
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH RFC 02/11] x86/mm: Factor out phys_pgd_init()
2025-03-13 22:14 ` Yosry Ahmed
@ 2025-03-17 16:24 ` Brendan Jackman
0 siblings, 0 replies; 18+ messages in thread
From: Brendan Jackman @ 2025-03-17 16:24 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport,
Junaid Shahid, Reiji Watanabe, Patrick Bellasi
On Thu, Mar 13, 2025 at 10:14:33PM +0000, Yosry Ahmed wrote:
> > @@ -771,15 +770,29 @@ __kernel_physical_mapping_init(unsigned long paddr_start,
> > (pud_t *) p4d, init);
> >
> > spin_unlock(&init_mm.page_table_lock);
> > - pgd_changed = true;
> > + *pgd_changed = true;
> > }
> >
> > - if (pgd_changed)
> > - sync_global_pgds(vaddr_start, vaddr_end - 1);
> > -
> > return paddr_last;
> > }
> >
> > +static unsigned long __meminit
> > +__kernel_physical_mapping_init(unsigned long paddr_start,
> > + unsigned long paddr_end,
> > + unsigned long page_size_mask,
> > + pgprot_t prot, bool init)
> > +{
> > + bool pgd_changed;
> > + unsigned long paddr_last;
> > +
> > + paddr_last = phys_pgd_init(init_mm.pgd, paddr_start, paddr_end, page_size_mask,
> > + prot, init, &pgd_changed);
> > + if (pgd_changed)
> > + sync_global_pgds((unsigned long)__va(paddr_start),
> > + (unsigned long)__va(paddr_end) - 1);
>
> This patch keeps the sync_global_pgds() in
> __kernel_physical_mapping_init(), then a following patch adds it back in
> phys_pgd_init() (but still leaves it here).
>
> Should we just leave sync_global_pgds() in phys_pgd_init() and eliminate
> the pgd_changed argument?
Oops, thanks. IIUC we only need the sync_global_pgds() call in
__kernel_physical_mapping_init(). We don't want to call it a second
time just because we mirrored changes into the ASI PGD.
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH RFC 03/11] x86/mm: Add lookup_pgtable_in_pgd()
2025-03-13 18:11 [PATCH RFC 00/11] mm: ASI integration for the page allocator Brendan Jackman
2025-03-13 18:11 ` [PATCH RFC 01/11] x86/mm: Bare minimum ASI API for page_alloc integration Brendan Jackman
2025-03-13 18:11 ` [PATCH RFC 02/11] x86/mm: Factor out phys_pgd_init() Brendan Jackman
@ 2025-03-13 18:11 ` Brendan Jackman
2025-03-13 22:09 ` Yosry Ahmed
2025-03-13 18:11 ` [PATCH RFC 04/11] x86/mm/asi: Sync physmap into ASI_GLOBAL_NONSENSITIVE Brendan Jackman
` (8 subsequent siblings)
11 siblings, 1 reply; 18+ messages in thread
From: Brendan Jackman @ 2025-03-13 18:11 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand
Cc: linux-kernel, linux-mm, Mike Rapoport, Junaid Shahid,
Reiji Watanabe, Patrick Bellasi, Brendan Jackman, Yosry Ahmed
This is the same thing as lookup_address_in_pgd(), but it returns the
pagetable unconditionally instead of returning NULL when the pagetable
is none. This will be used for looking up and modifying pages that are
*_none() in order to map memory into the ASI restricted address space.
For a [PATCH], if this logic is needed, the surrounding code should
probably first be somewhat refactored. It now looks pretty repetitive,
and it's confusing that lookup_address_in_pgd() returns NULL when
pmd_none() but note when pte_none(). For now here's something that
works.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
arch/x86/include/asm/pgtable_types.h | 2 ++
arch/x86/mm/pat/set_memory.c | 34 ++++++++++++++++++++++++++++++++++
2 files changed, 36 insertions(+)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 4b804531b03c3ce5cc48f0a75cb75d58b985777a..e09b509e525534f31c986d705e07b25dd9c04cb7 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -572,6 +572,8 @@ extern pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
unsigned int *level);
pte_t *lookup_address_in_pgd_attr(pgd_t *pgd, unsigned long address,
unsigned int *level, bool *nx, bool *rw);
+extern pte_t *lookup_pgtable_in_pgd(pgd_t *pgd, unsigned long address,
+ unsigned int *level);
extern pmd_t *lookup_pmd_address(unsigned long address);
extern phys_addr_t slow_virt_to_phys(void *__address);
extern int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn,
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index ef4514d64c0524e5854fa106e3f37ff1e1ba10a2..d066bf2c9e93e126757bd32a7a666db89b2488b6 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -658,6 +658,40 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star
return new;
}
+/*
+ * Lookup the page table entry for a virtual address in a specific pgd. Return
+ * the pointer to the entry, without implying that any mapping actually exists
+ * (the returned value may be zero).
+ */
+pte_t *lookup_pgtable_in_pgd(pgd_t *pgd, unsigned long address, unsigned int *level)
+{
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ *level = PG_LEVEL_256T;
+ if (pgd_none(*pgd))
+ return (pte_t *)pgd;
+
+ *level = PG_LEVEL_512G;
+ p4d = p4d_offset(pgd, address);
+ if (p4d_none(*p4d) || p4d_leaf(*p4d) || !p4d_present(*p4d))
+ return (pte_t *)p4d;
+
+ *level = PG_LEVEL_1G;
+ pud = pud_offset(p4d, address);
+ if (pud_none(*pud) || pud_leaf(*pud) || !pud_present(*pud))
+ return (pte_t *)pud;
+
+ *level = PG_LEVEL_2M;
+ pmd = pmd_offset(pud, address);
+ if (pmd_none(*pmd) || pmd_leaf(*pmd) || !pmd_present(*pmd))
+ return (pte_t *)pmd;
+
+ *level = PG_LEVEL_4K;
+ return pte_offset_kernel(pmd, address);
+}
+
/*
* Lookup the page table entry for a virtual address in a specific pgd.
* Return a pointer to the entry (or NULL if the entry does not exist),
--
2.49.0.rc1.451.g8f38331e32-goog
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH RFC 03/11] x86/mm: Add lookup_pgtable_in_pgd()
2025-03-13 18:11 ` [PATCH RFC 03/11] x86/mm: Add lookup_pgtable_in_pgd() Brendan Jackman
@ 2025-03-13 22:09 ` Yosry Ahmed
2025-03-14 9:12 ` Brendan Jackman
0 siblings, 1 reply; 18+ messages in thread
From: Yosry Ahmed @ 2025-03-13 22:09 UTC (permalink / raw)
To: Brendan Jackman
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport,
Junaid Shahid, Reiji Watanabe, Patrick Bellasi
On Thu, Mar 13, 2025 at 06:11:22PM +0000, Brendan Jackman wrote:
> This is the same thing as lookup_address_in_pgd(), but it returns the
> pagetable unconditionally instead of returning NULL when the pagetable
> is none. This will be used for looking up and modifying pages that are
> *_none() in order to map memory into the ASI restricted address space.
>
> For a [PATCH], if this logic is needed, the surrounding code should
> probably first be somewhat refactored. It now looks pretty repetitive,
> and it's confusing that lookup_address_in_pgd() returns NULL when
> pmd_none() but note when pte_none(). For now here's something that
> works.
My first instinct reading this is that lookup_address_in_pgd() should be
calling lookup_pgtable_in_pgd(), but I didn't look too closely.
>
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
> arch/x86/include/asm/pgtable_types.h | 2 ++
> arch/x86/mm/pat/set_memory.c | 34 ++++++++++++++++++++++++++++++++++
> 2 files changed, 36 insertions(+)
>
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 4b804531b03c3ce5cc48f0a75cb75d58b985777a..e09b509e525534f31c986d705e07b25dd9c04cb7 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -572,6 +572,8 @@ extern pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
> unsigned int *level);
> pte_t *lookup_address_in_pgd_attr(pgd_t *pgd, unsigned long address,
> unsigned int *level, bool *nx, bool *rw);
> +extern pte_t *lookup_pgtable_in_pgd(pgd_t *pgd, unsigned long address,
> + unsigned int *level);
> extern pmd_t *lookup_pmd_address(unsigned long address);
> extern phys_addr_t slow_virt_to_phys(void *__address);
> extern int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn,
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index ef4514d64c0524e5854fa106e3f37ff1e1ba10a2..d066bf2c9e93e126757bd32a7a666db89b2488b6 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -658,6 +658,40 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star
> return new;
> }
>
> +/*
> + * Lookup the page table entry for a virtual address in a specific pgd. Return
> + * the pointer to the entry, without implying that any mapping actually exists
> + * (the returned value may be zero).
> + */
> +pte_t *lookup_pgtable_in_pgd(pgd_t *pgd, unsigned long address, unsigned int *level)
> +{
> + p4d_t *p4d;
> + pud_t *pud;
> + pmd_t *pmd;
> +
> + *level = PG_LEVEL_256T;
> + if (pgd_none(*pgd))
> + return (pte_t *)pgd;
> +
> + *level = PG_LEVEL_512G;
> + p4d = p4d_offset(pgd, address);
> + if (p4d_none(*p4d) || p4d_leaf(*p4d) || !p4d_present(*p4d))
> + return (pte_t *)p4d;
> +
> + *level = PG_LEVEL_1G;
> + pud = pud_offset(p4d, address);
> + if (pud_none(*pud) || pud_leaf(*pud) || !pud_present(*pud))
> + return (pte_t *)pud;
> +
> + *level = PG_LEVEL_2M;
> + pmd = pmd_offset(pud, address);
> + if (pmd_none(*pmd) || pmd_leaf(*pmd) || !pmd_present(*pmd))
> + return (pte_t *)pmd;
> +
> + *level = PG_LEVEL_4K;
> + return pte_offset_kernel(pmd, address);
> +}
> +
> /*
> * Lookup the page table entry for a virtual address in a specific pgd.
> * Return a pointer to the entry (or NULL if the entry does not exist),
>
> --
> 2.49.0.rc1.451.g8f38331e32-goog
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH RFC 03/11] x86/mm: Add lookup_pgtable_in_pgd()
2025-03-13 22:09 ` Yosry Ahmed
@ 2025-03-14 9:12 ` Brendan Jackman
2025-03-14 17:56 ` Yosry Ahmed
0 siblings, 1 reply; 18+ messages in thread
From: Brendan Jackman @ 2025-03-14 9:12 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport,
Junaid Shahid, Reiji Watanabe, Patrick Bellasi
On Thu, Mar 13, 2025 at 10:09:21PM +0000, Yosry Ahmed wrote:
> On Thu, Mar 13, 2025 at 06:11:22PM +0000, Brendan Jackman wrote:
> > This is the same thing as lookup_address_in_pgd(), but it returns the
> > pagetable unconditionally instead of returning NULL when the pagetable
> > is none. This will be used for looking up and modifying pages that are
> > *_none() in order to map memory into the ASI restricted address space.
> >
> > For a [PATCH], if this logic is needed, the surrounding code should
> > probably first be somewhat refactored. It now looks pretty repetitive,
> > and it's confusing that lookup_address_in_pgd() returns NULL when
> > pmd_none() but note when pte_none(). For now here's something that
> > works.
>
> My first instinct reading this is that lookup_address_in_pgd() should be
> calling lookup_pgtable_in_pgd(), but I didn't look too closely.
Yeah. That outer function would get a "generic" PTE pointer isntead of
a strongly-typed p4d_t/pud_t/etc. So we either need to encode
assumptions that all the page tables have the same structure at
different levels for the bits we care about, or we need to have a
switch(*level) and then be careful about pgtable_l5_enabled(). I
think the former is fine but it needs a bit of care and attention to
ensure we don't miss anything and avoid creating
confusion/antipatterns in the code.
And perhaps more importantly, lookup_adress_in_pgd_attr() sets *nx and
*rw based on the level above the entry it returns. E.g. when it
returns a pte_t* it sets *nx pased on pmd_flags(). I haven't looked
into why this is.
So yeah overall it needs a bit of research and most likely needs a
couple of prep patches. Hopefully it's possible to do it in a way that
leaves the existing code in a clearer state.
Anyway, I was originally planning not to have asi_map()/asi_unmap() in
asi.c at all, and instead just kinda make set_memory.c natively aware
of ASI somehow. At that point I think this code is probably gonna look
a bit different. That's something I ran out of time for and had to
drop from the scope of this RFC. It's definitely not ideal in this
series that e.g. page_alloc.c, asi.c, and set_memory.c are all
implicitly coupled to one another (i.e. they are all colluding to
ensure asi_[un]map() never has to allocate). Maybe I should've called
this out as a TODO on the cover letter actually.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 03/11] x86/mm: Add lookup_pgtable_in_pgd()
2025-03-14 9:12 ` Brendan Jackman
@ 2025-03-14 17:56 ` Yosry Ahmed
0 siblings, 0 replies; 18+ messages in thread
From: Yosry Ahmed @ 2025-03-14 17:56 UTC (permalink / raw)
To: Brendan Jackman
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport,
Junaid Shahid, Reiji Watanabe, Patrick Bellasi
On Fri, Mar 14, 2025 at 09:12:04AM +0000, Brendan Jackman wrote:
> On Thu, Mar 13, 2025 at 10:09:21PM +0000, Yosry Ahmed wrote:
> > On Thu, Mar 13, 2025 at 06:11:22PM +0000, Brendan Jackman wrote:
> > > This is the same thing as lookup_address_in_pgd(), but it returns the
> > > pagetable unconditionally instead of returning NULL when the pagetable
> > > is none. This will be used for looking up and modifying pages that are
> > > *_none() in order to map memory into the ASI restricted address space.
> > >
> > > For a [PATCH], if this logic is needed, the surrounding code should
> > > probably first be somewhat refactored. It now looks pretty repetitive,
> > > and it's confusing that lookup_address_in_pgd() returns NULL when
> > > pmd_none() but note when pte_none(). For now here's something that
> > > works.
> >
> > My first instinct reading this is that lookup_address_in_pgd() should be
> > calling lookup_pgtable_in_pgd(), but I didn't look too closely.
>
> Yeah. That outer function would get a "generic" PTE pointer isntead of
> a strongly-typed p4d_t/pud_t/etc. So we either need to encode
> assumptions that all the page tables have the same structure at
> different levels for the bits we care about, or we need to have a
> switch(*level) and then be careful about pgtable_l5_enabled(). I
> think the former is fine but it needs a bit of care and attention to
> ensure we don't miss anything and avoid creating
> confusion/antipatterns in the code.
Hmm another option is to have a common helper that takes in a lot of
parameters to control the exact behavior (e.g. do we check for 'none'?),
and have both lookup_pgtable_in_pgd() and lookup_address_in_pgd() call
it with different parameters.
This could avoid the need for a generic pointer, for example.
>
> And perhaps more importantly, lookup_adress_in_pgd_attr() sets *nx and
> *rw based on the level above the entry it returns. E.g. when it
> returns a pte_t* it sets *nx pased on pmd_flags(). I haven't looked
> into why this is.
>
> So yeah overall it needs a bit of research and most likely needs a
> couple of prep patches. Hopefully it's possible to do it in a way that
> leaves the existing code in a clearer state.
Agreed.
>
> Anyway, I was originally planning not to have asi_map()/asi_unmap() in
> asi.c at all, and instead just kinda make set_memory.c natively aware
> of ASI somehow. At that point I think this code is probably gonna look
> a bit different. That's something I ran out of time for and had to
> drop from the scope of this RFC. It's definitely not ideal in this
> series that e.g. page_alloc.c, asi.c, and set_memory.c are all
> implicitly coupled to one another (i.e. they are all colluding to
> ensure asi_[un]map() never has to allocate). Maybe I should've called
> this out as a TODO on the cover letter actually.
Looking forward to seeing how this would look like :)
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH RFC 04/11] x86/mm/asi: Sync physmap into ASI_GLOBAL_NONSENSITIVE
2025-03-13 18:11 [PATCH RFC 00/11] mm: ASI integration for the page allocator Brendan Jackman
` (2 preceding siblings ...)
2025-03-13 18:11 ` [PATCH RFC 03/11] x86/mm: Add lookup_pgtable_in_pgd() Brendan Jackman
@ 2025-03-13 18:11 ` Brendan Jackman
2025-03-13 18:11 ` [PATCH RFC HACKS 05/11] Add asi_map() and asi_unmap() Brendan Jackman
` (7 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Brendan Jackman @ 2025-03-13 18:11 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand
Cc: linux-kernel, linux-mm, Mike Rapoport, Junaid Shahid,
Reiji Watanabe, Patrick Bellasi, Brendan Jackman, Yosry Ahmed
Mirror the physmap into the ASI pagetables, but with a maximum
granularity that's guaranteed to allow changing pageblock sensitivity
without having to allocate pagetables.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
arch/x86/mm/init_64.c | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8f75274fddd96b8285aff48493ebad93e30daebe..4ca6bb419b9643b0e72cb5b6da6d905f2b2be84b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -7,6 +7,7 @@
* Copyright (C) 2002,2003 Andi Kleen <ak@suse.de>
*/
+#include <linux/asi.h>
#include <linux/signal.h>
#include <linux/sched.h>
#include <linux/kernel.h>
@@ -736,7 +737,8 @@ phys_pgd_init(pgd_t *pgd_page, unsigned long paddr_start, unsigned long paddr_en
{
unsigned long vaddr, vaddr_start, vaddr_end, vaddr_next, paddr_last;
- *pgd_changed = false;
+ if (pgd_changed)
+ *pgd_changed = false;
paddr_last = paddr_end;
vaddr = (unsigned long)__va(paddr_start);
@@ -770,9 +772,13 @@ phys_pgd_init(pgd_t *pgd_page, unsigned long paddr_start, unsigned long paddr_en
(pud_t *) p4d, init);
spin_unlock(&init_mm.page_table_lock);
- *pgd_changed = true;
+ if (pgd_changed)
+ *pgd_changed = true;
}
+ if (pgd_changed)
+ sync_global_pgds(vaddr_start, vaddr_end - 1);
+
return paddr_last;
}
@@ -784,9 +790,29 @@ __kernel_physical_mapping_init(unsigned long paddr_start,
{
bool pgd_changed;
unsigned long paddr_last;
+ pgd_t *pgd_asi = asi_pgd(ASI_GLOBAL_NONSENSITIVE);
paddr_last = phys_pgd_init(init_mm.pgd, paddr_start, paddr_end, page_size_mask,
prot, init, &pgd_changed);
+
+ /*
+ * Set up ASI's global-nonsensitive physmap. This needs to mapped at max
+ * 2M size so that regions can be mapped and unmapped at pageblock
+ * granularity without requiring allocations.
+ */
+ if (pgd_asi) {
+ /*
+ * Since most memory is expected to end up sensitive, start with
+ * everything unmapped in this pagetable. The page allocator
+ * assumes that's the case.
+ */
+ pgprot_t prot_np = __pgprot(pgprot_val(prot) & ~_PAGE_PRESENT);
+
+ VM_BUG_ON((PAGE_SHIFT + pageblock_order) < page_level_shift(PG_LEVEL_2M));
+ phys_pgd_init(pgd_asi, paddr_start, paddr_end, 1 << PG_LEVEL_2M,
+ prot_np, init, NULL);
+ }
+
if (pgd_changed)
sync_global_pgds((unsigned long)__va(paddr_start),
(unsigned long)__va(paddr_end) - 1);
--
2.49.0.rc1.451.g8f38331e32-goog
^ permalink raw reply [flat|nested] 18+ messages in thread* [PATCH RFC HACKS 05/11] Add asi_map() and asi_unmap()
2025-03-13 18:11 [PATCH RFC 00/11] mm: ASI integration for the page allocator Brendan Jackman
` (3 preceding siblings ...)
2025-03-13 18:11 ` [PATCH RFC 04/11] x86/mm/asi: Sync physmap into ASI_GLOBAL_NONSENSITIVE Brendan Jackman
@ 2025-03-13 18:11 ` Brendan Jackman
2025-03-13 18:11 ` [PATCH RFC 06/11] mm/page_alloc: Add __GFP_SENSITIVE and always set it Brendan Jackman
` (6 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Brendan Jackman @ 2025-03-13 18:11 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand
Cc: linux-kernel, linux-mm, Mike Rapoport, Junaid Shahid,
Reiji Watanabe, Patrick Bellasi, Brendan Jackman, Yosry Ahmed
Do not worry too much about the implementation of this. I am trying to
implement this more neatly using the x86 PAT logic but haven't managed to
get it working in time for the RFC. To enable testing & review, I'm very
hastily throwing something together that basically works, based on a
simplified version of what was used for the latest RFC [0].
[0] https://lore.kernel.org/linux-mm/20250110-asi-rfc-v2-v2-0-8419288bc805@google.com/
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
arch/x86/include/asm/asi.h | 3 ++
arch/x86/mm/asi.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/asi.h | 2 ++
include/linux/vmalloc.h | 4 +++
mm/vmalloc.c | 32 +++++++++++--------
5 files changed, 105 insertions(+), 13 deletions(-)
diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index b8f604df6a36508acbc10710f821d5f95e8cdceb..cf8be544de8b108190b765e3eb337089866207a2 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -23,6 +23,9 @@ static __always_inline pgd_t *asi_pgd(struct asi *asi)
return asi ? asi->pgd : NULL;
}
+void asi_map(struct page *page, int numpages);
+void asi_unmap(struct page *page, int numpages);
+
#endif /* CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION */
#endif /* _ASM_X86_ASI_H */
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index e5a981a7b3192655cd981633514fbf945b92c9b6..570233224789631352891f47ac2f0453a7adc06e 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -1,8 +1,85 @@
// SPDX-License-Identifier: GPL-2.0
+#include <linux/bug.h>
+#include <linux/gfp.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/pgtable.h>
+#include <linux/set_memory.h>
+#include <linux/vmalloc.h>
+
#include <asm/asi.h>
+#include <asm/traps.h>
static __aligned(PAGE_SIZE) pgd_t asi_global_nonsensitive_pgd[PTRS_PER_PGD];
struct asi __asi_global_nonsensitive = {
.pgd = asi_global_nonsensitive_pgd,
};
+
+/*
+ * Map the given pages into the ASI nonsensitive physmap. The source of the
+ * mapping is the regular unrestricted page tables. Only supports mapping at
+ * pageblock granularity. Does no synchronization.
+ */
+void asi_map(struct page *page, int numpages)
+{
+ unsigned long virt;
+ unsigned long start = (size_t)(page_to_virt(page));
+ unsigned long end = start + PAGE_SIZE * numpages;
+ unsigned long page_size;
+
+ VM_BUG_ON(!IS_ALIGNED(page_to_pfn(page), pageblock_nr_pages));
+ VM_BUG_ON(!IS_ALIGNED(numpages, pageblock_nr_pages));
+
+ for (virt = start; virt < end; virt = ALIGN(virt + 1, page_size)) {
+ pte_t *pte, *pte_asi;
+ int level, level_asi;
+ pgd_t *pgd = pgd_offset_pgd(asi_global_nonsensitive_pgd, virt);
+
+ pte_asi = lookup_pgtable_in_pgd(pgd, virt, &level_asi);
+ page_size = page_level_size(level_asi);
+
+ pte = lookup_address(virt, &level);
+ if (!pte || pte_none(*pte))
+ continue;
+
+ /*
+ * Physmap should already be setup by PAT code, with no pages
+ * smaller than 2M. This function should only be called at
+ * pageblock granularity. Thus it should never be required to
+ * break up pages here.
+ */
+ if (WARN_ON_ONCE(!pte_asi) ||
+ WARN_ON_ONCE(ALIGN_DOWN(virt, page_size) < virt) ||
+ ALIGN(virt, page_size) > end)
+ continue;
+
+ /*
+ * Existing mappings should already match the structure of the
+ * unrestricted physmap.
+ */
+ if (WARN_ON_ONCE(level != level_asi))
+ continue;
+
+ set_pte(pte_asi, *pte);
+ }
+}
+
+/*
+ * Unmap pages previously mapped via asi_map().
+ *
+ * Interrupts must be enabled as this does a TLB shootdown.
+ */
+void asi_unmap(struct page *page, int numpages)
+{
+ size_t start = (size_t)page_to_virt(page);
+ size_t end = start + (PAGE_SIZE * numpages);
+ pgtbl_mod_mask mask = 0;
+
+ VM_BUG_ON(!IS_ALIGNED(page_to_pfn(page), pageblock_nr_pages));
+ VM_BUG_ON(!IS_ALIGNED(numpages, pageblock_nr_pages));
+
+ vunmap_pgd_range(asi_pgd(ASI_GLOBAL_NONSENSITIVE), start, end, &mask);
+
+ flush_tlb_kernel_range(start, end - 1);
+}
diff --git a/include/linux/asi.h b/include/linux/asi.h
index 2d3049d5fe423e139dcce8f3d68cdffcc0ec0bfe..ee9811f04a417556cf2e930644eaf05f3c9bfee3 100644
--- a/include/linux/asi.h
+++ b/include/linux/asi.h
@@ -13,6 +13,8 @@
struct asi {};
static inline pgd_t *asi_pgd(struct asi *asi) { return NULL; }
+static inline void asi_map(struct page *page, int numpages) { }
+static inline void asi_unmap(struct page *page, int numpages) { }
#endif /* CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION */
#endif /* _INCLUDE_ASI_H */
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 31e9ffd936e39334ddaff910222d4751c18da5e7..c498ba127b4a511b5a6f10afa2aae535509fc153 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -11,6 +11,7 @@
#include <asm/page.h> /* pgprot_t */
#include <linux/rbtree.h>
#include <linux/overflow.h>
+#include <linux/pgtable.h>
#include <asm/vmalloc.h>
@@ -324,4 +325,7 @@ bool vmalloc_dump_obj(void *object);
static inline bool vmalloc_dump_obj(void *object) { return false; }
#endif
+void vunmap_pgd_range(pgd_t *pgd_table, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask);
+
#endif /* _LINUX_VMALLOC_H */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 61981ee1c9d2f769d4a06ab542fc84334c1b0cbd..ffeb823398809388c0599f51929a7f3506ed035f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -427,6 +427,24 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
} while (p4d++, addr = next, addr != end);
}
+void vunmap_pgd_range(pgd_t *pgd_table, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask)
+{
+ unsigned long next;
+ pgd_t *pgd = pgd_offset_pgd(pgd_table, addr);
+
+ BUG_ON(addr >= end);
+
+ do {
+ next = pgd_addr_end(addr, end);
+ if (pgd_bad(*pgd))
+ *mask |= PGTBL_PGD_MODIFIED;
+ if (pgd_none_or_clear_bad(pgd))
+ continue;
+ vunmap_p4d_range(pgd, addr, next, mask);
+ } while (pgd++, addr = next, addr != end);
+}
+
/*
* vunmap_range_noflush is similar to vunmap_range, but does not
* flush caches or TLBs.
@@ -441,21 +459,9 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
*/
void __vunmap_range_noflush(unsigned long start, unsigned long end)
{
- unsigned long next;
- pgd_t *pgd;
- unsigned long addr = start;
pgtbl_mod_mask mask = 0;
- BUG_ON(addr >= end);
- pgd = pgd_offset_k(addr);
- do {
- next = pgd_addr_end(addr, end);
- if (pgd_bad(*pgd))
- mask |= PGTBL_PGD_MODIFIED;
- if (pgd_none_or_clear_bad(pgd))
- continue;
- vunmap_p4d_range(pgd, addr, next, &mask);
- } while (pgd++, addr = next, addr != end);
+ vunmap_pgd_range(init_mm.pgd, start, end, &mask);
if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
arch_sync_kernel_mappings(start, end);
--
2.49.0.rc1.451.g8f38331e32-goog
^ permalink raw reply [flat|nested] 18+ messages in thread* [PATCH RFC 06/11] mm/page_alloc: Add __GFP_SENSITIVE and always set it
2025-03-13 18:11 [PATCH RFC 00/11] mm: ASI integration for the page allocator Brendan Jackman
` (4 preceding siblings ...)
2025-03-13 18:11 ` [PATCH RFC HACKS 05/11] Add asi_map() and asi_unmap() Brendan Jackman
@ 2025-03-13 18:11 ` Brendan Jackman
2025-03-13 18:11 ` [PATCH RFC HACKS 07/11] mm/slub: Set __GFP_SENSITIVE for reclaimable slabs Brendan Jackman
` (5 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Brendan Jackman @ 2025-03-13 18:11 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand
Cc: linux-kernel, linux-mm, Mike Rapoport, Junaid Shahid,
Reiji Watanabe, Patrick Bellasi, Brendan Jackman, Yosry Ahmed
__GFP_SENSITIVE represents that a page should not be mapped into the
ASI restricted address space.
This is added as a GFP flag instead of via some contextual hint, because
its presence is not ultimately expected to correspond to any such
existing context. If necessary, it should be possible to instead achieve
this optionality with something like __alloc_pages_sensitive(), but
this would be much more invasive to the overall kernel.
On startup, all pages are sensitive. Since there is currently no way to
create nonsensitive pages, temporarily set the flag unconditionally at
the top of the allocator.
__GFP_SENSITIVE is also added to GFP_USER since that's the most
important data that ASI needs to protect.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
include/linux/gfp_types.h | 15 ++++++++++++++-
mm/page_alloc.c | 7 +++++++
2 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 65db9349f9053c701e24bdcf1dfe6afbf1278a2d..5147dbd53eafdccc32cfd506569b04d5c082d1b2 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -58,6 +58,7 @@ enum {
#ifdef CONFIG_SLAB_OBJ_EXT
___GFP_NO_OBJ_EXT_BIT,
#endif
+ ___GFP_SENSITIVE_BIT,
___GFP_LAST_BIT
};
@@ -103,6 +104,11 @@ enum {
#else
#define ___GFP_NO_OBJ_EXT 0
#endif
+#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
+#define ___GFP_SENSITIVE BIT(___GFP_SENSITIVE_BIT)
+#else
+#define ___GFP_SENSITIVE 0
+#endif
/*
* Physical address zone modifiers (see linux/mmzone.h - low four bits)
@@ -299,6 +305,12 @@ enum {
/* Disable lockdep for GFP context tracking */
#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
+/*
+ * Allocate sensitive memory, i.e. do not map it into ASI's restricted address
+ * space.
+ */
+#define __GFP_SENSITIVE ((__force gfp_t)___GFP_SENSITIVE)
+
/* Room for N __GFP_FOO bits */
#define __GFP_BITS_SHIFT ___GFP_LAST_BIT
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -380,7 +392,8 @@ enum {
#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM | __GFP_NOWARN)
#define GFP_NOIO (__GFP_RECLAIM)
#define GFP_NOFS (__GFP_RECLAIM | __GFP_IO)
-#define GFP_USER (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
+#define GFP_USER (__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
+ __GFP_HARDWALL | __GFP_SENSITIVE)
#define GFP_DMA __GFP_DMA
#define GFP_DMA32 __GFP_DMA32
#define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c9a1936c3070836a3633ebd32d423cd1a27b37a3..aa54ef95233a052b5e79b4994b2879a72ff9acfd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4686,6 +4686,13 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
struct alloc_context ac = { };
+ /*
+ * Temporary hack: Allocation of nonsensitive pages is not possible yet,
+ * allocate everything sensitive. The restricted address space is never
+ * actually entered yet so this is fine.
+ */
+ gfp |= __GFP_SENSITIVE;
+
/*
* There are several places where we assume that the order value is sane
* so bail out early if the request is out of bound.
--
2.49.0.rc1.451.g8f38331e32-goog
^ permalink raw reply [flat|nested] 18+ messages in thread* [PATCH RFC HACKS 07/11] mm/slub: Set __GFP_SENSITIVE for reclaimable slabs
2025-03-13 18:11 [PATCH RFC 00/11] mm: ASI integration for the page allocator Brendan Jackman
` (5 preceding siblings ...)
2025-03-13 18:11 ` [PATCH RFC 06/11] mm/page_alloc: Add __GFP_SENSITIVE and always set it Brendan Jackman
@ 2025-03-13 18:11 ` Brendan Jackman
2025-03-13 18:11 ` [PATCH RFC HACKS 08/11] mm/page_alloc: Simplify gfp_migratetype() Brendan Jackman
` (4 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Brendan Jackman @ 2025-03-13 18:11 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand
Cc: linux-kernel, linux-mm, Mike Rapoport, Junaid Shahid,
Reiji Watanabe, Patrick Bellasi, Brendan Jackman, Yosry Ahmed
It's not currently possible allocate reclaimable, nonsensitive pages.
For the moment, just add __GFP_SENSITIVE.
This will need to be fixed before this can be a [PATCH].
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
mm/slub.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c
index 1f50129dcfb3cd1fc76ac9398fa7718cedb42385..132e894e96df20f2e2d69d0b602b4719cdc072f5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5559,7 +5559,11 @@ static int calculate_sizes(struct kmem_cache_args *args, struct kmem_cache *s)
s->allocflags |= GFP_DMA32;
if (s->flags & SLAB_RECLAIM_ACCOUNT)
- s->allocflags |= __GFP_RECLAIMABLE;
+ /*
+ * TODO: Cannot currently allocate reclaimable, nonsensitive
+ * pages. For the moment, just add __GFP_SENSITIVE.
+ */
+ s->allocflags |= __GFP_RECLAIMABLE | __GFP_SENSITIVE;
/*
* Determine the number of objects per slab
--
2.49.0.rc1.451.g8f38331e32-goog
^ permalink raw reply [flat|nested] 18+ messages in thread* [PATCH RFC HACKS 08/11] mm/page_alloc: Simplify gfp_migratetype()
2025-03-13 18:11 [PATCH RFC 00/11] mm: ASI integration for the page allocator Brendan Jackman
` (6 preceding siblings ...)
2025-03-13 18:11 ` [PATCH RFC HACKS 07/11] mm/slub: Set __GFP_SENSITIVE for reclaimable slabs Brendan Jackman
@ 2025-03-13 18:11 ` Brendan Jackman
2025-03-13 18:11 ` [PATCH RFC 09/11] mm/page_alloc: Split MIGRATE_UNMOVABLE by sensitivity Brendan Jackman
` (3 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Brendan Jackman @ 2025-03-13 18:11 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand
Cc: linux-kernel, linux-mm, Mike Rapoport, Junaid Shahid,
Reiji Watanabe, Patrick Bellasi, Brendan Jackman, Yosry Ahmed
This currently uses optimised bit-hacks to avoid conditional branches
etc. For the purposes of the RFC, let's not get bogged down in those
details - temporarily just drop the bit hacking.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
include/linux/gfp.h | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6bb1a5a7a4ae3392c1cd39cb79271e05512adbeb..23289aa54b6c38a71a908e5a6e034828a75a3b66 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -14,25 +14,20 @@ struct mempolicy;
/* Convert GFP flags to their corresponding migrate type */
#define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
-#define GFP_MOVABLE_SHIFT 3
static inline int gfp_migratetype(const gfp_t gfp_flags)
{
VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
- BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
- BUILD_BUG_ON((___GFP_MOVABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_MOVABLE);
- BUILD_BUG_ON((___GFP_RECLAIMABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_RECLAIMABLE);
- BUILD_BUG_ON(((___GFP_MOVABLE | ___GFP_RECLAIMABLE) >>
- GFP_MOVABLE_SHIFT) != MIGRATE_HIGHATOMIC);
if (unlikely(page_group_by_mobility_disabled))
return MIGRATE_UNMOVABLE;
- /* Group based on mobility */
- return (__force unsigned long)(gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
+ switch (gfp_flags & GFP_MOVABLE_MASK) {
+ case __GFP_RECLAIMABLE: return MIGRATE_RECLAIMABLE;
+ case __GFP_MOVABLE: return MIGRATE_MOVABLE;
+ default: return MIGRATE_UNMOVABLE;
+ }
}
-#undef GFP_MOVABLE_MASK
-#undef GFP_MOVABLE_SHIFT
static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
{
--
2.49.0.rc1.451.g8f38331e32-goog
^ permalink raw reply [flat|nested] 18+ messages in thread* [PATCH RFC 09/11] mm/page_alloc: Split MIGRATE_UNMOVABLE by sensitivity
2025-03-13 18:11 [PATCH RFC 00/11] mm: ASI integration for the page allocator Brendan Jackman
` (7 preceding siblings ...)
2025-03-13 18:11 ` [PATCH RFC HACKS 08/11] mm/page_alloc: Simplify gfp_migratetype() Brendan Jackman
@ 2025-03-13 18:11 ` Brendan Jackman
2025-03-13 18:11 ` [PATCH RFC 10/11] mm/page_alloc: Add support for nonsensitive allocations Brendan Jackman
` (2 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Brendan Jackman @ 2025-03-13 18:11 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand
Cc: linux-kernel, linux-mm, Mike Rapoport, Junaid Shahid,
Reiji Watanabe, Patrick Bellasi, Brendan Jackman, Yosry Ahmed
When ASI is compiled in, create two separate unmovable migratetypes.
MIGRATE_UNMOVABLE_NONSENSITIVE represents blocks that are mapped into
ASI's restricted address space.
MIGRATE_UNMOVABLE becomes MIGRATE_UNMOVABLE_SENSITIVE. All other
migratetypes retain their original meaning and gain the additional
implication that the pageblock is not ASI-mapped.
In future extensions it's likely that more migratetypes will need to
support different sensitivities; if and when that happens a more
invasive change will be needed but for now this should allow developing
the necessary allocator logic for flipping sensitivities by modifying
the ASI page tables.
For builds with ASI disabled, the two new migratetypes are aliases for
one another. Some code needs to be aware of this aliasing (for example,
the 'fallbacks' array needs an ifdef for the entries that would
otherwise alias) while others doesn't (for example,
set_pageblock_migratetype() just works regardless).
Since there is now a migratetype below MIGRATE_PCPTYPES with no
fallbacks, the 'fallbacks' arrays are no longer all the same size, so
make them be terminated by a -1 instead of having a fixed size. On
non-ASI builds, the new 'if (fallback_mt < 0)' in
find_suitable_fallback() is provably always false and can be eliminated
by the compiler. Clang 20 seems to be smart enough to do this
regardless, but add a 'const' qualifier to the arrays to try and
increase confidence anyway.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
include/linux/gfp.h | 21 +++++++++++++++++----
include/linux/mmzone.h | 19 ++++++++++++++++++-
mm/memory_hotplug.c | 2 +-
mm/page_alloc.c | 22 +++++++++++++++-------
mm/show_mem.c | 13 +++++++------
5 files changed, 58 insertions(+), 19 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 23289aa54b6c38a71a908e5a6e034828a75a3b66..0253dcfb66cbaa82414528a266271d719bc09d6a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -20,13 +20,26 @@ static inline int gfp_migratetype(const gfp_t gfp_flags)
VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
if (unlikely(page_group_by_mobility_disabled))
- return MIGRATE_UNMOVABLE;
+ goto unmovable;
+
+ /* Only unmovable/unreclaimable pages can be nonsensitive right now. */
+ VM_WARN_ONCE((gfp_flags & GFP_MOVABLE_MASK) && !(gfp_flags & __GFP_SENSITIVE),
+ "%pGg", &gfp_flags);
switch (gfp_flags & GFP_MOVABLE_MASK) {
- case __GFP_RECLAIMABLE: return MIGRATE_RECLAIMABLE;
- case __GFP_MOVABLE: return MIGRATE_MOVABLE;
- default: return MIGRATE_UNMOVABLE;
+ case __GFP_RECLAIMABLE:
+ return MIGRATE_RECLAIMABLE;
+ case __GFP_MOVABLE:
+ return MIGRATE_MOVABLE;
+ default:
+ break;
}
+
+unmovable:
+ if (gfp_flags & __GFP_SENSITIVE)
+ return MIGRATE_UNMOVABLE_SENSITIVE;
+ else
+ return MIGRATE_UNMOVABLE_NONSENSITIVE;
}
static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9540b41894da6d67751506614e90bed142e127b4..b8215b921b71c0b1cbda2d262664b1ee2a0cb750 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -46,7 +46,19 @@
#define PAGE_ALLOC_COSTLY_ORDER 3
enum migratetype {
- MIGRATE_UNMOVABLE,
+ /*
+ * All movable pages are sensitive for ASI. Unmovable pages might be
+ * either; the migratetype reflects whether they are mapped into the
+ * global-nonsensitive address space.
+ *
+ * TODO: what about HIGHATOMIC/RECLAIMABLE?
+ */
+ MIGRATE_UNMOVABLE_SENSITIVE,
+#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
+ MIGRATE_UNMOVABLE_NONSENSITIVE,
+#else
+ MIGRATE_UNMOVABLE_NONSENSITIVE = MIGRATE_UNMOVABLE_SENSITIVE,
+#endif
MIGRATE_MOVABLE,
MIGRATE_RECLAIMABLE,
MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
@@ -89,6 +101,11 @@ static inline bool is_migrate_movable(int mt)
return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
}
+static inline bool is_migrate_unmovable(int mt)
+{
+ return mt == MIGRATE_UNMOVABLE_SENSITIVE || mt == MIGRATE_UNMOVABLE_NONSENSITIVE;
+}
+
/*
* Check whether a migratetype can be merged with another migratetype.
*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 16cf9e17077e359b98a69dc4bca48f4575b9a28c..28092c51c7a49546f5b7961aaf1500c72de88337 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1127,7 +1127,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
if (mhp_off_inaccessible)
page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages);
- move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
+ move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE_SENSITIVE);
for (i = 0; i < nr_pages; i++) {
struct page *page = pfn_to_page(pfn + i);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aa54ef95233a052b5e79b4994b2879a72ff9acfd..ae711025da15823e97cc8b63e0277fc41b5ca0f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -418,8 +418,9 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
void set_pageblock_migratetype(struct page *page, int migratetype)
{
if (unlikely(page_group_by_mobility_disabled &&
- migratetype < MIGRATE_PCPTYPES))
- migratetype = MIGRATE_UNMOVABLE;
+ migratetype < MIGRATE_PCPTYPES &&
+ migratetype != MIGRATE_UNMOVABLE_NONSENSITIVE))
+ migratetype = MIGRATE_UNMOVABLE_SENSITIVE;
set_pfnblock_flags_mask(page, (unsigned long)migratetype,
page_to_pfn(page), MIGRATETYPE_MASK);
@@ -1610,10 +1611,14 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
*
* The other migratetypes do not have fallbacks.
*/
-static int fallbacks[MIGRATE_PCPTYPES][MIGRATE_PCPTYPES - 1] = {
- [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE },
- [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE },
- [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE },
+static const int fallbacks[MIGRATE_PCPTYPES][MIGRATE_PCPTYPES - 1] = {
+ [MIGRATE_UNMOVABLE_SENSITIVE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE },
+#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
+ /* TODO: Cannot fallback from nonsensitive */
+ [MIGRATE_UNMOVABLE_NONSENSITIVE] = { -1 },
+#endif
+ [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE_SENSITIVE },
+ [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE_SENSITIVE, MIGRATE_MOVABLE },
};
#ifdef CONFIG_CMA
@@ -1893,7 +1898,7 @@ static bool should_try_claim_block(unsigned int order, int start_mt)
* allocation size. Later movable allocations can always steal from this
* block, which is less problematic.
*/
- if (start_mt == MIGRATE_RECLAIMABLE || start_mt == MIGRATE_UNMOVABLE)
+ if (start_mt == MIGRATE_RECLAIMABLE || is_migrate_unmovable(start_mt))
return true;
if (page_group_by_mobility_disabled)
@@ -1929,6 +1934,9 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
*claim_block = false;
for (i = 0; i < MIGRATE_PCPTYPES - 1 ; i++) {
fallback_mt = fallbacks[migratetype][i];
+ if (fallback_mt < 0)
+ return fallback_mt;
+
if (free_area_empty(area, fallback_mt))
continue;
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 43afb56abbd3e1f436452ef551dd8149545ccb13..59af6fb286354156cc2aa258fdd78c85797248a6 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -141,15 +141,16 @@ static bool show_mem_node_skip(unsigned int flags, int nid, nodemask_t *nodemask
static void show_migration_types(unsigned char type)
{
static const char types[MIGRATE_TYPES] = {
- [MIGRATE_UNMOVABLE] = 'U',
- [MIGRATE_MOVABLE] = 'M',
- [MIGRATE_RECLAIMABLE] = 'E',
- [MIGRATE_HIGHATOMIC] = 'H',
+ [MIGRATE_UNMOVABLE_SENSITIVE] = 'S',
+ [MIGRATE_UNMOVABLE_NONSENSITIVE] = 'N',
+ [MIGRATE_MOVABLE] = 'M',
+ [MIGRATE_RECLAIMABLE] = 'E',
+ [MIGRATE_HIGHATOMIC] = 'H',
#ifdef CONFIG_CMA
- [MIGRATE_CMA] = 'C',
+ [MIGRATE_CMA] = 'C',
#endif
#ifdef CONFIG_MEMORY_ISOLATION
- [MIGRATE_ISOLATE] = 'I',
+ [MIGRATE_ISOLATE] = 'I',
#endif
};
char tmp[MIGRATE_TYPES + 1];
--
2.49.0.rc1.451.g8f38331e32-goog
^ permalink raw reply [flat|nested] 18+ messages in thread* [PATCH RFC 10/11] mm/page_alloc: Add support for nonsensitive allocations
2025-03-13 18:11 [PATCH RFC 00/11] mm: ASI integration for the page allocator Brendan Jackman
` (8 preceding siblings ...)
2025-03-13 18:11 ` [PATCH RFC 09/11] mm/page_alloc: Split MIGRATE_UNMOVABLE by sensitivity Brendan Jackman
@ 2025-03-13 18:11 ` Brendan Jackman
2025-03-13 18:11 ` [PATCH RFC 11/11] mm/page_alloc: Add support for ASI-unmapping pages Brendan Jackman
2025-06-10 17:04 ` [PATCH RFC 00/11] mm: ASI integration for the page allocator Brendan Jackman
11 siblings, 0 replies; 18+ messages in thread
From: Brendan Jackman @ 2025-03-13 18:11 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand
Cc: linux-kernel, linux-mm, Mike Rapoport, Junaid Shahid,
Reiji Watanabe, Patrick Bellasi, Brendan Jackman, Yosry Ahmed
Creating nonsensitive pages from sensitive ones is pretty easy: Just
call asi_map(). Since this doesn't require any allocation or
synchronization (at least, the synchronization is implied by the zone
lock) just do it immediately in the place where the pageblock's
migratetype is updated.
Implement a migratetype fallback mechanism to let this happen,
restricted to requiring a whole pageblock.
Now that it's possible to create nonsensitive pages, the allocator
supports non-__GFP_SENSITIVE allocations, so remove the temporary hack
setting this flag unconditionally in the entry point.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
arch/x86/include/asm/asi.h | 5 ++++
mm/page_alloc.c | 67 +++++++++++++++++++++++++++++++++++-----------
2 files changed, 57 insertions(+), 15 deletions(-)
diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index cf8be544de8b108190b765e3eb337089866207a2..4208b60852ace25fc9716c276bd86bf920ab3340 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -2,7 +2,12 @@
#ifndef _ASM_X86_ASI_H
#define _ASM_X86_ASI_H
+#include <linux/align.h>
+#include <linux/mm.h>
+#include <linux/mmdebug.h>
+
#include <asm/pgtable_types.h>
+#include <asm/set_memory.h>
#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ae711025da15823e97cc8b63e0277fc41b5ca0f8..0d8bbad8675c99282f308c4a4122d5d9c4b14dae 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -15,6 +15,7 @@
* (lots of bits borrowed from Ingo Molnar & Andrew Morton)
*/
+#include <linux/asi.h>
#include <linux/stddef.h>
#include <linux/mm.h>
#include <linux/highmem.h>
@@ -422,6 +423,18 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
migratetype != MIGRATE_UNMOVABLE_NONSENSITIVE))
migratetype = MIGRATE_UNMOVABLE_SENSITIVE;
+ /*
+ * TODO: doing this here makes this code pretty confusing, since in the
+ * asi_unmap() case it's up to the caller. Need to find a way to express
+ * this with more symmetry.
+ *
+ * TODO: Also need to ensure the pages have been zeroed since they were
+ * last in userspace. Need to figure out how to do that without too much
+ * cost.
+ */
+ if (migratetype == MIGRATE_UNMOVABLE_NONSENSITIVE)
+ asi_map(page, pageblock_nr_pages);
+
set_pfnblock_flags_mask(page, (unsigned long)migratetype,
page_to_pfn(page), MIGRATETYPE_MASK);
}
@@ -1606,19 +1619,35 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
/*
- * This array describes the order lists are fallen back to when
- * the free lists for the desirable migrate type are depleted
+ * This array describes the order lists are fallen back to when the free lists
+ * for the desirable migrate type are depleted. When ASI is enabled, different
+ * migratetypes have different numbers of fallbacks available, in that case the
+ * arrays are terminated early by -1.
*
* The other migratetypes do not have fallbacks.
+ *
+ * Note there are no fallbacks from sensitive to nonsensitive migratetypes.
*/
-static const int fallbacks[MIGRATE_PCPTYPES][MIGRATE_PCPTYPES - 1] = {
- [MIGRATE_UNMOVABLE_SENSITIVE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE },
#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
- /* TODO: Cannot fallback from nonsensitive */
- [MIGRATE_UNMOVABLE_NONSENSITIVE] = { -1 },
+#define TERMINATE_FALLBACK -1
+#else
+#define TERMINATE_FALLBACK
#endif
- [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE_SENSITIVE },
- [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE_SENSITIVE, MIGRATE_MOVABLE },
+static const int fallbacks[MIGRATE_PCPTYPES][MIGRATE_PCPTYPES - 1] = {
+ [MIGRATE_UNMOVABLE_SENSITIVE] = {
+ MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, TERMINATE_FALLBACK
+ },
+#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
+ [MIGRATE_UNMOVABLE_NONSENSITIVE] = {
+ MIGRATE_UNMOVABLE_SENSITIVE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE
+ },
+#endif
+ [MIGRATE_MOVABLE] = {
+ MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE_SENSITIVE, TERMINATE_FALLBACK
+ },
+ [MIGRATE_RECLAIMABLE] = {
+ MIGRATE_UNMOVABLE_SENSITIVE, MIGRATE_MOVABLE, TERMINATE_FALLBACK
+ },
};
#ifdef CONFIG_CMA
@@ -1931,6 +1960,21 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
if (area->nr_free == 0)
return -1;
+ /*
+ * It's not possible to change the sensitivity of individual pages,
+ * only an entire pageblock. Thus, we can only fallback to a migratetype
+ * of a different sensitivity when the entire block is free.
+ *
+ * This cross-sensitivity fallback occurs exactly when the start
+ * migratetype is MIGRATE_UNMOVABLE_NONSENSITIVE. This is because it's
+ * the only nonsensitive migratetype (so if it's the start type, a
+ * fallback will always differ in sensitivity) and it doesn't appear in
+ * the 'fallbacks' arrays (i.e. we never fall back to it, so if it isn't
+ * the start type, the fallback will never differ in sensitivity).
+ */
+ if (migratetype == MIGRATE_UNMOVABLE_NONSENSITIVE && order < pageblock_order)
+ return -1;
+
*claim_block = false;
for (i = 0; i < MIGRATE_PCPTYPES - 1 ; i++) {
fallback_mt = fallbacks[migratetype][i];
@@ -4694,13 +4738,6 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
struct alloc_context ac = { };
- /*
- * Temporary hack: Allocation of nonsensitive pages is not possible yet,
- * allocate everything sensitive. The restricted address space is never
- * actually entered yet so this is fine.
- */
- gfp |= __GFP_SENSITIVE;
-
/*
* There are several places where we assume that the order value is sane
* so bail out early if the request is out of bound.
--
2.49.0.rc1.451.g8f38331e32-goog
^ permalink raw reply [flat|nested] 18+ messages in thread* [PATCH RFC 11/11] mm/page_alloc: Add support for ASI-unmapping pages
2025-03-13 18:11 [PATCH RFC 00/11] mm: ASI integration for the page allocator Brendan Jackman
` (9 preceding siblings ...)
2025-03-13 18:11 ` [PATCH RFC 10/11] mm/page_alloc: Add support for nonsensitive allocations Brendan Jackman
@ 2025-03-13 18:11 ` Brendan Jackman
2025-06-10 17:04 ` [PATCH RFC 00/11] mm: ASI integration for the page allocator Brendan Jackman
11 siblings, 0 replies; 18+ messages in thread
From: Brendan Jackman @ 2025-03-13 18:11 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand
Cc: linux-kernel, linux-mm, Mike Rapoport, Junaid Shahid,
Reiji Watanabe, Patrick Bellasi, Brendan Jackman, Yosry Ahmed
While calling asi_map() is pretty easy, to unmap pages we need to ensure
a TLB shootdown is complete before we allow them to be allocated.
Therefore, treat this as a special case of buddy allocation. Allocate an
entire block, release the zone lock and enable interrupts, then do the
unmap and TLB shootdown. Once that's complete, return any unwanted pages
within the block to the freelists.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
mm/internal.h | 5 ++++
mm/page_alloc.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 86 insertions(+), 7 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index b82ab345fb994b7c4971e550556e24bb68f315f6..7904be86fa2c7fded62c100d84fe572c15407ccf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1189,6 +1189,11 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
#endif
#define ALLOC_HIGHATOMIC 0x200 /* Allows access to MIGRATE_HIGHATOMIC */
#define ALLOC_KSWAPD 0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
+#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
+#define ALLOC_ASI_UNMAP 0x1000 /* allow asi_unmap(), requiring TLB shootdown. */
+#else
+#define ALLOC_ASI_UNMAP 0x0
+#endif
/* Flags that allow allocations below the min watermark. */
#define ALLOC_RESERVES (ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d8bbad8675c99282f308c4a4122d5d9c4b14dae..9ac883d7a71387d291bc04bad675e2545dd7ba0f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1627,6 +1627,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
* The other migratetypes do not have fallbacks.
*
* Note there are no fallbacks from sensitive to nonsensitive migratetypes.
+ * That's instead handled as a totally special case in __rmqueue_asi_unmap().
*/
#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
#define TERMINATE_FALLBACK -1
@@ -2790,7 +2791,77 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
#endif
}
-static __always_inline
+#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
+/*
+ * Allocate a page by converting some memory to sensitive, by doing an ASI
+ * unmap. This can't be done via __rmqueue_fallback because that unmap requires
+ * a TLB flush which can only be done with interrupts on.
+ */
+static inline
+struct page *__rmqueue_asi_unmap(struct zone *zone, unsigned int request_order,
+ unsigned int alloc_flags, int migratetype)
+{
+ struct page *page;
+ int alloc_order;
+ int i;
+
+ lockdep_assert_irqs_enabled();
+
+ if (!(alloc_flags & ALLOC_ASI_UNMAP))
+ return NULL;
+
+ VM_WARN_ON_ONCE(migratetype == MIGRATE_UNMOVABLE_NONSENSITIVE);
+
+ /*
+ * Need to unmap a whole pageblock (otherwise it might require
+ * allocating pagetables). Can't do that with the zone lock held, but we
+ * also can't flip the block's migratetype until the flush is complete,
+ * otherwise any concurrent sensitive allocations could momentarily leak
+ * data into the restricted address space. As a simple workaround,
+ * "allocate" at least the whole block, unmap it (with IRQs enabled),
+ * then free any remainder of the block again.
+ *
+ * An alternative to this could be to synchronize an intermediate state
+ * on the pageblock (since since this code can't be called in an IRQ,
+ * this shouldn't be too bad - it's likely OK to just busy-wait until
+ * any conurrent TLB flush completes.).
+ */
+
+ alloc_order = max(request_order, pageblock_order);
+ spin_lock_irq(&zone->lock);
+ page = __rmqueue_smallest(zone, alloc_order, MIGRATE_UNMOVABLE_NONSENSITIVE);
+ spin_unlock_irq(&zone->lock);
+ if (!page)
+ return NULL;
+
+ asi_unmap(page, 1 << alloc_order);
+
+ change_pageblock_range(page, alloc_order, migratetype);
+
+ if (request_order >= alloc_order)
+ return page;
+
+ spin_lock_irq(&zone->lock);
+ for (i = request_order; i < alloc_order; i++) {
+ struct page *page_to_free = page + (1 << i);
+
+ __free_one_page(page_to_free, page_to_pfn(page_to_free), zone, i,
+ migratetype, FPI_SKIP_REPORT_NOTIFY);
+ }
+ spin_unlock_irq(&zone->lock);
+
+ return page;
+}
+#else
+static inline
+struct page *__rmqueue_asi_unmap(struct zone *zone, unsigned int order,
+ unsigned int alloc_flags, int migratetype)
+{
+ return NULL;
+}
+#endif
+
+static noinline
struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
unsigned int order, unsigned int alloc_flags,
int migratetype)
@@ -2814,13 +2885,14 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
*/
if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_NON_BLOCK)))
page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
-
- if (!page) {
- spin_unlock_irqrestore(&zone->lock, flags);
- return NULL;
- }
}
spin_unlock_irqrestore(&zone->lock, flags);
+
+ if (!page)
+ page = __rmqueue_asi_unmap(zone, order, alloc_flags, migratetype);
+
+ if (!page)
+ return NULL;
} while (check_new_pages(page, order));
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -3356,6 +3428,8 @@ static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
#endif
+ if (gfp_mask & __GFP_SENSITIVE && gfpflags_allow_blocking(gfp_mask))
+ alloc_flags |= ALLOC_ASI_UNMAP;
return alloc_flags;
}
@@ -4382,7 +4456,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
if (reserve_flags)
alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags) |
- (alloc_flags & ALLOC_KSWAPD);
+ (alloc_flags & (ALLOC_KSWAPD | ALLOC_ASI_UNMAP));
/*
* Reset the nodemask and zonelist iterators if memory policies can be
--
2.49.0.rc1.451.g8f38331e32-goog
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH RFC 00/11] mm: ASI integration for the page allocator
2025-03-13 18:11 [PATCH RFC 00/11] mm: ASI integration for the page allocator Brendan Jackman
` (10 preceding siblings ...)
2025-03-13 18:11 ` [PATCH RFC 11/11] mm/page_alloc: Add support for ASI-unmapping pages Brendan Jackman
@ 2025-06-10 17:04 ` Brendan Jackman
11 siblings, 0 replies; 18+ messages in thread
From: Brendan Jackman @ 2025-06-10 17:04 UTC (permalink / raw)
To: Brendan Jackman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Andrew Morton, David Rientjes, Vlastimil Babka,
David Hildenbrand
Cc: linux-kernel, linux-mm, Mike Rapoport, Junaid Shahid,
Reiji Watanabe, Patrick Bellasi, Yosry Ahmed
On Thu Mar 13, 2025 at 6:11 PM UTC, Brendan Jackman wrote:
> .:: Patchset overview
Hey all, I have been down the pagetable mines lately trying to figure
out a solution to the page cache issue (the 70% FIO degradatation [0]).
I've got a prototype based on the idea I discussed at LSF/MM/BPF
that's slowly coming together. My hope is that as soon as I can
convincingly claim with a straight face that I know how to solve that
problem, I can transition from <post an RFC every N months then
disappear> mode into being a bit more visible with development
iterations...
[0] https://lore.kernel.org/linux-mm/20250129144320.2675822-1-jackmanb@google.com/
In the meantime, I am still provisionally planning to make the topic
of this RFC the first [PATCH] series for ASI. Obviously before I can
seriously ask Andrew to merge I'll also need to establish some
consensus on the x86 side, but in the meantime I think we're getting
close enough to start discussing the mm code.
So.. does anyone have a bit of time to look over this and see if the
implementation makes sense? Is the basic idea on the right lines?
Also if there's anything I can do to make that easier (is it worth
rebasing?) let me know.
Also, I guess I should also note my aspirational plan for the next few
months, it goes...
1. Get a convincing PoC working that improves the FIO degradation.
2. Gather it into a fairly messy but at least surveyable branch and push
that to Github or whatever.
3. Show that to x86 folks and hopefully (!!) get some maintainers to
give a nod like "yep we want ASI and we're more or less sold that
the developers know how to make it performant".
4. Turn this [RFC] into a [PATCH]. So start by trying to merge the stuff
that manages the restricted address space, leaving the logic of actually
_using_ it for a later series.
5. [Maybe this can be partially paralellised with 4] start a new [PATCH]
series that starts adding in the x86 stuff to actually switch address
spaces etc. Basically this means respinning the patches that Boris
has reviewed in [1]. Since we already have the page_alloc stuff, it
should be possible to start testing this code end-to-end quickly.
[1] https://lore.kernel.org/linux-mm/20250110-asi-rfc-v2-v2-0-8419288bc805@google.com/
Anyone have any thoughts on that overall strategy?
Cheers,
Brendan
^ permalink raw reply [flat|nested] 18+ messages in thread