* [PATCH v5 0/8] Fix stale IOTLB entries for kernel address space
@ 2025-09-19 5:39 Lu Baolu
2025-09-19 5:39 ` [PATCH v5 1/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
` (8 more replies)
0 siblings, 9 replies; 29+ messages in thread
From: Lu Baolu @ 2025-09-19 5:39 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Lu Baolu
This proposes a fix for a security vulnerability related to IOMMU Shared
Virtual Addressing (SVA). In an SVA context, an IOMMU can cache kernel
page table entries. When a kernel page table page is freed and
reallocated for another purpose, the IOMMU might still hold stale,
incorrect entries. This can be exploited to cause a use-after-free or
write-after-free condition, potentially leading to privilege escalation
or data corruption.
This solution introduces a deferred freeing mechanism for kernel page
table pages, which provides a safe window to notify the IOMMU to
invalidate its caches before the page is reused.
Change log:
v5:
- Renamed pagetable_free_async() to pagetable_free_kernel() to avoid
confusion.
- Removed list_del() when the list is on the stack, as it will be freed
when the function returns.
- Discussed a corner case related to memory unplug of memory that was
present as reserved memory at boot. Given that it's extremely rare
and cannot be triggered by unprivileged users. We decided to focus
our efforts on the common vfree() case and noted that corner case in
the commit message.
- Some cleanups.
v4:
- https://lore.kernel.org/linux-iommu/20250905055103.3821518-1-baolu.lu@linux.intel.com/
- Introduce a mechanism to defer the freeing of page-table pages for
KVA mappings. Call iommu_sva_invalidate_kva_range() in the deferred
work thread before freeing the pages.
v3:
- https://lore.kernel.org/linux-iommu/20250806052505.3113108-1-baolu.lu@linux.intel.com/
- iommu_sva_mms is an unbound list; iterating it in an atomic context
could introduce significant latency issues. Schedule it in a kernel
thread and replace the spinlock with a mutex.
- Replace the static key with a normal bool; it can be brought back if
data shows the benefit.
- Invalidate KVA range in the flush_tlb_all() paths.
- All previous reviewed-bys are preserved. Please let me know if there
are any objections.
v2:
- https://lore.kernel.org/linux-iommu/20250709062800.651521-1-baolu.lu@linux.intel.com/
- Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range);
- Replace the mutex with a spinlock to make the interface usable in the
critical regions.
v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1-baolu.lu@linux.intel.com/
Dave Hansen (6):
mm: Add a ptdesc flag to mark kernel page tables
mm: Actually mark kernel page table pages
x86/mm: Use 'ptdesc' when freeing PMD pages
mm: Introduce pure page table freeing function
mm: Introduce deferred freeing for kernel page tables
mm: Hook up Kconfig options for async page table freeing
Lu Baolu (2):
x86/mm: Use pagetable_free()
iommu/sva: Invalidate stale IOTLB entries for kernel address space
arch/x86/Kconfig | 1 +
arch/x86/mm/init_64.c | 2 +-
arch/x86/mm/pat/set_memory.c | 2 +-
arch/x86/mm/pgtable.c | 12 ++++-----
drivers/iommu/iommu-sva.c | 29 +++++++++++++++++++++-
include/asm-generic/pgalloc.h | 18 ++++++++++++++
include/linux/iommu.h | 4 +++
include/linux/mm.h | 24 +++++++++++++++---
include/linux/page-flags.h | 46 +++++++++++++++++++++++++++++++++++
mm/Kconfig | 3 +++
mm/pgtable-generic.c | 39 +++++++++++++++++++++++++++++
11 files changed, 168 insertions(+), 12 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v5 1/8] mm: Add a ptdesc flag to mark kernel page tables
2025-09-19 5:39 [PATCH v5 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
@ 2025-09-19 5:39 ` Lu Baolu
2025-10-08 19:56 ` Matthew Wilcox
2025-09-19 5:40 ` [PATCH v5 2/8] mm: Actually mark kernel page table pages Lu Baolu
` (7 subsequent siblings)
8 siblings, 1 reply; 29+ messages in thread
From: Lu Baolu @ 2025-09-19 5:39 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen, Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
The page tables used to map the kernel and userspace often have very
different handling rules. There are frequently *_kernel() variants of
functions just for kernel page tables. That's not great and has lead
to code duplication.
Instead of having completely separate call paths, allow a 'ptdesc' to
be marked as being for kernel mappings. Introduce helpers to set and
clear this status.
Note: this uses the PG_referenced bit. Page flags are a great fit for
this since it is truly a single bit of information. Use PG_referenced
itself because it's a fairly benign flag (as opposed to things like
PG_lock). It's also (according to Willy) unlikely to go away any time
soon.
PG_referenced is not in PAGE_FLAGS_CHECK_AT_FREE. It does not need to
be cleared before freeing the page, and pages coming out of the
allocator should have it cleared. Regardless, introduce an API to
clear it anyway. Having symmetry in the API makes it easier to change
the underlying implementation later, like if there was a need to move
to a PAGE_FLAGS_CHECK_AT_FREE bit.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
include/linux/page-flags.h | 46 ++++++++++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 8d3fa3a91ce4..1d82fb6fffe5 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1244,6 +1244,52 @@ static inline int folio_has_private(const struct folio *folio)
return !!(folio->flags & PAGE_FLAGS_PRIVATE);
}
+/**
+ * ptdesc_set_kernel - Mark a ptdesc used to map the kernel
+ * @ptdesc: The ptdesc to be marked
+ *
+ * Kernel page tables often need special handling. Set a flag so that
+ * the handling code knows this ptdesc will not be used for userspace.
+ */
+static inline void ptdesc_set_kernel(struct ptdesc *ptdesc)
+{
+ struct folio *folio = ptdesc_folio(ptdesc);
+
+ folio_set_referenced(folio);
+}
+
+/**
+ * ptdesc_clear_kernel - Mark a ptdesc as no longer used to map the kernel
+ * @ptdesc: The ptdesc to be unmarked
+ *
+ * Use when the ptdesc is no longer used to map the kernel and no longer
+ * needs special handling.
+ */
+static inline void ptdesc_clear_kernel(struct ptdesc *ptdesc)
+{
+ struct folio *folio = ptdesc_folio(ptdesc);
+
+ /*
+ * Note: the 'PG_referenced' bit does not strictly need to be
+ * cleared before freeing the page. But this is nice for
+ * symmetry.
+ */
+ folio_clear_referenced(folio);
+}
+
+/**
+ * ptdesc_test_kernel - Check if a ptdesc is used to map the kernel
+ * @ptdesc: The ptdesc being tested
+ *
+ * Call to tell if the ptdesc used to map the kernel.
+ */
+static inline bool ptdesc_test_kernel(struct ptdesc *ptdesc)
+{
+ struct folio *folio = ptdesc_folio(ptdesc);
+
+ return folio_test_referenced(folio);
+}
+
#undef PF_ANY
#undef PF_HEAD
#undef PF_NO_TAIL
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v5 2/8] mm: Actually mark kernel page table pages
2025-09-19 5:39 [PATCH v5 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
2025-09-19 5:39 ` [PATCH v5 1/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
@ 2025-09-19 5:40 ` Lu Baolu
2025-10-09 19:19 ` David Hildenbrand
2025-10-13 7:17 ` Mike Rapoport
2025-09-19 5:40 ` [PATCH v5 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
` (6 subsequent siblings)
8 siblings, 2 replies; 29+ messages in thread
From: Lu Baolu @ 2025-09-19 5:40 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen, Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
Now that the API is in place, mark kernel page table pages just
after they are allocated. Unmark them just before they are freed.
Note: Unconditionally clearing the 'kernel' marking (via
ptdesc_clear_kernel()) would be functionally identical to what
is here. But having the if() makes it logically clear that this
function can be used for kernel and non-kernel page tables.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
include/asm-generic/pgalloc.h | 18 ++++++++++++++++++
include/linux/mm.h | 3 +++
2 files changed, 21 insertions(+)
diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 3c8ec3bfea44..b9d2a7c79b93 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -28,6 +28,8 @@ static inline pte_t *__pte_alloc_one_kernel_noprof(struct mm_struct *mm)
return NULL;
}
+ ptdesc_set_kernel(ptdesc);
+
return ptdesc_address(ptdesc);
}
#define __pte_alloc_one_kernel(...) alloc_hooks(__pte_alloc_one_kernel_noprof(__VA_ARGS__))
@@ -146,6 +148,10 @@ static inline pmd_t *pmd_alloc_one_noprof(struct mm_struct *mm, unsigned long ad
pagetable_free(ptdesc);
return NULL;
}
+
+ if (mm == &init_mm)
+ ptdesc_set_kernel(ptdesc);
+
return ptdesc_address(ptdesc);
}
#define pmd_alloc_one(...) alloc_hooks(pmd_alloc_one_noprof(__VA_ARGS__))
@@ -179,6 +185,10 @@ static inline pud_t *__pud_alloc_one_noprof(struct mm_struct *mm, unsigned long
return NULL;
pagetable_pud_ctor(ptdesc);
+
+ if (mm == &init_mm)
+ ptdesc_set_kernel(ptdesc);
+
return ptdesc_address(ptdesc);
}
#define __pud_alloc_one(...) alloc_hooks(__pud_alloc_one_noprof(__VA_ARGS__))
@@ -233,6 +243,10 @@ static inline p4d_t *__p4d_alloc_one_noprof(struct mm_struct *mm, unsigned long
return NULL;
pagetable_p4d_ctor(ptdesc);
+
+ if (mm == &init_mm)
+ ptdesc_set_kernel(ptdesc);
+
return ptdesc_address(ptdesc);
}
#define __p4d_alloc_one(...) alloc_hooks(__p4d_alloc_one_noprof(__VA_ARGS__))
@@ -277,6 +291,10 @@ static inline pgd_t *__pgd_alloc_noprof(struct mm_struct *mm, unsigned int order
return NULL;
pagetable_pgd_ctor(ptdesc);
+
+ if (mm == &init_mm)
+ ptdesc_set_kernel(ptdesc);
+
return ptdesc_address(ptdesc);
}
#define __pgd_alloc(...) alloc_hooks(__pgd_alloc_noprof(__VA_ARGS__))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ae97a0b8ec7..f3db3a5ebefe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2895,6 +2895,9 @@ static inline void pagetable_free(struct ptdesc *pt)
{
struct page *page = ptdesc_page(pt);
+ if (ptdesc_test_kernel(pt))
+ ptdesc_clear_kernel(pt);
+
__free_pages(page, compound_order(page));
}
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v5 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages
2025-09-19 5:39 [PATCH v5 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
2025-09-19 5:39 ` [PATCH v5 1/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
2025-09-19 5:40 ` [PATCH v5 2/8] mm: Actually mark kernel page table pages Lu Baolu
@ 2025-09-19 5:40 ` Lu Baolu
2025-10-09 19:25 ` David Hildenbrand
2025-09-19 5:40 ` [PATCH v5 4/8] mm: Introduce pure page table freeing function Lu Baolu
` (5 subsequent siblings)
8 siblings, 1 reply; 29+ messages in thread
From: Lu Baolu @ 2025-09-19 5:40 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen, Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
There are a billion ways to refer to a physical memory address.
One of the x86 PMD freeing code location chooses to use a 'pte_t *' to
point to a PMD page and then call a PTE-specific freeing function for
it. That's a bit wonky.
Just use a 'struct ptdesc *' instead. Its entire purpose is to refer
to page table pages. It also means being able to remove an explicit
cast.
Right now, pte_free_kernel() is a one-liner that calls
pagetable_dtor_free(). Effectively, all this patch does is
remove one superfluous __pa(__va(paddr)) conversion and then
call pagetable_dtor_free() directly instead of through a helper.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
arch/x86/mm/pgtable.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index ddf248c3ee7d..2e5ecfdce73c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -729,7 +729,7 @@ int pmd_clear_huge(pmd_t *pmd)
int pud_free_pmd_page(pud_t *pud, unsigned long addr)
{
pmd_t *pmd, *pmd_sv;
- pte_t *pte;
+ struct ptdesc *pt;
int i;
pmd = pud_pgtable(*pud);
@@ -750,8 +750,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
for (i = 0; i < PTRS_PER_PMD; i++) {
if (!pmd_none(pmd_sv[i])) {
- pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
- pte_free_kernel(&init_mm, pte);
+ pt = page_ptdesc(pmd_page(pmd_sv[i]));
+ pagetable_dtor_free(pt);
}
}
@@ -772,15 +772,15 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
*/
int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
{
- pte_t *pte;
+ struct ptdesc *pt;
- pte = (pte_t *)pmd_page_vaddr(*pmd);
+ pt = page_ptdesc(pmd_page(*pmd));
pmd_clear(pmd);
/* INVLPG to clear all paging-structure caches */
flush_tlb_kernel_range(addr, addr + PAGE_SIZE-1);
- pte_free_kernel(&init_mm, pte);
+ pagetable_dtor_free(pt);
return 1;
}
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v5 4/8] mm: Introduce pure page table freeing function
2025-09-19 5:39 [PATCH v5 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
` (2 preceding siblings ...)
2025-09-19 5:40 ` [PATCH v5 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
@ 2025-09-19 5:40 ` Lu Baolu
2025-10-09 19:26 ` David Hildenbrand
2025-10-13 7:24 ` Mike Rapoport
2025-09-19 5:40 ` [PATCH v5 5/8] x86/mm: Use pagetable_free() Lu Baolu
` (4 subsequent siblings)
8 siblings, 2 replies; 29+ messages in thread
From: Lu Baolu @ 2025-09-19 5:40 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen, Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
The pages used for ptdescs are currently freed back to the allocator
in a single location. They will shortly be freed from a second
location.
Create a simple helper that just frees them back to the allocator.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
include/linux/mm.h | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3db3a5ebefe..668d519edc0f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2884,6 +2884,13 @@ static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int orde
}
#define pagetable_alloc(...) alloc_hooks(pagetable_alloc_noprof(__VA_ARGS__))
+static inline void __pagetable_free(struct ptdesc *pt)
+{
+ struct page *page = ptdesc_page(pt);
+
+ __free_pages(page, compound_order(page));
+}
+
/**
* pagetable_free - Free pagetables
* @pt: The page table descriptor
@@ -2893,12 +2900,10 @@ static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int orde
*/
static inline void pagetable_free(struct ptdesc *pt)
{
- struct page *page = ptdesc_page(pt);
-
if (ptdesc_test_kernel(pt))
ptdesc_clear_kernel(pt);
- __free_pages(page, compound_order(page));
+ __pagetable_free(pt);
}
#if defined(CONFIG_SPLIT_PTE_PTLOCKS)
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v5 5/8] x86/mm: Use pagetable_free()
2025-09-19 5:39 [PATCH v5 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
` (3 preceding siblings ...)
2025-09-19 5:40 ` [PATCH v5 4/8] mm: Introduce pure page table freeing function Lu Baolu
@ 2025-09-19 5:40 ` Lu Baolu
2025-09-24 12:40 ` Jason Gunthorpe
` (2 more replies)
2025-09-19 5:40 ` [PATCH v5 6/8] mm: Introduce deferred freeing for kernel page tables Lu Baolu
` (3 subsequent siblings)
8 siblings, 3 replies; 29+ messages in thread
From: Lu Baolu @ 2025-09-19 5:40 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Lu Baolu
The kernel's memory management subsystem provides a dedicated interface,
pagetable_free(), for freeing page table pages. Updates two call sites to
use pagetable_free() instead of the lower-level __free_page() or
free_pages(). This improves code consistency and clarity, and ensures the
correct freeing mechanism is used.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
arch/x86/mm/init_64.c | 2 +-
arch/x86/mm/pat/set_memory.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index b9426fce5f3e..3d9a5e4ccaa4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1031,7 +1031,7 @@ static void __meminit free_pagetable(struct page *page, int order)
free_reserved_pages(page, nr_pages);
#endif
} else {
- free_pages((unsigned long)page_address(page), order);
+ pagetable_free(page_ptdesc(page));
}
}
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 8834c76f91c9..8b78a8855024 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -438,7 +438,7 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
list_del(&ptdesc->pt_list);
- __free_page(ptdesc_page(ptdesc));
+ pagetable_free(ptdesc);
}
}
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v5 6/8] mm: Introduce deferred freeing for kernel page tables
2025-09-19 5:39 [PATCH v5 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
` (4 preceding siblings ...)
2025-09-19 5:40 ` [PATCH v5 5/8] x86/mm: Use pagetable_free() Lu Baolu
@ 2025-09-19 5:40 ` Lu Baolu
2025-10-09 19:28 ` David Hildenbrand
2025-10-10 15:47 ` David Hildenbrand
2025-09-19 5:40 ` [PATCH v5 7/8] mm: Hook up Kconfig options for async page table freeing Lu Baolu
` (2 subsequent siblings)
8 siblings, 2 replies; 29+ messages in thread
From: Lu Baolu @ 2025-09-19 5:40 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen, Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
This introduces a conditional asynchronous mechanism, enabled by
CONFIG_ASYNC_PGTABLE_FREE. When enabled, this mechanism defers the freeing
of pages that are used as page tables for kernel address mappings. These
pages are now queued to a work struct instead of being freed immediately.
This deferred freeing provides a safe context for a future patch to add
an IOMMU-specific callback, which might be expensive on large-scale
systems. This ensures the necessary IOMMU cache invalidation is performed
before the page is finally returned to the page allocator outside of any
critical, non-sleepable path.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
include/linux/mm.h | 16 +++++++++++++---
mm/pgtable-generic.c | 37 +++++++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 668d519edc0f..2d7b4af40442 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2891,6 +2891,14 @@ static inline void __pagetable_free(struct ptdesc *pt)
__free_pages(page, compound_order(page));
}
+#ifdef CONFIG_ASYNC_PGTABLE_FREE
+void pagetable_free_kernel(struct ptdesc *pt);
+#else
+static inline void pagetable_free_kernel(struct ptdesc *pt)
+{
+ __pagetable_free(pt);
+}
+#endif
/**
* pagetable_free - Free pagetables
* @pt: The page table descriptor
@@ -2900,10 +2908,12 @@ static inline void __pagetable_free(struct ptdesc *pt)
*/
static inline void pagetable_free(struct ptdesc *pt)
{
- if (ptdesc_test_kernel(pt))
+ if (ptdesc_test_kernel(pt)) {
ptdesc_clear_kernel(pt);
-
- __pagetable_free(pt);
+ pagetable_free_kernel(pt);
+ } else {
+ __pagetable_free(pt);
+ }
}
#if defined(CONFIG_SPLIT_PTE_PTLOCKS)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 567e2d084071..0279399d4910 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -406,3 +406,40 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
pte_unmap_unlock(pte, ptl);
goto again;
}
+
+#ifdef CONFIG_ASYNC_PGTABLE_FREE
+static void kernel_pgtable_work_func(struct work_struct *work);
+
+static struct {
+ struct list_head list;
+ /* protect above ptdesc lists */
+ spinlock_t lock;
+ struct work_struct work;
+} kernel_pgtable_work = {
+ .list = LIST_HEAD_INIT(kernel_pgtable_work.list),
+ .lock = __SPIN_LOCK_UNLOCKED(kernel_pgtable_work.lock),
+ .work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func),
+};
+
+static void kernel_pgtable_work_func(struct work_struct *work)
+{
+ struct ptdesc *pt, *next;
+ LIST_HEAD(page_list);
+
+ spin_lock(&kernel_pgtable_work.lock);
+ list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
+ spin_unlock(&kernel_pgtable_work.lock);
+
+ list_for_each_entry_safe(pt, next, &page_list, pt_list)
+ __pagetable_free(pt);
+}
+
+void pagetable_free_kernel(struct ptdesc *pt)
+{
+ spin_lock(&kernel_pgtable_work.lock);
+ list_add(&pt->pt_list, &kernel_pgtable_work.list);
+ spin_unlock(&kernel_pgtable_work.lock);
+
+ schedule_work(&kernel_pgtable_work.work);
+}
+#endif
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v5 7/8] mm: Hook up Kconfig options for async page table freeing
2025-09-19 5:39 [PATCH v5 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
` (5 preceding siblings ...)
2025-09-19 5:40 ` [PATCH v5 6/8] mm: Introduce deferred freeing for kernel page tables Lu Baolu
@ 2025-09-19 5:40 ` Lu Baolu
2025-09-19 5:40 ` [PATCH v5 8/8] iommu/sva: Invalidate stale IOTLB entries for kernel address space Lu Baolu
2025-09-25 20:24 ` [PATCH v5 0/8] Fix " Dave Hansen
8 siblings, 0 replies; 29+ messages in thread
From: Lu Baolu @ 2025-09-19 5:40 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen, Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
The CONFIG_ASYNC_PGTABLE_FREE option controls whether an architecture
requires asynchronous page table freeing. On x86, this is selected if
IOMMU_SVA is enabled, because both Intel and AMD IOMMU architectures
could potentially cache kernel page table entries in their paging
structure cache, regardless of the permission.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
arch/x86/Kconfig | 1 +
mm/Kconfig | 3 +++
2 files changed, 4 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 52c8910ba2ef..247caac65e22 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -281,6 +281,7 @@ config X86
select HAVE_PCI
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
+ select ASYNC_PGTABLE_FREE if IOMMU_SVA
select MMU_GATHER_RCU_TABLE_FREE
select MMU_GATHER_MERGE_VMAS
select HAVE_POSIX_CPU_TIMERS_TASK_WORK
diff --git a/mm/Kconfig b/mm/Kconfig
index e443fe8cd6cf..1576409cec03 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -920,6 +920,9 @@ config PAGE_MAPCOUNT
config PGTABLE_HAS_HUGE_LEAVES
def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
+config ASYNC_PGTABLE_FREE
+ def_bool n
+
# TODO: Allow to be enabled without THP
config ARCH_SUPPORTS_HUGE_PFNMAP
def_bool n
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v5 8/8] iommu/sva: Invalidate stale IOTLB entries for kernel address space
2025-09-19 5:39 [PATCH v5 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
` (6 preceding siblings ...)
2025-09-19 5:40 ` [PATCH v5 7/8] mm: Hook up Kconfig options for async page table freeing Lu Baolu
@ 2025-09-19 5:40 ` Lu Baolu
2025-09-25 20:24 ` [PATCH v5 0/8] Fix " Dave Hansen
8 siblings, 0 replies; 29+ messages in thread
From: Lu Baolu @ 2025-09-19 5:40 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Lu Baolu, stable
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. The x86 architecture maps the
kernel's virtual address space into the upper portion of every process's
page table. Consequently, in an SVA context, the IOMMU hardware can walk
and cache kernel page table entries.
The Linux kernel currently lacks a notification mechanism for kernel page
table changes, specifically when page table pages are freed and reused.
The IOMMU driver is only notified of changes to user virtual address
mappings. This can cause the IOMMU's internal caches to retain stale
entries for kernel VA.
A Use-After-Free (UAF) and Write-After-Free (WAF) condition arises when
kernel page table pages are freed and later reallocated. The IOMMU could
misinterpret the new data as valid page table entries. The IOMMU might
then walk into attacker-controlled memory, leading to arbitrary physical
memory DMA access or privilege escalation. This is also a Write-After-Free
issue, as the IOMMU will potentially continue to write Accessed and Dirty
bits to the freed memory while attempting to walk the stale page tables.
Currently, SVA contexts are unprivileged and cannot access kernel
mappings. However, the IOMMU will still walk kernel-only page tables
all the way down to the leaf entries, where it realizes the mapping
is for the kernel and errors out. This means the IOMMU still caches
these intermediate page table entries, making the described vulnerability
a real concern.
To mitigate this, a new IOMMU interface is introduced to flush IOTLB
entries for the kernel address space. This interface is invoked from the
x86 architecture code that manages combined user and kernel page tables,
specifically before any kernel page table page is freed and reused.
This addresses the main issue with vfree() which is a common occurrence
and can be triggered by unprivileged users. While this resolves the
primary problem, it doesn't address some extremely rare case related to
memory unplug of memory that was present as reserved memory at boot,
which cannot be triggered by unprivileged users. The discussion can be
found at the link below.
Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices")
Cc: stable@vger.kernel.org
Suggested-by: Jann Horn <jannh@google.com>
Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/linux-iommu/04983c62-3b1d-40d4-93ae-34ca04b827e5@intel.com/
---
drivers/iommu/iommu-sva.c | 29 ++++++++++++++++++++++++++++-
include/linux/iommu.h | 4 ++++
mm/pgtable-generic.c | 2 ++
3 files changed, 34 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 1a51cfd82808..d236aef80a8d 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -10,6 +10,8 @@
#include "iommu-priv.h"
static DEFINE_MUTEX(iommu_sva_lock);
+static bool iommu_sva_present;
+static LIST_HEAD(iommu_sva_mms);
static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
struct mm_struct *mm);
@@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de
return ERR_PTR(-ENOSPC);
}
iommu_mm->pasid = pasid;
+ iommu_mm->mm = mm;
INIT_LIST_HEAD(&iommu_mm->sva_domains);
/*
* Make sure the write to mm->iommu_mm is not reordered in front of
@@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
if (ret)
goto out_free_domain;
domain->users = 1;
- list_add(&domain->next, &mm->iommu_mm->sva_domains);
+ if (list_empty(&iommu_mm->sva_domains)) {
+ if (list_empty(&iommu_sva_mms))
+ iommu_sva_present = true;
+ list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
+ }
+ list_add(&domain->next, &iommu_mm->sva_domains);
out:
refcount_set(&handle->users, 1);
mutex_unlock(&iommu_sva_lock);
@@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
list_del(&domain->next);
iommu_domain_free(domain);
}
+
+ if (list_empty(&iommu_mm->sva_domains)) {
+ list_del(&iommu_mm->mm_list_elm);
+ if (list_empty(&iommu_sva_mms))
+ iommu_sva_present = false;
+ }
+
mutex_unlock(&iommu_sva_lock);
kfree(handle);
}
@@ -312,3 +327,15 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
return domain;
}
+
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
+{
+ struct iommu_mm_data *iommu_mm;
+
+ guard(mutex)(&iommu_sva_lock);
+ if (!iommu_sva_present)
+ return;
+
+ list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
+ mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, start, end);
+}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c30d12e16473..66e4abb2df0d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1134,7 +1134,9 @@ struct iommu_sva {
struct iommu_mm_data {
u32 pasid;
+ struct mm_struct *mm;
struct list_head sva_domains;
+ struct list_head mm_list_elm;
};
int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode);
@@ -1615,6 +1617,7 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
struct mm_struct *mm);
void iommu_sva_unbind_device(struct iommu_sva *handle);
u32 iommu_sva_get_pasid(struct iommu_sva *handle);
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end);
#else
static inline struct iommu_sva *
iommu_sva_bind_device(struct device *dev, struct mm_struct *mm)
@@ -1639,6 +1642,7 @@ static inline u32 mm_get_enqcmd_pasid(struct mm_struct *mm)
}
static inline void mm_pasid_drop(struct mm_struct *mm) {}
+static inline void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) {}
#endif /* CONFIG_IOMMU_SVA */
#ifdef CONFIG_IOMMU_IOPF
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 0279399d4910..2717dc9afff0 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -13,6 +13,7 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/mm_inline.h>
+#include <linux/iommu.h>
#include <asm/pgalloc.h>
#include <asm/tlb.h>
@@ -430,6 +431,7 @@ static void kernel_pgtable_work_func(struct work_struct *work)
list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
spin_unlock(&kernel_pgtable_work.lock);
+ iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
list_for_each_entry_safe(pt, next, &page_list, pt_list)
__pagetable_free(pt);
}
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 5/8] x86/mm: Use pagetable_free()
2025-09-19 5:40 ` [PATCH v5 5/8] x86/mm: Use pagetable_free() Lu Baolu
@ 2025-09-24 12:40 ` Jason Gunthorpe
2025-10-09 19:26 ` David Hildenbrand
2025-10-13 7:28 ` Mike Rapoport
2 siblings, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2025-09-24 12:40 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
Vasant Hegde, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security,
x86, linux-mm, linux-kernel
On Fri, Sep 19, 2025 at 01:40:03PM +0800, Lu Baolu wrote:
> The kernel's memory management subsystem provides a dedicated interface,
> pagetable_free(), for freeing page table pages. Updates two call sites to
> use pagetable_free() instead of the lower-level __free_page() or
> free_pages(). This improves code consistency and clarity, and ensures the
> correct freeing mechanism is used.
>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
> arch/x86/mm/init_64.c | 2 +-
> arch/x86/mm/pat/set_memory.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Jason
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 0/8] Fix stale IOTLB entries for kernel address space
2025-09-19 5:39 [PATCH v5 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
` (7 preceding siblings ...)
2025-09-19 5:40 ` [PATCH v5 8/8] iommu/sva: Invalidate stale IOTLB entries for kernel address space Lu Baolu
@ 2025-09-25 20:24 ` Dave Hansen
2025-10-08 19:42 ` Dave Hansen
8 siblings, 1 reply; 29+ messages in thread
From: Dave Hansen @ 2025-09-25 20:24 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel
On 9/18/25 22:39, Lu Baolu wrote:
> This solution introduces a deferred freeing mechanism for kernel page
> table pages, which provides a safe window to notify the IOMMU to
> invalidate its caches before the page is reused.
I think all the activity has died down and I everyone seems happy enough
with how this looks. Right?
So is this something we should prod Andrew to take through the mm tree,
or is it x86-specific enough it should go through tip?
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 0/8] Fix stale IOTLB entries for kernel address space
2025-09-25 20:24 ` [PATCH v5 0/8] Fix " Dave Hansen
@ 2025-10-08 19:42 ` Dave Hansen
2025-10-09 19:16 ` David Hildenbrand
2025-10-14 13:21 ` Baolu Lu
0 siblings, 2 replies; 29+ messages in thread
From: Dave Hansen @ 2025-10-08 19:42 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Andrew Morton, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Matthew Wilcox
Cc: iommu, security, x86, linux-mm, linux-kernel
I wondered why no mm folks were commenting on this. linux-mm@ was cc'd,
but the _people_ on cc seem to have been almost all IOMMU and x86 folks.
so I added a few mm folks...
On 9/25/25 13:24, Dave Hansen wrote:
> On 9/18/25 22:39, Lu Baolu wrote:
>> This solution introduces a deferred freeing mechanism for kernel page
>> table pages, which provides a safe window to notify the IOMMU to
>> invalidate its caches before the page is reused.
>
> I think all the activity has died down and I everyone seems happy enough
> with how this looks. Right?
>
> So is this something we should prod Andrew to take through the mm tree,
> or is it x86-specific enough it should go through tip?
Hi Folks! We've got a bug fix here that has impact on x86, mm, and IOMMU
code. I know I've talked with a few of you about this along the way, but
it's really thin on mm reviews, probably because mm folks haven't been
cc'd. Any eyeballs on it would be appreciated!
It seems like it should _probably_ go through the mm tree, although I'm
happy to send it through tip if folks disagree.
Diffstat for reference:
arch/x86/Kconfig | 1 +
arch/x86/mm/init_64.c | 2 +-
arch/x86/mm/pat/set_memory.c | 2 +-
arch/x86/mm/pgtable.c | 12 ++++-----
drivers/iommu/iommu-sva.c | 29 +++++++++++++++++++++-
include/asm-generic/pgalloc.h | 18 ++++++++++++++
include/linux/iommu.h | 4 +++
include/linux/mm.h | 24 +++++++++++++++---
include/linux/page-flags.h | 46 +++++++++++++++++++++++++++++++++++
mm/Kconfig | 3 +++
mm/pgtable-generic.c | 39 +++++++++++++++++++++++++++++
11 files changed, 168 insertions(+), 12 deletions(-)
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 1/8] mm: Add a ptdesc flag to mark kernel page tables
2025-09-19 5:39 ` [PATCH v5 1/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
@ 2025-10-08 19:56 ` Matthew Wilcox
2025-10-11 6:24 ` Baolu Lu
0 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2025-10-08 19:56 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, iommu, security, x86, linux-mm,
linux-kernel, Dave Hansen
On Fri, Sep 19, 2025 at 01:39:59PM +0800, Lu Baolu wrote:
> +static inline void ptdesc_set_kernel(struct ptdesc *ptdesc)
> +{
> + struct folio *folio = ptdesc_folio(ptdesc);
> +
> + folio_set_referenced(folio);
> +}
So this was the right way to do this at the time. However, if you look
at commit 522abd92279a this should now be ...
enum pt_flags {
PT_reserved = PG_reserved,
+ PT_kernel = PG_referenced,
/* High bits are used for zone/node/section */
};
[...]
+static inline void ptdesc_set_kernel(struct ptdesc *ptdesc)
+{
+ set_bit(PT_kernel, &pt->pt_flags.f);
+}
(etc)
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 0/8] Fix stale IOTLB entries for kernel address space
2025-10-08 19:42 ` Dave Hansen
@ 2025-10-09 19:16 ` David Hildenbrand
2025-10-14 13:21 ` Baolu Lu
1 sibling, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-10-09 19:16 UTC (permalink / raw)
To: Dave Hansen, Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Jason Gunthorpe, Jann Horn, Vasant Hegde,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, Lorenzo Stoakes, Liam R. Howlett,
Andrew Morton, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Matthew Wilcox
Cc: iommu, security, x86, linux-mm, linux-kernel
On 08.10.25 21:42, Dave Hansen wrote:
> I wondered why no mm folks were commenting on this. linux-mm@ was cc'd,
> but the _people_ on cc seem to have been almost all IOMMU and x86 folks.
> so I added a few mm folks...
Thanks. Lately I find myself scanning linux-mm only randomly. So if it's
not in my inbox, likely I won't realize easily that there is something
that needs our attention.
Will take a look.
>
> On 9/25/25 13:24, Dave Hansen wrote:
>> On 9/18/25 22:39, Lu Baolu wrote:
>>> This solution introduces a deferred freeing mechanism for kernel page
>>> table pages, which provides a safe window to notify the IOMMU to
>>> invalidate its caches before the page is reused.
>>
>> I think all the activity has died down and I everyone seems happy enough
>> with how this looks. Right?
>>
>> So is this something we should prod Andrew to take through the mm tree,
>> or is it x86-specific enough it should go through tip?
>
> Hi Folks! We've got a bug fix here that has impact on x86, mm, and IOMMU
> code. I know I've talked with a few of you about this along the way, but
> it's really thin on mm reviews, probably because mm folks haven't been
> cc'd. Any eyeballs on it would be appreciated!
>
> It seems like it should _probably_ go through the mm tree, although I'm
> happy to send it through tip if folks disagree.
>
> Diffstat for reference:
>
> arch/x86/Kconfig | 1 +
> arch/x86/mm/init_64.c | 2 +-
> arch/x86/mm/pat/set_memory.c | 2 +-
> arch/x86/mm/pgtable.c | 12 ++++-----
> drivers/iommu/iommu-sva.c | 29 +++++++++++++++++++++-
> include/asm-generic/pgalloc.h | 18 ++++++++++++++
> include/linux/iommu.h | 4 +++
> include/linux/mm.h | 24 +++++++++++++++---
> include/linux/page-flags.h | 46 +++++++++++++++++++++++++++++++++++
> mm/Kconfig | 3 +++
> mm/pgtable-generic.c | 39 +++++++++++++++++++++++++++++
> 11 files changed, 168 insertions(+), 12 deletions(-)
>
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 2/8] mm: Actually mark kernel page table pages
2025-09-19 5:40 ` [PATCH v5 2/8] mm: Actually mark kernel page table pages Lu Baolu
@ 2025-10-09 19:19 ` David Hildenbrand
2025-10-13 7:17 ` Mike Rapoport
1 sibling, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-10-09 19:19 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen
On 19.09.25 07:40, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Now that the API is in place, mark kernel page table pages just
> after they are allocated. Unmark them just before they are freed.
>
> Note: Unconditionally clearing the 'kernel' marking (via
> ptdesc_clear_kernel()) would be functionally identical to what
> is here. But having the if() makes it logically clear that this
> function can be used for kernel and non-kernel page tables.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> ---
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages
2025-09-19 5:40 ` [PATCH v5 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
@ 2025-10-09 19:25 ` David Hildenbrand
2025-10-09 19:31 ` Dave Hansen
0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-10-09 19:25 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen
On 19.09.25 07:40, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> There are a billion ways to refer to a physical memory address.
> One of the x86 PMD freeing code location chooses to use a 'pte_t *' to
> point to a PMD page and then call a PTE-specific freeing function for
> it. That's a bit wonky.
>
> Just use a 'struct ptdesc *' instead. Its entire purpose is to refer
> to page table pages. It also means being able to remove an explicit
> cast.
>
> Right now, pte_free_kernel() is a one-liner that calls
> pagetable_dtor_free(). Effectively, all this patch does is
> remove one superfluous __pa(__va(paddr)) conversion and then
> call pagetable_dtor_free() directly instead of through a helper.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> ---
> arch/x86/mm/pgtable.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index ddf248c3ee7d..2e5ecfdce73c 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -729,7 +729,7 @@ int pmd_clear_huge(pmd_t *pmd)
> int pud_free_pmd_page(pud_t *pud, unsigned long addr)
> {
> pmd_t *pmd, *pmd_sv;
> - pte_t *pte;
> + struct ptdesc *pt;
> int i;
>
> pmd = pud_pgtable(*pud);
> @@ -750,8 +750,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
>
> for (i = 0; i < PTRS_PER_PMD; i++) {
> if (!pmd_none(pmd_sv[i])) {
> - pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
> - pte_free_kernel(&init_mm, pte);
> + pt = page_ptdesc(pmd_page(pmd_sv[i]));
> + pagetable_dtor_free(pt);
There is pmd_ptdesc() which does
page_ptdesc(pmd_pgtable_page(pmd));
It's buried in a
#if defined(CONFIG_SPLIT_PMD_PTLOCKS)
Can't we just make that always available so we can use it here?
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 4/8] mm: Introduce pure page table freeing function
2025-09-19 5:40 ` [PATCH v5 4/8] mm: Introduce pure page table freeing function Lu Baolu
@ 2025-10-09 19:26 ` David Hildenbrand
2025-10-13 7:24 ` Mike Rapoport
1 sibling, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-10-09 19:26 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen
On 19.09.25 07:40, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> The pages used for ptdescs are currently freed back to the allocator
> in a single location. They will shortly be freed from a second
> location.
>
> Create a simple helper that just frees them back to the allocator.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> ---
> include/linux/mm.h | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f3db3a5ebefe..668d519edc0f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2884,6 +2884,13 @@ static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int orde
> }
> #define pagetable_alloc(...) alloc_hooks(pagetable_alloc_noprof(__VA_ARGS__))
>
> +static inline void __pagetable_free(struct ptdesc *pt)
> +{
> + struct page *page = ptdesc_page(pt);
> +
> + __free_pages(page, compound_order(page));
> +}
> +
> /**
> * pagetable_free - Free pagetables
> * @pt: The page table descriptor
> @@ -2893,12 +2900,10 @@ static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int orde
> */
> static inline void pagetable_free(struct ptdesc *pt)
> {
> - struct page *page = ptdesc_page(pt);
> -
> if (ptdesc_test_kernel(pt))
> ptdesc_clear_kernel(pt);
>
> - __free_pages(page, compound_order(page));
> + __pagetable_free(pt);
> }
>
> #if defined(CONFIG_SPLIT_PTE_PTLOCKS)
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 5/8] x86/mm: Use pagetable_free()
2025-09-19 5:40 ` [PATCH v5 5/8] x86/mm: Use pagetable_free() Lu Baolu
2025-09-24 12:40 ` Jason Gunthorpe
@ 2025-10-09 19:26 ` David Hildenbrand
2025-10-13 7:28 ` Mike Rapoport
2 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-10-09 19:26 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel
On 19.09.25 07:40, Lu Baolu wrote:
> The kernel's memory management subsystem provides a dedicated interface,
> pagetable_free(), for freeing page table pages. Updates two call sites to
> use pagetable_free() instead of the lower-level __free_page() or
> free_pages(). This improves code consistency and clarity, and ensures the
> correct freeing mechanism is used.
>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 6/8] mm: Introduce deferred freeing for kernel page tables
2025-09-19 5:40 ` [PATCH v5 6/8] mm: Introduce deferred freeing for kernel page tables Lu Baolu
@ 2025-10-09 19:28 ` David Hildenbrand
2025-10-09 19:32 ` Dave Hansen
2025-10-10 15:47 ` David Hildenbrand
1 sibling, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-10-09 19:28 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen
On 19.09.25 07:40, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> This introduces a conditional asynchronous mechanism, enabled by
> CONFIG_ASYNC_PGTABLE_FREE. When enabled, this mechanism defers the freeing
> of pages that are used as page tables for kernel address mappings. These
> pages are now queued to a work struct instead of being freed immediately.
>
> This deferred freeing provides a safe context for a future patch to add
> an IOMMU-specific callback, which might be expensive on large-scale
> systems. This ensures the necessary IOMMU cache invalidation is performed
> before the page is finally returned to the page allocator outside of any
> critical, non-sleepable path.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> ---
Can we please squash #7 in here and make sure to call the config knob
something that indicates that it is for *kernel* page tables only?
ASYNC_KERNEL_PGTABLE_FREE
or sth like that.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages
2025-10-09 19:25 ` David Hildenbrand
@ 2025-10-09 19:31 ` Dave Hansen
2025-10-11 6:26 ` Baolu Lu
0 siblings, 1 reply; 29+ messages in thread
From: Dave Hansen @ 2025-10-09 19:31 UTC (permalink / raw)
To: David Hildenbrand, Lu Baolu, Joerg Roedel, Will Deacon,
Robin Murphy, Kevin Tian, Jason Gunthorpe, Jann Horn,
Vasant Hegde, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen
On 10/9/25 12:25, David Hildenbrand wrote:
>>
>> @@ -750,8 +750,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
>> for (i = 0; i < PTRS_PER_PMD; i++) {
>> if (!pmd_none(pmd_sv[i])) {
>> - pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
>> - pte_free_kernel(&init_mm, pte);
>> + pt = page_ptdesc(pmd_page(pmd_sv[i]));
>> + pagetable_dtor_free(pt);
>
> There is pmd_ptdesc() which does
>
> page_ptdesc(pmd_pgtable_page(pmd));
>
> It's buried in a
>
> #if defined(CONFIG_SPLIT_PMD_PTLOCKS)
>
> Can't we just make that always available so we can use it here?
Yes, that looks like a good idea. I never noticed pmd_ptdesc() when I
was writing this for sure.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 6/8] mm: Introduce deferred freeing for kernel page tables
2025-10-09 19:28 ` David Hildenbrand
@ 2025-10-09 19:32 ` Dave Hansen
0 siblings, 0 replies; 29+ messages in thread
From: Dave Hansen @ 2025-10-09 19:32 UTC (permalink / raw)
To: David Hildenbrand, Lu Baolu, Joerg Roedel, Will Deacon,
Robin Murphy, Kevin Tian, Jason Gunthorpe, Jann Horn,
Vasant Hegde, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen
On 10/9/25 12:28, David Hildenbrand wrote:
...
> Can we please squash #7 in here and make sure to call the config knob
> something that indicates that it is for *kernel* page tables only?
>
> ASYNC_KERNEL_PGTABLE_FREE
I'm fine with both of those. That name ^ looks fine to me too.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 6/8] mm: Introduce deferred freeing for kernel page tables
2025-09-19 5:40 ` [PATCH v5 6/8] mm: Introduce deferred freeing for kernel page tables Lu Baolu
2025-10-09 19:28 ` David Hildenbrand
@ 2025-10-10 15:47 ` David Hildenbrand
2025-10-11 6:30 ` Baolu Lu
1 sibling, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-10-10 15:47 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen
On 19.09.25 07:40, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> This introduces a conditional asynchronous mechanism, enabled by
> CONFIG_ASYNC_PGTABLE_FREE. When enabled, this mechanism defers the freeing
> of pages that are used as page tables for kernel address mappings. These
> pages are now queued to a work struct instead of being freed immediately.
>
Okay, I now looked at patch #8 and I think the whole reason of this
patch is "batch-free page tables to minimize the impact of an expensive
cross-page table operation" which is a single TLB flush.
> This deferred freeing provides a safe context for a future patch to add
So I would claridy here instead something like
"This deferred freeing allows for batch-freeing of page tables,
providing a safe context for performing a single expensive operation
(TLB flush) for a batch of kernel page tables instead of performing that
expensive operation for each page table."
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 1/8] mm: Add a ptdesc flag to mark kernel page tables
2025-10-08 19:56 ` Matthew Wilcox
@ 2025-10-11 6:24 ` Baolu Lu
0 siblings, 0 replies; 29+ messages in thread
From: Baolu Lu @ 2025-10-11 6:24 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, iommu, security, x86, linux-mm,
linux-kernel, Dave Hansen
On 10/9/25 03:56, Matthew Wilcox wrote:
> On Fri, Sep 19, 2025 at 01:39:59PM +0800, Lu Baolu wrote:
>> +static inline void ptdesc_set_kernel(struct ptdesc *ptdesc)
>> +{
>> + struct folio *folio = ptdesc_folio(ptdesc);
>> +
>> + folio_set_referenced(folio);
>> +}
> So this was the right way to do this at the time. However, if you look
> at commit 522abd92279a this should now be ...
>
> enum pt_flags {
> PT_reserved = PG_reserved,
> + PT_kernel = PG_referenced,
> /* High bits are used for zone/node/section */
> };
> [...]
>
> +static inline void ptdesc_set_kernel(struct ptdesc *ptdesc)
> +{
> + set_bit(PT_kernel, &pt->pt_flags.f);
> +}
Thank you for the review comment. I updated the patch like the
following:
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a3f97c551ad8..5abd427b6202 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2940,6 +2940,7 @@ static inline pmd_t *pmd_alloc(struct mm_struct
*mm, pud_t *pud, unsigned long a
#endif /* CONFIG_MMU */
enum pt_flags {
+ PT_kernel = PG_referenced,
PT_reserved = PG_reserved,
/* High bits are used for zone/node/section */
};
@@ -2965,6 +2966,46 @@ static inline bool pagetable_is_reserved(struct
ptdesc *pt)
return test_bit(PT_reserved, &pt->pt_flags.f);
}
+/**
+ * ptdesc_set_kernel - Mark a ptdesc used to map the kernel
+ * @ptdesc: The ptdesc to be marked
+ *
+ * Kernel page tables often need special handling. Set a flag so that
+ * the handling code knows this ptdesc will not be used for userspace.
+ */
+static inline void ptdesc_set_kernel(struct ptdesc *ptdesc)
+{
+ set_bit(PT_kernel, &ptdesc->pt_flags.f);
+}
+
+/**
+ * ptdesc_clear_kernel - Mark a ptdesc as no longer used to map the kernel
+ * @ptdesc: The ptdesc to be unmarked
+ *
+ * Use when the ptdesc is no longer used to map the kernel and no longer
+ * needs special handling.
+ */
+static inline void ptdesc_clear_kernel(struct ptdesc *ptdesc)
+{
+ /*
+ * Note: the 'PG_referenced' bit does not strictly need to be
+ * cleared before freeing the page. But this is nice for
+ * symmetry.
+ */
+ clear_bit(PT_kernel, &ptdesc->pt_flags.f);
+}
+
+/**
+ * ptdesc_test_kernel - Check if a ptdesc is used to map the kernel
+ * @ptdesc: The ptdesc being tested
+ *
+ * Call to tell if the ptdesc used to map the kernel.
+ */
+static inline bool ptdesc_test_kernel(struct ptdesc *ptdesc)
+{
+ return test_bit(PT_kernel, &ptdesc->pt_flags.f);
+}
+
/**
* pagetable_alloc - Allocate pagetables
* @gfp: GFP flags
Thanks,
baolu
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages
2025-10-09 19:31 ` Dave Hansen
@ 2025-10-11 6:26 ` Baolu Lu
0 siblings, 0 replies; 29+ messages in thread
From: Baolu Lu @ 2025-10-11 6:26 UTC (permalink / raw)
To: Dave Hansen, David Hildenbrand, Joerg Roedel, Will Deacon,
Robin Murphy, Kevin Tian, Jason Gunthorpe, Jann Horn,
Vasant Hegde, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen
On 10/10/25 03:31, Dave Hansen wrote:
> On 10/9/25 12:25, David Hildenbrand wrote:
>>>
>>> @@ -750,8 +750,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
>>> for (i = 0; i < PTRS_PER_PMD; i++) {
>>> if (!pmd_none(pmd_sv[i])) {
>>> - pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
>>> - pte_free_kernel(&init_mm, pte);
>>> + pt = page_ptdesc(pmd_page(pmd_sv[i]));
>>> + pagetable_dtor_free(pt);
>>
>> There is pmd_ptdesc() which does
>>
>> page_ptdesc(pmd_pgtable_page(pmd));
>>
>> It's buried in a
>>
>> #if defined(CONFIG_SPLIT_PMD_PTLOCKS)
>>
>> Can't we just make that always available so we can use it here?
>
> Yes, that looks like a good idea. I never noticed pmd_ptdesc() when I
> was writing this for sure.
I updated the patch like this,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6a0bb7fc3148..a0850dc6878e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3203,8 +3203,6 @@ pte_t *pte_offset_map_rw_nolock(struct mm_struct
*mm, pmd_t *pmd,
((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
NULL: pte_offset_kernel(pmd, address))
-#if defined(CONFIG_SPLIT_PMD_PTLOCKS)
-
static inline struct page *pmd_pgtable_page(pmd_t *pmd)
{
unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
@@ -3216,6 +3214,8 @@ static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd)
return page_ptdesc(pmd_pgtable_page(pmd));
}
+#if defined(CONFIG_SPLIT_PMD_PTLOCKS)
+
static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
{
return ptlock_ptr(pmd_ptdesc(pmd));
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index ddf248c3ee7d..c830ccbc2fd8 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -729,7 +729,7 @@ int pmd_clear_huge(pmd_t *pmd)
int pud_free_pmd_page(pud_t *pud, unsigned long addr)
{
pmd_t *pmd, *pmd_sv;
- pte_t *pte;
+ struct ptdesc *pt;
int i;
pmd = pud_pgtable(*pud);
@@ -750,8 +750,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
for (i = 0; i < PTRS_PER_PMD; i++) {
if (!pmd_none(pmd_sv[i])) {
- pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
- pte_free_kernel(&init_mm, pte);
+ pt = pmd_ptdesc(&pmd_sv[i]);
+ pagetable_dtor_free(pt);
}
}
@@ -772,15 +772,15 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
*/
int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
{
- pte_t *pte;
+ struct ptdesc *pt;
- pte = (pte_t *)pmd_page_vaddr(*pmd);
+ pt = pmd_ptdesc(pmd);
pmd_clear(pmd);
/* INVLPG to clear all paging-structure caches */
flush_tlb_kernel_range(addr, addr + PAGE_SIZE-1);
- pte_free_kernel(&init_mm, pte);
+ pagetable_dtor_free(pt);
return 1;
}
Thanks,
baolu
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 6/8] mm: Introduce deferred freeing for kernel page tables
2025-10-10 15:47 ` David Hildenbrand
@ 2025-10-11 6:30 ` Baolu Lu
0 siblings, 0 replies; 29+ messages in thread
From: Baolu Lu @ 2025-10-11 6:30 UTC (permalink / raw)
To: David Hildenbrand, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Jason Gunthorpe, Jann Horn, Vasant Hegde,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen
On 10/10/25 23:47, David Hildenbrand wrote:
> On 19.09.25 07:40, Lu Baolu wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> This introduces a conditional asynchronous mechanism, enabled by
>> CONFIG_ASYNC_PGTABLE_FREE. When enabled, this mechanism defers the
>> freeing
>> of pages that are used as page tables for kernel address mappings. These
>> pages are now queued to a work struct instead of being freed immediately.
>>
>
> Okay, I now looked at patch #8 and I think the whole reason of this
> patch is "batch-free page tables to minimize the impact of an expensive
> cross-page table operation" which is a single TLB flush.
>
>> This deferred freeing provides a safe context for a future patch to add
>
> So I would claridy here instead something like
>
> "This deferred freeing allows for batch-freeing of page tables,
> providing a safe context for performing a single expensive operation
> (TLB flush) for a batch of kernel page tables instead of performing that
> expensive operation for each page table."
The commit message has been updated, and CONFIG_ASYNC_PGTABLE_FREE has
been replaced with CONFIG_ASYNC_KERNEL_PGTABLE_FREE. Thank you for the
comments.
Thanks,
baolu
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 2/8] mm: Actually mark kernel page table pages
2025-09-19 5:40 ` [PATCH v5 2/8] mm: Actually mark kernel page table pages Lu Baolu
2025-10-09 19:19 ` David Hildenbrand
@ 2025-10-13 7:17 ` Mike Rapoport
1 sibling, 0 replies; 29+ messages in thread
From: Mike Rapoport @ 2025-10-13 7:17 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, iommu, security, x86, linux-mm,
linux-kernel, Dave Hansen
On Fri, Sep 19, 2025 at 01:40:00PM +0800, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Now that the API is in place, mark kernel page table pages just
> after they are allocated. Unmark them just before they are freed.
>
> Note: Unconditionally clearing the 'kernel' marking (via
> ptdesc_clear_kernel()) would be functionally identical to what
> is here. But having the if() makes it logically clear that this
> function can be used for kernel and non-kernel page tables.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> include/asm-generic/pgalloc.h | 18 ++++++++++++++++++
> include/linux/mm.h | 3 +++
> 2 files changed, 21 insertions(+)
>
> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
> index 3c8ec3bfea44..b9d2a7c79b93 100644
> --- a/include/asm-generic/pgalloc.h
> +++ b/include/asm-generic/pgalloc.h
> @@ -28,6 +28,8 @@ static inline pte_t *__pte_alloc_one_kernel_noprof(struct mm_struct *mm)
> return NULL;
> }
>
> + ptdesc_set_kernel(ptdesc);
> +
> return ptdesc_address(ptdesc);
> }
> #define __pte_alloc_one_kernel(...) alloc_hooks(__pte_alloc_one_kernel_noprof(__VA_ARGS__))
> @@ -146,6 +148,10 @@ static inline pmd_t *pmd_alloc_one_noprof(struct mm_struct *mm, unsigned long ad
> pagetable_free(ptdesc);
> return NULL;
> }
> +
> + if (mm == &init_mm)
> + ptdesc_set_kernel(ptdesc);
> +
> return ptdesc_address(ptdesc);
> }
> #define pmd_alloc_one(...) alloc_hooks(pmd_alloc_one_noprof(__VA_ARGS__))
> @@ -179,6 +185,10 @@ static inline pud_t *__pud_alloc_one_noprof(struct mm_struct *mm, unsigned long
> return NULL;
>
> pagetable_pud_ctor(ptdesc);
> +
> + if (mm == &init_mm)
> + ptdesc_set_kernel(ptdesc);
> +
> return ptdesc_address(ptdesc);
> }
> #define __pud_alloc_one(...) alloc_hooks(__pud_alloc_one_noprof(__VA_ARGS__))
> @@ -233,6 +243,10 @@ static inline p4d_t *__p4d_alloc_one_noprof(struct mm_struct *mm, unsigned long
> return NULL;
>
> pagetable_p4d_ctor(ptdesc);
> +
> + if (mm == &init_mm)
> + ptdesc_set_kernel(ptdesc);
> +
> return ptdesc_address(ptdesc);
> }
> #define __p4d_alloc_one(...) alloc_hooks(__p4d_alloc_one_noprof(__VA_ARGS__))
> @@ -277,6 +291,10 @@ static inline pgd_t *__pgd_alloc_noprof(struct mm_struct *mm, unsigned int order
> return NULL;
>
> pagetable_pgd_ctor(ptdesc);
> +
> + if (mm == &init_mm)
> + ptdesc_set_kernel(ptdesc);
> +
> return ptdesc_address(ptdesc);
> }
> #define __pgd_alloc(...) alloc_hooks(__pgd_alloc_noprof(__VA_ARGS__))
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1ae97a0b8ec7..f3db3a5ebefe 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2895,6 +2895,9 @@ static inline void pagetable_free(struct ptdesc *pt)
> {
> struct page *page = ptdesc_page(pt);
>
> + if (ptdesc_test_kernel(pt))
> + ptdesc_clear_kernel(pt);
> +
> __free_pages(page, compound_order(page));
> }
>
> --
> 2.43.0
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 4/8] mm: Introduce pure page table freeing function
2025-09-19 5:40 ` [PATCH v5 4/8] mm: Introduce pure page table freeing function Lu Baolu
2025-10-09 19:26 ` David Hildenbrand
@ 2025-10-13 7:24 ` Mike Rapoport
1 sibling, 0 replies; 29+ messages in thread
From: Mike Rapoport @ 2025-10-13 7:24 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, iommu, security, x86, linux-mm,
linux-kernel, Dave Hansen
On Fri, Sep 19, 2025 at 01:40:02PM +0800, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> The pages used for ptdescs are currently freed back to the allocator
> in a single location. They will shortly be freed from a second
> location.
>
> Create a simple helper that just frees them back to the allocator.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> include/linux/mm.h | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f3db3a5ebefe..668d519edc0f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2884,6 +2884,13 @@ static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int orde
> }
> #define pagetable_alloc(...) alloc_hooks(pagetable_alloc_noprof(__VA_ARGS__))
>
> +static inline void __pagetable_free(struct ptdesc *pt)
> +{
> + struct page *page = ptdesc_page(pt);
> +
> + __free_pages(page, compound_order(page));
> +}
> +
> /**
> * pagetable_free - Free pagetables
> * @pt: The page table descriptor
> @@ -2893,12 +2900,10 @@ static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int orde
> */
> static inline void pagetable_free(struct ptdesc *pt)
> {
> - struct page *page = ptdesc_page(pt);
> -
> if (ptdesc_test_kernel(pt))
> ptdesc_clear_kernel(pt);
>
> - __free_pages(page, compound_order(page));
> + __pagetable_free(pt);
> }
>
> #if defined(CONFIG_SPLIT_PTE_PTLOCKS)
> --
> 2.43.0
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 5/8] x86/mm: Use pagetable_free()
2025-09-19 5:40 ` [PATCH v5 5/8] x86/mm: Use pagetable_free() Lu Baolu
2025-09-24 12:40 ` Jason Gunthorpe
2025-10-09 19:26 ` David Hildenbrand
@ 2025-10-13 7:28 ` Mike Rapoport
2 siblings, 0 replies; 29+ messages in thread
From: Mike Rapoport @ 2025-10-13 7:28 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, iommu, security, x86, linux-mm,
linux-kernel
On Fri, Sep 19, 2025 at 01:40:03PM +0800, Lu Baolu wrote:
> The kernel's memory management subsystem provides a dedicated interface,
> pagetable_free(), for freeing page table pages. Updates two call sites to
> use pagetable_free() instead of the lower-level __free_page() or
> free_pages(). This improves code consistency and clarity, and ensures the
> correct freeing mechanism is used.
>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> arch/x86/mm/init_64.c | 2 +-
> arch/x86/mm/pat/set_memory.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index b9426fce5f3e..3d9a5e4ccaa4 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1031,7 +1031,7 @@ static void __meminit free_pagetable(struct page *page, int order)
> free_reserved_pages(page, nr_pages);
> #endif
> } else {
> - free_pages((unsigned long)page_address(page), order);
> + pagetable_free(page_ptdesc(page));
> }
> }
>
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 8834c76f91c9..8b78a8855024 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -438,7 +438,7 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
>
> list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
> list_del(&ptdesc->pt_list);
> - __free_page(ptdesc_page(ptdesc));
> + pagetable_free(ptdesc);
> }
> }
>
> --
> 2.43.0
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 0/8] Fix stale IOTLB entries for kernel address space
2025-10-08 19:42 ` Dave Hansen
2025-10-09 19:16 ` David Hildenbrand
@ 2025-10-14 13:21 ` Baolu Lu
1 sibling, 0 replies; 29+ messages in thread
From: Baolu Lu @ 2025-10-14 13:21 UTC (permalink / raw)
To: Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Andrew Morton, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Matthew Wilcox
Cc: baolu.lu, iommu, security, x86, linux-mm, linux-kernel
On 10/9/2025 3:42 AM, Dave Hansen wrote:
> I wondered why no mm folks were commenting on this. linux-mm@ was cc'd,
> but the_people_ on cc seem to have been almost all IOMMU and x86 folks.
> so I added a few mm folks...
>
> On 9/25/25 13:24, Dave Hansen wrote:
>> On 9/18/25 22:39, Lu Baolu wrote:
>>> This solution introduces a deferred freeing mechanism for kernel page
>>> table pages, which provides a safe window to notify the IOMMU to
>>> invalidate its caches before the page is reused.
>> I think all the activity has died down and I everyone seems happy enough
>> with how this looks. Right?
>>
>> So is this something we should prod Andrew to take through the mm tree,
>> or is it x86-specific enough it should go through tip?
> Hi Folks! We've got a bug fix here that has impact on x86, mm, and IOMMU
> code. I know I've talked with a few of you about this along the way, but
> it's really thin on mm reviews, probably because mm folks haven't been
> cc'd. Any eyeballs on it would be appreciated!
>
> It seems like it should_probably_ go through the mm tree, although I'm
> happy to send it through tip if folks disagree.
Thank you all for the review comments. I have updated this series with a
new version and posted it here, with the mm folks cc'ed:
https://lore.kernel.org/linux-iommu/20251014130437.1090448-1-baolu.lu@linux.intel.com/
Thanks,
baolu
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2025-10-14 13:21 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-19 5:39 [PATCH v5 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
2025-09-19 5:39 ` [PATCH v5 1/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
2025-10-08 19:56 ` Matthew Wilcox
2025-10-11 6:24 ` Baolu Lu
2025-09-19 5:40 ` [PATCH v5 2/8] mm: Actually mark kernel page table pages Lu Baolu
2025-10-09 19:19 ` David Hildenbrand
2025-10-13 7:17 ` Mike Rapoport
2025-09-19 5:40 ` [PATCH v5 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
2025-10-09 19:25 ` David Hildenbrand
2025-10-09 19:31 ` Dave Hansen
2025-10-11 6:26 ` Baolu Lu
2025-09-19 5:40 ` [PATCH v5 4/8] mm: Introduce pure page table freeing function Lu Baolu
2025-10-09 19:26 ` David Hildenbrand
2025-10-13 7:24 ` Mike Rapoport
2025-09-19 5:40 ` [PATCH v5 5/8] x86/mm: Use pagetable_free() Lu Baolu
2025-09-24 12:40 ` Jason Gunthorpe
2025-10-09 19:26 ` David Hildenbrand
2025-10-13 7:28 ` Mike Rapoport
2025-09-19 5:40 ` [PATCH v5 6/8] mm: Introduce deferred freeing for kernel page tables Lu Baolu
2025-10-09 19:28 ` David Hildenbrand
2025-10-09 19:32 ` Dave Hansen
2025-10-10 15:47 ` David Hildenbrand
2025-10-11 6:30 ` Baolu Lu
2025-09-19 5:40 ` [PATCH v5 7/8] mm: Hook up Kconfig options for async page table freeing Lu Baolu
2025-09-19 5:40 ` [PATCH v5 8/8] iommu/sva: Invalidate stale IOTLB entries for kernel address space Lu Baolu
2025-09-25 20:24 ` [PATCH v5 0/8] Fix " Dave Hansen
2025-10-08 19:42 ` Dave Hansen
2025-10-09 19:16 ` David Hildenbrand
2025-10-14 13:21 ` Baolu Lu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox