[PATCH v7 0/8] Fix stale IOTLB entries for kernel address space

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 0/8] Fix stale IOTLB entries for kernel address space
@ 2025-10-22  8:26 Lu Baolu
  2025-10-22  8:26 ` [PATCH v7 1/8] iommu: Disable SVA when CONFIG_X86 is set Lu Baolu
                   ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: Lu Baolu @ 2025-10-22  8:26 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Matthew Wilcox, Vinicius Costa Gomes
  Cc: iommu, security, x86, linux-mm, linux-kernel, Lu Baolu

This proposes a fix for a security vulnerability related to IOMMU Shared
Virtual Addressing (SVA). In an SVA context, an IOMMU can cache kernel
page table entries. When a kernel page table page is freed and
reallocated for another purpose, the IOMMU might still hold stale,
incorrect entries. This can be exploited to cause a use-after-free or
write-after-free condition, potentially leading to privilege escalation
or data corruption.

This solution introduces a deferred freeing mechanism for kernel page
table pages, which provides a safe window to notify the IOMMU to
invalidate its caches before the page is reused.

Change log:
v7:
 - The use of pmd_ptdesc() introduced a bug reported at
   https://lore.kernel.org/linux-iommu/68eeb99e.050a0220.91a22.0220.GAE@google.com/.
   Fix this by replacing it with page_ptdesc().
 - Discussed the approach of backporting and reached a consensus that we
   need an extra patch to disable SVA for x86 arch and re-enable it after
   the kernel page table free callback is done.
 - Use "const struct ptdesc *ptdesc" as the parameter for
   ptdesc_test_kernel().
 - Move "select ASYNC_KERNEL_PGTABLE_FREE" to the last patch.

v6:
 - https://lore.kernel.org/linux-iommu/20251014130437.1090448-1-baolu.lu@linux.intel.com/
 - Follow commit 522abd92279a to set/clear/test a flag of struct
   ptdesc.
 - User pmd_ptdesc() helper.
 - Squash previous PATCH 6 and 7.
 - Rename CONFIG_ASYNC_PGTABLE_FREE to CONFIG_ASYNC_KERNEL_PGTABLE_FREE.
 - Refine commit message.
 - Rebase on top of v6.18-rc1.

v5:
 - https://lore.kernel.org/linux-iommu/20250919054007.472493-1-baolu.lu@linux.intel.com/
 - Renamed pagetable_free_async() to pagetable_free_kernel() to avoid
   confusion.
 - Removed list_del() when the list is on the stack, as it will be freed
   when the function returns.
 - Discussed a corner case related to memory unplug of memory that was
   present as reserved memory at boot. Given that it's extremely rare
   and cannot be triggered by unprivileged users. We decided to focus
   our efforts on the common vfree() case and noted that corner case in
   the commit message.
 - Some cleanups.

v4:
 - https://lore.kernel.org/linux-iommu/20250905055103.3821518-1-baolu.lu@linux.intel.com/
 - Introduce a mechanism to defer the freeing of page-table pages for
   KVA mappings. Call iommu_sva_invalidate_kva_range() in the deferred
   work thread before freeing the pages.

v3:
 - https://lore.kernel.org/linux-iommu/20250806052505.3113108-1-baolu.lu@linux.intel.com/
 - iommu_sva_mms is an unbound list; iterating it in an atomic context
   could introduce significant latency issues. Schedule it in a kernel
   thread and replace the spinlock with a mutex.
 - Replace the static key with a normal bool; it can be brought back if
   data shows the benefit.
 - Invalidate KVA range in the flush_tlb_all() paths.
 - All previous reviewed-bys are preserved. Please let me know if there
   are any objections.

v2:
 - https://lore.kernel.org/linux-iommu/20250709062800.651521-1-baolu.lu@linux.intel.com/
 - Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range);
 - Replace the mutex with a spinlock to make the interface usable in the
   critical regions.

v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1-baolu.lu@linux.intel.com/

Dave Hansen (5):
  mm: Add a ptdesc flag to mark kernel page tables
  mm: Actually mark kernel page table pages
  x86/mm: Use 'ptdesc' when freeing PMD pages
  mm: Introduce pure page table freeing function
  mm: Introduce deferred freeing for kernel page tables

Lu Baolu (3):
  iommu: Disable SVA when CONFIG_X86 is set
  x86/mm: Use pagetable_free()
  iommu/sva: Invalidate stale IOTLB entries for kernel address space

 arch/x86/Kconfig              |  1 +
 mm/Kconfig                    |  3 ++
 include/asm-generic/pgalloc.h | 18 ++++++++++
 include/linux/iommu.h         |  4 +++
 include/linux/mm.h            | 65 +++++++++++++++++++++++++++++++++--
 arch/x86/mm/init_64.c         |  2 +-
 arch/x86/mm/pat/set_memory.c  |  2 +-
 arch/x86/mm/pgtable.c         | 12 +++----
 drivers/iommu/iommu-sva.c     | 29 +++++++++++++++-
 mm/pgtable-generic.c          | 39 +++++++++++++++++++++
 10 files changed, 163 insertions(+), 12 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v7 1/8] iommu: Disable SVA when CONFIG_X86 is set
  2025-10-22  8:26 [PATCH v7 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
@ 2025-10-22  8:26 ` Lu Baolu
  2025-10-22 19:50   ` Jason Gunthorpe
  2025-10-22  8:26 ` [PATCH v7 2/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Lu Baolu @ 2025-10-22  8:26 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Matthew Wilcox, Vinicius Costa Gomes
  Cc: iommu, security, x86, linux-mm, linux-kernel, Lu Baolu, stable

In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. The x86 architecture maps the
kernel's virtual address space into the upper portion of every process's
page table. Consequently, in an SVA context, the IOMMU hardware can walk
and cache kernel page table entries.

The Linux kernel currently lacks a notification mechanism for kernel page
table changes, specifically when page table pages are freed and reused.
The IOMMU driver is only notified of changes to user virtual address
mappings. This can cause the IOMMU's internal caches to retain stale
entries for kernel VA.

Use-After-Free (UAF) and Write-After-Free (WAF) conditions arise when
kernel page table pages are freed and later reallocated. The IOMMU could
misinterpret the new data as valid page table entries. The IOMMU might
then walk into attacker-controlled memory, leading to arbitrary physical
memory DMA access or privilege escalation. This is also a Write-After-Free
issue, as the IOMMU will potentially continue to write Accessed and Dirty
bits to the freed memory while attempting to walk the stale page tables.

Currently, SVA contexts are unprivileged and cannot access kernel
mappings. However, the IOMMU will still walk kernel-only page tables
all the way down to the leaf entries, where it realizes the mapping
is for the kernel and errors out. This means the IOMMU still caches
these intermediate page table entries, making the described vulnerability
a real concern.

Disable SVA on x86 architecture until the IOMMU can receive notification
to flush the paging cache before freeing the CPU kernel page table pages.

Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices")
Cc: stable@vger.kernel.org
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommu-sva.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 1a51cfd82808..a0442faad952 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -77,6 +77,9 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
 	if (!group)
 		return ERR_PTR(-ENODEV);

+	if (IS_ENABLED(CONFIG_X86))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	mutex_lock(&iommu_sva_lock);

 	/* Allocate mm->pasid if necessary. */
-- 
2.43.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 1/8] iommu: Disable SVA when CONFIG_X86 is set
  2025-10-22  8:26 ` [PATCH v7 1/8] iommu: Disable SVA when CONFIG_X86 is set Lu Baolu
@ 2025-10-22 19:50   ` Jason Gunthorpe
  0 siblings, 0 replies; 20+ messages in thread
From: Jason Gunthorpe @ 2025-10-22 19:50 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
	Vasant Hegde, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Andrew Morton, Vlastimil Babka, Mike Rapoport, Michal Hocko,
	Matthew Wilcox, Vinicius Costa Gomes, iommu, security, x86,
	linux-mm, linux-kernel, stable

On Wed, Oct 22, 2025 at 04:26:27PM +0800, Lu Baolu wrote:
> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
> shares and walks the CPU's page tables. The x86 architecture maps the
> kernel's virtual address space into the upper portion of every process's
> page table. Consequently, in an SVA context, the IOMMU hardware can walk
> and cache kernel page table entries.
> 
> The Linux kernel currently lacks a notification mechanism for kernel page
> table changes, specifically when page table pages are freed and reused.
> The IOMMU driver is only notified of changes to user virtual address
> mappings. This can cause the IOMMU's internal caches to retain stale
> entries for kernel VA.
> 
> Use-After-Free (UAF) and Write-After-Free (WAF) conditions arise when
> kernel page table pages are freed and later reallocated. The IOMMU could
> misinterpret the new data as valid page table entries. The IOMMU might
> then walk into attacker-controlled memory, leading to arbitrary physical
> memory DMA access or privilege escalation. This is also a Write-After-Free
> issue, as the IOMMU will potentially continue to write Accessed and Dirty
> bits to the freed memory while attempting to walk the stale page tables.
> 
> Currently, SVA contexts are unprivileged and cannot access kernel
> mappings. However, the IOMMU will still walk kernel-only page tables
> all the way down to the leaf entries, where it realizes the mapping
> is for the kernel and errors out. This means the IOMMU still caches
> these intermediate page table entries, making the described vulnerability
> a real concern.
> 
> Disable SVA on x86 architecture until the IOMMU can receive notification
> to flush the paging cache before freeing the CPU kernel page table pages.
> 
> Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices")
> Cc: stable@vger.kernel.org
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  drivers/iommu/iommu-sva.c | 3 +++
>  1 file changed, 3 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v7 2/8] mm: Add a ptdesc flag to mark kernel page tables
  2025-10-22  8:26 [PATCH v7 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
  2025-10-22  8:26 ` [PATCH v7 1/8] iommu: Disable SVA when CONFIG_X86 is set Lu Baolu
@ 2025-10-22  8:26 ` Lu Baolu
  2025-10-22 18:31   ` David Hildenbrand
  2025-10-23  7:07   ` Mike Rapoport
  2025-10-22  8:26 ` [PATCH v7 3/8] mm: Actually mark kernel page table pages Lu Baolu
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 20+ messages in thread
From: Lu Baolu @ 2025-10-22  8:26 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Matthew Wilcox, Vinicius Costa Gomes
  Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen, Lu Baolu

From: Dave Hansen <dave.hansen@linux.intel.com>

The page tables used to map the kernel and userspace often have very
different handling rules. There are frequently *_kernel() variants of
functions just for kernel page tables. That's not great and has lead
to code duplication.

Instead of having completely separate call paths, allow a 'ptdesc' to
be marked as being for kernel mappings. Introduce helpers to set and
clear this status.

Note: this uses the PG_referenced bit. Page flags are a great fit for
this since it is truly a single bit of information.  Use PG_referenced
itself because it's a fairly benign flag (as opposed to things like
PG_lock). It's also (according to Willy) unlikely to go away any time
soon.

PG_referenced is not in PAGE_FLAGS_CHECK_AT_FREE. It does not need to
be cleared before freeing the page, and pages coming out of the
allocator should have it cleared. Regardless, introduce an API to
clear it anyway. Having symmetry in the API makes it easier to change
the underlying implementation later, like if there was a need to move
to a PAGE_FLAGS_CHECK_AT_FREE bit.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d16b33bacc32..354d7925bf77 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2940,6 +2940,7 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
 #endif /* CONFIG_MMU */
 
 enum pt_flags {
+	PT_kernel = PG_referenced,
 	PT_reserved = PG_reserved,
 	/* High bits are used for zone/node/section */
 };
@@ -2965,6 +2966,46 @@ static inline bool pagetable_is_reserved(struct ptdesc *pt)
 	return test_bit(PT_reserved, &pt->pt_flags.f);
 }
 
+/**
+ * ptdesc_set_kernel - Mark a ptdesc used to map the kernel
+ * @ptdesc: The ptdesc to be marked
+ *
+ * Kernel page tables often need special handling. Set a flag so that
+ * the handling code knows this ptdesc will not be used for userspace.
+ */
+static inline void ptdesc_set_kernel(struct ptdesc *ptdesc)
+{
+	set_bit(PT_kernel, &ptdesc->pt_flags.f);
+}
+
+/**
+ * ptdesc_clear_kernel - Mark a ptdesc as no longer used to map the kernel
+ * @ptdesc: The ptdesc to be unmarked
+ *
+ * Use when the ptdesc is no longer used to map the kernel and no longer
+ * needs special handling.
+ */
+static inline void ptdesc_clear_kernel(struct ptdesc *ptdesc)
+{
+	/*
+	 * Note: the 'PG_referenced' bit does not strictly need to be
+	 * cleared before freeing the page. But this is nice for
+	 * symmetry.
+	 */
+	clear_bit(PT_kernel, &ptdesc->pt_flags.f);
+}
+
+/**
+ * ptdesc_test_kernel - Check if a ptdesc is used to map the kernel
+ * @ptdesc: The ptdesc being tested
+ *
+ * Call to tell if the ptdesc used to map the kernel.
+ */
+static inline bool ptdesc_test_kernel(const struct ptdesc *ptdesc)
+{
+	return test_bit(PT_kernel, &ptdesc->pt_flags.f);
+}
+
 /**
  * pagetable_alloc - Allocate pagetables
  * @gfp:    GFP flags
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 2/8] mm: Add a ptdesc flag to mark kernel page tables
  2025-10-22  8:26 ` [PATCH v7 2/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
@ 2025-10-22 18:31   ` David Hildenbrand
  2025-10-23  7:07   ` Mike Rapoport
  1 sibling, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-10-22 18:31 UTC (permalink / raw)
  To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, Lorenzo Stoakes, Liam R . Howlett,
	Andrew Morton, Vlastimil Babka, Mike Rapoport, Michal Hocko,
	Matthew Wilcox, Vinicius Costa Gomes
  Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen

On 22.10.25 10:26, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The page tables used to map the kernel and userspace often have very
> different handling rules. There are frequently *_kernel() variants of
> functions just for kernel page tables. That's not great and has lead
> to code duplication.
> 
> Instead of having completely separate call paths, allow a 'ptdesc' to
> be marked as being for kernel mappings. Introduce helpers to set and
> clear this status.
> 
> Note: this uses the PG_referenced bit. Page flags are a great fit for
> this since it is truly a single bit of information.  Use PG_referenced
> itself because it's a fairly benign flag (as opposed to things like
> PG_lock). It's also (according to Willy) unlikely to go away any time
> soon.
> 
> PG_referenced is not in PAGE_FLAGS_CHECK_AT_FREE. It does not need to
> be cleared before freeing the page, and pages coming out of the
> allocator should have it cleared. Regardless, introduce an API to
> clear it anyway. Having symmetry in the API makes it easier to change
> the underlying implementation later, like if there was a need to move
> to a PAGE_FLAGS_CHECK_AT_FREE bit.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>

Just a process thing: if you modified patches such that you are 
considered a co-author, there should probably be a

	Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>

above your SOB.

See "When to use Acked-by:, Cc:, and Co-developed-by:" in 
Documentation/process/submitting-patches.rst

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 2/8] mm: Add a ptdesc flag to mark kernel page tables
  2025-10-22  8:26 ` [PATCH v7 2/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
  2025-10-22 18:31   ` David Hildenbrand
@ 2025-10-23  7:07   ` Mike Rapoport
  1 sibling, 0 replies; 20+ messages in thread
From: Mike Rapoport @ 2025-10-23  7:07 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Andrew Morton, Vlastimil Babka, Michal Hocko,
	Matthew Wilcox, Vinicius Costa Gomes, iommu, security, x86,
	linux-mm, linux-kernel, Dave Hansen

On Wed, Oct 22, 2025 at 04:26:28PM +0800, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The page tables used to map the kernel and userspace often have very
> different handling rules. There are frequently *_kernel() variants of
> functions just for kernel page tables. That's not great and has lead
> to code duplication.
> 
> Instead of having completely separate call paths, allow a 'ptdesc' to
> be marked as being for kernel mappings. Introduce helpers to set and
> clear this status.
> 
> Note: this uses the PG_referenced bit. Page flags are a great fit for
> this since it is truly a single bit of information.  Use PG_referenced
> itself because it's a fairly benign flag (as opposed to things like
> PG_lock). It's also (according to Willy) unlikely to go away any time
> soon.
> 
> PG_referenced is not in PAGE_FLAGS_CHECK_AT_FREE. It does not need to
> be cleared before freeing the page, and pages coming out of the
> allocator should have it cleared. Regardless, introduce an API to
> clear it anyway. Having symmetry in the API makes it easier to change
> the underlying implementation later, like if there was a need to move
> to a PAGE_FLAGS_CHECK_AT_FREE bit.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Acked-by: David Hildenbrand <david@redhat.com>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  include/linux/mm.h | 41 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 41 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d16b33bacc32..354d7925bf77 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2940,6 +2940,7 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
>  #endif /* CONFIG_MMU */
>  
>  enum pt_flags {
> +	PT_kernel = PG_referenced,
>  	PT_reserved = PG_reserved,
>  	/* High bits are used for zone/node/section */
>  };
> @@ -2965,6 +2966,46 @@ static inline bool pagetable_is_reserved(struct ptdesc *pt)
>  	return test_bit(PT_reserved, &pt->pt_flags.f);
>  }
>  
> +/**
> + * ptdesc_set_kernel - Mark a ptdesc used to map the kernel
> + * @ptdesc: The ptdesc to be marked
> + *
> + * Kernel page tables often need special handling. Set a flag so that
> + * the handling code knows this ptdesc will not be used for userspace.
> + */
> +static inline void ptdesc_set_kernel(struct ptdesc *ptdesc)
> +{
> +	set_bit(PT_kernel, &ptdesc->pt_flags.f);
> +}
> +
> +/**
> + * ptdesc_clear_kernel - Mark a ptdesc as no longer used to map the kernel
> + * @ptdesc: The ptdesc to be unmarked
> + *
> + * Use when the ptdesc is no longer used to map the kernel and no longer
> + * needs special handling.
> + */
> +static inline void ptdesc_clear_kernel(struct ptdesc *ptdesc)
> +{
> +	/*
> +	 * Note: the 'PG_referenced' bit does not strictly need to be
> +	 * cleared before freeing the page. But this is nice for
> +	 * symmetry.
> +	 */
> +	clear_bit(PT_kernel, &ptdesc->pt_flags.f);
> +}
> +
> +/**
> + * ptdesc_test_kernel - Check if a ptdesc is used to map the kernel
> + * @ptdesc: The ptdesc being tested
> + *
> + * Call to tell if the ptdesc used to map the kernel.
> + */
> +static inline bool ptdesc_test_kernel(const struct ptdesc *ptdesc)
> +{
> +	return test_bit(PT_kernel, &ptdesc->pt_flags.f);
> +}
> +
>  /**
>   * pagetable_alloc - Allocate pagetables
>   * @gfp:    GFP flags
> -- 
> 2.43.0
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v7 3/8] mm: Actually mark kernel page table pages
  2025-10-22  8:26 [PATCH v7 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
  2025-10-22  8:26 ` [PATCH v7 1/8] iommu: Disable SVA when CONFIG_X86 is set Lu Baolu
  2025-10-22  8:26 ` [PATCH v7 2/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
@ 2025-10-22  8:26 ` Lu Baolu
  2025-10-22  8:26 ` [PATCH v7 4/8] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Lu Baolu @ 2025-10-22  8:26 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Matthew Wilcox, Vinicius Costa Gomes
  Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen, Lu Baolu

From: Dave Hansen <dave.hansen@linux.intel.com>

Now that the API is in place, mark kernel page table pages just
after they are allocated. Unmark them just before they are freed.

Note: Unconditionally clearing the 'kernel' marking (via
ptdesc_clear_kernel()) would be functionally identical to what
is here. But having the if() makes it logically clear that this
function can be used for kernel and non-kernel page tables.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 include/asm-generic/pgalloc.h | 18 ++++++++++++++++++
 include/linux/mm.h            |  3 +++
 2 files changed, 21 insertions(+)

diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 3c8ec3bfea44..b9d2a7c79b93 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -28,6 +28,8 @@ static inline pte_t *__pte_alloc_one_kernel_noprof(struct mm_struct *mm)
 		return NULL;
 	}
 
+	ptdesc_set_kernel(ptdesc);
+
 	return ptdesc_address(ptdesc);
 }
 #define __pte_alloc_one_kernel(...)	alloc_hooks(__pte_alloc_one_kernel_noprof(__VA_ARGS__))
@@ -146,6 +148,10 @@ static inline pmd_t *pmd_alloc_one_noprof(struct mm_struct *mm, unsigned long ad
 		pagetable_free(ptdesc);
 		return NULL;
 	}
+
+	if (mm == &init_mm)
+		ptdesc_set_kernel(ptdesc);
+
 	return ptdesc_address(ptdesc);
 }
 #define pmd_alloc_one(...)	alloc_hooks(pmd_alloc_one_noprof(__VA_ARGS__))
@@ -179,6 +185,10 @@ static inline pud_t *__pud_alloc_one_noprof(struct mm_struct *mm, unsigned long
 		return NULL;
 
 	pagetable_pud_ctor(ptdesc);
+
+	if (mm == &init_mm)
+		ptdesc_set_kernel(ptdesc);
+
 	return ptdesc_address(ptdesc);
 }
 #define __pud_alloc_one(...)	alloc_hooks(__pud_alloc_one_noprof(__VA_ARGS__))
@@ -233,6 +243,10 @@ static inline p4d_t *__p4d_alloc_one_noprof(struct mm_struct *mm, unsigned long
 		return NULL;
 
 	pagetable_p4d_ctor(ptdesc);
+
+	if (mm == &init_mm)
+		ptdesc_set_kernel(ptdesc);
+
 	return ptdesc_address(ptdesc);
 }
 #define __p4d_alloc_one(...)	alloc_hooks(__p4d_alloc_one_noprof(__VA_ARGS__))
@@ -277,6 +291,10 @@ static inline pgd_t *__pgd_alloc_noprof(struct mm_struct *mm, unsigned int order
 		return NULL;
 
 	pagetable_pgd_ctor(ptdesc);
+
+	if (mm == &init_mm)
+		ptdesc_set_kernel(ptdesc);
+
 	return ptdesc_address(ptdesc);
 }
 #define __pgd_alloc(...)	alloc_hooks(__pgd_alloc_noprof(__VA_ARGS__))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 354d7925bf77..cca5946a9771 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3035,6 +3035,9 @@ static inline void pagetable_free(struct ptdesc *pt)
 {
 	struct page *page = ptdesc_page(pt);
 
+	if (ptdesc_test_kernel(pt))
+		ptdesc_clear_kernel(pt);
+
 	__free_pages(page, compound_order(page));
 }
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v7 4/8] x86/mm: Use 'ptdesc' when freeing PMD pages
  2025-10-22  8:26 [PATCH v7 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
                   ` (2 preceding siblings ...)
  2025-10-22  8:26 ` [PATCH v7 3/8] mm: Actually mark kernel page table pages Lu Baolu
@ 2025-10-22  8:26 ` Lu Baolu
  2025-10-22 18:31   ` David Hildenbrand
  2025-10-22  8:26 ` [PATCH v7 5/8] mm: Introduce pure page table freeing function Lu Baolu
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Lu Baolu @ 2025-10-22  8:26 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Matthew Wilcox, Vinicius Costa Gomes
  Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen, Lu Baolu

From: Dave Hansen <dave.hansen@linux.intel.com>

There are a billion ways to refer to a physical memory address.
One of the x86 PMD freeing code location chooses to use a 'pte_t *' to
point to a PMD page and then call a PTE-specific freeing function for
it.  That's a bit wonky.

Just use a 'struct ptdesc *' instead. Its entire purpose is to refer
to page table pages. It also means being able to remove an explicit
cast.

Right now, pte_free_kernel() is a one-liner that calls
pagetable_dtor_free(). Effectively, all this patch does is
remove one superfluous __pa(__va(paddr)) conversion and then
call pagetable_dtor_free() directly instead of through a helper.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
 arch/x86/mm/pgtable.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index ddf248c3ee7d..2e5ecfdce73c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -729,7 +729,7 @@ int pmd_clear_huge(pmd_t *pmd)
 int pud_free_pmd_page(pud_t *pud, unsigned long addr)
 {
 	pmd_t *pmd, *pmd_sv;
-	pte_t *pte;
+	struct ptdesc *pt;
 	int i;
 
 	pmd = pud_pgtable(*pud);
@@ -750,8 +750,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
 
 	for (i = 0; i < PTRS_PER_PMD; i++) {
 		if (!pmd_none(pmd_sv[i])) {
-			pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
-			pte_free_kernel(&init_mm, pte);
+			pt = page_ptdesc(pmd_page(pmd_sv[i]));
+			pagetable_dtor_free(pt);
 		}
 	}
 
@@ -772,15 +772,15 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
  */
 int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 {
-	pte_t *pte;
+	struct ptdesc *pt;
 
-	pte = (pte_t *)pmd_page_vaddr(*pmd);
+	pt = page_ptdesc(pmd_page(*pmd));
 	pmd_clear(pmd);
 
 	/* INVLPG to clear all paging-structure caches */
 	flush_tlb_kernel_range(addr, addr + PAGE_SIZE-1);
 
-	pte_free_kernel(&init_mm, pte);
+	pagetable_dtor_free(pt);
 
 	return 1;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 4/8] x86/mm: Use 'ptdesc' when freeing PMD pages
  2025-10-22  8:26 ` [PATCH v7 4/8] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
@ 2025-10-22 18:31   ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-10-22 18:31 UTC (permalink / raw)
  To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, Lorenzo Stoakes, Liam R . Howlett,
	Andrew Morton, Vlastimil Babka, Mike Rapoport, Michal Hocko,
	Matthew Wilcox, Vinicius Costa Gomes
  Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen

On 22.10.25 10:26, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> There are a billion ways to refer to a physical memory address.
> One of the x86 PMD freeing code location chooses to use a 'pte_t *' to
> point to a PMD page and then call a PTE-specific freeing function for
> it.  That's a bit wonky.
> 
> Just use a 'struct ptdesc *' instead. Its entire purpose is to refer
> to page table pages. It also means being able to remove an explicit
> cast.
> 
> Right now, pte_free_kernel() is a one-liner that calls
> pagetable_dtor_free(). Effectively, all this patch does is
> remove one superfluous __pa(__va(paddr)) conversion and then
> call pagetable_dtor_free() directly instead of through a helper.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v7 5/8] mm: Introduce pure page table freeing function
  2025-10-22  8:26 [PATCH v7 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
                   ` (3 preceding siblings ...)
  2025-10-22  8:26 ` [PATCH v7 4/8] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
@ 2025-10-22  8:26 ` Lu Baolu
  2025-10-22  8:26 ` [PATCH v7 6/8] x86/mm: Use pagetable_free() Lu Baolu
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Lu Baolu @ 2025-10-22  8:26 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Matthew Wilcox, Vinicius Costa Gomes
  Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen, Lu Baolu

From: Dave Hansen <dave.hansen@linux.intel.com>

The pages used for ptdescs are currently freed back to the allocator
in a single location. They will shortly be freed from a second
location.

Create a simple helper that just frees them back to the allocator.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 include/linux/mm.h | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cca5946a9771..52ae551d0eb4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3024,6 +3024,13 @@ static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int orde
 }
 #define pagetable_alloc(...)	alloc_hooks(pagetable_alloc_noprof(__VA_ARGS__))
 
+static inline void __pagetable_free(struct ptdesc *pt)
+{
+	struct page *page = ptdesc_page(pt);
+
+	__free_pages(page, compound_order(page));
+}
+
 /**
  * pagetable_free - Free pagetables
  * @pt:	The page table descriptor
@@ -3033,12 +3040,10 @@ static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int orde
  */
 static inline void pagetable_free(struct ptdesc *pt)
 {
-	struct page *page = ptdesc_page(pt);
-
 	if (ptdesc_test_kernel(pt))
 		ptdesc_clear_kernel(pt);
 
-	__free_pages(page, compound_order(page));
+	__pagetable_free(pt);
 }
 
 #if defined(CONFIG_SPLIT_PTE_PTLOCKS)
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v7 6/8] x86/mm: Use pagetable_free()
  2025-10-22  8:26 [PATCH v7 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
                   ` (4 preceding siblings ...)
  2025-10-22  8:26 ` [PATCH v7 5/8] mm: Introduce pure page table freeing function Lu Baolu
@ 2025-10-22  8:26 ` Lu Baolu
  2025-11-18  2:14   ` Vishal Moola (Oracle)
  2025-10-22  8:26 ` [PATCH v7 7/8] mm: Introduce deferred freeing for kernel page tables Lu Baolu
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Lu Baolu @ 2025-10-22  8:26 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Matthew Wilcox, Vinicius Costa Gomes
  Cc: iommu, security, x86, linux-mm, linux-kernel, Lu Baolu

The kernel's memory management subsystem provides a dedicated interface,
pagetable_free(), for freeing page table pages. Updates two call sites to
use pagetable_free() instead of the lower-level __free_page() or
free_pages(). This improves code consistency and clarity, and ensures the
correct freeing mechanism is used.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/mm/init_64.c        | 2 +-
 arch/x86/mm/pat/set_memory.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0e4270e20fad..3d9a5e4ccaa4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1031,7 +1031,7 @@ static void __meminit free_pagetable(struct page *page, int order)
 		free_reserved_pages(page, nr_pages);
 #endif
 	} else {
-		__free_pages(page, order);
+		pagetable_free(page_ptdesc(page));
 	}
 }
 
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 970981893c9b..fffb6ef1997d 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -429,7 +429,7 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
 
 	list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
 		list_del(&ptdesc->pt_list);
-		__free_page(ptdesc_page(ptdesc));
+		pagetable_free(ptdesc);
 	}
 }
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 6/8] x86/mm: Use pagetable_free()
  2025-10-22  8:26 ` [PATCH v7 6/8] x86/mm: Use pagetable_free() Lu Baolu
@ 2025-11-18  2:14   ` Vishal Moola (Oracle)
  2025-11-20 10:35     ` Mike Rapoport
  0 siblings, 1 reply; 20+ messages in thread
From: Vishal Moola (Oracle) @ 2025-11-18  2:14 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Matthew Wilcox, Vinicius Costa Gomes, iommu,
	security, x86, linux-mm, linux-kernel

On Wed, Oct 22, 2025 at 04:26:32PM +0800, Lu Baolu wrote:
> The kernel's memory management subsystem provides a dedicated interface,
> pagetable_free(), for freeing page table pages. Updates two call sites to
> use pagetable_free() instead of the lower-level __free_page() or
> free_pages(). This improves code consistency and clarity, and ensures the
> correct freeing mechanism is used.

In doing these ptdesc calls here, we're running into issues with the
concurrent work around ptdescs: Allocating frozen page tables[1] and
separately allocating ptdesc[2].

What we're seeing is attempts to cast a page that has still been
allocated by the regular page allocator to a ptdesc - which won't work
anymore.

My hunch is we want alot of the code in pat/set_memory.c to be using ptdescs
aka page table descriptors. At least all the allocations/frees for now.
Does that seem right? I'm not really familiar with this code though...

[1] https://lore.kernel.org/linux-mm/202511172257.ffd96dab-lkp@intel.com/T/#mf68f9c13f4b188eac08ae261c0172afe81a75827
[2] https://lore.kernel.org/linux-mm/20251020001652.2116669-1-willy@infradead.org/T/#md72f66473e017d6f3ce277405ad115e71898f418

> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  arch/x86/mm/init_64.c        | 2 +-
>  arch/x86/mm/pat/set_memory.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 0e4270e20fad..3d9a5e4ccaa4 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1031,7 +1031,7 @@ static void __meminit free_pagetable(struct page *page, int order)
>  		free_reserved_pages(page, nr_pages);
>  #endif
>  	} else {
> -		__free_pages(page, order);
> +		pagetable_free(page_ptdesc(page));
>  	}
>  }
>  
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 970981893c9b..fffb6ef1997d 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -429,7 +429,7 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
>  
>  	list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
>  		list_del(&ptdesc->pt_list);
> -		__free_page(ptdesc_page(ptdesc));
> +		pagetable_free(ptdesc);
>  	}
>  }
>  
> -- 
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 6/8] x86/mm: Use pagetable_free()
  2025-11-18  2:14   ` Vishal Moola (Oracle)
@ 2025-11-20 10:35     ` Mike Rapoport
  0 siblings, 0 replies; 20+ messages in thread
From: Mike Rapoport @ 2025-11-20 10:35 UTC (permalink / raw)
  To: Vishal Moola (Oracle)
  Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Andrew Morton, Vlastimil Babka, Michal Hocko,
	Matthew Wilcox, Vinicius Costa Gomes, iommu, security, x86,
	linux-mm, linux-kernel

On Mon, Nov 17, 2025 at 06:14:22PM -0800, Vishal Moola (Oracle) wrote:
> On Wed, Oct 22, 2025 at 04:26:32PM +0800, Lu Baolu wrote:
> > The kernel's memory management subsystem provides a dedicated interface,
> > pagetable_free(), for freeing page table pages. Updates two call sites to
> > use pagetable_free() instead of the lower-level __free_page() or
> > free_pages(). This improves code consistency and clarity, and ensures the
> > correct freeing mechanism is used.
> 
> In doing these ptdesc calls here, we're running into issues with the
> concurrent work around ptdescs: Allocating frozen page tables[1] and
> separately allocating ptdesc[2].
> 
> What we're seeing is attempts to cast a page that has still been
> allocated by the regular page allocator to a ptdesc - which won't work
> anymore.
> 
> My hunch is we want alot of the code in pat/set_memory.c to be using ptdescs
> aka page table descriptors. At least all the allocations/frees for now.
> Does that seem right? I'm not really familiar with this code though...

Yeah, that sounds about right. Allocations in x86::set_memory should use
pXd_alloc_one_kernel()
 
> [1] https://lore.kernel.org/linux-mm/202511172257.ffd96dab-lkp@intel.com/T/#mf68f9c13f4b188eac08ae261c0172afe81a75827
> [2] https://lore.kernel.org/linux-mm/20251020001652.2116669-1-willy@infradead.org/T/#md72f66473e017d6f3ce277405ad115e71898f418
> 
> > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> >  arch/x86/mm/init_64.c        | 2 +-
> >  arch/x86/mm/pat/set_memory.c | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index 0e4270e20fad..3d9a5e4ccaa4 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -1031,7 +1031,7 @@ static void __meminit free_pagetable(struct page *page, int order)
> >  		free_reserved_pages(page, nr_pages);
> >  #endif
> >  	} else {
> > -		__free_pages(page, order);
> > +		pagetable_free(page_ptdesc(page));
> >  	}
> >  }
> >  
> > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > index 970981893c9b..fffb6ef1997d 100644
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -429,7 +429,7 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
> >  
> >  	list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
> >  		list_del(&ptdesc->pt_list);
> > -		__free_page(ptdesc_page(ptdesc));
> > +		pagetable_free(ptdesc);
> >  	}
> >  }
> >  
> > -- 
> > 2.43.0
> > 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v7 7/8] mm: Introduce deferred freeing for kernel page tables
  2025-10-22  8:26 [PATCH v7 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
                   ` (5 preceding siblings ...)
  2025-10-22  8:26 ` [PATCH v7 6/8] x86/mm: Use pagetable_free() Lu Baolu
@ 2025-10-22  8:26 ` Lu Baolu
  2025-10-22 18:34   ` David Hildenbrand
  2025-10-23  7:10   ` Mike Rapoport
  2025-10-22  8:26 ` [PATCH v7 8/8] iommu/sva: Invalidate stale IOTLB entries for kernel address space Lu Baolu
  2025-10-22 19:01 ` [PATCH v7 0/8] Fix " Andrew Morton
  8 siblings, 2 replies; 20+ messages in thread
From: Lu Baolu @ 2025-10-22  8:26 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Matthew Wilcox, Vinicius Costa Gomes
  Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen, Lu Baolu

From: Dave Hansen <dave.hansen@linux.intel.com>

This introduces a conditional asynchronous mechanism, enabled by
CONFIG_ASYNC_KERNEL_PGTABLE_FREE. When enabled, this mechanism defers the
freeing of pages that are used as page tables for kernel address mappings.
These pages are now queued to a work struct instead of being freed
immediately.

This deferred freeing allows for batch-freeing of page tables, providing
a safe context for performing a single expensive operation (TLB flush)
for a batch of kernel page tables instead of performing that expensive
operation for each page table.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
 mm/Kconfig           |  3 +++
 include/linux/mm.h   | 16 +++++++++++++---
 mm/pgtable-generic.c | 37 +++++++++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 0e26f4fc8717..a83df9934acd 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -908,6 +908,9 @@ config PAGE_MAPCOUNT
 config PGTABLE_HAS_HUGE_LEAVES
 	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
 
+config ASYNC_KERNEL_PGTABLE_FREE
+	def_bool n
+
 # TODO: Allow to be enabled without THP
 config ARCH_SUPPORTS_HUGE_PFNMAP
 	def_bool n
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52ae551d0eb4..d521abd33164 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3031,6 +3031,14 @@ static inline void __pagetable_free(struct ptdesc *pt)
 	__free_pages(page, compound_order(page));
 }
 
+#ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
+void pagetable_free_kernel(struct ptdesc *pt);
+#else
+static inline void pagetable_free_kernel(struct ptdesc *pt)
+{
+	__pagetable_free(pt);
+}
+#endif
 /**
  * pagetable_free - Free pagetables
  * @pt:	The page table descriptor
@@ -3040,10 +3048,12 @@ static inline void __pagetable_free(struct ptdesc *pt)
  */
 static inline void pagetable_free(struct ptdesc *pt)
 {
-	if (ptdesc_test_kernel(pt))
+	if (ptdesc_test_kernel(pt)) {
 		ptdesc_clear_kernel(pt);
-
-	__pagetable_free(pt);
+		pagetable_free_kernel(pt);
+	} else {
+		__pagetable_free(pt);
+	}
 }
 
 #if defined(CONFIG_SPLIT_PTE_PTLOCKS)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 567e2d084071..1c7caa8ef164 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -406,3 +406,40 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
 	pte_unmap_unlock(pte, ptl);
 	goto again;
 }
+
+#ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
+static void kernel_pgtable_work_func(struct work_struct *work);
+
+static struct {
+	struct list_head list;
+	/* protect above ptdesc lists */
+	spinlock_t lock;
+	struct work_struct work;
+} kernel_pgtable_work = {
+	.list = LIST_HEAD_INIT(kernel_pgtable_work.list),
+	.lock = __SPIN_LOCK_UNLOCKED(kernel_pgtable_work.lock),
+	.work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func),
+};
+
+static void kernel_pgtable_work_func(struct work_struct *work)
+{
+	struct ptdesc *pt, *next;
+	LIST_HEAD(page_list);
+
+	spin_lock(&kernel_pgtable_work.lock);
+	list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
+	spin_unlock(&kernel_pgtable_work.lock);
+
+	list_for_each_entry_safe(pt, next, &page_list, pt_list)
+		__pagetable_free(pt);
+}
+
+void pagetable_free_kernel(struct ptdesc *pt)
+{
+	spin_lock(&kernel_pgtable_work.lock);
+	list_add(&pt->pt_list, &kernel_pgtable_work.list);
+	spin_unlock(&kernel_pgtable_work.lock);
+
+	schedule_work(&kernel_pgtable_work.work);
+}
+#endif
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 7/8] mm: Introduce deferred freeing for kernel page tables
  2025-10-22  8:26 ` [PATCH v7 7/8] mm: Introduce deferred freeing for kernel page tables Lu Baolu
@ 2025-10-22 18:34   ` David Hildenbrand
  2025-10-22 19:12     ` Dave Hansen
  2025-10-22 19:52     ` Jason Gunthorpe
  2025-10-23  7:10   ` Mike Rapoport
  1 sibling, 2 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-10-22 18:34 UTC (permalink / raw)
  To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, Lorenzo Stoakes, Liam R . Howlett,
	Andrew Morton, Vlastimil Babka, Mike Rapoport, Michal Hocko,
	Matthew Wilcox, Vinicius Costa Gomes
  Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen

On 22.10.25 10:26, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This introduces a conditional asynchronous mechanism, enabled by
> CONFIG_ASYNC_KERNEL_PGTABLE_FREE. When enabled, this mechanism defers the
> freeing of pages that are used as page tables for kernel address mappings.
> These pages are now queued to a work struct instead of being freed
> immediately.
> 
> This deferred freeing allows for batch-freeing of page tables, providing
> a safe context for performing a single expensive operation (TLB flush)
> for a batch of kernel page tables instead of performing that expensive
> operation for each page table.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> ---
>   mm/Kconfig           |  3 +++
>   include/linux/mm.h   | 16 +++++++++++++---
>   mm/pgtable-generic.c | 37 +++++++++++++++++++++++++++++++++++++
>   3 files changed, 53 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0e26f4fc8717..a83df9934acd 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -908,6 +908,9 @@ config PAGE_MAPCOUNT
>   config PGTABLE_HAS_HUGE_LEAVES
>   	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
>   
> +config ASYNC_KERNEL_PGTABLE_FREE
> +	def_bool n
> +
>   # TODO: Allow to be enabled without THP
>   config ARCH_SUPPORTS_HUGE_PFNMAP
>   	def_bool n
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 52ae551d0eb4..d521abd33164 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3031,6 +3031,14 @@ static inline void __pagetable_free(struct ptdesc *pt)
>   	__free_pages(page, compound_order(page));
>   }
>   
> +#ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
> +void pagetable_free_kernel(struct ptdesc *pt);
> +#else
> +static inline void pagetable_free_kernel(struct ptdesc *pt)
> +{
> +	__pagetable_free(pt);
> +}
> +#endif
>   /**
>    * pagetable_free - Free pagetables
>    * @pt:	The page table descriptor
> @@ -3040,10 +3048,12 @@ static inline void __pagetable_free(struct ptdesc *pt)
>    */
>   static inline void pagetable_free(struct ptdesc *pt)
>   {
> -	if (ptdesc_test_kernel(pt))
> +	if (ptdesc_test_kernel(pt)) {
>   		ptdesc_clear_kernel(pt);
> -
> -	__pagetable_free(pt);
> +		pagetable_free_kernel(pt);
> +	} else {
> +		__pagetable_free(pt);
> +	}
>   }
>   
>   #if defined(CONFIG_SPLIT_PTE_PTLOCKS)
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 567e2d084071..1c7caa8ef164 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -406,3 +406,40 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
>   	pte_unmap_unlock(pte, ptl);
>   	goto again;
>   }
> +
> +#ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
> +static void kernel_pgtable_work_func(struct work_struct *work);
> +
> +static struct {
> +	struct list_head list;
> +	/* protect above ptdesc lists */
> +	spinlock_t lock;
> +	struct work_struct work;
> +} kernel_pgtable_work = {
> +	.list = LIST_HEAD_INIT(kernel_pgtable_work.list),
> +	.lock = __SPIN_LOCK_UNLOCKED(kernel_pgtable_work.lock),
> +	.work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func),
> +};
> +
> +static void kernel_pgtable_work_func(struct work_struct *work)
> +{
> +	struct ptdesc *pt, *next;
> +	LIST_HEAD(page_list);
> +
> +	spin_lock(&kernel_pgtable_work.lock);
> +	list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
> +	spin_unlock(&kernel_pgtable_work.lock);
> +
> +	list_for_each_entry_safe(pt, next, &page_list, pt_list)
> +		__pagetable_free(pt);
> +}
> +
> +void pagetable_free_kernel(struct ptdesc *pt)
> +{
> +	spin_lock(&kernel_pgtable_work.lock);
> +	list_add(&pt->pt_list, &kernel_pgtable_work.list);
> +	spin_unlock(&kernel_pgtable_work.lock);
> +
> +	schedule_work(&kernel_pgtable_work.work);
> +}
> +#endif

Acked-by: David Hildenbrand <david@redhat.com>

I was briefly wondering whether the pages can get stuck in there 
sufficiently long that we would want to wire up the shrinker to say 
"OOM, hold your horses, we can still free something here".

But I'd assume the workqueue will get scheduled in a reasonable 
timeframe either so this is not a concern?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 7/8] mm: Introduce deferred freeing for kernel page tables
  2025-10-22 18:34   ` David Hildenbrand
@ 2025-10-22 19:12     ` Dave Hansen
  2025-10-22 19:52     ` Jason Gunthorpe
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Hansen @ 2025-10-22 19:12 UTC (permalink / raw)
  To: David Hildenbrand, Lu Baolu, Joerg Roedel, Will Deacon,
	Robin Murphy, Kevin Tian, Jason Gunthorpe, Jann Horn,
	Vasant Hegde, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, Lorenzo Stoakes,
	Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Matthew Wilcox, Vinicius Costa Gomes
  Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen

On 10/22/25 11:34, David Hildenbrand wrote:
...
> I was briefly wondering whether the pages can get stuck in there
> sufficiently long that we would want to wire up the shrinker to say
> "OOM, hold your horses, we can still free something here".
> 
> But I'd assume the workqueue will get scheduled in a reasonable
> timeframe either so this is not a concern?

First, I can't fathom there will ever be more than a couple of pages in
there.

If there's an OOM going on, there's probably no shortage of idle time
leading up to and during the OOM as threads plow into mutexes and wait
for I/O. That's when the work will get handled even more quickly than
normal.

I suspect it'll work itself out naturally. It wouldn't be hard to toss a
counter in there for the list length and dump it at OOM, or pr_info() if
it's got more than a few pages on it.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 7/8] mm: Introduce deferred freeing for kernel page tables
  2025-10-22 18:34   ` David Hildenbrand
  2025-10-22 19:12     ` Dave Hansen
@ 2025-10-22 19:52     ` Jason Gunthorpe
  1 sibling, 0 replies; 20+ messages in thread
From: Jason Gunthorpe @ 2025-10-22 19:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jann Horn, Vasant Hegde, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	Lorenzo Stoakes, Liam R . Howlett, Andrew Morton,
	Vlastimil Babka, Mike Rapoport, Michal Hocko, Matthew Wilcox,
	Vinicius Costa Gomes, iommu, security, x86, linux-mm,
	linux-kernel, Dave Hansen

On Wed, Oct 22, 2025 at 08:34:53PM +0200, David Hildenbrand wrote:
> On 22.10.25 10:26, Lu Baolu wrote:
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> > 
> > This introduces a conditional asynchronous mechanism, enabled by
> > CONFIG_ASYNC_KERNEL_PGTABLE_FREE. When enabled, this mechanism defers the
> > freeing of pages that are used as page tables for kernel address mappings.
> > These pages are now queued to a work struct instead of being freed
> > immediately.
> > 
> > This deferred freeing allows for batch-freeing of page tables, providing
> > a safe context for performing a single expensive operation (TLB flush)
> > for a batch of kernel page tables instead of performing that expensive
> > operation for each page table.
> > 
> > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> > ---
> >   mm/Kconfig           |  3 +++
> >   include/linux/mm.h   | 16 +++++++++++++---
> >   mm/pgtable-generic.c | 37 +++++++++++++++++++++++++++++++++++++
> >   3 files changed, 53 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 0e26f4fc8717..a83df9934acd 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -908,6 +908,9 @@ config PAGE_MAPCOUNT
> >   config PGTABLE_HAS_HUGE_LEAVES
> >   	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
> > +config ASYNC_KERNEL_PGTABLE_FREE
> > +	def_bool n
> > +
> >   # TODO: Allow to be enabled without THP
> >   config ARCH_SUPPORTS_HUGE_PFNMAP
> >   	def_bool n
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 52ae551d0eb4..d521abd33164 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3031,6 +3031,14 @@ static inline void __pagetable_free(struct ptdesc *pt)
> >   	__free_pages(page, compound_order(page));
> >   }
> > +#ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
> > +void pagetable_free_kernel(struct ptdesc *pt);
> > +#else
> > +static inline void pagetable_free_kernel(struct ptdesc *pt)
> > +{
> > +	__pagetable_free(pt);
> > +}
> > +#endif
> >   /**
> >    * pagetable_free - Free pagetables
> >    * @pt:	The page table descriptor
> > @@ -3040,10 +3048,12 @@ static inline void __pagetable_free(struct ptdesc *pt)
> >    */
> >   static inline void pagetable_free(struct ptdesc *pt)
> >   {
> > -	if (ptdesc_test_kernel(pt))
> > +	if (ptdesc_test_kernel(pt)) {
> >   		ptdesc_clear_kernel(pt);
> > -
> > -	__pagetable_free(pt);
> > +		pagetable_free_kernel(pt);
> > +	} else {
> > +		__pagetable_free(pt);
> > +	}
> >   }
> >   #if defined(CONFIG_SPLIT_PTE_PTLOCKS)
> > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> > index 567e2d084071..1c7caa8ef164 100644
> > --- a/mm/pgtable-generic.c
> > +++ b/mm/pgtable-generic.c
> > @@ -406,3 +406,40 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
> >   	pte_unmap_unlock(pte, ptl);
> >   	goto again;
> >   }
> > +
> > +#ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
> > +static void kernel_pgtable_work_func(struct work_struct *work);
> > +
> > +static struct {
> > +	struct list_head list;
> > +	/* protect above ptdesc lists */
> > +	spinlock_t lock;
> > +	struct work_struct work;
> > +} kernel_pgtable_work = {
> > +	.list = LIST_HEAD_INIT(kernel_pgtable_work.list),
> > +	.lock = __SPIN_LOCK_UNLOCKED(kernel_pgtable_work.lock),
> > +	.work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func),
> > +};
> > +
> > +static void kernel_pgtable_work_func(struct work_struct *work)
> > +{
> > +	struct ptdesc *pt, *next;
> > +	LIST_HEAD(page_list);
> > +
> > +	spin_lock(&kernel_pgtable_work.lock);
> > +	list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
> > +	spin_unlock(&kernel_pgtable_work.lock);
> > +
> > +	list_for_each_entry_safe(pt, next, &page_list, pt_list)
> > +		__pagetable_free(pt);
> > +}
> > +
> > +void pagetable_free_kernel(struct ptdesc *pt)
> > +{
> > +	spin_lock(&kernel_pgtable_work.lock);
> > +	list_add(&pt->pt_list, &kernel_pgtable_work.list);
> > +	spin_unlock(&kernel_pgtable_work.lock);
> > +
> > +	schedule_work(&kernel_pgtable_work.work);
> > +}
> > +#endif
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 
> I was briefly wondering whether the pages can get stuck in there
> sufficiently long that we would want to wire up the shrinker to say "OOM,
> hold your horses, we can still free something here".
> 
> But I'd assume the workqueue will get scheduled in a reasonable timeframe
> either so this is not a concern?

Maybe it should have this set then:

``WQ_MEM_RECLAIM``
  All wq which might be used in the memory reclaim paths **MUST**
  have this flag set.  The wq is guaranteed to have at least one
  execution context regardless of memory pressure.

So it can't get locked up and will eventually run and free.

Jason


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 7/8] mm: Introduce deferred freeing for kernel page tables
  2025-10-22  8:26 ` [PATCH v7 7/8] mm: Introduce deferred freeing for kernel page tables Lu Baolu
  2025-10-22 18:34   ` David Hildenbrand
@ 2025-10-23  7:10   ` Mike Rapoport
  1 sibling, 0 replies; 20+ messages in thread
From: Mike Rapoport @ 2025-10-23  7:10 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Andrew Morton, Vlastimil Babka, Michal Hocko,
	Matthew Wilcox, Vinicius Costa Gomes, iommu, security, x86,
	linux-mm, linux-kernel, Dave Hansen

On Wed, Oct 22, 2025 at 04:26:33PM +0800, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This introduces a conditional asynchronous mechanism, enabled by
> CONFIG_ASYNC_KERNEL_PGTABLE_FREE. When enabled, this mechanism defers the
> freeing of pages that are used as page tables for kernel address mappings.
> These pages are now queued to a work struct instead of being freed
> immediately.
> 
> This deferred freeing allows for batch-freeing of page tables, providing
> a safe context for performing a single expensive operation (TLB flush)
> for a batch of kernel page tables instead of performing that expensive
> operation for each page table.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  mm/Kconfig           |  3 +++
>  include/linux/mm.h   | 16 +++++++++++++---
>  mm/pgtable-generic.c | 37 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 53 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0e26f4fc8717..a83df9934acd 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -908,6 +908,9 @@ config PAGE_MAPCOUNT
>  config PGTABLE_HAS_HUGE_LEAVES
>  	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
>  
> +config ASYNC_KERNEL_PGTABLE_FREE
> +	def_bool n
> +
>  # TODO: Allow to be enabled without THP
>  config ARCH_SUPPORTS_HUGE_PFNMAP
>  	def_bool n
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 52ae551d0eb4..d521abd33164 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3031,6 +3031,14 @@ static inline void __pagetable_free(struct ptdesc *pt)
>  	__free_pages(page, compound_order(page));
>  }
>  
> +#ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
> +void pagetable_free_kernel(struct ptdesc *pt);
> +#else
> +static inline void pagetable_free_kernel(struct ptdesc *pt)
> +{
> +	__pagetable_free(pt);
> +}
> +#endif
>  /**
>   * pagetable_free - Free pagetables
>   * @pt:	The page table descriptor
> @@ -3040,10 +3048,12 @@ static inline void __pagetable_free(struct ptdesc *pt)
>   */
>  static inline void pagetable_free(struct ptdesc *pt)
>  {
> -	if (ptdesc_test_kernel(pt))
> +	if (ptdesc_test_kernel(pt)) {
>  		ptdesc_clear_kernel(pt);
> -
> -	__pagetable_free(pt);
> +		pagetable_free_kernel(pt);
> +	} else {
> +		__pagetable_free(pt);
> +	}
>  }
>  
>  #if defined(CONFIG_SPLIT_PTE_PTLOCKS)
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 567e2d084071..1c7caa8ef164 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -406,3 +406,40 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
>  	pte_unmap_unlock(pte, ptl);
>  	goto again;
>  }
> +
> +#ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
> +static void kernel_pgtable_work_func(struct work_struct *work);
> +
> +static struct {
> +	struct list_head list;
> +	/* protect above ptdesc lists */
> +	spinlock_t lock;
> +	struct work_struct work;
> +} kernel_pgtable_work = {
> +	.list = LIST_HEAD_INIT(kernel_pgtable_work.list),
> +	.lock = __SPIN_LOCK_UNLOCKED(kernel_pgtable_work.lock),
> +	.work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func),
> +};
> +
> +static void kernel_pgtable_work_func(struct work_struct *work)
> +{
> +	struct ptdesc *pt, *next;
> +	LIST_HEAD(page_list);
> +
> +	spin_lock(&kernel_pgtable_work.lock);
> +	list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
> +	spin_unlock(&kernel_pgtable_work.lock);
> +
> +	list_for_each_entry_safe(pt, next, &page_list, pt_list)
> +		__pagetable_free(pt);
> +}
> +
> +void pagetable_free_kernel(struct ptdesc *pt)
> +{
> +	spin_lock(&kernel_pgtable_work.lock);
> +	list_add(&pt->pt_list, &kernel_pgtable_work.list);
> +	spin_unlock(&kernel_pgtable_work.lock);
> +
> +	schedule_work(&kernel_pgtable_work.work);
> +}
> +#endif
> -- 
> 2.43.0
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v7 8/8] iommu/sva: Invalidate stale IOTLB entries for kernel address space
  2025-10-22  8:26 [PATCH v7 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
                   ` (6 preceding siblings ...)
  2025-10-22  8:26 ` [PATCH v7 7/8] mm: Introduce deferred freeing for kernel page tables Lu Baolu
@ 2025-10-22  8:26 ` Lu Baolu
  2025-10-22 19:01 ` [PATCH v7 0/8] Fix " Andrew Morton
  8 siblings, 0 replies; 20+ messages in thread
From: Lu Baolu @ 2025-10-22  8:26 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Matthew Wilcox, Vinicius Costa Gomes
  Cc: iommu, security, x86, linux-mm, linux-kernel, Lu Baolu

Introduce a new IOMMU interface to flush IOTLB paging cache entries for
the CPU kernel address space. This interface is invoked from the
x86 architecture code that manages combined user and kernel page tables,
specifically before any kernel page table page is freed and reused.

This addresses the main issue with vfree() which is a common occurrence
and can be triggered by unprivileged users. While this resolves the
primary problem, it doesn't address some extremely rare case related to
memory unplug of memory that was present as reserved memory at boot,
which cannot be triggered by unprivileged users. The discussion can be
found at the link below.

Enable SVA on x86 architecture since the IOMMU can now receive
notification to flush the paging cache before freeing the CPU kernel
page table pages.

Suggested-by: Jann Horn <jannh@google.com>
Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/linux-iommu/04983c62-3b1d-40d4-93ae-34ca04b827e5@intel.com/
---
 arch/x86/Kconfig          |  1 +
 include/linux/iommu.h     |  4 ++++
 drivers/iommu/iommu-sva.c | 32 ++++++++++++++++++++++++++++----
 mm/pgtable-generic.c      |  2 ++
 4 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fa3b616af03a..a3700766a8c0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -279,6 +279,7 @@ config X86
 	select HAVE_PCI
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
+	select ASYNC_KERNEL_PGTABLE_FREE	if IOMMU_SVA
 	select MMU_GATHER_RCU_TABLE_FREE
 	select MMU_GATHER_MERGE_VMAS
 	select HAVE_POSIX_CPU_TIMERS_TASK_WORK
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c30d12e16473..66e4abb2df0d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1134,7 +1134,9 @@ struct iommu_sva {
 
 struct iommu_mm_data {
 	u32			pasid;
+	struct mm_struct	*mm;
 	struct list_head	sva_domains;
+	struct list_head	mm_list_elm;
 };
 
 int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode);
@@ -1615,6 +1617,7 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
 					struct mm_struct *mm);
 void iommu_sva_unbind_device(struct iommu_sva *handle);
 u32 iommu_sva_get_pasid(struct iommu_sva *handle);
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end);
 #else
 static inline struct iommu_sva *
 iommu_sva_bind_device(struct device *dev, struct mm_struct *mm)
@@ -1639,6 +1642,7 @@ static inline u32 mm_get_enqcmd_pasid(struct mm_struct *mm)
 }
 
 static inline void mm_pasid_drop(struct mm_struct *mm) {}
+static inline void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) {}
 #endif /* CONFIG_IOMMU_SVA */
 
 #ifdef CONFIG_IOMMU_IOPF
diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index a0442faad952..d236aef80a8d 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -10,6 +10,8 @@
 #include "iommu-priv.h"
 
 static DEFINE_MUTEX(iommu_sva_lock);
+static bool iommu_sva_present;
+static LIST_HEAD(iommu_sva_mms);
 static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
 						   struct mm_struct *mm);
 
@@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de
 		return ERR_PTR(-ENOSPC);
 	}
 	iommu_mm->pasid = pasid;
+	iommu_mm->mm = mm;
 	INIT_LIST_HEAD(&iommu_mm->sva_domains);
 	/*
 	 * Make sure the write to mm->iommu_mm is not reordered in front of
@@ -77,9 +80,6 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
 	if (!group)
 		return ERR_PTR(-ENODEV);
 
-	if (IS_ENABLED(CONFIG_X86))
-		return ERR_PTR(-EOPNOTSUPP);
-
 	mutex_lock(&iommu_sva_lock);
 
 	/* Allocate mm->pasid if necessary. */
@@ -135,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
 	if (ret)
 		goto out_free_domain;
 	domain->users = 1;
-	list_add(&domain->next, &mm->iommu_mm->sva_domains);
 
+	if (list_empty(&iommu_mm->sva_domains)) {
+		if (list_empty(&iommu_sva_mms))
+			iommu_sva_present = true;
+		list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
+	}
+	list_add(&domain->next, &iommu_mm->sva_domains);
 out:
 	refcount_set(&handle->users, 1);
 	mutex_unlock(&iommu_sva_lock);
@@ -178,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
 		list_del(&domain->next);
 		iommu_domain_free(domain);
 	}
+
+	if (list_empty(&iommu_mm->sva_domains)) {
+		list_del(&iommu_mm->mm_list_elm);
+		if (list_empty(&iommu_sva_mms))
+			iommu_sva_present = false;
+	}
+
 	mutex_unlock(&iommu_sva_lock);
 	kfree(handle);
 }
@@ -315,3 +327,15 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
 
 	return domain;
 }
+
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
+{
+	struct iommu_mm_data *iommu_mm;
+
+	guard(mutex)(&iommu_sva_lock);
+	if (!iommu_sva_present)
+		return;
+
+	list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
+		mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, start, end);
+}
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 1c7caa8ef164..8c22be79b734 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -13,6 +13,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/mm_inline.h>
+#include <linux/iommu.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 
@@ -430,6 +431,7 @@ static void kernel_pgtable_work_func(struct work_struct *work)
 	list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
 	spin_unlock(&kernel_pgtable_work.lock);
 
+	iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
 	list_for_each_entry_safe(pt, next, &page_list, pt_list)
 		__pagetable_free(pt);
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 0/8] Fix stale IOTLB entries for kernel address space
  2025-10-22  8:26 [PATCH v7 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
                   ` (7 preceding siblings ...)
  2025-10-22  8:26 ` [PATCH v7 8/8] iommu/sva: Invalidate stale IOTLB entries for kernel address space Lu Baolu
@ 2025-10-22 19:01 ` Andrew Morton
  8 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2025-10-22 19:01 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
	Matthew Wilcox, Vinicius Costa Gomes, iommu, security, x86,
	linux-mm, linux-kernel

On Wed, 22 Oct 2025 16:26:26 +0800 Lu Baolu <baolu.lu@linux.intel.com> wrote:

> This proposes a fix for a security vulnerability related to IOMMU Shared
> Virtual Addressing (SVA). In an SVA context, an IOMMU can cache kernel
> page table entries. When a kernel page table page is freed and
> reallocated for another purpose, the IOMMU might still hold stale,
> incorrect entries. This can be exploited to cause a use-after-free or
> write-after-free condition, potentially leading to privilege escalation
> or data corruption.
> 
> This solution introduces a deferred freeing mechanism for kernel page
> table pages, which provides a safe window to notify the IOMMU to
> invalidate its caches before the page is reused.

Thanks, I'll add this to mm.git for some testing.  I'll suppress the
usual email flood when doing this.

The x86 maintainers may choose to merge this series in which case I
shall drop the mm.git copy.

As presented and merged, the [1/8] (which has cc:stable) won't hit
mainline until the next merge window.  So it won't be offered to
-stable maintainers until that time.  If you believe [1/8] should be
mainlined in the 6.18-rcX timeframe then please let me know and I'll
extract that patch from the series and shall stage it separately,


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-11-20 10:35 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-22  8:26 [PATCH v7 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
2025-10-22  8:26 ` [PATCH v7 1/8] iommu: Disable SVA when CONFIG_X86 is set Lu Baolu
2025-10-22 19:50   ` Jason Gunthorpe
2025-10-22  8:26 ` [PATCH v7 2/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
2025-10-22 18:31   ` David Hildenbrand
2025-10-23  7:07   ` Mike Rapoport
2025-10-22  8:26 ` [PATCH v7 3/8] mm: Actually mark kernel page table pages Lu Baolu
2025-10-22  8:26 ` [PATCH v7 4/8] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
2025-10-22 18:31   ` David Hildenbrand
2025-10-22  8:26 ` [PATCH v7 5/8] mm: Introduce pure page table freeing function Lu Baolu
2025-10-22  8:26 ` [PATCH v7 6/8] x86/mm: Use pagetable_free() Lu Baolu
2025-11-18  2:14   ` Vishal Moola (Oracle)
2025-11-20 10:35     ` Mike Rapoport
2025-10-22  8:26 ` [PATCH v7 7/8] mm: Introduce deferred freeing for kernel page tables Lu Baolu
2025-10-22 18:34   ` David Hildenbrand
2025-10-22 19:12     ` Dave Hansen
2025-10-22 19:52     ` Jason Gunthorpe
2025-10-23  7:10   ` Mike Rapoport
2025-10-22  8:26 ` [PATCH v7 8/8] iommu/sva: Invalidate stale IOTLB entries for kernel address space Lu Baolu
2025-10-22 19:01 ` [PATCH v7 0/8] Fix " Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox