linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/2] mm: move pte table reclaim code to memory.c
@ 2026-01-19 22:07 David Hildenbrand (Red Hat)
  2026-01-19 22:07 ` [PATCH v1 1/2] " David Hildenbrand (Red Hat)
  2026-01-19 22:07 ` [PATCH v1 2/2] mm/memory: handle non-split locks correctly in zap_empty_pte_table() David Hildenbrand (Red Hat)
  0 siblings, 2 replies; 7+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-19 22:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand (Red Hat),
	Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Qi Zheng

Some cleanups for PT table reclaim code, triggered by a false-positive
warning we might start to see soon after we unlocked pt-reclaim on
architectures besides x86-64.

Cross compiled on plenty of architectures, tested on x86-64 with
a simple test case that allocates plenty of page tables in a sparse
memory area to see if they will get reclaimed.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qi Zheng <qi.zheng@linux.dev>

David Hildenbrand (Red Hat) (2):
  mm: move pte table reclaim code to memory.c
  mm/memory: handle non-split locks correctly in zap_empty_pte_table()

 MAINTAINERS     |  1 -
 mm/Makefile     |  1 -
 mm/internal.h   | 18 -------------
 mm/memory.c     | 70 ++++++++++++++++++++++++++++++++++++++++++-----
 mm/pt_reclaim.c | 72 -------------------------------------------------
 5 files changed, 64 insertions(+), 98 deletions(-)
 delete mode 100644 mm/pt_reclaim.c


base-commit: ac1303686c1e823c9c88b20c5f8587629ad94a11
-- 
2.52.0



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v1 1/2] mm: move pte table reclaim code to memory.c
  2026-01-19 22:07 [PATCH v1 0/2] mm: move pte table reclaim code to memory.c David Hildenbrand (Red Hat)
@ 2026-01-19 22:07 ` David Hildenbrand (Red Hat)
  2026-01-20  3:30   ` Qi Zheng
  2026-01-20 11:19   ` Kiryl Shutsemau
  2026-01-19 22:07 ` [PATCH v1 2/2] mm/memory: handle non-split locks correctly in zap_empty_pte_table() David Hildenbrand (Red Hat)
  1 sibling, 2 replies; 7+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-19 22:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand (Red Hat),
	Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Qi Zheng

The pte-table reclaim code is only called from memory.c, while zapping
pages, and it better also stays that way in the long run. If we ever
have to call it from other files, we should expose proper high-level
helpers for zapping if the existing helpers are not good enough.

So, let's move the code over (it's not a lot) and slightly clean it up a
bit by:
- Renaming the functions.
- Dropping the "Check if it is empty PTE page" comment, which is now
  self-explaining given the function name.
- Making zap_pte_table_if_empty() return whether zapping worked so the
  caller can free it.
- Adding a comment in pte_table_reclaim_possible().
- Inlining free_pte() in the last remaining user.
- In zap_empty_pte_table(), switch from pmdp_get_lcokless() to
  pmd_clear(), we are holding the PMD PT lock.

By moving the code over, compilers can also easily figure out when
zap_empty_pte_table() does not initialize the pmdval variable, avoiding
false-positive warnings about the variable possibly not being
initialized.

Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
---
 MAINTAINERS     |  1 -
 mm/Makefile     |  1 -
 mm/internal.h   | 18 -------------
 mm/memory.c     | 68 +++++++++++++++++++++++++++++++++++++++++-----
 mm/pt_reclaim.c | 72 -------------------------------------------------
 5 files changed, 62 insertions(+), 98 deletions(-)
 delete mode 100644 mm/pt_reclaim.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 11720728d92f2..28e8e28bca3e5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16692,7 +16692,6 @@ R:	Shakeel Butt <shakeel.butt@linux.dev>
 R:	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
 L:	linux-mm@kvack.org
 S:	Maintained
-F:	mm/pt_reclaim.c
 F:	mm/vmscan.c
 F:	mm/workingset.c
 
diff --git a/mm/Makefile b/mm/Makefile
index 0d85b10dbdde4..53ca5d4b1929b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -146,5 +146,4 @@ obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
-obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
 obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) += tests/lazy_mmu_mode_kunit.o
diff --git a/mm/internal.h b/mm/internal.h
index 9508dbaf47cd4..ef71a1d9991f2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1745,24 +1745,6 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
 			  unsigned long end, const struct mm_walk_ops *ops,
 			  pgd_t *pgd, void *private);
 
-/* pt_reclaim.c */
-bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval);
-void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
-	      pmd_t pmdval);
-void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
-		     struct mmu_gather *tlb);
-
-#ifdef CONFIG_PT_RECLAIM
-bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
-			   struct zap_details *details);
-#else
-static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
-					 struct zap_details *details)
-{
-	return false;
-}
-#endif /* CONFIG_PT_RECLAIM */
-
 void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
 int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
 
diff --git a/mm/memory.c b/mm/memory.c
index f2e9e05388743..c3055b2577c27 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1824,11 +1824,68 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
 	return nr;
 }
 
+static bool pte_table_reclaim_possible(unsigned long start, unsigned long end,
+		struct zap_details *details)
+{
+	if (!IS_ENABLED(CONFIG_PT_RECLAIM))
+		return false;
+	/* Only zap if we are allowed to and cover the full page table. */
+	return details && details->reclaim_pt && (end - start >= PMD_SIZE);
+}
+
+static bool zap_empty_pte_table(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
+{
+	spinlock_t *pml = pmd_lockptr(mm, pmd);
+
+	if (!spin_trylock(pml))
+		return false;
+
+	*pmdval = pmdp_get(pmd);
+	pmd_clear(pmd);
+	spin_unlock(pml);
+	return true;
+}
+
+static bool zap_pte_table_if_empty(struct mm_struct *mm, pmd_t *pmd,
+		unsigned long addr, pmd_t *pmdval)
+{
+	spinlock_t *pml, *ptl = NULL;
+	pte_t *start_pte, *pte;
+	int i;
+
+	pml = pmd_lock(mm, pmd);
+	start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, pmdval, &ptl);
+	if (!start_pte)
+		goto out_ptl;
+	if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+
+	for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
+		if (!pte_none(ptep_get(pte)))
+			goto out_ptl;
+	}
+	pte_unmap(start_pte);
+
+	pmd_clear(pmd);
+
+	if (ptl != pml)
+		spin_unlock(ptl);
+	spin_unlock(pml);
+	return true;
+out_ptl:
+	if (start_pte)
+		pte_unmap_unlock(start_pte, ptl);
+	if (ptl != pml)
+		spin_unlock(pml);
+	return false;
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
 {
+	bool can_reclaim_pt = pte_table_reclaim_possible(addr, end, details);
 	bool force_flush = false, force_break = false;
 	struct mm_struct *mm = tlb->mm;
 	int rss[NR_MM_COUNTERS];
@@ -1837,7 +1894,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	pte_t *pte;
 	pmd_t pmdval;
 	unsigned long start = addr;
-	bool can_reclaim_pt = reclaim_pt_is_enabled(start, end, details);
 	bool direct_reclaim = true;
 	int nr;
 
@@ -1878,7 +1934,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	 * from being repopulated by another thread.
 	 */
 	if (can_reclaim_pt && direct_reclaim && addr == end)
-		direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
+		direct_reclaim = zap_empty_pte_table(mm, pmd, &pmdval);
 
 	add_mm_rss_vec(mm, rss);
 	lazy_mmu_mode_disable();
@@ -1907,10 +1963,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	}
 
 	if (can_reclaim_pt) {
-		if (direct_reclaim)
-			free_pte(mm, start, tlb, pmdval);
-		else
-			try_to_free_pte(mm, pmd, start, tlb);
+		if (direct_reclaim || zap_pte_table_if_empty(mm, pmd, start, &pmdval)) {
+			pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
+			mm_dec_nr_ptes(mm);
+		}
 	}
 
 	return addr;
diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
deleted file mode 100644
index 46771cfff8239..0000000000000
--- a/mm/pt_reclaim.c
+++ /dev/null
@@ -1,72 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-#include <linux/hugetlb.h>
-#include <linux/pgalloc.h>
-
-#include <asm/tlb.h>
-
-#include "internal.h"
-
-bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
-			   struct zap_details *details)
-{
-	return details && details->reclaim_pt && (end - start >= PMD_SIZE);
-}
-
-bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
-{
-	spinlock_t *pml = pmd_lockptr(mm, pmd);
-
-	if (!spin_trylock(pml))
-		return false;
-
-	*pmdval = pmdp_get_lockless(pmd);
-	pmd_clear(pmd);
-	spin_unlock(pml);
-
-	return true;
-}
-
-void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
-	      pmd_t pmdval)
-{
-	pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
-	mm_dec_nr_ptes(mm);
-}
-
-void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
-		     struct mmu_gather *tlb)
-{
-	pmd_t pmdval;
-	spinlock_t *pml, *ptl = NULL;
-	pte_t *start_pte, *pte;
-	int i;
-
-	pml = pmd_lock(mm, pmd);
-	start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
-	if (!start_pte)
-		goto out_ptl;
-	if (ptl != pml)
-		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
-
-	/* Check if it is empty PTE page */
-	for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
-		if (!pte_none(ptep_get(pte)))
-			goto out_ptl;
-	}
-	pte_unmap(start_pte);
-
-	pmd_clear(pmd);
-
-	if (ptl != pml)
-		spin_unlock(ptl);
-	spin_unlock(pml);
-
-	free_pte(mm, addr, tlb, pmdval);
-
-	return;
-out_ptl:
-	if (start_pte)
-		pte_unmap_unlock(start_pte, ptl);
-	if (ptl != pml)
-		spin_unlock(pml);
-}
-- 
2.52.0



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v1 2/2] mm/memory: handle non-split locks correctly in zap_empty_pte_table()
  2026-01-19 22:07 [PATCH v1 0/2] mm: move pte table reclaim code to memory.c David Hildenbrand (Red Hat)
  2026-01-19 22:07 ` [PATCH v1 1/2] " David Hildenbrand (Red Hat)
@ 2026-01-19 22:07 ` David Hildenbrand (Red Hat)
  2026-01-20  3:32   ` Qi Zheng
  1 sibling, 1 reply; 7+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-19 22:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand (Red Hat),
	Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Qi Zheng

While we handle pte_lockptr() == pmd_lockptr() correctly in
zap_pte_table_if_empty(), we don't handle it in zap_empty_pte_table(),
making the spin_trylock() always fail and forcing us onto the slow path.

So let's handle the scenario where pte_lockptr() == pmd_lockptr()
better, which can only happen if CONFIG_SPLIT_PTE_PTLOCKS is not set.

This is only relevant once we unlock CONFIG_PT_RECLAIM on architectures
that are not x86-64.

Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
---
 mm/memory.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c3055b2577c27..3852075ea62d4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1833,16 +1833,18 @@ static bool pte_table_reclaim_possible(unsigned long start, unsigned long end,
 	return details && details->reclaim_pt && (end - start >= PMD_SIZE);
 }
 
-static bool zap_empty_pte_table(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
+static bool zap_empty_pte_table(struct mm_struct *mm, pmd_t *pmd,
+		spinlock_t *ptl, pmd_t *pmdval)
 {
 	spinlock_t *pml = pmd_lockptr(mm, pmd);
 
-	if (!spin_trylock(pml))
+	if (ptl != pml && !spin_trylock(pml))
 		return false;
 
 	*pmdval = pmdp_get(pmd);
 	pmd_clear(pmd);
-	spin_unlock(pml);
+	if (ptl != pml)
+		spin_unlock(pml);
 	return true;
 }
 
@@ -1934,7 +1936,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	 * from being repopulated by another thread.
 	 */
 	if (can_reclaim_pt && direct_reclaim && addr == end)
-		direct_reclaim = zap_empty_pte_table(mm, pmd, &pmdval);
+		direct_reclaim = zap_empty_pte_table(mm, pmd, ptl, &pmdval);
 
 	add_mm_rss_vec(mm, rss);
 	lazy_mmu_mode_disable();
-- 
2.52.0



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 1/2] mm: move pte table reclaim code to memory.c
  2026-01-19 22:07 ` [PATCH v1 1/2] " David Hildenbrand (Red Hat)
@ 2026-01-20  3:30   ` Qi Zheng
  2026-01-20 11:19   ` Kiryl Shutsemau
  1 sibling, 0 replies; 7+ messages in thread
From: Qi Zheng @ 2026-01-20  3:30 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), linux-kernel
  Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko



On 1/20/26 6:07 AM, David Hildenbrand (Red Hat) wrote:
> The pte-table reclaim code is only called from memory.c, while zapping
> pages, and it better also stays that way in the long run. If we ever
> have to call it from other files, we should expose proper high-level
> helpers for zapping if the existing helpers are not good enough.
> 
> So, let's move the code over (it's not a lot) and slightly clean it up a
> bit by:
> - Renaming the functions.
> - Dropping the "Check if it is empty PTE page" comment, which is now
>    self-explaining given the function name.
> - Making zap_pte_table_if_empty() return whether zapping worked so the
>    caller can free it.
> - Adding a comment in pte_table_reclaim_possible().
> - Inlining free_pte() in the last remaining user.
> - In zap_empty_pte_table(), switch from pmdp_get_lcokless() to
>    pmd_clear(), we are holding the PMD PT lock.
> 
> By moving the code over, compilers can also easily figure out when
> zap_empty_pte_table() does not initialize the pmdval variable, avoiding
> false-positive warnings about the variable possibly not being
> initialized.
> 
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
> ---
>   MAINTAINERS     |  1 -
>   mm/Makefile     |  1 -
>   mm/internal.h   | 18 -------------
>   mm/memory.c     | 68 +++++++++++++++++++++++++++++++++++++++++-----
>   mm/pt_reclaim.c | 72 -------------------------------------------------
>   5 files changed, 62 insertions(+), 98 deletions(-)
>   delete mode 100644 mm/pt_reclaim.c

Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>

Thanks!

> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 11720728d92f2..28e8e28bca3e5 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16692,7 +16692,6 @@ R:	Shakeel Butt <shakeel.butt@linux.dev>
>   R:	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>   L:	linux-mm@kvack.org
>   S:	Maintained
> -F:	mm/pt_reclaim.c
>   F:	mm/vmscan.c
>   F:	mm/workingset.c
>   
> diff --git a/mm/Makefile b/mm/Makefile
> index 0d85b10dbdde4..53ca5d4b1929b 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -146,5 +146,4 @@ obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
>   obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
>   obj-$(CONFIG_EXECMEM) += execmem.o
>   obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
> -obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
>   obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) += tests/lazy_mmu_mode_kunit.o
> diff --git a/mm/internal.h b/mm/internal.h
> index 9508dbaf47cd4..ef71a1d9991f2 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1745,24 +1745,6 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
>   			  unsigned long end, const struct mm_walk_ops *ops,
>   			  pgd_t *pgd, void *private);
>   
> -/* pt_reclaim.c */
> -bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval);
> -void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
> -	      pmd_t pmdval);
> -void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
> -		     struct mmu_gather *tlb);
> -
> -#ifdef CONFIG_PT_RECLAIM
> -bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
> -			   struct zap_details *details);
> -#else
> -static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
> -					 struct zap_details *details)
> -{
> -	return false;
> -}
> -#endif /* CONFIG_PT_RECLAIM */
> -
>   void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
>   int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
>   
> diff --git a/mm/memory.c b/mm/memory.c
> index f2e9e05388743..c3055b2577c27 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1824,11 +1824,68 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
>   	return nr;
>   }
>   
> +static bool pte_table_reclaim_possible(unsigned long start, unsigned long end,
> +		struct zap_details *details)
> +{
> +	if (!IS_ENABLED(CONFIG_PT_RECLAIM))
> +		return false;
> +	/* Only zap if we are allowed to and cover the full page table. */
> +	return details && details->reclaim_pt && (end - start >= PMD_SIZE);
> +}
> +
> +static bool zap_empty_pte_table(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
> +{
> +	spinlock_t *pml = pmd_lockptr(mm, pmd);
> +
> +	if (!spin_trylock(pml))
> +		return false;
> +
> +	*pmdval = pmdp_get(pmd);
> +	pmd_clear(pmd);
> +	spin_unlock(pml);
> +	return true;
> +}
> +
> +static bool zap_pte_table_if_empty(struct mm_struct *mm, pmd_t *pmd,
> +		unsigned long addr, pmd_t *pmdval)
> +{
> +	spinlock_t *pml, *ptl = NULL;
> +	pte_t *start_pte, *pte;
> +	int i;
> +
> +	pml = pmd_lock(mm, pmd);
> +	start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, pmdval, &ptl);
> +	if (!start_pte)
> +		goto out_ptl;
> +	if (ptl != pml)
> +		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +
> +	for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
> +		if (!pte_none(ptep_get(pte)))
> +			goto out_ptl;
> +	}
> +	pte_unmap(start_pte);
> +
> +	pmd_clear(pmd);
> +
> +	if (ptl != pml)
> +		spin_unlock(ptl);
> +	spin_unlock(pml);
> +	return true;
> +out_ptl:
> +	if (start_pte)
> +		pte_unmap_unlock(start_pte, ptl);
> +	if (ptl != pml)
> +		spin_unlock(pml);
> +	return false;
> +}
> +
>   static unsigned long zap_pte_range(struct mmu_gather *tlb,
>   				struct vm_area_struct *vma, pmd_t *pmd,
>   				unsigned long addr, unsigned long end,
>   				struct zap_details *details)
>   {
> +	bool can_reclaim_pt = pte_table_reclaim_possible(addr, end, details);
>   	bool force_flush = false, force_break = false;
>   	struct mm_struct *mm = tlb->mm;
>   	int rss[NR_MM_COUNTERS];
> @@ -1837,7 +1894,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>   	pte_t *pte;
>   	pmd_t pmdval;
>   	unsigned long start = addr;
> -	bool can_reclaim_pt = reclaim_pt_is_enabled(start, end, details);
>   	bool direct_reclaim = true;
>   	int nr;
>   
> @@ -1878,7 +1934,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>   	 * from being repopulated by another thread.
>   	 */
>   	if (can_reclaim_pt && direct_reclaim && addr == end)
> -		direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
> +		direct_reclaim = zap_empty_pte_table(mm, pmd, &pmdval);
>   
>   	add_mm_rss_vec(mm, rss);
>   	lazy_mmu_mode_disable();
> @@ -1907,10 +1963,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>   	}
>   
>   	if (can_reclaim_pt) {
> -		if (direct_reclaim)
> -			free_pte(mm, start, tlb, pmdval);
> -		else
> -			try_to_free_pte(mm, pmd, start, tlb);
> +		if (direct_reclaim || zap_pte_table_if_empty(mm, pmd, start, &pmdval)) {
> +			pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
> +			mm_dec_nr_ptes(mm);
> +		}
>   	}
>   
>   	return addr;
> diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
> deleted file mode 100644
> index 46771cfff8239..0000000000000
> --- a/mm/pt_reclaim.c
> +++ /dev/null
> @@ -1,72 +0,0 @@
> -// SPDX-License-Identifier: GPL-2.0
> -#include <linux/hugetlb.h>
> -#include <linux/pgalloc.h>
> -
> -#include <asm/tlb.h>
> -
> -#include "internal.h"
> -
> -bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
> -			   struct zap_details *details)
> -{
> -	return details && details->reclaim_pt && (end - start >= PMD_SIZE);
> -}
> -
> -bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
> -{
> -	spinlock_t *pml = pmd_lockptr(mm, pmd);
> -
> -	if (!spin_trylock(pml))
> -		return false;
> -
> -	*pmdval = pmdp_get_lockless(pmd);
> -	pmd_clear(pmd);
> -	spin_unlock(pml);
> -
> -	return true;
> -}
> -
> -void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
> -	      pmd_t pmdval)
> -{
> -	pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
> -	mm_dec_nr_ptes(mm);
> -}
> -
> -void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
> -		     struct mmu_gather *tlb)
> -{
> -	pmd_t pmdval;
> -	spinlock_t *pml, *ptl = NULL;
> -	pte_t *start_pte, *pte;
> -	int i;
> -
> -	pml = pmd_lock(mm, pmd);
> -	start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
> -	if (!start_pte)
> -		goto out_ptl;
> -	if (ptl != pml)
> -		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> -
> -	/* Check if it is empty PTE page */
> -	for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
> -		if (!pte_none(ptep_get(pte)))
> -			goto out_ptl;
> -	}
> -	pte_unmap(start_pte);
> -
> -	pmd_clear(pmd);
> -
> -	if (ptl != pml)
> -		spin_unlock(ptl);
> -	spin_unlock(pml);
> -
> -	free_pte(mm, addr, tlb, pmdval);
> -
> -	return;
> -out_ptl:
> -	if (start_pte)
> -		pte_unmap_unlock(start_pte, ptl);
> -	if (ptl != pml)
> -		spin_unlock(pml);
> -}



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 2/2] mm/memory: handle non-split locks correctly in zap_empty_pte_table()
  2026-01-19 22:07 ` [PATCH v1 2/2] mm/memory: handle non-split locks correctly in zap_empty_pte_table() David Hildenbrand (Red Hat)
@ 2026-01-20  3:32   ` Qi Zheng
  0 siblings, 0 replies; 7+ messages in thread
From: Qi Zheng @ 2026-01-20  3:32 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), linux-kernel
  Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko



On 1/20/26 6:07 AM, David Hildenbrand (Red Hat) wrote:
> While we handle pte_lockptr() == pmd_lockptr() correctly in
> zap_pte_table_if_empty(), we don't handle it in zap_empty_pte_table(),
> making the spin_trylock() always fail and forcing us onto the slow path.
> 
> So let's handle the scenario where pte_lockptr() == pmd_lockptr()
> better, which can only happen if CONFIG_SPLIT_PTE_PTLOCKS is not set.
> 
> This is only relevant once we unlock CONFIG_PT_RECLAIM on architectures
> that are not x86-64.
> 
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
> ---
>   mm/memory.c | 10 ++++++----
>   1 file changed, 6 insertions(+), 4 deletions(-)

Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>

Thanks!

> 
> diff --git a/mm/memory.c b/mm/memory.c
> index c3055b2577c27..3852075ea62d4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1833,16 +1833,18 @@ static bool pte_table_reclaim_possible(unsigned long start, unsigned long end,
>   	return details && details->reclaim_pt && (end - start >= PMD_SIZE);
>   }
>   
> -static bool zap_empty_pte_table(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
> +static bool zap_empty_pte_table(struct mm_struct *mm, pmd_t *pmd,
> +		spinlock_t *ptl, pmd_t *pmdval)
>   {
>   	spinlock_t *pml = pmd_lockptr(mm, pmd);
>   
> -	if (!spin_trylock(pml))
> +	if (ptl != pml && !spin_trylock(pml))
>   		return false;
>   
>   	*pmdval = pmdp_get(pmd);
>   	pmd_clear(pmd);
> -	spin_unlock(pml);
> +	if (ptl != pml)
> +		spin_unlock(pml);
>   	return true;
>   }
>   
> @@ -1934,7 +1936,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>   	 * from being repopulated by another thread.
>   	 */
>   	if (can_reclaim_pt && direct_reclaim && addr == end)
> -		direct_reclaim = zap_empty_pte_table(mm, pmd, &pmdval);
> +		direct_reclaim = zap_empty_pte_table(mm, pmd, ptl, &pmdval);
>   
>   	add_mm_rss_vec(mm, rss);
>   	lazy_mmu_mode_disable();



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 1/2] mm: move pte table reclaim code to memory.c
  2026-01-19 22:07 ` [PATCH v1 1/2] " David Hildenbrand (Red Hat)
  2026-01-20  3:30   ` Qi Zheng
@ 2026-01-20 11:19   ` Kiryl Shutsemau
  2026-01-21 12:08     ` David Hildenbrand (Red Hat)
  1 sibling, 1 reply; 7+ messages in thread
From: Kiryl Shutsemau @ 2026-01-20 11:19 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: linux-kernel, linux-mm, Andrew Morton, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Qi Zheng

On Mon, Jan 19, 2026 at 11:07:07PM +0100, David Hildenbrand (Red Hat) wrote:
> The pte-table reclaim code is only called from memory.c, while zapping
> pages, and it better also stays that way in the long run. If we ever
> have to call it from other files, we should expose proper high-level
> helpers for zapping if the existing helpers are not good enough.
> 
> So, let's move the code over (it's not a lot) and slightly clean it up a
> bit by:
> - Renaming the functions.
> - Dropping the "Check if it is empty PTE page" comment, which is now
>   self-explaining given the function name.
> - Making zap_pte_table_if_empty() return whether zapping worked so the
>   caller can free it.
> - Adding a comment in pte_table_reclaim_possible().
> - Inlining free_pte() in the last remaining user.
> - In zap_empty_pte_table(), switch from pmdp_get_lcokless() to
>   pmd_clear(), we are holding the PMD PT lock.
> 
> By moving the code over, compilers can also easily figure out when
> zap_empty_pte_table() does not initialize the pmdval variable, avoiding
> false-positive warnings about the variable possibly not being
> initialized.

mm/memory.c is a kitchen sink as it is.

I think you miss opportunity to introduce mm/zap.c and move all zap
code.

It can be done for code from both mm/memory.c and mm/huge_memory.c.
Line between THP and non-THP code gets more and more blurry over time.

The same can be done for copy and fault code. I think it is going to be
more maintainable this way.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 1/2] mm: move pte table reclaim code to memory.c
  2026-01-20 11:19   ` Kiryl Shutsemau
@ 2026-01-21 12:08     ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-21 12:08 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: linux-kernel, linux-mm, Andrew Morton, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Qi Zheng

On 1/20/26 12:19, Kiryl Shutsemau wrote:
> On Mon, Jan 19, 2026 at 11:07:07PM +0100, David Hildenbrand (Red Hat) wrote:
>> The pte-table reclaim code is only called from memory.c, while zapping
>> pages, and it better also stays that way in the long run. If we ever
>> have to call it from other files, we should expose proper high-level
>> helpers for zapping if the existing helpers are not good enough.
>>
>> So, let's move the code over (it's not a lot) and slightly clean it up a
>> bit by:
>> - Renaming the functions.
>> - Dropping the "Check if it is empty PTE page" comment, which is now
>>    self-explaining given the function name.
>> - Making zap_pte_table_if_empty() return whether zapping worked so the
>>    caller can free it.
>> - Adding a comment in pte_table_reclaim_possible().
>> - Inlining free_pte() in the last remaining user.
>> - In zap_empty_pte_table(), switch from pmdp_get_lcokless() to
>>    pmd_clear(), we are holding the PMD PT lock.
>>
>> By moving the code over, compilers can also easily figure out when
>> zap_empty_pte_table() does not initialize the pmdval variable, avoiding
>> false-positive warnings about the variable possibly not being
>> initialized.
> 
> mm/memory.c is a kitchen sink as it is.
> 
> I think you miss opportunity to introduce mm/zap.c and move all zap
> code.
> 
> It can be done for code from both mm/memory.c and mm/huge_memory.c.
> Line between THP and non-THP code gets more and more blurry over time.
> 
> The same can be done for copy and fault code. I think it is going to be
> more maintainable this way.

While agree that memory.c contains too much stuff, I don't think zap.c 
is the right abstraction either. And actually, I think basic page table 
handling is well kept in memory.c, or moved along with other stuff 
(fork()) handling somewhere else.

So I won't do any of that as part of this patch.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-01-21 12:08 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-19 22:07 [PATCH v1 0/2] mm: move pte table reclaim code to memory.c David Hildenbrand (Red Hat)
2026-01-19 22:07 ` [PATCH v1 1/2] " David Hildenbrand (Red Hat)
2026-01-20  3:30   ` Qi Zheng
2026-01-20 11:19   ` Kiryl Shutsemau
2026-01-21 12:08     ` David Hildenbrand (Red Hat)
2026-01-19 22:07 ` [PATCH v1 2/2] mm/memory: handle non-split locks correctly in zap_empty_pte_table() David Hildenbrand (Red Hat)
2026-01-20  3:32   ` Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox