[PATCH V1 6.1.y 0/2] Fix bad pmd due to race between change_prot

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V1 6.1.y 0/2] Fix bad pmd due to race between change_prot_numa() and THP migration
@ 2025-11-11  7:10 Harry Yoo
  2025-11-11  7:11 ` [PATCH V1 6.1.y 1/2] mm/mprotect: use long for page accountings and retval Harry Yoo
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Harry Yoo @ 2025-11-11  7:10 UTC (permalink / raw)
  To: stable
  Cc: Liam.Howlett, akpm, baohua, baolin.wang, david, dev.jain, hughd,
	jane.chu, jannh, kas, lance.yang, linux-mm, lorenzo.stoakes,
	npache, pfalcato, ryan.roberts, vbabka, ziy, Harry Yoo

# TL;DR

previous discussion: https://lore.kernel.org/linux-mm/b41ea29e-6b48-4f64-859c-73be095453ae@redhat.com/

A "bad pmd" error occurs due to race condition between
change_prot_numa() and THP migration. The mainline kernel does not have
this bug as commit 670ddd8cdc fixes the race condition. 6.1.y, 5.15.y,
5.10.y, 5.4.y are affected by this bug. 

Fixing this in -stable kernels is tricky because pte_map_offset_lock()
has different semantics in pre-6.5 and post-6.5 kernels. I am trying to
backport the same mechanism we have in the mainline kernel.
Since the code looks bit different due to different semantics of
pte_map_offset_lock(), it'd be best to get this reviewed by MM folks.

# Testing

I verified that the bug described below is not reproduced anymore
(on a downstream kernel) after applying this patch series. It used to
trigger in few days of intensive numa balancing testing, but it survived
2 weeks with this applied.

# Bug Description

It was reported that a bad pmd is seen when automatic NUMA
balancing is marking page table entries as prot_numa:

  [2437548.196018] mm/pgtable-generic.c:50: bad pmd 00000000af22fc02(dffffffe71fbfe02)
  [2437548.235022] Call Trace:
  [2437548.238234]  <TASK>
  [2437548.241060]  dump_stack_lvl+0x46/0x61
  [2437548.245689]  panic+0x106/0x2e5
  [2437548.249497]  pmd_clear_bad+0x3c/0x3c
  [2437548.253967]  change_pmd_range.isra.0+0x34d/0x3a7
  [2437548.259537]  change_p4d_range+0x156/0x20e
  [2437548.264392]  change_protection_range+0x116/0x1a9
  [2437548.269976]  change_prot_numa+0x15/0x37
  [2437548.274774]  task_numa_work+0x1b8/0x302
  [2437548.279512]  task_work_run+0x62/0x95
  [2437548.283882]  exit_to_user_mode_loop+0x1a4/0x1a9
  [2437548.289277]  exit_to_user_mode_prepare+0xf4/0xfc
  [2437548.294751]  ? sysvec_apic_timer_interrupt+0x34/0x81
  [2437548.300677]  irqentry_exit_to_user_mode+0x5/0x25
  [2437548.306153]  asm_sysvec_apic_timer_interrupt+0x16/0x1b

This is due to a race condition between change_prot_numa() and
THP migration because the kernel doesn't check is_swap_pmd() and
pmd_trans_huge() atomically:

change_prot_numa()                      THP migration
======================================================================
- change_pmd_range()
-> is_swap_pmd() returns false,
meaning it's not a PMD migration
entry.
				  - do_huge_pmd_numa_page()
				  -> migrate_misplaced_page() sets
				     migration entries for the THP.
- change_pmd_range()
-> pmd_none_or_clear_bad_unless_trans_huge()
-> pmd_none() and pmd_trans_huge() returns false
- pmd_none_or_clear_bad_unless_trans_huge()
-> pmd_bad() returns true for the migration entry!

The upstream commit 670ddd8cdcbd ("mm/mprotect: delete
pmd_none_or_clear_bad_unless_trans_huge()") closes this race condition
by checking is_swap_pmd() and pmd_trans_huge() atomically.

# Backporting note

commit a79390f5d6a7 ("mm/mprotect: use long for page accountings and retval")
is backported to return an error code (negative value) in
change_pte_range().

Unlike the mainline, pte_offset_map_lock() does not check if the pmd
entry is a migration entry or a hugepage; acquires PTL unconditionally
instead of returning failure. Therefore, it is necessary to keep the
!is_swap_pmd() && !pmd_trans_huge() && !pmd_devmap() checks in
change_pmd_range() before acquiring the PTL.

After acquiring the lock, open-code the semantics of
pte_offset_map_lock() in the mainline kernel; change_pte_range() fails
if the pmd value has changed. This requires adding pmd_old parameter
(pmd_t value that is read before calling the function) to
change_pte_range().

Hugh Dickins (1):
  mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()

Peter Xu (1):
  mm/mprotect: use long for page accountings and retval

 include/linux/hugetlb.h |   4 +-
 include/linux/mm.h      |   2 +-
 mm/hugetlb.c            |   4 +-
 mm/mempolicy.c          |   2 +-
 mm/mprotect.c           | 125 ++++++++++++++++++----------------------
 5 files changed, 61 insertions(+), 76 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH V1 6.1.y 1/2] mm/mprotect: use long for page accountings and retval
  2025-11-11  7:10 [PATCH V1 6.1.y 0/2] Fix bad pmd due to race between change_prot_numa() and THP migration Harry Yoo
@ 2025-11-11  7:11 ` Harry Yoo
  2025-11-11  7:11 ` [PATCH V1 6.1.y 2/2] mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge() Harry Yoo
  2025-11-14 11:06 ` [PATCH V1 6.1.y 0/2] Fix bad pmd due to race between change_prot_numa() and THP migration David Hildenbrand (Red Hat)
  2 siblings, 0 replies; 6+ messages in thread
From: Harry Yoo @ 2025-11-11  7:11 UTC (permalink / raw)
  To: stable
  Cc: Liam.Howlett, akpm, baohua, baolin.wang, david, dev.jain, hughd,
	jane.chu, jannh, kas, lance.yang, linux-mm, lorenzo.stoakes,
	npache, pfalcato, ryan.roberts, vbabka, ziy, Peter Xu,
	Mike Kravetz, James Houghton, Andrea Arcangeli, Axel Rasmussen,
	Muchun Song, Nadav Amit, Harry Yoo

From: Peter Xu <peterx@redhat.com>

commit a79390f5d6a78647fd70856bd42b22d994de0ba2 upstream.

Switch to use type "long" for page accountings and retval across the whole
procedure of change_protection().

The change should have shrinked the possible maximum page number to be
half comparing to previous (ULONG_MAX / 2), but it shouldn't overflow on
any system either because the maximum possible pages touched by change
protection should be ULONG_MAX / PAGE_SIZE.

Two reasons to switch from "unsigned long" to "long":

  1. It suites better on count_vm_numa_events(), whose 2nd parameter takes
     a long type.

  2. It paves way for returning negative (error) values in the future.

Currently the only caller that consumes this retval is change_prot_numa(),
where the unsigned long was converted to an int.  Since at it, touching up
the numa code to also take a long, so it'll avoid any possible overflow
too during the int-size convertion.

Link: https://lkml.kernel.org/r/20230104225207.1066932-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: James Houghton <jthoughton@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Adjust context ]
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 include/linux/hugetlb.h |  4 ++--
 include/linux/mm.h      |  2 +-
 mm/hugetlb.c            |  4 ++--
 mm/mempolicy.c          |  2 +-
 mm/mprotect.c           | 26 +++++++++++++-------------
 5 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 26f2947c399d0..1ddc2b1f96d58 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -233,7 +233,7 @@ void hugetlb_vma_lock_release(struct kref *kref);
 
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pud);
-unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
+long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot,
 		unsigned long cp_flags);
 
@@ -447,7 +447,7 @@ static inline void move_hugetlb_state(struct page *oldpage,
 {
 }
 
-static inline unsigned long hugetlb_change_protection(
+static inline long hugetlb_change_protection(
 			struct vm_area_struct *vma, unsigned long address,
 			unsigned long end, pgprot_t newprot,
 			unsigned long cp_flags)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 44381ffaf34b8..f679f9007c823 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2148,7 +2148,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
 					    MM_CP_UFFD_WP_RESOLVE)
 
-extern unsigned long change_protection(struct mmu_gather *tlb,
+extern long change_protection(struct mmu_gather *tlb,
 			      struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
 			      unsigned long cp_flags);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 77c1ac7a05910..e7bac08071dea 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6668,7 +6668,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	return i ? i : err;
 }
 
-unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
+long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end,
 		pgprot_t newprot, unsigned long cp_flags)
 {
@@ -6677,7 +6677,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t *ptep;
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
-	unsigned long pages = 0, psize = huge_page_size(h);
+	long pages = 0, psize = huge_page_size(h);
 	bool shared_pmd = false;
 	struct mmu_notifier_range range;
 	unsigned long last_addr_mask;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 399d8cb488138..97106305ce21e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -628,7 +628,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 			unsigned long addr, unsigned long end)
 {
 	struct mmu_gather tlb;
-	int nr_updated;
+	long nr_updated;
 
 	tlb_gather_mmu(&tlb, vma->vm_mm);
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 668bfaa6ed2ae..8216f4018ee75 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -72,13 +72,13 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
 	return true;
 }
 
-static unsigned long change_pte_range(struct mmu_gather *tlb,
+static long change_pte_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
-	unsigned long pages = 0;
+	long pages = 0;
 	int target_node = NUMA_NO_NODE;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
@@ -346,13 +346,13 @@ uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags)
 		}							\
 	} while (0)
 
-static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
+static inline long change_pmd_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	pmd_t *pmd;
 	unsigned long next;
-	unsigned long pages = 0;
+	long pages = 0;
 	unsigned long nr_huge_updates = 0;
 	struct mmu_notifier_range range;
 
@@ -360,7 +360,7 @@ static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		unsigned long this_pages;
+		long this_pages;
 
 		next = pmd_addr_end(addr, end);
 
@@ -430,13 +430,13 @@ static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
 	return pages;
 }
 
-static inline unsigned long change_pud_range(struct mmu_gather *tlb,
+static inline long change_pud_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	pud_t *pud;
 	unsigned long next;
-	unsigned long pages = 0;
+	long pages = 0;
 
 	pud = pud_offset(p4d, addr);
 	do {
@@ -451,13 +451,13 @@ static inline unsigned long change_pud_range(struct mmu_gather *tlb,
 	return pages;
 }
 
-static inline unsigned long change_p4d_range(struct mmu_gather *tlb,
+static inline long change_p4d_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	p4d_t *p4d;
 	unsigned long next;
-	unsigned long pages = 0;
+	long pages = 0;
 
 	p4d = p4d_offset(pgd, addr);
 	do {
@@ -472,14 +472,14 @@ static inline unsigned long change_p4d_range(struct mmu_gather *tlb,
 	return pages;
 }
 
-static unsigned long change_protection_range(struct mmu_gather *tlb,
+static long change_protection_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long pages = 0;
+	long pages = 0;
 
 	BUG_ON(addr >= end);
 	pgd = pgd_offset(mm, addr);
@@ -498,12 +498,12 @@ static unsigned long change_protection_range(struct mmu_gather *tlb,
 	return pages;
 }
 
-unsigned long change_protection(struct mmu_gather *tlb,
+long change_protection(struct mmu_gather *tlb,
 		       struct vm_area_struct *vma, unsigned long start,
 		       unsigned long end, pgprot_t newprot,
 		       unsigned long cp_flags)
 {
-	unsigned long pages;
+	long pages;
 
 	BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH V1 6.1.y 2/2] mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()
  2025-11-11  7:10 [PATCH V1 6.1.y 0/2] Fix bad pmd due to race between change_prot_numa() and THP migration Harry Yoo
  2025-11-11  7:11 ` [PATCH V1 6.1.y 1/2] mm/mprotect: use long for page accountings and retval Harry Yoo
@ 2025-11-11  7:11 ` Harry Yoo
  2025-11-17 14:10   ` Sasha Levin
  2025-11-14 11:06 ` [PATCH V1 6.1.y 0/2] Fix bad pmd due to race between change_prot_numa() and THP migration David Hildenbrand (Red Hat)
  2 siblings, 1 reply; 6+ messages in thread
From: Harry Yoo @ 2025-11-11  7:11 UTC (permalink / raw)
  To: stable
  Cc: Liam.Howlett, akpm, baohua, baolin.wang, david, dev.jain, hughd,
	jane.chu, jannh, kas, lance.yang, linux-mm, lorenzo.stoakes,
	npache, pfalcato, ryan.roberts, vbabka, ziy, Alistair Popple,
	Anshuman Khandual, Axel Rasmussen, Christophe Leroy,
	Christoph Hellwig, Huang, Ying, Ira Weiny, Jason Gunthorpe,
	Kirill A . Shutemov, Lorenzo Stoakes, Matthew Wilcox, Mel Gorman,
	Miaohe Lin, Mike Kravetz, Mike Rapoport, Minchan Kim,
	Naoya Horiguchi, Pavel Tatashin, Peter Xu, Peter Zijlstra,
	Qi Zheng, Ralph Campbell, SeongJae Park, Song Liu, Steven Price,
	Suren Baghdasaryan, Thomas Hellström, Will Deacon, Yang Shi,
	Yu Zhao, Zack Rusin

From: Hugh Dickins <hughd@google.com>

commit 670ddd8cdcbd1d07a4571266ae3517f821728c3a upstream.

change_pmd_range() had special pmd_none_or_clear_bad_unless_trans_huge(),
required to avoid "bad" choices when setting automatic NUMA hinting under
mmap_read_lock(); but most of that is already covered in pte_offset_map()
now.  change_pmd_range() just wants a pmd_none() check before wasting time
on MMU notifiers, then checks on the read-once _pmd value to work out
what's needed for huge cases.  If change_pte_range() returns -EAGAIN to
retry if pte_offset_map_lock() fails, nothing more special is needed.

Link: https://lkml.kernel.org/r/725a42a9-91e9-c868-925-e3a5fd40bb4f@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Background: It was reported that a bad pmd is seen when automatic NUMA
  balancing is marking page table entries as prot_numa:

      [2437548.196018] mm/pgtable-generic.c:50: bad pmd 00000000af22fc02(dffffffe71fbfe02)
      [2437548.235022] Call Trace:
      [2437548.238234]  <TASK>
      [2437548.241060]  dump_stack_lvl+0x46/0x61
      [2437548.245689]  panic+0x106/0x2e5
      [2437548.249497]  pmd_clear_bad+0x3c/0x3c
      [2437548.253967]  change_pmd_range.isra.0+0x34d/0x3a7
      [2437548.259537]  change_p4d_range+0x156/0x20e
      [2437548.264392]  change_protection_range+0x116/0x1a9
      [2437548.269976]  change_prot_numa+0x15/0x37
      [2437548.274774]  task_numa_work+0x1b8/0x302
      [2437548.279512]  task_work_run+0x62/0x95
      [2437548.283882]  exit_to_user_mode_loop+0x1a4/0x1a9
      [2437548.289277]  exit_to_user_mode_prepare+0xf4/0xfc
      [2437548.294751]  ? sysvec_apic_timer_interrupt+0x34/0x81
      [2437548.300677]  irqentry_exit_to_user_mode+0x5/0x25
      [2437548.306153]  asm_sysvec_apic_timer_interrupt+0x16/0x1b

    This is due to a race condition between change_prot_numa() and
    THP migration because the kernel doesn't check is_swap_pmd() and
    pmd_trans_huge() atomically:

    change_prot_numa()                      THP migration
    ======================================================================
    - change_pmd_range()
    -> is_swap_pmd() returns false,
    meaning it's not a PMD migration
    entry.
                                      - do_huge_pmd_numa_page()
                                      -> migrate_misplaced_page() sets
                                         migration entries for the THP.
    - change_pmd_range()
    -> pmd_none_or_clear_bad_unless_trans_huge()
    -> pmd_none() and pmd_trans_huge() returns false
    - pmd_none_or_clear_bad_unless_trans_huge()
    -> pmd_bad() returns true for the migration entry!

  The upstream commit 670ddd8cdcbd ("mm/mprotect: delete
  pmd_none_or_clear_bad_unless_trans_huge()") closes this race condition
  by checking is_swap_pmd() and pmd_trans_huge() atomically.

  Backporting note:
    Unlike the mainline, pte_offset_map_lock() does not check if the pmd
    entry is a migration entry or a hugepage; acquires PTL unconditionally
    instead of returning failure. Therefore, it is necessary to keep the
    !is_swap_pmd() && !pmd_trans_huge() && !pmd_devmap() check before
    acquiring the PTL.

    After acquiring the lock, open-code the semantics of
    pte_offset_map_lock() in the mainline kernel; change_pte_range() fails
    if the pmd value has changed. This requires adding pmd_old parameter
    (pmd_t value that is read before calling the function) to
    change_pte_range(). ]
---
 mm/mprotect.c | 101 +++++++++++++++++++++-----------------------------
 1 file changed, 43 insertions(+), 58 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8216f4018ee75..9381179ff8a95 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -73,10 +73,12 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
 }
 
 static long change_pte_range(struct mmu_gather *tlb,
-		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
-		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
+		struct vm_area_struct *vma, pmd_t *pmd, pmd_t pmd_old,
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		unsigned long cp_flags)
 {
 	pte_t *pte, oldpte;
+	pmd_t pmd_val;
 	spinlock_t *ptl;
 	long pages = 0;
 	int target_node = NUMA_NO_NODE;
@@ -86,21 +88,15 @@ static long change_pte_range(struct mmu_gather *tlb,
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 
-	/*
-	 * Can be called with only the mmap_lock for reading by
-	 * prot_numa so we must check the pmd isn't constantly
-	 * changing from under us from pmd_none to pmd_trans_huge
-	 * and/or the other way around.
-	 */
-	if (pmd_trans_unstable(pmd))
-		return 0;
-
-	/*
-	 * The pmd points to a regular pte so the pmd can't change
-	 * from under us even if the mmap_lock is only hold for
-	 * reading.
-	 */
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	/* Make sure pmd didn't change after acquiring ptl */
+	pmd_val = pmd_read_atomic(pmd);
+	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
+	barrier();
+	if (!pmd_same(pmd_old, pmd_val)) {
+		pte_unmap_unlock(pte, ptl);
+		return -EAGAIN;
+	}
 
 	/* Get target node for single threaded private VMAs */
 	if (prot_numa && !(vma->vm_flags & VM_SHARED) &&
@@ -288,31 +284,6 @@ static long change_pte_range(struct mmu_gather *tlb,
 	return pages;
 }
 
-/*
- * Used when setting automatic NUMA hinting protection where it is
- * critical that a numa hinting PMD is not confused with a bad PMD.
- */
-static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
-{
-	pmd_t pmdval = pmd_read_atomic(pmd);
-
-	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	barrier();
-#endif
-
-	if (pmd_none(pmdval))
-		return 1;
-	if (pmd_trans_huge(pmdval))
-		return 0;
-	if (unlikely(pmd_bad(pmdval))) {
-		pmd_clear_bad(pmd);
-		return 1;
-	}
-
-	return 0;
-}
-
 /* Return true if we're uffd wr-protecting file-backed memory, or false */
 static inline bool
 uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags)
@@ -360,22 +331,34 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		long this_pages;
-
+		long ret;
+		pmd_t _pmd;
+again:
 		next = pmd_addr_end(addr, end);
+		_pmd = pmd_read_atomic(pmd);
+		/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		barrier();
+#endif
 
 		change_pmd_prepare(vma, pmd, cp_flags);
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
 		 * held for read. It's possible a parallel update to occur
-		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
-		 * check leading to a false positive and clearing.
-		 * Hence, it's necessary to atomically read the PMD value
-		 * for all the checks.
+		 * between pmd_trans_huge(), is_swap_pmd(), and
+		 * a pmd_none_or_clear_bad() check leading to a false positive
+		 * and clearing. Hence, it's necessary to atomically read
+		 * the PMD value for all the checks.
 		 */
-		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
-		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
-			goto next;
+		if (!is_swap_pmd(_pmd) && !pmd_devmap(_pmd) && !pmd_trans_huge(_pmd)) {
+			if (pmd_none(_pmd))
+				goto next;
+
+			if (pmd_bad(_pmd)) {
+				pmd_clear_bad(pmd);
+				goto next;
+			}
+		}
 
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!range.start) {
@@ -385,7 +368,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 			mmu_notifier_invalidate_range_start(&range);
 		}
 
-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
 			    uffd_wp_protect_file(vma, cp_flags)) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
@@ -400,11 +383,11 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 				 * change_huge_pmd() does not defer TLB flushes,
 				 * so no need to propagate the tlb argument.
 				 */
-				int nr_ptes = change_huge_pmd(tlb, vma, pmd,
-						addr, newprot, cp_flags);
+				ret = change_huge_pmd(tlb, vma, pmd,
+						      addr, newprot, cp_flags);
 
-				if (nr_ptes) {
-					if (nr_ptes == HPAGE_PMD_NR) {
+				if (ret) {
+					if (ret == HPAGE_PMD_NR) {
 						pages += HPAGE_PMD_NR;
 						nr_huge_updates++;
 					}
@@ -415,9 +398,11 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 			}
 			/* fall through, the trans huge pmd just split */
 		}
-		this_pages = change_pte_range(tlb, vma, pmd, addr, next,
-					      newprot, cp_flags);
-		pages += this_pages;
+		ret = change_pte_range(tlb, vma, pmd, _pmd, addr, next,
+				       newprot, cp_flags);
+		if (ret < 0)
+			goto again;
+		pages += ret;
 next:
 		cond_resched();
 	} while (pmd++, addr = next, addr != end);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH V1 6.1.y 0/2] Fix bad pmd due to race between change_prot_numa() and THP migration
  2025-11-11  7:10 [PATCH V1 6.1.y 0/2] Fix bad pmd due to race between change_prot_numa() and THP migration Harry Yoo
  2025-11-11  7:11 ` [PATCH V1 6.1.y 1/2] mm/mprotect: use long for page accountings and retval Harry Yoo
  2025-11-11  7:11 ` [PATCH V1 6.1.y 2/2] mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge() Harry Yoo
@ 2025-11-14 11:06 ` David Hildenbrand (Red Hat)
  2025-11-20  0:38   ` Harry Yoo
  2 siblings, 1 reply; 6+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-14 11:06 UTC (permalink / raw)
  To: Harry Yoo, stable
  Cc: Liam.Howlett, akpm, baohua, baolin.wang, dev.jain, hughd,
	jane.chu, jannh, kas, lance.yang, linux-mm, lorenzo.stoakes,
	npache, pfalcato, ryan.roberts, vbabka, ziy

On 11.11.25 08:10, Harry Yoo wrote:
> # TL;DR
> 
> previous discussion: https://lore.kernel.org/linux-mm/b41ea29e-6b48-4f64-859c-73be095453ae@redhat.com/
> 
> A "bad pmd" error occurs due to race condition between
> change_prot_numa() and THP migration. The mainline kernel does not have
> this bug as commit 670ddd8cdc fixes the race condition. 6.1.y, 5.15.y,
> 5.10.y, 5.4.y are affected by this bug.
> 
> Fixing this in -stable kernels is tricky because pte_map_offset_lock()
> has different semantics in pre-6.5 and post-6.5 kernels. I am trying to
> backport the same mechanism we have in the mainline kernel.
> Since the code looks bit different due to different semantics of
> pte_map_offset_lock(), it'd be best to get this reviewed by MM folks.
> 
> # Testing
> 
> I verified that the bug described below is not reproduced anymore
> (on a downstream kernel) after applying this patch series. It used to
> trigger in few days of intensive numa balancing testing, but it survived
> 2 weeks with this applied.
> 
> # Bug Description
> 
> It was reported that a bad pmd is seen when automatic NUMA
> balancing is marking page table entries as prot_numa:
>      
>    [2437548.196018] mm/pgtable-generic.c:50: bad pmd 00000000af22fc02(dffffffe71fbfe02)
>    [2437548.235022] Call Trace:
>    [2437548.238234]  <TASK>
>    [2437548.241060]  dump_stack_lvl+0x46/0x61
>    [2437548.245689]  panic+0x106/0x2e5
>    [2437548.249497]  pmd_clear_bad+0x3c/0x3c
>    [2437548.253967]  change_pmd_range.isra.0+0x34d/0x3a7
>    [2437548.259537]  change_p4d_range+0x156/0x20e
>    [2437548.264392]  change_protection_range+0x116/0x1a9
>    [2437548.269976]  change_prot_numa+0x15/0x37
>    [2437548.274774]  task_numa_work+0x1b8/0x302
>    [2437548.279512]  task_work_run+0x62/0x95
>    [2437548.283882]  exit_to_user_mode_loop+0x1a4/0x1a9
>    [2437548.289277]  exit_to_user_mode_prepare+0xf4/0xfc
>    [2437548.294751]  ? sysvec_apic_timer_interrupt+0x34/0x81
>    [2437548.300677]  irqentry_exit_to_user_mode+0x5/0x25
>    [2437548.306153]  asm_sysvec_apic_timer_interrupt+0x16/0x1b
> 
> This is due to a race condition between change_prot_numa() and
> THP migration because the kernel doesn't check is_swap_pmd() and
> pmd_trans_huge() atomically:
> 
> change_prot_numa()                      THP migration
> ======================================================================
> - change_pmd_range()
> -> is_swap_pmd() returns false,
> meaning it's not a PMD migration
> entry.
> 				  - do_huge_pmd_numa_page()
> 				  -> migrate_misplaced_page() sets
> 				     migration entries for the THP.
> - change_pmd_range()
> -> pmd_none_or_clear_bad_unless_trans_huge()
> -> pmd_none() and pmd_trans_huge() returns false
> - pmd_none_or_clear_bad_unless_trans_huge()
> -> pmd_bad() returns true for the migration entry!
> 
> The upstream commit 670ddd8cdcbd ("mm/mprotect: delete
> pmd_none_or_clear_bad_unless_trans_huge()") closes this race condition
> by checking is_swap_pmd() and pmd_trans_huge() atomically.
> 
> # Backporting note
> 
> commit a79390f5d6a7 ("mm/mprotect: use long for page accountings and retval")
> is backported to return an error code (negative value) in
> change_pte_range().
> 
> Unlike the mainline, pte_offset_map_lock() does not check if the pmd
> entry is a migration entry or a hugepage; acquires PTL unconditionally
> instead of returning failure. Therefore, it is necessary to keep the
> !is_swap_pmd() && !pmd_trans_huge() && !pmd_devmap() checks in
> change_pmd_range() before acquiring the PTL.
> 
> After acquiring the lock, open-code the semantics of
> pte_offset_map_lock() in the mainline kernel; change_pte_range() fails
> if the pmd value has changed. This requires adding pmd_old parameter
> (pmd_t value that is read before calling the function) to
> change_pte_range().

Looks reasonable to me, so I assume the backporting diff makes sense.

Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH V1 6.1.y 2/2] mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()
  2025-11-11  7:11 ` [PATCH V1 6.1.y 2/2] mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge() Harry Yoo
@ 2025-11-17 14:10   ` Sasha Levin
  0 siblings, 0 replies; 6+ messages in thread
From: Sasha Levin @ 2025-11-17 14:10 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Liam.Howlett, akpm, baohua, baolin.wang, david, dev.jain, hughd,
	jane.chu, jannh, kas, lance.yang, linux-mm, lorenzo.stoakes,
	npache, pfalcato, ryan.roberts, vbabka, ziy, Alistair Popple,
	Anshuman Khandual, Axel Rasmussen, Christophe Leroy,
	Christoph Hellwig, Ying, Ira Weiny, Jason Gunthorpe,
	Kirill A . Shutemov, Lorenzo Stoakes, Matthew Wilcox, Mel Gorman,
	Miaohe Lin, Mike Kravetz, Mike Rapoport, Minchan Kim,
	Naoya Horiguchi, Pavel Tatashin, Peter Xu, Peter Zijlstra,
	Qi Zheng, Ralph Campbell, SeongJae Park, Song Liu, Steven Price,
	Suren Baghdasaryan, Thomas Hellström, Will Deacon, Yang Shi,
	Yu Zhao, Zack Rusin

Subject: mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()
Queue: 6.1

Thanks for the backport!


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH V1 6.1.y 0/2] Fix bad pmd due to race between change_prot_numa() and THP migration
  2025-11-14 11:06 ` [PATCH V1 6.1.y 0/2] Fix bad pmd due to race between change_prot_numa() and THP migration David Hildenbrand (Red Hat)
@ 2025-11-20  0:38   ` Harry Yoo
  0 siblings, 0 replies; 6+ messages in thread
From: Harry Yoo @ 2025-11-20  0:38 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: stable, Liam.Howlett, akpm, baohua, baolin.wang, dev.jain, hughd,
	jane.chu, jannh, kas, lance.yang, linux-mm, lorenzo.stoakes,
	npache, pfalcato, ryan.roberts, vbabka, ziy

On Fri, Nov 14, 2025 at 12:06:57PM +0100, David Hildenbrand (Red Hat) wrote:
> On 11.11.25 08:10, Harry Yoo wrote:
> > # TL;DR
> > 
> > previous discussion: https://lore.kernel.org/linux-mm/b41ea29e-6b48-4f64-859c-73be095453ae@redhat.com/
> > 
> > A "bad pmd" error occurs due to race condition between
> > change_prot_numa() and THP migration. The mainline kernel does not have
> > this bug as commit 670ddd8cdc fixes the race condition. 6.1.y, 5.15.y,
> > 5.10.y, 5.4.y are affected by this bug.
> > 
> > Fixing this in -stable kernels is tricky because pte_map_offset_lock()
> > has different semantics in pre-6.5 and post-6.5 kernels. I am trying to
> > backport the same mechanism we have in the mainline kernel.
> > Since the code looks bit different due to different semantics of
> > pte_map_offset_lock(), it'd be best to get this reviewed by MM folks.
> > 
> > # Testing
> > 
> > I verified that the bug described below is not reproduced anymore
> > (on a downstream kernel) after applying this patch series. It used to
> > trigger in few days of intensive numa balancing testing, but it survived
> > 2 weeks with this applied.
> > 
> > # Bug Description
> > 
> > It was reported that a bad pmd is seen when automatic NUMA
> > balancing is marking page table entries as prot_numa:
> >    [2437548.196018] mm/pgtable-generic.c:50: bad pmd 00000000af22fc02(dffffffe71fbfe02)
> >    [2437548.235022] Call Trace:
> >    [2437548.238234]  <TASK>
> >    [2437548.241060]  dump_stack_lvl+0x46/0x61
> >    [2437548.245689]  panic+0x106/0x2e5
> >    [2437548.249497]  pmd_clear_bad+0x3c/0x3c
> >    [2437548.253967]  change_pmd_range.isra.0+0x34d/0x3a7
> >    [2437548.259537]  change_p4d_range+0x156/0x20e
> >    [2437548.264392]  change_protection_range+0x116/0x1a9
> >    [2437548.269976]  change_prot_numa+0x15/0x37
> >    [2437548.274774]  task_numa_work+0x1b8/0x302
> >    [2437548.279512]  task_work_run+0x62/0x95
> >    [2437548.283882]  exit_to_user_mode_loop+0x1a4/0x1a9
> >    [2437548.289277]  exit_to_user_mode_prepare+0xf4/0xfc
> >    [2437548.294751]  ? sysvec_apic_timer_interrupt+0x34/0x81
> >    [2437548.300677]  irqentry_exit_to_user_mode+0x5/0x25
> >    [2437548.306153]  asm_sysvec_apic_timer_interrupt+0x16/0x1b
> > 
> > This is due to a race condition between change_prot_numa() and
> > THP migration because the kernel doesn't check is_swap_pmd() and
> > pmd_trans_huge() atomically:
> > 
> > change_prot_numa()                      THP migration
> > ======================================================================
> > - change_pmd_range()
> > -> is_swap_pmd() returns false,
> > meaning it's not a PMD migration
> > entry.
> > 				  - do_huge_pmd_numa_page()
> > 				  -> migrate_misplaced_page() sets
> > 				     migration entries for the THP.
> > - change_pmd_range()
> > -> pmd_none_or_clear_bad_unless_trans_huge()
> > -> pmd_none() and pmd_trans_huge() returns false
> > - pmd_none_or_clear_bad_unless_trans_huge()
> > -> pmd_bad() returns true for the migration entry!
> > 
> > The upstream commit 670ddd8cdcbd ("mm/mprotect: delete
> > pmd_none_or_clear_bad_unless_trans_huge()") closes this race condition
> > by checking is_swap_pmd() and pmd_trans_huge() atomically.
> > 
> > # Backporting note
> > 
> > commit a79390f5d6a7 ("mm/mprotect: use long for page accountings and retval")
> > is backported to return an error code (negative value) in
> > change_pte_range().
> > 
> > Unlike the mainline, pte_offset_map_lock() does not check if the pmd
> > entry is a migration entry or a hugepage; acquires PTL unconditionally
> > instead of returning failure. Therefore, it is necessary to keep the
> > !is_swap_pmd() && !pmd_trans_huge() && !pmd_devmap() checks in
> > change_pmd_range() before acquiring the PTL.
> > 
> > After acquiring the lock, open-code the semantics of
> > pte_offset_map_lock() in the mainline kernel; change_pte_range() fails
> > if the pmd value has changed. This requires adding pmd_old parameter
> > (pmd_t value that is read before calling the function) to
> > change_pte_range().
> 
> Looks reasonable to me, so I assume the backporting diff makes sense.
> 
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>

Hi David, just wanted to say that I really appreciate your
acknowledgement. Thanks!

While this bug doesn't appear to cause severe damage on the system
(only a "bad pmd" error printed to console due to race), that wasn't
really clear before closer investigation and I think it is worth
backporting to save others' time.

I'll send v5.15, v5.10, v5.4 fix soon that does the same thing.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-11-20  0:38 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-11  7:10 [PATCH V1 6.1.y 0/2] Fix bad pmd due to race between change_prot_numa() and THP migration Harry Yoo
2025-11-11  7:11 ` [PATCH V1 6.1.y 1/2] mm/mprotect: use long for page accountings and retval Harry Yoo
2025-11-11  7:11 ` [PATCH V1 6.1.y 2/2] mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge() Harry Yoo
2025-11-17 14:10   ` Sasha Levin
2025-11-14 11:06 ` [PATCH V1 6.1.y 0/2] Fix bad pmd due to race between change_prot_numa() and THP migration David Hildenbrand (Red Hat)
2025-11-20  0:38   ` Harry Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox