[DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.
@ 2025-09-21 23:27 Harry Yoo
  2025-09-23  9:00 ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Harry Yoo @ 2025-09-21 23:27 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Kiryl Shutsemau, Hugh Dickins,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Jane Chu
  Cc: linux-mm, stable, Harry Yoo

Hi. This is supposed to be a patch, but I think it's worth discussing
how it should be backported to -stable, so I've labeled it as [DISCUSSION].

The bug described below was unintentionally fixed in v6.5 and not
backported to -stable. So technically I would need to use "Option 3" [A],
but since the original patch [B] did not intend to fix a bug (and it's also
part of a larger patch series), it looks quite different from the patch below,
and I'm not sure what the backport should look like.

I think there are probably two options:

1. Provide the description of the original patch along with a very long,
   detailed explanation of why the patch deviates from the upstream version, or

2. Post the patch below with a clarification that it was fixed upstream
   by commit 670ddd8cdcbd1.

Any thoughts?

[A] https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#option-3
[B] https://lkml.kernel.org/r/725a42a9-91e9-c868-925-e3a5fd40bb4f@google.com
    (Upstream commit 670ddd8cdcbd1)

In any case, no matter how we backport this, it needs some review and
feedback would be appreciated. The patch applies to v6.1 and v5.15, and
v5.10 but not v5.4.

From cf45867ab8e48b42160b7253390db7bdecef1455 Mon Sep 17 00:00:00 2001
From: Harry Yoo <harry.yoo@oracle.com>
Date: Thu, 11 Sep 2025 20:05:40 +0900
Subject: [PATCH] mm, numa: fix bad pmd by atomically checking is_swap_pmd() in
 change_prot_numa()

It was observed that a bad pmd is seen when automatic NUMA balancing
is marking page table entries as prot_numa:

[2437548.196018] mm/pgtable-generic.c:50: bad pmd 00000000af22fc02(dffffffe71fbfe02)

With some kernel modification, the call stack was dumped:

[2437548.235022] Call Trace:
[2437548.238234]  <TASK>
[2437548.241060]  dump_stack_lvl+0x46/0x61
[2437548.245689]  panic+0x106/0x2e5
[2437548.249497]  pmd_clear_bad+0x3c/0x3c
[2437548.253967]  change_pmd_range.isra.0+0x34d/0x3a7
[2437548.259537]  change_p4d_range+0x156/0x20e
[2437548.264392]  change_protection_range+0x116/0x1a9
[2437548.269976]  change_prot_numa+0x15/0x37
[2437548.274774]  task_numa_work+0x1b8/0x302
[2437548.279512]  task_work_run+0x62/0x95
[2437548.283882]  exit_to_user_mode_loop+0x1a4/0x1a9
[2437548.289277]  exit_to_user_mode_prepare+0xf4/0xfc
[2437548.294751]  ? sysvec_apic_timer_interrupt+0x34/0x81
[2437548.300677]  irqentry_exit_to_user_mode+0x5/0x25
[2437548.306153]  asm_sysvec_apic_timer_interrupt+0x16/0x1b

This is due to a race condition between change_prot_numa() and
THP migration because the kernel doesn't check is_swap_pmd() and
pmd_trans_huge() atomically:

change_prot_numa()                      THP migration
======================================================================
- change_pmd_range()
-> is_swap_pmd() returns false,
   meaning it's not a PMD migration
   entry.

				    - do_huge_pmd_numa_page()
				    -> migrate_misplaced_page() sets
				       migration entries for the THP.

- change_pmd_range()
-> pmd_none_or_clear_bad_unless_trans_huge()
-> pmd_none() and pmd_trans_huge() returns false

- pmd_none_or_clear_bad_unless_trans_huge()
-> pmd_bad() returns true for the migration entry!

For the race condition described above to occur:

1) AutoNUMA must be unmapping a range of pages, with at least part of the
range already unmapped by AutoNUMA.

2) While AutoNUMA is in the process of unmapping, a NUMA hinting fault
occurs within that range, specifically when we are about to unmap
the PMD entry, between the is_swap_pmd() and pmd_trans_huge() checks.

So this is a really rare race condition and it's observed that it takes
usually a few days of autonuma-intensive testing to trigger.

A bit of history on a similar race condition in the past:

In fact, a similar race condition caused by not checking pmd_trans_huge()
atomically was reported [1] in 2017. However, instead of the patch [1],
another patch series [3] fixed the problem [2] by not clearing the pmd
entry but invaliding it instead (so that pmd_trans_huge() would still
return true).

Despite patch series [3], the bad pmd error continued to be reported
in mainline. As a result, [1] was resurrected [4] and it landed mainline
in 2020 in a hope that it would resolve the issue. However, now it turns
out that [3] was not sufficient.

Fix this race condition by checking is_swap_pmd() and pmd_trans_huge()
atomically. With that, the kernel should see either
pmd_trans_huge() == true, or is_swap_pmd() == true when another task is
migrating the page concurrently.

This bug was introduced when THP migration support was added. More
specifically, by commit 84c3fc4e9c56 ("mm: thp: check pmd migration entry
in common path")).

It is unintentionally fixed since v6.5 by commit 670ddd8cdcbd1
("mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()") while
removing pmd_none_or_clear_bad_unless_trans_huge() function. But it's not
backported to -stable because it was fixed unintentionally.

Link: https://lore.kernel.org/linux-mm/20170410094825.2yfo5zehn7pchg6a@techsingularity.net [1]
Link: https://lore.kernel.org/linux-mm/8A6309F4-DB76-48FA-BE7F-BF9536A4C4E5@cs.rutgers.edu [2]
Link: https://lore.kernel.org/linux-mm/20170302151034.27829-1-kirill.shutemov@linux.intel.com [3]
Link: https://lore.kernel.org/linux-mm/20200216191800.22423-1-aquini@redhat.com [4]
Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/mprotect.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 668bfaa6ed2a..c0e796c0f9b0 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -303,7 +303,7 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)

 	if (pmd_none(pmdval))
 		return 1;
-	if (pmd_trans_huge(pmdval))
+	if (is_swap_pmd(pmdval) || pmd_trans_huge(pmdval))
 		return 0;
 	if (unlikely(pmd_bad(pmdval))) {
 		pmd_clear_bad(pmd);
@@ -373,7 +373,7 @@ static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
 		 * Hence, it's necessary to atomically read the PMD value
 		 * for all the checks.
 		 */
-		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
+		if (!pmd_devmap(*pmd) &&
 		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
 			goto next;

-- 
2.43.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.
  2025-09-21 23:27 [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels Harry Yoo
@ 2025-09-23  9:00 ` David Hildenbrand
  2025-09-23 11:46   ` Harry Yoo
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-09-23  9:00 UTC (permalink / raw)
  To: Harry Yoo, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Kiryl Shutsemau, Hugh Dickins, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Jane Chu
  Cc: linux-mm, stable

On 22.09.25 01:27, Harry Yoo wrote:
> Hi. This is supposed to be a patch, but I think it's worth discussing
> how it should be backported to -stable, so I've labeled it as [DISCUSSION].
> 
> The bug described below was unintentionally fixed in v6.5 and not
> backported to -stable. So technically I would need to use "Option 3" [A],

What is option 3? Just to clarify: it's fine to do a backport of a 
commit even though it was not tagged as a fix.

> but since the original patch [B] did not intend to fix a bug (and it's also
> part of a larger patch series), it looks quite different from the patch below,
> and I'm not sure what the backport should look like.
> 
> I think there are probably two options:
> 
> 1. Provide the description of the original patch along with a very long,
>     detailed explanation of why the patch deviates from the upstream version, or
> 
> 2. Post the patch below with a clarification that it was fixed upstream
>     by commit 670ddd8cdcbd1.
> 
> Any thoughts?
> 
> [A] https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#option-3
> [B] https://lkml.kernel.org/r/725a42a9-91e9-c868-925-e3a5fd40bb4f@google.com
>      (Upstream commit 670ddd8cdcbd1)
> 
> In any case, no matter how we backport this, it needs some review and
> feedback would be appreciated. The patch applies to v6.1 and v5.15, and
> v5.10 but not v5.4.
> 
>  From cf45867ab8e48b42160b7253390db7bdecef1455 Mon Sep 17 00:00:00 2001
> From: Harry Yoo <harry.yoo@oracle.com>
> Date: Thu, 11 Sep 2025 20:05:40 +0900
> Subject: [PATCH] mm, numa: fix bad pmd by atomically checking is_swap_pmd() in
>   change_prot_numa()
> 
> It was observed that a bad pmd is seen when automatic NUMA balancing
> is marking page table entries as prot_numa:
> 
> [2437548.196018] mm/pgtable-generic.c:50: bad pmd 00000000af22fc02(dffffffe71fbfe02)
> 
> With some kernel modification, the call stack was dumped:
> 
> [2437548.235022] Call Trace:
> [2437548.238234]  <TASK>
> [2437548.241060]  dump_stack_lvl+0x46/0x61
> [2437548.245689]  panic+0x106/0x2e5
> [2437548.249497]  pmd_clear_bad+0x3c/0x3c
> [2437548.253967]  change_pmd_range.isra.0+0x34d/0x3a7
> [2437548.259537]  change_p4d_range+0x156/0x20e
> [2437548.264392]  change_protection_range+0x116/0x1a9
> [2437548.269976]  change_prot_numa+0x15/0x37
> [2437548.274774]  task_numa_work+0x1b8/0x302
> [2437548.279512]  task_work_run+0x62/0x95
> [2437548.283882]  exit_to_user_mode_loop+0x1a4/0x1a9
> [2437548.289277]  exit_to_user_mode_prepare+0xf4/0xfc
> [2437548.294751]  ? sysvec_apic_timer_interrupt+0x34/0x81
> [2437548.300677]  irqentry_exit_to_user_mode+0x5/0x25
> [2437548.306153]  asm_sysvec_apic_timer_interrupt+0x16/0x1b
> 
> This is due to a race condition between change_prot_numa() and
> THP migration because the kernel doesn't check is_swap_pmd() and
> pmd_trans_huge() atomically:
> 
> change_prot_numa()                      THP migration
> ======================================================================
> - change_pmd_range()
> -> is_swap_pmd() returns false,
>     meaning it's not a PMD migration
>     entry.
> 
> 				    - do_huge_pmd_numa_page()
> 				    -> migrate_misplaced_page() sets
> 				       migration entries for the THP.
> 
> - change_pmd_range()
> -> pmd_none_or_clear_bad_unless_trans_huge()
> -> pmd_none() and pmd_trans_huge() returns false
> 
> - pmd_none_or_clear_bad_unless_trans_huge()
> -> pmd_bad() returns true for the migration entry!
> 
> For the race condition described above to occur:
> 
> 1) AutoNUMA must be unmapping a range of pages, with at least part of the
> range already unmapped by AutoNUMA.
> 
> 2) While AutoNUMA is in the process of unmapping, a NUMA hinting fault
> occurs within that range, specifically when we are about to unmap
> the PMD entry, between the is_swap_pmd() and pmd_trans_huge() checks.
> 
> So this is a really rare race condition and it's observed that it takes
> usually a few days of autonuma-intensive testing to trigger.
> 
> A bit of history on a similar race condition in the past:
> 
> In fact, a similar race condition caused by not checking pmd_trans_huge()
> atomically was reported [1] in 2017. However, instead of the patch [1],
> another patch series [3] fixed the problem [2] by not clearing the pmd
> entry but invaliding it instead (so that pmd_trans_huge() would still
> return true).
> 
> Despite patch series [3], the bad pmd error continued to be reported
> in mainline. As a result, [1] was resurrected [4] and it landed mainline
> in 2020 in a hope that it would resolve the issue. However, now it turns
> out that [3] was not sufficient.
> 
> Fix this race condition by checking is_swap_pmd() and pmd_trans_huge()
> atomically. With that, the kernel should see either
> pmd_trans_huge() == true, or is_swap_pmd() == true when another task is
> migrating the page concurrently.
> 
> This bug was introduced when THP migration support was added. More
> specifically, by commit 84c3fc4e9c56 ("mm: thp: check pmd migration entry
> in common path")).
> 
> It is unintentionally fixed since v6.5 by commit 670ddd8cdcbd1
> ("mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()") while
> removing pmd_none_or_clear_bad_unless_trans_huge() function. But it's not
> backported to -stable because it was fixed unintentionally.
> 
> Link: https://lore.kernel.org/linux-mm/20170410094825.2yfo5zehn7pchg6a@techsingularity.net [1]
> Link: https://lore.kernel.org/linux-mm/8A6309F4-DB76-48FA-BE7F-BF9536A4C4E5@cs.rutgers.edu [2]
> Link: https://lore.kernel.org/linux-mm/20170302151034.27829-1-kirill.shutemov@linux.intel.com [3]
> Link: https://lore.kernel.org/linux-mm/20200216191800.22423-1-aquini@redhat.com [4]
> Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---
>   mm/mprotect.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 668bfaa6ed2a..c0e796c0f9b0 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -303,7 +303,7 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)

This is like the worst function, ever :D

>   
>   	if (pmd_none(pmdval))
>   		return 1;
> -	if (pmd_trans_huge(pmdval))
> +	if (is_swap_pmd(pmdval) || pmd_trans_huge(pmdval))
>   		return 0;
>   	if (unlikely(pmd_bad(pmdval))) {
>   		pmd_clear_bad(pmd);
> @@ -373,7 +373,7 @@ static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
>   		 * Hence, it's necessary to atomically read the PMD value
>   		 * for all the checks.
>   		 */
> -		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
> +		if (!pmd_devmap(*pmd) &&
>   		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
>   			goto next;
>   

This is all because we are trying to be smart and walking page tables 
without the page table lock held. This is just absolutely nasty.

What about the following check:

if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {

Couldn't we have a similar race there when we are concurrently migrating?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.
  2025-09-23  9:00 ` David Hildenbrand
@ 2025-09-23 11:46   ` Harry Yoo
  2025-09-23 14:09     ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Harry Yoo @ 2025-09-23 11:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Kiryl Shutsemau, Hugh Dickins, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Jane Chu, linux-mm, stable

On Tue, Sep 23, 2025 at 11:00:57AM +0200, David Hildenbrand wrote:
> On 22.09.25 01:27, Harry Yoo wrote:
> > Hi. This is supposed to be a patch, but I think it's worth discussing
> > how it should be backported to -stable, so I've labeled it as [DISCUSSION].
> > 
> > The bug described below was unintentionally fixed in v6.5 and not
> > backported to -stable. So technically I would need to use "Option 3" [A],
> 
> What is option 3?

Citing Option 3 from [A]:
> Send the patch, after verifying that it follows the above rules, to
> stable@vger.kernel.org and mention the kernel versions you wish it to be
> applied to. When doing so, you must note the upstream commit ID in the
> changelog of your submission with a separate line above the commit text,
>
> like this:
> commit <sha1> upstream.
>
> Or alternatively:
> [ Upstream commit <sha1> ]
>
> If the submitted patch deviates from the original upstream patch
> (for example because it had to be adjusted for the older API),
> this must be very clearly documented and justified in the patch description.

> Just to clarify: it's fine to do a backport of a commit
> even though it was not tagged as a fix.

Thanks for looking into it, David!

Ok, I was worried that the original patch's description will confuse
people because 1) we don't allow pte_map_offset_lock() to fail in older
kernels, which the original patch relies on, and 2) the patch does not
mention the race condition (because it fixed the race 'accidentaly' :D).

I'll backport the original patch but make it clear that:

1. while the original patch did not mention the race condition,
   the patch fixes a it, and add link to this discussion.

2. we can't remove 1) pmd_trans_unstable() check in change_pte_range(),
   and 2) "bad" pmd check in change_pmd_range() because we don't allow
   pte_offset_map_lock() to fail().

3. pmd_read_atomic() is used instead of pmdp_get_lockless() beucase it
   does not exist in older kernels.

> > but since the original patch [B] did not intend to fix a bug (and it's also
> > part of a larger patch series), it looks quite different from the patch below,
> > and I'm not sure what the backport should look like.
> > 
> > I think there are probably two options:
> > 
> > 1. Provide the description of the original patch along with a very long,
> >     detailed explanation of why the patch deviates from the upstream version, or
> > 
> > 2. Post the patch below with a clarification that it was fixed upstream
> >     by commit 670ddd8cdcbd1.
> > 
> > Any thoughts?
> > 
> > [A] https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#option-3
> > [B] https://lkml.kernel.org/r/725a42a9-91e9-c868-925-e3a5fd40bb4f@google.com
> >      (Upstream commit 670ddd8cdcbd1)
> > 
> > In any case, no matter how we backport this, it needs some review and
> > feedback would be appreciated. The patch applies to v6.1 and v5.15, and
> > v5.10 but not v5.4.
> > 
> >  From cf45867ab8e48b42160b7253390db7bdecef1455 Mon Sep 17 00:00:00 2001
> > From: Harry Yoo <harry.yoo@oracle.com>
> > Date: Thu, 11 Sep 2025 20:05:40 +0900
> > Subject: [PATCH] mm, numa: fix bad pmd by atomically checking is_swap_pmd() in
> >   change_prot_numa()
> > 
> > It was observed that a bad pmd is seen when automatic NUMA balancing
> > is marking page table entries as prot_numa:
> > 
> > [2437548.196018] mm/pgtable-generic.c:50: bad pmd 00000000af22fc02(dffffffe71fbfe02)
> > 
> > With some kernel modification, the call stack was dumped:
> > 
> > [2437548.235022] Call Trace:
> > [2437548.238234]  <TASK>
> > [2437548.241060]  dump_stack_lvl+0x46/0x61
> > [2437548.245689]  panic+0x106/0x2e5
> > [2437548.249497]  pmd_clear_bad+0x3c/0x3c
> > [2437548.253967]  change_pmd_range.isra.0+0x34d/0x3a7
> > [2437548.259537]  change_p4d_range+0x156/0x20e
> > [2437548.264392]  change_protection_range+0x116/0x1a9
> > [2437548.269976]  change_prot_numa+0x15/0x37
> > [2437548.274774]  task_numa_work+0x1b8/0x302
> > [2437548.279512]  task_work_run+0x62/0x95
> > [2437548.283882]  exit_to_user_mode_loop+0x1a4/0x1a9
> > [2437548.289277]  exit_to_user_mode_prepare+0xf4/0xfc
> > [2437548.294751]  ? sysvec_apic_timer_interrupt+0x34/0x81
> > [2437548.300677]  irqentry_exit_to_user_mode+0x5/0x25
> > [2437548.306153]  asm_sysvec_apic_timer_interrupt+0x16/0x1b
> > 
> > This is due to a race condition between change_prot_numa() and
> > THP migration because the kernel doesn't check is_swap_pmd() and
> > pmd_trans_huge() atomically:
> > 
> > change_prot_numa()                      THP migration
> > ======================================================================
> > - change_pmd_range()
> > -> is_swap_pmd() returns false,
> >     meaning it's not a PMD migration
> >     entry.
> > 
> > 				    - do_huge_pmd_numa_page()
> > 				    -> migrate_misplaced_page() sets
> > 				       migration entries for the THP.
> > 
> > - change_pmd_range()
> > -> pmd_none_or_clear_bad_unless_trans_huge()
> > -> pmd_none() and pmd_trans_huge() returns false
> > 
> > - pmd_none_or_clear_bad_unless_trans_huge()
> > -> pmd_bad() returns true for the migration entry!
> > 
> > For the race condition described above to occur:
> > 
> > 1) AutoNUMA must be unmapping a range of pages, with at least part of the
> > range already unmapped by AutoNUMA.
> > 
> > 2) While AutoNUMA is in the process of unmapping, a NUMA hinting fault
> > occurs within that range, specifically when we are about to unmap
> > the PMD entry, between the is_swap_pmd() and pmd_trans_huge() checks.
> > 
> > So this is a really rare race condition and it's observed that it takes
> > usually a few days of autonuma-intensive testing to trigger.
> > 
> > A bit of history on a similar race condition in the past:
> > 
> > In fact, a similar race condition caused by not checking pmd_trans_huge()
> > atomically was reported [1] in 2017. However, instead of the patch [1],
> > another patch series [3] fixed the problem [2] by not clearing the pmd
> > entry but invaliding it instead (so that pmd_trans_huge() would still
> > return true).
> > 
> > Despite patch series [3], the bad pmd error continued to be reported
> > in mainline. As a result, [1] was resurrected [4] and it landed mainline
> > in 2020 in a hope that it would resolve the issue. However, now it turns
> > out that [3] was not sufficient.
> > 
> > Fix this race condition by checking is_swap_pmd() and pmd_trans_huge()
> > atomically. With that, the kernel should see either
> > pmd_trans_huge() == true, or is_swap_pmd() == true when another task is
> > migrating the page concurrently.
> > 
> > This bug was introduced when THP migration support was added. More
> > specifically, by commit 84c3fc4e9c56 ("mm: thp: check pmd migration entry
> > in common path")).
> > 
> > It is unintentionally fixed since v6.5 by commit 670ddd8cdcbd1
> > ("mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()") while
> > removing pmd_none_or_clear_bad_unless_trans_huge() function. But it's not
> > backported to -stable because it was fixed unintentionally.
> > 
> > Link: https://lore.kernel.org/linux-mm/20170410094825.2yfo5zehn7pchg6a@techsingularity.net [1]
> > Link: https://lore.kernel.org/linux-mm/8A6309F4-DB76-48FA-BE7F-BF9536A4C4E5@cs.rutgers.edu [2]
> > Link: https://lore.kernel.org/linux-mm/20170302151034.27829-1-kirill.shutemov@linux.intel.com [3]
> > Link: https://lore.kernel.org/linux-mm/20200216191800.22423-1-aquini@redhat.com [4]
> > Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
> > Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> > ---
> >   mm/mprotect.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 668bfaa6ed2a..c0e796c0f9b0 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -303,7 +303,7 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> 
> This is like the worst function, ever :D

Glad it's removed recently :)

> >   	if (pmd_none(pmdval))
> >   		return 1;
> > -	if (pmd_trans_huge(pmdval))
> > +	if (is_swap_pmd(pmdval) || pmd_trans_huge(pmdval))
> >   		return 0;
> >   	if (unlikely(pmd_bad(pmdval))) {
> >   		pmd_clear_bad(pmd);
> > @@ -373,7 +373,7 @@ static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
> >   		 * Hence, it's necessary to atomically read the PMD value
> >   		 * for all the checks.
> >   		 */
> > -		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
> > +		if (!pmd_devmap(*pmd) &&
> >   		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
> >   			goto next;
> 
> This is all because we are trying to be smart and walking page tables
> without the page table lock held. This is just absolutely nasty.

commit 175ad4f1e7a2 ("mm: mprotect: use pmd_trans_unstable instead of
taking the pmd_lock") did this :(

> What about the following check:
> 
> if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> 
> Couldn't we have a similar race there when we are concurrently migrating?

An excellent point! I agree that there could be a similar race,
but with something other than "bad pmd" error.

It'd be more robust to do something like:

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 668bfaa6ed2a..6feca04f9833 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -288,31 +288,6 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 	return pages;
 }
 
-/*
- * Used when setting automatic NUMA hinting protection where it is
- * critical that a numa hinting PMD is not confused with a bad PMD.
- */
-static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
-{
-	pmd_t pmdval = pmd_read_atomic(pmd);
-
-	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	barrier();
-#endif
-
-	if (pmd_none(pmdval))
-		return 1;
-	if (pmd_trans_huge(pmdval))
-		return 0;
-	if (unlikely(pmd_bad(pmdval))) {
-		pmd_clear_bad(pmd);
-		return 1;
-	}
-
-	return 0;
-}
-
 /* Return true if we're uffd wr-protecting file-backed memory, or false */
 static inline bool
 uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags)
@@ -361,6 +336,7 @@ static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
 	pmd = pmd_offset(pud, addr);
 	do {
 		unsigned long this_pages;
+		pmd_t _pmd;
 
 		next = pmd_addr_end(addr, end);
 
@@ -373,9 +349,20 @@ static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
 		 * Hence, it's necessary to atomically read the PMD value
 		 * for all the checks.
 		 */
-		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
-		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
-			goto next;
+		_pmd = pmd_read_atomic(pmd);
+		/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		barrier();
+#endif
+		if (!is_swap_pmd(_pmd) && !pmd_devmap(_pmd) &&
+				!pmd_trans_huge(_pmd)) {
+			if (pmd_none(_pmd))
+				goto next;
+			if (unlikely(pmd_bad(_pmd))) {
+				pmd_clear_bad(pmd);
+				goto next;
+			}
+		}
 
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!range.start) {
@@ -385,7 +372,7 @@ static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
 			mmu_notifier_invalidate_range_start(&range);
 		}
 
-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
 			    uffd_wp_protect_file(vma, cp_flags)) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.
  2025-09-23 11:46   ` Harry Yoo
@ 2025-09-23 14:09     ` David Hildenbrand
  2025-09-24 11:54       ` Harry Yoo
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-09-23 14:09 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Kiryl Shutsemau, Hugh Dickins, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Jane Chu, linux-mm, stable

On 23.09.25 13:46, Harry Yoo wrote:
> On Tue, Sep 23, 2025 at 11:00:57AM +0200, David Hildenbrand wrote:
>> On 22.09.25 01:27, Harry Yoo wrote:
>>> Hi. This is supposed to be a patch, but I think it's worth discussing
>>> how it should be backported to -stable, so I've labeled it as [DISCUSSION].
>>>
>>> The bug described below was unintentionally fixed in v6.5 and not
>>> backported to -stable. So technically I would need to use "Option 3" [A],
>>
>> What is option 3?
> 
> Citing Option 3 from [A]:

Ah, I see.

>> Send the patch, after verifying that it follows the above rules, to
>> stable@vger.kernel.org and mention the kernel versions you wish it to be
>> applied to. When doing so, you must note the upstream commit ID in the
>> changelog of your submission with a separate line above the commit text,
>>
>> like this:
>> commit <sha1> upstream.
>>
>> Or alternatively:
>> [ Upstream commit <sha1> ]
>>
>> If the submitted patch deviates from the original upstream patch
>> (for example because it had to be adjusted for the older API),
>> this must be very clearly documented and justified in the patch description.
> 
>> Just to clarify: it's fine to do a backport of a commit
>> even though it was not tagged as a fix.
> 
> Thanks for looking into it, David!
> 
> Ok, I was worried that the original patch's description will confuse
> people because 1) we don't allow pte_map_offset_lock() to fail in older
> kernels, which the original patch relies on, and 2) the patch does not
> mention the race condition (because it fixed the race 'accidentaly' :D).
> 
> I'll backport the original patch but make it clear that:
> 
> 1. while the original patch did not mention the race condition,
>     the patch fixes a it, and add link to this discussion.
> 
> 2. we can't remove 1) pmd_trans_unstable() check in change_pte_range(),
>     and 2) "bad" pmd check in change_pmd_range() because we don't allow
>     pte_offset_map_lock() to fail().
> 
> 3. pmd_read_atomic() is used instead of pmdp_get_lockless() beucase it
>     does not exist in older kernels.

Right, and backporting all the prerequisites might not be 
feasible/desirable.

[...]

>>>    			goto next;
>>
>> This is all because we are trying to be smart and walking page tables
>> without the page table lock held. This is just absolutely nasty.
> 
> commit 175ad4f1e7a2 ("mm: mprotect: use pmd_trans_unstable instead of
> taking the pmd_lock") did this :(

Right. I can understand why we would not want to grab the lock when 
there is a leaf page table. But everything else is just asking for 
trouble (as we saw :) ).

> 
>> What about the following check:
>>
>> if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>>
>> Couldn't we have a similar race there when we are concurrently migrating?
> 
> An excellent point! I agree that there could be a similar race,
> but with something other than "bad pmd" error.

Right, instead we'd go into change_pte_range() where we check 
pmd_trans_unstable().

After that, I'm not sure ... maybe we'll just retry or we'll 
accidentally try treating it as a PTE table.

Looks like 
pmd_trans_unstable()->pud_none_or_trans_huge_or_dev_or_clear_bad() would 
return "0"
in case we hit migration entry? :/

> 
> It'd be more robust to do something like:

That's also what I had in mind. But all this lockless stuff makes me a 
bit nervous :)

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.
  2025-09-23 14:09     ` David Hildenbrand
@ 2025-09-24 11:54       ` Harry Yoo
  2025-09-24 15:52         ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Harry Yoo @ 2025-09-24 11:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Kiryl Shutsemau, Hugh Dickins, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Jane Chu, linux-mm, stable

On Tue, Sep 23, 2025 at 04:09:06PM +0200, David Hildenbrand wrote:
> On 23.09.25 13:46, Harry Yoo wrote:
> > On Tue, Sep 23, 2025 at 11:00:57AM +0200, David Hildenbrand wrote:
> > > On 22.09.25 01:27, Harry Yoo wrote:
> > > This is all because we are trying to be smart and walking page tables
> > > without the page table lock held. This is just absolutely nasty.
> > 
> > commit 175ad4f1e7a2 ("mm: mprotect: use pmd_trans_unstable instead of
> > taking the pmd_lock") did this :(
> 
> Right. I can understand why we would not want to grab the lock when there is
> a leaf page table. But everything else is just asking for trouble (as we saw
> :) ).
> > > What about the following check:
> > > 
> > > if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> > > 
> > > Couldn't we have a similar race there when we are concurrently migrating?
> > 
> > An excellent point! I agree that there could be a similar race,
> > but with something other than "bad pmd" error.
> 
> Right, instead we'd go into change_pte_range() where we check
> pmd_trans_unstable().

Uh, my brain hurts again... :)

In case is_swap_pmd() or pmd_trans_huge() returned true, but another
kernel thread splits THP after we checked it, __split_huge_pmd() or
change_huge_pmd() will just return without actually splitting or changing
pmd entry, if it turns out that evaluating
(is_swap_pmd() || pmd_trans_huge() || pmd_devmap()) as true
was false positive due to race condition, because they both double check
after acquiring pmd lock:

1) __split_huge_pmd() checks if it's either pmd_trans_huge(), pmd_devmap()
or is_pmd_migration_entry() under pmd lock.

2) change_huge_pmd() checks if it's either is_swap_pmd(),
pmd_trans_huge(), or pmd_devmap() under pmd lock.

And if either function simply returns because it was not a THP,
pmd migration entry, or pmd devmap, khugepaged cannot colleapse
huge page because we're holding mmap_lock in read mode.

And then we call change_pte_range() and that's safe.

> After that, I'm not sure ... maybe we'll just retry

Or as you mentioned, if we are misled into thinking it is not a THP,
PMD devmap, or swap PMD due to race condition, we'd end up going into
change_pte_range().

> or we'll accidentally try treating it as a PTE table.

But then pmd_trans_unstable() check should prevent us from treating
it as PTE table (and we're still holding mmap_lock here).
In such case we don't retry but skip it instead.

> Looks like
> pmd_trans_unstable()->pud_none_or_trans_huge_or_dev_or_clear_bad() would

I think you mean
pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad()?

> return "0"
> in case we hit migration entry? :/

pmd_none_or_trans_huge_or_clear_bad() open-coded is_swap_pmd(), as it
eventually checks !pmd_none() && !pmd_present() case.

> > It'd be more robust to do something like:
> 
> That's also what I had in mind. But all this lockless stuff makes me a bit
> nervous :)

Yeah the code is not very straightforward... :/

But technically the diff that I pasted here should be enough to fix
this... or do you have any alternative approach in mind?

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.
  2025-09-24 11:54       ` Harry Yoo
@ 2025-09-24 15:52         ` David Hildenbrand
  2025-10-02 14:07           ` Harry Yoo
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-09-24 15:52 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Kiryl Shutsemau, Hugh Dickins, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Jane Chu, linux-mm, stable

On 24.09.25 13:54, Harry Yoo wrote:
> On Tue, Sep 23, 2025 at 04:09:06PM +0200, David Hildenbrand wrote:
>> On 23.09.25 13:46, Harry Yoo wrote:
>>> On Tue, Sep 23, 2025 at 11:00:57AM +0200, David Hildenbrand wrote:
>>>> On 22.09.25 01:27, Harry Yoo wrote:
>>>> This is all because we are trying to be smart and walking page tables
>>>> without the page table lock held. This is just absolutely nasty.
>>>
>>> commit 175ad4f1e7a2 ("mm: mprotect: use pmd_trans_unstable instead of
>>> taking the pmd_lock") did this :(
>>
>> Right. I can understand why we would not want to grab the lock when there is
>> a leaf page table. But everything else is just asking for trouble (as we saw
>> :) ).
>>>> What about the following check:
>>>>
>>>> if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>>>>
>>>> Couldn't we have a similar race there when we are concurrently migrating?
>>>
>>> An excellent point! I agree that there could be a similar race,
>>> but with something other than "bad pmd" error.
>>
>> Right, instead we'd go into change_pte_range() where we check
>> pmd_trans_unstable().
> 
> Uh, my brain hurts again... :)

Yeah, that's why I gave up faster than you trying to understand these 
horrible, horrible functions in the stable kernel :)

> 
> In case is_swap_pmd() or pmd_trans_huge() returned true, but another
> kernel thread splits THP after we checked it, __split_huge_pmd() or
> change_huge_pmd() will just return without actually splitting or changing
> pmd entry, if it turns out that evaluating
> (is_swap_pmd() || pmd_trans_huge() || pmd_devmap()) as true
> was false positive due to race condition, because they both double check
> after acquiring pmd lock:
> 
> 1) __split_huge_pmd() checks if it's either pmd_trans_huge(), pmd_devmap()
> or is_pmd_migration_entry() under pmd lock.
> 
> 2) change_huge_pmd() checks if it's either is_swap_pmd(),
> pmd_trans_huge(), or pmd_devmap() under pmd lock.
> 
> And if either function simply returns because it was not a THP,
> pmd migration entry, or pmd devmap, khugepaged cannot colleapse
> huge page because we're holding mmap_lock in read mode.
> 
> And then we call change_pte_range() and that's safe.
> 
>> After that, I'm not sure ... maybe we'll just retry
> 
> Or as you mentioned, if we are misled into thinking it is not a THP,
> PMD devmap, or swap PMD due to race condition, we'd end up going into
> change_pte_range().
> 
>> or we'll accidentally try treating it as a PTE table.
> 
> But then pmd_trans_unstable() check should prevent us from treating
> it as PTE table (and we're still holding mmap_lock here).
> In such case we don't retry but skip it instead.
> 
>> Looks like
>> pmd_trans_unstable()->pud_none_or_trans_huge_or_dev_or_clear_bad() would
> 
> I think you mean
> pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad()?

Yes!

> 
>> return "0"
>> in case we hit migration entry? :/
> 
> pmd_none_or_trans_huge_or_clear_bad() open-coded is_swap_pmd(), as it
> eventually checks !pmd_none() && !pmd_present() case.

Ah, right, I missed the pmd_present() while skimming over this extremely 
horrible function.

So pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad() would 
return "1" and make us retry.

> 
>>> It'd be more robust to do something like:
>>
>> That's also what I had in mind. But all this lockless stuff makes me a bit
>> nervous :)
> 
> Yeah the code is not very straightforward... :/
> 
> But technically the diff that I pasted here should be enough to fix
> this... or do you have any alternative approach in mind?

Hopefully, I'm not convinced this code is not buggy, but at least 
regarding concurrent migration it should be fine with that.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.
  2025-09-24 15:52         ` David Hildenbrand
@ 2025-10-02 14:07           ` Harry Yoo
  2025-10-06  8:18             ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Harry Yoo @ 2025-10-02 14:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Kiryl Shutsemau, Hugh Dickins, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Jane Chu, linux-mm, stable

On Wed, Sep 24, 2025 at 05:52:14PM +0200, David Hildenbrand wrote:
> On 24.09.25 13:54, Harry Yoo wrote:
> > On Tue, Sep 23, 2025 at 04:09:06PM +0200, David Hildenbrand wrote:
> > > On 23.09.25 13:46, Harry Yoo wrote:
> > > > On Tue, Sep 23, 2025 at 11:00:57AM +0200, David Hildenbrand wrote:
> > > > > On 22.09.25 01:27, Harry Yoo wrote:
> > In case is_swap_pmd() or pmd_trans_huge() returned true, but another
> > kernel thread splits THP after we checked it, __split_huge_pmd() or
> > change_huge_pmd() will just return without actually splitting or changing
> > pmd entry, if it turns out that evaluating
> > (is_swap_pmd() || pmd_trans_huge() || pmd_devmap()) as true
> > was false positive due to race condition, because they both double check
> > after acquiring pmd lock:
> > 
> > 1) __split_huge_pmd() checks if it's either pmd_trans_huge(), pmd_devmap()
> > or is_pmd_migration_entry() under pmd lock.
> > 
> > 2) change_huge_pmd() checks if it's either is_swap_pmd(),
> > pmd_trans_huge(), or pmd_devmap() under pmd lock.
> > 
> > And if either function simply returns because it was not a THP,
> > pmd migration entry, or pmd devmap, khugepaged cannot colleapse
> > huge page because we're holding mmap_lock in read mode.
> > 
> > And then we call change_pte_range() and that's safe.
> > 
> > > After that, I'm not sure ... maybe we'll just retry
> > 
> > Or as you mentioned, if we are misled into thinking it is not a THP,
> > PMD devmap, or swap PMD due to race condition, we'd end up going into
> > change_pte_range().
> > 
> > > or we'll accidentally try treating it as a PTE table.
> > 
> > But then pmd_trans_unstable() check should prevent us from treating
> > it as PTE table (and we're still holding mmap_lock here).
> > In such case we don't retry but skip it instead.
> > 
> > > Looks like
> > > pmd_trans_unstable()->pud_none_or_trans_huge_or_dev_or_clear_bad() would
> > 
> > I think you mean
> > pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad()?
> 
> Yes!
> 
> > 
> > > return "0"
> > > in case we hit migration entry? :/
> > 
> > pmd_none_or_trans_huge_or_clear_bad() open-coded is_swap_pmd(), as it
> > eventually checks !pmd_none() && !pmd_present() case.
 

Apologies for the late reply.

> Ah, right, I missed the pmd_present() while skimming over this extremely
> horrible function.
> 
> So pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad() would return
> "1" and make us retry.

We don't retry in pre-6.5 kernels because retrying is a new behavior
after commit 670ddd8cdcbd1.

> > > > It'd be more robust to do something like:
> > > 
> > > That's also what I had in mind. But all this lockless stuff makes me a bit
> > > nervous :)
> > 
> > Yeah the code is not very straightforward... :/
> > 
> > But technically the diff that I pasted here should be enough to fix
> > this... or do you have any alternative approach in mind?
> 
> Hopefully, I'm not convinced this code is not buggy, but at least regarding
> concurrent migration it should be fine with that.

I've been thinking about this...

Actually, it'll make more sense to open-code what pte_map_offset_lock()
does in the mainline:

1. do not remove the "bad pte" checks, because pte_offset_map() in pre-6.5
   kernels doesn't do the check for us unlike the mainline.
2. check is_swap_pmd(), pmd_trans_huge(), pmd_devmap() without ptl, but
   atomically.
3. after acquiring ptl in change_pte_range(), check if pmd has changed
   since step 1 and 2. if yes, retry (like mainline). if no, we're all good.

What do you think?

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.
  2025-10-02 14:07           ` Harry Yoo
@ 2025-10-06  8:18             ` David Hildenbrand
  2025-10-20 13:25               ` Harry Yoo
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-10-06  8:18 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Kiryl Shutsemau, Hugh Dickins, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Jane Chu, linux-mm, stable

On 02.10.25 16:07, Harry Yoo wrote:
> On Wed, Sep 24, 2025 at 05:52:14PM +0200, David Hildenbrand wrote:
>> On 24.09.25 13:54, Harry Yoo wrote:
>>> On Tue, Sep 23, 2025 at 04:09:06PM +0200, David Hildenbrand wrote:
>>>> On 23.09.25 13:46, Harry Yoo wrote:
>>>>> On Tue, Sep 23, 2025 at 11:00:57AM +0200, David Hildenbrand wrote:
>>>>>> On 22.09.25 01:27, Harry Yoo wrote:
>>> In case is_swap_pmd() or pmd_trans_huge() returned true, but another
>>> kernel thread splits THP after we checked it, __split_huge_pmd() or
>>> change_huge_pmd() will just return without actually splitting or changing
>>> pmd entry, if it turns out that evaluating
>>> (is_swap_pmd() || pmd_trans_huge() || pmd_devmap()) as true
>>> was false positive due to race condition, because they both double check
>>> after acquiring pmd lock:
>>>
>>> 1) __split_huge_pmd() checks if it's either pmd_trans_huge(), pmd_devmap()
>>> or is_pmd_migration_entry() under pmd lock.
>>>
>>> 2) change_huge_pmd() checks if it's either is_swap_pmd(),
>>> pmd_trans_huge(), or pmd_devmap() under pmd lock.
>>>
>>> And if either function simply returns because it was not a THP,
>>> pmd migration entry, or pmd devmap, khugepaged cannot colleapse
>>> huge page because we're holding mmap_lock in read mode.
>>>
>>> And then we call change_pte_range() and that's safe.
>>>
>>>> After that, I'm not sure ... maybe we'll just retry
>>>
>>> Or as you mentioned, if we are misled into thinking it is not a THP,
>>> PMD devmap, or swap PMD due to race condition, we'd end up going into
>>> change_pte_range().
>>>
>>>> or we'll accidentally try treating it as a PTE table.
>>>
>>> But then pmd_trans_unstable() check should prevent us from treating
>>> it as PTE table (and we're still holding mmap_lock here).
>>> In such case we don't retry but skip it instead.
>>>
>>>> Looks like
>>>> pmd_trans_unstable()->pud_none_or_trans_huge_or_dev_or_clear_bad() would
>>>
>>> I think you mean
>>> pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad()?
>>
>> Yes!
>>
>>>
>>>> return "0"
>>>> in case we hit migration entry? :/
>>>
>>> pmd_none_or_trans_huge_or_clear_bad() open-coded is_swap_pmd(), as it
>>> eventually checks !pmd_none() && !pmd_present() case.
>   
> 
> Apologies for the late reply.
> 
>> Ah, right, I missed the pmd_present() while skimming over this extremely
>> horrible function.
>>
>> So pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad() would return
>> "1" and make us retry.
> 
> We don't retry in pre-6.5 kernels because retrying is a new behavior
> after commit 670ddd8cdcbd1.

:/

>>>>> It'd be more robust to do something like:
>>>>
>>>> That's also what I had in mind. But all this lockless stuff makes me a bit
>>>> nervous :)
>>>
>>> Yeah the code is not very straightforward... :/
>>>
>>> But technically the diff that I pasted here should be enough to fix
>>> this... or do you have any alternative approach in mind?
>>
>> Hopefully, I'm not convinced this code is not buggy, but at least regarding
>> concurrent migration it should be fine with that.
> 
> I've been thinking about this...
> 
> Actually, it'll make more sense to open-code what pte_map_offset_lock()
> does in the mainline:
> 
> 1. do not remove the "bad pte" checks, because pte_offset_map() in pre-6.5
>     kernels doesn't do the check for us unlike the mainline.
> 2. check is_swap_pmd(), pmd_trans_huge(), pmd_devmap() without ptl, but
>     atomically.
> 3. after acquiring ptl in change_pte_range(), check if pmd has changed
>     since step 1 and 2. if yes, retry (like mainline). if no, we're all good.
> 
> What do you think?

Only for -stable, right? Does not sound too wrong for me, but I would 
have to take a closer look at the end result!

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.
  2025-10-06  8:18             ` David Hildenbrand
@ 2025-10-20 13:25               ` Harry Yoo
  2025-10-20 14:07                 ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Harry Yoo @ 2025-10-20 13:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Kiryl Shutsemau, Hugh Dickins, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Jane Chu, linux-mm, stable

On Mon, Oct 06, 2025 at 10:18:30AM +0200, David Hildenbrand wrote:
> On 02.10.25 16:07, Harry Yoo wrote:
> > On Wed, Sep 24, 2025 at 05:52:14PM +0200, David Hildenbrand wrote:
> > > On 24.09.25 13:54, Harry Yoo wrote:
> > > > On Tue, Sep 23, 2025 at 04:09:06PM +0200, David Hildenbrand wrote:
> > > > > On 23.09.25 13:46, Harry Yoo wrote:
> > > > > > On Tue, Sep 23, 2025 at 11:00:57AM +0200, David Hildenbrand wrote:
> > > > > > > On 22.09.25 01:27, Harry Yoo wrote:
> > > > In case is_swap_pmd() or pmd_trans_huge() returned true, but another
> > > > kernel thread splits THP after we checked it, __split_huge_pmd() or
> > > > change_huge_pmd() will just return without actually splitting or changing
> > > > pmd entry, if it turns out that evaluating
> > > > (is_swap_pmd() || pmd_trans_huge() || pmd_devmap()) as true
> > > > was false positive due to race condition, because they both double check
> > > > after acquiring pmd lock:
> > > > 
> > > > 1) __split_huge_pmd() checks if it's either pmd_trans_huge(), pmd_devmap()
> > > > or is_pmd_migration_entry() under pmd lock.
> > > > 
> > > > 2) change_huge_pmd() checks if it's either is_swap_pmd(),
> > > > pmd_trans_huge(), or pmd_devmap() under pmd lock.
> > > > 
> > > > And if either function simply returns because it was not a THP,
> > > > pmd migration entry, or pmd devmap, khugepaged cannot colleapse
> > > > huge page because we're holding mmap_lock in read mode.
> > > > 
> > > > And then we call change_pte_range() and that's safe.
> > > > 
> > > > > After that, I'm not sure ... maybe we'll just retry
> > > > 
> > > > Or as you mentioned, if we are misled into thinking it is not a THP,
> > > > PMD devmap, or swap PMD due to race condition, we'd end up going into
> > > > change_pte_range().
> > > > 
> > > > > or we'll accidentally try treating it as a PTE table.
> > > > 
> > > > But then pmd_trans_unstable() check should prevent us from treating
> > > > it as PTE table (and we're still holding mmap_lock here).
> > > > In such case we don't retry but skip it instead.
> > > > 
> > > > > Looks like
> > > > > pmd_trans_unstable()->pud_none_or_trans_huge_or_dev_or_clear_bad() would
> > > > 
> > > > I think you mean
> > > > pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad()?
> > > 
> > > Yes!
> > > 
> > > > 
> > > > > return "0"
> > > > > in case we hit migration entry? :/
> > > > 
> > > > pmd_none_or_trans_huge_or_clear_bad() open-coded is_swap_pmd(), as it
> > > > eventually checks !pmd_none() && !pmd_present() case.
> > 
> > Apologies for the late reply.
> > 
> > > Ah, right, I missed the pmd_present() while skimming over this extremely
> > > horrible function.
> > > 
> > > So pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad() would return
> > > "1" and make us retry.
> > 
> > We don't retry in pre-6.5 kernels because retrying is a new behavior
> > after commit 670ddd8cdcbd1.
> 
> :/
> 
> > > > > > It'd be more robust to do something like:
> > > > > 
> > > > > That's also what I had in mind. But all this lockless stuff makes me a bit
> > > > > nervous :)
> > > > 
> > > > Yeah the code is not very straightforward... :/
> > > > 
> > > > But technically the diff that I pasted here should be enough to fix
> > > > this... or do you have any alternative approach in mind?
> > > 
> > > Hopefully, I'm not convinced this code is not buggy, but at least regarding
> > > concurrent migration it should be fine with that.
> > 
> > I've been thinking about this...
> > 
> > Actually, it'll make more sense to open-code what pte_map_offset_lock()
> > does in the mainline:
> > 
> > 1. do not remove the "bad pte" checks, because pte_offset_map() in pre-6.5
> >     kernels doesn't do the check for us unlike the mainline.
> > 2. check is_swap_pmd(), pmd_trans_huge(), pmd_devmap() without ptl, but
> >     atomically.
> > 3. after acquiring ptl in change_pte_range(), check if pmd has changed
> >     since step 1 and 2. if yes, retry (like mainline). if no, we're all good.
> > 
> > What do you think?

Apologies for late reply...

> Only for -stable, right?

Right!

> Does not sound too wrong for me, but I would have
> to take a closer look at the end result!

FYI I'll test these two patches and see if they survive for a week,
and then submit them to -stable. Thanks.

The first patch is also going to be backported since change_pte_range()
cannot return negative values without it.

It's based on v5.15, but I'll have to backport it from v6.1 to, uh,
**checks note**, v5.4 since the race was introduced after commit 84c3fc4e9c56
("mm: thp: check pmd migration entry in common path").

------8<------
From 3cad29977e81cebbb0c9e9f4a7ac58d9353af619 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Wed, 4 Jan 2023 17:52:06 -0500
Subject: [PATCH 1/2] mm/mprotect: use long for page accountings and retval

commit a79390f5d6a78647fd70856bd42b22d994de0ba2 upstream.

Switch to use type "long" for page accountings and retval across the whole
procedure of change_protection().

The change should have shrinked the possible maximum page number to be
half comparing to previous (ULONG_MAX / 2), but it shouldn't overflow on
any system either because the maximum possible pages touched by change
protection should be ULONG_MAX / PAGE_SIZE.

Two reasons to switch from "unsigned long" to "long":

  1. It suites better on count_vm_numa_events(), whose 2nd parameter takes
     a long type.

  2. It paves way for returning negative (error) values in the future.

Currently the only caller that consumes this retval is change_prot_numa(),
where the unsigned long was converted to an int.  Since at it, touching up
the numa code to also take a long, so it'll avoid any possible overflow
too during the int-size convertion.

Link: https://lkml.kernel.org/r/20230104225207.1066932-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: James Houghton <jthoughton@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Adjust context ]
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 include/linux/hugetlb.h |  4 ++--
 include/linux/mm.h      |  2 +-
 mm/hugetlb.c            |  4 ++--
 mm/mempolicy.c          |  2 +-
 mm/mprotect.c           | 26 +++++++++++++-------------
 5 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 7733abdaf039..772ce73fc5ed 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -213,7 +213,7 @@ struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,

 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pud);
-unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
+long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot);

 bool is_hugetlb_entry_migration(pte_t pte);
@@ -398,7 +398,7 @@ static inline void move_hugetlb_state(struct page *oldpage,
 {
 }

-static inline unsigned long hugetlb_change_protection(
+static inline long hugetlb_change_protection(
 			struct vm_area_struct *vma, unsigned long address,
 			unsigned long end, pgprot_t newprot)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3a076a98733d..537575ea46e2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1905,7 +1905,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
 					    MM_CP_UFFD_WP_RESOLVE)

-extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+extern long change_protection(struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
 			      unsigned long cp_flags);
 extern int mprotect_fixup(struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 116636bddc46..3f4615af577c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6339,7 +6339,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	return i ? i : err;
 }

-unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
+long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot)
 {
 	struct mm_struct *mm = vma->vm_mm;
@@ -6347,7 +6347,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t *ptep;
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
-	unsigned long pages = 0;
+	long pages = 0;
 	bool shared_pmd = false;
 	struct mmu_notifier_range range;

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d5224a8ec9c0..5b73c56b37ed 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -630,7 +630,7 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 unsigned long change_prot_numa(struct vm_area_struct *vma,
 			unsigned long addr, unsigned long end)
 {
-	int nr_updated;
+	long nr_updated;

 	nr_updated = change_protection(vma, addr, end, PAGE_NONE, MM_CP_PROT_NUMA);
 	if (nr_updated)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 29c246c37636..22024da41597 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -35,13 +35,13 @@

 #include "internal.h"

-static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
+static long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		unsigned long cp_flags)
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
-	unsigned long pages = 0;
+	long pages = 0;
 	int target_node = NUMA_NO_NODE;
 	bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
@@ -219,13 +219,13 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
 	return 0;
 }

-static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
+static inline long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
 		pgprot_t newprot, unsigned long cp_flags)
 {
 	pmd_t *pmd;
 	unsigned long next;
-	unsigned long pages = 0;
+	long pages = 0;
 	unsigned long nr_huge_updates = 0;
 	struct mmu_notifier_range range;

@@ -233,7 +233,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,

 	pmd = pmd_offset(pud, addr);
 	do {
-		unsigned long this_pages;
+		long this_pages;

 		next = pmd_addr_end(addr, end);

@@ -291,13 +291,13 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	return pages;
 }

-static inline unsigned long change_pud_range(struct vm_area_struct *vma,
+static inline long change_pud_range(struct vm_area_struct *vma,
 		p4d_t *p4d, unsigned long addr, unsigned long end,
 		pgprot_t newprot, unsigned long cp_flags)
 {
 	pud_t *pud;
 	unsigned long next;
-	unsigned long pages = 0;
+	long pages = 0;

 	pud = pud_offset(p4d, addr);
 	do {
@@ -311,13 +311,13 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 	return pages;
 }

-static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
+static inline long change_p4d_range(struct vm_area_struct *vma,
 		pgd_t *pgd, unsigned long addr, unsigned long end,
 		pgprot_t newprot, unsigned long cp_flags)
 {
 	p4d_t *p4d;
 	unsigned long next;
-	unsigned long pages = 0;
+	long pages = 0;

 	p4d = p4d_offset(pgd, addr);
 	do {
@@ -331,7 +331,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 	return pages;
 }

-static unsigned long change_protection_range(struct vm_area_struct *vma,
+static long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		unsigned long cp_flags)
 {
@@ -339,7 +339,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	pgd_t *pgd;
 	unsigned long next;
 	unsigned long start = addr;
-	unsigned long pages = 0;
+	long pages = 0;

 	BUG_ON(addr >= end);
 	pgd = pgd_offset(mm, addr);
@@ -361,11 +361,11 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	return pages;
 }

-unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+long change_protection(struct vm_area_struct *vma, unsigned long start,
 		       unsigned long end, pgprot_t newprot,
 		       unsigned long cp_flags)
 {
-	unsigned long pages;
+	long pages;

 	BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);

--
2.43.0

------8<------
From 0db554652a1fe67841e4660ae8e87759d0404db4 Mon Sep 17 00:00:00 2001
From: Hugh Dickins <hughd@google.com>
Date: Thu, 8 Jun 2023 18:30:48 -0700
Subject: [PATCH 2/2] mm/mprotect: delete
 pmd_none_or_clear_bad_unless_trans_huge()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

commit 670ddd8cdcbd1d07a4571266ae3517f821728c3a upstream.

change_pmd_range() had special pmd_none_or_clear_bad_unless_trans_huge(),
required to avoid "bad" choices when setting automatic NUMA hinting under
mmap_read_lock(); but most of that is already covered in pte_offset_map()
now.  change_pmd_range() just wants a pmd_none() check before wasting time
on MMU notifiers, then checks on the read-once _pmd value to work out
what's needed for huge cases.  If change_pte_range() returns -EAGAIN to
retry if pte_offset_map_lock() fails, nothing more special is needed.

Link: https://lkml.kernel.org/r/725a42a9-91e9-c868-925-e3a5fd40bb4f@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Background:

    It was reported that a bad pmd is seen when automatic NUMA balancing
    is marking page table entries as prot_numa:

      [2437548.196018] mm/pgtable-generic.c:50: bad pmd 00000000af22fc02(dffffffe71fbfe02)
      [2437548.235022] Call Trace:
      [2437548.238234]  <TASK>
      [2437548.241060]  dump_stack_lvl+0x46/0x61
      [2437548.245689]  panic+0x106/0x2e5
      [2437548.249497]  pmd_clear_bad+0x3c/0x3c
      [2437548.253967]  change_pmd_range.isra.0+0x34d/0x3a7
      [2437548.259537]  change_p4d_range+0x156/0x20e
      [2437548.264392]  change_protection_range+0x116/0x1a9
      [2437548.269976]  change_prot_numa+0x15/0x37
      [2437548.274774]  task_numa_work+0x1b8/0x302
      [2437548.279512]  task_work_run+0x62/0x95
      [2437548.283882]  exit_to_user_mode_loop+0x1a4/0x1a9
      [2437548.289277]  exit_to_user_mode_prepare+0xf4/0xfc
      [2437548.294751]  ? sysvec_apic_timer_interrupt+0x34/0x81
      [2437548.300677]  irqentry_exit_to_user_mode+0x5/0x25
      [2437548.306153]  asm_sysvec_apic_timer_interrupt+0x16/0x1b

    This is due to a race condition between change_prot_numa() and
    THP migration because the kernel doesn't check is_swap_pmd() and
    pmd_trans_huge() atomically:

    change_prot_numa()                      THP migration
    ======================================================================
    - change_pmd_range()
    -> is_swap_pmd() returns false,
    meaning it's not a PMD migration
    entry.
                                      - do_huge_pmd_numa_page()
                                      -> migrate_misplaced_page() sets
                                         migration entries for the THP.
    - change_pmd_range()
    -> pmd_none_or_clear_bad_unless_trans_huge()
    -> pmd_none() and pmd_trans_huge() returns false
    - pmd_none_or_clear_bad_unless_trans_huge()
    -> pmd_bad() returns true for the migration entry!

  The upstream commit 670ddd8cdcbd ("mm/mprotect: delete
  pmd_none_or_clear_bad_unless_trans_huge()") closes this race condition
  by checking is_swap_pmd() and pmd_trans_huge() atomically.

  Backporting note:
    Unlike mainline, pte_offset_map_lock() does not check if the pmd
    entry is a migration entry or a hugepage; acquires PTL unconditionally
    instead of returning failure. Therefore, it is necessary to keep the
    !is_swap_pmd() && !pmd_trans_huge() && !pmd_devmap() check before
    acquiring the PTL.

    After acquiring it, open-code the mainline semantics of
    pte_offset_map_lock() so that change_pte_range() fails if the pmd value
    has changed (under the PTL). This requires adding one more parameter
    (for passing pmd value that is read before calling the function) to
    change_pte_range(). ]
---
 mm/mprotect.c | 97 ++++++++++++++++++++++-----------------------------
 1 file changed, 41 insertions(+), 56 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 22024da41597..5367d03a071d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -36,10 +36,11 @@
 #include "internal.h"

 static long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, unsigned long end, pgprot_t newprot,
-		unsigned long cp_flags)
+		pmd_t pmd_old, unsigned long addr, unsigned long end,
+		pgprot_t newprot, unsigned long cp_flags)
 {
 	pte_t *pte, oldpte;
+	pmd_t pmd_val;
 	spinlock_t *ptl;
 	long pages = 0;
 	int target_node = NUMA_NO_NODE;
@@ -48,21 +49,16 @@ static long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;

-	/*
-	 * Can be called with only the mmap_lock for reading by
-	 * prot_numa so we must check the pmd isn't constantly
-	 * changing from under us from pmd_none to pmd_trans_huge
-	 * and/or the other way around.
-	 */
-	if (pmd_trans_unstable(pmd))
-		return 0;

-	/*
-	 * The pmd points to a regular pte so the pmd can't change
-	 * from under us even if the mmap_lock is only hold for
-	 * reading.
-	 */
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	/* Make sure pmd didn't change after acquiring ptl */
+	pmd_val = pmd_read_atomic(pmd);
+	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
+	barrier();
+	if (!pmd_same(pmd_old, pmd_val)) {
+		pte_unmap_unlock(pte, ptl);
+		return -EAGAIN;
+	}

 	/* Get target node for single threaded private VMAs */
 	if (prot_numa && !(vma->vm_flags & VM_SHARED) &&
@@ -194,31 +190,6 @@ static long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	return pages;
 }

-/*
- * Used when setting automatic NUMA hinting protection where it is
- * critical that a numa hinting PMD is not confused with a bad PMD.
- */
-static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
-{
-	pmd_t pmdval = pmd_read_atomic(pmd);
-
-	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	barrier();
-#endif
-
-	if (pmd_none(pmdval))
-		return 1;
-	if (pmd_trans_huge(pmdval))
-		return 0;
-	if (unlikely(pmd_bad(pmdval))) {
-		pmd_clear_bad(pmd);
-		return 1;
-	}
-
-	return 0;
-}
-
 static inline long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
 		pgprot_t newprot, unsigned long cp_flags)
@@ -233,21 +204,33 @@ static inline long change_pmd_range(struct vm_area_struct *vma,

 	pmd = pmd_offset(pud, addr);
 	do {
-		long this_pages;
-
+		long ret;
+		pmd_t _pmd;
+again:
 		next = pmd_addr_end(addr, end);
+	        _pmd = pmd_read_atomic(pmd);
+		/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		barrier();
+#endif

 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
 		 * held for read. It's possible a parallel update to occur
-		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
-		 * check leading to a false positive and clearing.
-		 * Hence, it's necessary to atomically read the PMD value
-		 * for all the checks.
+		 * between pmd_trans_huge(), is_swap_pmd(), and
+		 * a pmd_none_or_clear_bad() check leading to a false positive
+		 * and clearing. Hence, it's necessary to atomically read
+		 * the PMD value for all the checks.
 		 */
-		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
-		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
-			goto next;
+		if (!is_swap_pmd(_pmd) && !pmd_devmap(_pmd) && !pmd_trans_huge(_pmd)) {
+			if (pmd_none(_pmd))
+				goto next;
+
+			if (pmd_bad(_pmd)) {
+				pmd_clear_bad(pmd);
+				goto next;
+			}
+		}

 		/* invoke the mmu notifier if the pmd is populated */
 		if (!range.start) {
@@ -257,15 +240,15 @@ static inline long change_pmd_range(struct vm_area_struct *vma,
 			mmu_notifier_invalidate_range_start(&range);
 		}

-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			} else {
-				int nr_ptes = change_huge_pmd(vma, pmd, addr,
+				ret = change_huge_pmd(vma, pmd, addr,
 							      newprot, cp_flags);

-				if (nr_ptes) {
-					if (nr_ptes == HPAGE_PMD_NR) {
+				if (ret) {
+					if (ret == HPAGE_PMD_NR) {
 						pages += HPAGE_PMD_NR;
 						nr_huge_updates++;
 					}
@@ -276,9 +259,11 @@ static inline long change_pmd_range(struct vm_area_struct *vma,
 			}
 			/* fall through, the trans huge pmd just split */
 		}
-		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
-					      cp_flags);
-		pages += this_pages;
+		ret = change_pte_range(vma, pmd, _pmd, addr, next,
+					      newprot, cp_flags);
+		if (ret < 0)
+			goto again;
+		pages += ret;
 next:
 		cond_resched();
 	} while (pmd++, addr = next, addr != end);
--
2.43.0



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.
  2025-10-20 13:25               ` Harry Yoo
@ 2025-10-20 14:07                 ` David Hildenbrand
  0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2025-10-20 14:07 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Kiryl Shutsemau, Hugh Dickins, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Jane Chu, linux-mm, stable


>>
>>>>>>> It'd be more robust to do something like:
>>>>>>
>>>>>> That's also what I had in mind. But all this lockless stuff makes me a bit
>>>>>> nervous :)
>>>>>
>>>>> Yeah the code is not very straightforward... :/
>>>>>
>>>>> But technically the diff that I pasted here should be enough to fix
>>>>> this... or do you have any alternative approach in mind?
>>>>
>>>> Hopefully, I'm not convinced this code is not buggy, but at least regarding
>>>> concurrent migration it should be fine with that.
>>>
>>> I've been thinking about this...
>>>
>>> Actually, it'll make more sense to open-code what pte_map_offset_lock()
>>> does in the mainline:
>>>
>>> 1. do not remove the "bad pte" checks, because pte_offset_map() in pre-6.5
>>>      kernels doesn't do the check for us unlike the mainline.
>>> 2. check is_swap_pmd(), pmd_trans_huge(), pmd_devmap() without ptl, but
>>>      atomically.
>>> 3. after acquiring ptl in change_pte_range(), check if pmd has changed
>>>      since step 1 and 2. if yes, retry (like mainline). if no, we're all good.
>>>
>>> What do you think?
> 
> Apologies for late reply...
> 
>> Only for -stable, right?
> 
> Right!
> 
>> Does not sound too wrong for me, but I would have
>> to take a closer look at the end result!
> 
> FYI I'll test these two patches and see if they survive for a week,
> and then submit them to -stable. Thanks.
> 
> The first patch is also going to be backported since change_pte_range()
> cannot return negative values without it.

That sound reasonable to me!

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-10-20 14:07 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-21 23:27 [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels Harry Yoo
2025-09-23  9:00 ` David Hildenbrand
2025-09-23 11:46   ` Harry Yoo
2025-09-23 14:09     ` David Hildenbrand
2025-09-24 11:54       ` Harry Yoo
2025-09-24 15:52         ` David Hildenbrand
2025-10-02 14:07           ` Harry Yoo
2025-10-06  8:18             ` David Hildenbrand
2025-10-20 13:25               ` Harry Yoo
2025-10-20 14:07                 ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox