[Patch v3] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Patch v3] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp
@ 2026-02-05  3:31 Wei Yang
  2026-02-09 17:08 ` Lorenzo Stoakes
  0 siblings, 1 reply; 5+ messages in thread
From: Wei Yang @ 2026-02-05  3:31 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, riel, Liam.Howlett, vbabka,
	harry.yoo, jannh, gavinguo, baolin.wang, ziy
  Cc: linux-mm, Wei Yang, Lance Yang, stable

Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and
split_huge_pmd_locked()") return false unconditionally after
split_huge_pmd_locked() which may fail early during try_to_migrate() for
shared thp. This will lead to unexpected folio split failure.

One way to reproduce:

    Create an anonymous thp range and fork 512 children, so we have a
    thp shared mapped in 513 processes. Then trigger folio split with
    /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to
    order 0.

Without the above commit, we can successfully split to order 0.
With the above commit, the folio is still a large folio.

The reason is the above commit return false after split pmd
unconditionally in the first process and break try_to_migrate().

On memory pressure or failure, we would try to reclaim unused memory or
limit bad memory after folio split. If failed to split it, we will leave
some more memory unusable than expected.

The tricky thing in above reproduce method is current debugfs interface
leverage function split_huge_pages_pid(), which will iterate the whole
pmd range and do folio split on each base page address. This means it
will try 512 times, and each time split one pmd from pmd mapped to pte
mapped thp. If there are less than 512 shared mapped process,
the folio is still split successfully at last. But in real world, we
usually try it for once.

This patch fixes this by restart page_vma_mapped_walk() after
split_huge_pmd_locked(). We cannot simply return "true" to fix the
problem, as that would affect another case:
split_huge_pmd_locked()->folio_try_share_anon_rmap_pmd() can failed and
leave the folio mapped through PTEs; we would return "true" from
try_to_migrate_one() in that case as well. While that is mostly
harmless, we could end up walking the rmap, wasting some cycles.

Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Tested-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Gavin Guo <gavinguo@igalia.com>
Acked-by: David Hildenbrand (arm) <david@kernel.org>
Cc: Gavin Guo <gavinguo@igalia.com>
Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: <stable@vger.kernel.org>

---
v3:
  * gather RB
  * adjust the commit log and comment per David
  * add userspace-visible runtime effect in change log
v2:
  * restart page_vma_mapped_walk() after split_huge_pmd_locked()
---
 mm/rmap.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 618df3385c8b..1041a64b8e6b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2446,11 +2446,17 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			__maybe_unused pmd_t pmdval;

 			if (flags & TTU_SPLIT_HUGE_PMD) {
+				/*
+				 * split_huge_pmd_locked() might leave the
+				 * folio mapped through PTEs. Retry the walk
+				 * so we can detect this scenario and properly
+				 * abort the walk.
+				 */
 				split_huge_pmd_locked(vma, pvmw.address,
 						      pvmw.pmd, true);
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
+				flags &= ~TTU_SPLIT_HUGE_PMD;
+				page_vma_mapped_walk_restart(&pvmw);
+				continue;
 			}
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 			pmdval = pmdp_get(pvmw.pmd);
-- 
2.34.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Patch v3] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp
  2026-02-05  3:31 [Patch v3] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp Wei Yang
@ 2026-02-09 17:08 ` Lorenzo Stoakes
  2026-02-10  3:23   ` Wei Yang
  0 siblings, 1 reply; 5+ messages in thread
From: Lorenzo Stoakes @ 2026-02-09 17:08 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, david, riel, Liam.Howlett, vbabka, harry.yoo, jannh,
	gavinguo, baolin.wang, ziy, linux-mm, Lance Yang, stable

On Thu, Feb 05, 2026 at 03:31:13AM +0000, Wei Yang wrote:
> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and
> split_huge_pmd_locked()") return false unconditionally after
> split_huge_pmd_locked() which may fail early during try_to_migrate() for
> shared thp. This will lead to unexpected folio split failure.

I think this could be put more clearly. 'When splitting a PMD THP migration
entry in try_to_migrate_one() in a rmap walk invoked by try_to_migrate() when
TTU_SPLIT_HUGE_PMD is specified.' or something like that.

>
> One way to reproduce:
>
>     Create an anonymous thp range and fork 512 children, so we have a
>     thp shared mapped in 513 processes. Then trigger folio split with
>     /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to
>     order 0.

I think you should explain the issue before the repro. This is just confusing
things. Mention the repro _afterwards_.

>
> Without the above commit, we can successfully split to order 0.
> With the above commit, the folio is still a large folio.
>
> The reason is the above commit return false after split pmd

This sentence doesn't really make sense. Returns false where? And under what
circumstances?

I'm having to look through 60fbb14396d5 to understand this which isn't a good
sign.

'This patch adjusted try_to_migrate_one() to, when a PMD-mapped THP migration
entry is found, and TTU_SPLIT_HUGE_PMD is specified (for example, via
unmap_folio()), exit the walk and return false unconditionally'.

> unconditionally in the first process and break try_to_migrate().
>
> On memory pressure or failure, we would try to reclaim unused memory or
> limit bad memory after folio split. If failed to split it, we will leave

Limit bad memory? What does that mean? Also should be If '_we_' or '_it_' or
something like that.

> some more memory unusable than expected.

'We will leave some more memory unusable than expected' is super unclear.

You mean we will fail to migrate THP entries at the PTE level?

Can we say this instead please?

>
> The tricky thing in above reproduce method is current debugfs interface
> leverage function split_huge_pages_pid(), which will iterate the whole
> pmd range and do folio split on each base page address. This means it
> will try 512 times, and each time split one pmd from pmd mapped to pte
> mapped thp. If there are less than 512 shared mapped process,
> the folio is still split successfully at last. But in real world, we
> usually try it for once.

This whole sentence could be dropped I think I don't think it adds anything.

And you're really confusing the issue by dwelling on this I think.

You need to restart the walk in this case in order for the PTEs to be correctly
handled right?

Can you explain why we can't just essentially revert 60fbb14396d5? Or at least
the bit that did this change?

Also is unmap_folio() the only caller with TTU_SPLIT_HUGE_PMD as the comment
that was deleted by 60fbb14396d5 implied? Or are there others? If it is, please
mention the commit msg.

>
> This patch fixes this by restart page_vma_mapped_walk() after
> split_huge_pmd_locked(). We cannot simply return "true" to fix the
> problem, as that would affect another case:

I mean how would it fix the problem to incorrectly have it return true when the
walk had not in fact completed?

I'm not sure why you're dwelling on this idea in the commit msg?

> split_huge_pmd_locked()->folio_try_share_anon_rmap_pmd() can failed and
> leave the folio mapped through PTEs; we would return "true" from
> try_to_migrate_one() in that case as well. While that is mostly
> harmless, we could end up walking the rmap, wasting some cycles.

I mean I think we can just drop this whole paragraph no?

You might think I'm being picky about the commit msg here, but as is I find it
pretty much incomprehensible and that's not helpful if we have to go back and
read this in future.

>
> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()")
> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Tested-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Gavin Guo <gavinguo@igalia.com>
> Acked-by: David Hildenbrand (arm) <david@kernel.org>
> Cc: Gavin Guo <gavinguo@igalia.com>
> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Lance Yang <lance.yang@linux.dev>
> Cc: <stable@vger.kernel.org>
>
> ---
> v3:
>   * gather RB
>   * adjust the commit log and comment per David

Clearly not enough :)

>   * add userspace-visible runtime effect in change log

Which one was that?

> v2:
>   * restart page_vma_mapped_walk() after split_huge_pmd_locked()
> ---
>  mm/rmap.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 618df3385c8b..1041a64b8e6b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2446,11 +2446,17 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  			__maybe_unused pmd_t pmdval;
>
>  			if (flags & TTU_SPLIT_HUGE_PMD) {
> +				/*
> +				 * split_huge_pmd_locked() might leave the
> +				 * folio mapped through PTEs. Retry the walk
> +				 * so we can detect this scenario and properly
> +				 * abort the walk.
> +				 */

This comment is a lot clearer than the commit msg :)

>  				split_huge_pmd_locked(vma, pvmw.address,
>  						      pvmw.pmd, true);
> -				ret = false;
> -				page_vma_mapped_walk_done(&pvmw);
> -				break;
> +				flags &= ~TTU_SPLIT_HUGE_PMD;
> +				page_vma_mapped_walk_restart(&pvmw);
> +				continue;

This logic does lok reasonable.

>  			}
>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>  			pmdval = pmdp_get(pvmw.pmd);
> --
> 2.34.1
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Patch v3] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp
  2026-02-09 17:08 ` Lorenzo Stoakes
@ 2026-02-10  3:23   ` Wei Yang
  2026-02-13 13:20     ` Wei Yang
  0 siblings, 1 reply; 5+ messages in thread
From: Wei Yang @ 2026-02-10  3:23 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Wei Yang, akpm, david, riel, Liam.Howlett, vbabka, harry.yoo,
	jannh, gavinguo, baolin.wang, ziy, linux-mm, Lance Yang, stable

On Mon, Feb 09, 2026 at 05:08:16PM +0000, Lorenzo Stoakes wrote:
>On Thu, Feb 05, 2026 at 03:31:13AM +0000, Wei Yang wrote:
>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and
>> split_huge_pmd_locked()") return false unconditionally after
>> split_huge_pmd_locked() which may fail early during try_to_migrate() for
>> shared thp. This will lead to unexpected folio split failure.
>
>I think this could be put more clearly. 'When splitting a PMD THP migration
>entry in try_to_migrate_one() in a rmap walk invoked by try_to_migrate() when

split_huge_pmd_locked() could split a PMD THP migration entry, but here we
expect a PMD THP normal entry.

>TTU_SPLIT_HUGE_PMD is specified.' or something like that.
>
>>
>> One way to reproduce:
>>
>>     Create an anonymous thp range and fork 512 children, so we have a
>>     thp shared mapped in 513 processes. Then trigger folio split with
>>     /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to
>>     order 0.
>
>I think you should explain the issue before the repro. This is just confusing
>things. Mention the repro _afterwards_.
>

OK, will move afterwards.

>>
>> Without the above commit, we can successfully split to order 0.
>> With the above commit, the folio is still a large folio.
>>
>> The reason is the above commit return false after split pmd
>
>This sentence doesn't really make sense. Returns false where? And under what
>circumstances?
>
>I'm having to look through 60fbb14396d5 to understand this which isn't a good
>sign.
>
>'This patch adjusted try_to_migrate_one() to, when a PMD-mapped THP migration

I am afraid the original intention of commit 60fbb14396d5 is not just for
migration entry.

>entry is found, and TTU_SPLIT_HUGE_PMD is specified (for example, via
>unmap_folio()), exit the walk and return false unconditionally'.
>
>> unconditionally in the first process and break try_to_migrate().
>>
>> On memory pressure or failure, we would try to reclaim unused memory or
>> limit bad memory after folio split. If failed to split it, we will leave
>
>Limit bad memory? What does that mean? Also should be If '_we_' or '_it_' or
>something like that.
>

What I want to mean is in memory_failure() we use try_to_split_thp_page() and
the PG_has_hwpoisoned bit is only set in the after-split folio contains
@split_at.

>> some more memory unusable than expected.
>
>'We will leave some more memory unusable than expected' is super unclear.
>
>You mean we will fail to migrate THP entries at the PTE level?
>

No. 

Hmm... I would like to clarify before continue.

This fix is not to fix migration case. This is to fix folio split for a shared
mapped PMD THP. Current folio split leverage migration entry during split
anonymous folio. So the action here is not to migrate it.

I am a little lost here.

>Can we say this instead please?
>
>>
>> The tricky thing in above reproduce method is current debugfs interface
>> leverage function split_huge_pages_pid(), which will iterate the whole
>> pmd range and do folio split on each base page address. This means it
>> will try 512 times, and each time split one pmd from pmd mapped to pte
>> mapped thp. If there are less than 512 shared mapped process,
>> the folio is still split successfully at last. But in real world, we
>> usually try it for once.
>
>This whole sentence could be dropped I think I don't think it adds anything.
>
>And you're really confusing the issue by dwelling on this I think.
>
>You need to restart the walk in this case in order for the PTEs to be correctly
>handled right?
>
>Can you explain why we can't just essentially revert 60fbb14396d5? Or at least
>the bit that did this change?
>
>Also is unmap_folio() the only caller with TTU_SPLIT_HUGE_PMD as the comment
>that was deleted by 60fbb14396d5 implied? Or are there others? If it is, please
>mention the commit msg.
>
>
>>
>> This patch fixes this by restart page_vma_mapped_walk() after
>> split_huge_pmd_locked(). We cannot simply return "true" to fix the
>> problem, as that would affect another case:
>
>I mean how would it fix the problem to incorrectly have it return true when the
>walk had not in fact completed?
>
>I'm not sure why you're dwelling on this idea in the commit msg?
>
>> split_huge_pmd_locked()->folio_try_share_anon_rmap_pmd() can failed and
>> leave the folio mapped through PTEs; we would return "true" from
>> try_to_migrate_one() in that case as well. While that is mostly
>> harmless, we could end up walking the rmap, wasting some cycles.
>
>I mean I think we can just drop this whole paragraph no?
>
>You might think I'm being picky about the commit msg here, but as is I find it
>pretty much incomprehensible and that's not helpful if we have to go back and
>read this in future.
>

Never mind.

A clearer and comprehensive change log is helpful for all. And my English is
not native language, so your suggestion helps a lot.

>>
>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()")
>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Reviewed-by: Zi Yan <ziy@nvidia.com>
>> Tested-by: Lance Yang <lance.yang@linux.dev>
>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>> Reviewed-by: Gavin Guo <gavinguo@igalia.com>
>> Acked-by: David Hildenbrand (arm) <david@kernel.org>
>> Cc: Gavin Guo <gavinguo@igalia.com>
>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Lance Yang <lance.yang@linux.dev>
>> Cc: <stable@vger.kernel.org>
>>
>> ---
>> v3:
>>   * gather RB
>>   * adjust the commit log and comment per David
>
>Clearly not enough :)
>
>>   * add userspace-visible runtime effect in change log
>
>Which one was that?
>
>> v2:
>>   * restart page_vma_mapped_walk() after split_huge_pmd_locked()
>> ---
>>  mm/rmap.c | 12 +++++++++---
>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 618df3385c8b..1041a64b8e6b 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2446,11 +2446,17 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>  			__maybe_unused pmd_t pmdval;
>>
>>  			if (flags & TTU_SPLIT_HUGE_PMD) {
>> +				/*
>> +				 * split_huge_pmd_locked() might leave the
>> +				 * folio mapped through PTEs. Retry the walk
>> +				 * so we can detect this scenario and properly
>> +				 * abort the walk.
>> +				 */
>
>This comment is a lot clearer than the commit msg :)
>
>>  				split_huge_pmd_locked(vma, pvmw.address,
>>  						      pvmw.pmd, true);
>> -				ret = false;
>> -				page_vma_mapped_walk_done(&pvmw);
>> -				break;
>> +				flags &= ~TTU_SPLIT_HUGE_PMD;
>> +				page_vma_mapped_walk_restart(&pvmw);
>> +				continue;
>
>This logic does lok reasonable.
>
>>  			}
>>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>>  			pmdval = pmdp_get(pvmw.pmd);
>> --
>> 2.34.1
>>
>
>Cheers, Lorenzo

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Patch v3] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp
  2026-02-10  3:23   ` Wei Yang
@ 2026-02-13 13:20     ` Wei Yang
  2026-02-22  0:50       ` Wei Yang
  0 siblings, 1 reply; 5+ messages in thread
From: Wei Yang @ 2026-02-13 13:20 UTC (permalink / raw)
  To: Wei Yang
  Cc: Lorenzo Stoakes, akpm, david, riel, Liam.Howlett, vbabka,
	harry.yoo, jannh, gavinguo, baolin.wang, ziy, linux-mm,
	Lance Yang, stable

On Tue, Feb 10, 2026 at 03:23:04AM +0000, Wei Yang wrote:
>On Mon, Feb 09, 2026 at 05:08:16PM +0000, Lorenzo Stoakes wrote:
>>On Thu, Feb 05, 2026 at 03:31:13AM +0000, Wei Yang wrote:
>>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and
>>> split_huge_pmd_locked()") return false unconditionally after
>>> split_huge_pmd_locked() which may fail early during try_to_migrate() for
>>> shared thp. This will lead to unexpected folio split failure.
>>
>>I think this could be put more clearly. 'When splitting a PMD THP migration
>>entry in try_to_migrate_one() in a rmap walk invoked by try_to_migrate() when
>
>split_huge_pmd_locked() could split a PMD THP migration entry, but here we
>expect a PMD THP normal entry.
>
>>TTU_SPLIT_HUGE_PMD is specified.' or something like that.
>>
>>>
>>> One way to reproduce:
>>>
>>>     Create an anonymous thp range and fork 512 children, so we have a
>>>     thp shared mapped in 513 processes. Then trigger folio split with
>>>     /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to
>>>     order 0.
>>
>>I think you should explain the issue before the repro. This is just confusing
>>things. Mention the repro _afterwards_.
>>
>
>OK, will move afterwards.
>
>>>
>>> Without the above commit, we can successfully split to order 0.
>>> With the above commit, the folio is still a large folio.
>>>
>>> The reason is the above commit return false after split pmd
>>
>>This sentence doesn't really make sense. Returns false where? And under what
>>circumstances?
>>
>>I'm having to look through 60fbb14396d5 to understand this which isn't a good
>>sign.
>>
>>'This patch adjusted try_to_migrate_one() to, when a PMD-mapped THP migration
>
>I am afraid the original intention of commit 60fbb14396d5 is not just for
>migration entry.
>
>>entry is found, and TTU_SPLIT_HUGE_PMD is specified (for example, via
>>unmap_folio()), exit the walk and return false unconditionally'.
>>
>>> unconditionally in the first process and break try_to_migrate().
>>>
>>> On memory pressure or failure, we would try to reclaim unused memory or
>>> limit bad memory after folio split. If failed to split it, we will leave
>>
>>Limit bad memory? What does that mean? Also should be If '_we_' or '_it_' or
>>something like that.
>>
>
>What I want to mean is in memory_failure() we use try_to_split_thp_page() and
>the PG_has_hwpoisoned bit is only set in the after-split folio contains
>@split_at.
>
>>> some more memory unusable than expected.
>>
>>'We will leave some more memory unusable than expected' is super unclear.
>>
>>You mean we will fail to migrate THP entries at the PTE level?
>>
>
>No. 
>
>Hmm... I would like to clarify before continue.
>
>This fix is not to fix migration case. This is to fix folio split for a shared
>mapped PMD THP. Current folio split leverage migration entry during split
>anonymous folio. So the action here is not to migrate it.
>
>I am a little lost here.
>
>>Can we say this instead please?
>>

Hi, Lorenzo

I am not sure understand you correctly. If not, please let me know.

>>>
>>> The tricky thing in above reproduce method is current debugfs interface
>>> leverage function split_huge_pages_pid(), which will iterate the whole
>>> pmd range and do folio split on each base page address. This means it
>>> will try 512 times, and each time split one pmd from pmd mapped to pte
>>> mapped thp. If there are less than 512 shared mapped process,
>>> the folio is still split successfully at last. But in real world, we
>>> usually try it for once.
>>
>>This whole sentence could be dropped I think I don't think it adds anything.
>>
>>And you're really confusing the issue by dwelling on this I think.
>>

It is intended to explain why the reproduce method should fork 512 child. In
case it is not helpful, I will drop it.

>>You need to restart the walk in this case in order for the PTEs to be correctly
>>handled right?
>>
>>Can you explain why we can't just essentially revert 60fbb14396d5? Or at least
>>the bit that did this change?

Commit 60fbb14396d5 removed some duplicated check covered by
page_vma_mapped_walk(), so just reverting it may not good?

You mean a sentence like above is preferred in commit msg?

>>
>>Also is unmap_folio() the only caller with TTU_SPLIT_HUGE_PMD as the comment
>>that was deleted by 60fbb14396d5 implied? Or are there others? If it is, please
>>mention the commit msg.
>>

Currently there are two core users of TTU_SPLIT_HUGE_PMD:

  * try_to_unmap_one()
  * try_to_migrate_one()

And another two indirect user by calling try_to_unmap():

  * try_folio_split_or_unmap()
  * shrink_folio_list()

try_to_unmap_one() doesn't fail early, so only try_to_migrate_one() is
affected.

So you prefer some description like above to be added in commit msg?

>>
>>>
>>> This patch fixes this by restart page_vma_mapped_walk() after
>>> split_huge_pmd_locked(). We cannot simply return "true" to fix the
>>> problem, as that would affect another case:
>>
>>I mean how would it fix the problem to incorrectly have it return true when the
>>walk had not in fact completed?
>>
>>I'm not sure why you're dwelling on this idea in the commit msg?
>>
>>> split_huge_pmd_locked()->folio_try_share_anon_rmap_pmd() can failed and
>>> leave the folio mapped through PTEs; we would return "true" from
>>> try_to_migrate_one() in that case as well. While that is mostly
>>> harmless, we could end up walking the rmap, wasting some cycles.
>>
>>I mean I think we can just drop this whole paragraph no?
>>

I had an original explanation in [1], which is not clear.
Then David proposed this version in [2], which looks good to me. So I took it
in v3.

If this is not necessary, I am ok to drop it.

[1]: http://lkml.kernel.org/r/20260204004219.6524-1-richard.weiyang@gmail.com
[2]: http://lkml.kernel.org/r/df86ccfd-68a5-416e-81cc-02858e395b70@kernel.org

>>You might think I'm being picky about the commit msg here, but as is I find it
>>pretty much incomprehensible and that's not helpful if we have to go back and
>>read this in future.
>>
>
>Never mind.
>
>A clearer and comprehensive change log is helpful for all. And my English is
>not native language, so your suggestion helps a lot.
>
>>>
>>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()")
>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Reviewed-by: Zi Yan <ziy@nvidia.com>
>>> Tested-by: Lance Yang <lance.yang@linux.dev>
>>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>>> Reviewed-by: Gavin Guo <gavinguo@igalia.com>
>>> Acked-by: David Hildenbrand (arm) <david@kernel.org>
>>> Cc: Gavin Guo <gavinguo@igalia.com>
>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: Lance Yang <lance.yang@linux.dev>
>>> Cc: <stable@vger.kernel.org>
>>>
>>> ---
>>> v3:
>>>   * gather RB
>>>   * adjust the commit log and comment per David
>>
>>Clearly not enough :)
>>
>>>   * add userspace-visible runtime effect in change log
>>
>>Which one was that?
>>
>>> v2:
>>>   * restart page_vma_mapped_walk() after split_huge_pmd_locked()
>>> ---
>>>  mm/rmap.c | 12 +++++++++---
>>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 618df3385c8b..1041a64b8e6b 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -2446,11 +2446,17 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>  			__maybe_unused pmd_t pmdval;
>>>
>>>  			if (flags & TTU_SPLIT_HUGE_PMD) {
>>> +				/*
>>> +				 * split_huge_pmd_locked() might leave the
>>> +				 * folio mapped through PTEs. Retry the walk
>>> +				 * so we can detect this scenario and properly
>>> +				 * abort the walk.
>>> +				 */
>>
>>This comment is a lot clearer than the commit msg :)
>>
>>>  				split_huge_pmd_locked(vma, pvmw.address,
>>>  						      pvmw.pmd, true);
>>> -				ret = false;
>>> -				page_vma_mapped_walk_done(&pvmw);
>>> -				break;
>>> +				flags &= ~TTU_SPLIT_HUGE_PMD;
>>> +				page_vma_mapped_walk_restart(&pvmw);
>>> +				continue;
>>
>>This logic does lok reasonable.
>>
>>>  			}
>>>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>>>  			pmdval = pmdp_get(pvmw.pmd);
>>> --
>>> 2.34.1
>>>
>>
>>Cheers, Lorenzo
>
>-- 
>Wei Yang
>Help you, Help me

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Patch v3] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp
  2026-02-13 13:20     ` Wei Yang
@ 2026-02-22  0:50       ` Wei Yang
  0 siblings, 0 replies; 5+ messages in thread
From: Wei Yang @ 2026-02-22  0:50 UTC (permalink / raw)
  To: Wei Yang
  Cc: Lorenzo Stoakes, akpm, david, riel, Liam.Howlett, vbabka,
	harry.yoo, jannh, gavinguo, baolin.wang, ziy, linux-mm,
	Lance Yang, stable

On Fri, Feb 13, 2026 at 01:20:27PM +0000, Wei Yang wrote:
>On Tue, Feb 10, 2026 at 03:23:04AM +0000, Wei Yang wrote:
>>On Mon, Feb 09, 2026 at 05:08:16PM +0000, Lorenzo Stoakes wrote:
>>>On Thu, Feb 05, 2026 at 03:31:13AM +0000, Wei Yang wrote:
>>>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and
>>>> split_huge_pmd_locked()") return false unconditionally after
>>>> split_huge_pmd_locked() which may fail early during try_to_migrate() for
>>>> shared thp. This will lead to unexpected folio split failure.
>>>
>>>I think this could be put more clearly. 'When splitting a PMD THP migration
>>>entry in try_to_migrate_one() in a rmap walk invoked by try_to_migrate() when
>>
>>split_huge_pmd_locked() could split a PMD THP migration entry, but here we
>>expect a PMD THP normal entry.
>>
>>>TTU_SPLIT_HUGE_PMD is specified.' or something like that.
>>>
>>>>
>>>> One way to reproduce:
>>>>
>>>>     Create an anonymous thp range and fork 512 children, so we have a
>>>>     thp shared mapped in 513 processes. Then trigger folio split with
>>>>     /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to
>>>>     order 0.
>>>
>>>I think you should explain the issue before the repro. This is just confusing
>>>things. Mention the repro _afterwards_.
>>>
>>
>>OK, will move afterwards.
>>
>>>>
>>>> Without the above commit, we can successfully split to order 0.
>>>> With the above commit, the folio is still a large folio.
>>>>
>>>> The reason is the above commit return false after split pmd
>>>
>>>This sentence doesn't really make sense. Returns false where? And under what
>>>circumstances?
>>>
>>>I'm having to look through 60fbb14396d5 to understand this which isn't a good
>>>sign.
>>>
>>>'This patch adjusted try_to_migrate_one() to, when a PMD-mapped THP migration
>>
>>I am afraid the original intention of commit 60fbb14396d5 is not just for
>>migration entry.
>>
>>>entry is found, and TTU_SPLIT_HUGE_PMD is specified (for example, via
>>>unmap_folio()), exit the walk and return false unconditionally'.
>>>
>>>> unconditionally in the first process and break try_to_migrate().
>>>>
>>>> On memory pressure or failure, we would try to reclaim unused memory or
>>>> limit bad memory after folio split. If failed to split it, we will leave
>>>
>>>Limit bad memory? What does that mean? Also should be If '_we_' or '_it_' or
>>>something like that.
>>>
>>
>>What I want to mean is in memory_failure() we use try_to_split_thp_page() and
>>the PG_has_hwpoisoned bit is only set in the after-split folio contains
>>@split_at.
>>
>>>> some more memory unusable than expected.
>>>
>>>'We will leave some more memory unusable than expected' is super unclear.
>>>
>>>You mean we will fail to migrate THP entries at the PTE level?
>>>
>>
>>No. 
>>
>>Hmm... I would like to clarify before continue.
>>
>>This fix is not to fix migration case. This is to fix folio split for a shared
>>mapped PMD THP. Current folio split leverage migration entry during split
>>anonymous folio. So the action here is not to migrate it.
>>
>>I am a little lost here.
>>
>>>Can we say this instead please?
>>>
>
>Hi, Lorenzo
>
>I am not sure understand you correctly. If not, please let me know.
>
>>>>
>>>> The tricky thing in above reproduce method is current debugfs interface
>>>> leverage function split_huge_pages_pid(), which will iterate the whole
>>>> pmd range and do folio split on each base page address. This means it
>>>> will try 512 times, and each time split one pmd from pmd mapped to pte
>>>> mapped thp. If there are less than 512 shared mapped process,
>>>> the folio is still split successfully at last. But in real world, we
>>>> usually try it for once.
>>>
>>>This whole sentence could be dropped I think I don't think it adds anything.
>>>
>>>And you're really confusing the issue by dwelling on this I think.
>>>
>
>It is intended to explain why the reproduce method should fork 512 child. In
>case it is not helpful, I will drop it.
>
>>>You need to restart the walk in this case in order for the PTEs to be correctly
>>>handled right?
>>>
>>>Can you explain why we can't just essentially revert 60fbb14396d5? Or at least
>>>the bit that did this change?
>
>Commit 60fbb14396d5 removed some duplicated check covered by
>page_vma_mapped_walk(), so just reverting it may not good?
>
>You mean a sentence like above is preferred in commit msg?
>
>>>
>>>Also is unmap_folio() the only caller with TTU_SPLIT_HUGE_PMD as the comment
>>>that was deleted by 60fbb14396d5 implied? Or are there others? If it is, please
>>>mention the commit msg.
>>>
>
>Currently there are two core users of TTU_SPLIT_HUGE_PMD:
>
>  * try_to_unmap_one()
>  * try_to_migrate_one()
>
>And another two indirect user by calling try_to_unmap():
>
>  * try_folio_split_or_unmap()
>  * shrink_folio_list()
>
>try_to_unmap_one() doesn't fail early, so only try_to_migrate_one() is
>affected.
>
>So you prefer some description like above to be added in commit msg?
>
>>>
>>>>
>>>> This patch fixes this by restart page_vma_mapped_walk() after
>>>> split_huge_pmd_locked(). We cannot simply return "true" to fix the
>>>> problem, as that would affect another case:
>>>
>>>I mean how would it fix the problem to incorrectly have it return true when the
>>>walk had not in fact completed?
>>>
>>>I'm not sure why you're dwelling on this idea in the commit msg?
>>>
>>>> split_huge_pmd_locked()->folio_try_share_anon_rmap_pmd() can failed and
>>>> leave the folio mapped through PTEs; we would return "true" from
>>>> try_to_migrate_one() in that case as well. While that is mostly
>>>> harmless, we could end up walking the rmap, wasting some cycles.
>>>
>>>I mean I think we can just drop this whole paragraph no?
>>>
>
>I had an original explanation in [1], which is not clear.
>Then David proposed this version in [2], which looks good to me. So I took it
>in v3.
>
>If this is not necessary, I am ok to drop it.
>
>[1]: http://lkml.kernel.org/r/20260204004219.6524-1-richard.weiyang@gmail.com
>[2]: http://lkml.kernel.org/r/df86ccfd-68a5-416e-81cc-02858e395b70@kernel.org
>

Hi, Lorenzo

I am not certain on how you prefer the commit msg, would you mind taking a
look at my question when you have time slot? So I could prepare next version.

Thanks a lot.

>>>You might think I'm being picky about the commit msg here, but as is I find it
>>>pretty much incomprehensible and that's not helpful if we have to go back and
>>>read this in future.
>>>
>>
>>Never mind.
>>
>>A clearer and comprehensive change log is helpful for all. And my English is
>>not native language, so your suggestion helps a lot.
>>
>>>>
>>>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()")
>>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>>>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> Reviewed-by: Zi Yan <ziy@nvidia.com>
>>>> Tested-by: Lance Yang <lance.yang@linux.dev>
>>>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>>>> Reviewed-by: Gavin Guo <gavinguo@igalia.com>
>>>> Acked-by: David Hildenbrand (arm) <david@kernel.org>
>>>> Cc: Gavin Guo <gavinguo@igalia.com>
>>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> Cc: Lance Yang <lance.yang@linux.dev>
>>>> Cc: <stable@vger.kernel.org>
>>>>
>>>> ---
>>>> v3:
>>>>   * gather RB
>>>>   * adjust the commit log and comment per David
>>>
>>>Clearly not enough :)
>>>
>>>>   * add userspace-visible runtime effect in change log
>>>
>>>Which one was that?
>>>
>>>> v2:
>>>>   * restart page_vma_mapped_walk() after split_huge_pmd_locked()
>>>> ---
>>>>  mm/rmap.c | 12 +++++++++---
>>>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 618df3385c8b..1041a64b8e6b 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -2446,11 +2446,17 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>>  			__maybe_unused pmd_t pmdval;
>>>>
>>>>  			if (flags & TTU_SPLIT_HUGE_PMD) {
>>>> +				/*
>>>> +				 * split_huge_pmd_locked() might leave the
>>>> +				 * folio mapped through PTEs. Retry the walk
>>>> +				 * so we can detect this scenario and properly
>>>> +				 * abort the walk.
>>>> +				 */
>>>
>>>This comment is a lot clearer than the commit msg :)
>>>
>>>>  				split_huge_pmd_locked(vma, pvmw.address,
>>>>  						      pvmw.pmd, true);
>>>> -				ret = false;
>>>> -				page_vma_mapped_walk_done(&pvmw);
>>>> -				break;
>>>> +				flags &= ~TTU_SPLIT_HUGE_PMD;
>>>> +				page_vma_mapped_walk_restart(&pvmw);
>>>> +				continue;
>>>
>>>This logic does lok reasonable.
>>>
>>>>  			}
>>>>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>>>>  			pmdval = pmdp_get(pvmw.pmd);
>>>> --
>>>> 2.34.1
>>>>
>>>
>>>Cheers, Lorenzo
>>
>>-- 
>>Wei Yang
>>Help you, Help me
>
>-- 
>Wei Yang
>Help you, Help me

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-02-22  0:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-05  3:31 [Patch v3] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp Wei Yang
2026-02-09 17:08 ` Lorenzo Stoakes
2026-02-10  3:23   ` Wei Yang
2026-02-13 13:20     ` Wei Yang
2026-02-22  0:50       ` Wei Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox