* [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp
@ 2026-01-30 23:00 Wei Yang
2026-01-31 2:44 ` Zi Yan
0 siblings, 1 reply; 13+ messages in thread
From: Wei Yang @ 2026-01-30 23:00 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes, riel, Liam.Howlett, vbabka,
harry.yoo, jannh, gavinguo, baolin.wang, ziy
Cc: linux-mm, Wei Yang, stable
Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and
split_huge_pmd_locked()") return false unconditionally after
split_huge_pmd_locked() which may fail early during try_to_migrate() for
shared thp. This will lead to unexpected folio split failure.
One way to reproduce:
Create an anonymous thp range and fork 512 children, so we have a
thp shared mapped in 513 processes. Then trigger folio split with
/sys/kernel/debug/split_huge_pages debugfs to split the thp folio to
order 0.
Without the above commit, we can successfully split to order 0.
With the above commit, the folio is still a large folio.
The reason is the above commit return false after split pmd
unconditionally in the first process and break try_to_migrate().
The tricky thing in above reproduce method is current debugfs interface
leverage function split_huge_pages_pid(), which will iterate the whole
pmd range and do folio split on each base page address. This means it
will try 512 times, and each time split one pmd from pmd mapped to pte
mapped thp. If there are less than 512 shared mapped process,
the folio is still split successfully at last. But in real world, we
usually try it for once.
This patch fixes this by removing the unconditional false return after
split_huge_pmd_locked(). Later, we may introduce a true fail early if
split_huge_pmd_locked() does fail.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()")
Cc: Gavin Guo <gavinguo@igalia.com>
Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
---
mm/rmap.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index 618df3385c8b..eed971568d65 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
if (flags & TTU_SPLIT_HUGE_PMD) {
split_huge_pmd_locked(vma, pvmw.address,
pvmw.pmd, true);
- ret = false;
page_vma_mapped_walk_done(&pvmw);
break;
}
--
2.34.1
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp 2026-01-30 23:00 [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp Wei Yang @ 2026-01-31 2:44 ` Zi Yan 2026-02-01 2:09 ` Wei Yang 0 siblings, 1 reply; 13+ messages in thread From: Zi Yan @ 2026-01-31 2:44 UTC (permalink / raw) To: Wei Yang Cc: akpm, david, lorenzo.stoakes, riel, Liam.Howlett, vbabka, harry.yoo, jannh, gavinguo, baolin.wang, linux-mm, stable On 30 Jan 2026, at 18:00, Wei Yang wrote: > Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and > split_huge_pmd_locked()") return false unconditionally after > split_huge_pmd_locked() which may fail early during try_to_migrate() for > shared thp. This will lead to unexpected folio split failure. > > One way to reproduce: > > Create an anonymous thp range and fork 512 children, so we have a > thp shared mapped in 513 processes. Then trigger folio split with > /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to > order 0. > > Without the above commit, we can successfully split to order 0. > With the above commit, the folio is still a large folio. > > The reason is the above commit return false after split pmd > unconditionally in the first process and break try_to_migrate(). The reasoning looks good to me. > > The tricky thing in above reproduce method is current debugfs interface > leverage function split_huge_pages_pid(), which will iterate the whole > pmd range and do folio split on each base page address. This means it > will try 512 times, and each time split one pmd from pmd mapped to pte > mapped thp. If there are less than 512 shared mapped process, > the folio is still split successfully at last. But in real world, we > usually try it for once. > > This patch fixes this by removing the unconditional false return after > split_huge_pmd_locked(). Later, we may introduce a true fail early if > split_huge_pmd_locked() does fail. > > Signed-off-by: Wei Yang <richard.weiyang@gmail.com> > Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()") > Cc: Gavin Guo <gavinguo@igalia.com> > Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: <stable@vger.kernel.org> > --- > mm/rmap.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 618df3385c8b..eed971568d65 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, > if (flags & TTU_SPLIT_HUGE_PMD) { > split_huge_pmd_locked(vma, pvmw.address, > pvmw.pmd, true); > - ret = false; > page_vma_mapped_walk_done(&pvmw); > break; > } How about the patch below? It matches the pattern of set_pmd_migration_entry() below. Basically, continue if the operation is successful, break otherwise. diff --git a/mm/rmap.c b/mm/rmap.c index 618df3385c8b..83cc9d98533e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, if (flags & TTU_SPLIT_HUGE_PMD) { split_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, true); - ret = false; - page_vma_mapped_walk_done(&pvmw); - break; + continue; } #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION pmdval = pmdp_get(pvmw.pmd); -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp 2026-01-31 2:44 ` Zi Yan @ 2026-02-01 2:09 ` Wei Yang 2026-02-01 3:39 ` Zi Yan 0 siblings, 1 reply; 13+ messages in thread From: Wei Yang @ 2026-02-01 2:09 UTC (permalink / raw) To: Zi Yan Cc: Wei Yang, akpm, david, lorenzo.stoakes, riel, Liam.Howlett, vbabka, harry.yoo, jannh, gavinguo, baolin.wang, linux-mm, stable On Fri, Jan 30, 2026 at 09:44:10PM -0500, Zi Yan wrote: >On 30 Jan 2026, at 18:00, Wei Yang wrote: > >> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and >> split_huge_pmd_locked()") return false unconditionally after >> split_huge_pmd_locked() which may fail early during try_to_migrate() for >> shared thp. This will lead to unexpected folio split failure. >> >> One way to reproduce: >> >> Create an anonymous thp range and fork 512 children, so we have a >> thp shared mapped in 513 processes. Then trigger folio split with >> /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to >> order 0. >> >> Without the above commit, we can successfully split to order 0. >> With the above commit, the folio is still a large folio. >> >> The reason is the above commit return false after split pmd >> unconditionally in the first process and break try_to_migrate(). > >The reasoning looks good to me. > >> >> The tricky thing in above reproduce method is current debugfs interface >> leverage function split_huge_pages_pid(), which will iterate the whole >> pmd range and do folio split on each base page address. This means it >> will try 512 times, and each time split one pmd from pmd mapped to pte >> mapped thp. If there are less than 512 shared mapped process, >> the folio is still split successfully at last. But in real world, we >> usually try it for once. >> >> This patch fixes this by removing the unconditional false return after >> split_huge_pmd_locked(). Later, we may introduce a true fail early if >> split_huge_pmd_locked() does fail. >> >> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> >> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()") >> Cc: Gavin Guo <gavinguo@igalia.com> >> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> >> Cc: Zi Yan <ziy@nvidia.com> >> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >> Cc: <stable@vger.kernel.org> >> --- >> mm/rmap.c | 1 - >> 1 file changed, 1 deletion(-) >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 618df3385c8b..eed971568d65 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >> if (flags & TTU_SPLIT_HUGE_PMD) { >> split_huge_pmd_locked(vma, pvmw.address, >> pvmw.pmd, true); >> - ret = false; >> page_vma_mapped_walk_done(&pvmw); >> break; >> } > >How about the patch below? It matches the pattern of set_pmd_migration_entry() below. >Basically, continue if the operation is successful, break otherwise. > >diff --git a/mm/rmap.c b/mm/rmap.c >index 618df3385c8b..83cc9d98533e 100644 >--- a/mm/rmap.c >+++ b/mm/rmap.c >@@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, > if (flags & TTU_SPLIT_HUGE_PMD) { > split_huge_pmd_locked(vma, pvmw.address, > pvmw.pmd, true); >- ret = false; >- page_vma_mapped_walk_done(&pvmw); >- break; >+ continue; > } Per my understanding if @freeze is trur, split_huge_pmd_locked() may "fail" as the comment says: * Without "freeze", we'll simply split the PMD, propagating the * PageAnonExclusive() flag for each PTE by setting it for * each subpage -- no need to (temporarily) clear. * * With "freeze" we want to replace mapped pages by * migration entries right away. This is only possible if we * managed to clear PageAnonExclusive() -- see * set_pmd_migration_entry(). * * In case we cannot clear PageAnonExclusive(), split the PMD * only and let try_to_migrate_one() fail later. While currently we don't return the status of split_huge_pmd_locked() to indicate whether it does replaced PMD with migration entries successfully. So we are not sure this operation succeed. Another difference from set_pmd_migration_entry() is split_huge_pmd_locked() would change the page table from PMD mapped to PTE mapped. page_vma_mapped_walk() can handle it now for (pvmw->pmd && !pvmw->pte), but I am not sure this is what we expected. For example, in try_to_unmap_one(), we use page_vma_mapped_walk_restart() after pmd splitted. So I prefer just remove the "ret = false" for a fix. Not sure this is reasonable to you. I am thinking two things after this fix: * add one similar test in selftests * let split_huge_pmd_locked() return value to indicate freeze is degrade to !freeze, and fail early on try_to_migrate() like the thp migration branch Look forward your opinion on whether it worth to do it. > #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION > pmdval = pmdp_get(pvmw.pmd); > > > >-- >Best Regards, >Yan, Zi -- Wei Yang Help you, Help me ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp 2026-02-01 2:09 ` Wei Yang @ 2026-02-01 3:39 ` Zi Yan 2026-02-01 13:04 ` Gavin Guo 2026-02-02 23:57 ` Wei Yang 0 siblings, 2 replies; 13+ messages in thread From: Zi Yan @ 2026-02-01 3:39 UTC (permalink / raw) To: Wei Yang, david Cc: akpm, lorenzo.stoakes, riel, Liam.Howlett, vbabka, harry.yoo, jannh, gavinguo, baolin.wang, linux-mm, stable On 31 Jan 2026, at 21:09, Wei Yang wrote: > On Fri, Jan 30, 2026 at 09:44:10PM -0500, Zi Yan wrote: >> On 30 Jan 2026, at 18:00, Wei Yang wrote: >> >>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and >>> split_huge_pmd_locked()") return false unconditionally after >>> split_huge_pmd_locked() which may fail early during try_to_migrate() for >>> shared thp. This will lead to unexpected folio split failure. >>> >>> One way to reproduce: >>> >>> Create an anonymous thp range and fork 512 children, so we have a >>> thp shared mapped in 513 processes. Then trigger folio split with >>> /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to >>> order 0. >>> >>> Without the above commit, we can successfully split to order 0. >>> With the above commit, the folio is still a large folio. >>> >>> The reason is the above commit return false after split pmd >>> unconditionally in the first process and break try_to_migrate(). >> >> The reasoning looks good to me. >> >>> >>> The tricky thing in above reproduce method is current debugfs interface >>> leverage function split_huge_pages_pid(), which will iterate the whole >>> pmd range and do folio split on each base page address. This means it >>> will try 512 times, and each time split one pmd from pmd mapped to pte >>> mapped thp. If there are less than 512 shared mapped process, >>> the folio is still split successfully at last. But in real world, we >>> usually try it for once. >>> >>> This patch fixes this by removing the unconditional false return after >>> split_huge_pmd_locked(). Later, we may introduce a true fail early if >>> split_huge_pmd_locked() does fail. >>> >>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> >>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()") >>> Cc: Gavin Guo <gavinguo@igalia.com> >>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> >>> Cc: Zi Yan <ziy@nvidia.com> >>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>> Cc: <stable@vger.kernel.org> >>> --- >>> mm/rmap.c | 1 - >>> 1 file changed, 1 deletion(-) >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 618df3385c8b..eed971568d65 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>> if (flags & TTU_SPLIT_HUGE_PMD) { >>> split_huge_pmd_locked(vma, pvmw.address, >>> pvmw.pmd, true); >>> - ret = false; >>> page_vma_mapped_walk_done(&pvmw); >>> break; >>> } >> >> How about the patch below? It matches the pattern of set_pmd_migration_entry() below. >> Basically, continue if the operation is successful, break otherwise. >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 618df3385c8b..83cc9d98533e 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >> if (flags & TTU_SPLIT_HUGE_PMD) { >> split_huge_pmd_locked(vma, pvmw.address, >> pvmw.pmd, true); >> - ret = false; >> - page_vma_mapped_walk_done(&pvmw); >> - break; >> + continue; >> } > > Per my understanding if @freeze is trur, split_huge_pmd_locked() may "fail" as > the comment says: > > * Without "freeze", we'll simply split the PMD, propagating the > * PageAnonExclusive() flag for each PTE by setting it for > * each subpage -- no need to (temporarily) clear. > * > * With "freeze" we want to replace mapped pages by > * migration entries right away. This is only possible if we > * managed to clear PageAnonExclusive() -- see > * set_pmd_migration_entry(). > * > * In case we cannot clear PageAnonExclusive(), split the PMD > * only and let try_to_migrate_one() fail later. > > While currently we don't return the status of split_huge_pmd_locked() to > indicate whether it does replaced PMD with migration entries successfully. So > we are not sure this operation succeed. This is the right reasoning. This means to properly handle it, split_huge_pmd_locked() needs to return whether it inserts migration entries or not when freeze is true. > > Another difference from set_pmd_migration_entry() is split_huge_pmd_locked() > would change the page table from PMD mapped to PTE mapped. > page_vma_mapped_walk() can handle it now for (pvmw->pmd && !pvmw->pte), but I > am not sure this is what we expected. For example, in try_to_unmap_one(), we > use page_vma_mapped_walk_restart() after pmd splitted. > > So I prefer just remove the "ret = false" for a fix. Not sure this is > reasonable to you. > > I am thinking two things after this fix: > > * add one similar test in selftests > * let split_huge_pmd_locked() return value to indicate freeze is degrade to > !freeze, and fail early on try_to_migrate() like the thp migration branch > > Look forward your opinion on whether it worth to do it. This is not the right fix, neither was mine above. Because before commit 60fbb14396d5, the code handles PAE properly. If PAE is cleared, PMD is split into PTEs and each PTE becomes a migration entry, page_vma_mapped_walk(&pvmw) returns false, and try_to_migrate_one() returns true. If PAE is not cleared, PMD is split into PTEs and each PTE is not a migration entry, inside while (page_vma_mapped_walk(&pvmw)), PAE will be attempted to get cleared again and it will fail again, leading to try_to_migrate_one() returns false. After commit 60fbb14396d5, no matter PAE is cleared or not, try_to_migrate_one() always returns false. It causes folio split failures for shared PMD THPs. Now with your fix (and mine above), no matter PAE is cleared or not, try_to_migrate_one() always returns true. It just flips the code to a different issue. So the proper fix is to let split_huge_pmd_locked() returns whether it inserts migration entries or not and do the same pattern as THP migration code path. Hi David, In terms of unmap_folio(), which is the only user of split_huge_pmd_locked(..., freeze=true), there is no folio_mapped() check afterwards. That might be causing an issue, when the folio is pinned between the refcount check and unmap_folio(), unmap_folio() fails, but folio split code proceeds. That means the folio is still accessible via PTEs and later remove_migration_pte() will try to remove non migration PTEs. It needs to be fixed separately, right? > >> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION >> pmdval = pmdp_get(pvmw.pmd); >> >> >> >> -- >> Best Regards, >> Yan, Zi > > -- > Wei Yang > Help you, Help me -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp 2026-02-01 3:39 ` Zi Yan @ 2026-02-01 13:04 ` Gavin Guo 2026-02-01 14:20 ` Zi Yan 2026-02-02 23:57 ` Wei Yang 1 sibling, 1 reply; 13+ messages in thread From: Gavin Guo @ 2026-02-01 13:04 UTC (permalink / raw) To: Zi Yan, Wei Yang, david Cc: akpm, lorenzo.stoakes, riel, Liam.Howlett, vbabka, harry.yoo, jannh, baolin.wang, linux-mm, stable, Gavin Shan On 2/1/26 11:39, Zi Yan wrote: > On 31 Jan 2026, at 21:09, Wei Yang wrote: > >> On Fri, Jan 30, 2026 at 09:44:10PM -0500, Zi Yan wrote: >>> On 30 Jan 2026, at 18:00, Wei Yang wrote: >>> >>>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and >>>> split_huge_pmd_locked()") return false unconditionally after >>>> split_huge_pmd_locked() which may fail early during try_to_migrate() for >>>> shared thp. This will lead to unexpected folio split failure. >>>> >>>> One way to reproduce: >>>> >>>> Create an anonymous thp range and fork 512 children, so we have a >>>> thp shared mapped in 513 processes. Then trigger folio split with >>>> /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to >>>> order 0. >>>> >>>> Without the above commit, we can successfully split to order 0. >>>> With the above commit, the folio is still a large folio. >>>> >>>> The reason is the above commit return false after split pmd >>>> unconditionally in the first process and break try_to_migrate(). >>> >>> The reasoning looks good to me. >>> >>>> >>>> The tricky thing in above reproduce method is current debugfs interface >>>> leverage function split_huge_pages_pid(), which will iterate the whole >>>> pmd range and do folio split on each base page address. This means it >>>> will try 512 times, and each time split one pmd from pmd mapped to pte >>>> mapped thp. If there are less than 512 shared mapped process, >>>> the folio is still split successfully at last. But in real world, we >>>> usually try it for once. >>>> >>>> This patch fixes this by removing the unconditional false return after >>>> split_huge_pmd_locked(). Later, we may introduce a true fail early if >>>> split_huge_pmd_locked() does fail. >>>> >>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> >>>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()") >>>> Cc: Gavin Guo <gavinguo@igalia.com> >>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> >>>> Cc: Zi Yan <ziy@nvidia.com> >>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>>> Cc: <stable@vger.kernel.org> >>>> --- >>>> mm/rmap.c | 1 - >>>> 1 file changed, 1 deletion(-) >>>> >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index 618df3385c8b..eed971568d65 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>> split_huge_pmd_locked(vma, pvmw.address, >>>> pvmw.pmd, true); >>>> - ret = false; >>>> page_vma_mapped_walk_done(&pvmw); >>>> break; >>>> } >>> >>> How about the patch below? It matches the pattern of set_pmd_migration_entry() below. >>> Basically, continue if the operation is successful, break otherwise. >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 618df3385c8b..83cc9d98533e 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>> if (flags & TTU_SPLIT_HUGE_PMD) { >>> split_huge_pmd_locked(vma, pvmw.address, >>> pvmw.pmd, true); >>> - ret = false; >>> - page_vma_mapped_walk_done(&pvmw); >>> - break; >>> + continue; >>> } >> >> Per my understanding if @freeze is trur, split_huge_pmd_locked() may "fail" as >> the comment says: >> >> * Without "freeze", we'll simply split the PMD, propagating the >> * PageAnonExclusive() flag for each PTE by setting it for >> * each subpage -- no need to (temporarily) clear. >> * >> * With "freeze" we want to replace mapped pages by >> * migration entries right away. This is only possible if we >> * managed to clear PageAnonExclusive() -- see >> * set_pmd_migration_entry(). >> * >> * In case we cannot clear PageAnonExclusive(), split the PMD >> * only and let try_to_migrate_one() fail later. >> >> While currently we don't return the status of split_huge_pmd_locked() to >> indicate whether it does replaced PMD with migration entries successfully. So >> we are not sure this operation succeed. > > This is the right reasoning. This means to properly handle it, split_huge_pmd_locked() > needs to return whether it inserts migration entries or not when freeze is true. > >> >> Another difference from set_pmd_migration_entry() is split_huge_pmd_locked() >> would change the page table from PMD mapped to PTE mapped. >> page_vma_mapped_walk() can handle it now for (pvmw->pmd && !pvmw->pte), but I >> am not sure this is what we expected. For example, in try_to_unmap_one(), we >> use page_vma_mapped_walk_restart() after pmd splitted. >> >> So I prefer just remove the "ret = false" for a fix. Not sure this is >> reasonable to you. >> >> I am thinking two things after this fix: >> >> * add one similar test in selftests >> * let split_huge_pmd_locked() return value to indicate freeze is degrade to >> !freeze, and fail early on try_to_migrate() like the thp migration branch >> >> Look forward your opinion on whether it worth to do it. > > This is not the right fix, neither was mine above. Because before commit 60fbb14396d5, > the code handles PAE properly. If PAE is cleared, PMD is split into PTEs and each > PTE becomes a migration entry, page_vma_mapped_walk(&pvmw) returns false, > and try_to_migrate_one() returns true. If PAE is not cleared, PMD is split into PTEs > and each PTE is not a migration entry, inside while (page_vma_mapped_walk(&pvmw)), > PAE will be attempted to get cleared again and it will fail again, leading to > try_to_migrate_one() returns false. After commit 60fbb14396d5, no matter PAE is > cleared or not, try_to_migrate_one() always returns false. It causes folio split > failures for shared PMD THPs. > > Now with your fix (and mine above), no matter PAE is cleared or not, try_to_migrate_one() > always returns true. It just flips the code to a different issue. So the proper fix > is to let split_huge_pmd_locked() returns whether it inserts migration entries or not > and do the same pattern as THP migration code path. How about aligning with the try_to_unmap_one()? The behavior would be the same before applying the commit 60fbb14396d5: diff --git a/mm/rmap.c b/mm/rmap.c index 7b9879ef442d..0c96f0883013 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2333,9 +2333,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, if (flags & TTU_SPLIT_HUGE_PMD) { split_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, true); - ret = false; - page_vma_mapped_walk_done(&pvmw); - break; + flags &= ~TTU_SPLIT_HUGE_PMD; + page_vma_mapped_walk_restart(&pvmw); + continue; } #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION pmdval = pmdp_get(pvmw.pmd); > > > Hi David, > > In terms of unmap_folio(), which is the only user of split_huge_pmd_locked(..., freeze=true), > there is no folio_mapped() check afterwards. That might be causing an issue, > when the folio is pinned between the refcount check and unmap_folio(), unmap_folio() > fails, but folio split code proceeds. That means the folio is still accessible > via PTEs and later remove_migration_pte() will try to remove non migration PTEs. > It needs to be fixed separately, right? > > >> >>> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION >>> pmdval = pmdp_get(pvmw.pmd); >>> >>> >>> >>> -- >>> Best Regards, >>> Yan, Zi >> >> -- >> Wei Yang >> Help you, Help me > > > -- > Best Regards, > Yan, Zi ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp 2026-02-01 13:04 ` Gavin Guo @ 2026-02-01 14:20 ` Zi Yan 2026-02-03 0:00 ` Wei Yang 2026-02-03 13:20 ` Lance Yang 0 siblings, 2 replies; 13+ messages in thread From: Zi Yan @ 2026-02-01 14:20 UTC (permalink / raw) To: Gavin Guo Cc: Wei Yang, david, akpm, lorenzo.stoakes, riel, Liam.Howlett, vbabka, harry.yoo, jannh, baolin.wang, linux-mm, stable, Gavin Shan On 1 Feb 2026, at 8:04, Gavin Guo wrote: > On 2/1/26 11:39, Zi Yan wrote: >> On 31 Jan 2026, at 21:09, Wei Yang wrote: >> >>> On Fri, Jan 30, 2026 at 09:44:10PM -0500, Zi Yan wrote: >>>> On 30 Jan 2026, at 18:00, Wei Yang wrote: >>>> >>>>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and >>>>> split_huge_pmd_locked()") return false unconditionally after >>>>> split_huge_pmd_locked() which may fail early during try_to_migrate() for >>>>> shared thp. This will lead to unexpected folio split failure. >>>>> >>>>> One way to reproduce: >>>>> >>>>> Create an anonymous thp range and fork 512 children, so we have a >>>>> thp shared mapped in 513 processes. Then trigger folio split with >>>>> /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to >>>>> order 0. >>>>> >>>>> Without the above commit, we can successfully split to order 0. >>>>> With the above commit, the folio is still a large folio. >>>>> >>>>> The reason is the above commit return false after split pmd >>>>> unconditionally in the first process and break try_to_migrate(). >>>> >>>> The reasoning looks good to me. >>>> >>>>> >>>>> The tricky thing in above reproduce method is current debugfs interface >>>>> leverage function split_huge_pages_pid(), which will iterate the whole >>>>> pmd range and do folio split on each base page address. This means it >>>>> will try 512 times, and each time split one pmd from pmd mapped to pte >>>>> mapped thp. If there are less than 512 shared mapped process, >>>>> the folio is still split successfully at last. But in real world, we >>>>> usually try it for once. >>>>> >>>>> This patch fixes this by removing the unconditional false return after >>>>> split_huge_pmd_locked(). Later, we may introduce a true fail early if >>>>> split_huge_pmd_locked() does fail. >>>>> >>>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> >>>>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()") >>>>> Cc: Gavin Guo <gavinguo@igalia.com> >>>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> >>>>> Cc: Zi Yan <ziy@nvidia.com> >>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>>>> Cc: <stable@vger.kernel.org> >>>>> --- >>>>> mm/rmap.c | 1 - >>>>> 1 file changed, 1 deletion(-) >>>>> >>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>> index 618df3385c8b..eed971568d65 100644 >>>>> --- a/mm/rmap.c >>>>> +++ b/mm/rmap.c >>>>> @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>>> split_huge_pmd_locked(vma, pvmw.address, >>>>> pvmw.pmd, true); >>>>> - ret = false; >>>>> page_vma_mapped_walk_done(&pvmw); >>>>> break; >>>>> } >>>> >>>> How about the patch below? It matches the pattern of set_pmd_migration_entry() below. >>>> Basically, continue if the operation is successful, break otherwise. >>>> >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index 618df3385c8b..83cc9d98533e 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>> split_huge_pmd_locked(vma, pvmw.address, >>>> pvmw.pmd, true); >>>> - ret = false; >>>> - page_vma_mapped_walk_done(&pvmw); >>>> - break; >>>> + continue; >>>> } >>> >>> Per my understanding if @freeze is trur, split_huge_pmd_locked() may "fail" as >>> the comment says: >>> >>> * Without "freeze", we'll simply split the PMD, propagating the >>> * PageAnonExclusive() flag for each PTE by setting it for >>> * each subpage -- no need to (temporarily) clear. >>> * >>> * With "freeze" we want to replace mapped pages by >>> * migration entries right away. This is only possible if we >>> * managed to clear PageAnonExclusive() -- see >>> * set_pmd_migration_entry(). >>> * >>> * In case we cannot clear PageAnonExclusive(), split the PMD >>> * only and let try_to_migrate_one() fail later. >>> >>> While currently we don't return the status of split_huge_pmd_locked() to >>> indicate whether it does replaced PMD with migration entries successfully. So >>> we are not sure this operation succeed. >> >> This is the right reasoning. This means to properly handle it, split_huge_pmd_locked() >> needs to return whether it inserts migration entries or not when freeze is true. >> >>> >>> Another difference from set_pmd_migration_entry() is split_huge_pmd_locked() >>> would change the page table from PMD mapped to PTE mapped. >>> page_vma_mapped_walk() can handle it now for (pvmw->pmd && !pvmw->pte), but I >>> am not sure this is what we expected. For example, in try_to_unmap_one(), we >>> use page_vma_mapped_walk_restart() after pmd splitted. >>> >>> So I prefer just remove the "ret = false" for a fix. Not sure this is >>> reasonable to you. >>> >>> I am thinking two things after this fix: >>> >>> * add one similar test in selftests >>> * let split_huge_pmd_locked() return value to indicate freeze is degrade to >>> !freeze, and fail early on try_to_migrate() like the thp migration branch >>> >>> Look forward your opinion on whether it worth to do it. >> >> This is not the right fix, neither was mine above. Because before commit 60fbb14396d5, >> the code handles PAE properly. If PAE is cleared, PMD is split into PTEs and each >> PTE becomes a migration entry, page_vma_mapped_walk(&pvmw) returns false, >> and try_to_migrate_one() returns true. If PAE is not cleared, PMD is split into PTEs >> and each PTE is not a migration entry, inside while (page_vma_mapped_walk(&pvmw)), >> PAE will be attempted to get cleared again and it will fail again, leading to >> try_to_migrate_one() returns false. After commit 60fbb14396d5, no matter PAE is >> cleared or not, try_to_migrate_one() always returns false. It causes folio split >> failures for shared PMD THPs. >> >> Now with your fix (and mine above), no matter PAE is cleared or not, try_to_migrate_one() >> always returns true. It just flips the code to a different issue. So the proper fix >> is to let split_huge_pmd_locked() returns whether it inserts migration entries or not >> and do the same pattern as THP migration code path. > > How about aligning with the try_to_unmap_one()? The behavior would be the same before applying the commit 60fbb14396d5: > > diff --git a/mm/rmap.c b/mm/rmap.c > index 7b9879ef442d..0c96f0883013 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -2333,9 +2333,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, > if (flags & TTU_SPLIT_HUGE_PMD) { > split_huge_pmd_locked(vma, pvmw.address, > pvmw.pmd, true); > - ret = false; > - page_vma_mapped_walk_done(&pvmw); > - break; > + flags &= ~TTU_SPLIT_HUGE_PMD; > + page_vma_mapped_walk_restart(&pvmw); > + continue; > } > #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION > pmdval = pmdp_get(pvmw.pmd); Yes, it works and definitely needs a comment like "After split_huge_pmd_locked(), restart the walk to detect PageAnonExclusive handling failure in __split_huge_pmd_locked()". The change is good for backporting, but an additional patch to fix it properly by adding a return value to split_huge_pmd_locked() is also necessary. > > >> >> >> Hi David, >> >> In terms of unmap_folio(), which is the only user of split_huge_pmd_locked(..., freeze=true), >> there is no folio_mapped() check afterwards. That might be causing an issue, >> when the folio is pinned between the refcount check and unmap_folio(), unmap_folio() >> fails, but folio split code proceeds. That means the folio is still accessible >> via PTEs and later remove_migration_pte() will try to remove non migration PTEs. >> It needs to be fixed separately, right? >> >> >>> >>>> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION >>>> pmdval = pmdp_get(pvmw.pmd); >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Yan, Zi >>> >>> -- >>> Wei Yang >>> Help you, Help me >> >> >> -- >> Best Regards, >> Yan, Zi -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp 2026-02-01 14:20 ` Zi Yan @ 2026-02-03 0:00 ` Wei Yang 2026-02-03 0:07 ` Zi Yan 2026-02-03 13:20 ` Lance Yang 1 sibling, 1 reply; 13+ messages in thread From: Wei Yang @ 2026-02-03 0:00 UTC (permalink / raw) To: Zi Yan Cc: Gavin Guo, Wei Yang, david, akpm, lorenzo.stoakes, riel, Liam.Howlett, vbabka, harry.yoo, jannh, baolin.wang, linux-mm, stable, Gavin Shan On Sun, Feb 01, 2026 at 09:20:35AM -0500, Zi Yan wrote: >On 1 Feb 2026, at 8:04, Gavin Guo wrote: > >> On 2/1/26 11:39, Zi Yan wrote: >>> On 31 Jan 2026, at 21:09, Wei Yang wrote: >>> >>>> On Fri, Jan 30, 2026 at 09:44:10PM -0500, Zi Yan wrote: >>>>> On 30 Jan 2026, at 18:00, Wei Yang wrote: >>>>> >>>>>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and >>>>>> split_huge_pmd_locked()") return false unconditionally after >>>>>> split_huge_pmd_locked() which may fail early during try_to_migrate() for >>>>>> shared thp. This will lead to unexpected folio split failure. >>>>>> >>>>>> One way to reproduce: >>>>>> >>>>>> Create an anonymous thp range and fork 512 children, so we have a >>>>>> thp shared mapped in 513 processes. Then trigger folio split with >>>>>> /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to >>>>>> order 0. >>>>>> >>>>>> Without the above commit, we can successfully split to order 0. >>>>>> With the above commit, the folio is still a large folio. >>>>>> >>>>>> The reason is the above commit return false after split pmd >>>>>> unconditionally in the first process and break try_to_migrate(). >>>>> >>>>> The reasoning looks good to me. >>>>> >>>>>> >>>>>> The tricky thing in above reproduce method is current debugfs interface >>>>>> leverage function split_huge_pages_pid(), which will iterate the whole >>>>>> pmd range and do folio split on each base page address. This means it >>>>>> will try 512 times, and each time split one pmd from pmd mapped to pte >>>>>> mapped thp. If there are less than 512 shared mapped process, >>>>>> the folio is still split successfully at last. But in real world, we >>>>>> usually try it for once. >>>>>> >>>>>> This patch fixes this by removing the unconditional false return after >>>>>> split_huge_pmd_locked(). Later, we may introduce a true fail early if >>>>>> split_huge_pmd_locked() does fail. >>>>>> >>>>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> >>>>>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()") >>>>>> Cc: Gavin Guo <gavinguo@igalia.com> >>>>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> >>>>>> Cc: Zi Yan <ziy@nvidia.com> >>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>>>>> Cc: <stable@vger.kernel.org> >>>>>> --- >>>>>> mm/rmap.c | 1 - >>>>>> 1 file changed, 1 deletion(-) >>>>>> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>> index 618df3385c8b..eed971568d65 100644 >>>>>> --- a/mm/rmap.c >>>>>> +++ b/mm/rmap.c >>>>>> @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>>>> split_huge_pmd_locked(vma, pvmw.address, >>>>>> pvmw.pmd, true); >>>>>> - ret = false; >>>>>> page_vma_mapped_walk_done(&pvmw); >>>>>> break; >>>>>> } >>>>> >>>>> How about the patch below? It matches the pattern of set_pmd_migration_entry() below. >>>>> Basically, continue if the operation is successful, break otherwise. >>>>> >>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>> index 618df3385c8b..83cc9d98533e 100644 >>>>> --- a/mm/rmap.c >>>>> +++ b/mm/rmap.c >>>>> @@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>>> split_huge_pmd_locked(vma, pvmw.address, >>>>> pvmw.pmd, true); >>>>> - ret = false; >>>>> - page_vma_mapped_walk_done(&pvmw); >>>>> - break; >>>>> + continue; >>>>> } >>>> >>>> Per my understanding if @freeze is trur, split_huge_pmd_locked() may "fail" as >>>> the comment says: >>>> >>>> * Without "freeze", we'll simply split the PMD, propagating the >>>> * PageAnonExclusive() flag for each PTE by setting it for >>>> * each subpage -- no need to (temporarily) clear. >>>> * >>>> * With "freeze" we want to replace mapped pages by >>>> * migration entries right away. This is only possible if we >>>> * managed to clear PageAnonExclusive() -- see >>>> * set_pmd_migration_entry(). >>>> * >>>> * In case we cannot clear PageAnonExclusive(), split the PMD >>>> * only and let try_to_migrate_one() fail later. >>>> >>>> While currently we don't return the status of split_huge_pmd_locked() to >>>> indicate whether it does replaced PMD with migration entries successfully. So >>>> we are not sure this operation succeed. >>> >>> This is the right reasoning. This means to properly handle it, split_huge_pmd_locked() >>> needs to return whether it inserts migration entries or not when freeze is true. >>> >>>> >>>> Another difference from set_pmd_migration_entry() is split_huge_pmd_locked() >>>> would change the page table from PMD mapped to PTE mapped. >>>> page_vma_mapped_walk() can handle it now for (pvmw->pmd && !pvmw->pte), but I >>>> am not sure this is what we expected. For example, in try_to_unmap_one(), we >>>> use page_vma_mapped_walk_restart() after pmd splitted. >>>> >>>> So I prefer just remove the "ret = false" for a fix. Not sure this is >>>> reasonable to you. >>>> >>>> I am thinking two things after this fix: >>>> >>>> * add one similar test in selftests >>>> * let split_huge_pmd_locked() return value to indicate freeze is degrade to >>>> !freeze, and fail early on try_to_migrate() like the thp migration branch >>>> >>>> Look forward your opinion on whether it worth to do it. >>> >>> This is not the right fix, neither was mine above. Because before commit 60fbb14396d5, >>> the code handles PAE properly. If PAE is cleared, PMD is split into PTEs and each >>> PTE becomes a migration entry, page_vma_mapped_walk(&pvmw) returns false, >>> and try_to_migrate_one() returns true. If PAE is not cleared, PMD is split into PTEs >>> and each PTE is not a migration entry, inside while (page_vma_mapped_walk(&pvmw)), >>> PAE will be attempted to get cleared again and it will fail again, leading to >>> try_to_migrate_one() returns false. After commit 60fbb14396d5, no matter PAE is >>> cleared or not, try_to_migrate_one() always returns false. It causes folio split >>> failures for shared PMD THPs. >>> >>> Now with your fix (and mine above), no matter PAE is cleared or not, try_to_migrate_one() >>> always returns true. It just flips the code to a different issue. So the proper fix >>> is to let split_huge_pmd_locked() returns whether it inserts migration entries or not >>> and do the same pattern as THP migration code path. >> >> How about aligning with the try_to_unmap_one()? The behavior would be the same before applying the commit 60fbb14396d5: >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 7b9879ef442d..0c96f0883013 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -2333,9 +2333,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >> if (flags & TTU_SPLIT_HUGE_PMD) { >> split_huge_pmd_locked(vma, pvmw.address, >> pvmw.pmd, true); >> - ret = false; >> - page_vma_mapped_walk_done(&pvmw); >> - break; >> + flags &= ~TTU_SPLIT_HUGE_PMD; >> + page_vma_mapped_walk_restart(&pvmw); >> + continue; >> } >> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION >> pmdval = pmdp_get(pvmw.pmd); > >Yes, it works and definitely needs a comment like "After split_huge_pmd_locked(), restart >the walk to detect PageAnonExclusive handling failure in __split_huge_pmd_locked()". >The change is good for backporting, but an additional patch to fix it properly by adding >a return value to split_huge_pmd_locked() is also necessary. > If my understanding is correct, this approach is good for backporting. And yes, we could further improve it by return a value to indicate whether split_huge_pmd_locked() do split to migration entry. Thanks both for your thoughtful inputs. -- Wei Yang Help you, Help me ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp 2026-02-03 0:00 ` Wei Yang @ 2026-02-03 0:07 ` Zi Yan 2026-02-03 13:04 ` Wei Yang 0 siblings, 1 reply; 13+ messages in thread From: Zi Yan @ 2026-02-03 0:07 UTC (permalink / raw) To: Wei Yang Cc: Gavin Guo, david, akpm, lorenzo.stoakes, riel, Liam.Howlett, vbabka, harry.yoo, jannh, baolin.wang, linux-mm, stable, Gavin Shan On 2 Feb 2026, at 19:00, Wei Yang wrote: > On Sun, Feb 01, 2026 at 09:20:35AM -0500, Zi Yan wrote: >> On 1 Feb 2026, at 8:04, Gavin Guo wrote: >> >>> On 2/1/26 11:39, Zi Yan wrote: >>>> On 31 Jan 2026, at 21:09, Wei Yang wrote: >>>> >>>>> On Fri, Jan 30, 2026 at 09:44:10PM -0500, Zi Yan wrote: >>>>>> On 30 Jan 2026, at 18:00, Wei Yang wrote: >>>>>> >>>>>>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and >>>>>>> split_huge_pmd_locked()") return false unconditionally after >>>>>>> split_huge_pmd_locked() which may fail early during try_to_migrate() for >>>>>>> shared thp. This will lead to unexpected folio split failure. >>>>>>> >>>>>>> One way to reproduce: >>>>>>> >>>>>>> Create an anonymous thp range and fork 512 children, so we have a >>>>>>> thp shared mapped in 513 processes. Then trigger folio split with >>>>>>> /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to >>>>>>> order 0. >>>>>>> >>>>>>> Without the above commit, we can successfully split to order 0. >>>>>>> With the above commit, the folio is still a large folio. >>>>>>> >>>>>>> The reason is the above commit return false after split pmd >>>>>>> unconditionally in the first process and break try_to_migrate(). >>>>>> >>>>>> The reasoning looks good to me. >>>>>> >>>>>>> >>>>>>> The tricky thing in above reproduce method is current debugfs interface >>>>>>> leverage function split_huge_pages_pid(), which will iterate the whole >>>>>>> pmd range and do folio split on each base page address. This means it >>>>>>> will try 512 times, and each time split one pmd from pmd mapped to pte >>>>>>> mapped thp. If there are less than 512 shared mapped process, >>>>>>> the folio is still split successfully at last. But in real world, we >>>>>>> usually try it for once. >>>>>>> >>>>>>> This patch fixes this by removing the unconditional false return after >>>>>>> split_huge_pmd_locked(). Later, we may introduce a true fail early if >>>>>>> split_huge_pmd_locked() does fail. >>>>>>> >>>>>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> >>>>>>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()") >>>>>>> Cc: Gavin Guo <gavinguo@igalia.com> >>>>>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> >>>>>>> Cc: Zi Yan <ziy@nvidia.com> >>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>>>>>> Cc: <stable@vger.kernel.org> >>>>>>> --- >>>>>>> mm/rmap.c | 1 - >>>>>>> 1 file changed, 1 deletion(-) >>>>>>> >>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>>> index 618df3385c8b..eed971568d65 100644 >>>>>>> --- a/mm/rmap.c >>>>>>> +++ b/mm/rmap.c >>>>>>> @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>>>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>>>>> split_huge_pmd_locked(vma, pvmw.address, >>>>>>> pvmw.pmd, true); >>>>>>> - ret = false; >>>>>>> page_vma_mapped_walk_done(&pvmw); >>>>>>> break; >>>>>>> } >>>>>> >>>>>> How about the patch below? It matches the pattern of set_pmd_migration_entry() below. >>>>>> Basically, continue if the operation is successful, break otherwise. >>>>>> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>> index 618df3385c8b..83cc9d98533e 100644 >>>>>> --- a/mm/rmap.c >>>>>> +++ b/mm/rmap.c >>>>>> @@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>>>> split_huge_pmd_locked(vma, pvmw.address, >>>>>> pvmw.pmd, true); >>>>>> - ret = false; >>>>>> - page_vma_mapped_walk_done(&pvmw); >>>>>> - break; >>>>>> + continue; >>>>>> } >>>>> >>>>> Per my understanding if @freeze is trur, split_huge_pmd_locked() may "fail" as >>>>> the comment says: >>>>> >>>>> * Without "freeze", we'll simply split the PMD, propagating the >>>>> * PageAnonExclusive() flag for each PTE by setting it for >>>>> * each subpage -- no need to (temporarily) clear. >>>>> * >>>>> * With "freeze" we want to replace mapped pages by >>>>> * migration entries right away. This is only possible if we >>>>> * managed to clear PageAnonExclusive() -- see >>>>> * set_pmd_migration_entry(). >>>>> * >>>>> * In case we cannot clear PageAnonExclusive(), split the PMD >>>>> * only and let try_to_migrate_one() fail later. >>>>> >>>>> While currently we don't return the status of split_huge_pmd_locked() to >>>>> indicate whether it does replaced PMD with migration entries successfully. So >>>>> we are not sure this operation succeed. >>>> >>>> This is the right reasoning. This means to properly handle it, split_huge_pmd_locked() >>>> needs to return whether it inserts migration entries or not when freeze is true. >>>> >>>>> >>>>> Another difference from set_pmd_migration_entry() is split_huge_pmd_locked() >>>>> would change the page table from PMD mapped to PTE mapped. >>>>> page_vma_mapped_walk() can handle it now for (pvmw->pmd && !pvmw->pte), but I >>>>> am not sure this is what we expected. For example, in try_to_unmap_one(), we >>>>> use page_vma_mapped_walk_restart() after pmd splitted. >>>>> >>>>> So I prefer just remove the "ret = false" for a fix. Not sure this is >>>>> reasonable to you. >>>>> >>>>> I am thinking two things after this fix: >>>>> >>>>> * add one similar test in selftests >>>>> * let split_huge_pmd_locked() return value to indicate freeze is degrade to >>>>> !freeze, and fail early on try_to_migrate() like the thp migration branch >>>>> >>>>> Look forward your opinion on whether it worth to do it. >>>> >>>> This is not the right fix, neither was mine above. Because before commit 60fbb14396d5, >>>> the code handles PAE properly. If PAE is cleared, PMD is split into PTEs and each >>>> PTE becomes a migration entry, page_vma_mapped_walk(&pvmw) returns false, >>>> and try_to_migrate_one() returns true. If PAE is not cleared, PMD is split into PTEs >>>> and each PTE is not a migration entry, inside while (page_vma_mapped_walk(&pvmw)), >>>> PAE will be attempted to get cleared again and it will fail again, leading to >>>> try_to_migrate_one() returns false. After commit 60fbb14396d5, no matter PAE is >>>> cleared or not, try_to_migrate_one() always returns false. It causes folio split >>>> failures for shared PMD THPs. >>>> >>>> Now with your fix (and mine above), no matter PAE is cleared or not, try_to_migrate_one() >>>> always returns true. It just flips the code to a different issue. So the proper fix >>>> is to let split_huge_pmd_locked() returns whether it inserts migration entries or not >>>> and do the same pattern as THP migration code path. >>> >>> How about aligning with the try_to_unmap_one()? The behavior would be the same before applying the commit 60fbb14396d5: >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 7b9879ef442d..0c96f0883013 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -2333,9 +2333,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>> if (flags & TTU_SPLIT_HUGE_PMD) { >>> split_huge_pmd_locked(vma, pvmw.address, >>> pvmw.pmd, true); >>> - ret = false; >>> - page_vma_mapped_walk_done(&pvmw); >>> - break; >>> + flags &= ~TTU_SPLIT_HUGE_PMD; >>> + page_vma_mapped_walk_restart(&pvmw); >>> + continue; >>> } >>> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION >>> pmdval = pmdp_get(pvmw.pmd); >> >> Yes, it works and definitely needs a comment like "After split_huge_pmd_locked(), restart >> the walk to detect PageAnonExclusive handling failure in __split_huge_pmd_locked()". >> The change is good for backporting, but an additional patch to fix it properly by adding >> a return value to split_huge_pmd_locked() is also necessary. >> > > If my understanding is correct, this approach is good for backporting. > > And yes, we could further improve it by return a value to indicate whether > split_huge_pmd_locked() do split to migration entry. > > Thanks both for your thoughtful inputs. Are you going to send two patches in a series, one is the above fix with a comment and the other changes split_huge_pmd_locked() to return a value? Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp 2026-02-03 0:07 ` Zi Yan @ 2026-02-03 13:04 ` Wei Yang 2026-02-03 13:07 ` Zi Yan 0 siblings, 1 reply; 13+ messages in thread From: Wei Yang @ 2026-02-03 13:04 UTC (permalink / raw) To: Zi Yan Cc: Wei Yang, Gavin Guo, david, akpm, lorenzo.stoakes, riel, Liam.Howlett, vbabka, harry.yoo, jannh, baolin.wang, linux-mm, stable, Gavin Shan On Mon, Feb 02, 2026 at 07:07:12PM -0500, Zi Yan wrote: >On 2 Feb 2026, at 19:00, Wei Yang wrote: > >> On Sun, Feb 01, 2026 at 09:20:35AM -0500, Zi Yan wrote: >>> On 1 Feb 2026, at 8:04, Gavin Guo wrote: >>> >>>> On 2/1/26 11:39, Zi Yan wrote: >>>>> On 31 Jan 2026, at 21:09, Wei Yang wrote: >>>>> >>>>>> On Fri, Jan 30, 2026 at 09:44:10PM -0500, Zi Yan wrote: >>>>>>> On 30 Jan 2026, at 18:00, Wei Yang wrote: >>>>>>> >>>>>>>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and >>>>>>>> split_huge_pmd_locked()") return false unconditionally after >>>>>>>> split_huge_pmd_locked() which may fail early during try_to_migrate() for >>>>>>>> shared thp. This will lead to unexpected folio split failure. >>>>>>>> >>>>>>>> One way to reproduce: >>>>>>>> >>>>>>>> Create an anonymous thp range and fork 512 children, so we have a >>>>>>>> thp shared mapped in 513 processes. Then trigger folio split with >>>>>>>> /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to >>>>>>>> order 0. >>>>>>>> >>>>>>>> Without the above commit, we can successfully split to order 0. >>>>>>>> With the above commit, the folio is still a large folio. >>>>>>>> >>>>>>>> The reason is the above commit return false after split pmd >>>>>>>> unconditionally in the first process and break try_to_migrate(). >>>>>>> >>>>>>> The reasoning looks good to me. >>>>>>> >>>>>>>> >>>>>>>> The tricky thing in above reproduce method is current debugfs interface >>>>>>>> leverage function split_huge_pages_pid(), which will iterate the whole >>>>>>>> pmd range and do folio split on each base page address. This means it >>>>>>>> will try 512 times, and each time split one pmd from pmd mapped to pte >>>>>>>> mapped thp. If there are less than 512 shared mapped process, >>>>>>>> the folio is still split successfully at last. But in real world, we >>>>>>>> usually try it for once. >>>>>>>> >>>>>>>> This patch fixes this by removing the unconditional false return after >>>>>>>> split_huge_pmd_locked(). Later, we may introduce a true fail early if >>>>>>>> split_huge_pmd_locked() does fail. >>>>>>>> >>>>>>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> >>>>>>>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()") >>>>>>>> Cc: Gavin Guo <gavinguo@igalia.com> >>>>>>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> >>>>>>>> Cc: Zi Yan <ziy@nvidia.com> >>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>>>>>>> Cc: <stable@vger.kernel.org> >>>>>>>> --- >>>>>>>> mm/rmap.c | 1 - >>>>>>>> 1 file changed, 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>>>> index 618df3385c8b..eed971568d65 100644 >>>>>>>> --- a/mm/rmap.c >>>>>>>> +++ b/mm/rmap.c >>>>>>>> @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>>>>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>>>>>> split_huge_pmd_locked(vma, pvmw.address, >>>>>>>> pvmw.pmd, true); >>>>>>>> - ret = false; >>>>>>>> page_vma_mapped_walk_done(&pvmw); >>>>>>>> break; >>>>>>>> } >>>>>>> >>>>>>> How about the patch below? It matches the pattern of set_pmd_migration_entry() below. >>>>>>> Basically, continue if the operation is successful, break otherwise. >>>>>>> >>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>>> index 618df3385c8b..83cc9d98533e 100644 >>>>>>> --- a/mm/rmap.c >>>>>>> +++ b/mm/rmap.c >>>>>>> @@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>>>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>>>>> split_huge_pmd_locked(vma, pvmw.address, >>>>>>> pvmw.pmd, true); >>>>>>> - ret = false; >>>>>>> - page_vma_mapped_walk_done(&pvmw); >>>>>>> - break; >>>>>>> + continue; >>>>>>> } >>>>>> >>>>>> Per my understanding if @freeze is trur, split_huge_pmd_locked() may "fail" as >>>>>> the comment says: >>>>>> >>>>>> * Without "freeze", we'll simply split the PMD, propagating the >>>>>> * PageAnonExclusive() flag for each PTE by setting it for >>>>>> * each subpage -- no need to (temporarily) clear. >>>>>> * >>>>>> * With "freeze" we want to replace mapped pages by >>>>>> * migration entries right away. This is only possible if we >>>>>> * managed to clear PageAnonExclusive() -- see >>>>>> * set_pmd_migration_entry(). >>>>>> * >>>>>> * In case we cannot clear PageAnonExclusive(), split the PMD >>>>>> * only and let try_to_migrate_one() fail later. >>>>>> >>>>>> While currently we don't return the status of split_huge_pmd_locked() to >>>>>> indicate whether it does replaced PMD with migration entries successfully. So >>>>>> we are not sure this operation succeed. >>>>> >>>>> This is the right reasoning. This means to properly handle it, split_huge_pmd_locked() >>>>> needs to return whether it inserts migration entries or not when freeze is true. >>>>> >>>>>> >>>>>> Another difference from set_pmd_migration_entry() is split_huge_pmd_locked() >>>>>> would change the page table from PMD mapped to PTE mapped. >>>>>> page_vma_mapped_walk() can handle it now for (pvmw->pmd && !pvmw->pte), but I >>>>>> am not sure this is what we expected. For example, in try_to_unmap_one(), we >>>>>> use page_vma_mapped_walk_restart() after pmd splitted. >>>>>> >>>>>> So I prefer just remove the "ret = false" for a fix. Not sure this is >>>>>> reasonable to you. >>>>>> >>>>>> I am thinking two things after this fix: >>>>>> >>>>>> * add one similar test in selftests >>>>>> * let split_huge_pmd_locked() return value to indicate freeze is degrade to >>>>>> !freeze, and fail early on try_to_migrate() like the thp migration branch >>>>>> >>>>>> Look forward your opinion on whether it worth to do it. >>>>> >>>>> This is not the right fix, neither was mine above. Because before commit 60fbb14396d5, >>>>> the code handles PAE properly. If PAE is cleared, PMD is split into PTEs and each >>>>> PTE becomes a migration entry, page_vma_mapped_walk(&pvmw) returns false, >>>>> and try_to_migrate_one() returns true. If PAE is not cleared, PMD is split into PTEs >>>>> and each PTE is not a migration entry, inside while (page_vma_mapped_walk(&pvmw)), >>>>> PAE will be attempted to get cleared again and it will fail again, leading to >>>>> try_to_migrate_one() returns false. After commit 60fbb14396d5, no matter PAE is >>>>> cleared or not, try_to_migrate_one() always returns false. It causes folio split >>>>> failures for shared PMD THPs. >>>>> >>>>> Now with your fix (and mine above), no matter PAE is cleared or not, try_to_migrate_one() >>>>> always returns true. It just flips the code to a different issue. So the proper fix >>>>> is to let split_huge_pmd_locked() returns whether it inserts migration entries or not >>>>> and do the same pattern as THP migration code path. >>>> >>>> How about aligning with the try_to_unmap_one()? The behavior would be the same before applying the commit 60fbb14396d5: >>>> >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index 7b9879ef442d..0c96f0883013 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -2333,9 +2333,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>> split_huge_pmd_locked(vma, pvmw.address, >>>> pvmw.pmd, true); >>>> - ret = false; >>>> - page_vma_mapped_walk_done(&pvmw); >>>> - break; >>>> + flags &= ~TTU_SPLIT_HUGE_PMD; >>>> + page_vma_mapped_walk_restart(&pvmw); >>>> + continue; >>>> } >>>> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION >>>> pmdval = pmdp_get(pvmw.pmd); >>> >>> Yes, it works and definitely needs a comment like "After split_huge_pmd_locked(), restart >>> the walk to detect PageAnonExclusive handling failure in __split_huge_pmd_locked()". >>> The change is good for backporting, but an additional patch to fix it properly by adding >>> a return value to split_huge_pmd_locked() is also necessary. >>> >> >> If my understanding is correct, this approach is good for backporting. >> >> And yes, we could further improve it by return a value to indicate whether >> split_huge_pmd_locked() do split to migration entry. >> >> Thanks both for your thoughtful inputs. > >Are you going to send two patches in a series, one is the above fix with a comment >and the other changes split_huge_pmd_locked() to return a value? > Hmm... as the above fix is supposed to be cc stable and backported, I think separate them is the correct process. And for the return value of split_huge_pmd_locked(), I will take another look at all the call places. Are you ok with this? Well, do you think we need to wait for David's comment? If not, I will prepare the v2 fix with the above change. >Best Regards, >Yan, Zi -- Wei Yang Help you, Help me ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp 2026-02-03 13:04 ` Wei Yang @ 2026-02-03 13:07 ` Zi Yan 0 siblings, 0 replies; 13+ messages in thread From: Zi Yan @ 2026-02-03 13:07 UTC (permalink / raw) To: Wei Yang Cc: Gavin Guo, david, akpm, lorenzo.stoakes, riel, Liam.Howlett, vbabka, harry.yoo, jannh, baolin.wang, linux-mm, stable, Gavin Shan On 3 Feb 2026, at 8:04, Wei Yang wrote: > On Mon, Feb 02, 2026 at 07:07:12PM -0500, Zi Yan wrote: >> On 2 Feb 2026, at 19:00, Wei Yang wrote: >> >>> On Sun, Feb 01, 2026 at 09:20:35AM -0500, Zi Yan wrote: >>>> On 1 Feb 2026, at 8:04, Gavin Guo wrote: >>>> >>>>> On 2/1/26 11:39, Zi Yan wrote: >>>>>> On 31 Jan 2026, at 21:09, Wei Yang wrote: >>>>>> >>>>>>> On Fri, Jan 30, 2026 at 09:44:10PM -0500, Zi Yan wrote: >>>>>>>> On 30 Jan 2026, at 18:00, Wei Yang wrote: >>>>>>>> >>>>>>>>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and >>>>>>>>> split_huge_pmd_locked()") return false unconditionally after >>>>>>>>> split_huge_pmd_locked() which may fail early during try_to_migrate() for >>>>>>>>> shared thp. This will lead to unexpected folio split failure. >>>>>>>>> >>>>>>>>> One way to reproduce: >>>>>>>>> >>>>>>>>> Create an anonymous thp range and fork 512 children, so we have a >>>>>>>>> thp shared mapped in 513 processes. Then trigger folio split with >>>>>>>>> /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to >>>>>>>>> order 0. >>>>>>>>> >>>>>>>>> Without the above commit, we can successfully split to order 0. >>>>>>>>> With the above commit, the folio is still a large folio. >>>>>>>>> >>>>>>>>> The reason is the above commit return false after split pmd >>>>>>>>> unconditionally in the first process and break try_to_migrate(). >>>>>>>> >>>>>>>> The reasoning looks good to me. >>>>>>>> >>>>>>>>> >>>>>>>>> The tricky thing in above reproduce method is current debugfs interface >>>>>>>>> leverage function split_huge_pages_pid(), which will iterate the whole >>>>>>>>> pmd range and do folio split on each base page address. This means it >>>>>>>>> will try 512 times, and each time split one pmd from pmd mapped to pte >>>>>>>>> mapped thp. If there are less than 512 shared mapped process, >>>>>>>>> the folio is still split successfully at last. But in real world, we >>>>>>>>> usually try it for once. >>>>>>>>> >>>>>>>>> This patch fixes this by removing the unconditional false return after >>>>>>>>> split_huge_pmd_locked(). Later, we may introduce a true fail early if >>>>>>>>> split_huge_pmd_locked() does fail. >>>>>>>>> >>>>>>>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> >>>>>>>>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()") >>>>>>>>> Cc: Gavin Guo <gavinguo@igalia.com> >>>>>>>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> >>>>>>>>> Cc: Zi Yan <ziy@nvidia.com> >>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>>>>>>>> Cc: <stable@vger.kernel.org> >>>>>>>>> --- >>>>>>>>> mm/rmap.c | 1 - >>>>>>>>> 1 file changed, 1 deletion(-) >>>>>>>>> >>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>>>>> index 618df3385c8b..eed971568d65 100644 >>>>>>>>> --- a/mm/rmap.c >>>>>>>>> +++ b/mm/rmap.c >>>>>>>>> @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>>>>>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>>>>>>> split_huge_pmd_locked(vma, pvmw.address, >>>>>>>>> pvmw.pmd, true); >>>>>>>>> - ret = false; >>>>>>>>> page_vma_mapped_walk_done(&pvmw); >>>>>>>>> break; >>>>>>>>> } >>>>>>>> >>>>>>>> How about the patch below? It matches the pattern of set_pmd_migration_entry() below. >>>>>>>> Basically, continue if the operation is successful, break otherwise. >>>>>>>> >>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>>>> index 618df3385c8b..83cc9d98533e 100644 >>>>>>>> --- a/mm/rmap.c >>>>>>>> +++ b/mm/rmap.c >>>>>>>> @@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>>>>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>>>>>> split_huge_pmd_locked(vma, pvmw.address, >>>>>>>> pvmw.pmd, true); >>>>>>>> - ret = false; >>>>>>>> - page_vma_mapped_walk_done(&pvmw); >>>>>>>> - break; >>>>>>>> + continue; >>>>>>>> } >>>>>>> >>>>>>> Per my understanding if @freeze is trur, split_huge_pmd_locked() may "fail" as >>>>>>> the comment says: >>>>>>> >>>>>>> * Without "freeze", we'll simply split the PMD, propagating the >>>>>>> * PageAnonExclusive() flag for each PTE by setting it for >>>>>>> * each subpage -- no need to (temporarily) clear. >>>>>>> * >>>>>>> * With "freeze" we want to replace mapped pages by >>>>>>> * migration entries right away. This is only possible if we >>>>>>> * managed to clear PageAnonExclusive() -- see >>>>>>> * set_pmd_migration_entry(). >>>>>>> * >>>>>>> * In case we cannot clear PageAnonExclusive(), split the PMD >>>>>>> * only and let try_to_migrate_one() fail later. >>>>>>> >>>>>>> While currently we don't return the status of split_huge_pmd_locked() to >>>>>>> indicate whether it does replaced PMD with migration entries successfully. So >>>>>>> we are not sure this operation succeed. >>>>>> >>>>>> This is the right reasoning. This means to properly handle it, split_huge_pmd_locked() >>>>>> needs to return whether it inserts migration entries or not when freeze is true. >>>>>> >>>>>>> >>>>>>> Another difference from set_pmd_migration_entry() is split_huge_pmd_locked() >>>>>>> would change the page table from PMD mapped to PTE mapped. >>>>>>> page_vma_mapped_walk() can handle it now for (pvmw->pmd && !pvmw->pte), but I >>>>>>> am not sure this is what we expected. For example, in try_to_unmap_one(), we >>>>>>> use page_vma_mapped_walk_restart() after pmd splitted. >>>>>>> >>>>>>> So I prefer just remove the "ret = false" for a fix. Not sure this is >>>>>>> reasonable to you. >>>>>>> >>>>>>> I am thinking two things after this fix: >>>>>>> >>>>>>> * add one similar test in selftests >>>>>>> * let split_huge_pmd_locked() return value to indicate freeze is degrade to >>>>>>> !freeze, and fail early on try_to_migrate() like the thp migration branch >>>>>>> >>>>>>> Look forward your opinion on whether it worth to do it. >>>>>> >>>>>> This is not the right fix, neither was mine above. Because before commit 60fbb14396d5, >>>>>> the code handles PAE properly. If PAE is cleared, PMD is split into PTEs and each >>>>>> PTE becomes a migration entry, page_vma_mapped_walk(&pvmw) returns false, >>>>>> and try_to_migrate_one() returns true. If PAE is not cleared, PMD is split into PTEs >>>>>> and each PTE is not a migration entry, inside while (page_vma_mapped_walk(&pvmw)), >>>>>> PAE will be attempted to get cleared again and it will fail again, leading to >>>>>> try_to_migrate_one() returns false. After commit 60fbb14396d5, no matter PAE is >>>>>> cleared or not, try_to_migrate_one() always returns false. It causes folio split >>>>>> failures for shared PMD THPs. >>>>>> >>>>>> Now with your fix (and mine above), no matter PAE is cleared or not, try_to_migrate_one() >>>>>> always returns true. It just flips the code to a different issue. So the proper fix >>>>>> is to let split_huge_pmd_locked() returns whether it inserts migration entries or not >>>>>> and do the same pattern as THP migration code path. >>>>> >>>>> How about aligning with the try_to_unmap_one()? The behavior would be the same before applying the commit 60fbb14396d5: >>>>> >>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>> index 7b9879ef442d..0c96f0883013 100644 >>>>> --- a/mm/rmap.c >>>>> +++ b/mm/rmap.c >>>>> @@ -2333,9 +2333,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>>> split_huge_pmd_locked(vma, pvmw.address, >>>>> pvmw.pmd, true); >>>>> - ret = false; >>>>> - page_vma_mapped_walk_done(&pvmw); >>>>> - break; >>>>> + flags &= ~TTU_SPLIT_HUGE_PMD; >>>>> + page_vma_mapped_walk_restart(&pvmw); >>>>> + continue; >>>>> } >>>>> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION >>>>> pmdval = pmdp_get(pvmw.pmd); >>>> >>>> Yes, it works and definitely needs a comment like "After split_huge_pmd_locked(), restart >>>> the walk to detect PageAnonExclusive handling failure in __split_huge_pmd_locked()". >>>> The change is good for backporting, but an additional patch to fix it properly by adding >>>> a return value to split_huge_pmd_locked() is also necessary. >>>> >>> >>> If my understanding is correct, this approach is good for backporting. >>> >>> And yes, we could further improve it by return a value to indicate whether >>> split_huge_pmd_locked() do split to migration entry. >>> >>> Thanks both for your thoughtful inputs. >> >> Are you going to send two patches in a series, one is the above fix with a comment >> and the other changes split_huge_pmd_locked() to return a value? >> > > Hmm... as the above fix is supposed to be cc stable and backported, I think > separate them is the correct process. And for the return value of Right. Andrew prefer this way. You can send the fix first and send the refactor patch after the fix is picked up by Andrew. > split_huge_pmd_locked(), I will take another look at all the call places. Are > you ok with this? > > Well, do you think we need to wait for David's comment? If not, I will prepare > the v2 fix with the above change. You answered my question. There should be no issue with existing code. Thanks. Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp 2026-02-01 14:20 ` Zi Yan 2026-02-03 0:00 ` Wei Yang @ 2026-02-03 13:20 ` Lance Yang 1 sibling, 0 replies; 13+ messages in thread From: Lance Yang @ 2026-02-03 13:20 UTC (permalink / raw) To: ziy Cc: Liam.Howlett, akpm, baolin.wang, david, gavinguo, gshan, harry.yoo, jannh, linux-mm, lorenzo.stoakes, richard.weiyang, riel, stable, vbabka, Lance Yang On Sun, Feb 01, 2026 at 09:20:35AM -0500, Zi Yan wrote: >On 1 Feb 2026, at 8:04, Gavin Guo wrote: > >> On 2/1/26 11:39, Zi Yan wrote: >>> On 31 Jan 2026, at 21:09, Wei Yang wrote: >>> >>>> On Fri, Jan 30, 2026 at 09:44:10PM -0500, Zi Yan wrote: >>>>> On 30 Jan 2026, at 18:00, Wei Yang wrote: >>>>> >>>>>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and >>>>>> split_huge_pmd_locked()") return false unconditionally after >>>>>> split_huge_pmd_locked() which may fail early during try_to_migrate() for >>>>>> shared thp. This will lead to unexpected folio split failure. >>>>>> >>>>>> One way to reproduce: >>>>>> >>>>>> Create an anonymous thp range and fork 512 children, so we have a >>>>>> thp shared mapped in 513 processes. Then trigger folio split with >>>>>> /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to >>>>>> order 0. >>>>>> >>>>>> Without the above commit, we can successfully split to order 0. >>>>>> With the above commit, the folio is still a large folio. >>>>>> >>>>>> The reason is the above commit return false after split pmd >>>>>> unconditionally in the first process and break try_to_migrate(). >>>>> >>>>> The reasoning looks good to me. >>>>> >>>>>> >>>>>> The tricky thing in above reproduce method is current debugfs interface >>>>>> leverage function split_huge_pages_pid(), which will iterate the whole >>>>>> pmd range and do folio split on each base page address. This means it >>>>>> will try 512 times, and each time split one pmd from pmd mapped to pte >>>>>> mapped thp. If there are less than 512 shared mapped process, >>>>>> the folio is still split successfully at last. But in real world, we >>>>>> usually try it for once. >>>>>> >>>>>> This patch fixes this by removing the unconditional false return after >>>>>> split_huge_pmd_locked(). Later, we may introduce a true fail early if >>>>>> split_huge_pmd_locked() does fail. >>>>>> >>>>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> >>>>>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()") >>>>>> Cc: Gavin Guo <gavinguo@igalia.com> >>>>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> >>>>>> Cc: Zi Yan <ziy@nvidia.com> >>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>>>>> Cc: <stable@vger.kernel.org> >>>>>> --- >>>>>> mm/rmap.c | 1 - >>>>>> 1 file changed, 1 deletion(-) >>>>>> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>> index 618df3385c8b..eed971568d65 100644 >>>>>> --- a/mm/rmap.c >>>>>> +++ b/mm/rmap.c >>>>>> @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>>>> split_huge_pmd_locked(vma, pvmw.address, >>>>>> pvmw.pmd, true); >>>>>> - ret = false; >>>>>> page_vma_mapped_walk_done(&pvmw); >>>>>> break; >>>>>> } >>>>> >>>>> How about the patch below? It matches the pattern of set_pmd_migration_entry() below. >>>>> Basically, continue if the operation is successful, break otherwise. >>>>> >>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>> index 618df3385c8b..83cc9d98533e 100644 >>>>> --- a/mm/rmap.c >>>>> +++ b/mm/rmap.c >>>>> @@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>>> split_huge_pmd_locked(vma, pvmw.address, >>>>> pvmw.pmd, true); >>>>> - ret = false; >>>>> - page_vma_mapped_walk_done(&pvmw); >>>>> - break; >>>>> + continue; >>>>> } >>>> >>>> Per my understanding if @freeze is trur, split_huge_pmd_locked() may "fail" as >>>> the comment says: >>>> >>>> * Without "freeze", we'll simply split the PMD, propagating the >>>> * PageAnonExclusive() flag for each PTE by setting it for >>>> * each subpage -- no need to (temporarily) clear. >>>> * >>>> * With "freeze" we want to replace mapped pages by >>>> * migration entries right away. This is only possible if we >>>> * managed to clear PageAnonExclusive() -- see >>>> * set_pmd_migration_entry(). >>>> * >>>> * In case we cannot clear PageAnonExclusive(), split the PMD >>>> * only and let try_to_migrate_one() fail later. >>>> >>>> While currently we don't return the status of split_huge_pmd_locked() to >>>> indicate whether it does replaced PMD with migration entries successfully. So >>>> we are not sure this operation succeed. >>> >>> This is the right reasoning. This means to properly handle it, split_huge_pmd_locked() >>> needs to return whether it inserts migration entries or not when freeze is true. >>> >>>> >>>> Another difference from set_pmd_migration_entry() is split_huge_pmd_locked() >>>> would change the page table from PMD mapped to PTE mapped. >>>> page_vma_mapped_walk() can handle it now for (pvmw->pmd && !pvmw->pte), but I >>>> am not sure this is what we expected. For example, in try_to_unmap_one(), we >>>> use page_vma_mapped_walk_restart() after pmd splitted. >>>> >>>> So I prefer just remove the "ret = false" for a fix. Not sure this is >>>> reasonable to you. >>>> >>>> I am thinking two things after this fix: >>>> >>>> * add one similar test in selftests >>>> * let split_huge_pmd_locked() return value to indicate freeze is degrade to >>>> !freeze, and fail early on try_to_migrate() like the thp migration branch >>>> >>>> Look forward your opinion on whether it worth to do it. >>> >>> This is not the right fix, neither was mine above. Because before commit 60fbb14396d5, >>> the code handles PAE properly. If PAE is cleared, PMD is split into PTEs and each >>> PTE becomes a migration entry, page_vma_mapped_walk(&pvmw) returns false, >>> and try_to_migrate_one() returns true. If PAE is not cleared, PMD is split into PTEs >>> and each PTE is not a migration entry, inside while (page_vma_mapped_walk(&pvmw)), >>> PAE will be attempted to get cleared again and it will fail again, leading to >>> try_to_migrate_one() returns false. After commit 60fbb14396d5, no matter PAE is >>> cleared or not, try_to_migrate_one() always returns false. It causes folio split >>> failures for shared PMD THPs. >>> >>> Now with your fix (and mine above), no matter PAE is cleared or not, try_to_migrate_one() >>> always returns true. It just flips the code to a different issue. So the proper fix >>> is to let split_huge_pmd_locked() returns whether it inserts migration entries or not >>> and do the same pattern as THP migration code path. >> >> How about aligning with the try_to_unmap_one()? The behavior would be the same before applying the commit 60fbb14396d5: >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 7b9879ef442d..0c96f0883013 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -2333,9 +2333,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >> if (flags & TTU_SPLIT_HUGE_PMD) { >> split_huge_pmd_locked(vma, pvmw.address, >> pvmw.pmd, true); >> - ret = false; >> - page_vma_mapped_walk_done(&pvmw); >> - break; >> + flags &= ~TTU_SPLIT_HUGE_PMD; >> + page_vma_mapped_walk_restart(&pvmw); >> + continue; >> } >> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION >> pmdval = pmdp_get(pvmw.pmd); > >Yes, it works and definitely needs a comment like "After split_huge_pmd_locked(), restart >the walk to detect PageAnonExclusive handling failure in __split_huge_pmd_locked()". >The change is good for backporting, but an additional patch to fix it properly by adding >a return value to split_huge_pmd_locked() is also necessary. Right. IIUC, after split_huge_pmd_locked() we always have 512 PTEs: either migration entries, or present PTEs with PageAnonExclusive still set. And try_to_migrate_one() doesn't use PVMW_MIGRATION. So when we restart the walk we're either seeing migration — then map_pte/check_pte() won't match, we hit not_found() and leave the loop with ret still true. Or we see present (with PageAnonExclusive still set) — then we drop into the normal PTE path, call folio_try_share_anon_rmap_pte() again, and set ret=false when it fails. Cheers, Lance ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp 2026-02-01 3:39 ` Zi Yan 2026-02-01 13:04 ` Gavin Guo @ 2026-02-02 23:57 ` Wei Yang 2026-02-03 0:05 ` Zi Yan 1 sibling, 1 reply; 13+ messages in thread From: Wei Yang @ 2026-02-02 23:57 UTC (permalink / raw) To: Zi Yan Cc: Wei Yang, david, akpm, lorenzo.stoakes, riel, Liam.Howlett, vbabka, harry.yoo, jannh, gavinguo, baolin.wang, linux-mm, stable On Sat, Jan 31, 2026 at 10:39:40PM -0500, Zi Yan wrote: >On 31 Jan 2026, at 21:09, Wei Yang wrote: > >> On Fri, Jan 30, 2026 at 09:44:10PM -0500, Zi Yan wrote: >>> On 30 Jan 2026, at 18:00, Wei Yang wrote: >>> >>>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and >>>> split_huge_pmd_locked()") return false unconditionally after >>>> split_huge_pmd_locked() which may fail early during try_to_migrate() for >>>> shared thp. This will lead to unexpected folio split failure. >>>> >>>> One way to reproduce: >>>> >>>> Create an anonymous thp range and fork 512 children, so we have a >>>> thp shared mapped in 513 processes. Then trigger folio split with >>>> /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to >>>> order 0. >>>> >>>> Without the above commit, we can successfully split to order 0. >>>> With the above commit, the folio is still a large folio. >>>> >>>> The reason is the above commit return false after split pmd >>>> unconditionally in the first process and break try_to_migrate(). >>> >>> The reasoning looks good to me. >>> >>>> >>>> The tricky thing in above reproduce method is current debugfs interface >>>> leverage function split_huge_pages_pid(), which will iterate the whole >>>> pmd range and do folio split on each base page address. This means it >>>> will try 512 times, and each time split one pmd from pmd mapped to pte >>>> mapped thp. If there are less than 512 shared mapped process, >>>> the folio is still split successfully at last. But in real world, we >>>> usually try it for once. >>>> >>>> This patch fixes this by removing the unconditional false return after >>>> split_huge_pmd_locked(). Later, we may introduce a true fail early if >>>> split_huge_pmd_locked() does fail. >>>> >>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> >>>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()") >>>> Cc: Gavin Guo <gavinguo@igalia.com> >>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> >>>> Cc: Zi Yan <ziy@nvidia.com> >>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>>> Cc: <stable@vger.kernel.org> >>>> --- >>>> mm/rmap.c | 1 - >>>> 1 file changed, 1 deletion(-) >>>> >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index 618df3385c8b..eed971568d65 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>> split_huge_pmd_locked(vma, pvmw.address, >>>> pvmw.pmd, true); >>>> - ret = false; >>>> page_vma_mapped_walk_done(&pvmw); >>>> break; >>>> } >>> >>> How about the patch below? It matches the pattern of set_pmd_migration_entry() below. >>> Basically, continue if the operation is successful, break otherwise. >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 618df3385c8b..83cc9d98533e 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>> if (flags & TTU_SPLIT_HUGE_PMD) { >>> split_huge_pmd_locked(vma, pvmw.address, >>> pvmw.pmd, true); >>> - ret = false; >>> - page_vma_mapped_walk_done(&pvmw); >>> - break; >>> + continue; >>> } >> >> Per my understanding if @freeze is trur, split_huge_pmd_locked() may "fail" as >> the comment says: >> >> * Without "freeze", we'll simply split the PMD, propagating the >> * PageAnonExclusive() flag for each PTE by setting it for >> * each subpage -- no need to (temporarily) clear. >> * >> * With "freeze" we want to replace mapped pages by >> * migration entries right away. This is only possible if we >> * managed to clear PageAnonExclusive() -- see >> * set_pmd_migration_entry(). >> * >> * In case we cannot clear PageAnonExclusive(), split the PMD >> * only and let try_to_migrate_one() fail later. >> >> While currently we don't return the status of split_huge_pmd_locked() to >> indicate whether it does replaced PMD with migration entries successfully. So >> we are not sure this operation succeed. > >This is the right reasoning. This means to properly handle it, split_huge_pmd_locked() >needs to return whether it inserts migration entries or not when freeze is true. > >> >> Another difference from set_pmd_migration_entry() is split_huge_pmd_locked() >> would change the page table from PMD mapped to PTE mapped. >> page_vma_mapped_walk() can handle it now for (pvmw->pmd && !pvmw->pte), but I >> am not sure this is what we expected. For example, in try_to_unmap_one(), we >> use page_vma_mapped_walk_restart() after pmd splitted. >> >> So I prefer just remove the "ret = false" for a fix. Not sure this is >> reasonable to you. >> >> I am thinking two things after this fix: >> >> * add one similar test in selftests >> * let split_huge_pmd_locked() return value to indicate freeze is degrade to >> !freeze, and fail early on try_to_migrate() like the thp migration branch >> >> Look forward your opinion on whether it worth to do it. > >This is not the right fix, neither was mine above. Because before commit 60fbb14396d5, >the code handles PAE properly. If PAE is cleared, PMD is split into PTEs and each >PTE becomes a migration entry, page_vma_mapped_walk(&pvmw) returns false, >and try_to_migrate_one() returns true. If PAE is not cleared, PMD is split into PTEs >and each PTE is not a migration entry, inside while (page_vma_mapped_walk(&pvmw)), >PAE will be attempted to get cleared again and it will fail again, leading to >try_to_migrate_one() returns false. After commit 60fbb14396d5, no matter PAE is >cleared or not, try_to_migrate_one() always returns false. It causes folio split >failures for shared PMD THPs. > >Now with your fix (and mine above), no matter PAE is cleared or not, try_to_migrate_one() >always returns true. It just flips the code to a different issue. So the proper fix >is to let split_huge_pmd_locked() returns whether it inserts migration entries or not >and do the same pattern as THP migration code path. > You are right. BTW, I thought PAE stands for Physical Address Extension and confused a while :-( > >Hi David, > >In terms of unmap_folio(), which is the only user of split_huge_pmd_locked(..., freeze=true), >there is no folio_mapped() check afterwards. That might be causing an issue, >when the folio is pinned between the refcount check and unmap_folio(), unmap_folio() >fails, but folio split code proceeds. That means the folio is still accessible >via PTEs and later remove_migration_pte() will try to remove non migration PTEs. >It needs to be fixed separately, right? > Current __folio_split() logic is like below: if (folio_expected_ref_count(folio) != folio_ref_count(folio) - 1) { --- (1) ret = -EAGAIN; goto out_unlock; } unmap_folio(folio); --- (2) ret = __folio_freeze_and_split_unmapped() if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) { --- (3) } else { return -EAGAIN; } You mean after (1) and (2), we don't check folio_mapped() and continue spliting? Hmm... before continue split we tried to freeze folio with expected refcount at (3). This makes sure there is not extra refcount except in pagecache or swapcache. You mean this is not enough? Not sure I follow you correctly. > >> >>> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION >>> pmdval = pmdp_get(pvmw.pmd); >>> >>> >>> >>> -- >>> Best Regards, >>> Yan, Zi >> >> -- >> Wei Yang >> Help you, Help me > > >-- >Best Regards, >Yan, Zi -- Wei Yang Help you, Help me ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp 2026-02-02 23:57 ` Wei Yang @ 2026-02-03 0:05 ` Zi Yan 0 siblings, 0 replies; 13+ messages in thread From: Zi Yan @ 2026-02-03 0:05 UTC (permalink / raw) To: Wei Yang Cc: david, akpm, lorenzo.stoakes, riel, Liam.Howlett, vbabka, harry.yoo, jannh, gavinguo, baolin.wang, linux-mm, stable On 2 Feb 2026, at 18:57, Wei Yang wrote: > On Sat, Jan 31, 2026 at 10:39:40PM -0500, Zi Yan wrote: >> On 31 Jan 2026, at 21:09, Wei Yang wrote: >> >>> On Fri, Jan 30, 2026 at 09:44:10PM -0500, Zi Yan wrote: >>>> On 30 Jan 2026, at 18:00, Wei Yang wrote: >>>> >>>>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and >>>>> split_huge_pmd_locked()") return false unconditionally after >>>>> split_huge_pmd_locked() which may fail early during try_to_migrate() for >>>>> shared thp. This will lead to unexpected folio split failure. >>>>> >>>>> One way to reproduce: >>>>> >>>>> Create an anonymous thp range and fork 512 children, so we have a >>>>> thp shared mapped in 513 processes. Then trigger folio split with >>>>> /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to >>>>> order 0. >>>>> >>>>> Without the above commit, we can successfully split to order 0. >>>>> With the above commit, the folio is still a large folio. >>>>> >>>>> The reason is the above commit return false after split pmd >>>>> unconditionally in the first process and break try_to_migrate(). >>>> >>>> The reasoning looks good to me. >>>> >>>>> >>>>> The tricky thing in above reproduce method is current debugfs interface >>>>> leverage function split_huge_pages_pid(), which will iterate the whole >>>>> pmd range and do folio split on each base page address. This means it >>>>> will try 512 times, and each time split one pmd from pmd mapped to pte >>>>> mapped thp. If there are less than 512 shared mapped process, >>>>> the folio is still split successfully at last. But in real world, we >>>>> usually try it for once. >>>>> >>>>> This patch fixes this by removing the unconditional false return after >>>>> split_huge_pmd_locked(). Later, we may introduce a true fail early if >>>>> split_huge_pmd_locked() does fail. >>>>> >>>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> >>>>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()") >>>>> Cc: Gavin Guo <gavinguo@igalia.com> >>>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> >>>>> Cc: Zi Yan <ziy@nvidia.com> >>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>>>> Cc: <stable@vger.kernel.org> >>>>> --- >>>>> mm/rmap.c | 1 - >>>>> 1 file changed, 1 deletion(-) >>>>> >>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>> index 618df3385c8b..eed971568d65 100644 >>>>> --- a/mm/rmap.c >>>>> +++ b/mm/rmap.c >>>>> @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>>> split_huge_pmd_locked(vma, pvmw.address, >>>>> pvmw.pmd, true); >>>>> - ret = false; >>>>> page_vma_mapped_walk_done(&pvmw); >>>>> break; >>>>> } >>>> >>>> How about the patch below? It matches the pattern of set_pmd_migration_entry() below. >>>> Basically, continue if the operation is successful, break otherwise. >>>> >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index 618df3385c8b..83cc9d98533e 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>> split_huge_pmd_locked(vma, pvmw.address, >>>> pvmw.pmd, true); >>>> - ret = false; >>>> - page_vma_mapped_walk_done(&pvmw); >>>> - break; >>>> + continue; >>>> } >>> >>> Per my understanding if @freeze is trur, split_huge_pmd_locked() may "fail" as >>> the comment says: >>> >>> * Without "freeze", we'll simply split the PMD, propagating the >>> * PageAnonExclusive() flag for each PTE by setting it for >>> * each subpage -- no need to (temporarily) clear. >>> * >>> * With "freeze" we want to replace mapped pages by >>> * migration entries right away. This is only possible if we >>> * managed to clear PageAnonExclusive() -- see >>> * set_pmd_migration_entry(). >>> * >>> * In case we cannot clear PageAnonExclusive(), split the PMD >>> * only and let try_to_migrate_one() fail later. >>> >>> While currently we don't return the status of split_huge_pmd_locked() to >>> indicate whether it does replaced PMD with migration entries successfully. So >>> we are not sure this operation succeed. >> >> This is the right reasoning. This means to properly handle it, split_huge_pmd_locked() >> needs to return whether it inserts migration entries or not when freeze is true. >> >>> >>> Another difference from set_pmd_migration_entry() is split_huge_pmd_locked() >>> would change the page table from PMD mapped to PTE mapped. >>> page_vma_mapped_walk() can handle it now for (pvmw->pmd && !pvmw->pte), but I >>> am not sure this is what we expected. For example, in try_to_unmap_one(), we >>> use page_vma_mapped_walk_restart() after pmd splitted. >>> >>> So I prefer just remove the "ret = false" for a fix. Not sure this is >>> reasonable to you. >>> >>> I am thinking two things after this fix: >>> >>> * add one similar test in selftests >>> * let split_huge_pmd_locked() return value to indicate freeze is degrade to >>> !freeze, and fail early on try_to_migrate() like the thp migration branch >>> >>> Look forward your opinion on whether it worth to do it. >> >> This is not the right fix, neither was mine above. Because before commit 60fbb14396d5, >> the code handles PAE properly. If PAE is cleared, PMD is split into PTEs and each >> PTE becomes a migration entry, page_vma_mapped_walk(&pvmw) returns false, >> and try_to_migrate_one() returns true. If PAE is not cleared, PMD is split into PTEs >> and each PTE is not a migration entry, inside while (page_vma_mapped_walk(&pvmw)), >> PAE will be attempted to get cleared again and it will fail again, leading to >> try_to_migrate_one() returns false. After commit 60fbb14396d5, no matter PAE is >> cleared or not, try_to_migrate_one() always returns false. It causes folio split >> failures for shared PMD THPs. >> >> Now with your fix (and mine above), no matter PAE is cleared or not, try_to_migrate_one() >> always returns true. It just flips the code to a different issue. So the proper fix >> is to let split_huge_pmd_locked() returns whether it inserts migration entries or not >> and do the same pattern as THP migration code path. >> > > You are right. > > BTW, I thought PAE stands for Physical Address Extension and confused a while :-( > >> >> Hi David, >> >> In terms of unmap_folio(), which is the only user of split_huge_pmd_locked(..., freeze=true), >> there is no folio_mapped() check afterwards. That might be causing an issue, >> when the folio is pinned between the refcount check and unmap_folio(), unmap_folio() >> fails, but folio split code proceeds. That means the folio is still accessible >> via PTEs and later remove_migration_pte() will try to remove non migration PTEs. >> It needs to be fixed separately, right? >> > > Current __folio_split() logic is like below: > > if (folio_expected_ref_count(folio) != folio_ref_count(folio) - 1) { --- (1) > ret = -EAGAIN; > goto out_unlock; > } > > unmap_folio(folio); --- (2) > > ret = __folio_freeze_and_split_unmapped() > if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) { --- (3) > } else { > return -EAGAIN; > } > > You mean after (1) and (2), we don't check folio_mapped() and continue > spliting? Hmm... before continue split we tried to freeze folio with expected > refcount at (3). This makes sure there is not extra refcount except in > pagecache or swapcache. > > You mean this is not enough? Not sure I follow you correctly. Missed that. It works. Thanks for pointing this out. > >> >>> >>>> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION >>>> pmdval = pmdp_get(pvmw.pmd); >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Yan, Zi >>> >>> -- >>> Wei Yang >>> Help you, Help me >> >> >> -- >> Best Regards, >> Yan, Zi > > -- > Wei Yang > Help you, Help me Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2026-02-03 13:20 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2026-01-30 23:00 [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp Wei Yang 2026-01-31 2:44 ` Zi Yan 2026-02-01 2:09 ` Wei Yang 2026-02-01 3:39 ` Zi Yan 2026-02-01 13:04 ` Gavin Guo 2026-02-01 14:20 ` Zi Yan 2026-02-03 0:00 ` Wei Yang 2026-02-03 0:07 ` Zi Yan 2026-02-03 13:04 ` Wei Yang 2026-02-03 13:07 ` Zi Yan 2026-02-03 13:20 ` Lance Yang 2026-02-02 23:57 ` Wei Yang 2026-02-03 0:05 ` Zi Yan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox