Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Zi Yan <ziy@nvidia.com>
To: Wei Yang <richard.weiyang@gmail.com>
Cc: Gavin Guo <gavinguo@igalia.com>,
	david@kernel.org, akpm@linux-foundation.org,
	lorenzo.stoakes@oracle.com, riel@surriel.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, harry.yoo@oracle.com,
	jannh@google.com, baolin.wang@linux.alibaba.com,
	linux-mm@kvack.org, stable@vger.kernel.org,
	Gavin Shan <gshan@redhat.com>
Subject: Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp
Date: Tue, 03 Feb 2026 08:07:49 -0500	[thread overview]
Message-ID: <3172E2A7-EA3E-4F68-8067-96D72AE5D2C7@nvidia.com> (raw)
In-Reply-To: <20260203130427.n2td43cb275ybi7j@master>

On 3 Feb 2026, at 8:04, Wei Yang wrote:

> On Mon, Feb 02, 2026 at 07:07:12PM -0500, Zi Yan wrote:
>> On 2 Feb 2026, at 19:00, Wei Yang wrote:
>>
>>> On Sun, Feb 01, 2026 at 09:20:35AM -0500, Zi Yan wrote:
>>>> On 1 Feb 2026, at 8:04, Gavin Guo wrote:
>>>>
>>>>> On 2/1/26 11:39, Zi Yan wrote:
>>>>>> On 31 Jan 2026, at 21:09, Wei Yang wrote:
>>>>>>
>>>>>>> On Fri, Jan 30, 2026 at 09:44:10PM -0500, Zi Yan wrote:
>>>>>>>> On 30 Jan 2026, at 18:00, Wei Yang wrote:
>>>>>>>>
>>>>>>>>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and
>>>>>>>>> split_huge_pmd_locked()") return false unconditionally after
>>>>>>>>> split_huge_pmd_locked() which may fail early during try_to_migrate() for
>>>>>>>>> shared thp. This will lead to unexpected folio split failure.
>>>>>>>>>
>>>>>>>>> One way to reproduce:
>>>>>>>>>
>>>>>>>>>      Create an anonymous thp range and fork 512 children, so we have a
>>>>>>>>>      thp shared mapped in 513 processes. Then trigger folio split with
>>>>>>>>>      /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to
>>>>>>>>>      order 0.
>>>>>>>>>
>>>>>>>>> Without the above commit, we can successfully split to order 0.
>>>>>>>>> With the above commit, the folio is still a large folio.
>>>>>>>>>
>>>>>>>>> The reason is the above commit return false after split pmd
>>>>>>>>> unconditionally in the first process and break try_to_migrate().
>>>>>>>>
>>>>>>>> The reasoning looks good to me.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The tricky thing in above reproduce method is current debugfs interface
>>>>>>>>> leverage function split_huge_pages_pid(), which will iterate the whole
>>>>>>>>> pmd range and do folio split on each base page address. This means it
>>>>>>>>> will try 512 times, and each time split one pmd from pmd mapped to pte
>>>>>>>>> mapped thp. If there are less than 512 shared mapped process,
>>>>>>>>> the folio is still split successfully at last. But in real world, we
>>>>>>>>> usually try it for once.
>>>>>>>>>
>>>>>>>>> This patch fixes this by removing the unconditional false return after
>>>>>>>>> split_huge_pmd_locked(). Later, we may introduce a true fail early if
>>>>>>>>> split_huge_pmd_locked() does fail.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>>>>>>>>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()")
>>>>>>>>> Cc: Gavin Guo <gavinguo@igalia.com>
>>>>>>>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>> Cc: <stable@vger.kernel.org>
>>>>>>>>> ---
>>>>>>>>>   mm/rmap.c | 1 -
>>>>>>>>>   1 file changed, 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>>>> index 618df3385c8b..eed971568d65 100644
>>>>>>>>> --- a/mm/rmap.c
>>>>>>>>> +++ b/mm/rmap.c
>>>>>>>>> @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>>>>>   			if (flags & TTU_SPLIT_HUGE_PMD) {
>>>>>>>>>   				split_huge_pmd_locked(vma, pvmw.address,
>>>>>>>>>   						      pvmw.pmd, true);
>>>>>>>>> -				ret = false;
>>>>>>>>>   				page_vma_mapped_walk_done(&pvmw);
>>>>>>>>>   				break;
>>>>>>>>>   			}
>>>>>>>>
>>>>>>>> How about the patch below? It matches the pattern of set_pmd_migration_entry() below.
>>>>>>>> Basically, continue if the operation is successful, break otherwise.
>>>>>>>>
>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>>> index 618df3385c8b..83cc9d98533e 100644
>>>>>>>> --- a/mm/rmap.c
>>>>>>>> +++ b/mm/rmap.c
>>>>>>>> @@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>>>> 			if (flags & TTU_SPLIT_HUGE_PMD) {
>>>>>>>> 				split_huge_pmd_locked(vma, pvmw.address,
>>>>>>>> 						      pvmw.pmd, true);
>>>>>>>> -				ret = false;
>>>>>>>> -				page_vma_mapped_walk_done(&pvmw);
>>>>>>>> -				break;
>>>>>>>> +				continue;
>>>>>>>> 			}
>>>>>>>
>>>>>>> Per my understanding if @freeze is trur, split_huge_pmd_locked() may "fail" as
>>>>>>> the comment says:
>>>>>>>
>>>>>>> 		 * Without "freeze", we'll simply split the PMD, propagating the
>>>>>>> 		 * PageAnonExclusive() flag for each PTE by setting it for
>>>>>>> 		 * each subpage -- no need to (temporarily) clear.
>>>>>>> 		 *
>>>>>>> 		 * With "freeze" we want to replace mapped pages by
>>>>>>> 		 * migration entries right away. This is only possible if we
>>>>>>> 		 * managed to clear PageAnonExclusive() -- see
>>>>>>> 		 * set_pmd_migration_entry().
>>>>>>> 		 *
>>>>>>> 		 * In case we cannot clear PageAnonExclusive(), split the PMD
>>>>>>> 		 * only and let try_to_migrate_one() fail later.
>>>>>>>
>>>>>>> While currently we don't return the status of split_huge_pmd_locked() to
>>>>>>> indicate whether it does replaced PMD with migration entries successfully. So
>>>>>>> we are not sure this operation succeed.
>>>>>>
>>>>>> This is the right reasoning. This means to properly handle it, split_huge_pmd_locked()
>>>>>> needs to return whether it inserts migration entries or not when freeze is true.
>>>>>>
>>>>>>>
>>>>>>> Another difference from set_pmd_migration_entry() is split_huge_pmd_locked()
>>>>>>> would change the page table from PMD mapped to PTE mapped.
>>>>>>> page_vma_mapped_walk() can handle it now for (pvmw->pmd && !pvmw->pte), but I
>>>>>>> am not sure this is what we expected. For example, in try_to_unmap_one(), we
>>>>>>> use page_vma_mapped_walk_restart() after pmd splitted.
>>>>>>>
>>>>>>> So I prefer just remove the "ret = false" for a fix. Not sure this is
>>>>>>> reasonable to you.
>>>>>>>
>>>>>>> I am thinking two things after this fix:
>>>>>>>
>>>>>>>    * add one similar test in selftests
>>>>>>>    * let split_huge_pmd_locked() return value to indicate freeze is degrade to
>>>>>>>      !freeze, and fail early on try_to_migrate() like the thp migration branch
>>>>>>>
>>>>>>> Look forward your opinion on whether it worth to do it.
>>>>>>
>>>>>> This is not the right fix, neither was mine above. Because before commit 60fbb14396d5,
>>>>>> the code handles PAE properly. If PAE is cleared, PMD is split into PTEs and each
>>>>>> PTE becomes a migration entry, page_vma_mapped_walk(&pvmw) returns false,
>>>>>> and try_to_migrate_one() returns true. If PAE is not cleared, PMD is split into PTEs
>>>>>> and each PTE is not a migration entry, inside while (page_vma_mapped_walk(&pvmw)),
>>>>>> PAE will be attempted to get cleared again and it will fail again, leading to
>>>>>> try_to_migrate_one() returns false. After commit 60fbb14396d5, no matter PAE is
>>>>>> cleared or not, try_to_migrate_one() always returns false. It causes folio split
>>>>>> failures for shared PMD THPs.
>>>>>>
>>>>>> Now with your fix (and mine above), no matter PAE is cleared or not, try_to_migrate_one()
>>>>>> always returns true. It just flips the code to a different issue. So the proper fix
>>>>>> is to let split_huge_pmd_locked() returns whether it inserts migration entries or not
>>>>>> and do the same pattern as THP migration code path.
>>>>>
>>>>> How about aligning with the try_to_unmap_one()? The behavior would be the same before applying the commit 60fbb14396d5:
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 7b9879ef442d..0c96f0883013 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -2333,9 +2333,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>                         if (flags & TTU_SPLIT_HUGE_PMD) {
>>>>>                                 split_huge_pmd_locked(vma, pvmw.address,
>>>>>                                                       pvmw.pmd, true);
>>>>> -                               ret = false;
>>>>> -                               page_vma_mapped_walk_done(&pvmw);
>>>>> -                               break;
>>>>> +                               flags &= ~TTU_SPLIT_HUGE_PMD;
>>>>> +                               page_vma_mapped_walk_restart(&pvmw);
>>>>> +                               continue;
>>>>>                         }
>>>>>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>>>>>                         pmdval = pmdp_get(pvmw.pmd);
>>>>
>>>> Yes, it works and definitely needs a comment like "After split_huge_pmd_locked(), restart
>>>> the walk to detect PageAnonExclusive handling failure in __split_huge_pmd_locked()".
>>>> The change is good for backporting, but an additional patch to fix it properly by adding
>>>> a return value to split_huge_pmd_locked() is also necessary.
>>>>
>>>
>>> If my understanding is correct, this approach is good for backporting.
>>>
>>> And yes, we could further improve it by return a value to indicate whether
>>> split_huge_pmd_locked() do split to migration entry.
>>>
>>> Thanks both for your thoughtful inputs.
>>
>> Are you going to send two patches in a series, one is the above fix with a comment
>> and the other changes split_huge_pmd_locked() to return a value?
>>
>
> Hmm... as the above fix is supposed to be cc stable and backported, I think
> separate them is the correct process. And for the return value of

Right. Andrew prefer this way. You can send the fix first and send the refactor
patch after the fix is picked up by Andrew.

> split_huge_pmd_locked(), I will take another look at all the call places. Are
> you ok with this?
>
> Well, do you think we need to wait for David's comment? If not, I will prepare
> the v2 fix with the above change.

You answered my question. There should be no issue with existing code.

Thanks.


Best Regards,
Yan, Zi

next prev parent reply	other threads:[~2026-02-03 13:08 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-30 23:00 Wei Yang
2026-01-31  2:44 ` Zi Yan
2026-02-01  2:09   ` Wei Yang
2026-02-01  3:39     ` Zi Yan
2026-02-01 13:04       ` Gavin Guo
2026-02-01 14:20         ` Zi Yan
2026-02-03  0:00           ` Wei Yang
2026-02-03  0:07             ` Zi Yan
2026-02-03 13:04               ` Wei Yang
2026-02-03 13:07                 ` Zi Yan [this message]
2026-02-03 13:20           ` Lance Yang
2026-02-02 23:57       ` Wei Yang
2026-02-03  0:05         ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3172E2A7-EA3E-4F68-8067-96D72AE5D2C7@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@kernel.org \
    --cc=gavinguo@igalia.com \
    --cc=gshan@redhat.com \
    --cc=harry.yoo@oracle.com \
    --cc=jannh@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=richard.weiyang@gmail.com \
    --cc=riel@surriel.com \
    --cc=stable@vger.kernel.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox