linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Wei Yang <richard.weiyang@gmail.com>
To: Zi Yan <ziy@nvidia.com>
Cc: Wei Yang <richard.weiyang@gmail.com>,
	david@kernel.org, akpm@linux-foundation.org,
	lorenzo.stoakes@oracle.com, riel@surriel.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, harry.yoo@oracle.com,
	jannh@google.com, gavinguo@igalia.com,
	baolin.wang@linux.alibaba.com, linux-mm@kvack.org,
	stable@vger.kernel.org
Subject: Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp
Date: Mon, 2 Feb 2026 23:57:14 +0000	[thread overview]
Message-ID: <20260202235714.5wvxveurjfdka5pl@master> (raw)
In-Reply-To: <C620202F-685A-4B9E-B51B-078EBE5BF0C4@nvidia.com>

On Sat, Jan 31, 2026 at 10:39:40PM -0500, Zi Yan wrote:
>On 31 Jan 2026, at 21:09, Wei Yang wrote:
>
>> On Fri, Jan 30, 2026 at 09:44:10PM -0500, Zi Yan wrote:
>>> On 30 Jan 2026, at 18:00, Wei Yang wrote:
>>>
>>>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and
>>>> split_huge_pmd_locked()") return false unconditionally after
>>>> split_huge_pmd_locked() which may fail early during try_to_migrate() for
>>>> shared thp. This will lead to unexpected folio split failure.
>>>>
>>>> One way to reproduce:
>>>>
>>>>     Create an anonymous thp range and fork 512 children, so we have a
>>>>     thp shared mapped in 513 processes. Then trigger folio split with
>>>>     /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to
>>>>     order 0.
>>>>
>>>> Without the above commit, we can successfully split to order 0.
>>>> With the above commit, the folio is still a large folio.
>>>>
>>>> The reason is the above commit return false after split pmd
>>>> unconditionally in the first process and break try_to_migrate().
>>>
>>> The reasoning looks good to me.
>>>
>>>>
>>>> The tricky thing in above reproduce method is current debugfs interface
>>>> leverage function split_huge_pages_pid(), which will iterate the whole
>>>> pmd range and do folio split on each base page address. This means it
>>>> will try 512 times, and each time split one pmd from pmd mapped to pte
>>>> mapped thp. If there are less than 512 shared mapped process,
>>>> the folio is still split successfully at last. But in real world, we
>>>> usually try it for once.
>>>>
>>>> This patch fixes this by removing the unconditional false return after
>>>> split_huge_pmd_locked(). Later, we may introduce a true fail early if
>>>> split_huge_pmd_locked() does fail.
>>>>
>>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>>>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()")
>>>> Cc: Gavin Guo <gavinguo@igalia.com>
>>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> Cc: <stable@vger.kernel.org>
>>>> ---
>>>>  mm/rmap.c | 1 -
>>>>  1 file changed, 1 deletion(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 618df3385c8b..eed971568d65 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>>  			if (flags & TTU_SPLIT_HUGE_PMD) {
>>>>  				split_huge_pmd_locked(vma, pvmw.address,
>>>>  						      pvmw.pmd, true);
>>>> -				ret = false;
>>>>  				page_vma_mapped_walk_done(&pvmw);
>>>>  				break;
>>>>  			}
>>>
>>> How about the patch below? It matches the pattern of set_pmd_migration_entry() below.
>>> Basically, continue if the operation is successful, break otherwise.
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 618df3385c8b..83cc9d98533e 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>> 			if (flags & TTU_SPLIT_HUGE_PMD) {
>>> 				split_huge_pmd_locked(vma, pvmw.address,
>>> 						      pvmw.pmd, true);
>>> -				ret = false;
>>> -				page_vma_mapped_walk_done(&pvmw);
>>> -				break;
>>> +				continue;
>>> 			}
>>
>> Per my understanding if @freeze is trur, split_huge_pmd_locked() may "fail" as
>> the comment says:
>>
>> 		 * Without "freeze", we'll simply split the PMD, propagating the
>> 		 * PageAnonExclusive() flag for each PTE by setting it for
>> 		 * each subpage -- no need to (temporarily) clear.
>> 		 *
>> 		 * With "freeze" we want to replace mapped pages by
>> 		 * migration entries right away. This is only possible if we
>> 		 * managed to clear PageAnonExclusive() -- see
>> 		 * set_pmd_migration_entry().
>> 		 *
>> 		 * In case we cannot clear PageAnonExclusive(), split the PMD
>> 		 * only and let try_to_migrate_one() fail later.
>>
>> While currently we don't return the status of split_huge_pmd_locked() to
>> indicate whether it does replaced PMD with migration entries successfully. So
>> we are not sure this operation succeed.
>
>This is the right reasoning. This means to properly handle it, split_huge_pmd_locked()
>needs to return whether it inserts migration entries or not when freeze is true.
>
>>
>> Another difference from set_pmd_migration_entry() is split_huge_pmd_locked()
>> would change the page table from PMD mapped to PTE mapped.
>> page_vma_mapped_walk() can handle it now for (pvmw->pmd && !pvmw->pte), but I
>> am not sure this is what we expected. For example, in try_to_unmap_one(), we
>> use page_vma_mapped_walk_restart() after pmd splitted.
>>
>> So I prefer just remove the "ret = false" for a fix. Not sure this is
>> reasonable to you.
>>
>> I am thinking two things after this fix:
>>
>>   * add one similar test in selftests
>>   * let split_huge_pmd_locked() return value to indicate freeze is degrade to
>>     !freeze, and fail early on try_to_migrate() like the thp migration branch
>>
>> Look forward your opinion on whether it worth to do it.
>
>This is not the right fix, neither was mine above. Because before commit 60fbb14396d5,
>the code handles PAE properly. If PAE is cleared, PMD is split into PTEs and each
>PTE becomes a migration entry, page_vma_mapped_walk(&pvmw) returns false,
>and try_to_migrate_one() returns true. If PAE is not cleared, PMD is split into PTEs
>and each PTE is not a migration entry, inside while (page_vma_mapped_walk(&pvmw)),
>PAE will be attempted to get cleared again and it will fail again, leading to
>try_to_migrate_one() returns false. After commit 60fbb14396d5, no matter PAE is
>cleared or not, try_to_migrate_one() always returns false. It causes folio split
>failures for shared PMD THPs.
>
>Now with your fix (and mine above), no matter PAE is cleared or not, try_to_migrate_one()
>always returns true. It just flips the code to a different issue. So the proper fix
>is to let split_huge_pmd_locked() returns whether it inserts migration entries or not
>and do the same pattern as THP migration code path.
>

You are right.

BTW, I thought PAE stands for Physical Address Extension and confused a while :-(

>
>Hi David,
>
>In terms of unmap_folio(), which is the only user of split_huge_pmd_locked(..., freeze=true),
>there is no folio_mapped() check afterwards. That might be causing an issue,
>when the folio is pinned between the refcount check and unmap_folio(), unmap_folio()
>fails, but folio split code proceeds. That means the folio is still accessible
>via PTEs and later remove_migration_pte() will try to remove non migration PTEs.
>It needs to be fixed separately, right?
>

Current __folio_split() logic is like below:

    if (folio_expected_ref_count(folio) != folio_ref_count(folio) - 1) {     --- (1)
    	ret = -EAGAIN;
	goto out_unlock;
    }

    unmap_folio(folio);                                                      --- (2)

    ret = __folio_freeze_and_split_unmapped()
        if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) {     --- (3)
	} else {
	    return -EAGAIN;
	}

You mean after (1) and (2), we don't check folio_mapped() and continue
spliting? Hmm... before continue split we tried to freeze folio with expected
refcount at (3). This makes sure there is not extra refcount except in
pagecache or swapcache.

You mean this is not enough? Not sure I follow you correctly.

>
>>
>>> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>>> 			pmdval = pmdp_get(pvmw.pmd);
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
>>
>> -- 
>> Wei Yang
>> Help you, Help me
>
>
>--
>Best Regards,
>Yan, Zi

-- 
Wei Yang
Help you, Help me


  parent reply	other threads:[~2026-02-02 23:57 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-30 23:00 Wei Yang
2026-01-31  2:44 ` Zi Yan
2026-02-01  2:09   ` Wei Yang
2026-02-01  3:39     ` Zi Yan
2026-02-01 13:04       ` Gavin Guo
2026-02-01 14:20         ` Zi Yan
2026-02-03  0:00           ` Wei Yang
2026-02-03  0:07             ` Zi Yan
2026-02-03 13:04               ` Wei Yang
2026-02-03 13:07                 ` Zi Yan
2026-02-03 13:20           ` Lance Yang
2026-02-02 23:57       ` Wei Yang [this message]
2026-02-03  0:05         ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260202235714.5wvxveurjfdka5pl@master \
    --to=richard.weiyang@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@kernel.org \
    --cc=gavinguo@igalia.com \
    --cc=harry.yoo@oracle.com \
    --cc=jannh@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=riel@surriel.com \
    --cc=stable@vger.kernel.org \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox