From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 7AD67E7FDF3 for ; Mon, 2 Feb 2026 23:57:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 89A796B0005; Mon, 2 Feb 2026 18:57:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 851BD6B0088; Mon, 2 Feb 2026 18:57:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 75E536B0089; Mon, 2 Feb 2026 18:57:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 625FD6B0005 for ; Mon, 2 Feb 2026 18:57:19 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 0DB33D428A for ; Mon, 2 Feb 2026 23:57:19 +0000 (UTC) X-FDA: 84401180598.16.6E90A67 Received: from mail-ej1-f43.google.com (mail-ej1-f43.google.com [209.85.218.43]) by imf18.hostedemail.com (Postfix) with ESMTP id 08F8E1C0006 for ; Mon, 2 Feb 2026 23:57:16 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=R9oVbV1q; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf18.hostedemail.com: domain of richard.weiyang@gmail.com designates 209.85.218.43 as permitted sender) smtp.mailfrom=richard.weiyang@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1770076637; a=rsa-sha256; cv=none; b=Y02/ApS+MT74cIvGv9Rg2tR9FaLLcmmy5XpQHKQ4/84l5hLI74sdzKC7KLv5QnklpS99TR FD/wncTi9GQrH23DFOikbFXs4iyGBqyHRWKJe3/7gFjjmAviTg+3x2QiEsc5CAuYrtL3wE ERuvcLfvqyr7XF2heKQX0l4b3qbEMB0= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=R9oVbV1q; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf18.hostedemail.com: domain of richard.weiyang@gmail.com designates 209.85.218.43 as permitted sender) smtp.mailfrom=richard.weiyang@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1770076637; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fYOQCYoVLtUoF5usiYIjTFsy3/SBgdQV8AlY2B4RB9M=; b=Ks7bcCmGagwXcTzfzebFXKRD1dr5ruw0wExUhKDS+nCofP+4olHJyBMe+f1uNCrSsdaHOc k/qj/RsnjPydWz05AyFnsU0/vx+mzDh9/SRu7uQmFBdg4V7jrRU/LT+3M7RAimrQnJeruR Iyx3IZS2MbnnVcyu/raHmbm6mNfogjk= Received: by mail-ej1-f43.google.com with SMTP id a640c23a62f3a-b883787268fso720080166b.3 for ; Mon, 02 Feb 2026 15:57:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1770076635; x=1770681435; darn=kvack.org; h=user-agent:in-reply-to:content-disposition:mime-version:references :reply-to:message-id:subject:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=fYOQCYoVLtUoF5usiYIjTFsy3/SBgdQV8AlY2B4RB9M=; b=R9oVbV1qWitTTUuYmjAoqwKUJZyWYQGKSC2CMj7St3JdqLMyFXQEPdDjskG0TxplXZ 4iiEyGfq9LqHKU8T2ZfVTWT5y7mscBApH7u3KJOQajfvp/tTk0anF/9Ki97u6MY4dgpk JzAcza/+IiFUzAqXexcV+qA6D9OAoNfiV1O5mrEVnR1RNoo2/oTqwt7GO9oubyzGxey9 KoRtfMunDefqaxzE8AEE38Z0opPMZCkAgFl0jPdSifEAk214ZQhhcBNHmEA5wDGe4qEk LMqPgbZQOYutzRNCovJ+ML3N8171WnaijPooF32d5baIqdL7v/AVjP1pcyBVS8l7pE4A YmSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770076635; x=1770681435; h=user-agent:in-reply-to:content-disposition:mime-version:references :reply-to:message-id:subject:cc:to:from:date:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=fYOQCYoVLtUoF5usiYIjTFsy3/SBgdQV8AlY2B4RB9M=; b=at7A5Qvphy7booiVJlvvQrs/fs0+79VdNXpbZq1UPO1cdY+TVJBal1X3i6kjPwWf0Q e8y66LAiAZ4gVnZfetENMBeqqvUmaJO0pHigJstRLcuBpsSJMopVyjn9MBy7rGS5r5iu ONPhB3Kkoraay+2gFjBePLHnon+kr3ocQzFXZrLTWc+unZ0i7W9mpOh6fNR6ctgV0Blt bfVBV3VYvizEiE8fOw0GP+W4X3hVwI6jzwxML0rB2eI3+6PXNBtN4lRs4ZW1MRIp8hok zC5xhHL7jGTGYacsKFJUJuYtKCjkRMwgToH70o2aRze0zR9GlQabNFt+r/vFlz3Ee2p3 k2Nw== X-Forwarded-Encrypted: i=1; AJvYcCVgq2zdxy18niUHHbWjL5wqVtoCLQ2eOlMRGPRf5rA0yfe02uB/Vec4Fu5G5ORataBUfMBc+Lk13g==@kvack.org X-Gm-Message-State: AOJu0YyIMRIZxQndsq4e/ughptnPsGpjgFsdg/MAccuUpu6WClZubtOT jNFd3t8iHq3ocbbe+/EuGhyeEsjlyH71iGTBlT74pT/ANDBsSX6SQ26R X-Gm-Gg: AZuq6aLr6c1XwCP60BCYomOgkFo/rAFljjWQyDD3nhszEVF/Zc17Ga2ZsJoSLyyK99J 8WIF/Wibes7HC5KGAHyfl7WLGYv3NJNRNf5dvmnOZj0NNPQIO/siojKZB9+N/hX3s+ljM1oZyG3 yR5HvCfL/OpGmrpYznBcaYbM6/4ZtAZZMtCUkkur5/BGO1MT/AL9fh0HSuiL4imbKkdiKMwAJop /K3VZhMImNEYkG+k66VmYoRfi4LobqMS1NbzDDsPo5zSxXiKMvoD+hGKfIHLfkNqhYBQHsD30be CCr318vnSj1QsXRwg6ZFOu7UbX/aJd/xemgGuYutNBA8uIR7v7uArwm9f89+1NOF0dFYEnwOgAI veS+3y/5GFlpS8fmb0/oxZwntREqRfRITMwZIOP6lTYRLyV4p5o9hpRprtMNQP6Wpemjmldu1sJ o/iVJkST2UjQ== X-Received: by 2002:a17:906:4783:b0:b84:2075:b902 with SMTP id a640c23a62f3a-b8dff7a365amr890704466b.36.1770076634993; Mon, 02 Feb 2026 15:57:14 -0800 (PST) Received: from localhost ([185.92.221.13]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-b8dbf2f3e26sm954868966b.67.2026.02.02.15.57.14 (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Mon, 02 Feb 2026 15:57:14 -0800 (PST) Date: Mon, 2 Feb 2026 23:57:14 +0000 From: Wei Yang To: Zi Yan Cc: Wei Yang , david@kernel.org, akpm@linux-foundation.org, lorenzo.stoakes@oracle.com, riel@surriel.com, Liam.Howlett@oracle.com, vbabka@suse.cz, harry.yoo@oracle.com, jannh@google.com, gavinguo@igalia.com, baolin.wang@linux.alibaba.com, linux-mm@kvack.org, stable@vger.kernel.org Subject: Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp Message-ID: <20260202235714.5wvxveurjfdka5pl@master> Reply-To: Wei Yang References: <20260130230058.11471-1-richard.weiyang@gmail.com> <178ADAB8-50AB-452F-B25F-6E145DEAA44C@nvidia.com> <20260201020950.p6aygkkiy4hxbi5r@master> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170113 (1.7.2) X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 08F8E1C0006 X-Stat-Signature: et7zqbnx43o9e93ff9pxxjxnwrjkhmew X-HE-Tag: 1770076636-469858 X-HE-Meta: U2FsdGVkX1+goxsxsHWU8VWS4iGQt2zoqG1/ChQkqHA1YWUTtUBgjX9I8wdxVSn2dMNw2Cp447VNk3Jza3AisbwMITQ/DqPz9dafgplujFXY0xMeiEpStqhj/VlV17uS+dFiR4IQ4W99pZUi868YBG3mX3UuKXJAwt1QQgrMUF9A2dBckO5LyjLka87k8itImfsE3rHnTx3Ri5/ZqxI8LQVH1N0WI+bqSe8ymiliqfTdZEg5pMg+qfZW/A8CZZfciJ7P8OQLSdbDubr6lr+AmP+9ppEb40e7pVvayM9E/Vw0fsSt/KbRM8W/sduy/eus6Yxr0p7fhRiBlNXOfZQbivItoGva4YViKe3iz/C8T74YLvNMGk2+irVHdSC0DLg1UzDCQbL2lkSeo2vKycUH/26s/fnXVG2e2dQJ3OAkmdojoFBFe1yfi2HAg6exdK1cwC4psyjUAjnFhSmQWxKNfZZijltT+YfuOt1DNL///Zcp49YZ+2aRWd23QR+xbL/nVyVxNfrpzfk/3gcLHiwtMn6zKuDWhy0otOlhp1OHPmfCsbFatxy88yV5WD8AhQ1fMOZbkwC+OMcuiw1jFk5bxZfleaG2jck/duCqZ+WoFspMnGawEIj8sd/loGIP/LLPHFX3R0adc2QvvtNMJ4UxwPJIB+96QyBxXyJ6uERnQRCGqyUfCjxaXUy+wgMRiPE/9vhnOZNFpmiWWoEywM6DxLpieU5qgCQzAUOqyVy2MU6ID4tGs9+NcjoLWljJgwnu+aUnalw2NnnQMvsbQQrLlHImhbhOuGappvHoKF5K31DexyCCzqE1OI16TXJ7+Ru0l3n/kBp/7ubf9vE+6u26DDtgAA+Lj1DMIOXFpkRATAO5O6hPKo7n8x19yRM1CRh34ZPzIPjxqZTiTvW3ool3czt89ZdQjJOltz2bITgZNpkUDS3u2UFEPIrEM2qo8UrscexRoeONVvGUovu3pYL jum9tuKm R+e4sGsJY7zPTKlVqaiBiK+3pYSjje+Phl9f2hALN+gjU5hwXqOUUDvf1carv7jMeN2ZnRp90ew7xgc8WGhD0MkL2nrdPB3iNbAHIMAlrtosCER38cKAw7cZP4t+ImOy65FcWb0HvfWUnBkuzLlo5YB+wxB5xEBU8oxA5AGutsfwIp4ZYMRBqMaemSdm9E7sHAlj6hbzTd1Nb5kohmgtDM2wNptzHibIe3XkBBgb5wzhZks65JdzLJsg6+R1nsqRg9ly5sM/Vrj57CCsbvAQrGuPNodSiz+S8esaNjgq2MmdHWUC9+/B5vVei6dzxFp2g2F9yPF5iu1i257jGzAGZRw1AFARFRnxnLZaSbkKx6/GzlZak11QSUhszNvWV3zpkFUQIsAZWWW6DlJ/gAx4QEn0zBn4z5FakA3O3g1bdCpYgwFrjmXmDWJiVZoBrAEAmEqTWbPji3gkZF9K2XqAutaOXAkEYYDYwnKp8J/YKIbpOz5N4WP2Sq9FtZVQF87Thuj3y9jUpUz7L7buwZkuNpGFngxbcs8rFba78LOc7/8sP/lNpMpM8SMnNuh52m+ppQGU/KE96PtPkhvAmG6yy2OJGIB9KnTZe/N4SuANvt4oHCmVBTSvRyRHTuzzm73W3zIQCfEf17withhp0A6oWnEU/Ni/cvtB/kjhNMbwREPY7DeIeMZebTSKOmWmMI2dJzhxTcCN6t4kesSsAyy85q0oU1IEMom3vFwDEjnHPPQTNcjPFIFZSIDfrdyhK3Dhf4sVD3ttR2/J19XYIRUt5+RE7nDjG5Z8vpkf7HFEnc9sAXvmngd2qHoQ/2Lm5F8TIZ84sLx70pAWhbIfdH7NvOE/NToo2k/Mm7/8XSm6zGovBXqwYvNUJ5qJMcHps3KRYJoD+ussyOwiXRhaVG4ZrQ1zUkFW22UEJ4HeRsqnmYgtypllmCGZ6P665vp/VDqNP9YIk+JLAIQLStqZ8StceMlekt4z8 RyevcDax BAc2U5d9azDGV7SFm0CcsA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Jan 31, 2026 at 10:39:40PM -0500, Zi Yan wrote: >On 31 Jan 2026, at 21:09, Wei Yang wrote: > >> On Fri, Jan 30, 2026 at 09:44:10PM -0500, Zi Yan wrote: >>> On 30 Jan 2026, at 18:00, Wei Yang wrote: >>> >>>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and >>>> split_huge_pmd_locked()") return false unconditionally after >>>> split_huge_pmd_locked() which may fail early during try_to_migrate() for >>>> shared thp. This will lead to unexpected folio split failure. >>>> >>>> One way to reproduce: >>>> >>>> Create an anonymous thp range and fork 512 children, so we have a >>>> thp shared mapped in 513 processes. Then trigger folio split with >>>> /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to >>>> order 0. >>>> >>>> Without the above commit, we can successfully split to order 0. >>>> With the above commit, the folio is still a large folio. >>>> >>>> The reason is the above commit return false after split pmd >>>> unconditionally in the first process and break try_to_migrate(). >>> >>> The reasoning looks good to me. >>> >>>> >>>> The tricky thing in above reproduce method is current debugfs interface >>>> leverage function split_huge_pages_pid(), which will iterate the whole >>>> pmd range and do folio split on each base page address. This means it >>>> will try 512 times, and each time split one pmd from pmd mapped to pte >>>> mapped thp. If there are less than 512 shared mapped process, >>>> the folio is still split successfully at last. But in real world, we >>>> usually try it for once. >>>> >>>> This patch fixes this by removing the unconditional false return after >>>> split_huge_pmd_locked(). Later, we may introduce a true fail early if >>>> split_huge_pmd_locked() does fail. >>>> >>>> Signed-off-by: Wei Yang >>>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()") >>>> Cc: Gavin Guo >>>> Cc: "David Hildenbrand (Red Hat)" >>>> Cc: Zi Yan >>>> Cc: Baolin Wang >>>> Cc: >>>> --- >>>> mm/rmap.c | 1 - >>>> 1 file changed, 1 deletion(-) >>>> >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index 618df3385c8b..eed971568d65 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>>> if (flags & TTU_SPLIT_HUGE_PMD) { >>>> split_huge_pmd_locked(vma, pvmw.address, >>>> pvmw.pmd, true); >>>> - ret = false; >>>> page_vma_mapped_walk_done(&pvmw); >>>> break; >>>> } >>> >>> How about the patch below? It matches the pattern of set_pmd_migration_entry() below. >>> Basically, continue if the operation is successful, break otherwise. >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 618df3385c8b..83cc9d98533e 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >>> if (flags & TTU_SPLIT_HUGE_PMD) { >>> split_huge_pmd_locked(vma, pvmw.address, >>> pvmw.pmd, true); >>> - ret = false; >>> - page_vma_mapped_walk_done(&pvmw); >>> - break; >>> + continue; >>> } >> >> Per my understanding if @freeze is trur, split_huge_pmd_locked() may "fail" as >> the comment says: >> >> * Without "freeze", we'll simply split the PMD, propagating the >> * PageAnonExclusive() flag for each PTE by setting it for >> * each subpage -- no need to (temporarily) clear. >> * >> * With "freeze" we want to replace mapped pages by >> * migration entries right away. This is only possible if we >> * managed to clear PageAnonExclusive() -- see >> * set_pmd_migration_entry(). >> * >> * In case we cannot clear PageAnonExclusive(), split the PMD >> * only and let try_to_migrate_one() fail later. >> >> While currently we don't return the status of split_huge_pmd_locked() to >> indicate whether it does replaced PMD with migration entries successfully. So >> we are not sure this operation succeed. > >This is the right reasoning. This means to properly handle it, split_huge_pmd_locked() >needs to return whether it inserts migration entries or not when freeze is true. > >> >> Another difference from set_pmd_migration_entry() is split_huge_pmd_locked() >> would change the page table from PMD mapped to PTE mapped. >> page_vma_mapped_walk() can handle it now for (pvmw->pmd && !pvmw->pte), but I >> am not sure this is what we expected. For example, in try_to_unmap_one(), we >> use page_vma_mapped_walk_restart() after pmd splitted. >> >> So I prefer just remove the "ret = false" for a fix. Not sure this is >> reasonable to you. >> >> I am thinking two things after this fix: >> >> * add one similar test in selftests >> * let split_huge_pmd_locked() return value to indicate freeze is degrade to >> !freeze, and fail early on try_to_migrate() like the thp migration branch >> >> Look forward your opinion on whether it worth to do it. > >This is not the right fix, neither was mine above. Because before commit 60fbb14396d5, >the code handles PAE properly. If PAE is cleared, PMD is split into PTEs and each >PTE becomes a migration entry, page_vma_mapped_walk(&pvmw) returns false, >and try_to_migrate_one() returns true. If PAE is not cleared, PMD is split into PTEs >and each PTE is not a migration entry, inside while (page_vma_mapped_walk(&pvmw)), >PAE will be attempted to get cleared again and it will fail again, leading to >try_to_migrate_one() returns false. After commit 60fbb14396d5, no matter PAE is >cleared or not, try_to_migrate_one() always returns false. It causes folio split >failures for shared PMD THPs. > >Now with your fix (and mine above), no matter PAE is cleared or not, try_to_migrate_one() >always returns true. It just flips the code to a different issue. So the proper fix >is to let split_huge_pmd_locked() returns whether it inserts migration entries or not >and do the same pattern as THP migration code path. > You are right. BTW, I thought PAE stands for Physical Address Extension and confused a while :-( > >Hi David, > >In terms of unmap_folio(), which is the only user of split_huge_pmd_locked(..., freeze=true), >there is no folio_mapped() check afterwards. That might be causing an issue, >when the folio is pinned between the refcount check and unmap_folio(), unmap_folio() >fails, but folio split code proceeds. That means the folio is still accessible >via PTEs and later remove_migration_pte() will try to remove non migration PTEs. >It needs to be fixed separately, right? > Current __folio_split() logic is like below: if (folio_expected_ref_count(folio) != folio_ref_count(folio) - 1) { --- (1) ret = -EAGAIN; goto out_unlock; } unmap_folio(folio); --- (2) ret = __folio_freeze_and_split_unmapped() if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) { --- (3) } else { return -EAGAIN; } You mean after (1) and (2), we don't check folio_mapped() and continue spliting? Hmm... before continue split we tried to freeze folio with expected refcount at (3). This makes sure there is not extra refcount except in pagecache or swapcache. You mean this is not enough? Not sure I follow you correctly. > >> >>> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION >>> pmdval = pmdp_get(pvmw.pmd); >>> >>> >>> >>> -- >>> Best Regards, >>> Yan, Zi >> >> -- >> Wei Yang >> Help you, Help me > > >-- >Best Regards, >Yan, Zi -- Wei Yang Help you, Help me