From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AD64AC677C4 for ; Wed, 11 Jun 2025 07:03:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2D7886B0088; Wed, 11 Jun 2025 03:03:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2885A6B0089; Wed, 11 Jun 2025 03:03:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 177DB6B008A; Wed, 11 Jun 2025 03:03:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id E96CA6B0088 for ; Wed, 11 Jun 2025 03:03:02 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 8607D1201A5 for ; Wed, 11 Jun 2025 07:03:02 +0000 (UTC) X-FDA: 83542227804.01.8115E34 Received: from out30-98.freemail.mail.aliyun.com (out30-98.freemail.mail.aliyun.com [115.124.30.98]) by imf14.hostedemail.com (Postfix) with ESMTP id 3F3B910000E for ; Wed, 11 Jun 2025 07:02:59 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b="usLM2MG/"; spf=pass (imf14.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.98 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749625380; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cky1mhqZeHnNT4xZL/7/NUHWUNW6VBAuAzbtLdXRr2M=; b=4T6nYPHyO5VajlhVLC9/lg0kgtuwBgn/0zEm6ADOTK7qO66RkSNBcOZsfE88F/1eyr5urP UDVaMpbjWBZP8DpOtB+lid2e8qcI+A/HY3KkAH9kR9zkPVcPE0djZLsNlZUOTYjw+k3nRg xgC6qogzp4zU0j42D4AONDf8QkLi+nM= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b="usLM2MG/"; spf=pass (imf14.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.98 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749625380; a=rsa-sha256; cv=none; b=0cLibZKngFk5LiAJO12PRzkXVzS335WsV607qw3brIt4o+3MK9JiY6ehL2qdlkhREzo8St sK8JD/rkaVxgEsJvyiPOD1Z8JsHPRH9XklYPr0+GvinaR+s3ni4fOmldyHXw+Uj8lohjFv m3eTlkee4vVQ8gCgAgkW7BFBALxeKBE= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1749625376; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=cky1mhqZeHnNT4xZL/7/NUHWUNW6VBAuAzbtLdXRr2M=; b=usLM2MG/o+y97McFPRg+ESKPj4wrxjqfu4Tjh/7FTghmFv3RXX0LBta1WAEevvyjIBZsZzm4m5WMqQ3qQJBGBrwo3rFyhq9fT7fKGQ2t1lWtqZtmZfR/XaPitPMaelimMEAxH/Gmo6jOx907Yga81Yz8OuLjNf0eKa1wDLCm7Jc= Received: from 30.74.144.128(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WdcCfV._1749625374 cluster:ay36) by smtp.aliyun-inc.com; Wed, 11 Jun 2025 15:02:55 +0800 Message-ID: Date: Wed, 11 Jun 2025 15:02:54 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/2] mm: shmem: disallow hugepages if the system-wide shmem THP sysfs settings are disabled To: Lorenzo Stoakes Cc: akpm@linux-foundation.org, hughd@google.com, david@redhat.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, ziy@nvidia.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <39d7617a6142c6091f233357171c5793e0992d36.1749109709.git.baolin.wang@linux.alibaba.com> <5578735d-a06c-4a50-9ca1-c141092f2b3a@linux.alibaba.com> <86bf2dcd-4be9-4fd9-98cc-da55aea52be0@lucifer.local> From: Baolin Wang In-Reply-To: <86bf2dcd-4be9-4fd9-98cc-da55aea52be0@lucifer.local> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Stat-Signature: abunfkako8owwejsuoxsegj5h9inx6cs X-Rspamd-Queue-Id: 3F3B910000E X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1749625379-677493 X-HE-Meta: U2FsdGVkX19U436AqnveXIFUhpm0WVNQmIHyeeEs/cHdB0QiLxgEnM1D3+EebKsEMubrcpJojBjC8kTVIQr3XF3zZvs6MoaH/M25hQLPG34YLfdh1mvsyTOnaclwGfIlpeBX62L2Xj33tP//+XznspLtbjiw1knhiZRQxCaV5KaDwJNXLPxpzQkMB2BxeHdQhVRuAifhOcs/boVkbcoSOVt/r6wWnb2fDB0KApB4gHXzRNNM4/XQjOoMuwlh4arb6s84ULcV53rCmeKiGT+j98Gw5KL7mLgJwS1bq3PBL9R8mJmCw14n/mJyP9SXB3v4ZG2FXD4SkL4Owh8+/QED2oefLeGyND8O2LHM8VySbbipu7oWrmEKyi3o0g/wSGudUMQYwjCQdriCj/cacDrm7oJH2o4xHrPDyRAJOlMr3zNlA7RpjGCGxUwylb7mW+CqvBuUGcXfR9iFbbO9TOPAbOoj1E8f0WdMf9C+1G25xS8dZPu+yDdWjs+SU+TKY2llAEsVo+XnlzMFg4cMajFTk5w3W8MdujvNZmCuY5BaQsLabR9PUahQGgaUN3LkGoX4AUrTI7PH9leEtcblCZ3RP3EYpvVS/6z3LEaP7s43MxTu4/Ko+kWkC3xs98xdHyhocjBJsg+xS2W1VL6KSucJ4MDy9uP2PuqQs5/QLOX9XKMC6jZ6vehAIrih4OvPYLZM4O4oCdzwWXX3ayJxQMo3r5dvp4X98m21NrQ5RgVW6Z1szZBINC2hHMBZKkOzHuXM2E7Utky3GkTG819sGaJsvJyPe+b1AUL+Qav6uYbhAZRVaJFO6rIzBIskSHqtlxX5011JYktGAjgSc2WWuEY0FYuo2GDpA2SsyiQiM/56FhoouNssAn3vOe3VqapIJq71rs57qHHUjHoRjzRPtCM0o/Br0M4/YRyRkwB/xKpDm1rHBIxDmnO9Drbg4WyqDTMKZVENmlj1Vt7gIXtdCe+ tONASmTe ogmPVmMU+R8+CI9oGWvOl16nKr7qSuGE+1tWv0vH/F+lHHm3+5wGhNtm3hqeB4yFF0JQilzG5LMJ3c4IILxft6OgJNpi1iGB1o4W7epeTvb6zZFCX3IZyxashvgny9YkxZVnK+Xb1G8w/DNHAZwig4XCS4ZR/oxpiinhZJw24RE9DkPxmDm+QZW7FfYhC9TT+ua4+CD4SqXAHzNw8h7cZPbNf8aHZvRnhE92XdOTyakyMtoR7O2e0+LW/gt4KiGdggbYetJ7RjGfFzaq1Vxs3LMpaBy313zVw9iVO2Mv9nfi3ExAQ9N+l+45BtO8dXfiOZauVT5Y/Q1wPtM+YxhMkHrS9Iva54nlg1kd31EfoAl1psrA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/6/9 23:33, Lorenzo Stoakes wrote: > OK overall the logic looks good now I realise the mistake I made is that > the thp_vma_allowable_orders() check is for the vma_is_anonymous() case > only. > > On Mon, Jun 09, 2025 at 02:31:46PM +0800, Baolin Wang wrote: >> >> >> On 2025/6/7 20:14, Lorenzo Stoakes wrote: >>> On Thu, Jun 05, 2025 at 04:00:59PM +0800, Baolin Wang wrote: >>>> The MADV_COLLAPSE will ignore the system-wide shmem THP sysfs settings, which >>>> means that even though we have disabled the shmem THP configuration, MADV_COLLAPSE >>>> will still attempt to collapse into a shmem THP. This violates the rule we have >>>> agreed upon: never means never. >>> >>> Ugh that we have separate shmem logic. And split between huge_memory.c and >>> shmem.c too :)) Again, not your fault, just a general moan about existing >>> stuff :P >>> >>>> >>>> Another rule for madvise, referring to David's suggestion: “allowing for collapsing >>>> in a VM without VM_HUGEPAGE in the "madvise" mode would be fine". >>> >>> Hm I'm not sure if this is enforced is it? I may have missed something here >>> however. >>> >>>> >>>> Then the current strategy is: >>>> >>>> For shmem, if none of always, madvise, within_size, and inherit have enabled >>>> PMD-sized THP, then MADV_COLLAPSE will be prohibited from collapsing PMD-sized THP. >>> >>> Again, is this just MADV_COLLAPSE? Surely this is a general change? >>> >>> We should be clear that we are not explicitly limiting ourselves to >>> MADV_COLLAPSE here. >>> >>> You shoudl clearly indicate that the MADV_COLLAPSE case DOESN'T set >>> TVA_ENFORCE_SYSFS and that's the key difference here. >> >> Sure. >> > > Thanks! > >>>> For tmpfs, if the mount option is set with the 'huge=never' parameter, then >>>> MADV_COLLAPSE will be prohibited from collapsing PMD-sized THP. >>>> >>>> Acked-by: Zi Yan >>>> Signed-off-by: Baolin Wang >>>> --- >>>> mm/huge_memory.c | 2 +- >>>> mm/shmem.c | 6 +++--- >>>> 2 files changed, 4 insertions(+), 4 deletions(-) >>>> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>> index d3e66136e41a..a8cfa37cae72 100644 >>>> --- a/mm/huge_memory.c >>>> +++ b/mm/huge_memory.c >>>> @@ -166,7 +166,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, >>>> * own flags. >>>> */ >>>> if (!in_pf && shmem_file(vma->vm_file)) >>>> - return shmem_allowable_huge_orders(file_inode(vma->vm_file), >>>> + return orders & shmem_allowable_huge_orders(file_inode(vma->vm_file), >>>> vma, vma->vm_pgoff, 0, >>>> !enforce_sysfs); >>> >>> Did you mean to do &&? >> >> No. It might be worth having a separate fix patch here, because the original >> logic is incorrect and needs to perform an '&' operation with ’orders‘. >> >> This change should be a general change. > > Ah yeah, I did think that _perhaps_ it could be. I think it would make > sense to separate out into another patch albeit very small, just so we can > separate your further changes from the fix for this. OK. >>> Also, is this achieving what you want to achieve? Is it necessary? The >>> changes in patch 1/2 enforce the global settings and before this code in >>> __thp_vma_allowable_orders(): >>> >>> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, >>> unsigned long vm_flags, >>> unsigned long tva_flags, >>> unsigned long orders) >>> { >>> ... (no early exits) ... >>> >>> orders &= supported_orders; >>> if (!orders) >>> return 0; >>> >>> ... >>> } >>> >>> So if orders == 0 due to the changes in thp_vma_allowable_orders(), which >>> is the only caller of __thp_vma_allowable_orders() then we _always_ exit >>> early here before we get to this shmem_allowable_huge_orders() code. >> >> Not for this case. Since shmem already supports mTHP, this is to check >> whether the 'orders' are enabled in the shmem mTHP configuration. For >> example, shmem mTHP might only enable 64K mTHP, which obviously does not >> allow PMD-sized THP to collapse. > > Yeah sorry I get it now thp_vma_allowable_orders() does a > vma_is_anonymous() predicate. Doh! :P > > God what a mess this is (not your fault, pre-existing obviously :P) > > Yeah le sigh. > >> >>>> diff --git a/mm/shmem.c b/mm/shmem.c >>>> index 4b42419ce6b2..9af45d4e27e6 100644 >>>> --- a/mm/shmem.c >>>> +++ b/mm/shmem.c >>>> @@ -625,7 +625,7 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index >>>> return 0; >>>> if (shmem_huge == SHMEM_HUGE_DENY) >>>> return 0; >>>> - if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE) >>>> + if (shmem_huge == SHMEM_HUGE_FORCE) >>>> return maybe_pmd_order; > > OK I get it now, this means the !check sysfs doesn't just get > actioned... :) > >>>> >>>> /* >>>> @@ -660,7 +660,7 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index >>>> >>>> fallthrough; >>>> case SHMEM_HUGE_ADVISE: >>>> - if (vm_flags & VM_HUGEPAGE) >>>> + if (shmem_huge_force || (vm_flags & VM_HUGEPAGE)) >>>> return maybe_pmd_order; >>>> fallthrough; >>>> default: >>>> @@ -1790,7 +1790,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode, >>>> /* Allow mTHP that will be fully within i_size. */ >>>> mask |= shmem_get_orders_within_size(inode, within_size_orders, index, 0); >>>> >>>> - if (vm_flags & VM_HUGEPAGE) >>>> + if (shmem_huge_force || (vm_flags & VM_HUGEPAGE)) >>>> mask |= READ_ONCE(huge_shmem_orders_madvise); >>> >>> I'm also not sure these are necessary: >>> >>> The only path that can set shmem_huge_force is __thp_vma_allowable_orders() >>> -> shmem_allowable_huge_orders() -> shmem_huge_global_enabled() and then >>> only if !(tva_flags & TVA_ENFORCE_SYSFS) and as stated above we already >>> cover off this case by early exiting __thp_vma_allowable_orders() if orders >>> == 0 as established in patch 1/2. >> >> Not ture. Shmem has its own separate mTHP sysfs setting, which is different >> from the mTHP sysfs setting for anonymous pages mentioned earlier. These >> checks are in the shmem file. You can check more for shmem mTHP in >> Documentation/admin-guide/mm/transhuge.rst :) > > Ah yeah the issue is if (vma_is_anonymous())... doh! > > The stack trace is correct though, this is the only place we do it: > > ~~ > > static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index, > loff_t write_end, bool shmem_huge_force, > struct vm_area_struct *vma, > unsigned long vm_flags) > > Is called from shmem_getattr(): > > if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0)) > stat->blksize = HPAGE_PMD_SIZE; > > Note that smem_huge_force == false here > > And shmem_allowable_huge_orders(): > > unsigned long shmem_allowable_huge_orders(struct inode *inode, > struct vm_area_struct *vma, pgoff_t index, > loff_t write_end, bool shmem_huge_force) > > global_orders = shmem_huge_global_enabled(inode, index, write_end, > shmem_huge_force, vma, vm_flags); > > Which forwards 'shmem_huge_force'. > > In shmem_get_folow_gfp(): > > orders = shmem_allowable_huge_orders(inode, vma, index, write_end, false); > > Note that shmem_huge_force == false. > > In __thp_vma_allowable_orders(); > > unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, > unsigned long vm_flags, > unsigned long tva_flags, > unsigned long orders) > { > ... > bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS; > > ... > > if (!in_pf && shmem_file(vma->vm_file)) > return shmem_allowable_huge_orders(file_inode(vma->vm_file), > vma, vma->vm_pgoff, 0, > !enforce_sysfs); > > So we set shmem_huge_force only if TVA_ENFORCE_SYSFS is not set in tva_flags passed to __thp_vma_allowable_orders() > > The only caller of __thp_vma_allowable_orders() is thp_vma_allowable_orders(). Right :) > > But yeah we do need to do shmem things... sorry my mistake. No worries. I appreciate your useful comments.