From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A71C8C5B552 for ; Mon, 9 Jun 2025 06:18:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3DF436B007B; Mon, 9 Jun 2025 02:18:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 390B96B0088; Mon, 9 Jun 2025 02:18:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2CE0A6B0089; Mon, 9 Jun 2025 02:18:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 0ECAB6B007B for ; Mon, 9 Jun 2025 02:18:15 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id B0E791D7522 for ; Mon, 9 Jun 2025 06:18:14 +0000 (UTC) X-FDA: 83534857308.05.EF44A8E Received: from out30-124.freemail.mail.aliyun.com (out30-124.freemail.mail.aliyun.com [115.124.30.124]) by imf25.hostedemail.com (Postfix) with ESMTP id CCF70A000A for ; Mon, 9 Jun 2025 06:18:11 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=FWmLWk74; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf25.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.124 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749449893; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zQzeG0vNP0zhacH+0iF/2q6/uDdbQXYDFVGJBkyJQos=; b=wjo2XE/PhRMVg5Qh7FP5mRtBwWE8oO9yMSQjHHeYWFtrfE/XZHK3u7y3zRUD3HQwMpMSdF zfgSVu5G/bBlH3Fve8zgwdw9CmXRPlV7BYdK4BDenGkS0Arp+gqCVpj/fA5/uRGrfZQM0D qQu2cVmaVWJUG3MTRcUv4FHoj4iIFkM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749449893; a=rsa-sha256; cv=none; b=QNhXJCnQDYEZBRXCwPU0mOwn6fVEdSQ6hlsC+bmfM3KKGuRUgkst//PH4cx2hvSsPAMSAd oUDULitVEsAnUHL6TPajK6qnjhP3mSsX7NcK32fPw5mpkgD3+HRR0ZP0+3bp3iY9sHKhab 8pIcj17TMpX+2k49wFADuL4leJpNKjY= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=FWmLWk74; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf25.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.124 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1749449888; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=zQzeG0vNP0zhacH+0iF/2q6/uDdbQXYDFVGJBkyJQos=; b=FWmLWk74BAd/6PLzPc99JA7kSHD7Nk+b8RAzeLd6uR396xjQ6tJ6BsSKJV/H5RVSvDp3ON8uZ08FozTMY1oUJ4XMaKjS9I2aOCwDss/3csuGwraYVEJrqxpsbYkOa9Pj03e+90ZQWrAP3lFazdzrqxy4/3+SYzxl5f2fFHqFNWM= Received: from 30.74.144.144(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WdLPzMy_1749449887 cluster:ay36) by smtp.aliyun-inc.com; Mon, 09 Jun 2025 14:18:08 +0800 Message-ID: <43872045-518a-4690-acf2-a0eb4362ca1c@linux.alibaba.com> Date: Mon, 9 Jun 2025 14:18:07 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled To: Lorenzo Stoakes Cc: akpm@linux-foundation.org, hughd@google.com, david@redhat.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, ziy@nvidia.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <8eefb0809c598fadaa4a022634fba5689a4f3257.1749109709.git.baolin.wang@linux.alibaba.com> <998a069c-9be5-4a10-888c-ba8269eaa333@lucifer.local> <381ca19f-4b62-41b8-9883-f233b50d6521@lucifer.local> From: Baolin Wang In-Reply-To: <381ca19f-4b62-41b8-9883-f233b50d6521@lucifer.local> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: CCF70A000A X-Stat-Signature: nyr67opoi5s4jnk5bzrsepjzpofz8chy X-Rspam-User: X-HE-Tag: 1749449891-978484 X-HE-Meta: U2FsdGVkX194HR4Rm3cnwXrcxb8nF40ZQjKm7h4ppU+FFhKRt76whfDZeR/qU+ckJRmpIOX8utJ6b3b91/Do5lMgHbpLgw7t8ykad49MB+TjyvJaP82bPkpIf5MQhWebSKa/1lTGTrW/Z3IZPuApsNe2emY2QO9Znh36XmBgZs7S/M51gG75uym6YwBzkd+jfXLcHXEYogdaJbhjHgLFI8QYcTpwJ/oDWPs4taiaxOcq95m6rI0xoylFTWjru1gb9wWjK8kY+/9BDCPCFaC+RhX6oNCypSXL32wr6HhPTxwgSs8nTL/QWS0JhFlmT54r8cja549yttYTBwSAIxeiKu8AuQdjHaiZr/wlGlLY5lN2JM5IAJRPdMU1rTBoNylnuqp89J10K0/ZLLHuoxHl185z97a/nM8FEfhWM4paprv+LCAeyz5QrrvMSjD6wbpWSELb8GOFmkdOiC2cV1uQwWRrUokv5rCG6h7V95IJh5evTt3AYDGKunw2FfGPyk/mkyrcplE43RXrQhn0kpp7LAsyk8ZgnBPbOA7Stkp0/kJpKpL0H4PEFtx6oj2T5deSIDV8CznEcPVc96wPCTIQVOMUI2hK9MsfMH7zD1fk5TSf2NcHoFuFh3g8LqSMGj9wViD/Vw/DUINiD85BXkMff8PPhh3UtuqxPO2bSf3W4qKxvRlqFBz3vzD615anViaS5JPCb65Q71KI9kCg4afJB8EIkfUcRoIz9TZBq/FCSln+A9GiGOT9wpwElwoX5H3zfcm278/Po/AjBG4h2yrIS+CC5AlZo7s4xcgGSo+TslDWAVDLnnpSw/3gyG+JrpCJiRTWB7SM8F7EirsNc7OdahJetgjKtjhc2AZqlT9onF4+Cw1RGNymTZo2eJeLXP2wtWfMdpAsTlxTj6baHxBodHVgpwngUy53Yn9GnUEVTFqxI1l/mI6CE3dZbqxTW7LKqnoS6OcOCW724kEzO96 Mks6mz3t dBKBJvkSbVxeof9IFhspCpr4t05+fqe3HLWsUcz876CjLotug/zYj9WnSros+Gc9n01YtX+GxYJyyRcvi3dRcaU7X+9AwJqocUXzwDDBuPzRGG9r/+++JXG9GBGafBr89EycUbKw1H8IwA1hY6yBtmFb+i20nsj85x5W+HokDC0f7aHt4yktkLbBEpRToRXTA6pwrA9XoQzg7f2aYmzO2SKzlMZEghVbDa0saRrk5eMeWicaWRdEbGgtksmV2SlRR86giS/JWtikddnxWFgvqYZX+m0/F6UOSK5zKNU2ifihumlE+YbhH0gb33NxrKWGwYK/G53aCPFyXoNmxkCZBJmjDV5FHrLKwiTxLKFMktsuTdcbuVYyp3ZFJScgCyQJw2Yb7iglz4v2xg38= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/6/7 20:21, Lorenzo Stoakes wrote: > A couple follow up points that occurred to me: > > On Sat, Jun 07, 2025 at 12:55:19PM +0100, Lorenzo Stoakes wrote: >> Not related to your patch at all, but man this whole thing (thp allowed orders) >> needs significant improvement, it seems always perversely complicated for a >> relatively simple operation. >> >> Overall I LOVE what you're doing here, but I feel we can clarify things a >> little while we're at it to make it clear exactly what we're doing. >> >> This is a very important change so forgive my fiddling about here but I'm >> hoping we can take the opportunity to make things a little simpler! >> >> On Thu, Jun 05, 2025 at 04:00:58PM +0800, Baolin Wang wrote: >>> The MADV_COLLAPSE will ignore the system-wide Anon THP sysfs settings, which >>> means that even though we have disabled the Anon THP configuration, MADV_COLLAPSE >>> will still attempt to collapse into a Anon THP. This violates the rule we have >>> agreed upon: never means never. >>> >>> Another rule for madvise, referring to David's suggestion: “allowing for collapsing >>> in a VM without VM_HUGEPAGE in the "madvise" mode would be fine". >> >> I'm generally not sure it's worth talking only about MADV_COLLAPSE here when >> you're changing what THP is permitted across the board, I may have missed some >> discussion and forgive me if so, but what is special about MADV_COLLAPSE's use >> of thp_vma_allowable_orders() that makes it ignore 'never's moreso than other >> users? > > I'd mention that MADV_COLLAPSE is special because of not specifying > TVA_ENFORCE_SYSFS but you are making this change across the board for all > callers who do not specify this. Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, there is only one other instance where TVA_ENFORCE_SYSFS is not set, which is in the collapse_pte_mapped_thp() function, but I believe this is reasonable from the comments: /* * If we are here, we've succeeded in replacing all the native pages * in the page cache with a single hugepage. If a mm were to fault-in * this memory (mapped by a suitably aligned VMA), we'd get the hugepage * and map it by a PMD, regardless of sysfs THP settings. As such, let's * analogously elide sysfs THP settings here. */ if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER)) return SCAN_VMA_CHECK; > > I'd also CLEARLY mention that you handle David's request re: madvise by > restricting yourself to checking only for NEVER and retaining the existing logic > of not enforcing sysfs settings when TVA_ENFORCE_SYSFS, which includes not > checking the VMA for VM_HUGEPAGE if the madvise mode is enabled. Sure. > > (i.e. addressing David's request). > > [snip] > >> I feel this is compressing a lot of logic in a way that took me several >> readings to understand (hey I might not be the smartest cookie in the jar, >> but we need to account for all levels of kernel developer ;) >> >> I feel like we can make things a lot clearer here by separating out with a >> helper function (means we can drop some indentation too), and also take >> advantage of the fact that, if orders == 0, __thp_vma_allowable_orders() >> exits with 0 early so no need for us to do so ourselves: >> >> /* Strictly mask requested anonymous orders according to sysfs settings. */ >> static inline unsigned long __thp_mask_anon_orders(unsigned long vm_flags, >> unsigned long tva_flags, unsigned long orders) >> { >> unsigned long always = READ_ONCE(huge_anon_orders_always); >> unsigned long madvise = READ_ONCE(huge_anon_orders_madvise); >> unsigned long inherit = READ_ONCE(huge_anon_orders_inherit);; >> bool inherit_enabled = hugepage_global_enabled(); >> bool has_madvise = vm_flags & VM_HUGEPAGE; >> unsigned long mask = always | madvise; >> >> mask = always | madvise; >> if (inherit_enabled) >> mask |= inherit; >> >> /* All set to/inherit NEVER - never means never globally, abort. */ >> if (!(mask & orders)) >> return 0; >> >> /* Otherwise, we only enforce sysfs settings if asked. */ > > Perhaps worth adding a comment here noting that, if the user sets a sysfs mode > of madvise and if TVA_ENFORCE_SYSFS is not set, we don't bother checking whether > the VMA has VM_HUGEPAGE set. Sure, will do. Thanks for reviewing.