From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C5D4C7115B for ; Tue, 24 Jun 2025 01:44:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 982FE6B00AB; Mon, 23 Jun 2025 21:44:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9325B6B00AC; Mon, 23 Jun 2025 21:44:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8221F6B00AD; Mon, 23 Jun 2025 21:44:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 704556B00AB for ; Mon, 23 Jun 2025 21:44:23 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id E9BE01D91CB for ; Tue, 24 Jun 2025 01:44:22 +0000 (UTC) X-FDA: 83588599164.14.AB727F5 Received: from out30-112.freemail.mail.aliyun.com (out30-112.freemail.mail.aliyun.com [115.124.30.112]) by imf25.hostedemail.com (Postfix) with ESMTP id 3B6F1A0012 for ; Tue, 24 Jun 2025 01:44:19 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=D9m08BsG; spf=pass (imf25.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750729461; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bXzLG3qEfg6SUTPkxvvaO8P1W2oF5qK0rvfgE/cq0gA=; b=CUXJPLRTU5RysGY7VI9j5ySL6rJYzRxlcCmt07ECj4uoFq5jkbZMMNV6anOj/6qPl4m62s RNzL84DzPO3GqZsLcnteuZ7Oo9EIbmghydYwu6vueRBggSamEj/qOMd3RudJPySQqQEkWr yQU4pxzIimKIpLmLJE8NRK/VMeZtZEM= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=D9m08BsG; spf=pass (imf25.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750729461; a=rsa-sha256; cv=none; b=TTtJja2mNwOr+y/Glvi+kZBw5u4e5qJtPY9ZE2kiwWl2z4qzW3+FsUB7rNotnL4+H6n2MJ DTLliViu/VY9Ck2568FoPaLFeXW6Jes90Fb/RafDKJalXqIrkSI4W1Pj3MH0OJZJoQvxf8 n0JJ/eGVOmviDEXU/m9+mTNImzcbo2E= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1750729457; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=bXzLG3qEfg6SUTPkxvvaO8P1W2oF5qK0rvfgE/cq0gA=; b=D9m08BsGH3fACAyJxzVQkNybvMDrfQodNz+e3miqgBh/+qYAQP8yQeN2j5UPcH55yzF87TYp8wqoF3Ey416COo4Tsnk46DgglJrSLDYPNmYLDV7KqwTFBeohbRYt8OVOuVU3XaX4GERG2PnQprhYgDjySMvgF4BTrSw9/Y7p9Ak= Received: from 30.74.144.102(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WeeL9rz_1750729455 cluster:ay36) by smtp.aliyun-inc.com; Tue, 24 Jun 2025 09:44:15 +0800 Message-ID: Date: Tue, 24 Jun 2025 09:44:14 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, hughd@google.com, david@redhat.com, ziy@nvidia.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 3B6F1A0012 X-Stat-Signature: bbdd8tm96pt6o6mqwcu36ub9dc15y98z X-HE-Tag: 1750729459-973774 X-HE-Meta: U2FsdGVkX19HDAVzchEf5P5gLbdJLk3FjCV+I/WqUnHEGuirD8NUWRo9iMb3ZRyAuGtoJcy2xOAUn5aIEWMXHyhcgE2KtPU4Iam9DIkBen4+r8y9wqjZkFk72MAFETVKN1KRK52Txx4SlwgStTep0O+8NVxkHzzVCwGxfNNYmlpegHFNgkxaSnrH98EZqWhAC9NUKechgiFWK5FkFuL7mUn++4oypA+U2+sN47C0z/NotGCUJbC3E1Y2ORj4e9BwJMf+TaRZJz49EJhXu8/oaJxyfq9iZkYkWEF3+s/QrBdQR+dGQAha0V+1UjFUGheCbixjoLFGBZjZ+SIhxdAmcVYpDSMMy6gOpOHm3GwdfgEoUj8eNCDdpZ2WT5NHnI1vcS/M7UgGMCo7kM4ZlOCapvCwP5IkcJ7XZ99+Rz8Yc/+BQ+YCij8S0r2o2JNHpGIIL7DCwu/JyD6jNKMVr6bZzZcJLu/2QOF4jB7NofMvc/Dc9dlzKirwxc7zeZkltHrIbgk5Q5gbn5OgixVX7IZTv1ECs2gdjmCisxXFl2bu0mjidDYgsgauIgikKKBL+XSx7Eq7KqF0nmC7yssRbLNyZLZRcOs/lFUExLhzRVVqxlNZcwd1Yk2XduRpGWe7AZvN8Rb1RJVkTKlFSrHyg+d8yl8TvbnKmEWiDIIOLMpBUeDzgZlOAPELSd28Ls9v3s0unMeWKKOBRedith6X3YxkgquuXNiJ3Nax63YMPgleSHrfOmU2EbMchCJBNxQaxqKEqTE3WfxIm9rTXw3h5jat6Rc0aws39LyK+z5tzH1gYL9JOMRtfCCClWYMpQn4J9Rku60CkW6H6yG961/dtOoY0G6AScEGD1YPqRZUv6xAWUtVDOeRON4xxn1XEn0uqNeNU/WKG4uzpCQrffE8ceCAPiSbkgG5DXp/Gz0JNZaNaffdcG3jxsGL9chB9m4pGVhV3mcw4+34yYrA35kaBCO hvrrTjS4 082vup71qjTcRoHqaiINJUUvGg1EfPqxakf47PBzS+uUWyuIZHiO3+CeTxuxGE+98fhb/Ub/PEkaasR4XbyLWawNbNvbmldkuUfc7kVQQ1eVWUACvG43jjxl5IEn/xCGMYAfvvaRSGzYggXa+/Dm9CPs5W3WLrgGTq4CSb9kZW3iY9ptFS6B/l+iwPTtjw/p5j22rcl2JqltqOQhuy3wk4cYzZdn0vy/K3ZB7T5UoNrrjjHat0cExSRDF8vvidx75JXasTSkQgHG+aIfRgGTVWAXT7m2dfA54fugYw8IqfYfEnrxrOA5W4hU3s+EDaA9u5MNG7hbv2CTOB/2IbYXmq+uBauxpBzNb4q6oSaTQ+LGvcS5hoq5rMnvOUINXqEAlieLXvg91JYXEH0xaI5jwvAUEgjIp4jFEwYYWV7EMYAMfWf+XCfpTuuM59IM7F57Fl9f2drYFB2RGTo5OxFST80wwTA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/6/23 19:08, Barry Song wrote: > On Mon, Jun 23, 2025 at 8:28 PM Baolin Wang > wrote: >> >> When invoking thp_vma_allowable_orders(), the TVA_ENFORCE_SYSFS flag is not >> specified, we will ignore the THP sysfs settings. Whilst it makes sense for the >> callers who do not specify this flag, it creates a odd and surprising situation >> where a sysadmin specifying 'never' for all THP sizes still observing THP pages >> being allocated and used on the system. >> >> The motivating case for this is MADV_COLLAPSE. The MADV_COLLAPSE will ignore >> the system-wide Anon THP sysfs settings, which means that even though we have >> disabled the Anon THP configuration, MADV_COLLAPSE will still attempt to collapse >> into a Anon THP. This violates the rule we have agreed upon: never means never. >> > > Should we update the man page for madv_collapse ? > https://man7.org/linux/man-pages/man2/madvise.2.html > > MADV_COLLAPSE is independent of any sysfs (see sysfs(5)) > setting under /sys/kernel/mm/transparent_hugepage, both in > terms of determining THP eligibility, and allocation > semantics. See Linux kernel source file > Documentation/admin-guide/mm/transhuge.rst for more > information. MADV_COLLAPSE also ignores huge= tmpfs mount > when operating on tmpfs files. Allocation for the new > hugepage may enter direct reclaim and/or compaction, > regardless of VMA flags (though VM_NOHUGEPAGE is still > respected). > > So this effectively changes the uABI, right? Good point. Will update the man page. >> Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, there is only >> one other instance where TVA_ENFORCE_SYSFS is not set, which is in the >> collapse_pte_mapped_thp() function, but I believe this is reasonable from its >> comments: >> >> " >> /* >> * If we are here, we've succeeded in replacing all the native pages >> * in the page cache with a single hugepage. If a mm were to fault-in >> * this memory (mapped by a suitably aligned VMA), we'd get the hugepage >> * and map it by a PMD, regardless of sysfs THP settings. As such, let's >> * analogously elide sysfs THP settings here. >> */ >> if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER)) >> " >> >> Another rule for madvise, referring to David's suggestion: “allowing for >> collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine". >> >> To address this issue, the current strategy should be: >> >> If no hugepage modes are enabled for the desired orders, nor can we enable them >> by inheriting from a 'global' enabled setting - then it must be the case that >> all desired orders either specify or inherit 'NEVER' - and we must abort. >> >> Meanwhile, we should fix the khugepaged selftest for MADV_COLLAPSE by enabling >> THP. > > It’s a bit odd that the old test case expects collapsing to succeed > even when we’ve set it > to ‘never’. > Setting it to ‘always’ doesn’t seem to test anything as a counterpart. > > I assume the goal is to test that setting it to ‘never’ prevents collapsing? The original logic will prevent khugepaged by setting THP_NEVER, allowing only madvise_collapse() to perform THP collapse. And this is the logic this patchset tries to fix, which is to also prevent madvise_collapse() from performing THP collapse when system-wide THP sysfs settings are disabled. Therefore, it should be changed to THP_ALWAYS here to allow madvise_collapse() to perform THP collapse. Of course, the current logic cannot completely disable khugepaged, but I haven't found a better way to modify it. As David suggested, changing to MADVISE mode would cause some test cases to fail because some tests previously set MADV_NOHUGEPAGE, and now there is no other way to clear the MADV_NOHUGEPAGE flag except for setting MADV_HUGEPAGE. As a result, khugepaged cannot be completely disabled either. So I think we should introduce a new method to clear MADV_NOHUGEPAGE flag without setting MADV_HUGEPAGE in the future.