From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 277B4C5B549 for ; Fri, 30 May 2025 09:52:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B7E906B00BF; Fri, 30 May 2025 05:52:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B2EE06B00DF; Fri, 30 May 2025 05:52:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A1DBB6B00E0; Fri, 30 May 2025 05:52:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 83BCF6B00BF for ; Fri, 30 May 2025 05:52:12 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id E1461568CE for ; Fri, 30 May 2025 09:52:11 +0000 (UTC) X-FDA: 83499108462.26.75A01E8 Received: from out30-113.freemail.mail.aliyun.com (out30-113.freemail.mail.aliyun.com [115.124.30.113]) by imf14.hostedemail.com (Postfix) with ESMTP id 2402610000E for ; Fri, 30 May 2025 09:52:08 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=UXXr7kX9; spf=pass (imf14.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.113 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748598730; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Ybwvx1QHJx3D3UtDCyVd1Dkp/fMn1/UpuqWpEwFvHXQ=; b=IO5x3porldVYJvI3jLo767V9QxUvxkeEM0D9ELcY+qzjPjZZ2Gd7LLWBAuAqHUmNB75W7A eXiV9FToai3peDlFmVuxLjo9nrWyK5JyZU15Mz3RatfnCoq0UfAl83kgzjYc8dGLhxz7Ss anrBnb2ZJQnFdkXKVzvrN4Ld25Gas/k= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=UXXr7kX9; spf=pass (imf14.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.113 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748598730; a=rsa-sha256; cv=none; b=JslinuJovPnyuE7LM0vYmxFXFvUrhkhu79h4VNdl50UIOO6fUHciR1xC3pPVGEealQA3gf fuROkKeXlrarQn6nBcAgdBZgRDdI8NyhRIFGP8xg4rVE/CBKVKQOribbS8B5CcXJ/hEN8D njStrdQdpAK/v62U7/WFP5ti4Mx7IQo= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1748598726; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=Ybwvx1QHJx3D3UtDCyVd1Dkp/fMn1/UpuqWpEwFvHXQ=; b=UXXr7kX9SLzqR0RMCUj3MO7IiWa2gCpAmMTyXfLpS/yJdp8rOqmulgZ03buPIald30oiqNTtQPE31TUeLNGtfp7GgSRmjZHbxaQ9oGTcwuEGfmYcG+aA12gAexfbsHeL/rUo28lyOA2q+b5MVK8Z/J7+3yNF39KaiJaQudmrn6s= Received: from 30.74.144.115(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WcL4CT5_1748598724 cluster:ay36) by smtp.aliyun-inc.com; Fri, 30 May 2025 17:52:05 +0800 Message-ID: <19faab84-dd8e-4dfd-bf91-80bcb4a34fe8@linux.alibaba.com> Date: Fri, 30 May 2025 17:52:03 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled To: David Hildenbrand , Ryan Roberts , akpm@linux-foundation.org, hughd@google.com Cc: lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, dev.jain@arm.com, ziy@nvidia.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org References: <05d60e72-3113-41f0-b81f-225397f06c81@arm.com> <9b1bac6c-fd9f-4dc1-8c94-c4da0cbb9e7f@arm.com> <6caefe0b-c909-4692-a006-7f8b9c0299a6@redhat.com> From: Baolin Wang In-Reply-To: <6caefe0b-c909-4692-a006-7f8b9c0299a6@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Queue-Id: 2402610000E X-Rspamd-Server: rspam09 X-Stat-Signature: jheeezmqmy5jecsh38xoz5w1teygkj1b X-HE-Tag: 1748598728-149220 X-HE-Meta: U2FsdGVkX19GOb2k4ugnPz8Eaqq4kNfiqL/AeKhpz9Ps+51RekhgJUdBFMCRuh0UVkid0dPqlOMd2R5t/5luISUnZlZJ0ibraICmkqkQPkAxKb87QBAoebUBPSL7H1rPttb23GEKWV7yP6CPtVIGCu4vul3IFCV7qun3ZaewnDbHrdEyhFlu4G5goxNpaRiFrjQEmzV2om6Vv385/whK2mN4CF3+EOtD6God3fFWR0XObYg/tOp9nAhmBjhZ8/WDr1VIqYTc6N/MAjqsrbiXHjyi49Ou4xt98pdJ5nnqibjdrdsaPjInlSX7KfrPXSLWC6qV1bCu40V5HJxHCgftMoGoM7AmtCklUktgJLzGe82xjcP5l0kHEJw1Uq7D67yJdnas3Osse/XFKRYe/s6N5emZR8WiI2cXttMKFRYsLdDbQPoZzqgrSFcZAFcGZrV4bIEzMnQSSfoTDVYmELkZKEAa1LYBzuRgblKdkLgq9NA/k/iwrfrdQM80pbMlt07p7ShHmppm0ZlUlIqDp8GWjWHf5JU1HRsKlzqN0mIne7JyCs7M8lfnoTHad+v1tx4OggBh3WzVUA8LpstJRkDyMsbOs/64jpyl4YxWZ4d6pKhEdgEIK4Mz+mxojh+mWMDITz+QPXJKvwYmMnQrq+ZFnO2Po3bD94u6+gcfQVCsPnfCO+2B4emIVwxUJCXnLOwQr87fqU96RKnkCAVaqKL5KUV2UxWvASUunRwceR7N/uey5BNQYJS+RBt2PE014TiyHP0FQzzLUW9ZtgutVJi6Sv3FYZ46cFW5lZQSh1CZga6hkPg1cMmQ99eREEqF2BPcFzXg9iCZVhoz8KNWagUD4bLaHT2QA8mlAcGb6mKvFieIMKSONafKpkQoaoYnoABzHSvNhcmIhs5PX7LFw7yzM8r/qycgFU7jgA2ekYg9kt/IHqx6NfJn3uiJup9D8zSpkFFRmymZp3AYabED2r4 byN4yESU Sj0TnqOIDxOG8j3N2W254q61TOv7VglbMOQ6aoHaY7xDOepBX3AQH1lLephQ0X27nUSgxifI/XAyYC5RSc7tddbB11lql3KtZUUuDM+ig6WG/qbte4qsxlmWQXfZsjt5j/KFHCn7SWQiCtVSsGCtodoseFb3s1Fz436lglxsMDwUAijcadlNW5naWzygu9Rg6mnV3RXOUFysc7A+dbd28xuc2hhppDPNhlB2bwjY+fm/oIsRtT7q+91mhvBNtuc6WEe200Z5IMWusJKuR4pcmvnOuigcwscHJ6SL3yHKaOhbl967E8IfBNmxS5VU8/WpE4ahzyXeTlSxBfJraNs4KK55+bEys3JtIINIG7YBId4WvmlLBtaLvSZJbk/18lXXdfV6nZ5hilTY1vwo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/5/30 17:16, David Hildenbrand wrote: > On 30.05.25 11:10, David Hildenbrand wrote: >> On 30.05.25 10:59, Ryan Roberts wrote: >>> On 30/05/2025 09:44, David Hildenbrand wrote: >>>> On 30.05.25 10:04, Ryan Roberts wrote: >>>>> On 29/05/2025 09:23, Baolin Wang wrote: >>>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will >>>>>> ignore >>>>>> the system-wide anon/shmem THP sysfs settings, which means that >>>>>> even though >>>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE >>>>>> will still >>>>>> attempt to collapse into a anon/shmem THP. This violates the rule >>>>>> we have >>>>>> agreed upon: never means never. This patch set will address this >>>>>> issue. >>>>> >>>>> This is a drive-by comment from me without having the previous >>>>> context, but... >>>>> >>>>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a >>>>> deliberate >>>>> user-initiated, synchonous request to use huge pages for a range of >>>>> memory. >>>>> There is nothing *transparent* about it, it just happens to be >>>>> implemented using >>>>> the same logic that THP uses. >>>>> >>>>> I always thought this was a deliberate design decision. >>>> >>>> If the admin said "never", then why should a user be able to >>>> overwrite that? >>> >>> Well my interpretation would be that the admin is saying never >>> *transparently* >>> give anyone any hugepages; on balance it does more harm than good for my >>> workloads. The toggle is called transparent_hugepage/enabled, after all. >> >> I'd say it's "enabling transparent huge pages" not "transparently >> enabling huge pages". After all, these things are ... transparent huge >> pages. >> >> But yeah, it's confusing. >> >>> >>> Whereas MADV_COLLAPSE is deliberately applied to a specific region at an >>> opportune moment in time, presumably because the user knows that the >>> region >>> *will* benefit and because that point in the execution is not >>> sensitive to latency. >> >> Not sure if MADV_HUGEPAGE is really *that* different. >> >>> >>> I see them as logically separate. >>> >>>> >>>> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll >>>> ignore that. >>>> Because that was set by the app itself (MADV_NOHUEPAGE). IIUC, MADV_COLLAPSE does not ignore the VM_NOHUGEPAGE setting, if we set VM_NOHUGEPAGE, then MADV_COLLAPSE will not be allowed to collapse a THP. See: __thp_vma_allowable_orders() ---> vma_thp_disabled() >>> Hmm, ok. My instinct would have been the opposite; MADV_NOHUGEPAGE >>> means "I >>> don't want the risk of latency spikes and memory bloat that THP can >>> cause". Not >>> "ignore my explicit requests to MADV_COLLAPSE". >>> >>> But if that descision was already taken and that's the current >>> behavior then I >>> agree we have an inconsistency with respect to the sysfs control. >>> >>> Perhaps we should be guided by real world usage - AIUI there is a >>> cloud that >>> disables THP at system level today (Google?). >> The use case I am aware of for disabling it for debugging purposes. >> Saved us quite some headake in the past at customer sites for >> troubleshooting + workarounds ... >> >> >> Let's take a look at the man page: >> >> MADV_COLLAPSE is  independent  of  any  sysfs  (see  sysfs(5))  setting >> under  /sys/kernel/mm/transparent_hugepage, both in terms of determining >> THP eligibility, and allocation semantics. >> >> I recall we discussed that it should ignore the >> max_ptes_none/swap/shared. >> >> But "any" setting would include "enable" ... > > It kind-of contradicts the linked > Documentation/admin-guide/mm/transhuge.rst, where we have this > *beautiful* comment > > "Transparent Hugepage Support for anonymous memory can be entirely > disable (mostly for debugging purposes". > > I mean, "entirely" is also pretty clear to me. Yes, agree. We have encountered issues caused by THP in our Alibaba fleet. The quickest way to stop the bleeding was to disable THP. In such case, we do not expect MADV_HUGEPAGE to still collapse a THP.