From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95D24C7115C for ; Wed, 25 Jun 2025 06:55:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 05F096B009E; Wed, 25 Jun 2025 02:55:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 00E616B00A0; Wed, 25 Jun 2025 02:55:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E3FF96B00A2; Wed, 25 Jun 2025 02:55:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id CDE1C6B009E for ; Wed, 25 Jun 2025 02:55:32 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 5FE135FC79 for ; Wed, 25 Jun 2025 06:55:32 +0000 (UTC) X-FDA: 83593012104.21.75F895D Received: from out30-99.freemail.mail.aliyun.com (out30-99.freemail.mail.aliyun.com [115.124.30.99]) by imf22.hostedemail.com (Postfix) with ESMTP id 49732C0011 for ; Wed, 25 Jun 2025 06:55:28 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=KQ5wq5EG; spf=pass (imf22.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.99 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750834530; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CSs7NkcJykzWd9X/4bqzO39746qhrI0qvK17CnhNWlg=; b=bgnmmhssUCfmuwJ4hcwH0eTVmOPG5PBDdDu5gEYMLgPL2EP1ypslBhQqz5D5kbEMUNTQRD 8/6zTo7zu9nGO61/KfuuvYUCI46XIJjr4DgyGhRw76LvGUSQoPKLTn5BSmjijw/zIP4odn mlOantEz+K32yNY6lBrW6REcJs2pZJs= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750834530; a=rsa-sha256; cv=none; b=bauc7GECWpsXuc4U2GWrimfnZ8bjkUSWrkXOZdnXjydeAS9HdBODBi2znWi3zeNS2gfpHn YqrCdxg1sHD0ZE28AYk39/tY0uRAyUopfbdsjPLcxa8WITbX2nYsIwofOvVcFcdWIOEa6v zRccbhyTUftw5YueAmS/BERLPRGCLN8= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=KQ5wq5EG; spf=pass (imf22.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.99 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1750834522; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=CSs7NkcJykzWd9X/4bqzO39746qhrI0qvK17CnhNWlg=; b=KQ5wq5EGKuQlM45TzK/CohCfUiEAOPbpEGm1TswFRvZkx6GQwuMbvp+M39CHPAw3qexJcqd2U247sJNWqiLdLNPbyfNkakrCOH1oQG8jSXOoYtBLKCT5aBpm6KKCiwAkNjuHmrcUd4fVSXGp4rOKjF8pQm/LjmE46ufajzeVOmw= Received: from 30.74.144.110(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0Wetv5oA_1750834519 cluster:ay36) by smtp.aliyun-inc.com; Wed, 25 Jun 2025 14:55:20 +0800 Message-ID: <0f3b8f9f-a105-4cf9-a0df-bcca66f88c9b@linux.alibaba.com> Date: Wed, 25 Jun 2025 14:55:19 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled To: Dev Jain , Hugh Dickins Cc: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, baohua@kernel.org, zokeefe@google.com, shy828301@gmail.com, usamaarif642@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <75c02dbf-4189-958d-515e-fa80bb2187fc@google.com> <88901329-08a0-49b1-b2be-c00d157cb901@arm.com> From: Baolin Wang In-Reply-To: <88901329-08a0-49b1-b2be-c00d157cb901@arm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Stat-Signature: 9e9s8idc3ogd5qufcbgts4obrsfbj9x5 X-Rspamd-Queue-Id: 49732C0011 X-Rspamd-Server: rspam08 X-HE-Tag: 1750834528-992369 X-HE-Meta: U2FsdGVkX1+LFvsN25riNngLqudVSzSb59N0bcTFbNROMl1XKg21ICiTkSoiRcbKuL4/5zDEmqdNlSgpS0G4fs269ONrdh8ZPo/kl43haKQzGGJyR5gTAWoMJYfKuFdBT67waYfOK7nPgEnsyuRNBEXUK38TbrIC4jAcaBV3QauUZ24M60+2A+5U5cu+JCu2ID4fCc2zX90jpkkTeJJYQS49zhpT4SW+hWmAhcW4ZVi9rVOGrGtSqYQa+mhlfcpwMMRK9eSnY/3mVBzRHc+JlnO847acef7rb5yGYSbNsxqgC+0TQ5M4P6SgTjFak68pYuwLgnO+7n6brSm7CtoZfVC2rS5Mh3o6+RpWC3PG/QxndGXn0nCSxr+XDu7TLh4uSyjAperUz1xVj3/LUu9vVV54z7Y/HHOrQnBkBll2vfbDUipPXcFsXYlO5FNgjusf6etbegLn1apZn0ETMxvaUQb2Dx2CDmatPpLURgNRkrEUQpkPqPIyAZ5lEafu/TuIOO4WUUD8HfItb+u7SgRBUoRMLr0W3tTfvOen6tL0DLM195lQHeoO6yJXtEcWut1gdhWg0rzBLr4KF4xTmBKofzA6WY9N2L5X3wRi6q5sur5MB6cfWR/sL7Wz/PwS8AXdOISN9b6mHeqC8Qnkiu/o2vL22mkz224Sw8fzFKk9YnPgcfwI4yAVPKK1R4RSwdNf+irAvgMRvAmcSSgFRdCNdc1+vBAV2aAENHGEJWCt2D62p6B3nV9aNMlAXe3mVFNulaERIH4c1z1C+thMmvTsKoYV1nuc/0MkaWba+C9+aCf5gKM8RmiyxzjUQN+esrZ/d/uJ8gqIKhYjRl4LTAGZfNNYeK0xiiTUZkYyNkAlWNqzSY6F8nM3JudkNUNuivKF8gHCPA8osK9vHlkF52EQHeBBChxFIn1pCCeSgCbVbNw9cVd14opBkrwkWpeOwCYmBKe03xRtWCqu8QTmsiK 6st9uy7k 9mv+h+84qQnDUJv1/WY3tWBuKFn6M9evreBnXiMjamb1EiwF48YPvaF71ruGpKU7OdmvLZUTRKnhOYlpHwxdkHK8o/5UUopWy+SRVEph9AfmqvDhV1W1ksX38hs1w37WoXK6Bir6Yl7U5kwp8kTPoHocyjI2BQ7F61OWeetw/MyxPJEY965nNdtIz7FgEjTEG3jECT1jHOt/STNgEheWAGIVuw5Q8aW8+bIHIONeWNxSqcdppmSx/Eiyl+lJDpHvUskH12oRbVhpdAnM26Fz6OTtH7+hX+hNc+ajPQp0hivJK4LMw0xR/XsFwvdFX2JC/EmsEUVPQ3Z4BsZX/m60i8mE3rXrCu+kOqM6f2HcFXYHRoEy+Qc77LTNuI/01C3d3ug/fHpLyrMzNYPmDp5g6SFDrE+hbaXtxbtbRoTtG5CeTtZI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/6/25 14:49, Dev Jain wrote: > > On 25/06/25 11:56 am, Baolin Wang wrote: >> >> >> On 2025/6/25 13:53, Hugh Dickins wrote: >>> On Wed, 25 Jun 2025, Baolin Wang wrote: >>> >>>> When invoking thp_vma_allowable_orders(), if the TVA_ENFORCE_SYSFS >>>> flag is not >>>> specified, we will ignore the THP sysfs settings. Whilst it makes >>>> sense for the >>>> callers who do not specify this flag, it creates a odd and >>>> surprising situation >>>> where a sysadmin specifying 'never' for all THP sizes still >>>> observing THP pages >>>> being allocated and used on the system. And the MADV_COLLAPSE is an >>>> example of >>>> such a case, that means it will not set TVA_ENFORCE_SYSFS when calling >>>> thp_vma_allowable_orders(). >>>> >>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will >>>> ignore >>>> the system-wide anon/shmem THP sysfs settings, which means that even >>>> though >>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE >>>> will still >>>> attempt to collapse into a anon/shmem THP. This violates the rule we >>>> have >>>> agreed upon: never means never. >>>> >>>> For example, system administrators who disabled THP everywhere must >>>> indeed very >>>> much not want THP to be used for whatever reason - having individual >>>> programs >>>> being able to quietly override this is very surprising and likely to >>>> cause headaches >>>> for those who desire this not to happen on their systems. >>>> >>>> This patch set will address the MADV_COLLAPSE issue. >>>> >>>> Test >>>> ==== >>>> 1. Tested the mm selftests and found no regressions. >>>> 2. With toggling different Anon mTHP settings, the allocation and >>>> madvise collapse for >>>> anonymous pages work well. >>>> 3. With toggling different shmem mTHP settings, the allocation and >>>> madvise collapse for >>>> shmem work well. >>>> 4. Tested the large order allocation for tmpfs, and works as expected. >>>> >>>> [1] https://lore.kernel.org/all/1f00fdc3- >>>> a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/ >>>> >>>> Changes from v3: >>>>   - Collect reviewed tags. Thanks. >>>>   - Update the commit message, per David. >>>> >>>> Changes from v2: >>>>   - Update the commit message and cover letter, per Lorenzo. Thanks. >>>>   - Simplify the logic in thp_vma_allowable_orders(), per Lorenzo >>>> and David. Thanks. >>>> >>>> Changes from v1: >>>>   - Update the commit message, per Zi. >>>>   - Add Zi's reviewed tag. Thanks. >>>>   - Update the shmem logic. >>>> >>>> Baolin Wang (2): >>>>    mm: huge_memory: disallow hugepages if the system-wide THP sysfs >>>>      settings are disabled >>>>    mm: shmem: disallow hugepages if the system-wide shmem THP sysfs >>>>      settings are disabled >>>> >>>>   include/linux/huge_mm.h                 | 51 +++++++++++++++++ >>>> +------- >>>>   mm/shmem.c                              |  6 +-- >>>>   tools/testing/selftests/mm/khugepaged.c |  8 +--- >>>>   3 files changed, 43 insertions(+), 22 deletions(-) >>>> >>>> -- >>>> 2.43.5 >>> >>> Sorry for chiming in so late, after so much effort: but I beg you, >>> please drop these. >> >> Thanks Hugh for your input. (yes, we put in a lot of effort on >> discussion and testing:( ). >> >>> I did not want to get into a fight, and had been hoping a voice of >>> reason would come from others, before I got around to responding. >>> >>> And indeed Ryan understood correctly at the start; and he, Usama >>> and Barry, perhaps others I've missed, have raised appropriate >>> concerns but not prevailed. >>> >>> If we're sloganeering, I much prefer "never break userspace" to >>> "never means never", attractive though that over-simplification is. >> >> Yes, agree. we should not break userspace, however, I suspect whether >> this can really break userspace. We can set '/sys/kernel/mm/ >> transparent_hugepage/enabled' to 'madvise' to allow MADV_COLLAPSE. >> Additionally, I really doubt that when the system-wide THP settings >> are set to 'never', userspace would still expect to collapse into THP >> using MADV_COLLAPSE. > > After this patch, will a user still be able to use MADV_COLLAPSE and > ensure no interference from khugepaged? I think so. Becuase khugepaged will still check VM_HUGEPAGE if we set '/sys/kernel/mm/transparent_hugepage/enabled' to 'madvise'. >> Moreover, what makes this issue particularly frustrating is that when >> we introduce mTHP collapse[1], MADV_COLLAPSE complicates matters >> further. That is, when the system only enables 64K mTHP, MADV_COLLAPSE >> still allows collapsing into PMD-sized THP. This really breaks the >> user's settings. > > This issue will still be there without this patch right? NO. Will fix this issue. After this patch, MADV_COLLAPSE can not continue to collapse PMD-sized THP if the system only enables 64K mTHP. >> [1] https://lore.kernel.org/all/20250515032226.128900-1- >> npache@redhat.com/ >> >>> Seldom has a feature been so thorougly documented as MADV_COLLAPSE, >>> in its 6.1 commits and in the "man 2 madvise" page: which are >>> explicit about MADV_COLLAPSE providing a way to get THPs where the >>> sysfs setting governing automatic behaviour does not insert them. >>> >>> We would all prefer a less messy world of THP tunables.  I certainly >>> find plenty to dislike there too; and wish that a less assertive name >>> than "never" had been chosen originally for the default off position. >>> >>> But please don't break the accepted and documented behaviour of >>> MADV_COLLAPSE now. >>> >>> If you want to exclude all possibility of THPs, then please use the >>> prctl(PR_SET_THP_DISABLE); or shmem_enabled=deny (I think it was me >>> who insisted that be respected by MADV_COLLAPSE back then). >> >> Yes, that will prevent MADV_COLLAPSE. >> >>> Add a "deny" option to /sys/kernel/mm/transparent_hugepage/enabled >>> if you like.  (But in these days of filesystem large folios, adding >>> new protections against them seems a few years late.) >>> >>> If Andrew decides that these patches should go in, then I'll have to >>> scrutinize them more carefully than I've done so far: but currently >>> I'm hoping to avoid that. >>> >>> Hugh >>