From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9F04C3ABCB for ; Sun, 11 May 2025 14:08:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6CA656B000A; Sun, 11 May 2025 10:08:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 678186B0082; Sun, 11 May 2025 10:08:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 540A36B0083; Sun, 11 May 2025 10:08:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 371A26B000A for ; Sun, 11 May 2025 10:08:16 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id E0CC0805FF for ; Sun, 11 May 2025 14:08:16 +0000 (UTC) X-FDA: 83430806592.13.2297EF9 Received: from mail-wm1-f41.google.com (mail-wm1-f41.google.com [209.85.128.41]) by imf29.hostedemail.com (Postfix) with ESMTP id DB401120007 for ; Sun, 11 May 2025 14:08:14 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Mv7I5jSZ; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.41 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746972495; a=rsa-sha256; cv=none; b=iP4XWUGFf3+p8mpfdDSbWWWLit39lg4CpWBq4CnQ7za58fYhY2nlPZ/C+QWrzItuPNjmAS MEVh/f+0xLRSd7J+1NOuMg6iBUUFsKYqSVct3nq6qpWux8z45sam1Iz6s1TlZhYVd8pddV AS2VOef7ZoHoPY+Fc4ZcE2VOoo++sMI= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Mv7I5jSZ; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.41 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746972495; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8axzo8yOdCmehlpZ/NfVBO0iMmQJTUTX8U3nxV5mbN4=; b=ytC0Xbl+tzVfvA6K8xI2SZne1uKEzC1dxB2EU549ayUeRcBvtGaGX7RUZco4MpuFdvMim9 WsYaHyoQhBqh/sgAXpUW/gMYlnP0mvUrOBO17h7bUyTkVy7s2cKwBzAbF8q0ZKx+FNA7KK V9a0GAOe2Vr0h7aTphwU49G/IoBiiCg= Received: by mail-wm1-f41.google.com with SMTP id 5b1f17b1804b1-43cf628cb14so31444195e9.1 for ; Sun, 11 May 2025 07:08:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746972488; x=1747577288; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=8axzo8yOdCmehlpZ/NfVBO0iMmQJTUTX8U3nxV5mbN4=; b=Mv7I5jSZ/iOJRo35kR5d+zzz45UiqBo1gm2MOJb2lbqSl1NwGVpzKo8jcN6wgwzrJO Qk9M73GmkHfSNKR8rf1jxp7hJZ3Qzah2XwpC7fUq6IwYeBIb5MD+gp7O0qLbeHldo2rU HcX9TpGpdKxp/Apyw3m/MUbVvlr7av2LuKeMTUQqdAIgq2DEDoW+ftvcRPSvqVhZYP9B qFNUf5AQqpwIzw7SWfDYkQ3voUKEPIrAckINkxaptkJRMRASwpUB8FSAfdQAwcukbD4m iNVNG6De8FIjaxDUZhisipE9H/rj5nzTbT+LxO4/lZcshX2MgLZnh8KGXzkOOPT+sXLk 0QXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746972488; x=1747577288; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8axzo8yOdCmehlpZ/NfVBO0iMmQJTUTX8U3nxV5mbN4=; b=oT8ppIjJn9Z7xUVue6yh6evwdA7pMYn94TT9F2bFMvTApe1M+r/2AzIbY9ZhwHgBNh mpqgMz2dkyXRuacXPQwK4k0FGcOS+WQfRmBvx04hDmetOrgJp20UkhR11OjbnoRYzKzT ArNA5oq/t83HzH3coMB0KXgxcAPqlpHIGZz+T6RNAQeag6qbyjlYiq8mJgbtU+QZBMaK eOJ0ZYjAIiMvZ6qqtyT/M8l4t0bUNowcXLn5d34mD4OmLikAMMuScjEEA5o8dwsT0Z/N SAQijt2MLoEtdTHDlU99PUDxiC0gGROM/OwB7oE5mMIv/kZDkIgoUxSQNI5ShsZqvhmv O+oQ== X-Forwarded-Encrypted: i=1; AJvYcCX8LscDj+xyS1H8gumLmBjEvSwsqVKNPF2iNI7tQ0rj48D8awLdT7RDMgt+OTNWFXljVhXjtN/UaQ==@kvack.org X-Gm-Message-State: AOJu0YwwWuZkJ97SVxMSmkOQcwGDZZY3BEjqrBCtUfJ1uoI72zWADPAK Pne0FfZG1bChYS3BnWq63iBoSfPP1wDlq+qZoEuHE12LhXfXKoto X-Gm-Gg: ASbGncvPr+kwejgG5zjwfJmPkS3k53ce8pkCTCGYu9diAOVFJt8lJHGxxHR4/RpHrNZ cdD4q/PnNdGtzFwd08r3mu0Nw1qgYagsn+Pqs/NneFAp7cp/L1xadySllIGBePSxYFFlL/ugh5X BVegHANVdmU5fgpcnnDi9XrBqjACVuDPBjizukTTel0wFYOwftMYBL82APmdApkuV8QjDsAV/vU b0NZa8EM8NnBMrLpR6kLP3bTOIDakkjWfoYbp/IKt78Qq3jOMxJKB9NJ0GbGLxoeMnIQg59U4KW yqhWU2hwmkbl/0pjjgZzqin5ms+YCaxQQAWHvd+QqEIrcd/NnOtxGEkRy9Bfv3O1eQUamU9CjYH BGtEUFLVTL6t4GpSuQZJNNk1AAF8SIOMpFb5SCn+e2BRl X-Google-Smtp-Source: AGHT+IGd3t5Amh1IYTJLjTaFBg9C15cPdTcmpiT7qwduWhY4t2d5syNN0VOstJT64FKbbGlhjwO/RA== X-Received: by 2002:a05:600c:1808:b0:43b:bb72:1dce with SMTP id 5b1f17b1804b1-442d02ca7c1mr77164435e9.5.1746972487773; Sun, 11 May 2025 07:08:07 -0700 (PDT) Received: from ?IPV6:2a02:6b67:d752:5f00:c46:86ac:45ea:7590? ([2a02:6b67:d752:5f00:c46:86ac:45ea:7590]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-442d67ee33bsm93351155e9.20.2025.05.11.07.08.06 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 11 May 2025 07:08:06 -0700 (PDT) Message-ID: <13b68fa0-8755-43d8-8504-d181c2d46134@gmail.com> Date: Sun, 11 May 2025 15:08:05 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 0/1] prctl: allow overriding system THP policy to always To: David Hildenbrand , Zi Yan Cc: Johannes Weiner , Yafang Shao , Andrew Morton , linux-mm@kvack.org, shakeel.butt@linux.dev, riel@surriel.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, linux-kernel@vger.kernel.org, kernel-team@meta.com References: <96eccc48-b632-40b7-9797-1b0780ea59cd@gmail.com> <8E3EC5A4-4387-4839-926F-3655188C20F4@nvidia.com> <279d29ad-cbd6-4a0e-b904-0a19326334d1@gmail.com> <20250509051328.GF323143@cmpxchg.org> <41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com> <20250509164654.GA608090@cmpxchg.org> <8A18FB29-CC41-456F-A80E-807984691F0F@nvidia.com> <913bdc9b-a3c2-401c-99d0-18799850db9e@redhat.com> Content-Language: en-US From: Usama Arif In-Reply-To: <913bdc9b-a3c2-401c-99d0-18799850db9e@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: DB401120007 X-Stat-Signature: sdrfjorotw4a9w1jmx8ii4y1jshymhb7 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1746972494-460773 X-HE-Meta: U2FsdGVkX1/gwqEKii598vi1KYYXy8mps6aUemoyaf6rPq9ByXup74OjmvGpIgUnlQQxKkXAOhxAu4URlcOK85xfK915xU+B8jtsXjSOknI22CJ0Q/6FlEisQQ0sWZS8XgKum8rm7UrdN/OLLQL9M4KC3nMtpLA++7QljnZrYdil0Li8JxOTI3t5rqgysaQJGB+T8ptkQccrmLsUVQLP8Z0KQ/7kx3B+0ATGOOCo8m6SvgzUnjcc7DBNZgS8zwPx3OWPFYpsQ29WO4qAnmkwETLmVQlfknmw/NcEBAqQ+AqWZ31of5PEu3mzdHE4I3/uuRQnPLNunkoCrrUAw0LN8ZVkeKUNTtxxaMFrbgphI39i7YTJ2YlLwlXOT6f76vm380wPwds+UTx9obWf4MWt38PNPsw6vIgYpGeJ6CmjednQNq8MhkQ5WwU1XVpGW0dJI0EZIQcvIIAEbHlm59FI1fBjuHepmpBIsGpQqi/F92axbu1VJU1gReEdsPFHhfLmEWTevcwbr5XegeMvFP8Mn6+aMDS9XJrCElkKprh7WJrpFEhiicScunOvKSVssAC0GdPLa2E1+j1h62nL6eFSZU+C355mlUM3YwsddgVLDuJs8jRb91rOkF6KQkoMD5KTKvXkKFam/iRI/D58THz/jXln00nuDOxkYfddvstYinTH3i7SDVNA30Pb5X0xjDDAGXnKDFnXdV7x6jd/rRy4o9N+x2kzjz9BF7KDbgLVXUh3uDoiUtupEq+dw0C6YsiqzObBDsoWj6F41bTfYOFdxGG8wXMMcBESOjpQKbdqf4ZpX0QexDgadwwgmInj2BmnDwzJBZIbjT7ablE9AnekoBHnhFEsDeCnIvvbGeJBXG8AaWXdz+z63o63UDyWj/U6aWxjkoYygoa71mZFn53xOGUEFuSw2Ag2J+0QCjdTRWy78wCI3PpyeIi+KgXw0FSre6kGT6XmRMoyhvpclTk tzK2I3O6 av463PHyqU2qaGFMd079JGXdYBENUhdujz//ErLrJTdrIBpSpoJHMDMcYUnEaNdzN55Ixd2TvtWciN6I/oDGA06uO82+MnuprSamOlYo0hZ88b4U3ZCKEthJpNJOg5qkqHey83nl4wVpAekKt3lDRAUqG20NJOAo5nCMHwV9T/XIEf4iwaPOHsW1uigZYv7yRx0/m0CERq253GNwNF+xr7hE29YDsmdzlanHlj5YcWBdpSTCbP3f6uoTxgVNODNVMwZl+8Vad+2RSgN4Tp26ucsctV7lzh+adRWoODxJH3+NPqtzfBRlZgn+WpwOnlq1i+QYmZkVx9UX8MDH7Hx9sgvMWQmtfrIo9VOKR8jVvJJxFr9fMLfeXYSFMy9QQHT++Pizqd1c+k4h6UYEZW/MHJOeBvKPRliyEayTbPJ9wGXslwpl/TYbesAycQMfyTVhkxmuJQsefI4/HoOcB5z/WZddKufxSHiVb8J2GLTIwoHvDjNrGM8lMbWPcxrFzAot/55pQ0zO00bF/X5If6OvRsEjUslloNxJvmRio7kN/ChLRe38t+YHgKkBXYE5y6902+j+BhilfvLKx+xiXXRECOuR5zQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 11/05/2025 09:15, David Hildenbrand wrote: > On 10.05.25 01:34, Zi Yan wrote: >> On 9 May 2025, at 18:42, David Hildenbrand wrote: >> >>>>>>> - madvise >>>>>>>     The sysadmin gently encourages the use of THP, but it is only >>>>>>> enabled when explicitly requested by the application. >>>> >>>> And this "user mode" or "manual mode", where applications self-manage >>>> which parts of userspace they want to enroll. >>>> >>>> Both madvise() and unprivileged prctl() should work here as well, >>>> IMO. There is no policy or security difference between them, it's just >>>> about granularity and usability. >>>> >>>>>>> - never >>>>>>>      The sysadmin discourages the use of THP, and "its use is only permitted >>>>>>> with explicit approval" . >>>> >>>> This one I don't quite agree with, and IMO conflicts with what David >>>> is saying as well. >>> >>> Yeah ... "never" does not mean "sometimes" in my reality :) >>> >>>> >>>>>> "never" so far means "no thps, no exceptions". We've had serious THP >>>>>> issues in the past, where our workaround until we sorted out the issue >>>>>> for affected customers was to force-disable THPs on that system during boot. >>>>> >>>>> Right, that reflects the current behavior. What we aim to enhance is >>>>> by adding the requirement that "its use is only permitted with >>>>> explicit approval." >>>> >>>> I think you're conflating a safety issue with a security issue. >>>> >>>> David is saying there can be cases where the kernel is broken, and >>>> "never" is a production escape hatch to disable the feature until a >>>> kernel upgrade for the fix is possible. In such a case, it doesn't >>>> make sense to override this decision based on any sort of workload >>>> policy, privileged or not. >>>> >>>> The way I understand you is that you want enrollment (and/or >>>> self-management) only for blessed applications. Because you don't >>>> generally trust workloads in the wild enough to switch the global >>>> default away from "never", given the semantics of always/madvise. >>> >>> Assuming "never" means "never" and "always" means "always" ( crazy, right? :) ), could be make use of "madvise" mode, which essentially means "VM_HUGEPAGE" takes control? >>> >>> We'd need >>> >>> a) A way to enable THP for a process. Changing the default/vma settings to VM_HUGEPAGE as discussed using a prctl could work. >>> >>> b) A way to ignore VM_HUGEPAGE for processes. Maybe the existing prctl to force-disable THPs could work? >> >> This means process level control overrides VMA level control, which >> overrides global control, right? >> >> Intuitively, it should be that VMA level control overrides process level >> control, which overrides global control, namely finer granularly control >> precedes coarse one. But some apps might not use VMA level control >> (e.g., madvise) carefully, we want to override that. Maybe ignoring VMA >> level control is what we want? > > Let's take a step back: > > Current behavior is > > 1) If anybody (global / process / VM) says "never" (never/PR_SET_THP_DISABLE/VM_NOHUGEPAGE), the behavior is "never". Just to add here to the current behavior for completeness, if we have the global system setting set to never, but the global hugepage level setting set to madvise, we do still get a THP, i.e. if I have: [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled always inherit [madvise] never And then MADV_HUGEPAGE some region, it still gives me a THP. > > 2) In "madvise" mode, only "VM_HUGEPAGE" gets THP unless PR_SET_THP_DISABLE is set. So per-process overrides per-VMA. > > 3) In "always" mode, everything gets THP unless per-VMA (VM_NOHUGEPAGE) or per-process (PR_SET_THP_DISABLE) disables it. > > > Interestingly, PR_SET_THP_DISABLE used to mimic exactly what I proposed for the other direction (change default of VM_HUGEPAGE), except that it wouldn't modify already existing mappings. Worth looking at 1860033237d4. Not sure if that commit was the right call, but it's the semantics we have today. > > That commit notes: > > "It should be noted, that the new implementation makes PR_SET_THP_DISABLE master override to any per-VMA setting, which was not the case previously." > > > Whatever we do, we have to be careful to not create more mess or inconsistency. > > Especially, if anybody sets VM_NOHUGEPAGE or PR_SET_THP_DISABLE, we must not use THPs, ever. > I thought I will also summarize what the real world usecases are that we want to solve: 1) global system policy=madvise, process wants "always" policy for itself: We can have different types of workloads stacked on the same host, some of them benefit from always having THPs, others will incur a regression (either its a performance regression or they are completely memory bound and even a very slight increase in memory will cause them to OOM). So we want to selectively have "always" set for just those workloads (processes). (This is how workloads are deployed in our (Metas) fleet at this moment.) 2) global system policy=always, process wants "madvise" policy for itself: Same reasoning as 1, just that the host has a different default policy and we don't want the workloads (processes) that regress with always getting THPs to do so. (We hope this is us (meta) in the future, if a majority of workloads show that they benefit from always, we flip the default host setting to "always" and workloads that regress can opt-out and be "madvise". New services developed will then be tested with always by default. Always is also the default defconfig option upstream, so I would imagine this is faced by others as well.) 3) global system policy=never, process wants "madvise" policy for itself: This is what Yafang mentioned in [1]. sysadmins dont want to switch the global policy to madvise, but are willing to accept certain processes to madvise. But David mentioned in [2] that never means no thps, no exceptions and the only way to solve some issues in the past has been to disable THPs completely. Please feel free to add to the above list. I thought it would be good to list them out so that the solution can be derived with them in mind. In terms of doing this with prctl, I was able to make prototypes for the 2 approaches that have been discussed: a) have prctl change how existing and new VMAs have VM_HUGEPAGE set for the process including after fork+exec, as proposed by David. This prototype is available at [3]. This will solve problem 1 discussed above, but I don't think this approach can be used to solve problems 2 and 3? There isnt a way where we can have a process change VMA setting so that after prctl, all future allocations are on madvise basis and not global policy (even if always). IOW we will need some change in thp_vma_allowable_orders to have it done on process level basis. b) have prctl override global policy *only* for hugepages that "inherit" global and only if global is "madvise" or "always". This prototype is available at [4]. The way I did this will achieve usecase 1 and 2, but not 3 (It can very easily be modified to get 3, but didn't do it as there maybe still is a discussion on what should be allowed when global=never?). I do prefer this method as I think it might be simpler overall and achieves both usecases. [1] https://lore.kernel.org/all/CALOAHbBAVELx-fwyoQUH_ypFvT_Zd5ZLjSkAPXxShgCua8ifpA@mail.gmail.com/ [2] https://lore.kernel.org/all/41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com/ [3] https://github.com/uarif1/linux/commit/209373cdeda93a55a699e2eee29f88f4e64ac8a5 [4] https://github.com/uarif1/linux/commit/e85c8edbcb4165c00026f0058b71e85f77da23f4 Thanks, Usama