From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46948C3ABC3 for ; Fri, 9 May 2025 05:13:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A98D46B000A; Fri, 9 May 2025 01:13:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A45B66B0082; Fri, 9 May 2025 01:13:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 934F76B0083; Fri, 9 May 2025 01:13:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 764EE6B000A for ; Fri, 9 May 2025 01:13:49 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 7931B1A0D12 for ; Fri, 9 May 2025 05:13:49 +0000 (UTC) X-FDA: 83422202178.22.BF4D33B Received: from mail-oo1-f67.google.com (mail-oo1-f67.google.com [209.85.161.67]) by imf20.hostedemail.com (Postfix) with ESMTP id 436101C000A for ; Fri, 9 May 2025 05:13:47 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=eU42MAB0; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf20.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.161.67 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746767627; a=rsa-sha256; cv=none; b=WSaqwtHJUSRDDA0LCE//Szw26dxbAnAzLR8Z1VMiyxdL07lGjZQ5vkOCXV1P2rzWnM3jw9 wHBALHc2O7CTtjCXDogrrMMcoewnVANDOcKq5dyyW5XNC8JCc2rCg5m1krhlT+ARsl4rdB QyGNyLmEywJhxGHrSeQnMDRxcgh6N1k= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=eU42MAB0; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf20.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.161.67 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746767627; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=PUpmkKUfw9A5OMg6fQ75gqALv1X0Kk+J6ZX5ClmsX/I=; b=N2h5LWGBY28Et0meQDYJxqsiNbwPVY/AeoOlEFGN2dT+I0kTTUuWMc6vJoqz37P/vtI6zC zcu2rycDzY2/EpJJid1j90A/q4IPYRLEjgrcFAgN94FdOTuu6p+EwFDdhfFQ3bjf4iyEzv d4Qc/entKnRpdmr223bN0WqzdGDi9us= Received: by mail-oo1-f67.google.com with SMTP id 006d021491bc7-607dceb1afdso1142921eaf.3 for ; Thu, 08 May 2025 22:13:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1746767626; x=1747372426; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=PUpmkKUfw9A5OMg6fQ75gqALv1X0Kk+J6ZX5ClmsX/I=; b=eU42MAB0ejzlaOKes/7bQb3ZDkoe0eu1NGLtspJQQoxAZOOSK8szELd+8my4P7qOAu qaBVxMs4i7f6qB6xnhgIG4Xj0RkiWF7eSItWGszsBpMT8aKyPPTT/l/bkiFUgmPqnWD2 f73qxX1oIkDdzIWqVD3y3v6uXjiMI7ZkBX5MTrrJmwP2D0ebfzoQHxn5VhBpY5HP+VyJ BMvm5ajqCs/wHJ5RGxbOHW9u4FDNyKANjFlq0ta9MjbPDMyuc/7iOTpthkA7wTmQ0NDK momdkxcNGJDiO5BataFdObSiDKCsntipSaUhBn9YMH21KqtyPv1CfKzyKswtzm3av7kD 53fg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746767626; x=1747372426; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=PUpmkKUfw9A5OMg6fQ75gqALv1X0Kk+J6ZX5ClmsX/I=; b=ZpKVPAJou7Cvf79rOAb7vuWxa/bY4+9cqVvmhjz7Rd4rv6fcJTXTFIb1E2aauDi6cr bW8A/Ptkc3dV8jtOlLn0qOLhRbvRc8fippNKVX81zotI4i1s40iiJNkySI4j0AHrvYA9 zd3zKQ50lqqQ2dWTcM2OgQTmGb1otQXv9h4xISYxgtzE2uhygywC5douNsr2G5LGUfyP YqxJWKZ6hIa9sddYVH8XMTjSC0xZQnqXCQSJozqgQMu+MYrYi9F4SiliWbiG+v1Cz6r0 CKMiCLVmgCjqg8wxo86l0KRLtotMIBxniZ9rvhsvOzGYZkHu/pERu3bZwfL8jTAuHvRy IC/A== X-Forwarded-Encrypted: i=1; AJvYcCWXAum4uJ69b6lwQWha5tvHd1zcIGcJXjQB8V4Z60m7TPOB426ZCMr7Ib20I4JWModKZNCjFlsJ4g==@kvack.org X-Gm-Message-State: AOJu0YyTqJPLJBrw5n9IQjQg4s40tLHqUqd7d619daU9/8nynfeDBYrQ GTROtFYDy9rAnh0Pi4tXcPIpmK0WtkJjcFOPbgMglu2RZJUzUa9HUiaThvF3v75it5Q3jcO6cMz a15obBA== X-Gm-Gg: ASbGnctOI8zf6j54rTtXic8lrGtfiVvgoeS5AtfO4u/hMm2I2bJKf+2G4TDOdVSODxs MJAXTwV2/G8AvDwX4XalgasD0FCldWA1IFmJbNqnQTZcLWR7PUq20q7Lk3FjH/dDsW9eHadQNXv Z1lnXqgEmYqW3YqqwcHi3rq5ohDG3g/6F1OsRM0R0WmsawDKrL1TZzBr2HYy0LHgptOQ2LAh7bA FTd5+t73rQrcTvLB6MPfQSWk2d6SER/qoIvke6dheyetXGwYwg9baojGY0RulYc9S7VahwpgCAX 7QACMc0nwhKEmO8GJ7oHi61vDN3NRGcnX5bwoyKuV3MU714lQg== X-Google-Smtp-Source: AGHT+IGIlTCYJ/dQFP9OrgQ9K4cwO9xpxO0fTpb/OJNjBaP5lJTQLuU0TrrBpJ0zonzaPmRihikvng== X-Received: by 2002:a05:620a:2551:b0:7c9:2612:32d6 with SMTP id af79cd13be357-7cd0115c77amr334970385a.48.1746767614945; Thu, 08 May 2025 22:13:34 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:365a:60ff:fe62:ff29]) by smtp.gmail.com with UTF8SMTPSA id af79cd13be357-7cd00fdc494sm89758285a.69.2025.05.08.22.13.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 May 2025 22:13:33 -0700 (PDT) Date: Fri, 9 May 2025 01:13:28 -0400 From: Johannes Weiner To: Yafang Shao Cc: Usama Arif , Zi Yan , Andrew Morton , david@redhat.com, linux-mm@kvack.org, shakeel.butt@linux.dev, riel@surriel.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, linux-kernel@vger.kernel.org, kernel-team@meta.com Subject: Re: [PATCH 0/1] prctl: allow overriding system THP policy to always Message-ID: <20250509051328.GF323143@cmpxchg.org> References: <20250507141132.2773275-1-usamaarif642@gmail.com> <293530AA-1AB7-4FA0-AF40-3A8464DC0198@nvidia.com> <96eccc48-b632-40b7-9797-1b0780ea59cd@gmail.com> <8E3EC5A4-4387-4839-926F-3655188C20F4@nvidia.com> <279d29ad-cbd6-4a0e-b904-0a19326334d1@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspam-User: X-Rspamd-Queue-Id: 436101C000A X-Rspamd-Server: rspam04 X-Stat-Signature: sghro1ki59jzd5ui5575xkwrkcij5y5h X-HE-Tag: 1746767627-530339 X-HE-Meta: U2FsdGVkX181t+SynQBLmeAccxxK7E9uAp55ckSmpMviMeFREwPeXjoSmKDSie27jNF3o2k5FMjIsDIbU2l6Fz9Gba+GXAyQbf3ixoIn7YCSQf7ybXggghZ6Is3lyc51fPab+e9hBHjkPEFN/4+Ep/KUF3pfrw5GoZVmFjMxFf/ApJ0ban81OkVHyTx2k4SsrvD5PhNFPk8z5/QsMrGf2x3B6VzqcyzhT3UI8nnF8HQOCqQ9wJzrKPKfKqwoSJ6I6spcFwTZCUx07LR1lT2Wy83nMFGWXtTHuoNHysr0HBUbyFgkshvQfrIe3A1Uf1N+P/GTe/6aXDL6ilOKjvAUnNWzr2SsHnSEJlwFFcnEbMUO9IJUDrCm23bFbx1GunZHxu7cFt8lxO1BPGmg/E0ACJNgnXkMFcKaMPwSG+psFfISOQadlucHDvsLFhR8k15M7Qy7UYMGmGIvL5pWzUChg5f5HDbnd3fd83cgN3qNpi0Z2hnvL8NneFtVEWPWCmR5K1eqg8DO9N/SlLbUvMPOT3Gz4sUS2JTG3/7Q0TEfmk4D9oG3T3TlTS7cEb1xwvmlhB22KznMJRIyUsLRmJElmHJA2E7q86W8WD9g4X8TmvebDYP1hRr2QNHQX/OCz1ZkXnWzzmUXUP3fvrwfMAWqr9M+787TOGOKo42OW0iw1V/L81nDTlVPIFkQgEefulcA15b3PHzjgmOwucQaRjgnKDJSG2IyvNa9v4kEx9YeKgfV8Auc0R/J4yQM8y1BdADkQloeIKrlJqCz6Kr5xFMURBps1VdhU0T5nXfNMhTN322iGld60mNRETTJNfjHjMBW34YSR/rzCkksyhy+YCb5fdPYczjcTtHEQzjSgKNry9mUJwtQ1ZXALc9hnh66HDydJUVEK869dms0c7O6Lqp3sP4sKpaVvq8wOV/c+yku31EswbrTeO7uRtheJ/zu7M86ffddPGq5sqvhDceSPXo snnehTPZ j+Xdq27uKwB3KDuF3frUCulfUPEFmASmbu44QVTxs6jDWLQyTtiPVi4Bty1SNNhTaSdrTyBIu+vOhYfH8E5wG41CWA80TauTlk/BRbcBoSvRfBlmIQZn5mcCTyzHQ5Y/86+cewplvu1Gfi4emrxwOVESLKspLY2TgB2kzozgCWsH/x/1N09mzO5LwQZ2BqLFgKpTV3YTJQA+gelt3piPM4tLVqMNNGVEuEOmu5soUed9QgH6u2VaLwm6O1LsiEt51y21bWirPySE6J22LUfZBMsXiMG7mzHTvP4xpW37mRUW3G9QaIzUPdI6Tv3f49X1uMNgtIWB4pst8erreyu5xwHkFHc/ObfIJwqmCOmbnM0Z5Eed29xcPB1Ee8FEK7C7W1CoexTAWmBGmoiodtp4ttBiqC4W8M9xa/PeABgYo7R2Lgtqbn7XouiJVaDMF0C1dZE7daYSu2TuFSeK7umnOj85hPl+Ru0xHMixeFXt1rVo7C27eO+aTio3UDT4DYZp3W3al61xMc4GMwxUjWI/jwi7w4YYW+JT0LqGCuqlzEpv4lOGSEIMj/zLwYJ7c2eqp3skM1ZpIhPMLT0Mx3eL0JtJqyL70LOaGEPdb X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote: > On Fri, May 9, 2025 at 12:04 AM Usama Arif wrote: > > > > > > > > On 08/05/2025 06:41, Yafang Shao wrote: > > > On Thu, May 8, 2025 at 12:09 AM Usama Arif wrote: > > >> > > >> > > >> > > >> On 07/05/2025 16:57, Zi Yan wrote: > > >>> On 7 May 2025, at 11:12, Usama Arif wrote: > > >>> > > >>>> On 07/05/2025 15:57, Zi Yan wrote: > > >>>>> +Yafang, who is also looking at changing THP config at cgroup/container level. > > > > > > Thanks > > > > > >>>>> > > >>>>> On 7 May 2025, at 10:00, Usama Arif wrote: > > >>>>> > > >>>>>> Allowing override of global THP policy per process allows workloads > > >>>>>> that have shown to benefit from hugepages to do so, without regressing > > >>>>>> workloads that wouldn't benefit. This will allow such types of > > >>>>>> workloads to be run/stacked on the same machine. > > >>>>>> > > >>>>>> It also helps in rolling out hugepages in hyperscaler configurations > > >>>>>> for workloads that benefit from them, where a single THP policy is > > >>>>>> likely to be used across the entire fleet, and prctl will help override it. > > >>>>>> > > >>>>>> An advantage of doing it via prctl vs creating a cgroup specific > > >>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is > > >>>>>> that this will work even when there are no cgroups present, and my > > >>>>>> understanding is there is a strong preference of cgroups controls being > > >>>>>> hierarchical which usually means them having a numerical value. > > >>>>> > > >>>>> Hi Usama, > > >>>>> > > >>>>> Do you mind giving an example on how to change THP policy for a set of > > >>>>> processes running in a container (under a cgroup)? > > >>>> > > >>>> Hi Zi, > > >>>> > > >>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always > > >>>> for processes in a cgroup is in the same way we enable KSM for the cgroup. > > >>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS > > >>>> in exec-invoke. > > >>>> This is at the start of the process, but you would already know at the start of the process > > >>>> whether you want THP=always for it or not. > > >>>> > > >>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > >>> > > >>> You also need to add a new systemd.directives, e.g., MemoryTHP, to > > >>> pass the THP enablement or disablement info from a systemd config file. > > >>> And if you find those processes do not benefit from using THPs, > > >>> you can just change the new "MemoryTHP" config and restart the processes. > > >>> > > >>> Am I getting it? Thanks. > > >>> > > >> > > >> Yes, thats right. They would exactly the same as what we (Meta) do > > >> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set, > > >> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when > > >> that is set, the prctl will be called at exec_invoke of the process [3]. > > >> > > >> The systemd changes should be quite simple to do. > > >> > > >> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 > > >> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151 > > >> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > > > > > This solution carries a risk: since prctl() does not require any > > > capabilities, the task itself could call it and override your memory > > > policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that > > > capability is typically enabled by default in containers, leaving them > > > still vulnerable. > > > > > > This approach might work for Kubernetes/container environments, but it > > > would require substantial code changes to implement securely. > > > > > > > You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE > > already exists and the someone could use this to slow the process down. So the > > approach this patch takes shouldn't be anymore of a security fix then what is already > > exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE > > should be used to restrict this. > > I believe we should at least require CAP_SYS_RESOURCE to enable THP, > since it overrides global system settings. Alternatively, > CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely > certain. Hm, could you verbalize a concrete security concern? I've never really looked at the global settings as a hard policy, more as picking a default for the workloads in the system. It's usually `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long existed to give applications the ability to refine the global choice. The prctl should probably respect `never' for consistency, but beyond that I don't really see the concern, or how this would allow something that isn't already possible. > > In terms of security vulnerability of prctl, I feel like there are a lot of others > > that can be a much much bigger issue? I just had a look and you can change the > > seccomp, reset PAC keys(!) even speculation control(!!), so I dont think the security > > argument would be valid. > > I was surprised to discover that none of these operations require any > capabilities to execute. seccomp enabling is a one-way street, PR_SPEC_FORCE_DISABLE is as well. You can reset PAC keys, but presumably, unless you also switch to a new execution context with entirely new PAC/AUT pairs, this would just crash the application on the next AUT?