From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 827BFC3ABC3 for ; Fri, 9 May 2025 09:24:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 84BF38E0001; Fri, 9 May 2025 05:24:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7F9FD6B00E4; Fri, 9 May 2025 05:24:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6C4AD8E0001; Fri, 9 May 2025 05:24:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 4D8956B00E3 for ; Fri, 9 May 2025 05:24:42 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 630AF5BA9E for ; Fri, 9 May 2025 09:24:43 +0000 (UTC) X-FDA: 83422834446.15.44673BD Received: from mail-qv1-f53.google.com (mail-qv1-f53.google.com [209.85.219.53]) by imf15.hostedemail.com (Postfix) with ESMTP id 87916A0002 for ; Fri, 9 May 2025 09:24:41 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=adN+AkQD; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf15.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.53 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746782681; a=rsa-sha256; cv=none; b=0QbYKsSS/pW2liJnyH+n5f0Xs7Iek0+wleu9MY44/wvIVnUZcpJeVyjiC94qvmPKVdELfz YXh90Xlj664yBSAMwbfSoOyimr5tAIOvPqE8855r/mSdRmzeM5g82Z7QzvFzIZOPIjUA8d LzCB41cs4TiP8sTgwji2Npur1DrBJQ0= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=adN+AkQD; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf15.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.53 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746782681; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=g1JN9kv3Dr6G523hbCTrv5AkRnjP42S/5WAMWU2Wf+k=; b=CTCnjZ8s/6vTSBl9kNWG72iAUO8XGgfeCDdP1OxGvSduAd/pZAckQoclkXZbcjtjTgLGM6 FCeRntNPxWEEB0gZGR94R9HX+4poD4fzeWhr8yNzFN3m9QVCUIcLs9kX6we3Lr6Jdbq2oz vgXre5oi2DemH9aGhoWDtLzQehs/NvM= Received: by mail-qv1-f53.google.com with SMTP id 6a1803df08f44-6f6e72cbbf4so2752006d6.0 for ; Fri, 09 May 2025 02:24:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746782680; x=1747387480; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=g1JN9kv3Dr6G523hbCTrv5AkRnjP42S/5WAMWU2Wf+k=; b=adN+AkQD8wriEW9ZoVOAmBTKk3aHPdue+r8iLpbuLA61VeB7ib1y3Epame4FEgIUNq XM7PyfmyRcr+a+cfO68Q9PaVatqBXNpzjOPY/TSlM2aTRZ2NGC7/o/6A5Pv4BwVfBZpR 9DdY1T/KZyZxugeRjbZWG8dtInZmzooJkNhNqMHxt/mpcG2FXLQn47Ae6zU8rkxWgqWT GwYYPd9etJll4+7QMr+KHWlI4RVZWeyfTwlf3uk+4tcl43Hve5XP4VccP5WnXiazg6l7 ai7ybA2a2RFauKw/dfbI7eu0vEaPOHSOcpl2DcN6JDlvBXyp1gXK/YNl9PH7VIG8RlUF tKdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746782680; x=1747387480; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=g1JN9kv3Dr6G523hbCTrv5AkRnjP42S/5WAMWU2Wf+k=; b=QfGYaGQQne+sMOoL/OeJA7rUX+W/lXR7ec05lC6rtxu5mboYnO0MA0EPozMC8Q+ZNs +Asnc6sBL5nufT0uwRKCuuzOxNNl0W0OsbaSyYP+zMfXc/7ByERVgNgIBj6RSPvt3TgZ NN6HkQzvVLLxV1QIX5yzNJodhET1KkkDacR4VjzSWV/ClTtxKm7uqzxRqVUwuk0hcVa3 PcWqpbnL9IaQcq1ZzchlyNL3EDEAir5BN3/Uvb1VhE+nPpSB4LLKWrk+PuzhCQNY6NQY wt14h/AyqiPfjQ4pXZITK3nUdNUu+EFYRMZjqJLtw5Vq2cQVlr8hz/2nzku7Q+WiMrbJ fuuQ== X-Forwarded-Encrypted: i=1; AJvYcCUg6yCJKie/1go3MU1a2KRWSjQiV/2U3ga+mQxgzrt49tnPI86ztpmgtf0yMfFYChNzLf95M+I8pg==@kvack.org X-Gm-Message-State: AOJu0YzKMk5PDZwSgmjClZgWn8Ieq+TuiMuyviBfI/jZcQBPFbfprpSq tliMxLisQu+cHBuV1aWYobDpCBBukR9/HvUJ0TRPpQju/jIEZAXFANJBZ32l5oZyrwNVBWhekMm nPnCaGb143KRIPG1n6t3bGtsN+Yo= X-Gm-Gg: ASbGncu3FrlK9y1k+8rXtZEicAaCiRNGDz8QN47CeSw2XDW573I3acowwNdWfqkqoGb 9O6h852vGTHeOaVpjAqutFWDRnKEfwDCAw+F50EgIu+8gNfDnLAySf6tPYXpCdJWfRwvHwgUgjp dS4sD1b2DwZ6sbqqly+ijRBgE7 X-Google-Smtp-Source: AGHT+IFDdxNRRNQ78luKEuhDny38a5Lke0QoSXyn95d0Mme4CoIAq6vkYA+GqyNNasTLe8Nv/c2dEdZr3iMihy/Iqz0= X-Received: by 2002:ad4:4687:0:b0:6f6:e6d4:e319 with SMTP id 6a1803df08f44-6f6e6d4e4aamr13171206d6.8.1746782680438; Fri, 09 May 2025 02:24:40 -0700 (PDT) MIME-Version: 1.0 References: <20250507141132.2773275-1-usamaarif642@gmail.com> <293530AA-1AB7-4FA0-AF40-3A8464DC0198@nvidia.com> <96eccc48-b632-40b7-9797-1b0780ea59cd@gmail.com> <8E3EC5A4-4387-4839-926F-3655188C20F4@nvidia.com> <279d29ad-cbd6-4a0e-b904-0a19326334d1@gmail.com> <20250509051328.GF323143@cmpxchg.org> In-Reply-To: <20250509051328.GF323143@cmpxchg.org> From: Yafang Shao Date: Fri, 9 May 2025 17:24:04 +0800 X-Gm-Features: AX0GCFvleWYevErqVNftdx90LiLJTg72XKMiSE8RHnj0wZmz1N6lqozx5vLtg80 Message-ID: Subject: Re: [PATCH 0/1] prctl: allow overriding system THP policy to always To: Johannes Weiner Cc: Usama Arif , Zi Yan , Andrew Morton , david@redhat.com, linux-mm@kvack.org, shakeel.butt@linux.dev, riel@surriel.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, linux-kernel@vger.kernel.org, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 87916A0002 X-Stat-Signature: uhesa34qp8k81dimj3fj9urtycxtj3er X-Rspam-User: X-HE-Tag: 1746782681-150463 X-HE-Meta: U2FsdGVkX1+c6P7CjEvMtFW5iFdADvKynddUgsWJjHHiscXP3eU1RTXfz2RUGlF3QfVek0rwMeacJ+0ziKB3JgX5ksQGcPf7Ll/30dkV/D3LiHtfk+yUSMQfquoLuMsaM0e9nfgG8/dVEb6Yv7XryQ16Q4UZCwmUAclWi/s0pMVeiJp+ThCgKbJFqCBoPxaj1YEYnpIlgrFh9AjlbwIwwcZ2+affQPDyCvw0zSmlr8A2RzuTlm+oEzBb+1tDdSiunI3sUtVNYHl1ITB1LMAHx6t6AIKdRWYOkfNOI6nJthURbnbpqOaReUX11sgKCUZ6MmASdibVElUUH9Qri2siwMrAUOZqGc/PpqK5lbUR3bSDho6JsY7u81z4YrL67XFkJ/hc+hfvVj8eE8d+ncaXNc7CgnufXwH2r11+i/b6yYO/OOXtNI6fgxI9VT+6Y+XddxmYwo0UOEjLT0us5fs5fMqeqNeYmItXKmt3l2hq6X9CVsYC++QEeTBeg8aEAzLgtfneMKNWUDXYRxpPOyZ3EDm1hnfzStmI2v16B+SwCpcPfW8A1C/5EFIfeI723/pNYMBxZGj4BhVZ5PKevo3CawVrFCOG/x0N8jtThaQmi9hO8f4Y10sS1imtKSgP/N3Fjjwyu3+AN9e8Mv77VygGU0L1pn9eGb+nbeLewMr8rqE8rRgaNkmvTZT9fZ8RwbtDf8yHfaYtoG2+sLrI8mnHdXLnrtZiilrn4Wbq1/hA+7ml2TJXxBUglHlXcjnzKaivDrBgjR1rJyZ+Z329waxmad895L1y1giqEP5nFq/18zbo6oJh3Mz5W0Wm8RC/oBfZMewyEDHq6gygGPL639TUW/GIUfy0JDL3Yc+CZp8jXloSEfFmfY4XXo8252qiU6YJKeCH7dkkQ1I6aS32dEbZeg6+36xgmlhxzK/JDdNRdBCOByj8AUkz7MHBlBcLw6Vz2MIgp5sNar2v3i2grut zlEjZCmH 2MzqddPaSTWNbWYjWknJmS/GenZYTHsMT9LHVS5nV++7ATdniofKRjZcwbYq0A1Q4NyxVazVCuEEwxiy4aMyl/BbNBg0D4P/6AOzKPrP/dvVPDQwwzy1kp9nWg5iMPR5nE87iAPZiy7aJRb7bjoySqyCQZVyetFR3ZlbHwlpbghd9feJGmdDoQUaKF9aEOpICZhGTIMMmwtdYfM7kyPAUOEH0YDMHh9/Z3o0p2ZNcNCRWk+Xe5fpz2mSl99tziOoZZH7x5E+cs6rjIT2b6n8mHZF1iLYNQx0QFo+OYvZFQVLVSQcI0ZPsY2MvGirbx2U4zDW5RdEyQMObwJ+W2dkF7t8BrhGG7wC3SxSIm1+X7opDl5hCDFkIm+1YPQNwG5i8mtOPk0zoGgmJ8qVlZbeJttXh4c18z8CZfjQWzIzEdK2+puY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 9, 2025 at 1:13=E2=80=AFPM Johannes Weiner = wrote: > > On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote: > > On Fri, May 9, 2025 at 12:04=E2=80=AFAM Usama Arif wrote: > > > > > > > > > > > > On 08/05/2025 06:41, Yafang Shao wrote: > > > > On Thu, May 8, 2025 at 12:09=E2=80=AFAM Usama Arif wrote: > > > >> > > > >> > > > >> > > > >> On 07/05/2025 16:57, Zi Yan wrote: > > > >>> On 7 May 2025, at 11:12, Usama Arif wrote: > > > >>> > > > >>>> On 07/05/2025 15:57, Zi Yan wrote: > > > >>>>> +Yafang, who is also looking at changing THP config at cgroup/c= ontainer level. > > > > > > > > Thanks > > > > > > > >>>>> > > > >>>>> On 7 May 2025, at 10:00, Usama Arif wrote: > > > >>>>> > > > >>>>>> Allowing override of global THP policy per process allows work= loads > > > >>>>>> that have shown to benefit from hugepages to do so, without re= gressing > > > >>>>>> workloads that wouldn't benefit. This will allow such types of > > > >>>>>> workloads to be run/stacked on the same machine. > > > >>>>>> > > > >>>>>> It also helps in rolling out hugepages in hyperscaler configur= ations > > > >>>>>> for workloads that benefit from them, where a single THP polic= y is > > > >>>>>> likely to be used across the entire fleet, and prctl will help= override it. > > > >>>>>> > > > >>>>>> An advantage of doing it via prctl vs creating a cgroup specif= ic > > > >>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.e= nabled) is > > > >>>>>> that this will work even when there are no cgroups present, an= d my > > > >>>>>> understanding is there is a strong preference of cgroups contr= ols being > > > >>>>>> hierarchical which usually means them having a numerical value= . > > > >>>>> > > > >>>>> Hi Usama, > > > >>>>> > > > >>>>> Do you mind giving an example on how to change THP policy for a= set of > > > >>>>> processes running in a container (under a cgroup)? > > > >>>> > > > >>>> Hi Zi, > > > >>>> > > > >>>> In our case, we create the processes in the cgroup via systemd. = The way we will enable THP=3Dalways > > > >>>> for processes in a cgroup is in the same way we enable KSM for t= he cgroup. > > > >>>> The change in systemd would be very similar to the line in [1], = where we would set prctl PR_SET_THP_ALWAYS > > > >>>> in exec-invoke. > > > >>>> This is at the start of the process, but you would already know = at the start of the process > > > >>>> whether you want THP=3Dalways for it or not. > > > >>>> > > > >>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9= b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > > >>> > > > >>> You also need to add a new systemd.directives, e.g., MemoryTHP, t= o > > > >>> pass the THP enablement or disablement info from a systemd config= file. > > > >>> And if you find those processes do not benefit from using THPs, > > > >>> you can just change the new "MemoryTHP" config and restart the pr= ocesses. > > > >>> > > > >>> Am I getting it? Thanks. > > > >>> > > > >> > > > >> Yes, thats right. They would exactly the same as what we (Meta) do > > > >> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if Memory= THP is set, > > > >> the ExecContext->memory_thp would be set similar to memory_ksm [2]= , and when > > > >> that is set, the prctl will be called at exec_invoke of the proces= s [3]. > > > >> > > > >> The systemd changes should be quite simple to do. > > > >> > > > >> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b2= 8dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 > > > >> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b2= 8dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151 > > > >> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b2= 8dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > > > > > > > This solution carries a risk: since prctl() does not require any > > > > capabilities, the task itself could call it and override your memor= y > > > > policy. While we could enforce CAP_SYS_RESOURCE to restrict this, t= hat > > > > capability is typically enabled by default in containers, leaving t= hem > > > > still vulnerable. > > > > > > > > This approach might work for Kubernetes/container environments, but= it > > > > would require substantial code changes to implement securely. > > > > > > > > > > You can already change the memory policy with prctl, for e.g. PR_SET_= THP_DISABLE > > > already exists and the someone could use this to slow the process dow= n. So the > > > approach this patch takes shouldn't be anymore of a security fix then= what is already > > > exposed by the kernel. I think as you mentioned, if prctl is an issue= CAP_SYS_RESOURCE > > > should be used to restrict this. > > > > I believe we should at least require CAP_SYS_RESOURCE to enable THP, > > since it overrides global system settings. Alternatively, > > CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely > > certain. > > Hm, could you verbalize a concrete security concern? > > I've never really looked at the global settings as a hard policy, more > as picking a default for the workloads in the system. It's usually > `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long > existed to give applications the ability to refine the global choice. > > The prctl should probably respect `never' for consistency, but beyond > that I don't really see the concern, or how this would allow something > that isn't already possible. I would interpret the always, madvise, and never options as follows: - always The sysadmin strongly recommends using THP. If a user does not want to use it, they must explicitly disable it. - madvise The sysadmin gently encourages the use of THP, but it is only enabled when explicitly requested by the application. - never The sysadmin discourages the use of THP, and "its use is only permitted with explicit approval" . > > > > In terms of security vulnerability of prctl, I feel like there are a = lot of others > > > that can be a much much bigger issue? I just had a look and you can c= hange the > > > seccomp, reset PAC keys(!) even speculation control(!!), so I dont th= ink the security > > > argument would be valid. > > > > I was surprised to discover that none of these operations require any > > capabilities to execute. > > seccomp enabling is a one-way street, PR_SPEC_FORCE_DISABLE is as > well. You can reset PAC keys, but presumably, unless you also switch > to a new execution context with entirely new PAC/AUT pairs, this would > just crash the application on the next AUT? It appears so=E2=80=94thank you for the clarification. -- Regards Yafang