From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 89003C3ABBC for ; Fri, 9 May 2025 16:47:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 560B06B0098; Fri, 9 May 2025 12:47:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 50EAB6B0099; Fri, 9 May 2025 12:47:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3B0FE6B009A; Fri, 9 May 2025 12:47:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 1DFA06B0098 for ; Fri, 9 May 2025 12:47:02 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 30CE7C0229 for ; Fri, 9 May 2025 16:47:03 +0000 (UTC) X-FDA: 83423949126.23.451C872 Received: from mail-qt1-f195.google.com (mail-qt1-f195.google.com [209.85.160.195]) by imf27.hostedemail.com (Postfix) with ESMTP id BEF5E40002 for ; Fri, 9 May 2025 16:47:00 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=PpDLRAlx; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf27.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.195 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746809221; a=rsa-sha256; cv=none; b=gNJsWScSghO5hQoT39V3Hpxm1sDQCDTy5idNcbBQYEH3BgzXlsqIZ4LZKEz67/yfioPL9C SzcInjlqG3n4P6ufTp95WUIbN1Bo/IaeIwIieG1pCoyZM+4rSHpWEyVedEtRUBtpT7vr54 /7NbuEs2omeAN6L1Jv5pjA2euNWmDho= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=PpDLRAlx; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf27.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.195 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746809221; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8o3V3hA+0uC3cTtT6cKiLx8lEd07VbDkB9KnvndMZ7A=; b=3ySUfGXqwW5TUQ/UJqQFHs9K5UHzbIwu6cykwfTNh3Xntae1Suf77zt7mMETQPR7n0s388 GGbQP4oFXXge/bv7j85du+i7W+HioDuJ8f8SfBpHzJkejJLkD2CnafDtzhtUWjHSCtbKZp ocyoBPC49ntwwme8WSGFXgwplW7Xgl0= Received: by mail-qt1-f195.google.com with SMTP id d75a77b69052e-4774193fdffso39686421cf.1 for ; Fri, 09 May 2025 09:47:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1746809220; x=1747414020; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=8o3V3hA+0uC3cTtT6cKiLx8lEd07VbDkB9KnvndMZ7A=; b=PpDLRAlxTIeVjYFoBCO9nQ1u00WxHfa588NdfY9kU+Fl4PojPsRE0fzePEp/eLO8UD JY1RnEOK60efBWIYy1Le/cOG8aKT8PQq2bFG0DcsgAmbGVLakfTHGjO61d97kPQ1GyQk fYw8IkcRPuvUEMdfMOo4mi9Zu0t+w/6cGu869HXKLE9cl03GrKAbNj0U69gmeCY8G4uQ 8S8ujqY91q6IFhaQz+uW5vxUao9rF0NUE/+ORxteHMztdbMl5HDb4iwT4jm9o+DHJQtF tPGjy01OGAGwit5/B1iv8Je1+G6GFSllZnnfmbjDFkz5JebCoaFoIDJmJE4t3yWe5T7z mqqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746809220; x=1747414020; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8o3V3hA+0uC3cTtT6cKiLx8lEd07VbDkB9KnvndMZ7A=; b=eMoQUHR8IopHQ+tcR2L1yKk8nfJE1pFDpZ689t8rR+54+G0Jgltcnbyx1g3BXGiIzP HUl1p3dsPtjRwYcKNJ6Bze+U2aUXzPtpQXnniOdvmlZkYDPy4OhZ5FNim0/pVBEjnHNF lBM/f9d9FHNHUp9tesdUexEaGgiRCXTNKP5Fe0ah9RwibWtSOJdwghd7qzCGYutaQazZ 5uk8hMbpNxvV+KGxC6oScyeH2a61/vHPkAVJoTjxsp0gQsENcajWNgngrYQPLjpCxLmy aNAHnmazxxTQxtqyfEqt7Dp6VihIl5LpHbjs8VrBg1PJDYCV3G71OuWJbYI8fVR5DDsK Pjlw== X-Forwarded-Encrypted: i=1; AJvYcCVg7EU0/j1cy4qCAK6EFIhCD7A0wkmPp6G8UzmaKVXC0+qwRhKw81LQVedxkgOMNbXCia/nTjqJEw==@kvack.org X-Gm-Message-State: AOJu0Yywe2mEskbKFv+WIXhHE8hp5DF/QBAPPFSJVkrn3BiUzeTExKTx CSGv8t+pMn+dhpiSbfuIjj7CsW9AssruojqtihjbGNqqz89+BkyYnscwqJMe+fg= X-Gm-Gg: ASbGnctyM0z5jQ4UgxgoDNPNNcjqIPH7z5u5RL7RwdcbisECcQ37fG+e+h4kFD9KQIE zFtqwOp4GJcHphzM/OSHUNhnQB/Un2A5h0BokyIDXQbbxUdDP3Q0zYXi81UglEFtuq97y3JLouL qZY/cseFzQOQstD6o3MLMJG4AHCYMwtRQreap9T3lTwynMa+T1MhC56lafxn9ocp0nUz2Wcv040 8uj/KolG93krYhm47NYZvWYlxe80Wzvnkc5OR7zCG9AlfdHYTssEj2ZvyFzaNKJmJgm1fc1WS+w lOHXkjqJBXF/tMO8MeXJ2tDji/QTinejCBRWxas= X-Google-Smtp-Source: AGHT+IEShoPP3YD6z2W+O0/vPyKF1Y0yxuBYn94naG7pDaYCKG//+YoFJl7jSG0i9ESCvq2AxsMDDg== X-Received: by 2002:a05:622a:22a5:b0:477:6f28:8c16 with SMTP id d75a77b69052e-49452706cbfmr62292191cf.6.1746809219530; Fri, 09 May 2025 09:46:59 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:365a:60ff:fe62:ff29]) by smtp.gmail.com with UTF8SMTPSA id d75a77b69052e-4945259e456sm14597701cf.67.2025.05.09.09.46.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 09 May 2025 09:46:58 -0700 (PDT) Date: Fri, 9 May 2025 12:46:54 -0400 From: Johannes Weiner To: Yafang Shao Cc: David Hildenbrand , Usama Arif , Zi Yan , Andrew Morton , linux-mm@kvack.org, shakeel.butt@linux.dev, riel@surriel.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, linux-kernel@vger.kernel.org, kernel-team@meta.com Subject: Re: [PATCH 0/1] prctl: allow overriding system THP policy to always Message-ID: <20250509164654.GA608090@cmpxchg.org> References: <96eccc48-b632-40b7-9797-1b0780ea59cd@gmail.com> <8E3EC5A4-4387-4839-926F-3655188C20F4@nvidia.com> <279d29ad-cbd6-4a0e-b904-0a19326334d1@gmail.com> <20250509051328.GF323143@cmpxchg.org> <41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: BEF5E40002 X-Stat-Signature: twrdrypkbesj3mabaaq9rntw66ych4ro X-Rspam-User: X-HE-Tag: 1746809220-41044 X-HE-Meta: U2FsdGVkX1/C1PLAqZ1P8KfvAf6FYj/lBuwh3MiIlazzBvghIpUw3kPU1wESjjxZUwqSB6IWeW0lfSnDzHG7HpcpKj8iIHjx4W7EUVMqcI5tHvBprM+q2/WqIh6aQpxNTnzEpDrpIXWIC8MuZhnALPkhblXVOVos26cIRzdeRaiHLLp3kROYg/dN+j2WXf29hHP7XM4vHmB9tXQAw0rSxqKRjdaqSpZfYztFweawNNCPBH3V6y0sBNtlNZFp2hv/awTecJ54Zh13nL/A+F7+2QfDSHNlhbe1CBy1qS5H4rjuZZaRaN1Icb8uAHROzFhKgsKO66/EpnebgQ2ojUQYyhCJ8PdGx5W5aJh6+bTIV5Lf1iB03iJ8oKMY0w1oNQYTT7gktRBqWzwdC8PIoI1c0ss2RVfeOo7UWJWnN0TW5UUo/4tcvGESm5ERGYhkPLnZgIg9OkjSYjpjet8bji7TszL5HDSX3MgrWnsj75MRF1rctlPeSoWZlFvp8m90gAvcVwL9HegqLW/RZnLFMAOepIUrVHVuvzsygwcft3HHGqHWpbvk2tULBxtz0M+q1cIDhWe42Kp7Br64mILNkSFzQ+psmSleJlApnkFk+ocZBgcZ51uPJtlghoEY0CjfDJlUBt6CdXntFvyLOWYUJ/sLC1y3YA7v7rc4VCdXV27at2LtzYidjcyh037uTeNMQhhvqW414F9Xhch8rBLMjuDcfl1SApRi+nxqnYM59wYjrwKqws0BcXVhl+Ptz1fzWZBkArqMbehRUF6Sc2ybo5Cv7kxAc+br1P8N5mvYQLLNvbvB47DBLYbgalPYIy/7p57Cv8X3YJC9E74YLDYcslvjBa1MS//ycBYwAp+EVm/D9ws9mAfHLJKgTUmvS6nsZ/LW+F5ayVy8tMYKQZWImMGQ93pM2Uglf3+NXT+VBI++qvG5Oe6o981s31D4vjrv/7VNiI69rfLL4j5uS1JFivP gQ0EuRto yxDbYEsLL7MLTnk/jzaqhzx8LQk58acQZ+jVTVHsefKoz91oLRPKo5VUwEK7oIhzhH1ecDwTS3FlGuG3TiVagzzOZ23Pgt4biza0jWo4RKawsdz9rNsP9zTs/dBB1+YSvtLqw1A0W8xdJ4dTHwddz4MCx/Mif+fkRSzxraTq2CruzVhYVfMg/u5zNlqj/cqg8OuyVcbgRkAKRj/hERORt+sSVZSzGMvux2GVbqAFyhe6dWaxcRhsU+bFqPx/Vd+kHK+MsFh1sKdk+m0F1AbY3k6glbuDDumpxXRIi+wfFTI4bgrEDY8aDZf9cXxQGfs3rzf2qjyv2io7GI7c4KX6Rkome2kq8FnHJI8TrRv5ZhTI8lfyzJ7bEDMy6DI2CQBwSAw3evdJGXaHcbFywKibAF+EfdG/0sSzjoLMLxWP82mFSZBz9JZIlpZETRS0gNbm8Gk73SYmDoNHpwBWj2kDlIkdKn1hmRjLDiUbqL+zj8QxJ9ju8B5XIt5ojgEYRUvwrmUdbhdEIdNGdk5u59FtR6cCOKi+sn89/78lGOFLvXq0gsWK7R7bD6OKVasSqj685xFbI4lyg/Vsarh82N9JA7EPoREl1T8lFTRkhn3pAKymGNiOL70N4A4ngsSCbaH/jaKi4oUvoInXyP0rMf34Iz/RWgQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 09, 2025 at 05:43:10PM +0800, Yafang Shao wrote: > On Fri, May 9, 2025 at 5:31 PM David Hildenbrand wrote: > > > > On 09.05.25 11:24, Yafang Shao wrote: > > > On Fri, May 9, 2025 at 1:13 PM Johannes Weiner wrote: > > >> > > >> On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote: > > >>> On Fri, May 9, 2025 at 12:04 AM Usama Arif wrote: > > >>>> > > >>>> > > >>>> > > >>>> On 08/05/2025 06:41, Yafang Shao wrote: > > >>>>> On Thu, May 8, 2025 at 12:09 AM Usama Arif wrote: > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> On 07/05/2025 16:57, Zi Yan wrote: > > >>>>>>> On 7 May 2025, at 11:12, Usama Arif wrote: > > >>>>>>> > > >>>>>>>> On 07/05/2025 15:57, Zi Yan wrote: > > >>>>>>>>> +Yafang, who is also looking at changing THP config at cgroup/container level. > > >>>>> > > >>>>> Thanks > > >>>>> > > >>>>>>>>> > > >>>>>>>>> On 7 May 2025, at 10:00, Usama Arif wrote: > > >>>>>>>>> > > >>>>>>>>>> Allowing override of global THP policy per process allows workloads > > >>>>>>>>>> that have shown to benefit from hugepages to do so, without regressing > > >>>>>>>>>> workloads that wouldn't benefit. This will allow such types of > > >>>>>>>>>> workloads to be run/stacked on the same machine. > > >>>>>>>>>> > > >>>>>>>>>> It also helps in rolling out hugepages in hyperscaler configurations > > >>>>>>>>>> for workloads that benefit from them, where a single THP policy is > > >>>>>>>>>> likely to be used across the entire fleet, and prctl will help override it. > > >>>>>>>>>> > > >>>>>>>>>> An advantage of doing it via prctl vs creating a cgroup specific > > >>>>>>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is > > >>>>>>>>>> that this will work even when there are no cgroups present, and my > > >>>>>>>>>> understanding is there is a strong preference of cgroups controls being > > >>>>>>>>>> hierarchical which usually means them having a numerical value. > > >>>>>>>>> > > >>>>>>>>> Hi Usama, > > >>>>>>>>> > > >>>>>>>>> Do you mind giving an example on how to change THP policy for a set of > > >>>>>>>>> processes running in a container (under a cgroup)? > > >>>>>>>> > > >>>>>>>> Hi Zi, > > >>>>>>>> > > >>>>>>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always > > >>>>>>>> for processes in a cgroup is in the same way we enable KSM for the cgroup. > > >>>>>>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS > > >>>>>>>> in exec-invoke. > > >>>>>>>> This is at the start of the process, but you would already know at the start of the process > > >>>>>>>> whether you want THP=always for it or not. > > >>>>>>>> > > >>>>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > >>>>>>> > > >>>>>>> You also need to add a new systemd.directives, e.g., MemoryTHP, to > > >>>>>>> pass the THP enablement or disablement info from a systemd config file. > > >>>>>>> And if you find those processes do not benefit from using THPs, > > >>>>>>> you can just change the new "MemoryTHP" config and restart the processes. > > >>>>>>> > > >>>>>>> Am I getting it? Thanks. > > >>>>>>> > > >>>>>> > > >>>>>> Yes, thats right. They would exactly the same as what we (Meta) do > > >>>>>> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set, > > >>>>>> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when > > >>>>>> that is set, the prctl will be called at exec_invoke of the process [3]. > > >>>>>> > > >>>>>> The systemd changes should be quite simple to do. > > >>>>>> > > >>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 > > >>>>>> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151 > > >>>>>> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > >>>>> > > >>>>> This solution carries a risk: since prctl() does not require any > > >>>>> capabilities, the task itself could call it and override your memory > > >>>>> policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that > > >>>>> capability is typically enabled by default in containers, leaving them > > >>>>> still vulnerable. > > >>>>> > > >>>>> This approach might work for Kubernetes/container environments, but it > > >>>>> would require substantial code changes to implement securely. > > >>>>> > > >>>> > > >>>> You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE > > >>>> already exists and the someone could use this to slow the process down. So the > > >>>> approach this patch takes shouldn't be anymore of a security fix then what is already > > >>>> exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE > > >>>> should be used to restrict this. > > >>> > > >>> I believe we should at least require CAP_SYS_RESOURCE to enable THP, > > >>> since it overrides global system settings. Alternatively, > > >>> CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely > > >>> certain. > > >> > > >> Hm, could you verbalize a concrete security concern? > > >> > > >> I've never really looked at the global settings as a hard policy, more > > >> as picking a default for the workloads in the system. It's usually > > >> `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long > > >> existed to give applications the ability to refine the global choice. > > >> > > >> The prctl should probably respect `never' for consistency, but beyond > > >> that I don't really see the concern, or how this would allow something > > >> that isn't already possible. > > > > > > I would interpret the always, madvise, and never options as follows: > > > - always > > > The sysadmin strongly recommends using THP. If a user does not > > > want to use it, they must explicitly disable it. I would call this "kernel mode" or "auto mode", where userspace should *generally* not have to worry about huge pages, but with an option for declaring the odd exceptional case. Both madvise() and unprivileged prctl() currently work, and IMO should continue to work, for declaring exceptions. > > > - madvise > > > The sysadmin gently encourages the use of THP, but it is only > > > enabled when explicitly requested by the application. And this "user mode" or "manual mode", where applications self-manage which parts of userspace they want to enroll. Both madvise() and unprivileged prctl() should work here as well, IMO. There is no policy or security difference between them, it's just about granularity and usability. > > > - never > > > The sysadmin discourages the use of THP, and "its use is only permitted > > > with explicit approval" . This one I don't quite agree with, and IMO conflicts with what David is saying as well. > > "never" so far means "no thps, no exceptions". We've had serious THP > > issues in the past, where our workaround until we sorted out the issue > > for affected customers was to force-disable THPs on that system during boot. > > Right, that reflects the current behavior. What we aim to enhance is > by adding the requirement that "its use is only permitted with > explicit approval." I think you're conflating a safety issue with a security issue. David is saying there can be cases where the kernel is broken, and "never" is a production escape hatch to disable the feature until a kernel upgrade for the fix is possible. In such a case, it doesn't make sense to override this decision based on any sort of workload policy, privileged or not. The way I understand you is that you want enrollment (and/or self-management) only for blessed applications. Because you don't generally trust workloads in the wild enough to switch the global default away from "never", given the semantics of always/madvise. To me this sounds like you'd need a different mode, call it "blessed"; with a privileged interface to control which applications are allowed to madvise/prctl-enable.