From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84F9CC3ABC3 for ; Sun, 11 May 2025 02:08:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9AE236B000A; Sat, 10 May 2025 22:08:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 95C8B6B0082; Sat, 10 May 2025 22:08:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 84AAD6B0083; Sat, 10 May 2025 22:08:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 67F856B000A for ; Sat, 10 May 2025 22:08:43 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 8C1A81613BB for ; Sun, 11 May 2025 02:08:44 +0000 (UTC) X-FDA: 83428993368.21.61A0F4E Received: from mail-qv1-f44.google.com (mail-qv1-f44.google.com [209.85.219.44]) by imf22.hostedemail.com (Postfix) with ESMTP id A4EA3C0008 for ; Sun, 11 May 2025 02:08:42 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=K97DtHiS; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.44 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746929322; a=rsa-sha256; cv=none; b=uHIVPZRJl3w4weCmGkW9FSF6GDyw8vFkgdt4Zbm/5x/H/9QKw/bbbNdFspvNHJGynvbTgA zNmbaOAjCqTPmCmmfqV6Kg3ehKthUXa+xqeJpq2xDZWDupNccSH0kzTHF5AvaXTmA86meY nD8q8yozSpaK8BajwApmpNyKMwyvNUA= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=K97DtHiS; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.44 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746929322; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ClOyMhlucZgqqcgXQFq1OGdn3FnzyenLIIB/AbOlZeQ=; b=ORkyREAtOy6NSlTsX645KNpH1Ol4HNrEMCHykiKeKkLdRmY/90o5mMQtwfAjUopkESnT/1 VZN77cbGZNOgdnGcYn7zkM7JNWi8IHfvD8C2c/jlt1MXUslCksIpjwyf7XH30FCXPpOjXz NBtWXVg5YGaKdeI35LftvDeRBSoeYPk= Received: by mail-qv1-f44.google.com with SMTP id 6a1803df08f44-6f54e4f2989so67114306d6.1 for ; Sat, 10 May 2025 19:08:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746929322; x=1747534122; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ClOyMhlucZgqqcgXQFq1OGdn3FnzyenLIIB/AbOlZeQ=; b=K97DtHiS6BSaiwHG0BdMmAlRX3HBPUqHXzEc60MNKj3IQwWUtB2pp6D5yHx8ohB60Q 3LKhC7F97JirtgpX2qtIrcIzoUvLoJ1tlRFp84spOcApGiTARhCkuT/CVYz4H3yo5jn/ bg/BeNuUhhOvyZXmeaZeSI5jwr1DtgzZGfq/yFW+zWuOD99CeT+mh6RjghhLRDCSsKKq Y5R9EsynfD91dDn9SCUUz04shC7PB3QfOdB0ShoGLUG1yk/ZSzvp2SwNrZaNFmCVn4n8 sXu/YsZ/06oalnUIlDkrHEYGq+o0ERl8ia2W3hpNOxhOwjf8izXntFQe2PGmaCzgxiJO RaaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746929322; x=1747534122; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ClOyMhlucZgqqcgXQFq1OGdn3FnzyenLIIB/AbOlZeQ=; b=d/8BGOxg9FKTuK2HtMkD4IhMx8WKNYRA/+x6/gxTbPNe7dfBASIy3/EKDg3aQuy9O9 nJfbpSSwS8YeI9dLIPWcbyvOo7iE5CefkEq6pR/PQZEC7R0d2qTxS1zRn/PM8d1DmaqD uaGttF95p/uG3QLoUsZa7GO0YT5NRKMfvX/LwR2v+0WjU1Sq/WPxgzI8UUt4fY9WcRZw 8xFdm8xOYa6osKJZ0+FZaVatbgMjgZIbcaxXyWRUuHYSFyJvkiWpthhqD8olRhLpD6Xh jtH5+PnQxFhME12LWqitNWJai0gaBya5XsoQe8iovzEdYqcCkZW5ZZOI78JBPCqxBTCx t7Jg== X-Forwarded-Encrypted: i=1; AJvYcCXbYNRK/I+16aPOjaGaB5eKqc1EeaJXNEBW8gd36ai4ckh9Hp5FZDhdLnj6BOldCXbQ6tcPGf7GTA==@kvack.org X-Gm-Message-State: AOJu0YzBrTt9G5cAXBoNjSPyfn9LTdOFl7SJtdjuFpCNFhyh7gWAaO+a ekcAlZZV+v2orthMB0e1+X5y22+CgmiGslCtn9poUsPfO8EN2kaBF+wP6aLrJrwVFzG/lHkpLTW oblAm+p3Jx8lsZqYp7Pvf7xHr1jk= X-Gm-Gg: ASbGncvWiWPT9nRGVtGSNNV7xvsSuLhBzM4+QvjMOAmOlJ7PL8ak5utLk+qEFyzerTF 8AfypAkOnVpgtvcRQssXVnJqF1+kSYgo90ZbYOwxduJ/cDIp+zQ6vJuab0FYt3qeREKSxzpc7Ci KlkKRDtkVMSIYDFt4iBk961PSlvthk5xNud9PRi+QFs/+L X-Google-Smtp-Source: AGHT+IF2PhzGd7mOgX0tWj7Z6vinwLYVjeU213A/ddZfEGxSK+Qd5So3trvKUv6fSx63wtK9zaZnFKkdfgqOOiT6+NI= X-Received: by 2002:a05:6214:2509:b0:6e8:f99c:7939 with SMTP id 6a1803df08f44-6f6e4858dbamr111203756d6.44.1746929321591; Sat, 10 May 2025 19:08:41 -0700 (PDT) MIME-Version: 1.0 References: <96eccc48-b632-40b7-9797-1b0780ea59cd@gmail.com> <8E3EC5A4-4387-4839-926F-3655188C20F4@nvidia.com> <279d29ad-cbd6-4a0e-b904-0a19326334d1@gmail.com> <20250509051328.GF323143@cmpxchg.org> <41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com> <20250509164654.GA608090@cmpxchg.org> In-Reply-To: <20250509164654.GA608090@cmpxchg.org> From: Yafang Shao Date: Sun, 11 May 2025 10:08:05 +0800 X-Gm-Features: AX0GCFuXx6aNt3iX3iyRdugZGlhnQTK3V6QST_mQcFhXTae8aR1F8hga1SaSN90 Message-ID: Subject: Re: [PATCH 0/1] prctl: allow overriding system THP policy to always To: Johannes Weiner Cc: David Hildenbrand , Usama Arif , Zi Yan , Andrew Morton , linux-mm@kvack.org, shakeel.butt@linux.dev, riel@surriel.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, linux-kernel@vger.kernel.org, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: A4EA3C0008 X-Stat-Signature: u91to985bof43d1yh1buogqfokaiyuec X-Rspam-User: X-HE-Tag: 1746929322-476181 X-HE-Meta: U2FsdGVkX1+vjK9AeVtveVPJi9U8OheITDBQ4MIy9xW1blXE8BsSnA8Hn1fS8EIk4+T9SYt8T0b0uji0FjO3YDmmRnuMgcW0/bl7ViNepKOOYLuTLni9DUHVNhnK6jODcb+H7Lur+fv5b18fYpu2s4QK3fhNpnJZ4QRxwSYwW3QK8t1L3dSGbFpTh0cLsze2RaJLR2dxmYtx8MS/lqWVXJWFAuHxeknHvJqjBSlKUcW8TnYSLJOS0DcQm1llW5K5Q8pkQwkjopygi6rAdsM3vbv64ABJyV4eQBZMtKhBnWfReznnl1MLpy/0FXo2yuG67l/JQ+jwGuyE1N6SE/xon5FnFSjpSMs8PMMlauvczf5JacZ3MfRNB36p+02n0yfZzIXs/yUJ0IcRlsA0bcUGUd24ntM6yJ03FLJnpQawgLZ7DIfBJm6oEw88mZzvSEau/FtkbP6K5KqkHiJegVj4nkAbcj3D/5k35wuRKi4Fv4jpI9AizYz1q5ca2SNJRNlVLc72J2ozzGc0LAuBDJcgVD3k8hycr/NXfbJlD9u1IlMIgHy+8m/kwSkCS2GPXCmc6Q5sMjWm5O+olRcJtxyzPk5BoIDMIX5J7PYlIHcRc14v6GqHlFTqYoSqG4S8EUKMWI+NSPDtJ2JLfkp3xAW7me5sg/4ZpP5TD9oshxEV9NINllNWbdUp3p60pMgJoU0IZt8MLYfUMjwkgxv5C1ChlPSn0wZ7q4pcZpiHdTl3hKKjttwl4heT7aKyMQgGx4+xtnZT4ZJey9VC3xQLw1OWVRwm9qgvv9qwR5Z0sQm5GNeZX4xVYl8FDMilF42bk6JMIlUHZ+HBA7Ut7+LlwIdpCZzTCNqyVDxvFs36VVHdMOkgjbKCxsW/285t5L2dz4Hq9uytRv7OWx5+3HByHya0P8EG3Q7ICHx0oBSxXgSrbXRKT66urCkhvuDqGFG2h+19jFRZomwnUwmQhZ0nHtQ OQueGgct 4MGKWl5lkSEF2altYH5ud6q1gN6aGUevOx0E61uNl3n+P9N8GyvbfOHhDbJ2BhRfPsL9JyQzv3/AlHxo2OrTp/gE3lMB4Mx+asPlWA7Jr65XvsxhqaEkI+hWdbNaK0XcvDKx/QeJQEqe43Ok5XdUN+ST95yjmVq0Jd8acIbxxRhBfJ0M70gUKrfeCnZlZ83h4aoU8AEAcDZ8OfDg6Iy3SxjgldYbEJUfAj1Ss2HW/hfR+hCrErOC7e4PjqAUKuMfljl/3KDRlyErUaeMeWLlvxmhKHnn7xt8laTbdGP37jlXkGtAeYxjPLjzZYg/VWLSXA22SWFxd3Rz13MR5z+KhHDrmniSh4hPCVYhYrVkLgii/hi2qggnR90Kv1V2etGSc6JmHjEiW0dS6uHc95zy5pKwZRQB11FIhPBjfdQICgGSVlK0BxzcWXDRBTMphHuH/uJu/ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, May 10, 2025 at 12:47=E2=80=AFAM Johannes Weiner wrote: > > On Fri, May 09, 2025 at 05:43:10PM +0800, Yafang Shao wrote: > > On Fri, May 9, 2025 at 5:31=E2=80=AFPM David Hildenbrand wrote: > > > > > > On 09.05.25 11:24, Yafang Shao wrote: > > > > On Fri, May 9, 2025 at 1:13=E2=80=AFPM Johannes Weiner wrote: > > > >> > > > >> On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote: > > > >>> On Fri, May 9, 2025 at 12:04=E2=80=AFAM Usama Arif wrote: > > > >>>> > > > >>>> > > > >>>> > > > >>>> On 08/05/2025 06:41, Yafang Shao wrote: > > > >>>>> On Thu, May 8, 2025 at 12:09=E2=80=AFAM Usama Arif wrote: > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> On 07/05/2025 16:57, Zi Yan wrote: > > > >>>>>>> On 7 May 2025, at 11:12, Usama Arif wrote: > > > >>>>>>> > > > >>>>>>>> On 07/05/2025 15:57, Zi Yan wrote: > > > >>>>>>>>> +Yafang, who is also looking at changing THP config at cgro= up/container level. > > > >>>>> > > > >>>>> Thanks > > > >>>>> > > > >>>>>>>>> > > > >>>>>>>>> On 7 May 2025, at 10:00, Usama Arif wrote: > > > >>>>>>>>> > > > >>>>>>>>>> Allowing override of global THP policy per process allows = workloads > > > >>>>>>>>>> that have shown to benefit from hugepages to do so, withou= t regressing > > > >>>>>>>>>> workloads that wouldn't benefit. This will allow such type= s of > > > >>>>>>>>>> workloads to be run/stacked on the same machine. > > > >>>>>>>>>> > > > >>>>>>>>>> It also helps in rolling out hugepages in hyperscaler conf= igurations > > > >>>>>>>>>> for workloads that benefit from them, where a single THP p= olicy is > > > >>>>>>>>>> likely to be used across the entire fleet, and prctl will = help override it. > > > >>>>>>>>>> > > > >>>>>>>>>> An advantage of doing it via prctl vs creating a cgroup sp= ecific > > > >>>>>>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepa= ge.enabled) is > > > >>>>>>>>>> that this will work even when there are no cgroups present= , and my > > > >>>>>>>>>> understanding is there is a strong preference of cgroups c= ontrols being > > > >>>>>>>>>> hierarchical which usually means them having a numerical v= alue. > > > >>>>>>>>> > > > >>>>>>>>> Hi Usama, > > > >>>>>>>>> > > > >>>>>>>>> Do you mind giving an example on how to change THP policy f= or a set of > > > >>>>>>>>> processes running in a container (under a cgroup)? > > > >>>>>>>> > > > >>>>>>>> Hi Zi, > > > >>>>>>>> > > > >>>>>>>> In our case, we create the processes in the cgroup via syste= md. The way we will enable THP=3Dalways > > > >>>>>>>> for processes in a cgroup is in the same way we enable KSM f= or the cgroup. > > > >>>>>>>> The change in systemd would be very similar to the line in [= 1], where we would set prctl PR_SET_THP_ALWAYS > > > >>>>>>>> in exec-invoke. > > > >>>>>>>> This is at the start of the process, but you would already k= now at the start of the process > > > >>>>>>>> whether you want THP=3Dalways for it or not. > > > >>>>>>>> > > > >>>>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1c= b4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > > >>>>>>> > > > >>>>>>> You also need to add a new systemd.directives, e.g., MemoryTH= P, to > > > >>>>>>> pass the THP enablement or disablement info from a systemd co= nfig file. > > > >>>>>>> And if you find those processes do not benefit from using THP= s, > > > >>>>>>> you can just change the new "MemoryTHP" config and restart th= e processes. > > > >>>>>>> > > > >>>>>>> Am I getting it? Thanks. > > > >>>>>>> > > > >>>>>> > > > >>>>>> Yes, thats right. They would exactly the same as what we (Meta= ) do > > > >>>>>> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if Me= moryTHP is set, > > > >>>>>> the ExecContext->memory_thp would be set similar to memory_ksm= [2], and when > > > >>>>>> that is set, the prctl will be called at exec_invoke of the pr= ocess [3]. > > > >>>>>> > > > >>>>>> The systemd changes should be quite simple to do. > > > >>>>>> > > > >>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4= d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 > > > >>>>>> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4= d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151 > > > >>>>>> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4= d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > > >>>>> > > > >>>>> This solution carries a risk: since prctl() does not require an= y > > > >>>>> capabilities, the task itself could call it and override your m= emory > > > >>>>> policy. While we could enforce CAP_SYS_RESOURCE to restrict thi= s, that > > > >>>>> capability is typically enabled by default in containers, leavi= ng them > > > >>>>> still vulnerable. > > > >>>>> > > > >>>>> This approach might work for Kubernetes/container environments,= but it > > > >>>>> would require substantial code changes to implement securely. > > > >>>>> > > > >>>> > > > >>>> You can already change the memory policy with prctl, for e.g. PR= _SET_THP_DISABLE > > > >>>> already exists and the someone could use this to slow the proces= s down. So the > > > >>>> approach this patch takes shouldn't be anymore of a security fix= then what is already > > > >>>> exposed by the kernel. I think as you mentioned, if prctl is an = issue CAP_SYS_RESOURCE > > > >>>> should be used to restrict this. > > > >>> > > > >>> I believe we should at least require CAP_SYS_RESOURCE to enable T= HP, > > > >>> since it overrides global system settings. Alternatively, > > > >>> CAP_SYS_ADMIN might be even more appropriate, though I'm not enti= rely > > > >>> certain. > > > >> > > > >> Hm, could you verbalize a concrete security concern? > > > >> > > > >> I've never really looked at the global settings as a hard policy, = more > > > >> as picking a default for the workloads in the system. It's usually > > > >> `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have = long > > > >> existed to give applications the ability to refine the global choi= ce. > > > >> > > > >> The prctl should probably respect `never' for consistency, but bey= ond > > > >> that I don't really see the concern, or how this would allow somet= hing > > > >> that isn't already possible. > > > > > > > > I would interpret the always, madvise, and never options as follows= : > > > > - always > > > > The sysadmin strongly recommends using THP. If a user does not > > > > want to use it, they must explicitly disable it. > > I would call this "kernel mode" or "auto mode", where userspace should > *generally* not have to worry about huge pages, but with an option for > declaring the odd exceptional case. > > Both madvise() and unprivileged prctl() currently work, and IMO should > continue to work, for declaring exceptions. > > > > > - madvise > > > > The sysadmin gently encourages the use of THP, but it is only > > > > enabled when explicitly requested by the application. > > And this "user mode" or "manual mode", where applications self-manage > which parts of userspace they want to enroll. > > Both madvise() and unprivileged prctl() should work here as well, > IMO. There is no policy or security difference between them, it's just > about granularity and usability. > > > > > - never > > > > The sysadmin discourages the use of THP, and "its use is only pe= rmitted > > > > with explicit approval" . > > This one I don't quite agree with, and IMO conflicts with what David > is saying as well. > > > > "never" so far means "no thps, no exceptions". We've had serious THP > > > issues in the past, where our workaround until we sorted out the issu= e > > > for affected customers was to force-disable THPs on that system durin= g boot. > > > > Right, that reflects the current behavior. What we aim to enhance is > > by adding the requirement that "its use is only permitted with > > explicit approval." > > I think you're conflating a safety issue with a security issue. I appreciate the corrections. English isn't my first language, so I occasionally don't use words as precisely as I'd like. > > David is saying there can be cases where the kernel is broken, and > "never" is a production escape hatch to disable the feature until a > kernel upgrade for the fix is possible. In such a case, it doesn't > make sense to override this decision based on any sort of workload > policy, privileged or not. > > The way I understand you is that you want enrollment (and/or > self-management) only for blessed applications. Right. > Because you don't > generally trust workloads in the wild enough to switch the global > default away from "never", given the semantics of always/madvise. Historically, we have always set it to 'never.' Due to concerns stemming from past incidents, the sysadmins have been hesitant to switch it to 'madvise.' However, we=E2=80=99ve now discovered that AI servi= ces can gain significant performance benefits from it. As a solution, we propose enabling THP exclusively for AI services while maintaining the global setting as 'never.' > > To me this sounds like you'd need a different mode, call it "blessed"; > with a privileged interface to control which applications are allowed > to madvise/prctl-enable. This appears to be a viable solution. --=20 Regards Yafang