From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 09375C3ABBE for ; Fri, 9 May 2025 02:15:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F2D336B000A; Thu, 8 May 2025 22:15:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EDD606B0082; Thu, 8 May 2025 22:15:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DA3896B0083; Thu, 8 May 2025 22:15:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id BD1CA6B000A for ; Thu, 8 May 2025 22:15:56 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 937E8BF374 for ; Fri, 9 May 2025 02:15:57 +0000 (UTC) X-FDA: 83421753954.25.39E08F1 Received: from mail-qt1-f182.google.com (mail-qt1-f182.google.com [209.85.160.182]) by imf18.hostedemail.com (Postfix) with ESMTP id AFC7A1C0004 for ; Fri, 9 May 2025 02:15:55 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DNx4hZ5X; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf18.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746756955; a=rsa-sha256; cv=none; b=DRz2i5pmYS88nXtkkFyh6jNe+GtRtVzM9f26adRCy0lmD8vHktZ02XUdsoUUlUvbOgCZZG ttfcAVlySaAb6Nev47QSzBamAU6skUuOEIW/8IjTgJ8QUnT9hy1nemBecfu1I51HQg6eOr ZrnB+R2t54HIV355ivW3WN/hT6cinq8= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746756955; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OUZzF1GMc7pPrd91LkBIIsb0Tc+00pKXpIBgeZ8sEe0=; b=W6AHk/peDdxd5LfeEOaEi1lODBu9p4qAYOmCVds0FbJ3lPsEiM/OTSWOtGhRh9ZTbu6S3D lvPunrZTTKyaSWHQD6x4uwJMcWLIx8XxF5aipSwNG2ZIMJd9yVEUOkeoxroRiCO7I8NYBs A67Y2k6x4iLGHloKcgkver02r4V0HHc= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DNx4hZ5X; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf18.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com Received: by mail-qt1-f182.google.com with SMTP id d75a77b69052e-476b4c9faa2so21161581cf.3 for ; Thu, 08 May 2025 19:15:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746756955; x=1747361755; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=OUZzF1GMc7pPrd91LkBIIsb0Tc+00pKXpIBgeZ8sEe0=; b=DNx4hZ5XoytFIGh8YTAxpnePu7OcaC64g8Map8Q0Wbod1DUitHI9VbtBXHWajhaNgS vGgFOs+a7FxOifAmGqRCf55al2MJLyFRzeKEKXWQVLioZv96GASllYF/fEoy68oXZnL4 RACZFYAdeGJQpOb3P2ZGPj62faGOM6iUEq+CdUJxhVYrOXe6ceq928BOcwImTpmmykbi dc8cptKWIg+OUCB0dA1zL9ZzX9GkpbkqR7WgkF6Xn25znaDOP6rgi6CEKlLCiacugTBw OiY832lgnBI4OPGA0k5rFdcMrX8ejGDH2Qh4WyZSwYLp3xNba38P/tz1mS9zWy0MaSmA aFqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746756955; x=1747361755; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OUZzF1GMc7pPrd91LkBIIsb0Tc+00pKXpIBgeZ8sEe0=; b=YCuSJCFgSCAFDOeeF1EqWW7D+5f+7xhEkCFldxb44I/kPjK/5w2R1Q+cHqjOsnZUSx t+p9O0ihhTUWJNQaD7oy0F/qqbDoQbIXZI5z0d8JlHNXl3w61MypG/OXPW1IPxzyOlz/ IVh5DwY2FZ7I/gvBvHRCkifyscGNwQS10p9vxdh0aQIhlPFOoMET+kJLodlyBOSKLP5F SwvI9wb2M2yLX1ARNXuSIb3VUPwVx1wNZczy1ogrI2/SVPP3gYEBjQKItm/Ih0GJKJZ5 OqNCvmSluz0p0kulqqGgXnyk/cTASBfphvE6sRrZDyIWh5ZbqzFPHiy9ehWFDYJVWcfe JUUA== X-Forwarded-Encrypted: i=1; AJvYcCUC98AiH3wZvH+YZpaB4t7ekhgK3Nr1C7EDHNpdMRXzx+Yz/7PVolEsC2BG1asBOih+UIQXiXcb3w==@kvack.org X-Gm-Message-State: AOJu0YwgrhCPb22xaWdWhXeFVavPLuPZlHDmnCsHg5un6k/AvUPwaUyG WAZiCAROGKHoxRViSI3DmRFpvaLP124zPN88QAACR9IFscoLjkFTLVogSmWzSteeYKUVeP1LzdL pv/NLfzArCPvP6IcKUVniTypA9EwmywoCNqk= X-Gm-Gg: ASbGncui5ykktUpkU/kQ006x0hb+NsoAaUxvgvBwqf7+pswDZqjlKPvYifRTEEm5feh 18DkrOVmS+s/aOgrZaTs78YgczdOveD/wScDdzy6V3O3XUr6FcscxlXNcvAA1AEyNn7D7aQZe6o nBH4IowjgmM0QDjbedPWwIi98= X-Google-Smtp-Source: AGHT+IGVTRp4pCU6b4pOnJGKOvcrK5tsM/hmu5yadLe4Xoc6rkOBSiYHgUq51P6Nj4Te4xvaRZFDJla5klVmaEyKrxE= X-Received: by 2002:a05:6214:300e:b0:6cb:ee08:c1e8 with SMTP id 6a1803df08f44-6f6e47fa918mr27085726d6.23.1746756944581; Thu, 08 May 2025 19:15:44 -0700 (PDT) MIME-Version: 1.0 References: <20250507141132.2773275-1-usamaarif642@gmail.com> <293530AA-1AB7-4FA0-AF40-3A8464DC0198@nvidia.com> <96eccc48-b632-40b7-9797-1b0780ea59cd@gmail.com> <8E3EC5A4-4387-4839-926F-3655188C20F4@nvidia.com> <279d29ad-cbd6-4a0e-b904-0a19326334d1@gmail.com> In-Reply-To: From: Yafang Shao Date: Fri, 9 May 2025 10:15:08 +0800 X-Gm-Features: AX0GCFtM1_z184WFg9l7GMhprKo3jxyNYmE2HFC1s-2rlbQQJ4PB6bv2Ytrs_Rw Message-ID: Subject: Re: [PATCH 0/1] prctl: allow overriding system THP policy to always To: Usama Arif Cc: Zi Yan , Andrew Morton , david@redhat.com, linux-mm@kvack.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, riel@surriel.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, linux-kernel@vger.kernel.org, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: AFC7A1C0004 X-Stat-Signature: q9oixniuwinwn95g6k3dx11ucx53x1bz X-Rspam-User: X-HE-Tag: 1746756955-208341 X-HE-Meta: U2FsdGVkX195/U76TcZxUe2I0OJkUmCzc2Bhfi0rvyOU970IXchXb3ss4+QqoWhGY8/oiw63yn7v/Cdcs3V0n30NA3vtqyOR/qXRQUN+7FoMsjxYyRNNJ0tWzNAxNUKq1Eq8Pbnbd0MtyemLYWP3ZbI5/8zgPLpieinPKo94R2g6if82M/J3zKquJTe//RV/vmE7JBAgPUMK7mwoaIdQyNOP/eOmGsTGbkM+7G+jzZtSt+gYH59SsLeVdTpmN0AJkDrvGba64QEDPW4kvuUunrYhtW6WPs30ICbIL85N/0qVVaJ7asr26CzbFpU/9PZC2HnDad5EJDLhdxjFPrFvYAbNKV90+OeD9sClEkf8p5Kwi15LftjGl5r3EBxwvx/m009UuvkXGmH+lJdr3bQT78W02CiWW12QAQE8IvyGTNpZ4KRANbVnfa60D34GZ4eUvVDaYj7gADfL71sICrogBS9kmWDXuGk+RBFnsbmuAm3q734feYTbP6a73niNZXChbnJIuoKPh+JIyl+pwD3/Uu9v3iJt0EI5/THliUEy3zo02cmdW5d0YUZdSxaA5ZObX2/PMbq4T+9Okb1x758w2H97481wQ58wnicnhrgK1cIgzuS8FB9V7uRSQpMbustjjVpPQSNwLjHHFKsAQVXwxBRMxXgmdPrHlsM1GtcF0UypV2aWNymt8gjSlOKSTDU/cA+20RRIjb5d2TiUFGkcZL/IXIfQ9/UZbwMJxNWuLDI4jWJBLR3oDv/Cngog5SD20r4h/EV2VML5DZDRgaODivAkWaraJVlNLdjbPdmg1/9EawHShYwOz2TSHU0P3paeQAkUgBSRmDXwwG/U6odXMDrPQVBuN6xuzm2m0/MArsbAYZkZdFi+QJD5NDtCqUGE2nXMEcWUhdv4RcjCYQ8bfFh8T+ODUQmOy2Rb4MXCJZqryDnJ1vlMjqX7c5GkZUBLcWhhmiMz7Ihml0qGpb8 q4Wpthr8 9HOAJHmPoxeomOQVd8I2jT9IQHy3VgdgLLL5Skay77qJhVGqGWekIfqxKiL4ntb3whlOhvqXxpcA8WErT1ZY2vwT/D8TYyyozUqtmmE30rFCZSPvxSqbWhMAn5nV8BT9FhYDKdi7ya++2SZARy9q8w3YPefDRylhbslr9CuZr7cJTSyAYlHr681N0CpnlRdzUeCWLMvOm/9Ks12qNJf0O6loBtdTJOEhQtKL7V7pjAhNZUEZmz97YRczZIWPY5h6Nb4PYeA/9b1k8zDcOl3vY9E9U0kjjdrkuP5szJ6HodZB+l7KagUev557JxrmBge555BPmj3Vchy0Ny28LDaOOwTz0t+LNfP4uZpcNxSbM/+ViazU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 9, 2025 at 12:04=E2=80=AFAM Usama Arif = wrote: > > > > On 08/05/2025 06:41, Yafang Shao wrote: > > On Thu, May 8, 2025 at 12:09=E2=80=AFAM Usama Arif wrote: > >> > >> > >> > >> On 07/05/2025 16:57, Zi Yan wrote: > >>> On 7 May 2025, at 11:12, Usama Arif wrote: > >>> > >>>> On 07/05/2025 15:57, Zi Yan wrote: > >>>>> +Yafang, who is also looking at changing THP config at cgroup/conta= iner level. > > > > Thanks > > > >>>>> > >>>>> On 7 May 2025, at 10:00, Usama Arif wrote: > >>>>> > >>>>>> Allowing override of global THP policy per process allows workload= s > >>>>>> that have shown to benefit from hugepages to do so, without regres= sing > >>>>>> workloads that wouldn't benefit. This will allow such types of > >>>>>> workloads to be run/stacked on the same machine. > >>>>>> > >>>>>> It also helps in rolling out hugepages in hyperscaler configuratio= ns > >>>>>> for workloads that benefit from them, where a single THP policy is > >>>>>> likely to be used across the entire fleet, and prctl will help ove= rride it. > >>>>>> > >>>>>> An advantage of doing it via prctl vs creating a cgroup specific > >>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabl= ed) is > >>>>>> that this will work even when there are no cgroups present, and my > >>>>>> understanding is there is a strong preference of cgroups controls = being > >>>>>> hierarchical which usually means them having a numerical value. > >>>>> > >>>>> Hi Usama, > >>>>> > >>>>> Do you mind giving an example on how to change THP policy for a set= of > >>>>> processes running in a container (under a cgroup)? > >>>> > >>>> Hi Zi, > >>>> > >>>> In our case, we create the processes in the cgroup via systemd. The = way we will enable THP=3Dalways > >>>> for processes in a cgroup is in the same way we enable KSM for the c= group. > >>>> The change in systemd would be very similar to the line in [1], wher= e we would set prctl PR_SET_THP_ALWAYS > >>>> in exec-invoke. > >>>> This is at the start of the process, but you would already know at t= he start of the process > >>>> whether you want THP=3Dalways for it or not. > >>>> > >>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28d= d4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > >>> > >>> You also need to add a new systemd.directives, e.g., MemoryTHP, to > >>> pass the THP enablement or disablement info from a systemd config fil= e. > >>> And if you find those processes do not benefit from using THPs, > >>> you can just change the new "MemoryTHP" config and restart the proces= ses. > >>> > >>> Am I getting it? Thanks. > >>> > >> > >> Yes, thats right. They would exactly the same as what we (Meta) do > >> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP = is set, > >> the ExecContext->memory_thp would be set similar to memory_ksm [2], an= d when > >> that is set, the prctl will be called at exec_invoke of the process [3= ]. > >> > >> The systemd changes should be quite simple to do. > >> > >> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4= ade7c6ab7be29a/man/systemd.exec.xml#L1978 > >> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4= ade7c6ab7be29a/src/core/dbus-execute.c#L2151 > >> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4= ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > > > This solution carries a risk: since prctl() does not require any > > capabilities, the task itself could call it and override your memory > > policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that > > capability is typically enabled by default in containers, leaving them > > still vulnerable. > > > > This approach might work for Kubernetes/container environments, but it > > would require substantial code changes to implement securely. > > > > You can already change the memory policy with prctl, for e.g. PR_SET_THP_= DISABLE > already exists and the someone could use this to slow the process down. S= o the > approach this patch takes shouldn't be anymore of a security fix then wha= t is already > exposed by the kernel. I think as you mentioned, if prctl is an issue CAP= _SYS_RESOURCE > should be used to restrict this. I believe we should at least require CAP_SYS_RESOURCE to enable THP, since it overrides global system settings. Alternatively, CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely certain. > > In terms of security vulnerability of prctl, I feel like there are a lot = of others > that can be a much much bigger issue? I just had a look and you can chang= e the > seccomp, reset PAC keys(!) even speculation control(!!), so I dont think = the security > argument would be valid. I was surprised to discover that none of these operations require any capabilities to execute. --=20 Regards Yafang