From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D70B5C3ABC0 for ; Thu, 8 May 2025 16:04:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C99746B0085; Thu, 8 May 2025 12:04:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C48FF6B0088; Thu, 8 May 2025 12:04:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AE92D6B008A; Thu, 8 May 2025 12:04:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 8C9DE6B0085 for ; Thu, 8 May 2025 12:04:41 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 779E0160A20 for ; Thu, 8 May 2025 16:04:42 +0000 (UTC) X-FDA: 83420213604.04.245B0A5 Received: from mail-ed1-f52.google.com (mail-ed1-f52.google.com [209.85.208.52]) by imf01.hostedemail.com (Postfix) with ESMTP id 52BBE40007 for ; Thu, 8 May 2025 16:04:40 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eeODa12f; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746720280; a=rsa-sha256; cv=none; b=zDR0AYKNO/hg86wr3Js8Bobn/G5h0IangmPVCSNlS/LSOeKlgnaF7a5VZG6MLY6gFoyXl0 t/2HbQsuYbU7J77YWjHHIbOQLPBZgrFqSMGqWCdiZBXD6yZPpxw2wMkNtB4g9LmVJH3s3t D+/Hyo1wadRS79S1KdxfBNmOTq16s1A= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eeODa12f; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746720280; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3NfzdIk19HfLe8jO397hSvRiahTxGsvjTzWdnQRt7+4=; b=xqHcy/uyiCGe9UI7Q20kiXUzaS9RrINRvh+xIYhjyKOq0GdGxAloEuovWNFW6VBt7cbVFf /y7ilnhXuv4g6TJjczYNCw2uJOcmZob01ESlH03kDJ+0kMWWFicjq8fFflVmQJKRlJ6mQ5 51DXhDjGjeq4rE3YWKZ7ktk4f2X1nnc= Received: by mail-ed1-f52.google.com with SMTP id 4fb4d7f45d1cf-5e5e22e6ed2so1823466a12.3 for ; Thu, 08 May 2025 09:04:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746720279; x=1747325079; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=3NfzdIk19HfLe8jO397hSvRiahTxGsvjTzWdnQRt7+4=; b=eeODa12f5sW1EcoLmlNv0fLgGTjddNwWTeB9laZfNQhb5dFo3E3I9PirnZz14BSAgP 5dgKbhoVF6dyqYOtlNAiuoCQDsc/otvQuNFyxeOmTztHLtpizG/HMq5fPMI3z3fJhNFd Vew+vJP1gYXSy3XS9SLZ6LPpyAZfjiY4oejwCal8pAK8xSJ3GP7fzMmvMo0TsqGyDfsA 6CqJesoKUVmGhV4h04skGfSnCJyHqUd64rENkF82Q8GNwVMH9hhxynxa1toJ9ndqVWBZ ePEgoDOkQNvmV58TpbKutFy+4Bm3HwgrYUs7TSnwqFlmDdgORKyWkRmikEMNjkStOv02 xM4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746720279; x=1747325079; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=3NfzdIk19HfLe8jO397hSvRiahTxGsvjTzWdnQRt7+4=; b=PXXIXDV2PFoQ6YHv0bxwtF9tq5cf/Ia/QvZGK3mmdG3BimIceiYe+BE8EYKg3NtJyg Hu1jnQG80X/UPlgAU1cs6mnBvmU4OKkeP2l+5vAONR5SmJX0N0tpJn4OT+ZnWqyviiIv 4OFSQq2Jt9eh0s2h50UhcXgARj2Qj/+Cc1XzYRL0Kg5SRbCn4HZfKBF6DWP2xjUHQy/I /gH2CWA5mkqYMJXTJJl4oEmvVzTB5VV7+aS/pNrmaTL/M3Ec+LbNJcIVE/MGnafiJgK5 W8RAhBCb9tD7E3sc+LfCZ+WutQfXNzxw5ByXsSp28xbTqeVDaNuJ+Az3ASppBw0ll60+ 0KWg== X-Forwarded-Encrypted: i=1; AJvYcCXltXotoblpGOcI+hNoE0J22kuU9540yu3p4XslUutkJUNcFpuWFOl8r09g4LhPHcwZqh5DaZf7eQ==@kvack.org X-Gm-Message-State: AOJu0YwV/VL2N1sPqF2ItYBG6baN3wcHQ3XqcL3/v0fn9TMF3X+WxHmo XftAqECrJ+nd2enxq3KcFMeD7lbvB8gI7VozoMaRODzlxIrowewI X-Gm-Gg: ASbGncvJItsm6n/LQdecv3KDp0Zt3e4p7lzoca+yZtuj3tq+skzPwC+46nKt27F0QDd jhFlUFRU66vaVtKoB2Fma99YKKe93MKfobl+URtmXYH3HPEot7lf/OLW+hQBSrzA8fEQL3W4AHH SlKKt8KoFPKbgCC/RFnlBTRFyr3hW8YN8KWX/RcyYLX2XT9Hlecy47wBzsH69lsIHfELo+L8mpw MRQkjmSExyPFK4b82VwsdTYPhXGgJtpsUNM8nzt/t5XkILw1zk/UZfAr2Nf9nCVDIGlBbJH0sre 8vKQGZJ8mEw5kDkOnr1w1USSpSaqrj2K00nlP8kvRwf2yEY7AgNJGa4SlqZz8ghuRdUSfLw24TZ A71haYLc/dzJ4OQ== X-Google-Smtp-Source: AGHT+IHLN+yYME7DS+0cirQlXVdwmlw/zhTFQRUFhBY7O4cmjucGQDGtj/TXxFDZXTkcJTkX91TH1A== X-Received: by 2002:a17:907:c201:b0:ace:3643:1959 with SMTP id a640c23a62f3a-ad218ea8227mr22607266b.7.1746720278239; Thu, 08 May 2025 09:04:38 -0700 (PDT) Received: from ?IPV6:2a03:83e0:1126:4:14f7:eab6:23d5:4cab? ([2620:10d:c092:500::7:80fe]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ad2192cf5casm6729666b.2.2025.05.08.09.04.37 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 08 May 2025 09:04:37 -0700 (PDT) Message-ID: Date: Thu, 8 May 2025 17:04:37 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 0/1] prctl: allow overriding system THP policy to always To: Yafang Shao Cc: Zi Yan , Andrew Morton , david@redhat.com, linux-mm@kvack.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, riel@surriel.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, linux-kernel@vger.kernel.org, kernel-team@meta.com References: <20250507141132.2773275-1-usamaarif642@gmail.com> <293530AA-1AB7-4FA0-AF40-3A8464DC0198@nvidia.com> <96eccc48-b632-40b7-9797-1b0780ea59cd@gmail.com> <8E3EC5A4-4387-4839-926F-3655188C20F4@nvidia.com> <279d29ad-cbd6-4a0e-b904-0a19326334d1@gmail.com> Content-Language: en-US From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 52BBE40007 X-Stat-Signature: 7zekw55c153d1xe3zosrhnp8ntwbr947 X-Rspam-User: X-HE-Tag: 1746720280-350210 X-HE-Meta: U2FsdGVkX1/j/uvZHnkK+yJG9H86I/hzmtjO1WOCEd+6XDbLweTAcce8gBHJMO7S26fOL5X4v3RWy9etEIgHoz3f7T2gGIVF1PLnYNmqIt359JXlf2Zld2M/JLJdcNZ2230htlWSpngUC5jt4UEHZDQR3fQexETjfbNX+18YMSEftINNzqY9L2v6dI11K/A5zaUY8Fx6Ay32hqROXwHtAXfg0x4AFaUT6eTwLkT+GyOyfDhnewP4DMFdMVmb3Mebii/jO791ohJxbsYZNVo+eMYQ/FzxJokd8CZXWV6dEIh9MyIQoeaM/C2BQ3hKxFWnK1OppjM8Jamtb7gzs8YwxAZ//Yb3jmkwyqw/8cmTKr8VI70AERSOTdXGbmhkLLd9VRbLk32230iHiUNpDfHBotTiT2SpIDB0DiHYmcODIf8rhCsXy/rTMOy8H6PR5+SsY2oXTAaphMs9tl5YSxuU1zy4rhRMkI6EZBXNqomofqkkFBRDnmkDNp7WEnHd8mNyJdWBYbS0lmVeMxNA2tcDEWAMO56n0PsgjRHdfETj1GcyTXof3AMRfL9aaPOXbGv/BQn6IjYk9KrIpEcBIp9yYQMEeE+cbkAOZjTfOmgi3F+JRpo4P9fg+WVntIfF8Ru1Wg9+dNX4RX+MQ87NV/iuf5MQCD7HS26SJQ4MXWChCok+Yof0Wkf8TRzhI+vTE2aGnBhirwzU9B08OuCNC4EotIyhHrgBLkZlKmjSGwLlR+XSTwTVd57S2HWikESww1FEAPn55GeKdrWTIbIiSG89P+QhTXegjpWOzk3F1B09WUhA3xoS9C7PvIQiydzYY+o1fU9wxxOSTQDszCjxaUmsAOdiLhYsVYwEmYxofzjdNVidRn2RolCceVx4rkPOZKtYwnm1SjPMtQURK3xCcZdlSu5Xa14NWoAZeqT5lK9L9YQrwR1oBn6FtwYTTAF9Gd8SOgmY8voWYfhv3cUVXPL 6izyo82P XcPCO/gsj2Cl0PMe9H+sXaxajDEDpH7HTg7GHfHinC5NkAYHfjN0/Jlnv+pdASdTE8d9QlSHKKvDnjoTkG10Fvx6pkat3zGZv+FWvY1sjnrff4oP/Gt+Hl8QYeNxxJkxGusXcBt3uwRMEgFeFcYF7+em2WuUxoRUJt8ltRSaI4mc5/ziAD94DbPtbd3NQEYI2uAgevb813v5xOtQk3vv7TBxt8d9VHr0zm+sIqXrrmqKTvHbFgIyzJm5iRztIptYws1Y2+SSV6t2YWg5J5JUXDx2fVfmOYHWkeukTbloRxgXzmBL8r0URrDekY+YlBLhgAv4jJEA8eCr/kCoZtt8ZBGglRIEKHjHSJHaRW3M/dMxcSSXOWA1kk4gUGJMljiM+QteSbxmI2q6sSPRIMNpyVlPVCmub1iNhUnggFxmyc8+hz9bcWfWxWQmnOh2ggUvLpxlliHBxY/g68mPOvz0zqVn/FsnBDbryfmlB9Q95NGdVjOgv/3N3SwN6hX1c+Kh6pnEpNa7WQlgjILLaIM4un/H6usWNpjur/wFQ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 08/05/2025 06:41, Yafang Shao wrote: > On Thu, May 8, 2025 at 12:09 AM Usama Arif wrote: >> >> >> >> On 07/05/2025 16:57, Zi Yan wrote: >>> On 7 May 2025, at 11:12, Usama Arif wrote: >>> >>>> On 07/05/2025 15:57, Zi Yan wrote: >>>>> +Yafang, who is also looking at changing THP config at cgroup/container level. > > Thanks > >>>>> >>>>> On 7 May 2025, at 10:00, Usama Arif wrote: >>>>> >>>>>> Allowing override of global THP policy per process allows workloads >>>>>> that have shown to benefit from hugepages to do so, without regressing >>>>>> workloads that wouldn't benefit. This will allow such types of >>>>>> workloads to be run/stacked on the same machine. >>>>>> >>>>>> It also helps in rolling out hugepages in hyperscaler configurations >>>>>> for workloads that benefit from them, where a single THP policy is >>>>>> likely to be used across the entire fleet, and prctl will help override it. >>>>>> >>>>>> An advantage of doing it via prctl vs creating a cgroup specific >>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is >>>>>> that this will work even when there are no cgroups present, and my >>>>>> understanding is there is a strong preference of cgroups controls being >>>>>> hierarchical which usually means them having a numerical value. >>>>> >>>>> Hi Usama, >>>>> >>>>> Do you mind giving an example on how to change THP policy for a set of >>>>> processes running in a container (under a cgroup)? >>>> >>>> Hi Zi, >>>> >>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always >>>> for processes in a cgroup is in the same way we enable KSM for the cgroup. >>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS >>>> in exec-invoke. >>>> This is at the start of the process, but you would already know at the start of the process >>>> whether you want THP=always for it or not. >>>> >>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 >>> >>> You also need to add a new systemd.directives, e.g., MemoryTHP, to >>> pass the THP enablement or disablement info from a systemd config file. >>> And if you find those processes do not benefit from using THPs, >>> you can just change the new "MemoryTHP" config and restart the processes. >>> >>> Am I getting it? Thanks. >>> >> >> Yes, thats right. They would exactly the same as what we (Meta) do >> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set, >> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when >> that is set, the prctl will be called at exec_invoke of the process [3]. >> >> The systemd changes should be quite simple to do. >> >> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 >> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151 >> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045 > > This solution carries a risk: since prctl() does not require any > capabilities, the task itself could call it and override your memory > policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that > capability is typically enabled by default in containers, leaving them > still vulnerable. > > This approach might work for Kubernetes/container environments, but it > would require substantial code changes to implement securely. > You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE already exists and the someone could use this to slow the process down. So the approach this patch takes shouldn't be anymore of a security fix then what is already exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE should be used to restrict this. In terms of security vulnerability of prctl, I feel like there are a lot of others that can be a much much bigger issue? I just had a look and you can change the seccomp, reset PAC keys(!) even speculation control(!!), so I dont think the security argument would be valid.