From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9EDFC83F1A for ; Mon, 21 Jul 2025 17:27:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7286F6B007B; Mon, 21 Jul 2025 13:27:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6D92D6B0089; Mon, 21 Jul 2025 13:27:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5C84E6B008A; Mon, 21 Jul 2025 13:27:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 49EBB6B007B for ; Mon, 21 Jul 2025 13:27:47 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id E65461A0140 for ; Mon, 21 Jul 2025 17:27:46 +0000 (UTC) X-FDA: 83688954132.10.E8B8E7F Received: from mail-ej1-f54.google.com (mail-ej1-f54.google.com [209.85.218.54]) by imf02.hostedemail.com (Postfix) with ESMTP id C7FCB80010 for ; Mon, 21 Jul 2025 17:27:44 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=SgDODyPg; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf02.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1753118865; a=rsa-sha256; cv=none; b=kcDHekg/VOq2kfFDZjm2csIFrYhJw4RcwmUXsQG1BGdqsvZFAuoQpvvWZUp54+gYt52lSv 3CxS85vz/f5/cJKppIOSmcFsLY2XOU75ygSdIvu5fL9Q9hU6M28KqXKl+DxuQKD6ti9g0B qtrQSFjgix2dJE3TsuLng1VkePoH/FU= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=SgDODyPg; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf02.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1753118865; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=aLBjxQaTX8LscBqdSZ08o3ArpoFNx0b7tv/Q0Py17yI=; b=YVxsOLdf8Z+9S9aQIpGCKuSvG5LNEwXqBy+pZBLxYL1ya6j9QhPKIfLQanfqeHRf2UftZX 1wTQKLJ36Ou8AP+SVLicBFo2KOoaV1aQe+JI8FX4iLYswxFoib7LMgkrcbWA07+2Ho8B3N JD/UYNlKD4WgSqHLnKAyx08FExdLU9Q= Received: by mail-ej1-f54.google.com with SMTP id a640c23a62f3a-ae3b336e936so863579966b.3 for ; Mon, 21 Jul 2025 10:27:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1753118863; x=1753723663; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=aLBjxQaTX8LscBqdSZ08o3ArpoFNx0b7tv/Q0Py17yI=; b=SgDODyPgzonZylw8p4mbdsm7/SLkmuQsIXYqw0dlWCRuiFYYEI/yoXBsQ8NJ0H2RRB Y5Ld3/523Jxodo0qcvzEF1aqk9OKlJZ9kyd7+Mo/l4U9QLln3WtEzPqmruLxAbca0pWC Fqd1bQ1w+Ai9aQWSo12+I95JalYv2ATok4bvR4i1N4nToTknBphShBOIz0yoAKB4Yfj9 VOfF9thtTT8GdawBG044XeAg+QShR3OMTWGI8roDGihiQXXbPY+Z95p8seOEtF6yVG/X 0EpOKWYhjbdWsFCW9H0wDLR4CBb3hJrPROgR3rCRhqXNrrM9PagG05DhAQGtK+5oVv56 PLVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753118863; x=1753723663; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=aLBjxQaTX8LscBqdSZ08o3ArpoFNx0b7tv/Q0Py17yI=; b=JBXtjFzgLKWAb/0ilU4mDnwanP4ZSOriZFBp21Xyp2t155BazwP6IxXyb04tCRbkgw nyCvF4offslHTkONI9kiFos1Zq8fRSq9DBbEt0sgZjZEafACXLMNLMpJR4C0ag0zrfse XN2AElE8zPPgUXK0hbV8K7/fWVjnKrXu5SM25DiXafYQemIn3q2u8NcYXFHPTuhakIFO r2VLJRszsJ6wLEZwYNlHst+6NiOjTVenCNyOyRwXcogilh4wOZurBaTNFK3gf6+OW/fQ 4EG35mU9eVziNTSzFvT9Va8YuRWknNu1dqiAZuq6XITaeveittWBUBTdfiz6yk7gNOi1 dDZw== X-Gm-Message-State: AOJu0YywZjTRZA/QhXEUp61ByC6+epb9AYq5EVvIaOxOMr6ADYXi5GKO /MX8HnPCZa8oxJKvZLjTuc0vOklqjRtmnjO+Gyftp+jyS0QJY41GX5b3 X-Gm-Gg: ASbGncufpLb+MP8xowse0JxEK3KiMMy4KW1NmYo2WEk//EmMYAiM0oI8Qrcnw9Ba68P qIrjs/UYeVZ0COOIzbEODI+w6xvotwUc+2Ojbzr87b1+JlupxkOxbiptQacq6i2PfkBg7S9jSTo x/R38hJliRt4tiakvLLjS4WTezpw9c9zLUVgau/n6eapx7/WiskoloF+q28P6zvHF+PMhqeyRv0 T8LhN0BJH0GZbeAD23QP3TVJg3128ylP+r1KXwOxq5dCeDRj8Zd/23LCkBXyk6z8NQFl0kwgNfC uqy+VT6GwPoGUbGVdkXPo+5XeHbBD+3v0eNRgt8r6SbxYIVkXTR53XP3+xzqepQqNV3McycgKgw dt+sygDXN9Fnxj62t+EWxECN7gbS/cfrVAscnDZSnzXZWlehbx2Yf0MCtQYKK9IlAdy2kVIw= X-Google-Smtp-Source: AGHT+IF67qRQILwsFhjtv0xjHo3oNFhJgY+f8Q8WfYmpBOmYl+u3H0UBEOyl0+feFNjNVxpoy5TwiQ== X-Received: by 2002:a17:907:bd11:b0:ae3:51ac:12b5 with SMTP id a640c23a62f3a-ae9ce14ae3cmr2044637066b.46.1753118862587; Mon, 21 Jul 2025 10:27:42 -0700 (PDT) Received: from ?IPV6:2a03:83e0:1126:4:14f1:c189:9748:5e5a? ([2620:10d:c092:500::7:cc27]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-aec6ca7a3e1sm708835166b.105.2025.07.21.10.27.41 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 21 Jul 2025 10:27:41 -0700 (PDT) Message-ID: <4a8b70b1-7ba0-4d60-a3a0-04ac896a672d@gmail.com> Date: Mon, 21 Jul 2025 18:27:38 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE To: David Hildenbrand , linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-doc@vger.kernel.org, Jonathan Corbet , Andrew Morton , Lorenzo Stoakes , Zi Yan , Baolin Wang , "Liam R. Howlett" , Nico Pache , Ryan Roberts , Dev Jain , Barry Song , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , SeongJae Park , Jann Horn , Yafang Shao , Matthew Wilcox References: <20250721090942.274650-1-david@redhat.com> Content-Language: en-US From: Usama Arif In-Reply-To: <20250721090942.274650-1-david@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: C7FCB80010 X-Stat-Signature: rh8wmzw7a463bjucqdzuomt74hapi1wk X-HE-Tag: 1753118864-1376 X-HE-Meta: U2FsdGVkX19bOxK4nGxaWjRSldFotGZERNfMia44hzjBE1jVeEZVGWvO1HtoxMB1nCCvfAzQTyHo1sBcym+ItC/j/NMrXWrslJw/JC4XYNC7KHPpMQIzdZFuYc9m9UntdgCCDPOyV1FhK6/VH6aiuE3IHhk3FWBbNw3YC9D1cQq8paE/pMMXwnSy78TXhvsJ2Z/TcCvfFMt2Hov4187QS7KWMuJCKGHHnKIXpnB1p5QZeTBl+SX44IyHWHpzVG5Ay0n6gz+JzxZ9RmNEEKGfvCfea42YBjCMqBNJnOcCh09lVhl2xv8zePAvnV2KlFLyHhX78IU1MXN740xo8ZzZbHEo7kK28ko0ZGyFPAQxUpdNzb+V8lkPW0N46x8TKmMWHFPSKZF6lQHIFjRTHO5NeRbJ8Bt2EcAV10/4Mm+niM/4KIJ6Bh4CPGXCyKWS4yQX8FDYeAysZqS0NKlc+m4MjH796S0lSZA4/eHxM3x3nkI7B4hEF/4Slyzf7yq2A0JzWeocoKxq4GoRzSARqvbuW/KePaoQA6Fj8foqlgjQej98wYIjhR3GZVginFu3TNUhcuGJMDD9LSZa8SuqYzP/XVN/FlzBBUldmNWU0GfQsYyqMBHJRlQoyT21S8zHl5GAKbRLkrY/yP6VSo4+FqBdgj6ven78rSV5Fpf4Xz9/F9Dr2fYyTCPmftR1kv+8xEVmlgKD/PNAyYJijcKD8iM3AB6+A4Ne64+a8AGT5+jwgVFPn8mYinWTxYiPhZNd3F+NRbCz5euaCpujC3yBgRMVe4OqZJgtBOJ4evkzCGrYbuNii6e7mGEjdRRzDUW/dz3WzOXdm1mH8OsGqk54gfnrEVt6RiMM1lh9Pp/rfO4VzHSVdppT5XAlN97wfzK0XH4QR+LpsHPVb2+dpWVFZPkBcOisqOwkSGoupmwcVvB4P0fk7asBvXcHThiv53jX/xaEDXkbWkUg7/SgJpbuGZ+ JVL7qd2Q zFNy1UbDC9/4d6XHvKr0xsUDkbKStP2eO6hCbi7zSdL+xvGuByBjPZPbSAFfZfcUhZmOeTrt8U2lBrjE6p0osDeo4pbVM5+Ss12iOO+hE9oHgRxcpGhZcalX/forwac0WOJ+jgoAznfCV4l77Gi2HJ56ygl/XMWp2xVWo13TV8KhIpsy0WuoE9YzP7hRVvHI3Dx/OVq1CdT4VdIxQio4X3H5sCJsF4JMoh3Z81FsGxgSd8PAmXuv5w6fpSxJoVTiJxjFG8W7KVc473vwyZIqBOg16GBlSxD91+5/ozcAp/A/T88BvetFt3xQFvj8qj4hbXKk1PWVE5jPH5IpD88P8ZOvu4fG8Ez8IAouywcbjgN/1QZS7D9En+w+EE9L3w+J2pDusy/JXyN63cEOB0Yd3XqCBVFMWo1exlrzsnT16kDcr2TtNk1xu9BbTqsHXpnvNQhnKo8ixMtvrgz4jkddQZLiz0U6ExF3RID9nhSnFsqavuKTucJsWiD8rIJ2f2j/9w2GWIyrl3ZaeUHYzLY/3GyV3CExEh0NSuwYyoAvG9FzqKIpXVqdRVJ1xBalqo8h3JmFavTG5OB3OO+ClYVSc9Bg+1ogX6EshodmQLvd0Hq3IEI1jvYDn2sy5nzfa8o6s28eYTxsKzB0aWPxDVwmX1IjUg9yI4kLYRFIONOWSONetoHjbD349xXV8Fua1Dlu1AN85QNag3lyOtMn0SMIWndeviTBPZxPed8bKdQRFWZYWday9/twyxQPZNqWq+O+uyQHsWvYe3jm+aYkMxjAWgeNybmsFPu4Wj400ROFMX+8F5Wx2aYJkrZjvCrNx2H20gzEARlI4FMtidZriwTkQ6qxdTSVEcNCViijW/gyM1tYE8kF4XNx4dMJLzpuFt4jmwkuMYzYSnMDCeY1Y7c1o4JRFwDpc7csDXezFb6TqO7+BXg6ZPrB+2e5edm8jUSfbsWodSQ7lYXccHFXq5ppAJvhG05Pr MpI8qY/7 SlZ+W6o3Ud3XdE9ILbYFN6wjv4t6a/+8 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 21/07/2025 10:09, David Hildenbrand wrote: > People want to make use of more THPs, for example, moving from > THP=never to THP=madvise, or from THP=madvise to THP=never. > > While this is great news for every THP desperately waiting to get > allocated out there, apparently there are some workloads that require a > bit of care during that transition: once problems are detected, these > workloads should be started with the old behavior, without making all > other workloads on the system go back to the old behavior as well. > > In essence, the following scenarios are imaginable: > > (1) Switch from THP=none to THP=madvise or THP=always, but keep the old > behavior (no THP) for selected workloads. > > (2) Stay at THP=none, but have "madvise" or "always" behavior for > selected workloads. > > (3) Switch from THP=madvise to THP=always, but keep the old behavior > (THP only when advised) for selected workloads. > > (4) Stay at THP=madvise, but have "always" behavior for selected > workloads. > > In essence, (2) can be emulated through (1), by setting THP!=none while > disabling THPs for all processes that don't want THPs. It requires > configuring all workloads, but that is a user-space problem to sort out. > > (4) can be emulated through (3) in a similar way. > > Back when (1) was relevant in the past, as people started enabling THPs, > we added PR_SET_THP_DISABLE, so relevant workloads that were not ready > yet (i.e., used by Redis) were able to just disable THPs completely. Redis > still implements the option to use this interface to disable THPs > completely. > > With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a > workload -- a process, including fork+exec'ed process hierarchy. > That essentially made us support (1): simply disable THPs for all workloads > that are not ready for THPs yet, while still enabling THPs system-wide. > > The quest for handling (3) and (4) started, but current approaches > (completely new prctl, options to set other policies per processm, > alternatives to prctl -- mctrl, cgroup handling) don't look particularly > promising. Likely, the future will use bpf or something similar to > implement better policies, in particular to also make better decisions > about THP sizes to use, but this will certainly take a while as that work > just started. > > Long story short: a simple enable/disable is not really suitable for the > future, so we're not willing to add completely new toggles. > > While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs > completely for these processes, this scares many THPs in our system > because they could no longer get allocated where they used to be allocated > for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a > problem for relevant workloads, because "not THPs" is certainly worse > than "THPs only when advised". > > Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not > explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this > would change the documented semantics quite a bit, and the versatility > to use it for debugging purposes, so I am not 100% sure that is what we > want -- although it would certainly be much easier. > > So instead, as an easy way forward for (3) and (4), an option to > make PR_SET_THP_DISABLE disable *less* THPs for a process. > > In essence, this patch: > > (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3 > of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0). > > For now, arg3 was not allowed to be set (-EINVAL). Now it holds > flags. > > (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if > PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling. > > For now, it would return 1 if THPs were disabled completely. Now > it essentially returns the set flags as well. > > (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express > the semantics clearly. > > Fortunately, there are only two instances outside of prctl() code. > > (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs > with VM_HUGEPAGE" -- essentially "thp=madvise" behavior > > Fortunately, we only have to extend vma_thp_disabled(). > > (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not > disabled completely > > Only indicating that THPs are disabled when they are really disabled > completely, not only partially. > > The documented semantics in the man page for PR_SET_THP_DISABLE > "is inherited by a child created via fork(2) and is preserved across > execve(2)" is maintained. This behavior, for example, allows for > disabling THPs for a workload through the launching process (e.g., > systemd where we fork() a helper process to then exec()). > > There is currently not way to prevent that a process will not issue > PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option > to PR_SET_THP_DISABLE through another flag if ever required. The known > users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so > that is not added for now. > > Cc: Jonathan Corbet > Cc: Andrew Morton > Cc: Lorenzo Stoakes > Cc: Zi Yan > Cc: Baolin Wang > Cc: "Liam R. Howlett" > Cc: Nico Pache > Cc: Ryan Roberts > Cc: Dev Jain > Cc: Barry Song > Cc: Vlastimil Babka > Cc: Mike Rapoport > Cc: Suren Baghdasaryan > Cc: Michal Hocko > Cc: Usama Arif > Cc: SeongJae Park > Cc: Jann Horn > Cc: Liam R. Howlett > Cc: Yafang Shao > Cc: Matthew Wilcox > Signed-off-by: David Hildenbrand > > --- > > At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I > think there might be real use cases where we want to disable any THPs -- > in particular also around debugging THP-related problems, and > "THP=never" not meaning ... "never" anymore. PR_SET_THP_DISABLE will > also block MADV_COLLAPSE, which can be very helpful. Of course, I thought > of having a system-wide config to change PR_SET_THP_DISABLE behavior, but > I just don't like the semantics. > > "prctl: allow overriding system THP policy to always"[1] proposed > "overriding policies to always", which is just the wrong way around: we > should not add mechanisms to "enable more" when we already have an > interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets > weird otherwise. > > "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed > setting the default of the VM_HUGEPAGE, which is similarly the wrong way > around I think now. > > The proposals by Lorenzo to extend process_madvise()[3] and mctrl()[4] > similarly were around the "default for VM_HUGEPAGE" idea, but after the > discussion, I think we should better leave VM_HUGEPAGE untouched. > > Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where > we essentially want to say "leave advised regions alone" -- "keep THP > enabled for advised regions", > > The only thing I really dislike about this is using another MMF_* flag, > but well, no way around it -- and seems like we could easily support > more than 32 if we want to, or storing this thp information elsewhere. > > I think this here (modifying an existing toggle) is the only prctl() > extension that we might be willing to accept. In general, I agree like > most others, that prctl() is a very bad interface for that -- but > PR_SET_THP_DISABLE is already there and is getting used. > > Long-term, I think the answer will be something based on bpf[5]. Maybe > in that context, I there could still be value in easily disabling THPs for > selected workloads (esp. debugging purposes). > > Jann raised valid concerns[6] about new flags that are persistent across > exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I > consider it having a similar security risk as our existing > PR_SET_THP_DISABLE, but devil is in the detail. > > This is *completely* untested and might be utterly broken. It merely > serves as a PoC of what I think could be done. If this ever goes upstream, > we need some kselftests for it, and extensive tests. > > [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com > [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com > [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com > [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local > [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com > [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com > > --- > Documentation/filesystems/proc.rst | 5 +-- > fs/proc/array.c | 2 +- > include/linux/huge_mm.h | 20 ++++++++--- > include/linux/mm_types.h | 13 +++---- > include/uapi/linux/prctl.h | 7 ++++ > kernel/sys.c | 58 +++++++++++++++++++++++------- > mm/khugepaged.c | 2 +- > 7 files changed, 78 insertions(+), 29 deletions(-) Thanks for the patch David! As discussed in the other thread, with the below diff diff --git a/kernel/sys.c b/kernel/sys.c index 2a34b2f70890..3912f5b6a02d 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2447,7 +2447,7 @@ static int prctl_set_thp_disable(unsigned long thp_disable, unsigned long flags, return -EINVAL; /* Flags are only allowed when disabling. */ - if (!thp_disable || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED)) + if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED)) return -EINVAL; if (mmap_write_lock_killable(current->mm)) return -EINTR; I tested with the below selftest, and it works. It hopefully covers majority of the cases including fork and re-enabling THPs. Let me know if it looks ok and please feel free to add this in the next revision you send. Once the above diff is included, please feel free to add Acked-by: Usama Arif Tested-by: Usama Arif Thanks! >From ee9004e7d34511a79726ee1314aec0503e6351d4 Mon Sep 17 00:00:00 2001 From: Usama Arif Date: Thu, 15 May 2025 14:33:33 +0100 Subject: [PATCH] selftests: prctl: introduce tests for PR_THP_DISABLE_EXCEPT_ADVISED The test is limited to 2M PMD THPs. It does not modify the system settings in order to not disturb other process running in the system. It checks if the PMD size is 2M, if the 2M policy is set to inherit and if the system global THP policy is set to "always", so that the change in behaviour due to PR_THP_DISABLE_EXCEPT_ADVISED can be seen. This tests if: - the process can successfully set the policy - carry it over to the new process with fork - if no hugepage is gotten when the process doesn't MADV_HUGEPAGE - if hugepage is gotten when the process does MADV_HUGEPAGE - the process can successfully reset the policy to PR_THP_POLICY_SYSTEM - if hugepage is gotten after the policy reset Signed-off-by: Usama Arif --- tools/testing/selftests/prctl/Makefile | 2 +- tools/testing/selftests/prctl/thp_disable.c | 207 ++++++++++++++++++++ 2 files changed, 208 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/prctl/thp_disable.c diff --git a/tools/testing/selftests/prctl/Makefile b/tools/testing/selftests/prctl/Makefile index 01dc90fbb509..a3cf76585c48 100644 --- a/tools/testing/selftests/prctl/Makefile +++ b/tools/testing/selftests/prctl/Makefile @@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/) ifeq ($(ARCH),x86) TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \ - disable-tsc-test set-anon-vma-name-test set-process-name + disable-tsc-test set-anon-vma-name-test set-process-name thp_disable all: $(TEST_PROGS) include ../lib.mk diff --git a/tools/testing/selftests/prctl/thp_disable.c b/tools/testing/selftests/prctl/thp_disable.c new file mode 100644 index 000000000000..e524723b3313 --- /dev/null +++ b/tools/testing/selftests/prctl/thp_disable.c @@ -0,0 +1,207 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * This test covers the PR_GET/SET_THP_DISABLE functionality of prctl calls + * for PR_THP_DISABLE_EXCEPT_ADVISED + */ +#include +#include +#include +#include +#include +#include +#include + +#ifndef PR_THP_DISABLE_EXCEPT_ADVISED +#define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1) +#endif + +#define CONTENT_SIZE 256 +#define BUF_SIZE (12 * 2 * 1024 * 1024) // 12 x 2MB pages + +enum system_policy { + SYSTEM_POLICY_ALWAYS, + SYSTEM_POLICY_MADVISE, + SYSTEM_POLICY_NEVER, +}; + +int system_thp_policy; + +/* check if the sysfs file contains the expected substring */ +static int check_file_content(const char *file_path, const char *expected_substring) +{ + FILE *file = fopen(file_path, "r"); + char buffer[CONTENT_SIZE]; + + if (!file) { + perror("Failed to open file"); + return -1; + } + if (fgets(buffer, CONTENT_SIZE, file) == NULL) { + perror("Failed to read file"); + fclose(file); + return -1; + } + fclose(file); + // Remove newline character from the buffer + buffer[strcspn(buffer, "\n")] = '\0'; + if (strstr(buffer, expected_substring)) + return 0; + else + return 1; +} + +/* + * The test is designed for 2M hugepages only. + * Check if hugepage size is 2M, if 2M size inherits from global + * setting, and if the global setting is always. + */ +static int sysfs_check(void) +{ + int res = 0; + + res = check_file_content("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", "2097152"); + if (res) { + printf("hpage_pmd_size is not set to 2MB. Skipping test.\n"); + return -1; + } + res |= check_file_content("/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled", + "[inherit]"); + if (res) { + printf("hugepages-2048kB does not inherit global setting. Skipping test.\n"); + return -1; + } + + res = check_file_content("/sys/kernel/mm/transparent_hugepage/enabled", "[always]"); + if (!res) { + system_thp_policy = SYSTEM_POLICY_ALWAYS; + return 0; + } + printf("Global THP policy not set to always. Skipping test.\n"); + return -1; +} + +static int check_smaps_for_huge(void) +{ + FILE *file = fopen("/proc/self/smaps", "r"); + int is_anonhuge = 0; + char line[256]; + + if (!file) { + perror("fopen"); + return -1; + } + + while (fgets(line, sizeof(line), file)) { + if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB")) { + is_anonhuge = 1; + break; + } + } + fclose(file); + return is_anonhuge; +} + +static int test_mmap_thp(int madvise_buffer) +{ + int is_anonhuge; + + char *buffer = (char *)mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (buffer == MAP_FAILED) { + perror("mmap"); + return -1; + } + if (madvise_buffer) + madvise(buffer, BUF_SIZE, MADV_HUGEPAGE); + + // set memory to ensure it's allocated + memset(buffer, 0, BUF_SIZE); + is_anonhuge = check_smaps_for_huge(); + munmap(buffer, BUF_SIZE); + return is_anonhuge; +} + +/* Global policy is always, process is changed to "madvise only" */ +static int test_global_always_process_madvise(void) +{ + int is_anonhuge = 0, res = 0, status = 0; + pid_t pid; + + if (prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, NULL, NULL) != 0) { + perror("prctl failed to set policy to madvise"); + return -1; + } + + /* Make sure prctl changes are carried across fork */ + pid = fork(); + if (pid < 0) { + perror("fork"); + exit(EXIT_FAILURE); + } + + res = prctl(PR_GET_THP_DISABLE, NULL, NULL, NULL, NULL); + if (res != 3) { + printf("prctl PR_GET_THP_POLICY returned %d pid %d\n", res, pid); + goto err_out; + } + + /* global = always, process = madvise, we shouldn't get HPs without madvise */ + is_anonhuge = test_mmap_thp(0); + if (is_anonhuge) { + printf( + "PR_THP_POLICY_DEFAULT_NOHUGE set but still got hugepages without MADV_HUGEPAGE\n"); + goto err_out; + } + + is_anonhuge = test_mmap_thp(1); + if (!is_anonhuge) { + printf( + "PR_THP_POLICY_DEFAULT_NOHUGE set but did't get hugepages with MADV_HUGEPAGE\n"); + goto err_out; + } + + /* Reset to system policy */ + if (prctl(PR_SET_THP_DISABLE, 0, NULL, NULL, NULL) != 0) { + perror("prctl failed to set policy to system"); + goto err_out; + } + + is_anonhuge = test_mmap_thp(0); + if (!is_anonhuge) { + printf("global policy is always but we still didn't get hugepages\n"); + goto err_out; + } + + is_anonhuge = test_mmap_thp(1); + if (!is_anonhuge) { + printf("global policy is always but we still didn't get hugepages\n"); + goto err_out; + } + printf("PASS\n"); + + if (pid == 0) { + exit(EXIT_SUCCESS); + } else { + wait(&status); + if (WIFEXITED(status)) + return 0; + else + return -1; + } + +err_out: + if (pid == 0) + exit(EXIT_FAILURE); + else + return -1; +} + +int main(void) +{ + if (sysfs_check()) + return 0; + + if (system_thp_policy == SYSTEM_POLICY_ALWAYS) + return test_global_always_process_madvise(); + +} -- 2.47.1