From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98ABCC54E90 for ; Thu, 22 May 2025 12:11:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3B4E76B0089; Thu, 22 May 2025 08:11:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 38C266B008A; Thu, 22 May 2025 08:11:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2A25E6B008C; Thu, 22 May 2025 08:11:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 0D4AE6B0089 for ; Thu, 22 May 2025 08:11:01 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 5B31DBED55 for ; Thu, 22 May 2025 12:11:00 +0000 (UTC) X-FDA: 83470427880.24.38DF215 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf16.hostedemail.com (Postfix) with ESMTP id 8691D180003 for ; Thu, 22 May 2025 12:10:58 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=nkXM5PH1; spf=pass (imf16.hostedemail.com: domain of rppt@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=rppt@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747915858; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kZk0skN5HTdkdsfiphh4R+nj1vAUg8CaJeZBhaCjgU8=; b=uM8ZWP5Nl3oS9e1L9fgH52vawBdb8TG7QqZgPDddsntDzA5IuTNtzT1trE8cj1HOS0KDBz aaBTCiViuZMv/SPkA9rQshDaWvcTKK8V8xeFms2epUGA7zXtUmGeT4jv2tewkBhw1q1vx9 wRjTCCYKkknfCZRLJQcbqxER/gAGDh4= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=nkXM5PH1; spf=pass (imf16.hostedemail.com: domain of rppt@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=rppt@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747915858; a=rsa-sha256; cv=none; b=1LYorlsO/cWRptB9i8c/wiRE7X8MXFA37CD0kQ9zbChPcU7XbouR4t4syNibJg9HGvWb98 ocGucp2LYvEVwqMSVyOgSUCaimUyQeib0Ftdd8VoJrZ69Vy4DMiQ5T/GG54y1KAxfUIcRx b+leDxCUwswAn1NBA8EdC6KFmLc19uY= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 9FBB25C414C; Thu, 22 May 2025 12:08:38 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0EB4FC4CEE4; Thu, 22 May 2025 12:10:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1747915855; bh=b6ul/IexOU3UrwM7guWXvrzYe6FDI+7LVqnn9ABwBxo=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=nkXM5PH1UOY51ngJLl4uZllA5Ci9jw+TjLfP4hokuQwbhjcCdah0S/N3QkANGMK3d wImkWwJrreuMgBRLezTB1cM8ix5QDJVeLulJnbY9ASowqZehW3HzJtfLPbmM7/TdPd c/xaiJ6DDLhPr1hlw6GwDrYBVxoRJgl5Y72IJ9oCYylCPJLF3k2Nx5xQJ8BwqcXeuL iRZL5xmqqdMyhwSRvf+h8/1MWC0k2SC+qcVfOpBadGUH3moerBLvgy57loUsrEl4cO M3Y+x2H0F+uX+vM6C1K/u5b1M9UKz2kyRNByPZPSbenwRDULvmGyZZqnHQqDTsbMmT lqam+yTFz3vAQ== Date: Thu, 22 May 2025 15:10:45 +0300 From: Mike Rapoport To: Usama Arif Cc: Andrew Morton , david@redhat.com, linux-mm@kvack.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, riel@surriel.com, ziy@nvidia.com, laoar.shao@gmail.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, vbabka@suse.cz, jannh@google.com, Arnd Bergmann , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com, linux-api@vger.kernel.org Subject: Re: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Message-ID: References: <20250519223307.3601786-1-usamaarif642@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250519223307.3601786-1-usamaarif642@gmail.com> X-Stat-Signature: tdm6mjdof9fmotdqpjbzezzwruu7ufdt X-Rspamd-Queue-Id: 8691D180003 X-Rspam-User: X-Rspamd-Server: rspam02 X-HE-Tag: 1747915858-743115 X-HE-Meta: U2FsdGVkX1/jfkEmPXqE0FOq2H55SjUV+EGbTyz7KSOV3DBAfAxi3tAdEB3NfESPMUmr1eL/1LGq8LjCxWZJicasOyaLH1EF2CpJoURBC0YjYHo7XREGyC3Td+j3ORArlSa0K4hOpJf5Uo2nfJskNP4FDUnm64p7wNOrPCZC4h/YeM/ClTJcUKKBit3s03p4/31duEyDHL+t3hVNxgkCq8Q4pf6jFhrXtNgP80vImgsxH/TkoeewN+VDD+zzmhYYfmTyvjHgPg/62MV19S/fj3FLkfAFuG6RfFT4KWeZabLO8636YPsosq4ImXHi0Hopb5t/2vGpB16qN6maJvvTa+vyOMSzN0yUkvkV6hHZ/CVxq6VP7PM1g+imXlfHgZJOgS/kAVql5PhiUHFDxrRbdqyV4qR11vuBZfAZarKZLHiWLL5f1hJ5dwbQJF1gjn9gsMm2YnlBzCPv6QomYMLoWWsnWyWZO6PaN+RZW592hSvmwcioKmamN0CVijhWjd70w3ihcW9tMxvlRrImRzlUqGbHhsBg3BAlMa+OmdX8+XGiookuqglIwxRPxwgOHVoetvis1JveL73stax6xK3osTGrRW2xBGZWXMltCm/0XIzdrnYYPfb85xs2S1hrnqnw8H2ZjcdFz26gPw+NTdSfIDbvPiUox2sVBBfQlJaV7llJteGO1ogMaP1XMbSXL8JBO9yEf6o+P6g9YFwCXZU4fQUkNNv/XKCmO8VzdCnrqIPyXC2wF40LSJZNJJc/gyQZSEbJDz2/0zzjw8q2UQKf5cfY1UrtD1S1x7rcneco4CuFB0By//4dIipojJZJtCIRIyo586Dr9Gx/7YS+yiL1Dh7/NRdpQMyOB17r3M2kA3EMovvPIQXbxsW4+oOy9UjVq8jFEHdGFX2eOqsSNTyzSBqJPhpTObmOnLVmWMMkhmusYBl5ij3POVvxaX2wy/mG++I7GpPvbD0fpybn96v pBbKSrCn valvlzUo9Yh3ck1hWaLdcs/CDewss+YjgjKZ/c6PkyyLDo2/LEvR30RVS698F/VrTcwxiQpihuyGcl0N+8787SfUS6lc2iu+sIcDx9WYRMhTYMkTJyo8/rBaaHNETuSLp0OEHAQ6x+QcWDKIEpqvNkL28YHI2BBxme0tAHZ6tEWcn2tJgt/Wv1KMUHha1GQFNVMTWY/EiYNwDdyYvLHJzQ4e+g+hWm/S8ZQeMc1Z1zoHAug9WdxmmONoR/hRPW9jtRFBtLkSzOtr9evpo8eIt3SlZ6NbeaPWIilXk7ZtIoFx3pSSzGj4NMnkN9xiRwCb1BqlFYQDlsclLc5bqGKv+mjUGjiN5Kxf6PG+5LWljfcrXvqzvuOyvEtr+w4S8buMJtEbaYnQTNXSMDzyE6UG5/MJGDiGIK4kvGl6v6tsI5UKjwYU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: (cc'ing linux-api) On Mon, May 19, 2025 at 11:29:52PM +0100, Usama Arif wrote: > This series allows to change the THP policy of a process, according to the > value set in arg2, all of which will be inherited during fork+exec: > - PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE > for the default VMA flags. It will also iterate through every VMA in the > process and call hugepage_madvise on it, with MADV_HUGEPAGE policy. > This effectively allows setting MADV_HUGEPAGE on the entire process. > In an environment where different types of workloads are run on the > same machine, this will allow workloads that benefit from always having > hugepages to do so, without regressing those that don't. > - PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE > for the default VMA flags. It will also iterate through every VMA in the > process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy. > This effectively allows setting MADV_NOHUGEPAGE on the entire process. > In an environment where different types of workloads are run on the > same machine,this will allow workloads that benefit from having > hugepages on an madvise basis only to do so, without regressing those > that benefit from having hugepages always. > - PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and > VM_NOHUGEPAGE process for the default flags. > > In hyperscalers, we have a single THP policy for the entire fleet. > We have different types of workloads (e.g. AI/compute/databases/etc) > running on a single server. > Some of these workloads will benefit from always getting THP at fault > (or collapsed by khugepaged), some of them will benefit by only getting > them at madvise. > > This series is useful for 2 usecases: > 1) global system policy = madvise, while we want some workloads to get THPs > at fault and by khugepaged :- some processes (e.g. AI workloads) benefits > from getting THPs at fault (and collapsed by khugepaged). Other workloads > like databases will incur regression (either a performance regression or > they are completely memory bound and even a very slight increase in memory > will cause them to OOM). So what these patches will do is allow setting > prctl(PR_DEFAULT_MADV_HUGEPAGE) on the AI workloads, (This is how > workloads are deployed in our (Meta's/Facebook) fleet at this moment). > > 2) global system policy = always, while we want some workloads to get THPs > only on madvise basis :- Same reason as 1). What these patches > will do is allow setting prctl(PR_DEFAULT_MADV_NOHUGEPAGE) on the database > workloads. (We hope this is us (Meta) in the near future, if a majority of > workloads show that they benefit from always, we flip the default host > setting to "always" across the fleet and workloads that regress can opt-out > and be "madvise". New services developed will then be tested with always by > default. "always" is also the default defconfig option upstream, so I would > imagine this is faced by others as well.) > > v2->v3: (Thanks Lorenzo for all the below feedback!) > v2: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@gmail.com/ > - no more flags2. > - no more MMF2_... > - renamed policy to PR_DEFAULT_MADV_(NO)HUGEPAGE > - mmap_write_lock_killable acquired in PR_GET_THP_POLICY > - mmap_write lock fixed in PR_SET_THP_POLICY > - mmap assert check in process_default_madv_hugepage > - check if hugepage_global_enabled is enabled in the call and account for s390 > - set mm->def_flags VM_HUGEPAGE and VM_NOHUGEPAGE according to the policy in > the way done by madvise(). I believe VM merge will not be broken in > this way. > - process_default_madv_hugepage function that does for_each_vma and calls > hugepage_madvise. > > v1->v2: > - change from modifying the THP decision making for the process, to modifying > VMA flags only. This prevents further complicating the logic used to > determine THP order (Thanks David!) > - change from using a prctl per policy change to just using PR_SET_THP_POLICY > and arg2 to set the policy. (Zi Yan) > - Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM > - Add selftests and documentation. > > Usama Arif (7): > mm: khugepaged: extract vm flag setting outside of hugepage_madvise > prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process > prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE for the process > prctl: introduce PR_THP_POLICY_SYSTEM for the process > selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE > selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE > docs: transhuge: document process level THP controls > > Documentation/admin-guide/mm/transhuge.rst | 42 +++ > include/linux/huge_mm.h | 2 + > include/linux/mm.h | 2 +- > include/linux/mm_types.h | 4 +- > include/uapi/linux/prctl.h | 6 + > kernel/sys.c | 53 ++++ > mm/huge_memory.c | 13 + > mm/khugepaged.c | 26 +- > tools/include/uapi/linux/prctl.h | 6 + > .../trace/beauty/include/uapi/linux/prctl.h | 6 + > tools/testing/selftests/prctl/Makefile | 2 +- > tools/testing/selftests/prctl/thp_policy.c | 286 ++++++++++++++++++ > 12 files changed, 436 insertions(+), 12 deletions(-) > create mode 100644 tools/testing/selftests/prctl/thp_policy.c > > -- > 2.47.1 > > -- Sincerely yours, Mike.