From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 47A2EC54FB3 for ; Thu, 29 May 2025 17:22:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A3B1A6B008A; Thu, 29 May 2025 13:22:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9C4106B008C; Thu, 29 May 2025 13:22:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8B4246B0092; Thu, 29 May 2025 13:22:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 67AAF6B008A for ; Thu, 29 May 2025 13:22:04 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 19C411D3C90 for ; Thu, 29 May 2025 17:22:04 +0000 (UTC) X-FDA: 83496613368.07.EB445BB Received: from mail-ej1-f46.google.com (mail-ej1-f46.google.com [209.85.218.46]) by imf29.hostedemail.com (Postfix) with ESMTP id 218A8120004 for ; Thu, 29 May 2025 17:22:01 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=PVeGwSqk; spf=pass (imf29.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.218.46 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748539322; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DX8U7tbhWHGdqlNwj9oFLF/YU0AXwL1c1vUsziqoPv8=; b=TfCtWoBEticgvEWNLeXwCX2jha6wDBGhoF9RQT7q17rmI6j1RA20sp2jYbW0pdr/pBZi9j U1jmOGyZa47AjJyfzVy5D9LHRvFtk8ocGvlslTEjDKorIL6Gt7uzYPR7sXVoh9Jq71XwBg W5GWbPSaHMt2N+nCLcVK6UxRxPrWPC8= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=PVeGwSqk; spf=pass (imf29.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.218.46 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748539322; a=rsa-sha256; cv=none; b=gRAWe5gE7hG86zVnEQcjvwh/I6VQRPVnT6vJb/ag7Z8LPtyTD2SI1nKka9ZJ1RtlKp7Psk UHvo3VppE5UcOCb6B3dsR7wJax5rkJIu54f0wGMwq5jI5yD086OCjFgC5tL+6ud5m0MGQV SI8UiSNkDn6muwpmziYr6Ro6o4nY+uI= Received: by mail-ej1-f46.google.com with SMTP id a640c23a62f3a-ad572ba1347so172917666b.1 for ; Thu, 29 May 2025 10:22:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748539320; x=1749144120; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id:from :to:cc:subject:date:message-id:reply-to; bh=DX8U7tbhWHGdqlNwj9oFLF/YU0AXwL1c1vUsziqoPv8=; b=PVeGwSqkt0ppcdniqivO4001PSeYZQajh/BStwfC1kNDmhljBiLWIZnfarMUJojzBW IzFqvRTCiakiGrpMfdtQFWDV4K76gtGnPYLFOak2Njg7iXkIcy5tiZwYVmhNqDeNW0VX NZAvbMPOBmQDbi4S3vzUGS16A+6JLdsapKMCqYd/Weh1RirzuqEXQ5EjVfzjxSUT4isf v+jTDsr8V1M5E20BUJFYV/L5MVXcEezFkoY1KW1X0A0M2ddwZp6NFolBDm5K84J0Xomq aB4S7sbX87h/t+zn+9bshNe4BWZA7NYIXzWwqdLAX+WHro9TBm71FBicfTxEsmu84zZ3 2RDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748539320; x=1749144120; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=DX8U7tbhWHGdqlNwj9oFLF/YU0AXwL1c1vUsziqoPv8=; b=OQNseuygHGrOsFWToKHZuxAs6+WtQtKpL+7OdNOA5VW1vUe2wcBlVzlkyQGQDPRQoF 1ivDkhJTd1pvcvZR8kZtnTPsC2btUI7ZXgUEcfXPcKW/crpPGqo4aEtXvz7XkF+XRhne MbgiGnvV2U91/RwQOPB2fx2JLOcVuAH5vwhOeWoFZ5UewyUWdRQ8HXXnDUtClYxZUjI0 MnxzK6KYZSzs9Dc6JMQxs7yfi0JorcqE8dEF83lZBSAqeF4veiW39OdYzQB4n28upKiU wtpcJxZVbBqKmPs+UZ+YopHKLyk2ibEURNC7xkIs9yLE+nlRIr7R7m0CdkXHTgWiTj3l X+wQ== X-Forwarded-Encrypted: i=1; AJvYcCUAXOHO26K/ieyB92H1G3tDZO5cieJYx8uhl2B8rQw8V57gnTExH/KFDN8xe0ZEvfHOFYD1g0bUxg==@kvack.org X-Gm-Message-State: AOJu0YxIZvFDloTkMyjZcMiCvNBAQGhpYV2mVsWzyv/fd+JzEqEfwnC+ v6PZJ6N8nA0A1/ozkqz2FMFjxQw56O0G8TieHY+aGEzV1iImQBYGykd3 X-Gm-Gg: ASbGncuG/gMjB7bqYkpX4c+rwavBcd5ipbJxR7dVhVvr41zhXNsaXDp1cFNl4cPs0m1 DARh7N9wM/iVjEiqMSDLtw6Z3cMCJfCLIuiYx0LcMARfD2MhX1Ajj6I9X8awllXDPndUVdOESbl BqPalYw56y3VRg/Ll/DU89cLJdfCIJw0bbxeOavqoGgBD/gvW3IbdOO8UPPVYbtdI+ejSMLmMNY UJABSiYNKEx0lQX0BvhlnRW1bxZyBtH0jB/sfS+8dgCTtcJ/KNzCJCEs7hA2LVwrrAFLLNvMTef /s2zsjaQCdi8DNv3dbTCFx8NFhRB0/OshdtmzJ2Q9us4ziCThsWE23cpX1heNuakZP85q9o64ol yfXfd6YqTQ1lr2xfKkmPVM/DiRykKd7Q0JI4= X-Google-Smtp-Source: AGHT+IEJTYRcGMiupd4lE9mkGQlu/Uu54Uifq4ZwoVv6lwIXWYIylPx9Wq/FUw4L5saWKvyflVZEqw== X-Received: by 2002:a17:907:7285:b0:ad8:9c97:c2fa with SMTP id a640c23a62f3a-adb322455e1mr35760366b.4.1748539320038; Thu, 29 May 2025 10:22:00 -0700 (PDT) Received: from ?IPV6:2a03:83e0:1126:4:18cd:67ac:6946:5beb? ([2620:10d:c092:500::6:9f6d]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ada6ad3949bsm173685466b.129.2025.05.29.10.21.58 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 29 May 2025 10:21:59 -0700 (PDT) Message-ID: Date: Thu, 29 May 2025 18:21:55 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [DISCUSSION] proposed mctl() API To: Lorenzo Stoakes , Andrew Morton , Shakeel Butt , "Liam R . Howlett" , David Hildenbrand , Vlastimil Babka , Jann Horn , Arnd Bergmann , Christian Brauner , SeongJae Park , Mike Rapoport , Johannes Weiner , Barry Song <21cnbao@gmail.com>, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, Pedro Falcato References: <85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local> Content-Language: en-US From: Usama Arif In-Reply-To: <85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: 6go9ctoo7piwkp37d6i9gsndyiox6zus X-Rspamd-Queue-Id: 218A8120004 X-Rspam-User: X-Rspamd-Server: rspam02 X-HE-Tag: 1748539321-888917 X-HE-Meta: U2FsdGVkX18iQtd/TSM0N5CfdzAblygOqGqQEJHvi3fnTp1B2ueip7A8wEpJMYdGLSHSqGFTTbSMuQTitD1C7Ei/Hp2t769QvRw/+8wHeXoW0YHliF/AVStJyPAtYBRYZSqYsvyG7hWADHmRne5Cjps17giqLT5UND4GX1MSdqVlq6GGOpAqPgEJ+PeeO326AXPCsa7X1S17deaZZbQNoAa1cEwaFUg79bNctypwhOpfW3QxTKjA5uLUQ3x8/zi8SNOAAU6q8Ef/HSRS3/Xclqe3xGs6ryKFssXHadnOYDZZVD1yq0iUyvRxfbUEVaRcP7o1PwaNEaqKJGYaW4wbXvjcE33gU6nCnqctNaiSL9P/A+g2/FD4PtWMtvtOT+ig2ISLmAolIGd2GpQQWyreyFjLv5Ap0eOAd3oaruXHyIUEG7HKdnqDN6k42Cm1RWkVHgwC4UuyJNmgrthzubXFabp1a+HQ/JHihDQyQtu8vMFj6C985PWSNvLz0uMRKZkn43jMpr81yd3E2MaPxtT8TnluoVpZBCdPOKmhddE8Fl4R8JNAit8wFocyiCpq/KQtU9hglZbSeQLYMGEjfr1cET+wo0vqERfCzZcgLcVGsdku8meu40xwtHmkfaTMpdcKxRJ7n1iMJz9OUfpKUo0kC/Gw/n4vosHjgNOhXGy0XjFJJYgjfyoSr+kDcjMw68/IWEN9kIcvcAdB3yxDvH89/55E4KpXP1fh4wRdAqzpm08iejKqNhHFTh/jbSto9Mh9hhW/jjz6LArvs39/K2SjfJ2s6pmc63XD1CXJV8nRl+CXcu7jENWJYPq43LIzVCM8CABwl/4X/dGX6NriZXa6C3+SRuhfjd00zN1aBtMTKkdd44RdlcrsMPlvh/KKSkrVj9YDVueBWz6ksWxwdelBSG/qVTpy6zrMNP9wdbDPXSq4tj6qrg6cEcxzb8GK6n3hBPMA4xwSRTZUBnl183O tuX3ZVfd Wetml5ajARm09POEiuErtXgwnpmAzLr/gFh+2k9DRlnd0nRbl9BuDeXr5+eZFSf7fr7EoWr5DDkKVYBEOSAnZycdV9yOkVnGVsYtRH5+26hkp5cutYbR/kSavbhxlqsUJWx4R8HI0knh8/oDrgc9QsOVrTwQN3USSM5yNPJRh2ESiAnTByBYGTLa7SX0nMsWVQw3GVChQnC7ZO2xmUK11amYaW9GoGv7fObU6FHYpfx/o1EvVkDsFrC7Y1AI6J/N+xRphFyYJlnHHpnQ42o0rtmNNolFNAoAFgQCHQZkwiUB7oqTraqGNDoA8ltdP2N67ysd0gLBXG0Lfu7gGyhceLzQrdM7J8Ihbxfx0B8RN6FmGVxhJK31h4jhyxYYPXmzwk4s49niv8NXxbUd1qiUvvu4k+WWVfVwjn0T6dQ6MSsnAdEvq+XLd2LLgjBuCQAc78HQn3w/rtbytnKsikTCmOVahPYMmdPqJtmvmuNcJdhT/IPVSBpN6IEAzJdmr3vuz2Ou9MtcEQFY0DDfeL8dlK5Ec2Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 29/05/2025 15:43, Lorenzo Stoakes wrote: > ## INTRODUCTION > > After discussions in various threads (Usama's series adding a new prctl() > in [0], and a proposal to adapt process_madvise() to do the same - > conception in [1] and RFC in [2]), it seems fairly clear that it would make > sense to explore a dedicated API to explicitly allow for actions which > affect the virtual address space as a whole. > > Also, Barry is implementing a feature (currently under RFC) which could > additionally make use of this API (see [3]). > > [0]: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@gmail.com/ > [1]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ > [2]: https://lore.kernel.org/all/cover.1747686021.git.lorenzo.stoakes@oracle.com/ > [3]: https://lore.kernel.org/all/20250514070820.51793-1-21cnbao@gmail.com/ > > While madvise() and process_madvise() are useful for altering the > attributes of VMAs within a virtual address space, it isn't the right fit > for something that affects the whole address space. > > Additionally, a requirement of Usama's proposal (see [0]) is that we have > the ability to propagate the change in behaviour across fork/exec. This > further suggests the need for a dedicated interface, as this really sits > outside the ordinary behaviour of [process_]madvise(). > > prctl() is too broad and encourages mm code to migrate to kernel/sys.c > where it is at risk of bit-rotting. It can make it harder/impossible to > isolate mm logic for testing and logic there might be missed in changes > moving forward. > > It also, like so many kernel interfaces, has 'grown roots out of its pot' > so to speak - while it started off as an ostensible 'process' manipulation > interface, prctl() operations manipulate a myriad of task, virtual > address space and even specific VMA attributes. > > At this stage it really is a 'catch-all' for things we simply couldn't fit > elsewhere. > > Therefore, as suggested by the rather excellent Liam Howlett, I propose an > mm-specific interface that _explicitly_ manipulates attributes of the > virtual address space as a whole. > > I think something that mimics the simplicity of [process_]madvise() makes > sense - have a limited set of actions that can be taken, and treat them as > a simple action - a user requests you do XXX to the virtual address space > (that is, across the mm_struct), and you do it. > Hi Lorenzo, Thanks for writing the proposal, this is awesome! Whatever the community agrees with, whether its this or prctl, happy to move forward with either as both should accomplish the usecase proposed. I will just add some points over here in defence of prctl, this is just for discussion, and if the community disagrees, completely happy to move forward with new syscall as well. When it comes to having mm code in kernel/sys.c, we can just do something like below that can actually clean it up? diff --git a/kernel/sys.c b/kernel/sys.c index 3a2df1bd9f64..bfadc339e2c5 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2467,6 +2467,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = 0; switch (option) { + case PR_SET_MM: + case PR_GET_THP_DISABLE: + case PR_SET_THP_DISABLE: + case PR_NEW_MM_THING: + error = some_function_in_mm_folder(); // in mm/mctl.c ? + break; case PR_SET_PDEATHSIG: if (!valid_signal(arg2)) { error = -EINVAL; when it comes to prctl becoming a catch-all thing, with above clean up, we can be a lot more careful to what gets added to the mm side of prctl. The advantage of this is it avoids having another syscall. My personal view (which can be wrong :)) is that a new syscall should be for something major, and I feel that PR_DEFAULT_MADV_HUGEPAGE and PR_DEFAULT_MADV_NOHUGEPAGE might be small enough to fit in prctl? but I completely understand your point of view as well! > ## INTERFACE > > The proposed interface is simply: > > int mctl(int pidfd, int action, unsigned int flags); > > Since PIDFD_SELF is now available, it is easy to invoke this for the > current process, while also adding the flexibility of being able to apply > actions to other processes also. > > The function will return 0 on success, -1 on failure, with errno set to the > error that arose, standard stuff. > > The behaviour will be tailored to each action taken. > > To begin with, I propose a single flag: > > - MCTL_SET_DEFAULT_EXEC - Persists this behaviour across fork/exec. > > This again will be tailored - only certain actions will be allowed to set > this flag, and we will of course assert appropriate capabilities, etc. upon > its use. > Sounds good to me. Just adding this here, the solution will be used in systemd in exec_invoke, similar to how KSM is done with prctl in [1], so for the THP solution, we would need MCTL_SET_DEFAULT_EXEC as it would need to be inherited across fork+exec. [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5046 > All actions would, impact every VMA (if adjustments to VMAs are required). > > ## SECURITY > > Of course, security will be of utmost concern (Jann's input is important > here :) > > We can vary security requirements depending on the action taken. > > For an initial version I suggest we simply limit operations which: > > - Operate on a remote process > - Use the MCTL_SET_DEFAULT_EXEC flag > > To those tasks which possess the CAP_SYS_ADMIN capability. > > This may be too restrictive - be good to get some feedback on this. > > I know Jann raised concerns around privileged execution and perhaps it'd be > useful to see whether this would make more sense for the SET_DEFAULT_EXEC > case or not. > > Usama - would requiring CAP_SYS_ADMIN be egregious to your use case? > My knowledge is security is limited, so please bare with me, but I actually didn't understand the security issue and the need for CAP_SYS_ADMIN for doing VM_(NO)HUGEPAGE. A process can already madvise its own VMAs, and this is just doing that for the entire process. And VM_INIT_DEF_MASK is already set to VM_NOHUGEPAGE so it will be inherited by the parent. Just adding VM_HUGEPAGE shouldnt be a issue? Inheriting MMF_VM_HUGEPAGE will mean that khugepaged would enter for that process as well, which again doesnt seem like a security issue to me. > ## IMPLEMENTATION > > I think that sensibly we'd need to add some new files here, mm/mctl.c, > include/linux/mctl.h (the latter of providing the MCTL_xxx actions and > flags). > > We could find ways to share code between mm files where appropriate to > avoid too much duplication. > > I suggest that the best way forward, if we were minded to examine how this > would look in practice, would be for me to implement an RFC that adds the > interface, and a simple MCTL_SET_NOHUGEPAGE, MCTL_CLEAR_NOHUGEPAGE > implementation as a proof of concept. > > If we wanted to then go ahead with a non-RFC version, this could then form > a foundation upon which Usama and Barry could implement their features, > with Usama then able to add MCTL_[SET/CLEAR]_HUGEPAGE and Barry > MCTL_[SET/CLEAR]_FADE_ON_DEATH. > > Obviously I don't mean to presume to suggest how we might proceed here - > only suggesting this might be a good way of moving forward and getting > things done as quickly as possible while allowing you guys to move forward > with your features. > > Let me know if this makes sense, alternatively I could try to find a > relatively benign action to implement as part of the base work, or we could > simply collaborate to do it all in one series with multiple authors? > > ## RFC > > The above is all only in effect 'putting ideas out there' so this is > entirely an RFC in spirit and intent - let me know if this makes sense in > whole or part :) > > Thanks! > > Lorenzo Again thanks for the proposal! Happy to move forward with this or prctl. Just adding my 2 cents in this email. Thanks Usama