From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9FD20C54ED1 for ; Tue, 27 May 2025 09:44:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0B76B6B008A; Tue, 27 May 2025 05:44:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 067F26B0098; Tue, 27 May 2025 05:44:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EE70B6B0099; Tue, 27 May 2025 05:44:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id CFDFD6B008A for ; Tue, 27 May 2025 05:44:04 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 70BA312056F for ; Tue, 27 May 2025 09:44:04 +0000 (UTC) X-FDA: 83488201608.13.FBD291E Received: from mail-qv1-f51.google.com (mail-qv1-f51.google.com [209.85.219.51]) by imf20.hostedemail.com (Postfix) with ESMTP id 94A021C0002 for ; Tue, 27 May 2025 09:44:02 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=L09C8kCu; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf20.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748339042; a=rsa-sha256; cv=none; b=duE4MfGotg05Ea9jZSRWmlk8qKseB3m9IA+0agjYRrUmIjjop1mxaQYhjIKWTiOke2IbSU lWfdK0nITxMaqEQLcXcZLnTDlScYAKesOYsZW7IHe6JSCquF4KkXdQrR3iVb2SyCpnflED 3t7KEY/eVxG+q4/t8yDRBgng30Q9z/g= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=L09C8kCu; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf20.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748339042; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=N63HG5Y92FVIyGYtF9GWHOZMWzYb14mgOGmeAq1M4Zk=; b=RPIjl8NEWksIGuu+SNcE71HDq9aeDl6BbkTl2tQD4LXfvjEraWY62hHuJWjH6lTuuDCQO6 0O4uqAJuX61eg3x6SyHBBtH6tqgB1WS2TJpQdANbwcUdoEgxlVUeqcfgvXqOy3aXjDeJfo 07QEYUCLcVSTbEC5QYrjtEyL9XpqOQ4= Received: by mail-qv1-f51.google.com with SMTP id 6a1803df08f44-6f8d663fa22so41510106d6.0 for ; Tue, 27 May 2025 02:44:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748339041; x=1748943841; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=N63HG5Y92FVIyGYtF9GWHOZMWzYb14mgOGmeAq1M4Zk=; b=L09C8kCuZABjfHoh1oX80KCV81kHaCML8zOivkmSeAbe9v8BpVoe6dbVZIyOU5T8Jc Ud48/AhuyDA6QVvz32QjdRMpnkrs8Rj9UffsNVNX9acpsNSE06+U4R45dbialCuPW6IC 8nh/tQpkf8BVVhC3CSLsMsAd2bzYRbkXusknPDr+eHw88GVsOPGzAHUuZf2fwIemG4kK vLrxQV08KO+6H3QShUP19AmfHMj1g/qAcgsPPTQOUSMa5uTyWHLvnsFsj5rdlhCVcYJQ YviBQ13UBXYpVlth0/SEVJpUC+Qwz4LLe3Ua9ir45W+IEGvut/58YbOAa6vQ9/kXm4ka cLSw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748339041; x=1748943841; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=N63HG5Y92FVIyGYtF9GWHOZMWzYb14mgOGmeAq1M4Zk=; b=JQyJM+JQjMPh++7iWHUxckdliiaMrBwLnruBIBl5m7c+rYd3935v0y2Hsd6/NU31Yv EvwD9M+q49JhIf4ftKtMbFs94JJb7VvE6lIyEBSXrLrr4K3WHWmLXXHzcMlKaRgkHh0U 0XbEGi3YZIUsoGgfyDd1iX1bjtGtNfiRh3IrnH60yD+94gQMvvk+V5CcBe8gbFBWGCtj bF1ZnmekMZTQ0IgLMPXEewlXVggB81Xd3fRAt8tL5IuVvAQhmghHMIeuDZ/r6rtrpFPF eKeb+0FrYW3oBksVrwSRAbdn+C+fE/UxJzx7iGwhiioNS4FmiHyMYsndEfSje+KhITTO wpTA== X-Forwarded-Encrypted: i=1; AJvYcCVR/GQ/Dr2PxM1/VeEAPQYfyUWoPmygvyTsPRS750KtnMECUIhFbR4OC0jgJL5sUO1NVV/PVZ81Xg==@kvack.org X-Gm-Message-State: AOJu0Yy49w75T77D5o9ub1PYLCBc8+Jnnt++RyMM+y33p/+wpTXEtr+C T4ENDWEEUSgBw5tAy8eCTOVQza3+8bTinJdJb+c7szSg5iOe7PQ/IwMD1R/xhvHLFOaDhJid2T7 YS37GJjIR6Pt7S6sIHklLMGzB4TDKdAk= X-Gm-Gg: ASbGncvYB3M0kxa3Baj9uk/ogqumL6Y3orFrqEhhrIk7NtOSEikBhUZ5cflX5K6TGCg rtSp7e3sE46LHtzPcl5r5yE7HYhj5YsKzpZQUX3EmX26Q9rbNsEiwV9MIC4GjvOEHhkZX4AiUZo tQqqKRFK/7z7Ciz3grULzXQzZAHLPbG8Oteg== X-Google-Smtp-Source: AGHT+IGdNv2zziVNbElhr3cJOToNZON9YtqVKI7OLlWGgXLHr0FzfwqC0eobYdVSCnSNLAAsmfJTU+Op/72SKVtawQU= X-Received: by 2002:a05:6214:ac6:b0:6e8:f17e:e00d with SMTP id 6a1803df08f44-6fa9d01da5amr215870256d6.14.1748339041560; Tue, 27 May 2025 02:44:01 -0700 (PDT) MIME-Version: 1.0 References: <20250520060504.20251-1-laoar.shao@gmail.com> <7d8a9a5c-e0ef-4e36-9e1d-1ef8e853aed4@redhat.com> <5f0aadb1-28a8-4be0-bad9-16b738840e57@redhat.com> <5d48d0c3-89a3-44da-bc1a-9a4601f146a4@redhat.com> In-Reply-To: <5d48d0c3-89a3-44da-bc1a-9a4601f146a4@redhat.com> From: Yafang Shao Date: Tue, 27 May 2025 17:43:24 +0800 X-Gm-Features: AX0GCFtwWKoF-RQmtZjDpQX-Dit9jZ_VJEAtyOSuVMZ9pHRZFkDt07JSG21M9Tw Message-ID: Subject: Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment To: David Hildenbrand Cc: akpm@linux-foundation.org, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, bpf@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 94A021C0002 X-Stat-Signature: wc8xmtp79qtkeh6hfdh7f9e5yh8cm4op X-Rspam-User: X-HE-Tag: 1748339042-350120 X-HE-Meta: U2FsdGVkX195FMjnTkXRmfQCs5A8pWh4JQo6o2qsYyAUPov0OTqH5tov5Z7CjlejN5ePAI1QpFXrTr2Zs5K5pknmFFbfovTEEjCg6CiUzOgvesKiIcb1GEjTaFY5kMS3eCEpPxVDGPobBhKNfj9Rsmu+o0N2o6IbqlBF73xnZMIwcuGtzJF7HrnwHGVXPIwUkCqpJtYix6wQ+Bz0ZJhlEDfivfUMOlO3p+aw22gjtRLg0RTnOobv0wyXG/RMlLWFS1hJ+NwXl0qmRguvgl9SGAoGkrHz9rH+1nJHTL/TxKZaac/ySPmv+gUdaCZXzB3slY7vUFCsbSM3UDX56dqibDJm/yo+L93Z3uqbDib8AmmnDZBrRhGVyVAGkAwnFWtIQb5g3eLIBBY9cEiJexm5mw86pye9PvW8YVl3/7n2lAsoaQBvEuVCjPxgXd7i4zcILuAS4Bf+lK5ERfStu/puaiiODWSGbcboIpKywMJ6Z++FixQfKLOK/OR/a2Amfh+Brijs+rgwVDS/9fmsojzbzWzg3LaNBZ7+E/LJL1aFzbO8dAg7KdKDKyJKOCkEU9XUIoH/hdfgMQE06D6WixS5yZIW7SPHicCIrk6tmxBCW1wF12s/+4Wofy1ycK75OJaBkwSP8p2yDHfliQSkfSm81F/Jeas0OMsSZ0YorozEGA4C5JTjXaSeE2/+AUirQCjvOy5oDVNOK+oTSBBySNEZfMBf670XlwCEw7PjAPM4Zt17XExM/fCy2fRPZ1AFMjOVbUsd4KvdI27jKgnQyap/4T9huPSdDHGI8btA4ATjX0s+2M5a7na0AFQ6hy28Xs4ONrUw3N5HZC9OnfjXcb4KyFh/Be6u+Jr028YcKIxXb7yfZF9VG6KlTjsWueD4RVsrPztpRcK5f+75uwVxez/BCxscc+FjRauCGX6/VCLnrNPHqs/rqohdjK8/mzFwp1n7TgC3y/Qa2nUbFGC4Fv6 4Xtj31XK qG2G29Qoyq8+j76vgaHlwd40wcxB2DHBVjMCOxCSgvuZOCuCluwKPIV4aB/hYW7sSeZhnyTZKcVvaghIpANj3EJE20MPFXpDVDQ1jP0nYTTa8z5j3wuUT5NKPB+TjOg53ER5alWnv2fVHlo/gLGs+GICI7a+jbxxpr7I/pgGNP971YLKPxygNweBGvyQAK0jRwTP9uEVYmHwq5ogvd6pypKhLCFNxvG9kYRN+tEt9sCYJiXo8vdWfCfLm2rNqEXdp/Wkq1lRbkyddfx2KwdEPNY7z5H02OYWZC5wgw//Gvh/hTqB4ZQdMBkt07ICzbZ16EkHB1huhCcc7+deFp//Rk7dpFDtVTX3FkZRAUzvZdG05HUyiq6CXIRvQqpD/Fm8sCLPSM7RpXT8DKAE82K3q76Y7BDOBn2megHlf2gn1pDJW48jl6o12ntt5MXlvfWJqzOZX5PuSYc5kRbo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 27, 2025 at 5:27=E2=80=AFPM David Hildenbrand wrote: > > On 27.05.25 10:40, Yafang Shao wrote: > > On Tue, May 27, 2025 at 4:30=E2=80=AFPM David Hildenbrand wrote: > >> > >>>> I don't think we want to add such a mechanism (new mode) where the > >>>> primary configuration mechanism is through bpf. > >>>> > >>>> Maybe bpf could be used as an alternative, but we should look into a > >>>> reasonable alternative first, like the discussed mctrl()/.../ raised= in > >>>> the process_madvise() series. > >>>> > >>>> No "bpf" mode in disguise, please :) > >>> > >>> This goal can be readily achieved using a BPF program. In any case, i= t > >>> is a feasible solution. > >> > >> No BPF-only solution. > >> > >>> > >>>> > >>>>> We could define > >>>>> the API as follows: > >>>>> > >>>>> struct bpf_thp_ops { > >>>>> /** > >>>>> * @task_thp_mode: Get the THP mode for a specific task > >>>>> * > >>>>> * Return: > >>>>> * - TASK_THP_ALWAYS: "always" mode > >>>>> * - TASK_THP_MADVISE: "madvise" mode > >>>>> * - TASK_THP_NEVER: "never" mode > >>>>> * Future modes can also be added. > >>>>> */ > >>>>> int (*task_thp_mode)(struct task_struct *p); > >>>>> }; > >>>>> > >>>>> For observability, we could add a "THP mode" field to > >>>>> /proc/[pid]/status. For example: > >>>>> > >>>>> $ grep "THP mode" /proc/123/status > >>>>> always > >>>>> $ grep "THP mode" /proc/456/status > >>>>> madvise > >>>>> $ grep "THP mode" /proc/789/status > >>>>> never > >>>>> > >>>>> The THP mode for each task would be determined by the attached BPF > >>>>> program based on the task's attributes. We would place the BPF hook= in > >>>>> appropriate kernel functions. Note that this setting wouldn't be > >>>>> inherited during fork/exec - the BPF program would make the decisio= n > >>>>> dynamically for each task. > >>>> > >>>> What would be the mode (default) when the bpf program would not be a= ctive? > >>>> > >>>>> This approach also enables runtime adjustments to THP modes based o= n > >>>>> system-wide conditions, such as memory fragmentation or other > >>>>> performance overheads. The BPF program could adapt policies > >>>>> dynamically, optimizing THP behavior in response to changing > >>>>> workloads. > >>>> > >>>> I am not sure that is the proper way to handle these scenarios: I ne= ver > >>>> heard that people would be adjusting the system-wide policy dynamica= lly > >>>> in that way either. > >>>> > >>>> Whatever we do, we have to make sure that what we add won't > >>>> over-complicate things in the future. Having tooling dynamically adj= ust > >>>> the THP policy of processes that coarsely sounds ... very wrong long= -term. > >>> > >>> This is just an example demonstrating how BPF can be used to adjust > >>> its flexibility. Notably, all these policies can be implemented > >>> without modifying the kernel. > >> > >> See below on "policy". > >> > >>> > >>>> > >>>> > > As Liam pointed out in another thread, naming is challenging = here - > >>>>> "process" might not be the most accurate term for this context. > >>>> > >>>> No, it's not even a per-process thing. It is per MM, and a MM might = be > >>>> used by multiple processes ... > >>> > >>> I consistently use 'thread' for the latter case. > >> > >> You can use CLONE_VM without CLONE_THREAD ... > > > > If I understand correctly, this can only occur for shared THP but not > > anonymous THP. For instance, if either process allocates an anonymous > > THP, it would trigger the creation of a new MM. Please correct me if > > I'm mistaken. > > What clone(CLONE_VM) will do is essentially create a new process, that > shares the MM with the original process. Similar to a thread, just that > the new process will show up in /proc/ as ... a new process, not as a > thread under /prod/$pid/tasks of the original process. > > Both processes will operate on the shared MM struct as if they were > ordinary threads. No Copy-on-Write involved. > > One example use case I've been involved in is async teardown in QEMU [1]. > > [1] https://kvm-forum.qemu.org/2022/ibm_async_destroy.pdf I understand what you mean, but what I'm really confused about is how this relates to allocating anonymous THP. If either one allocates anon THP, it will definitely create a new MM, right ? > > > > >> > >> Additionally, this > >>> can be implemented per-MM without kernel code modifications. > >>> With a well-designed API, users can even implement custom THP > >>> policies=E2=80=94all without altering kernel code. > >> > >> You can switch between modes, that' all you can do. I wouldn't really > >> call that "custom policy" as it is extremely limited. > >> > >> And that's exactly my point: it's basic switching between modes ... a > >> reasonable policy in the future will make placement decisions and not > >> just state "always/never/madvise". > > > > Could you please elaborate further on 'make placement decisions'? As > > previously mentioned, we (including the broader community) really need > > the user input to determine whether THP allocation is appropriate in a > > given case. > > The glorious future were we make smarter decisions where to actually > place THPs even in the "always" mode. > > E.g., just because we enable "always" for a process does not mean that > we really want a THP everywhere; quite the opposite. So 'always' simply means "the system doesn't guarantee THP allocation will succeed" ? If that's the case, we should revisit RFC v1 [0], where we proposed rejecting THP allocations in certain scenarios for specific tasks. [0] https://lwn.net/Articles/1019290/ > > Treat the "always"/"madvise"/"never" as a rough mode, not a future-proof > policy that we would want to fine-tune dynamically ... that would be > very limiting. --=20 Regards Yafang