From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B6FBC54ED1 for ; Sun, 25 May 2025 03:01:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 38C7C6B007B; Sat, 24 May 2025 23:01:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 33D236B0083; Sat, 24 May 2025 23:01:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 22BC36B0085; Sat, 24 May 2025 23:01:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id F3E306B007B for ; Sat, 24 May 2025 23:01:52 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 72F6C1615F6 for ; Sun, 25 May 2025 03:01:52 +0000 (UTC) X-FDA: 83479930464.05.E91F260 Received: from mail-qv1-f52.google.com (mail-qv1-f52.google.com [209.85.219.52]) by imf26.hostedemail.com (Postfix) with ESMTP id 93232140011 for ; Sun, 25 May 2025 03:01:50 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=BacG7NOy; spf=pass (imf26.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.52 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748142110; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fpg3ydFHfh6C0XPQB7F+NnCW4LagDSCJFo3hH04STKo=; b=jx3mlqdWKkqE5JlZ6vMSorf3PGntMfTWqlQOgXcyZ4IGZA6d5hu2PPI3poq6J8EC9JYp8O 6BUi789qx2nIWhfaPihXfA2vsJc7k5WmvI08CNKrhUKeDe7ij9+AV3c7xfFMH4o3vBKH2N dELp7nNrAomHxoMMOR9erHqZljim5K0= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=BacG7NOy; spf=pass (imf26.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.52 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748142110; a=rsa-sha256; cv=none; b=bj/tklJNBHvFnHd/gDZbpjqPUxhwIRwgDHgtTDBiIcADvLZ+X7IhjpjlP04ltxhqkvr6ce Y1TgjUWl9p4TcbMZJurjB9InkpLpLZMOf93fKx6uKlTxUBsiiGRo0rOoBwyUJYS/CwqYeh mkmxZXA23Kp1sDlgn2M6ftzrDCNQ2OI= Received: by mail-qv1-f52.google.com with SMTP id 6a1803df08f44-6faa4f176bdso1840086d6.3 for ; Sat, 24 May 2025 20:01:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748142109; x=1748746909; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=fpg3ydFHfh6C0XPQB7F+NnCW4LagDSCJFo3hH04STKo=; b=BacG7NOylStAei/L4zIrhu/GHR3n/Ge9BQ+zF8Wu3f2wEjNlLtd7jZWdQ2WBrRspOk V9EeUXa3q6q/kwk5hseq+/SFHyNXFdVFweXZ4eJZIMpoNf6K+Pc0r0hPqJBU/dBC/bUL x7Q99vuphSK27AAC8Yzdhh14RWvwAgVFCwZdDWtZk5jPKI1mFBCC9NDbkrYEtIIZoTJt qMj5tEXOboR4GhSJkx/hLAq67tBzNtU6+NBW0bSyhmYBhM6DLpNEqcjCYyDyMb5solGg raCeZHdxrsadhLj14jktS7ygukYmdnIsBAv0coFIl/R3SZXaT3O+5LOTuV4PLt1T/vb5 9AAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748142109; x=1748746909; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fpg3ydFHfh6C0XPQB7F+NnCW4LagDSCJFo3hH04STKo=; b=jBbOIUpybpD4B4MjeLOMkFcb9h2s05NopBNquGMzg9/jXucbDHQvUgXX1QVynScyEE C74K7STbfcVz841My0jsw26ySdWowsuna3SoYKoOIoZOFlBQh1Rk5GMI/AZmsw+SN+ws ddxuuUi29Wcha9t8GbnWvgrdzkzfpYHiWNe7WRhUhy0VRkfZAW2F2ygDvWTedvp7wTry 0RFRZbnJOw2wWZ3Qgm18VusXwJ0kXv14N5/icfy3NdhVgItH1J2/DHEKHKGZKUuJbcjf L7fGhBX4psmXbUn/KqhPVFF2ipebOxitFRMX3lszZjcM61QLKEYQbW1ZZRIx6ssfzk2C 6AXQ== X-Forwarded-Encrypted: i=1; AJvYcCV8NEhRgoALwvOkN+LIzw/CurQr9MP81JQdlZq+4pBIcgAQHk9jOZNMoe/IlABRnvoxE4Lu/tsscQ==@kvack.org X-Gm-Message-State: AOJu0YxUSNpTEU76k1Hrw9RkFkqqlFA4ecqJfRAIEFz6IbFivawSQmqS xP006KzwLykKdkmh4EpS3PQ5WrG3M+5aYjTBRLt4WlMZpQ2jdg0VdwPlRcxlHvloq31PThY81BP F7VGn8/b/RA2rkbKD6A1nomNpFo+Aydc= X-Gm-Gg: ASbGncvr8rwY41QZRkvNFQ38PTdbrxsWqHGsrRpLPotXvGvM4aCjYmSCLDdvT2zlv+x IBFVaig5tj1f6jTlzz9+/qpL5s68mD177s4D5JABZUHqzPv507km6CWnlgzkDDu7UU488MmAKep NiPTCijJPa+A+K+bkjVGcdSsIKRgFyOBmphQ== X-Google-Smtp-Source: AGHT+IFYZidugpwXxIui43ZlKCb8Zdcj0WFejy/cUh86MZ7zTdngVfHpGXC9aCEc52zFZ4u25BROF764p2MlHAuxyuo= X-Received: by 2002:a05:6214:2483:b0:6f8:bfbf:5d49 with SMTP id 6a1803df08f44-6fa9cff2fb7mr83214036d6.5.1748142109494; Sat, 24 May 2025 20:01:49 -0700 (PDT) MIME-Version: 1.0 References: <20250520060504.20251-1-laoar.shao@gmail.com> In-Reply-To: <20250520060504.20251-1-laoar.shao@gmail.com> From: Yafang Shao Date: Sun, 25 May 2025 11:01:13 +0800 X-Gm-Features: AX0GCFvIqUjTycaWqTqYAlFohO34CP7TvgS_Cl3gcdcXFRKkl4BcSS_4rVXkny4 Message-ID: Subject: Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment To: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org Cc: bpf@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 93232140011 X-Rspamd-Server: rspam09 X-Stat-Signature: 4rbjx8oeizsfw7gdhusij5wopwhr8qst X-HE-Tag: 1748142110-651396 X-HE-Meta: U2FsdGVkX1/u/f+aa0SBueQ4HG1PAA7WzPJABqgYdcqM+CA2vRsJC4LYb0Mrb20ODAK03+I6INZc0p5wvFk6hOa4I/May6qbCS/y9Vh5KRq7tDkPV6RMa+wcg7P246giFg7oRH10GK+IMmFgAH0/lQfDy0Qj3TTi95bFcFwQU4FUOWVqV9MakZ92azl17Y/FOpOZPP3y75gGJUr5XLYb6/r23BACDBbOrPSI/4A5QXMQFy/I8/EKyxUIQVy5SRoMYXmJPF2j3co+bHpiPwMKgNiScH1GqB/EuR3dUT8pGsh3w+l4ZI3zkrcAhUV0hJLiZf88PKRdnI6HLQDugoWiQ+AKM6vIMRw1yaMqo/ZWSaQt6Aak24GvLhivvNWk8F9SCFi1Y9NpsUJlNc3XNi+iQGSUEl+IbD1Jsr/acyt356uUJYxy0Ph9qp2EnnAPQvGbddZjPxrEs2hSkVEjoHPqIx6kdPgCopIdVYE8GFIBTbEinN04Ow1K4jRqNL5Jil5oY0D2g8dJU9VI+Ga/vZmuTjLK0XQwno/COLG3BMUyOw1bbVaOcJfY+x2TZiH9YVqTv8zoo0/xe+9Ag6bikdCSSIOXapfj98PojGxg9EcGjuvI2/olsatTFQ5sqwaWFRPlGR154QLNVsrAAqNRDbfx13Fo+qBCqjiVAbxgDiIdpjC1NHYt2RnQUbSdAPNSTDO766aj8h25W0wFlaEnTIms29RWm+clm/tV9UFzV7bAmxdZYPf15tvdpUge6dkPbswouc31dMnESlqIOHE36iEwjlKNK1YRYjEsyErc4+EFXfOZwHYZIw0hA2lQQ9e5QDi/NzZCfr62tp3FSSlkJ28ILMNgSLdqLG/RwWUKks8YlRBFzd7TosQYYsVgGGanGMEugyhEjLtxdsLRuCwrJ536rFrbte/KXd6rOiapnk+8vMJTzWOQDT60G8eMlkXYnCscFzTG9qynB4/GQD2dTUO CwLu3K+G bAgJfYTVHb5/Odr1Xy4auiFibOMbmnCxTTQloTtLZo2V4pP/LjyD+5r+Pm7X58+s0xNYbc7cw5E/Ggxm+ymlWpJSt6QEHE4h2oPAcyL9ch8dCTmnrQaaKQuUNUcn2VfiNru5P/wKbaQAS5Q0GGSyWcGkKN0EiL+Q7P3SSz4flieM+Pr+k87nCGjfzwAeNDRDCg8obNvofk7vUETozMeE1qOD4EzibWLfXNKkzWcc7w+RSj10oBBsqyEZIJXrAR4SwuhLGmNdO+EwYQjsuGVE9OZkGhnqCNnvW5TcntLyJqBSVKN7qUWgUhNSurTEOn9aWsG9IlM+ZEWi25r+P+EZJ/oASutvtE6W0Hqh+wOmj5/YvtWGWQBAU7wr0SMyTGdZx7vMxXgGqPbK6of3QjlNIwTOadNXnE+BMXyWLW/B97pTCtCuEqxvp96aKSDooYY6dwpRBw3TbcHykANnJRiL5XQJBqpuTTYUDp7MMN/NQ4rANBEuX+MJL5HmOaNZYEYLMwsXQf6fDLAwGi+9YpL0P2v+KQVOriLUmbwTikyazk2bAhJin1pFVcO+RH4XfUchMuSPL1zkBf9TH0k0i9oiTw8n7/eeqc21Scf7R X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 20, 2025 at 2:05=E2=80=AFPM Yafang Shao = wrote: > > Background > ---------- > > At my current employer, PDD, we have consistently configured THP to "neve= r" > on our production servers due to past incidents caused by its behavior: > > - Increased memory consumption > THP significantly raises overall memory usage. > > - Latency spikes > Random latency spikes occur due to more frequent memory compaction > activity triggered by THP. > > These issues have made sysadmins hesitant to switch to "madvise" or > "always" modes. > > New Motivation > -------------- > > We have now identified that certain AI workloads achieve substantial > performance gains with THP enabled. However, we=E2=80=99ve also verified = that some > workloads see little to no benefit=E2=80=94or are even negatively impacte= d=E2=80=94by THP. > > In our Kubernetes environment, we deploy mixed workloads on a single serv= er > to maximize resource utilization. Our goal is to selectively enable THP f= or > services that benefit from it while keeping it disabled for others. This > approach allows us to incrementally enable THP for additional services an= d > assess how to make it more viable in production. > > Proposed Solution > ----------------- > > For this use case, Johannes suggested introducing a dedicated mode [0]. I= n > this new mode, we could implement BPF-based THP adjustment for fine-grain= ed > control over tasks or cgroups. If no BPF program is attached, THP remains > in "never" mode. This solution elegantly meets our needs while avoiding t= he > complexity of managing BPF alongside other THP modes. > > A selftest example demonstrates how to enable THP for the current task > while keeping it disabled for others. > > Alternative Proposals > --------------------- > > - Gutierrez=E2=80=99s cgroup-based approach [1] > - Proposed adding a new cgroup file to control THP policy. > - However, as Johannes noted, cgroups are designed for hierarchical > resource allocation, not arbitrary policy settings [2]. > > - Usama=E2=80=99s per-task THP proposal based on prctl() [3]: > - Enabling THP per task via prctl(). > - As David pointed out, neither madvise() nor prctl() works in "never" > mode [4], making this solution insufficient for our needs. > > Conclusion > ---------- > > Introducing a new "bpf" mode for BPF-based per-task THP adjustments is th= e > most effective solution for our requirements. This approach represents a > small but meaningful step toward making THP truly usable=E2=80=94and mana= geable=E2=80=94in > production environments. > > This is currently a PoC implementation. Feedback of any kind is welcome. > > Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.or= g/ [0] > Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.= asier@huawei-partners.com/ [1] > Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/= [2] > Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif= 642@gmail.com/ [3] > Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c88= 1b@redhat.com/ [4] > > RFC v1->v2: > The main changes are as follows, > - Use struct_ops instead of fmod_ret (Alexei) > - Introduce a new THP mode (Johannes) > - Introduce new helpers for BPF hook (Zi) > - Refine the commit log > > RFC v1: https://lwn.net/Articles/1019290/ > > Yafang Shao (5): > mm: thp: Add a new mode "bpf" > mm: thp: Add hook for BPF based THP adjustment > mm: thp: add struct ops for BPF based THP adjustment > bpf: Add get_current_comm to bpf_base_func_proto > selftests/bpf: Add selftest for THP adjustment > > include/linux/huge_mm.h | 15 +- > kernel/bpf/cgroup.c | 2 - > kernel/bpf/helpers.c | 2 + > mm/Makefile | 3 + > mm/bpf_thp.c | 120 ++++++++++++ > mm/huge_memory.c | 65 ++++++- > mm/khugepaged.c | 3 + > tools/testing/selftests/bpf/config | 1 + > .../selftests/bpf/prog_tests/thp_adjust.c | 175 ++++++++++++++++++ > .../selftests/bpf/progs/test_thp_adjust.c | 39 ++++ > 10 files changed, 414 insertions(+), 11 deletions(-) > create mode 100644 mm/bpf_thp.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c > create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c > > -- > 2.43.5 > Hi all, Let=E2=80=99s summarize the current state of the discussion and identify ho= w to move forward. - Global-Only Control is Not Viable We all seem to agree that a global-only control for THP is unwise. In practice, some workloads benefit from THP while others do not, so a one-size-fits-all approach doesn=E2=80=99t work. - Should We Use "Always" or "Madvise"? I suspect no one would choose 'always' in its current state. ;) Both Lorenzo and David propose relying on the madvise mode. However, since madvise is an unprivileged userspace mechanism, any user can freely adjust their THP policy. This makes fine-grained control impossible without breaking userspace compatibility=E2=80=94an undesirable tradeoff. Given these limitations, the community should consider introducing a new "admin" mode for privileged THP policy management. - Can the Kernel Automatically Manage THP Without User Input? In practice, users define their own success metrics=E2=80=94such as latency (RT), queries per second (QPS), or throughput=E2=80=94to evaluate a feature= =E2=80=99s usefulness. If a feature fails to improve these metrics, it provides no practical value. Currently, the kernel lacks visibility into user-defined metrics, making fully automated optimization impossible (at least without user input). More importantly, automatic management offers no benefit if it doesn=E2=80=99t align with user needs. Exception: For kernel-enforced changes (e.g., the page-to-folio transition), users must adapt regardless. But THP tuning requires flexibility=E2=80=94forcing automation without measurable gains is counterproductive. (Please correct me if I=E2=80=99ve overlooked anything.) --=20 Regards Yafang