From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3B6FBC54ED1
	for <linux-mm@archiver.kernel.org>; Sun, 25 May 2025 03:01:54 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 38C7C6B007B; Sat, 24 May 2025 23:01:53 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 33D236B0083; Sat, 24 May 2025 23:01:53 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 22BC36B0085; Sat, 24 May 2025 23:01:53 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id F3E306B007B
	for <linux-mm@kvack.org>; Sat, 24 May 2025 23:01:52 -0400 (EDT)
Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 72F6C1615F6
	for <linux-mm@kvack.org>; Sun, 25 May 2025 03:01:52 +0000 (UTC)
X-FDA: 83479930464.05.E91F260
Received: from mail-qv1-f52.google.com (mail-qv1-f52.google.com [209.85.219.52])
	by imf26.hostedemail.com (Postfix) with ESMTP id 93232140011
	for <linux-mm@kvack.org>; Sun, 25 May 2025 03:01:50 +0000 (UTC)
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=BacG7NOy;
	spf=pass (imf26.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.52 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1748142110;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=fpg3ydFHfh6C0XPQB7F+NnCW4LagDSCJFo3hH04STKo=;
	b=jx3mlqdWKkqE5JlZ6vMSorf3PGntMfTWqlQOgXcyZ4IGZA6d5hu2PPI3poq6J8EC9JYp8O
	6BUi789qx2nIWhfaPihXfA2vsJc7k5WmvI08CNKrhUKeDe7ij9+AV3c7xfFMH4o3vBKH2N
	dELp7nNrAomHxoMMOR9erHqZljim5K0=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=BacG7NOy;
	spf=pass (imf26.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.52 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748142110; a=rsa-sha256;
	cv=none;
	b=bj/tklJNBHvFnHd/gDZbpjqPUxhwIRwgDHgtTDBiIcADvLZ+X7IhjpjlP04ltxhqkvr6ce
	Y1TgjUWl9p4TcbMZJurjB9InkpLpLZMOf93fKx6uKlTxUBsiiGRo0rOoBwyUJYS/CwqYeh
	mkmxZXA23Kp1sDlgn2M6ftzrDCNQ2OI=
Received: by mail-qv1-f52.google.com with SMTP id 6a1803df08f44-6faa4f176bdso1840086d6.3
        for <linux-mm@kvack.org>; Sat, 24 May 2025 20:01:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1748142109; x=1748746909; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=fpg3ydFHfh6C0XPQB7F+NnCW4LagDSCJFo3hH04STKo=;
        b=BacG7NOylStAei/L4zIrhu/GHR3n/Ge9BQ+zF8Wu3f2wEjNlLtd7jZWdQ2WBrRspOk
         V9EeUXa3q6q/kwk5hseq+/SFHyNXFdVFweXZ4eJZIMpoNf6K+Pc0r0hPqJBU/dBC/bUL
         x7Q99vuphSK27AAC8Yzdhh14RWvwAgVFCwZdDWtZk5jPKI1mFBCC9NDbkrYEtIIZoTJt
         qMj5tEXOboR4GhSJkx/hLAq67tBzNtU6+NBW0bSyhmYBhM6DLpNEqcjCYyDyMb5solGg
         raCeZHdxrsadhLj14jktS7ygukYmdnIsBAv0coFIl/R3SZXaT3O+5LOTuV4PLt1T/vb5
         9AAg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1748142109; x=1748746909;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=fpg3ydFHfh6C0XPQB7F+NnCW4LagDSCJFo3hH04STKo=;
        b=jBbOIUpybpD4B4MjeLOMkFcb9h2s05NopBNquGMzg9/jXucbDHQvUgXX1QVynScyEE
         C74K7STbfcVz841My0jsw26ySdWowsuna3SoYKoOIoZOFlBQh1Rk5GMI/AZmsw+SN+ws
         ddxuuUi29Wcha9t8GbnWvgrdzkzfpYHiWNe7WRhUhy0VRkfZAW2F2ygDvWTedvp7wTry
         0RFRZbnJOw2wWZ3Qgm18VusXwJ0kXv14N5/icfy3NdhVgItH1J2/DHEKHKGZKUuJbcjf
         L7fGhBX4psmXbUn/KqhPVFF2ipebOxitFRMX3lszZjcM61QLKEYQbW1ZZRIx6ssfzk2C
         6AXQ==
X-Forwarded-Encrypted: i=1; AJvYcCV8NEhRgoALwvOkN+LIzw/CurQr9MP81JQdlZq+4pBIcgAQHk9jOZNMoe/IlABRnvoxE4Lu/tsscQ==@kvack.org
X-Gm-Message-State: AOJu0YxUSNpTEU76k1Hrw9RkFkqqlFA4ecqJfRAIEFz6IbFivawSQmqS
	xP006KzwLykKdkmh4EpS3PQ5WrG3M+5aYjTBRLt4WlMZpQ2jdg0VdwPlRcxlHvloq31PThY81BP
	F7VGn8/b/RA2rkbKD6A1nomNpFo+Aydc=
X-Gm-Gg: ASbGncvr8rwY41QZRkvNFQ38PTdbrxsWqHGsrRpLPotXvGvM4aCjYmSCLDdvT2zlv+x
	IBFVaig5tj1f6jTlzz9+/qpL5s68mD177s4D5JABZUHqzPv507km6CWnlgzkDDu7UU488MmAKep
	NiPTCijJPa+A+K+bkjVGcdSsIKRgFyOBmphQ==
X-Google-Smtp-Source: AGHT+IFYZidugpwXxIui43ZlKCb8Zdcj0WFejy/cUh86MZ7zTdngVfHpGXC9aCEc52zFZ4u25BROF764p2MlHAuxyuo=
X-Received: by 2002:a05:6214:2483:b0:6f8:bfbf:5d49 with SMTP id
 6a1803df08f44-6fa9cff2fb7mr83214036d6.5.1748142109494; Sat, 24 May 2025
 20:01:49 -0700 (PDT)
MIME-Version: 1.0
References: <20250520060504.20251-1-laoar.shao@gmail.com>
In-Reply-To: <20250520060504.20251-1-laoar.shao@gmail.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Sun, 25 May 2025 11:01:13 +0800
X-Gm-Features: AX0GCFvIqUjTycaWqTqYAlFohO34CP7TvgS_Cl3gcdcXFRKkl4BcSS_4rVXkny4
Message-ID: <CALOAHbDPF+Mxqwh+5ScQFCyEdiz1ghNbgxJKAqmBRDeAZfe3sA@mail.gmail.com>
Subject: Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
To: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, 
	baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, 
	Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, 
	dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, 
	gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, 
	daniel@iogearbox.net, andrii@kernel.org
Cc: bpf@vger.kernel.org, linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Queue-Id: 93232140011
X-Rspamd-Server: rspam09
X-Stat-Signature: 4rbjx8oeizsfw7gdhusij5wopwhr8qst
X-HE-Tag: 1748142110-651396
X-HE-Meta: U2FsdGVkX1/u/f+aa0SBueQ4HG1PAA7WzPJABqgYdcqM+CA2vRsJC4LYb0Mrb20ODAK03+I6INZc0p5wvFk6hOa4I/May6qbCS/y9Vh5KRq7tDkPV6RMa+wcg7P246giFg7oRH10GK+IMmFgAH0/lQfDy0Qj3TTi95bFcFwQU4FUOWVqV9MakZ92azl17Y/FOpOZPP3y75gGJUr5XLYb6/r23BACDBbOrPSI/4A5QXMQFy/I8/EKyxUIQVy5SRoMYXmJPF2j3co+bHpiPwMKgNiScH1GqB/EuR3dUT8pGsh3w+l4ZI3zkrcAhUV0hJLiZf88PKRdnI6HLQDugoWiQ+AKM6vIMRw1yaMqo/ZWSaQt6Aak24GvLhivvNWk8F9SCFi1Y9NpsUJlNc3XNi+iQGSUEl+IbD1Jsr/acyt356uUJYxy0Ph9qp2EnnAPQvGbddZjPxrEs2hSkVEjoHPqIx6kdPgCopIdVYE8GFIBTbEinN04Ow1K4jRqNL5Jil5oY0D2g8dJU9VI+Ga/vZmuTjLK0XQwno/COLG3BMUyOw1bbVaOcJfY+x2TZiH9YVqTv8zoo0/xe+9Ag6bikdCSSIOXapfj98PojGxg9EcGjuvI2/olsatTFQ5sqwaWFRPlGR154QLNVsrAAqNRDbfx13Fo+qBCqjiVAbxgDiIdpjC1NHYt2RnQUbSdAPNSTDO766aj8h25W0wFlaEnTIms29RWm+clm/tV9UFzV7bAmxdZYPf15tvdpUge6dkPbswouc31dMnESlqIOHE36iEwjlKNK1YRYjEsyErc4+EFXfOZwHYZIw0hA2lQQ9e5QDi/NzZCfr62tp3FSSlkJ28ILMNgSLdqLG/RwWUKks8YlRBFzd7TosQYYsVgGGanGMEugyhEjLtxdsLRuCwrJ536rFrbte/KXd6rOiapnk+8vMJTzWOQDT60G8eMlkXYnCscFzTG9qynB4/GQD2dTUO
 CwLu3K+G
 bAgJfYTVHb5/Odr1Xy4auiFibOMbmnCxTTQloTtLZo2V4pP/LjyD+5r+Pm7X58+s0xNYbc7cw5E/Ggxm+ymlWpJSt6QEHE4h2oPAcyL9ch8dCTmnrQaaKQuUNUcn2VfiNru5P/wKbaQAS5Q0GGSyWcGkKN0EiL+Q7P3SSz4flieM+Pr+k87nCGjfzwAeNDRDCg8obNvofk7vUETozMeE1qOD4EzibWLfXNKkzWcc7w+RSj10oBBsqyEZIJXrAR4SwuhLGmNdO+EwYQjsuGVE9OZkGhnqCNnvW5TcntLyJqBSVKN7qUWgUhNSurTEOn9aWsG9IlM+ZEWi25r+P+EZJ/oASutvtE6W0Hqh+wOmj5/YvtWGWQBAU7wr0SMyTGdZx7vMxXgGqPbK6of3QjlNIwTOadNXnE+BMXyWLW/B97pTCtCuEqxvp96aKSDooYY6dwpRBw3TbcHykANnJRiL5XQJBqpuTTYUDp7MMN/NQ4rANBEuX+MJL5HmOaNZYEYLMwsXQf6fDLAwGi+9YpL0P2v+KQVOriLUmbwTikyazk2bAhJin1pFVcO+RH4XfUchMuSPL1zkBf9TH0k0i9oiTw8n7/eeqc21Scf7R
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, May 20, 2025 at 2:05=E2=80=AFPM Yafang Shao <laoar.shao@gmail.com> =
wrote:
>
> Background
> ----------
>
> At my current employer, PDD, we have consistently configured THP to "neve=
r"
> on our production servers due to past incidents caused by its behavior:
>
> - Increased memory consumption
>   THP significantly raises overall memory usage.
>
> - Latency spikes
>   Random latency spikes occur due to more frequent memory compaction
>   activity triggered by THP.
>
> These issues have made sysadmins hesitant to switch to "madvise" or
> "always" modes.
>
> New Motivation
> --------------
>
> We have now identified that certain AI workloads achieve substantial
> performance gains with THP enabled. However, we=E2=80=99ve also verified =
that some
> workloads see little to no benefit=E2=80=94or are even negatively impacte=
d=E2=80=94by THP.
>
> In our Kubernetes environment, we deploy mixed workloads on a single serv=
er
> to maximize resource utilization. Our goal is to selectively enable THP f=
or
> services that benefit from it while keeping it disabled for others. This
> approach allows us to incrementally enable THP for additional services an=
d
> assess how to make it more viable in production.
>
> Proposed Solution
> -----------------
>
> For this use case, Johannes suggested introducing a dedicated mode [0]. I=
n
> this new mode, we could implement BPF-based THP adjustment for fine-grain=
ed
> control over tasks or cgroups. If no BPF program is attached, THP remains
> in "never" mode. This solution elegantly meets our needs while avoiding t=
he
> complexity of managing BPF alongside other THP modes.
>
> A selftest example demonstrates how to enable THP for the current task
> while keeping it disabled for others.
>
> Alternative Proposals
> ---------------------
>
> - Gutierrez=E2=80=99s cgroup-based approach [1]
>   - Proposed adding a new cgroup file to control THP policy.
>   - However, as Johannes noted, cgroups are designed for hierarchical
>     resource allocation, not arbitrary policy settings [2].
>
> - Usama=E2=80=99s per-task THP proposal based on prctl() [3]:
>   - Enabling THP per task via prctl().
>   - As David pointed out, neither madvise() nor prctl() works in "never"
>     mode [4], making this solution insufficient for our needs.
>
> Conclusion
> ----------
>
> Introducing a new "bpf" mode for BPF-based per-task THP adjustments is th=
e
> most effective solution for our requirements. This approach represents a
> small but meaningful step toward making THP truly usable=E2=80=94and mana=
geable=E2=80=94in
> production environments.
>
> This is currently a PoC implementation. Feedback of any kind is welcome.
>
> Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.or=
g/ [0]
> Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.=
asier@huawei-partners.com/ [1]
> Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/=
 [2]
> Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif=
642@gmail.com/ [3]
> Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c88=
1b@redhat.com/ [4]
>
> RFC v1->v2:
> The main changes are as follows,
> - Use struct_ops instead of fmod_ret (Alexei)
> - Introduce a new THP mode (Johannes)
> - Introduce new helpers for BPF hook (Zi)
> - Refine the commit log
>
> RFC v1: https://lwn.net/Articles/1019290/
>
> Yafang Shao (5):
>   mm: thp: Add a new mode "bpf"
>   mm: thp: Add hook for BPF based THP adjustment
>   mm: thp: add struct ops for BPF based THP adjustment
>   bpf: Add get_current_comm to bpf_base_func_proto
>   selftests/bpf: Add selftest for THP adjustment
>
>  include/linux/huge_mm.h                       |  15 +-
>  kernel/bpf/cgroup.c                           |   2 -
>  kernel/bpf/helpers.c                          |   2 +
>  mm/Makefile                                   |   3 +
>  mm/bpf_thp.c                                  | 120 ++++++++++++
>  mm/huge_memory.c                              |  65 ++++++-
>  mm/khugepaged.c                               |   3 +
>  tools/testing/selftests/bpf/config            |   1 +
>  .../selftests/bpf/prog_tests/thp_adjust.c     | 175 ++++++++++++++++++
>  .../selftests/bpf/progs/test_thp_adjust.c     |  39 ++++
>  10 files changed, 414 insertions(+), 11 deletions(-)
>  create mode 100644 mm/bpf_thp.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
>
> --
> 2.43.5
>

Hi all,

Let=E2=80=99s summarize the current state of the discussion and identify ho=
w
to move forward.

- Global-Only Control is Not Viable
We all seem to agree that a global-only control for THP is unwise. In
practice, some workloads benefit from THP while others do not, so a
one-size-fits-all approach doesn=E2=80=99t work.

- Should We Use "Always" or "Madvise"?
I suspect no one would choose 'always' in its current state. ;)
Both Lorenzo and David propose relying on the madvise mode. However,
since madvise is an unprivileged userspace mechanism, any user can
freely adjust their THP policy. This makes fine-grained control
impossible without breaking userspace compatibility=E2=80=94an undesirable
tradeoff.
Given these limitations, the community should consider introducing a
new "admin" mode for privileged THP policy management.

- Can the Kernel Automatically Manage THP Without User Input?
In practice, users define their own success metrics=E2=80=94such as latency
(RT), queries per second (QPS), or throughput=E2=80=94to evaluate a feature=
=E2=80=99s
usefulness. If a feature fails to improve these metrics, it provides
no practical value.
Currently, the kernel lacks visibility into user-defined metrics,
making fully automated optimization impossible (at least without user
input). More importantly, automatic management offers no benefit if it
doesn=E2=80=99t align with user needs.
Exception: For kernel-enforced changes (e.g., the page-to-folio
transition), users must adapt regardless. But THP tuning requires
flexibility=E2=80=94forcing automation without measurable gains is
counterproductive.
(Please correct me if I=E2=80=99ve overlooked anything.)

--=20
Regards
Yafang