Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Zi Yan <ziy@nvidia.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: akpm@linux-foundation.org, david@redhat.com,
	baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
	dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com,
	gutierrez.asier@huawei-partners.com, willy@infradead.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	bpf@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
Date: Mon, 26 May 2025 10:32:18 -0400	[thread overview]
Message-ID: <7570019E-1FF1-47E0-82CD-D28378EBD8B6@nvidia.com> (raw)
In-Reply-To: <CALOAHbDPF+Mxqwh+5ScQFCyEdiz1ghNbgxJKAqmBRDeAZfe3sA@mail.gmail.com>

On 24 May 2025, at 23:01, Yafang Shao wrote:

> On Tue, May 20, 2025 at 2:05 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>
>> Background
>> ----------
>>
>> At my current employer, PDD, we have consistently configured THP to "never"
>> on our production servers due to past incidents caused by its behavior:
>>
>> - Increased memory consumption
>>   THP significantly raises overall memory usage.
>>
>> - Latency spikes
>>   Random latency spikes occur due to more frequent memory compaction
>>   activity triggered by THP.
>>
>> These issues have made sysadmins hesitant to switch to "madvise" or
>> "always" modes.
>>
>> New Motivation
>> --------------
>>
>> We have now identified that certain AI workloads achieve substantial
>> performance gains with THP enabled. However, we’ve also verified that some
>> workloads see little to no benefit—or are even negatively impacted—by THP.
>>
>> In our Kubernetes environment, we deploy mixed workloads on a single server
>> to maximize resource utilization. Our goal is to selectively enable THP for
>> services that benefit from it while keeping it disabled for others. This
>> approach allows us to incrementally enable THP for additional services and
>> assess how to make it more viable in production.
>>
>> Proposed Solution
>> -----------------
>>
>> For this use case, Johannes suggested introducing a dedicated mode [0]. In
>> this new mode, we could implement BPF-based THP adjustment for fine-grained
>> control over tasks or cgroups. If no BPF program is attached, THP remains
>> in "never" mode. This solution elegantly meets our needs while avoiding the
>> complexity of managing BPF alongside other THP modes.
>>
>> A selftest example demonstrates how to enable THP for the current task
>> while keeping it disabled for others.
>>
>> Alternative Proposals
>> ---------------------
>>
>> - Gutierrez’s cgroup-based approach [1]
>>   - Proposed adding a new cgroup file to control THP policy.
>>   - However, as Johannes noted, cgroups are designed for hierarchical
>>     resource allocation, not arbitrary policy settings [2].
>>
>> - Usama’s per-task THP proposal based on prctl() [3]:
>>   - Enabling THP per task via prctl().
>>   - As David pointed out, neither madvise() nor prctl() works in "never"
>>     mode [4], making this solution insufficient for our needs.
>>
>> Conclusion
>> ----------
>>
>> Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
>> most effective solution for our requirements. This approach represents a
>> small but meaningful step toward making THP truly usable—and manageable—in
>> production environments.
>>
>> This is currently a PoC implementation. Feedback of any kind is welcome.
>>
>> Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.org/ [0]
>> Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [1]
>> Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/ [2]
>> Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@gmail.com/ [3]
>> Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com/ [4]
>>
>> RFC v1->v2:
>> The main changes are as follows,
>> - Use struct_ops instead of fmod_ret (Alexei)
>> - Introduce a new THP mode (Johannes)
>> - Introduce new helpers for BPF hook (Zi)
>> - Refine the commit log
>>
>> RFC v1: https://lwn.net/Articles/1019290/
>>
>> Yafang Shao (5):
>>   mm: thp: Add a new mode "bpf"
>>   mm: thp: Add hook for BPF based THP adjustment
>>   mm: thp: add struct ops for BPF based THP adjustment
>>   bpf: Add get_current_comm to bpf_base_func_proto
>>   selftests/bpf: Add selftest for THP adjustment
>>
>>  include/linux/huge_mm.h                       |  15 +-
>>  kernel/bpf/cgroup.c                           |   2 -
>>  kernel/bpf/helpers.c                          |   2 +
>>  mm/Makefile                                   |   3 +
>>  mm/bpf_thp.c                                  | 120 ++++++++++++
>>  mm/huge_memory.c                              |  65 ++++++-
>>  mm/khugepaged.c                               |   3 +
>>  tools/testing/selftests/bpf/config            |   1 +
>>  .../selftests/bpf/prog_tests/thp_adjust.c     | 175 ++++++++++++++++++
>>  .../selftests/bpf/progs/test_thp_adjust.c     |  39 ++++
>>  10 files changed, 414 insertions(+), 11 deletions(-)
>>  create mode 100644 mm/bpf_thp.c
>>  create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
>>
>> --
>> 2.43.5
>>
>
> Hi all,
>
> Let’s summarize the current state of the discussion and identify how
> to move forward.
>
> - Global-Only Control is Not Viable
> We all seem to agree that a global-only control for THP is unwise. In
> practice, some workloads benefit from THP while others do not, so a
> one-size-fits-all approach doesn’t work.
>
> - Should We Use "Always" or "Madvise"?
> I suspect no one would choose 'always' in its current state. ;)
> Both Lorenzo and David propose relying on the madvise mode. However,
> since madvise is an unprivileged userspace mechanism, any user can
> freely adjust their THP policy. This makes fine-grained control
> impossible without breaking userspace compatibility—an undesirable
> tradeoff.
> Given these limitations, the community should consider introducing a
> new "admin" mode for privileged THP policy management.
>

I agree with the above two points.

> - Can the Kernel Automatically Manage THP Without User Input?
> In practice, users define their own success metrics—such as latency
> (RT), queries per second (QPS), or throughput—to evaluate a feature’s
> usefulness. If a feature fails to improve these metrics, it provides
> no practical value.
> Currently, the kernel lacks visibility into user-defined metrics,
> making fully automated optimization impossible (at least without user
> input). More importantly, automatic management offers no benefit if it
> doesn’t align with user needs.

Yes, kernel is basically guessing what userspace wants with some hints
like MADV_HUGEPAGE/MADV_NOHUGEPAGE. But kernel has the global view
of memory fragmentation, which userspace cannot get easily. I wonder
if it is possible that userspace tuning might benefit one set of
applications but hurt others or overall performance. Right now,
THP tuning is 0 or 1, either an application wants THPs or not.
We might need a way of ranking THP requests from userspace to
let kernel prioritize them (I am not sure if we can add another
user input parameter, like THP_nice, to get this done, since
apparently everyone will set THP_nice to -100 to get themselves
at the top of the list).

> Exception: For kernel-enforced changes (e.g., the page-to-folio
> transition), users must adapt regardless. But THP tuning requires
> flexibility—forcing automation without measurable gains is
> counterproductive.
> (Please correct me if I’ve overlooked anything.)
>
> -- 
> Regards
> Yafang


--
Best Regards,
Yan, Zi

next prev parent reply	other threads:[~2025-05-26 14:32 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-20  6:04 Yafang Shao
2025-05-20  6:04 ` [RFC PATCH v2 1/5] mm: thp: Add a new mode "bpf" Yafang Shao
2025-05-20  6:05 ` [RFC PATCH v2 2/5] mm: thp: Add hook for BPF based THP adjustment Yafang Shao
2025-05-20  6:05 ` [RFC PATCH v2 3/5] mm: thp: add struct ops " Yafang Shao
2025-05-20  6:05 ` [RFC PATCH v2 4/5] bpf: Add get_current_comm to bpf_base_func_proto Yafang Shao
2025-05-20 23:32   ` Andrii Nakryiko
2025-05-20  6:05 ` [RFC PATCH v2 5/5] selftests/bpf: Add selftest for THP adjustment Yafang Shao
2025-05-20  6:52 ` [RFC PATCH v2 0/5] mm, bpf: BPF based " Nico Pache
2025-05-20  7:25   ` Yafang Shao
2025-05-20 13:10     ` Matthew Wilcox
2025-05-20 14:08       ` Yafang Shao
2025-05-20 14:22         ` Lorenzo Stoakes
2025-05-20 14:32           ` Usama Arif
2025-05-20 14:35             ` Lorenzo Stoakes
2025-05-20 14:42               ` Matthew Wilcox
2025-05-20 14:56                 ` David Hildenbrand
2025-05-21  4:28                 ` Yafang Shao
2025-05-20 14:46               ` Usama Arif
2025-05-20 15:00             ` David Hildenbrand
2025-05-20  9:43 ` David Hildenbrand
2025-05-20  9:49   ` Lorenzo Stoakes
2025-05-20 12:06     ` Yafang Shao
2025-05-20 13:45       ` Lorenzo Stoakes
2025-05-20 15:54         ` David Hildenbrand
2025-05-21  4:02           ` Yafang Shao
2025-05-21  3:52         ` Yafang Shao
2025-05-20 11:59   ` Yafang Shao
2025-05-25  3:01 ` Yafang Shao
2025-05-26  7:41   ` Gutierrez Asier
2025-05-26  9:37     ` Yafang Shao
2025-05-26  8:14   ` David Hildenbrand
2025-05-26  9:37     ` Yafang Shao
2025-05-26 10:49       ` David Hildenbrand
2025-05-26 14:53         ` Liam R. Howlett
2025-05-26 15:54           ` Liam R. Howlett
2025-05-26 16:51             ` David Hildenbrand
2025-05-26 17:07               ` Liam R. Howlett
2025-05-26 17:12                 ` David Hildenbrand
2025-05-26 20:30               ` Gutierrez Asier
2025-05-26 20:37                 ` David Hildenbrand
2025-05-27  5:46         ` Yafang Shao
2025-05-27  7:57           ` David Hildenbrand
2025-05-27  8:13             ` Yafang Shao
2025-05-27  8:30               ` David Hildenbrand
2025-05-27  8:40                 ` Yafang Shao
2025-05-27  9:27                   ` David Hildenbrand
2025-05-27  9:43                     ` Yafang Shao
2025-05-27 12:19                       ` David Hildenbrand
2025-05-28  2:04                         ` Yafang Shao
2025-05-28 20:32                           ` David Hildenbrand
2025-05-26 14:32   ` Zi Yan [this message]
2025-05-27  5:53     ` Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7570019E-1FF1-47E0-82CD-D28378EBD8B6@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=gutierrez.asier@huawei-partners.com \
    --cc=hannes@cmpxchg.org \
    --cc=laoar.shao@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=usamaarif642@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox