linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yafang Shao <laoar.shao@gmail.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>,
	akpm@linux-foundation.org, ziy@nvidia.com,
	 baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
	npache@redhat.com,  ryan.roberts@arm.com, dev.jain@arm.com,
	hannes@cmpxchg.org,  usamaarif642@gmail.com,
	gutierrez.asier@huawei-partners.com,  willy@infradead.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	 bpf@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
Date: Wed, 21 May 2025 11:52:51 +0800	[thread overview]
Message-ID: <CALOAHbCFTSKr4yvGKhjK9tA0peBNusFpJ=NoT4tnCzEe2p-oEw@mail.gmail.com> (raw)
In-Reply-To: <849decad-ab38-4a1a-8532-f518a108d8c6@lucifer.local>

On Tue, May 20, 2025 at 9:45 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, May 20, 2025 at 08:06:21PM +0800, Yafang Shao wrote:
> > On Tue, May 20, 2025 at 5:49 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Tue, May 20, 2025 at 11:43:11AM +0200, David Hildenbrand wrote:
> > > > > Conclusion
> > > > > ----------
> > > > >
> > > > > Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
> > > > > most effective solution for our requirements. This approach represents a
> > > > > small but meaningful step toward making THP truly usable—and manageable—in
> > > > > production environments.
> > > > A new "bpf" mode sounds way too special.
> > > >
> > > > We currently have:
> > > >
> > > > never -> never
> > > > madvise -> MADV_HUGEPAGE, except PR_SET_THP_DISABLE
> > > > always -> always, except PR_SET_THP_DISABLE and MADV_NOHUGEPAGE
> > > >
> > > > Whatever new mode we add, it should honor PR_SET_THP_DISABLE +
> > > > MADV_NOHUGEPAGE.
> > > >
> > > > So, if we want another way to enable things, it would live between "never"
> > > > and "madvise".
> > > >
> > > > I'm wondering how we could make that generic: likely we want this new
> > > > mechanism to *not* be triggerable by the process itself (madvise).
> > > >
> > > > I am not convinced bpf is the answer here ...
> > >
> > > Agreed.
> > >
> > > I am also very concerned with us inserting BPF bits here - are we not then
> > > ensuring that we cannot in any way move towards a future where we
> > > 'automagically' determine what to do?
> > >
> > > I don't know what is claimed about BPF, but it strikes me that we're
> > > establishing a permanent uABI (uAPI?) if we do that and essentially
> > > promising that THP will continue to operate in a fashion similar to how it
> > > does now.
> > >
> > > While BPF is a wonderful technology, I thik we have to be very very careful
> > > about inserting it in places that consist of -implementation details- that
> > > we in mm already are planning to move away from.
> > >
> > > It's one thing adding BPF in the oomk (simple interface, unlikely to
> > > change, doesn't really constrain us) or the scheduler (again the hooks are
> > > by nature reasonably stable), it's quite another sticking it in the heart
> > > of a part of mm that is undergoing _constant_ change, partly as evidenced
> > > by the sheer number of series related to THP that are currently on-list.
> > >
> > > So while BPF may be the best solution for your needs _right now_, we need
> > > be concerned with how things affect the kernel in the future.
> > >
> > > I think we really do have to tread very carefully here.
> >
> > I totally agree with you that the key point here is how to define the
> > API. As I replied to David, I believe we have two fundamental
> > principles to adjust the THP policies:
> > 1. Selective Benefit: Some tasks benefit from THP, while others do not.
> > 2. Conditional Safety: THP allocation is safe under certain conditions
> > but not others.
> >
> > Therefore, I believe we can define these APIs based on the established
> > principles - everything else constitutes implementation details, even
> > if core MM internals need to change.
>
> But if we're looking to make the concept of THP go away, we really need to
> go further than this.
>
> The second we have 'bpf program that figures out whether THP should be
> used' we are permanently tied to the idea of THP on/off being a thing.
>
> I mean any future stuff that makes THP more automagic will probably involve
> having new modes for the legacy THP
> /sys/kernel/mm/transparent_hugepage/enabled and
> /sys/kernel/mm/transparent_hugepage/hugepages-xxkB/enabled
>
> But if people are super reliant on this stuff it's potentially really
> limiting.
>
> I think you said in another post here that you were toying with the notion
> of exposing somehow the madvise() interface and having that be the 'stable
> API' of sorts?

Yes, I have a BPF program that hooks into madvise() to selectively
enforce THP policies—allowing it for certain tasks while blocking it
for others. However, this violates the semantic guarantee of
madvise(). For instance, if a user sees THP configured in madvise
mode, they’d expect madvise() to reliably enable it. But with this BPF
logic, such calls might silently fail, creating inconsistency. This is
why we propose introducing a dedicated BPF-controlled mode, or
alternatively extending the semantics of the existing "never" mode.

-- 
Regards
Yafang


  parent reply	other threads:[~2025-05-21  3:53 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-20  6:04 Yafang Shao
2025-05-20  6:04 ` [RFC PATCH v2 1/5] mm: thp: Add a new mode "bpf" Yafang Shao
2025-05-20  6:05 ` [RFC PATCH v2 2/5] mm: thp: Add hook for BPF based THP adjustment Yafang Shao
2025-05-20  6:05 ` [RFC PATCH v2 3/5] mm: thp: add struct ops " Yafang Shao
2025-05-20  6:05 ` [RFC PATCH v2 4/5] bpf: Add get_current_comm to bpf_base_func_proto Yafang Shao
2025-05-20 23:32   ` Andrii Nakryiko
2025-05-20  6:05 ` [RFC PATCH v2 5/5] selftests/bpf: Add selftest for THP adjustment Yafang Shao
2025-05-20  6:52 ` [RFC PATCH v2 0/5] mm, bpf: BPF based " Nico Pache
2025-05-20  7:25   ` Yafang Shao
2025-05-20 13:10     ` Matthew Wilcox
2025-05-20 14:08       ` Yafang Shao
2025-05-20 14:22         ` Lorenzo Stoakes
2025-05-20 14:32           ` Usama Arif
2025-05-20 14:35             ` Lorenzo Stoakes
2025-05-20 14:42               ` Matthew Wilcox
2025-05-20 14:56                 ` David Hildenbrand
2025-05-21  4:28                 ` Yafang Shao
2025-05-20 14:46               ` Usama Arif
2025-05-20 15:00             ` David Hildenbrand
2025-05-20  9:43 ` David Hildenbrand
2025-05-20  9:49   ` Lorenzo Stoakes
2025-05-20 12:06     ` Yafang Shao
2025-05-20 13:45       ` Lorenzo Stoakes
2025-05-20 15:54         ` David Hildenbrand
2025-05-21  4:02           ` Yafang Shao
2025-05-21  3:52         ` Yafang Shao [this message]
2025-05-20 11:59   ` Yafang Shao
2025-05-25  3:01 ` Yafang Shao
2025-05-26  7:41   ` Gutierrez Asier
2025-05-26  9:37     ` Yafang Shao
2025-05-26  8:14   ` David Hildenbrand
2025-05-26  9:37     ` Yafang Shao
2025-05-26 10:49       ` David Hildenbrand
2025-05-26 14:53         ` Liam R. Howlett
2025-05-26 15:54           ` Liam R. Howlett
2025-05-26 16:51             ` David Hildenbrand
2025-05-26 17:07               ` Liam R. Howlett
2025-05-26 17:12                 ` David Hildenbrand
2025-05-26 20:30               ` Gutierrez Asier
2025-05-26 20:37                 ` David Hildenbrand
2025-05-27  5:46         ` Yafang Shao
2025-05-27  7:57           ` David Hildenbrand
2025-05-27  8:13             ` Yafang Shao
2025-05-27  8:30               ` David Hildenbrand
2025-05-27  8:40                 ` Yafang Shao
2025-05-27  9:27                   ` David Hildenbrand
2025-05-27  9:43                     ` Yafang Shao
2025-05-27 12:19                       ` David Hildenbrand
2025-05-28  2:04                         ` Yafang Shao
2025-05-28 20:32                           ` David Hildenbrand
2025-05-26 14:32   ` Zi Yan
2025-05-27  5:53     ` Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALOAHbCFTSKr4yvGKhjK9tA0peBNusFpJ=NoT4tnCzEe2p-oEw@mail.gmail.com' \
    --to=laoar.shao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=gutierrez.asier@huawei-partners.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=usamaarif642@gmail.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox