Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yafang Shao <laoar.shao@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	ziy@nvidia.com,  baolin.wang@linux.alibaba.com,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	 Liam Howlett <Liam.Howlett@oracle.com>,
	npache@redhat.com, ryan.roberts@arm.com,  dev.jain@arm.com,
	Johannes Weiner <hannes@cmpxchg.org>,
	usamaarif642@gmail.com,  gutierrez.asier@huawei-partners.com,
	Matthew Wilcox <willy@infradead.org>,
	 Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	 Andrii Nakryiko <andrii@kernel.org>,
	Amery Hung <ameryhung@gmail.com>,
	 David Rientjes <rientjes@google.com>,
	Jonathan Corbet <corbet@lwn.net>,
	21cnbao@gmail.com,  Shakeel Butt <shakeel.butt@linux.dev>,
	Tejun Heo <tj@kernel.org>,
	lance.yang@linux.dev,  Randy Dunlap <rdunlap@infradead.org>,
	bpf <bpf@vger.kernel.org>,  linux-mm <linux-mm@kvack.org>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	 LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
Date: Tue, 7 Oct 2025 16:47:07 +0800	[thread overview]
Message-ID: <CALOAHbATDURsi265PGQ7022vC9QsKUxxyiDUL9wLKGgVpaxJUw@mail.gmail.com> (raw)
In-Reply-To: <CAADnVQJtrJZOCWZKH498GBA8M0mYVztApk54mOEejs8Wr3nSiw@mail.gmail.com>

On Fri, Oct 3, 2025 at 10:18 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> > +                                     enum tva_type type,
> > +                                     unsigned long orders)
> > +{
> > +       thp_order_fn_t *bpf_hook_thp_get_order;
> > +       int bpf_order;
> > +
> > +       /* No BPF program is attached */
> > +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > +                     &transparent_hugepage_flags))
> > +               return orders;
> > +
> > +       rcu_read_lock();
> > +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> > +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> > +               goto out;
> > +
> > +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> > +       orders &= BIT(bpf_order);
> > +
> > +out:
> > +       rcu_read_unlock();
> > +       return orders;
> > +}
>

Hello Alexei,

My apologies for the slow reply. I'm on a family vacation and am
checking email intermittently.

> I thought I explained it earlier.

I recall your earlier suggestion for a cgroup-based approach for
BPF-THP. However, as I mentioned, I believe cgroups might not be the
best fit[0]. My understanding was that we had agreed to move away from
that model. Could we realign on this?

[0].  https://lwn.net/ml/all/CALOAHbBvwT+6f_4gBHzPc9n_SukhAs_sa5yX=AjHYsWic1MRuw@mail.gmail.com/

> Nack to a single global prog approach.

The design of BPF-THP as a global program is a direct consequence of
its purpose: to extend the existing global
/sys/kernel/mm/transparent_hugepage/ interface. This architectural
consistency simplifies both understanding and maintenance.

Crucially, this global nature does not limit policy control. The
program is designed with the flexibility to enforce policies at
multiple levels—globally, per-cgroup, or per-task—enabling all of our
target use cases through a unified mechanism.

>
> The logic must accommodate multiple programs per-container
> or any other way from the beginning.
> If cgroup based scoping doesn't fit use per process tree scoping.

During the initial design of BPF-THP, I evaluated whether a global
program or a per-process program would be more suitable. While a
per-process design would require embedding a struct_ops into
task_struct, this seemed like over-engineering to me. We can
efficiently implement both cgroup-tree-scoped and process-tree-scoped
THP policies using existing BPF helpers, such as:

  SCOPING                        BPF kfuncs
  cgroup tree   ->  bpf_task_under_cgroup()
  process tree -> bpf_task_is_ ancestors()

With these kfuncs, there is no need to attach individual BPF-THP
programs to every process or cgroup tree. I have not identified a
valid use case that necessitates embedding a struct_ops in task_struct
which can't be achieved more simply with these kfuncs. If such use
cases exist, please detail them. Consequently, I proceeded with a
global struct_ops implementation.

The desire to attach multiple BPF-THP programs simultaneously does not
appear to be a valid use case. Furthermore, our production experience
has shown that multiple attachments often introduce conflicts. This is
precisely why system administrators prefer to manage BPF programs with
a single manager—to avoid undefined behaviors from competing programs.

Focusing specifically on BPF-THP, the semantics of the program make
multiple attachments unsuitable. A BPF-THP program's outcome is its
return value (a suggested THP order), not the side effects of its
execution. In other words, it is functionally a variant of fmod_ret.

If we allow multiple attachments and they return different values, how
do we resolve the conflict?

If one program returns order-9 and another returns order-1, which
value should be chosen? Neither 1, 9, nor a combination (1 & 9) is
appropriate. The only logical solution is to reject subsequent
attachments and explicitly notify the user of the conflict. Our goal
should be to prevent conflicts from the outset, rather than forcing
developers to create another userspace manager to handle them.

A single global program is a natural and logical extension of the
existing global /sys/kernel/mm/transparent_hugepage/ interface. It is
a good fit for BPF-THP and avoids unnecessary complexity.

Please provide a detailed clarification if I have misunderstood your position.

-- 
Regards
Yafang

next prev parent reply	other threads:[~2025-10-07  8:47 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-30  5:58 [PATCH v9 mm-new 00/11] mm, bpf: " Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 01/11] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 02/11] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection Yafang Shao
2025-10-03  2:18   ` Alexei Starovoitov
2025-10-07  8:47     ` Yafang Shao [this message]
2025-10-08  3:25       ` Alexei Starovoitov
2025-10-08  3:50         ` Yafang Shao
2025-10-08  4:10           ` Alexei Starovoitov
2025-10-08  4:25             ` Yafang Shao
2025-10-08  4:39               ` Alexei Starovoitov
2025-10-08  6:02                 ` Yafang Shao
2025-10-08  8:08     ` David Hildenbrand
2025-10-08  8:18       ` Yafang Shao
2025-10-08  8:28         ` David Hildenbrand
2025-10-08  9:04           ` Yafang Shao
2025-10-08 11:27             ` Zi Yan
2025-10-08 12:06               ` Yafang Shao
2025-10-08 12:49                 ` Gutierrez Asier
2025-10-08 12:07               ` David Hildenbrand
2025-10-08 13:11                 ` Yafang Shao
2025-10-09  9:19                   ` David Hildenbrand
2025-10-09  9:59                     ` Yafang Shao
2025-10-10  7:54                       ` David Hildenbrand
2025-10-11  2:13                         ` Yafang Shao
2025-10-13 12:41                           ` David Hildenbrand
2025-10-13 13:07                             ` Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 04/11] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 05/11] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 06/11] bpf: mark mm->owner as __safe_rcu_or_null Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 07/11] bpf: mark vma->vm_mm as __safe_trusted_or_null Yafang Shao
2025-10-06 21:06   ` Andrii Nakryiko
2025-10-07  9:05     ` Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 08/11] selftests/bpf: add a simple BPF based THP policy Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 09/11] selftests/bpf: add test case to update " Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 10/11] selftests/bpf: add test cases for invalid thp_adjust usage Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 11/11] Documentation: add BPF-based THP policy management Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALOAHbATDURsi265PGQ7022vC9QsKUxxyiDUL9wLKGgVpaxJUw@mail.gmail.com \
    --to=laoar.shao@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=ameryhung@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bpf@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=daniel@iogearbox.net \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=gutierrez.asier@huawei-partners.com \
    --cc=hannes@cmpxchg.org \
    --cc=lance.yang@linux.dev \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=rdunlap@infradead.org \
    --cc=rientjes@google.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    --cc=usamaarif642@gmail.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox