Re: [RFC PATCH v4 1/4] mm: thp: add support for BPF based THP order selection

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yafang Shao <laoar.shao@gmail.com>
To: Zi Yan <ziy@nvidia.com>
Cc: akpm@linux-foundation.org, david@redhat.com,
	baolin.wang@linux.alibaba.com,  lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, npache@redhat.com,
	 ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org,
	 usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com,
	 willy@infradead.org, ast@kernel.org, daniel@iogearbox.net,
	andrii@kernel.org,  ameryhung@gmail.com, bpf@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [RFC PATCH v4 1/4] mm: thp: add support for BPF based THP order selection
Date: Wed, 30 Jul 2025 10:36:43 +0800	[thread overview]
Message-ID: <CALOAHbDKHqnyz0w0fKtdCgA3ScQ2qXG7QwZUDRGQjjTb1UNTRw@mail.gmail.com> (raw)
In-Reply-To: <F204238B-5B11-41DC-AF9B-4D2AC11ADF5E@nvidia.com>

On Tue, Jul 29, 2025 at 11:32 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 29 Jul 2025, at 5:18, Yafang Shao wrote:
>
> > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
> > programs to influence THP order selection based on factors such as:
> > - Workload identity
> >   For example, workloads running in specific containers or cgroups.
> > - Allocation context
> >   Whether the allocation occurs during a page fault, khugepaged, or other
> >   paths.
> > - System memory pressure
> >   (May require new BPF helpers to accurately assess memory pressure.)
> >
> > Key Details:
> > - Only one BPF program can be attached at a time, but it can be updated
> >   dynamically to adjust the policy.
> > - Supports automatic mTHP order selection and per-workload THP policies.
> > - Only functional when THP is set to madise or always.
> >
> > Experimental Status:
> > - Requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
> > - This feature is unstable and may evolve in future kernel versions.
> >
> > Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
> > Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]
> >
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/huge_mm.h    |  13 +++
> >  include/linux/khugepaged.h |  12 ++-
> >  mm/Kconfig                 |  12 +++
> >  mm/Makefile                |   1 +
> >  mm/bpf_thp.c               | 172 +++++++++++++++++++++++++++++++++++++
> >  mm/huge_memory.c           |   9 ++
> >  mm/khugepaged.c            |  18 +++-
> >  mm/memory.c                |  14 ++-
> >  8 files changed, 244 insertions(+), 7 deletions(-)
> >  create mode 100644 mm/bpf_thp.c
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 2f190c90192d..5a1527b3b6f0 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -6,6 +6,8 @@
> >
> >  #include <linux/fs.h> /* only for vma_is_dax() */
> >  #include <linux/kobject.h>
> > +#include <linux/pgtable.h>
> > +#include <linux/mm.h>
> >
> >  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
> >  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > @@ -54,6 +56,7 @@ enum transparent_hugepage_flag {
> >       TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> >       TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> >       TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> > +     TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
> >  };
> >
> >  struct kobject;
> > @@ -190,6 +193,16 @@ static inline bool hugepage_global_always(void)
> >                       (1<<TRANSPARENT_HUGEPAGE_FLAG);
> >  }
> >
> > +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
> > +int get_suggested_order(struct mm_struct *mm, unsigned long tva_flags, int order);
> > +#else
> > +static inline int
> > +get_suggested_order(struct mm_struct *mm, unsigned long tva_flags, int order)
> > +{
> > +     return order;
> > +}
> > +#endif
> > +
> >  static inline int highest_order(unsigned long orders)
> >  {
> >       return fls_long(orders) - 1;
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index b8d69cfbb58b..e0242968a020 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -2,6 +2,8 @@
> >  #ifndef _LINUX_KHUGEPAGED_H
> >  #define _LINUX_KHUGEPAGED_H
> >
> > +#include <linux/huge_mm.h>
> > +
> >  extern unsigned int khugepaged_max_ptes_none __read_mostly;
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  extern struct attribute_group khugepaged_attr_group;
> > @@ -20,7 +22,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> >
> >  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  {
> > -     if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
> > +     /*
> > +      * THP allocation policy can be dynamically modified via BPF. If a
> > +      * long-lived task was previously allowed to allocate THP but is no
> > +      * longer permitted under the new policy, we must ensure its forked
> > +      * child processes also inherit this restriction.
>
> The comment is probably better to be:
>
> THP allocation policy can be dynamically modified via BPF. Even if a task
> was allowed to allocate THPs, BPF can decide whether its forked child
> can allocate THPs.
>
> The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
>
> Because the code here just wants to change a forked child’s mm flag. It has
> nothing to do with its parent THP policy.

Thanks for the improvement. I will change it.

>
> > +      * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
> > +      */
> > +     if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags) &&
> > +         get_suggested_order(mm, 0, PMD_ORDER) == PMD_ORDER)
>
> Will it work for mTHPs? Nico is adding mTHP support for khugepaged[1].
> What if a BPF program wants khugepaged to work on some mTHP orders.
>
> Maybe get_suggested_order() should accept a bitmask of all allowed
> orders and return a bitmask as well. Only if the returned bitmask
> is 0, khugepaged is not entered.
>
> [1] https://lore.kernel.org/linux-mm/20250714003207.113275-1-npache@redhat.com/

Thanks for the information.
It seems extending this to use a bitmask would better accommodate
future changes.
I’ll give it some thought.

>
> >               __khugepaged_enter(mm);
> >  }
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 781be3240e21..5d05a537ecde 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -908,6 +908,18 @@ config NO_PAGE_MAPCOUNT
> >
> >         EXPERIMENTAL because the impact of some changes is still unclear.
> >
> > +config EXPERIMENTAL_BPF_ORDER_SELECTION
> > +     bool "BPF-based THP order selection (EXPERIMENTAL)"
> > +     depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> > +
> > +     help
> > +       Enable dynamic THP order selection using BPF programs. This
> > +       experimental feature allows custom BPF logic to determine optimal
> > +       transparent hugepage allocation sizes at runtime.
> > +
> > +       Warning: This feature is unstable and may change in future kernel
> > +       versions.
> > +
> >  endif # TRANSPARENT_HUGEPAGE
> >
> >  # simple helper to make the code a bit easier to read
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 1a7a11d4933d..562525e6a28a 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> >  obj-$(CONFIG_NUMA) += memory-tiers.o
> >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
> >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> >  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> >  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > new file mode 100644
> > index 000000000000..10b486dd8bc4
> > --- /dev/null
> > +++ b/mm/bpf_thp.c
> > @@ -0,0 +1,172 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/btf.h>
> > +#include <linux/huge_mm.h>
> > +#include <linux/khugepaged.h>
> > +
> > +struct bpf_thp_ops {
> > +     /**
> > +      * @get_suggested_order: Get the suggested highest THP order for allocation
> > +      * @mm: mm_struct associated with the THP allocation
> > +      * @tva_flags: TVA flags for current context
> > +      *             %TVA_IN_PF: Set when in page fault context
> > +      *             Other flags: Reserved for future use
> > +      * @order: The highest order being considered for this THP allocation.
> > +      *         %PUD_ORDER for PUD-mapped allocations
>
> Like I mentioned in the cover letter, PMD_ORDER is the highest order
> mm currently supports. I wonder if it is better to be a bitmask of orders
> to better support mTHP.

I’ll look into it.

Regards
Yafang

next prev parent reply	other threads:[~2025-07-30  2:37 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-29  9:18 [RFC PATCH v4 0/4] mm, bpf: " Yafang Shao
2025-07-29  9:18 ` [RFC PATCH v4 1/4] mm: thp: add support for " Yafang Shao
2025-07-29 15:32   ` Zi Yan
2025-07-30  2:36     ` Yafang Shao [this message]
2025-07-29  9:18 ` [RFC PATCH v4 2/4] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
2025-07-29  9:18 ` [RFC PATCH v4 3/4] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
2025-07-29  9:18 ` [RFC PATCH v4 4/4] selftest/bpf: add selftest for BPF based THP order seletection Yafang Shao
2025-07-29 15:36   ` Zi Yan
2025-07-30  2:38     ` Yafang Shao
2025-07-29 15:07 ` [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection Zi Yan
2025-07-30  2:31   ` Yafang Shao
2025-07-30  9:58     ` David Hildenbrand
2025-07-31  2:07       ` Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALOAHbDKHqnyz0w0fKtdCgA3ScQ2qXG7QwZUDRGQjjTb1UNTRw@mail.gmail.com \
    --to=laoar.shao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=ameryhung@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=gutierrez.asier@huawei-partners.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=usamaarif642@gmail.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox