linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Usama Arif <usamaarif642@gmail.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com,
	baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
	dev.jain@arm.com, hannes@cmpxchg.org,
	gutierrez.asier@huawei-partners.com, willy@infradead.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	ameryhung@gmail.com, rientjes@google.com, bpf@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [RFC PATCH v5 mm-new 1/5] mm: thp: add support for BPF based THP order selection
Date: Tue, 19 Aug 2025 11:11:05 +0100	[thread overview]
Message-ID: <c855bb8b-4395-4235-8786-3a271252c455@gmail.com> (raw)
In-Reply-To: <CALOAHbAzQmqxcJ7HU+gsdX4+iEj0U=E5mS8nXF8OYW9QxuXSLA@mail.gmail.com>



On 19/08/2025 04:08, Yafang Shao wrote:
>> Hi Yafang,
>>
>> From the coverletter, one of the potential usecases you are trying to solve for is if global policy
>> is "never", but the workload want THPs (either always or on madvise basis). But over here,
>> MMF_VM_HUGEPAGE will never be set so in that case mm_flags_test(MMF_VM_HUGEPAGE, oldmm) will
>> always evaluate to false and the get_sugested_order call doesnt matter?
> 
> See the replyment in another thread.
> 
>>
>>
>>
>>>               __khugepaged_enter(mm);
>>>  }
>>>
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 4108bcd96784..d10089e3f181 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
>>>
>>>         EXPERIMENTAL because the impact of some changes is still unclear.
>>>
>>> +config EXPERIMENTAL_BPF_ORDER_SELECTION
>>> +     bool "BPF-based THP order selection (EXPERIMENTAL)"
>>> +     depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
>>> +
>>> +     help
>>> +       Enable dynamic THP order selection using BPF programs. This
>>> +       experimental feature allows custom BPF logic to determine optimal
>>> +       transparent hugepage allocation sizes at runtime.
>>> +
>>> +       Warning: This feature is unstable and may change in future kernel
>>> +       versions.
>>> +
>>
>>
>> I know there was a discussion on this earlier, but my opinion is that putting all of this
>> as experiment with warnings is not great. No one will be able to deploy this in production
>> if its going to be removed, and I believe thats where the real usage is.
> 
> See the replyment in another thread.
> 
>>
>>>  endif # TRANSPARENT_HUGEPAGE
>>>
>>>  # simple helper to make the code a bit easier to read
>>> diff --git a/mm/Makefile b/mm/Makefile
>>> index ef54aa615d9d..cb55d1509be1 100644
>>> --- a/mm/Makefile
>>> +++ b/mm/Makefile
>>> @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>>>  obj-$(CONFIG_NUMA) += memory-tiers.o
>>>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>>> +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
>>>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>>>  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
>>>  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
>>> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
>>> new file mode 100644
>>> index 000000000000..2b03539452d1
>>> --- /dev/null
>>> +++ b/mm/bpf_thp.c
>>> @@ -0,0 +1,186 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +
>>> +#include <linux/bpf.h>
>>> +#include <linux/btf.h>
>>> +#include <linux/huge_mm.h>
>>> +#include <linux/khugepaged.h>
>>> +
>>> +struct bpf_thp_ops {
>>> +     /**
>>> +      * @get_suggested_order: Get the suggested THP orders for allocation
>>> +      * @mm: mm_struct associated with the THP allocation
>>> +      * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
>>> +      *                 When NULL, the decision should be based on @mm (i.e., when
>>> +      *                 triggered from an mm-scope hook rather than a VMA-specific
>>> +      *                 context).
>>> +      *                 Must belong to @mm (guaranteed by the caller).
>>> +      * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
>>> +      * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
>>> +      * @orders: Bitmask of requested THP orders for this allocation
>>> +      *          - PMD-mapped allocation if PMD_ORDER is set
>>> +      *          - mTHP allocation otherwise
>>> +      *
>>> +      * Rerurn: Bitmask of suggested THP orders for allocation. The highest
>>> +      *         suggested order will not exceed the highest requested order
>>> +      *         in @orders.
>>
>> If we want to make this generic enough so that it doesnt change, should we allow suggested order to
>> exceed highest requested order?
> 
> The maximum requested order is determined by the callsite. For example:
> - PMD-mapped THP uses PMD_ORDER
> - mTHP uses (PMD_ORDER - 1)
> 
> We must respect this upper bound to avoid undefined behavior.

Ack, makes sense.
> 
>>
>>> +      */
>>> +     int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>>> +                                u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
>>> +};
>>> +
>>> +static struct bpf_thp_ops bpf_thp;
>>> +static DEFINE_SPINLOCK(thp_ops_lock);
>>> +
>>> +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>>> +                     u64 vma_flags, enum tva_type tva_flags, int orders)
>>> +{
>>> +     int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>>> +                                u64 vma_flags, enum tva_type tva_flags, int orders);
>>> +     int suggested_orders = orders;
>>> +
>>> +     /* No BPF program is attached */
>>> +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>> +                   &transparent_hugepage_flags))
>>> +             return suggested_orders;
>>> +
>>> +     rcu_read_lock();
>>> +     bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
>>> +     if (!bpf_suggested_order)
>>> +             goto out;
>>
>>
>> My rcu API knowledge is not the best, but maybe we could do:
>>
>> if (!rcu_access_pointer(bpf_thp.get_suggested_order))
>>         return suggested_orders;
>>
> 
> There might be a race here.  The current rcu_access_pointer() check
> occurs outside the RCU read-side critical section, meaning the
> protected pointer could be freed between the check and use.
> Therefore, we must perform the NULL check within the RCU read critical
> section when dereferencing the pointer:
> 

Ack, makes sense.



  reply	other threads:[~2025-08-19 10:11 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-18  5:55 [RFC PATCH v5 mm-new 0/5] mm, bpf: " Yafang Shao
2025-08-18  5:55 ` [RFC PATCH v5 mm-new 1/5] mm: thp: add support for " Yafang Shao
2025-08-18 13:17   ` Usama Arif
2025-08-19  3:08     ` Yafang Shao
2025-08-19 10:11       ` Usama Arif [this message]
2025-08-19 11:10     ` Gutierrez Asier
2025-08-19 11:43       ` Yafang Shao
2025-08-18  5:55 ` [RFC PATCH v5 mm-new 2/5] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
2025-08-18  5:55 ` [RFC PATCH v5 mm-new 3/5] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
2025-08-18  5:55 ` [RFC PATCH v5 mm-new 4/5] bpf: mark vma->vm_mm as trusted Yafang Shao
2025-08-18  5:55 ` [RFC PATCH v5 mm-new 5/5] selftest/bpf: add selftest for BPF based THP order seletection Yafang Shao
2025-08-18 14:00   ` Usama Arif
2025-08-19  3:09     ` Yafang Shao
2025-08-18 14:35 ` [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection Usama Arif
2025-08-19  2:41   ` Yafang Shao
2025-08-19 10:44     ` Usama Arif
2025-08-19 11:33       ` Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c855bb8b-4395-4235-8786-3a271252c455@gmail.com \
    --to=usamaarif642@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=ameryhung@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=gutierrez.asier@huawei-partners.com \
    --cc=hannes@cmpxchg.org \
    --cc=laoar.shao@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=rientjes@google.com \
    --cc=ryan.roberts@arm.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox