From: Usama Arif <usamaarif642@gmail.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com,
baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
dev.jain@arm.com, hannes@cmpxchg.org,
gutierrez.asier@huawei-partners.com, willy@infradead.org,
ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
ameryhung@gmail.com, rientjes@google.com, bpf@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [RFC PATCH v5 mm-new 1/5] mm: thp: add support for BPF based THP order selection
Date: Tue, 19 Aug 2025 11:11:05 +0100 [thread overview]
Message-ID: <c855bb8b-4395-4235-8786-3a271252c455@gmail.com> (raw)
In-Reply-To: <CALOAHbAzQmqxcJ7HU+gsdX4+iEj0U=E5mS8nXF8OYW9QxuXSLA@mail.gmail.com>
On 19/08/2025 04:08, Yafang Shao wrote:
>> Hi Yafang,
>>
>> From the coverletter, one of the potential usecases you are trying to solve for is if global policy
>> is "never", but the workload want THPs (either always or on madvise basis). But over here,
>> MMF_VM_HUGEPAGE will never be set so in that case mm_flags_test(MMF_VM_HUGEPAGE, oldmm) will
>> always evaluate to false and the get_sugested_order call doesnt matter?
>
> See the replyment in another thread.
>
>>
>>
>>
>>> __khugepaged_enter(mm);
>>> }
>>>
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 4108bcd96784..d10089e3f181 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
>>>
>>> EXPERIMENTAL because the impact of some changes is still unclear.
>>>
>>> +config EXPERIMENTAL_BPF_ORDER_SELECTION
>>> + bool "BPF-based THP order selection (EXPERIMENTAL)"
>>> + depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
>>> +
>>> + help
>>> + Enable dynamic THP order selection using BPF programs. This
>>> + experimental feature allows custom BPF logic to determine optimal
>>> + transparent hugepage allocation sizes at runtime.
>>> +
>>> + Warning: This feature is unstable and may change in future kernel
>>> + versions.
>>> +
>>
>>
>> I know there was a discussion on this earlier, but my opinion is that putting all of this
>> as experiment with warnings is not great. No one will be able to deploy this in production
>> if its going to be removed, and I believe thats where the real usage is.
>
> See the replyment in another thread.
>
>>
>>> endif # TRANSPARENT_HUGEPAGE
>>>
>>> # simple helper to make the code a bit easier to read
>>> diff --git a/mm/Makefile b/mm/Makefile
>>> index ef54aa615d9d..cb55d1509be1 100644
>>> --- a/mm/Makefile
>>> +++ b/mm/Makefile
>>> @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>>> obj-$(CONFIG_NUMA) += memory-tiers.o
>>> obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>> obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>>> +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
>>> obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>>> obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
>>> obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
>>> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
>>> new file mode 100644
>>> index 000000000000..2b03539452d1
>>> --- /dev/null
>>> +++ b/mm/bpf_thp.c
>>> @@ -0,0 +1,186 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +
>>> +#include <linux/bpf.h>
>>> +#include <linux/btf.h>
>>> +#include <linux/huge_mm.h>
>>> +#include <linux/khugepaged.h>
>>> +
>>> +struct bpf_thp_ops {
>>> + /**
>>> + * @get_suggested_order: Get the suggested THP orders for allocation
>>> + * @mm: mm_struct associated with the THP allocation
>>> + * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
>>> + * When NULL, the decision should be based on @mm (i.e., when
>>> + * triggered from an mm-scope hook rather than a VMA-specific
>>> + * context).
>>> + * Must belong to @mm (guaranteed by the caller).
>>> + * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
>>> + * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
>>> + * @orders: Bitmask of requested THP orders for this allocation
>>> + * - PMD-mapped allocation if PMD_ORDER is set
>>> + * - mTHP allocation otherwise
>>> + *
>>> + * Rerurn: Bitmask of suggested THP orders for allocation. The highest
>>> + * suggested order will not exceed the highest requested order
>>> + * in @orders.
>>
>> If we want to make this generic enough so that it doesnt change, should we allow suggested order to
>> exceed highest requested order?
>
> The maximum requested order is determined by the callsite. For example:
> - PMD-mapped THP uses PMD_ORDER
> - mTHP uses (PMD_ORDER - 1)
>
> We must respect this upper bound to avoid undefined behavior.
Ack, makes sense.
>
>>
>>> + */
>>> + int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>>> + u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
>>> +};
>>> +
>>> +static struct bpf_thp_ops bpf_thp;
>>> +static DEFINE_SPINLOCK(thp_ops_lock);
>>> +
>>> +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>>> + u64 vma_flags, enum tva_type tva_flags, int orders)
>>> +{
>>> + int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>>> + u64 vma_flags, enum tva_type tva_flags, int orders);
>>> + int suggested_orders = orders;
>>> +
>>> + /* No BPF program is attached */
>>> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>> + &transparent_hugepage_flags))
>>> + return suggested_orders;
>>> +
>>> + rcu_read_lock();
>>> + bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
>>> + if (!bpf_suggested_order)
>>> + goto out;
>>
>>
>> My rcu API knowledge is not the best, but maybe we could do:
>>
>> if (!rcu_access_pointer(bpf_thp.get_suggested_order))
>> return suggested_orders;
>>
>
> There might be a race here. The current rcu_access_pointer() check
> occurs outside the RCU read-side critical section, meaning the
> protected pointer could be freed between the check and use.
> Therefore, we must perform the NULL check within the RCU read critical
> section when dereferencing the pointer:
>
Ack, makes sense.
next prev parent reply other threads:[~2025-08-19 10:11 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-18 5:55 [RFC PATCH v5 mm-new 0/5] mm, bpf: " Yafang Shao
2025-08-18 5:55 ` [RFC PATCH v5 mm-new 1/5] mm: thp: add support for " Yafang Shao
2025-08-18 13:17 ` Usama Arif
2025-08-19 3:08 ` Yafang Shao
2025-08-19 10:11 ` Usama Arif [this message]
2025-08-19 11:10 ` Gutierrez Asier
2025-08-19 11:43 ` Yafang Shao
2025-08-18 5:55 ` [RFC PATCH v5 mm-new 2/5] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
2025-08-18 5:55 ` [RFC PATCH v5 mm-new 3/5] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
2025-08-18 5:55 ` [RFC PATCH v5 mm-new 4/5] bpf: mark vma->vm_mm as trusted Yafang Shao
2025-08-18 5:55 ` [RFC PATCH v5 mm-new 5/5] selftest/bpf: add selftest for BPF based THP order seletection Yafang Shao
2025-08-18 14:00 ` Usama Arif
2025-08-19 3:09 ` Yafang Shao
2025-08-18 14:35 ` [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection Usama Arif
2025-08-19 2:41 ` Yafang Shao
2025-08-19 10:44 ` Usama Arif
2025-08-19 11:33 ` Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c855bb8b-4395-4235-8786-3a271252c455@gmail.com \
--to=usamaarif642@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=ameryhung@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=gutierrez.asier@huawei-partners.com \
--cc=hannes@cmpxchg.org \
--cc=laoar.shao@gmail.com \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=npache@redhat.com \
--cc=rientjes@google.com \
--cc=ryan.roberts@arm.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox