* [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP
@ 2025-10-26 10:01 Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 01/10] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
` (9 more replies)
0 siblings, 10 replies; 29+ messages in thread
From: Yafang Shao @ 2025-10-26 10:01 UTC (permalink / raw)
To: akpm, ast, daniel, andrii, david, lorenzo.stoakes
Cc: martin.lau, eddyz87, song, yonghong.song, john.fastabend,
kpsingh, sdf, haoluo, jolsa, ziy, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ameryhung, rientjes, corbet, 21cnbao, shakeel.butt, tj,
lance.yang, rdunlap, clm, bpf, linux-mm, Yafang Shao
History
=======
RFC v1: fmod_ret based BPF-THP hook
https://lore.kernel.org/linux-mm/20250429024139.34365-1-laoar.shao@gmail.com/
RFC v2: struct_ops based BPF-THP hook
https://lore.kernel.org/linux-mm/20250520060504.20251-1-laoar.shao@gmail.com/
RFC v4: Get THP order with interface get_suggested_order()
https://lore.kernel.org/linux-mm/20250729091807.84310-1-laoar.shao@gmail.com/
v4->v9: Simplify the interface to:
unsigned long
bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
unsigned long orders);
https://lore.kernel.org/linux-mm/20250930055826.9810-1-laoar.shao@gmail.com/
v9->RFC v10: Scope BPF-THP to individual processes
v10->v11: Remove the RFC tag
v11->v12: Fix issues reported by AI
The Design
==========
Scoping BPF-THP to cgroup is rejected
-------------------------------------
As explained by Gutierrez:
1. It breaks the cgroup hierarchy when 2 siblings have different THP policies
2. Cgroup was designed for resource management not for grouping processes and
tune those processes
3. We set a precedent for other people adding new flags to cgroup and
potentially polluting cgroups. We may end up with cgroups having tens of
different flags, making sysadmin's job more complex
The related links are:
https://lore.kernel.org/linux-mm/1940d681-94a6-48fb-b889-cd8f0b91b330@huawei-partners.com/
https://lore.kernel.org/linux-mm/20241030150851.GB706616@cmpxchg.org/
So we has to scope it to process.
Scoping BPF-THP to process
--------------------------
To eliminate potential conflicts among competing BPF-THP instances, we
enforce that each process is exclusively managed by a single BPF-THP. This
approach has received agreement from David. For context, see:
https://lore.kernel.org/linux-mm/3577f7fd-429a-49c5-973b-38174a67be15@redhat.com/
When registering a BPF-THP, we specify the PID of a target task. The
BPF-THP is then installed in the task's `mm_struct`
struct mm_struct {
struct bpf_thp_ops __rcu *thp_thp;
};
Inheritance Behavior:
- Existing child processes are unaffected
- Newly forked children inherit the BPF-THP from their parent
- The BPF-THP persists across exec
A new linked list tracks all tasks managed by each BPF-THP instance:
- Newly managed tasks are added to the list
- Exiting tasks are automatically removed from the list
- During BPF-THP unregistration (e.g., when the BPF link is removed), all
managed tasks have their bpf_thp pointer set to NULL
- BPF-THP instances can be dynamically updated, with all tracked tasks
automatically migrating to the new version.
This design simplifies BPF-THP management in production environments by
providing clear lifecycle management and preventing conflicts between
multiple BPF-THP instances.
Global Mode
-----------
The per-process BPF-THP mode is unsuitable for managing shared resources
such as shmem THP and file-backed THP. This aligns with known cgroup
limitations for similar scenarios:
https://lore.kernel.org/linux-mm/YwNold0GMOappUxc@slm.duckdns.org/
Introduce a global BPF-THP mode to address this gap. When registered:
- All existing per-process instances are disabled
- New per-process registrations are blocked
- Existing per-process instances remain registered (no forced unregistration)
The global mode takes precedence over per-process instances. Updates are
type-isolated: global instances can only be updated by new global
instances, and per-process instances by new per-process instances.
BPF CI
------
Several dependency patches are currently in mm-new but haven't been merged
into bpf-next. To enable BPF CI testing, I had to make minor changes to
patches #1 and #2 and trigger the BPF CI manually. For details, see:
https://github.com/kernel-patches/bpf/pull/10097
An error occurred during the test, but it was unrelated to this series.
Yafang Shao (10):
mm: thp: remove vm_flags parameter from khugepaged_enter_vma()
mm: thp: remove vm_flags parameter from thp_vma_allowable_order()
mm: thp: add support for BPF based THP order selection
mm: thp: decouple THP allocation between swap and page fault paths
mm: thp: enable THP allocation exclusively through khugepaged
mm: bpf-thp: add support for global mode
Documentation: add BPF THP
selftests/bpf: add a simple BPF based THP policy
selftests/bpf: add test case to update THP policy
selftests/bpf: add test case for BPF-THP inheritance across fork
Documentation/admin-guide/mm/transhuge.rst | 113 +++++
MAINTAINERS | 3 +
fs/exec.c | 1 +
fs/proc/task_mmu.c | 3 +-
include/linux/huge_mm.h | 58 ++-
include/linux/khugepaged.h | 10 +-
include/linux/mm_types.h | 17 +
kernel/fork.c | 1 +
mm/Kconfig | 24 +
mm/Makefile | 1 +
mm/huge_memory.c | 7 +-
mm/huge_memory_bpf.c | 423 ++++++++++++++++++
mm/khugepaged.c | 43 +-
mm/madvise.c | 7 +
mm/memory.c | 22 +-
mm/mmap.c | 1 +
mm/shmem.c | 2 +-
mm/vma.c | 6 +-
tools/testing/selftests/bpf/config | 3 +
.../selftests/bpf/prog_tests/thp_adjust.c | 357 +++++++++++++++
.../selftests/bpf/progs/test_thp_adjust.c | 53 +++
21 files changed, 1101 insertions(+), 54 deletions(-)
create mode 100644 mm/huge_memory_bpf.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
--
2.47.3
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v12 mm-new 01/10] mm: thp: remove vm_flags parameter from khugepaged_enter_vma()
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
@ 2025-10-26 10:01 ` Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 02/10] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Yafang Shao
` (8 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Yafang Shao @ 2025-10-26 10:01 UTC (permalink / raw)
To: akpm, ast, daniel, andrii, david, lorenzo.stoakes
Cc: martin.lau, eddyz87, song, yonghong.song, john.fastabend,
kpsingh, sdf, haoluo, jolsa, ziy, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ameryhung, rientjes, corbet, 21cnbao, shakeel.butt, tj,
lance.yang, rdunlap, clm, bpf, linux-mm, Yafang Shao, Yang Shi
The khugepaged_enter_vma() function requires handling in two specific
scenarios:
1. New VMA creation
When a new VMA is created (for anon vma, it is deferred to pagefault), if
vma->vm_mm is not present in khugepaged_mm_slot, it must be added. In
this case, khugepaged_enter_vma() is called after vma->vm_flags have been
set, allowing direct use of the VMA's flags.
2. VMA flag modification
When vma->vm_flags are modified (particularly when VM_HUGEPAGE is set),
the system must recheck whether to add vma->vm_mm to khugepaged_mm_slot.
Currently, khugepaged_enter_vma() is called before the flag update, so
the call must be relocated to occur after vma->vm_flags have been set.
In the VMA merging path, khugepaged_enter_vma() is also called. For this
case, since VMA merging only occurs when the vm_flags of both VMAs are
identical (excluding special flags like VM_SOFTDIRTY), we can safely use
target->vm_flags instead. (It is worth noting that khugepaged_enter_vma()
can be removed from the VMA merging path because the VMA has already been
added in the two aforementioned cases. We will address this cleanup in a
separate patch.)
After this change, we can further remove vm_flags parameter from
thp_vma_allowable_order(). That will be handled in a followup patch.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Usama Arif <usamaarif642@gmail.com>
---
include/linux/khugepaged.h | 10 ++++++----
mm/huge_memory.c | 2 +-
mm/khugepaged.c | 27 ++++++++++++++-------------
mm/madvise.c | 7 +++++++
mm/vma.c | 6 +++---
5 files changed, 31 insertions(+), 21 deletions(-)
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 179ce716e769..b8291a9740b4 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -15,8 +15,8 @@ extern void khugepaged_destroy(void);
extern int start_stop_khugepaged(void);
extern void __khugepaged_enter(struct mm_struct *mm);
extern void __khugepaged_exit(struct mm_struct *mm);
-extern void khugepaged_enter_vma(struct vm_area_struct *vma,
- vm_flags_t vm_flags);
+extern void khugepaged_enter_vma(struct vm_area_struct *vma);
+extern void khugepaged_enter_mm(struct mm_struct *mm);
extern void khugepaged_min_free_kbytes_update(void);
extern bool current_is_khugepaged(void);
extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
@@ -40,8 +40,10 @@ static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm
static inline void khugepaged_exit(struct mm_struct *mm)
{
}
-static inline void khugepaged_enter_vma(struct vm_area_struct *vma,
- vm_flags_t vm_flags)
+static inline void khugepaged_enter_vma(struct vm_area_struct *vma)
+{
+}
+static inline void khugepaged_enter_mm(struct mm_struct *mm)
{
}
static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7a0eedf5e3c8..bcbc1674f3d3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1476,7 +1476,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
ret = vmf_anon_prepare(vmf);
if (ret)
return ret;
- khugepaged_enter_vma(vma, vma->vm_flags);
+ khugepaged_enter_vma(vma);
if (!(vmf->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm) &&
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8ed9f8e2d376..d517659d905f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -367,12 +367,6 @@ int hugepage_madvise(struct vm_area_struct *vma,
#endif
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
- /*
- * If the vma become good for khugepaged to scan,
- * register it here without waiting a page fault that
- * may not happen any time soon.
- */
- khugepaged_enter_vma(vma, *vm_flags);
break;
case MADV_NOHUGEPAGE:
*vm_flags &= ~VM_HUGEPAGE;
@@ -514,14 +508,21 @@ static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
}
-void khugepaged_enter_vma(struct vm_area_struct *vma,
- vm_flags_t vm_flags)
+void khugepaged_enter_mm(struct mm_struct *mm)
{
- if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
- hugepage_enabled()) {
- if (collapse_allowable_orders(vma, vm_flags, true))
- __khugepaged_enter(vma->vm_mm);
- }
+ if (mm_flags_test(MMF_VM_HUGEPAGE, mm))
+ return;
+ if (!hugepage_enabled())
+ return;
+
+ __khugepaged_enter(mm);
+}
+
+void khugepaged_enter_vma(struct vm_area_struct *vma)
+{
+ if (!collapse_allowable_orders(vma, vma->vm_flags, true))
+ return;
+ khugepaged_enter_mm(vma->vm_mm);
}
void __khugepaged_exit(struct mm_struct *mm)
diff --git a/mm/madvise.c b/mm/madvise.c
index fb1c86e630b6..067d4c6d5c46 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1425,6 +1425,13 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
VM_WARN_ON_ONCE(madv_behavior->lock_mode != MADVISE_MMAP_WRITE_LOCK);
error = madvise_update_vma(new_flags, madv_behavior);
+ /*
+ * If the vma become good for khugepaged to scan,
+ * register it here without waiting a page fault that
+ * may not happen any time soon.
+ */
+ if (!error && new_flags & VM_HUGEPAGE)
+ khugepaged_enter_mm(madv_behavior->vma->vm_mm);
out:
/*
* madvise() returns EAGAIN if kernel resources, such as
diff --git a/mm/vma.c b/mm/vma.c
index 919d1fc63a52..519963e6f174 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -975,7 +975,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
if (err || commit_merge(vmg))
goto abort;
- khugepaged_enter_vma(vmg->target, vmg->vm_flags);
+ khugepaged_enter_vma(vmg->target);
vmg->state = VMA_MERGE_SUCCESS;
return vmg->target;
@@ -1095,7 +1095,7 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg)
* following VMA if we have VMAs on both sides.
*/
if (vmg->target && !vma_expand(vmg)) {
- khugepaged_enter_vma(vmg->target, vmg->vm_flags);
+ khugepaged_enter_vma(vmg->target);
vmg->state = VMA_MERGE_SUCCESS;
return vmg->target;
}
@@ -2506,7 +2506,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
* call covers the non-merge case.
*/
if (!vma_is_anonymous(vma))
- khugepaged_enter_vma(vma, map->vm_flags);
+ khugepaged_enter_vma(vma);
*vmap = vma;
return 0;
--
2.47.3
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v12 mm-new 02/10] mm: thp: remove vm_flags parameter from thp_vma_allowable_order()
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 01/10] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
@ 2025-10-26 10:01 ` Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 03/10] mm: thp: add support for BPF based THP order selection Yafang Shao
` (7 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Yafang Shao @ 2025-10-26 10:01 UTC (permalink / raw)
To: akpm, ast, daniel, andrii, david, lorenzo.stoakes
Cc: martin.lau, eddyz87, song, yonghong.song, john.fastabend,
kpsingh, sdf, haoluo, jolsa, ziy, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ameryhung, rientjes, corbet, 21cnbao, shakeel.butt, tj,
lance.yang, rdunlap, clm, bpf, linux-mm, Yafang Shao
Because all calls to thp_vma_allowable_order() pass vma->vm_flags as the
vma_flags argument, we can remove the parameter and have the function
access vma->vm_flags directly.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
---
fs/proc/task_mmu.c | 3 +--
include/linux/huge_mm.h | 16 ++++++++--------
mm/huge_memory.c | 4 ++--
mm/khugepaged.c | 18 +++++++++---------
mm/memory.c | 11 +++++------
mm/shmem.c | 2 +-
6 files changed, 26 insertions(+), 28 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fc35a0543f01..e713d1905750 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1369,8 +1369,7 @@ static int show_smap(struct seq_file *m, void *v)
__show_smap(m, &mss, false);
seq_printf(m, "THPeligible: %8u\n",
- !!thp_vma_allowable_orders(vma, vma->vm_flags, TVA_SMAPS,
- THP_ORDERS_ALL));
+ !!thp_vma_allowable_orders(vma, TVA_SMAPS, THP_ORDERS_ALL));
if (arch_pkeys_enabled())
seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma));
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4b2773235041..f73c72d58620 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -101,8 +101,8 @@ enum tva_type {
TVA_FORCED_COLLAPSE, /* Forced collapse (e.g. MADV_COLLAPSE). */
};
-#define thp_vma_allowable_order(vma, vm_flags, type, order) \
- (!!thp_vma_allowable_orders(vma, vm_flags, type, BIT(order)))
+#define thp_vma_allowable_order(vma, type, order) \
+ (!!thp_vma_allowable_orders(vma, type, BIT(order)))
#define split_folio(f) split_folio_to_list(f, NULL)
@@ -271,14 +271,12 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
}
unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
- vm_flags_t vm_flags,
enum tva_type type,
unsigned long orders);
/**
* thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
* @vma: the vm area to check
- * @vm_flags: use these vm_flags instead of vma->vm_flags
* @type: TVA type
* @orders: bitfield of all orders to consider
*
@@ -292,10 +290,11 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
*/
static inline
unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
- vm_flags_t vm_flags,
enum tva_type type,
unsigned long orders)
{
+ vm_flags_t vm_flags = vma->vm_flags;
+
/*
* Optimization to check if required orders are enabled early. Only
* forced collapse ignores sysfs configs.
@@ -314,7 +313,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
return 0;
}
- return __thp_vma_allowable_orders(vma, vm_flags, type, orders);
+ return __thp_vma_allowable_orders(vma, type, orders);
}
struct thpsize {
@@ -334,8 +333,10 @@ struct thpsize {
* through madvise or prctl.
*/
static inline bool vma_thp_disabled(struct vm_area_struct *vma,
- vm_flags_t vm_flags, bool forced_collapse)
+ bool forced_collapse)
{
+ vm_flags_t vm_flags = vma->vm_flags;
+
/* Are THPs disabled for this VMA? */
if (vm_flags & VM_NOHUGEPAGE)
return true;
@@ -564,7 +565,6 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
}
static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
- vm_flags_t vm_flags,
enum tva_type type,
unsigned long orders)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bcbc1674f3d3..db9a2a24d58c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -98,7 +98,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
}
unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
- vm_flags_t vm_flags,
enum tva_type type,
unsigned long orders)
{
@@ -106,6 +105,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
const bool in_pf = type == TVA_PAGEFAULT;
const bool forced_collapse = type == TVA_FORCED_COLLAPSE;
unsigned long supported_orders;
+ vm_flags_t vm_flags = vma->vm_flags;
/* Check the intersection of requested and supported orders. */
if (vma_is_anonymous(vma))
@@ -122,7 +122,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
if (!vma->vm_mm) /* vdso */
return 0;
- if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags, forced_collapse))
+ if (thp_disabled_by_hw() || vma_thp_disabled(vma, forced_collapse))
return 0;
/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d517659d905f..d70e1d4be3f2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -499,13 +499,13 @@ static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
/* Check what orders are allowed based on the vma and collapse type */
static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
- vm_flags_t vm_flags, bool is_khugepaged)
+ bool is_khugepaged)
{
- enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
+ enum tva_type tva_type = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
unsigned long orders = is_khugepaged && vma_is_anonymous(vma) ?
THP_ORDERS_ALL_ANON : BIT(HPAGE_PMD_ORDER);
- return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
+ return thp_vma_allowable_orders(vma, tva_type, orders);
}
void khugepaged_enter_mm(struct mm_struct *mm)
@@ -520,7 +520,7 @@ void khugepaged_enter_mm(struct mm_struct *mm)
void khugepaged_enter_vma(struct vm_area_struct *vma)
{
- if (!collapse_allowable_orders(vma, vma->vm_flags, true))
+ if (!collapse_allowable_orders(vma, TVA_KHUGEPAGED))
return;
khugepaged_enter_mm(vma->vm_mm);
}
@@ -992,7 +992,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
/* Always check the PMD order to ensure its not shared by another VMA */
if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
return SCAN_ADDRESS_RANGE;
- if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, BIT(order)))
+ if (!thp_vma_allowable_orders(vma, type, BIT(order)))
return SCAN_VMA_CHECK;
/*
* Anon VMA expected, the address may be unmapped then
@@ -1508,7 +1508,7 @@ static int collapse_scan_pmd(struct mm_struct *mm,
memset(cc->node_load, 0, sizeof(cc->node_load));
nodes_clear(cc->alloc_nmask);
- enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, cc->is_khugepaged);
+ enabled_orders = collapse_allowable_orders(vma, cc->is_khugepaged);
/*
* If PMD is the only enabled order, enforce max_ptes_none, otherwise
@@ -1777,7 +1777,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
* and map it by a PMD, regardless of sysfs THP settings. As such, let's
* analogously elide sysfs THP settings here and force collapse.
*/
- if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
+ if (!thp_vma_allowable_order(vma, TVA_FORCED_COLLAPSE, PMD_ORDER))
return SCAN_VMA_CHECK;
/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
@@ -2719,7 +2719,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, int *result,
progress++;
break;
}
- if (!collapse_allowable_orders(vma, vma->vm_flags, true)) {
+ if (!collapse_allowable_orders(vma, true)) {
skip:
progress++;
continue;
@@ -3025,7 +3025,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
BUG_ON(vma->vm_start > start);
BUG_ON(vma->vm_end < end);
- if (!collapse_allowable_orders(vma, vma->vm_flags, false))
+ if (!collapse_allowable_orders(vma, false))
return -EINVAL;
cc = kmalloc(sizeof(*cc), GFP_KERNEL);
diff --git a/mm/memory.c b/mm/memory.c
index 618534b4963c..7b52068372d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4558,7 +4558,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
* Get a list of all the (large) orders below PMD_ORDER that are enabled
* and suitable for swapping THP.
*/
- orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
+ orders = thp_vma_allowable_orders(vma, TVA_PAGEFAULT,
BIT(PMD_ORDER) - 1);
orders = thp_vma_suitable_orders(vma, vmf->address, orders);
orders = thp_swap_suitable_orders(swp_offset(entry),
@@ -5107,7 +5107,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
* for this vma. Then filter out the orders that can't be allocated over
* the faulting address and still be fully contained in the vma.
*/
- orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
+ orders = thp_vma_allowable_orders(vma, TVA_PAGEFAULT,
BIT(PMD_ORDER) - 1);
orders = thp_vma_suitable_orders(vma, vmf->address, orders);
@@ -5379,7 +5379,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
* PMD mappings if THPs are disabled. As we already have a THP,
* behave as if we are forcing a collapse.
*/
- if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags,
+ if (thp_disabled_by_hw() || vma_thp_disabled(vma,
/* forced_collapse=*/ true))
return ret;
@@ -6289,7 +6289,6 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
.gfp_mask = __get_fault_gfp_mask(vma),
};
struct mm_struct *mm = vma->vm_mm;
- vm_flags_t vm_flags = vma->vm_flags;
pgd_t *pgd;
p4d_t *p4d;
vm_fault_t ret;
@@ -6304,7 +6303,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
return VM_FAULT_OOM;
retry_pud:
if (pud_none(*vmf.pud) &&
- thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PUD_ORDER)) {
+ thp_vma_allowable_order(vma, TVA_PAGEFAULT, PUD_ORDER)) {
ret = create_huge_pud(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
@@ -6338,7 +6337,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
goto retry_pud;
if (pmd_none(*vmf.pmd) &&
- thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
+ thp_vma_allowable_order(vma, TVA_PAGEFAULT, PMD_ORDER)) {
ret = create_huge_pmd(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
diff --git a/mm/shmem.c b/mm/shmem.c
index 6580f3cd24bb..5882c37fa04e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1809,7 +1809,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
vm_flags_t vm_flags = vma ? vma->vm_flags : 0;
unsigned int global_orders;
- if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags, shmem_huge_force)))
+ if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, shmem_huge_force)))
return 0;
global_orders = shmem_huge_global_enabled(inode, index, write_end,
--
2.47.3
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v12 mm-new 03/10] mm: thp: add support for BPF based THP order selection
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 01/10] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 02/10] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Yafang Shao
@ 2025-10-26 10:01 ` Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 04/10] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
` (6 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Yafang Shao @ 2025-10-26 10:01 UTC (permalink / raw)
To: akpm, ast, daniel, andrii, david, lorenzo.stoakes
Cc: martin.lau, eddyz87, song, yonghong.song, john.fastabend,
kpsingh, sdf, haoluo, jolsa, ziy, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ameryhung, rientjes, corbet, 21cnbao, shakeel.butt, tj,
lance.yang, rdunlap, clm, bpf, linux-mm, Yafang Shao
The Motivation
==============
This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
programs to influence THP order selection based on factors such as:
- Workload identity
For example, workloads running in specific containers or cgroups.
- Allocation context
Whether the allocation occurs during a page fault, khugepaged, swap or
other paths.
- VMA's memory advice settings
MADV_HUGEPAGE or MADV_NOHUGEPAGE
- Memory pressure
PSI system data or associated cgroup PSI metrics
The BPF-THP Interface
=====================
The kernel API of this new BPF hook is as follows,
/**
* thp_get_order: Get the suggested THP order from a BPF program for allocation
* @vma: vm_area_struct associated with the THP allocation
* @type: TVA type for current @vma
* @orders: Bitmask of available THP orders for this allocation
*
* Return: The suggested THP order for allocation from the BPF program.
* Returns a negative value to preserve the original available @orders,
* which is useful in specific cases—for example, when only a particular
* @type is handled and others are ignored.
*/
int thp_get_order(struct vm_area_struct *vma,
enum tva_type type,
unsigned long orders);
This functionality is only active when system-wide THP is configured to
madvise or always mode. It remains disabled in never mode. Additionally,
if THP is explicitly disabled for a specific task via prctl(), this BPF
functionality will also be unavailable for that task.
The Design of Per Process BPF-THP
=================================
As suggested by Alexei, we need to scoping the BPF-THP [0].
Scoping BPF-THP to cgroup is not acceptible
-------------------------------------------
As explained by Gutierrez: [1]
1. It breaks the cgroup hierarchy when 2 siblings have different THP policies
2. Cgroup was designed for resource management not for grouping processes and
une those processes
3. We set a precedent for other people adding new flags to cgroup and
potentially polluting cgroups. We may end up with cgroups having tens of
different flags, making sysadmin's job more complex
Scoping BPF-THP to process
--------------------------
To eliminate potential conflicts among competing BPF-THP instances, we
enforce that each process is exclusively managed by a single BPF-THP. This
approach has received agreement from David [2].
When registering a BPF-THP, we specify the PID of a target task. The
BPF-THP is then installed in the task's `mm_struct`
struct mm_struct {
struct bpf_thp_ops __rcu *thp_thp;
};
Inheritance Behavior:
- Existing child processes are unaffected
- Newly forked children inherit the BPF-THP from their parent
- The BPF-THP persists across execve() calls
A new linked list tracks all tasks managed by each BPF-THP instance:
- Newly managed tasks are added to the list
- Exiting tasks are automatically removed from the list
- During BPF-THP unregistration (e.g., when the BPF link is removed), all
managed tasks have their bpf_thp pointer set to NULL
- BPF-THP instances can be dynamically updated, with all tracked tasks
automatically migrating to the new version.
This design simplifies BPF-THP management in production environments by
providing clear lifecycle management and preventing conflicts between
multiple BPF-THP instances.
WARNING
=======
This feature requires CONFIG_BPF_THP (EXPERIMENTAL) to be enabled. Note
that this capability is currently unstable and may undergo significant
changes—including potential removal—in future kernel versions.
Link: https://lore.kernel.org/linux-mm/CAADnVQJtrJZOCWZKH498GBA8M0mYVztApk54mOEejs8Wr3nSiw@mail.gmail.com/ [0]
Link: https://lore.kernel.org/linux-mm/1940d681-94a6-48fb-b889-cd8f0b91b330@huawei-partners.com/ [1]
Link: https://lore.kernel.org/linux-mm/3577f7fd-429a-49c5-973b-38174a67be15@redhat.com/ [2]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
MAINTAINERS | 1 +
fs/exec.c | 1 +
include/linux/huge_mm.h | 39 +++++
include/linux/mm_types.h | 17 +++
kernel/fork.c | 1 +
mm/Kconfig | 22 +++
mm/Makefile | 1 +
mm/huge_memory_bpf.c | 316 +++++++++++++++++++++++++++++++++++++++
mm/mmap.c | 1 +
9 files changed, 399 insertions(+)
create mode 100644 mm/huge_memory_bpf.c
diff --git a/MAINTAINERS b/MAINTAINERS
index c1a1732df7b1..e8eeb7c89431 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16521,6 +16521,7 @@ F: include/linux/huge_mm.h
F: include/linux/khugepaged.h
F: include/trace/events/huge_memory.h
F: mm/huge_memory.c
+F: mm/huge_memory_bpf.c
F: mm/khugepaged.c
F: mm/mm_slot.h
F: tools/testing/selftests/mm/khugepaged.c
diff --git a/fs/exec.c b/fs/exec.c
index 6b70c6726d31..41d7703368e9 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -890,6 +890,7 @@ static int exec_mmap(struct mm_struct *mm)
activate_mm(active_mm, mm);
if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
local_irq_enable();
+ bpf_thp_retain_mm(mm, old_mm);
lru_gen_add_mm(mm);
task_unlock(tsk);
lru_gen_use_mm(mm);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f73c72d58620..49050455f793 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -274,6 +274,40 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
enum tva_type type,
unsigned long orders);
+#ifdef CONFIG_BPF_THP
+
+unsigned long
+bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
+ unsigned long orders);
+void bpf_thp_exit_mm(struct mm_struct *mm);
+void bpf_thp_retain_mm(struct mm_struct *mm, struct mm_struct *old_mm);
+void bpf_thp_fork(struct mm_struct *mm, struct mm_struct *old_mm);
+
+#else
+
+static inline unsigned long
+bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
+ unsigned long orders)
+{
+ return orders;
+}
+
+static inline void bpf_thp_exit_mm(struct mm_struct *mm)
+{
+}
+
+static inline void
+bpf_thp_retain_mm(struct mm_struct *mm, struct mm_struct *old_mm)
+{
+}
+
+static inline void
+bpf_thp_fork(struct mm_struct *mm, struct mm_struct *old_mm)
+{
+}
+
+#endif
+
/**
* thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
* @vma: the vm area to check
@@ -295,6 +329,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
{
vm_flags_t vm_flags = vma->vm_flags;
+ /* The BPF-specified order overrides which order is selected. */
+ orders &= bpf_hook_thp_get_orders(vma, type, orders);
+ if (!orders)
+ return 0;
+
/*
* Optimization to check if required orders are enabled early. Only
* forced collapse ignores sysfs configs.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5021047485a9..e0c89ca9f6f7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -983,6 +983,19 @@ struct mm_cid {
};
#endif
+#ifdef CONFIG_BPF_THP
+struct bpf_thp_ops;
+#endif
+
+#ifdef CONFIG_BPF_MM
+struct bpf_mm_ops {
+#ifdef CONFIG_BPF_THP
+ struct bpf_thp_ops __rcu *bpf_thp;
+ struct list_head bpf_thp_list;
+#endif
+};
+#endif
+
/*
* Opaque type representing current mm_struct flag state. Must be accessed via
* mm_flags_xxx() helper functions.
@@ -1280,6 +1293,10 @@ struct mm_struct {
#ifdef CONFIG_MM_ID
mm_id_t mm_id;
#endif /* CONFIG_MM_ID */
+
+#ifdef CONFIG_BPF_MM
+ struct bpf_mm_ops bpf_mm;
+#endif
} __randomize_layout;
/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 3da0f08615a9..dc24f3d012df 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1130,6 +1130,7 @@ static inline void __mmput(struct mm_struct *mm)
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
+ bpf_thp_exit_mm(mm);
exit_mmap(mm);
mm_put_huge_zero_folio(mm);
set_mm_exe_file(mm, NULL);
diff --git a/mm/Kconfig b/mm/Kconfig
index a5a90b169435..12a2fbdc0909 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1457,6 +1457,28 @@ config PT_RECLAIM
config FIND_NORMAL_PAGE
def_bool n
+menuconfig BPF_MM
+ bool "BPF-based Memory Management (EXPERIMENTAL)"
+ depends on BPF_SYSCALL
+
+ help
+ Enable BPF-based Memory Management Policy. This feature is currently
+ experimental.
+
+ WARNING: This feature is unstable and may change in future kernel
+
+if BPF_MM
+config BPF_THP
+ bool "BPF-based THP Policy (EXPERIMENTAL)"
+ depends on TRANSPARENT_HUGEPAGE && BPF_MM
+
+ help
+ Enable dynamic THP policy adjustment using BPF programs. This feature
+ is currently experimental.
+
+ WARNING: This feature is unstable and may change in future kernel
+endif # BPF_MM
+
source "mm/damon/Kconfig"
endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 21abb3353550..4efca1c8a919 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_NUMA) += memory-tiers.o
obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+obj-$(CONFIG_BPF_THP) += huge_memory_bpf.o
obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
new file mode 100644
index 000000000000..f69c5851ea61
--- /dev/null
+++ b/mm/huge_memory_bpf.c
@@ -0,0 +1,316 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BPF-based THP policy management
+ *
+ * Author: Yafang Shao <laoar.shao@gmail.com>
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/huge_mm.h>
+#include <linux/khugepaged.h>
+
+/**
+ * @thp_order_fn_t: Get the suggested THP order from a BPF program for allocation
+ * @vma: vm_area_struct associated with the THP allocation
+ * @type: TVA type for current @vma
+ * @orders: Bitmask of available THP orders for this allocation
+ *
+ * Return: The suggested THP order for allocation from the BPF program.
+ * Returns a negative value to preserve the original available @orders,
+ * which is useful in specific cases—for example, when only a particular
+ * @type is handled and others are ignored.
+ */
+typedef int thp_order_fn_t(struct vm_area_struct *vma,
+ enum tva_type type,
+ unsigned long orders);
+
+struct bpf_thp_ops {
+ pid_t pid; /* The pid to attach */
+ thp_order_fn_t *thp_get_order;
+
+ /* private */
+ /* The list of mm_struct objects managed by this BPF-THP instance. */
+ struct list_head mm_list;
+};
+
+static DEFINE_SPINLOCK(thp_ops_lock);
+
+unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
+ enum tva_type type,
+ unsigned long orders)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct bpf_thp_ops *bpf_thp;
+ int bpf_order;
+
+ if (!mm)
+ return orders;
+
+ rcu_read_lock();
+ bpf_thp = rcu_dereference(mm->bpf_mm.bpf_thp);
+ if (!bpf_thp || !bpf_thp->thp_get_order)
+ goto out;
+
+ bpf_order = bpf_thp->thp_get_order(vma, type, orders);
+ if (bpf_order < 0)
+ goto out;
+ orders &= BIT(bpf_order);
+
+out:
+ rcu_read_unlock();
+ return orders;
+}
+
+void bpf_thp_exit_mm(struct mm_struct *mm)
+{
+ if (!rcu_access_pointer(mm->bpf_mm.bpf_thp))
+ return;
+
+ spin_lock(&thp_ops_lock);
+ if (!rcu_access_pointer(mm->bpf_mm.bpf_thp)) {
+ spin_unlock(&thp_ops_lock);
+ return;
+ }
+ list_del(&mm->bpf_mm.bpf_thp_list);
+ RCU_INIT_POINTER(mm->bpf_mm.bpf_thp, NULL);
+ spin_unlock(&thp_ops_lock);
+
+}
+
+void bpf_thp_retain_mm(struct mm_struct *mm, struct mm_struct *old_mm)
+{
+ struct bpf_thp_ops *bpf_thp;
+
+ if (!old_mm || !rcu_access_pointer(old_mm->bpf_mm.bpf_thp))
+ return;
+
+ spin_lock(&thp_ops_lock);
+ bpf_thp = rcu_dereference_protected(old_mm->bpf_mm.bpf_thp,
+ lockdep_is_held(&thp_ops_lock));
+ if (!bpf_thp) {
+ spin_unlock(&thp_ops_lock);
+ return;
+ }
+
+ /* The new mm_struct is under initialization. */
+ RCU_INIT_POINTER(mm->bpf_mm.bpf_thp, bpf_thp);
+
+ /* The old mm_struct is being destroyed. */
+ RCU_INIT_POINTER(old_mm->bpf_mm.bpf_thp, NULL);
+ list_replace(&old_mm->bpf_mm.bpf_thp_list, &mm->bpf_mm.bpf_thp_list);
+ spin_unlock(&thp_ops_lock);
+}
+
+void bpf_thp_fork(struct mm_struct *mm, struct mm_struct *old_mm)
+{
+ struct bpf_thp_ops *bpf_thp;
+
+ if (!rcu_access_pointer(old_mm->bpf_mm.bpf_thp))
+ return;
+
+ spin_lock(&thp_ops_lock);
+ bpf_thp = rcu_dereference_protected(old_mm->bpf_mm.bpf_thp,
+ lockdep_is_held(&thp_ops_lock));
+ if (!bpf_thp) {
+ spin_unlock(&thp_ops_lock);
+ return;
+ }
+
+ /* The new mm_struct is under initialization. */
+ RCU_INIT_POINTER(mm->bpf_mm.bpf_thp, bpf_thp);
+
+ list_add_tail(&mm->bpf_mm.bpf_thp_list, &bpf_thp->mm_list);
+ spin_unlock(&thp_ops_lock);
+}
+
+static bool bpf_thp_ops_is_valid_access(int off, int size,
+ enum bpf_access_type type,
+ const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_func_proto *
+bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+ return bpf_base_func_proto(func_id, prog);
+}
+
+static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
+ .get_func_proto = bpf_thp_get_func_proto,
+ .is_valid_access = bpf_thp_ops_is_valid_access,
+};
+
+static int bpf_thp_init(struct btf *btf)
+{
+ return 0;
+}
+
+static int bpf_thp_check_member(const struct btf_type *t,
+ const struct btf_member *member,
+ const struct bpf_prog *prog)
+{
+ /* The call site operates under RCU protection. */
+ if (prog->sleepable)
+ return -EINVAL;
+ return 0;
+}
+
+static int bpf_thp_init_member(const struct btf_type *t,
+ const struct btf_member *member,
+ void *kdata, const void *udata)
+{
+ const struct bpf_thp_ops *ubpf_thp;
+ struct bpf_thp_ops *kbpf_thp;
+ u32 moff;
+
+ ubpf_thp = (const struct bpf_thp_ops *)udata;
+ kbpf_thp = (struct bpf_thp_ops *)kdata;
+
+ moff = __btf_member_bit_offset(t, member) / 8;
+ switch (moff) {
+ case offsetof(struct bpf_thp_ops, pid):
+ /* bpf_struct_ops only handles func ptrs and zero-ed members.
+ * Return 1 to bypass the default handler.
+ */
+ kbpf_thp->pid = ubpf_thp->pid;
+ return 1;
+ }
+ return 0;
+}
+
+static int bpf_thp_reg(void *kdata, struct bpf_link *link)
+{
+ struct bpf_thp_ops *bpf_thp = kdata;
+ struct list_head *mm_list;
+ struct task_struct *p;
+ struct mm_struct *mm;
+ int err = 0;
+ pid_t pid;
+
+ pid = bpf_thp->pid;
+ p = find_get_task_by_vpid(pid);
+ if (!p)
+ return -ESRCH;
+
+ if (p->flags & PF_EXITING) {
+ put_task_struct(p);
+ return -ESRCH;
+ }
+
+ mm = get_task_mm(p);
+ put_task_struct(p);
+ if (!mm)
+ return -EINVAL;
+
+ /* To prevent conflicts, use this lock when multiple BPF-THP instances
+ * might register this task simultaneously.
+ */
+ spin_lock(&thp_ops_lock);
+ /* Each process is exclusively managed by a single BPF-THP. */
+ if (rcu_access_pointer(mm->bpf_mm.bpf_thp)) {
+ err = -EBUSY;
+ goto out;
+ }
+ rcu_assign_pointer(mm->bpf_mm.bpf_thp, bpf_thp);
+
+ mm_list = &bpf_thp->mm_list;
+ INIT_LIST_HEAD(mm_list);
+ list_add_tail(&mm->bpf_mm.bpf_thp_list, mm_list);
+
+out:
+ spin_unlock(&thp_ops_lock);
+ mmput(mm);
+ return err;
+}
+
+static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
+{
+ struct bpf_thp_ops *bpf_thp = kdata;
+ struct bpf_mm_ops *bpf_mm;
+ struct list_head *pos, *n;
+
+ spin_lock(&thp_ops_lock);
+ list_for_each_safe(pos, n, &bpf_thp->mm_list) {
+ bpf_mm = list_entry(pos, struct bpf_mm_ops, bpf_thp_list);
+ WARN_ON_ONCE(!bpf_mm);
+ rcu_replace_pointer(bpf_mm->bpf_thp, NULL, lockdep_is_held(&thp_ops_lock));
+ list_del(pos);
+ }
+ spin_unlock(&thp_ops_lock);
+
+ synchronize_rcu();
+}
+
+static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
+{
+ struct bpf_thp_ops *old_bpf_thp = old_kdata;
+ struct bpf_thp_ops *bpf_thp = kdata;
+ struct bpf_mm_ops *bpf_mm;
+ struct list_head *pos, *n;
+
+ INIT_LIST_HEAD(&bpf_thp->mm_list);
+
+ /* Could be optimized to a per-instance lock if this lock becomes a bottleneck. */
+ spin_lock(&thp_ops_lock);
+ list_for_each_safe(pos, n, &old_bpf_thp->mm_list) {
+ bpf_mm = list_entry(pos, struct bpf_mm_ops, bpf_thp_list);
+ WARN_ON_ONCE(!bpf_mm);
+ rcu_replace_pointer(bpf_mm->bpf_thp, bpf_thp, lockdep_is_held(&thp_ops_lock));
+ list_del(pos);
+ list_add_tail(&bpf_mm->bpf_thp_list, &bpf_thp->mm_list);
+ }
+ spin_unlock(&thp_ops_lock);
+
+ synchronize_rcu();
+ return 0;
+}
+
+static int bpf_thp_validate(void *kdata)
+{
+ struct bpf_thp_ops *ops = kdata;
+
+ if (!ops->thp_get_order) {
+ pr_err("bpf_thp: required ops isn't implemented\n");
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static int bpf_thp_get_order(struct vm_area_struct *vma,
+ enum tva_type type,
+ unsigned long orders)
+{
+ return -1;
+}
+
+static struct bpf_thp_ops __bpf_thp_ops = {
+ .thp_get_order = (thp_order_fn_t *)bpf_thp_get_order,
+};
+
+static struct bpf_struct_ops bpf_bpf_thp_ops = {
+ .verifier_ops = &thp_bpf_verifier_ops,
+ .init = bpf_thp_init,
+ .check_member = bpf_thp_check_member,
+ .init_member = bpf_thp_init_member,
+ .reg = bpf_thp_reg,
+ .unreg = bpf_thp_unreg,
+ .update = bpf_thp_update,
+ .validate = bpf_thp_validate,
+ .cfi_stubs = &__bpf_thp_ops,
+ .owner = THIS_MODULE,
+ .name = "bpf_thp_ops",
+};
+
+static int __init bpf_thp_ops_init(void)
+{
+ int err;
+
+ err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
+ if (err)
+ pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
+ return err;
+}
+late_initcall(bpf_thp_ops_init);
diff --git a/mm/mmap.c b/mm/mmap.c
index 644f02071a41..cf811e6678e3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1841,6 +1841,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
vma_iter_free(&vmi);
if (!retval) {
mt_set_in_rcu(vmi.mas.tree);
+ bpf_thp_fork(mm, oldmm);
ksm_fork(mm, oldmm);
khugepaged_fork(mm, oldmm);
} else {
--
2.47.3
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v12 mm-new 04/10] mm: thp: decouple THP allocation between swap and page fault paths
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
` (2 preceding siblings ...)
2025-10-26 10:01 ` [PATCH v12 mm-new 03/10] mm: thp: add support for BPF based THP order selection Yafang Shao
@ 2025-10-26 10:01 ` Yafang Shao
2025-10-27 4:07 ` Barry Song
2025-10-26 10:01 ` [PATCH v12 mm-new 05/10] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
` (5 subsequent siblings)
9 siblings, 1 reply; 29+ messages in thread
From: Yafang Shao @ 2025-10-26 10:01 UTC (permalink / raw)
To: akpm, ast, daniel, andrii, david, lorenzo.stoakes
Cc: martin.lau, eddyz87, song, yonghong.song, john.fastabend,
kpsingh, sdf, haoluo, jolsa, ziy, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ameryhung, rientjes, corbet, 21cnbao, shakeel.butt, tj,
lance.yang, rdunlap, clm, bpf, linux-mm, Yafang Shao
The new BPF capability enables finer-grained THP policy decisions by
introducing separate handling for swap faults versus normal page faults.
As highlighted by Barry:
We’ve observed that swapping in large folios can lead to more
swap thrashing for some workloads- e.g. kernel build. Consequently,
some workloads might prefer swapping in smaller folios than those
allocated by alloc_anon_folio().
While prtcl() could potentially be extended to leverage this new policy,
doing so would require modifications to the uAPI.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Cc: Barry Song <21cnbao@gmail.com>
---
include/linux/huge_mm.h | 3 ++-
mm/huge_memory.c | 2 +-
mm/memory.c | 2 +-
3 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 49050455f793..7867411b2a21 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -96,9 +96,10 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
enum tva_type {
TVA_SMAPS, /* Exposing "THPeligible:" in smaps. */
- TVA_PAGEFAULT, /* Serving a page fault. */
+ TVA_PAGEFAULT, /* Serving a non-swap page fault. */
TVA_KHUGEPAGED, /* Khugepaged collapse. */
TVA_FORCED_COLLAPSE, /* Forced collapse (e.g. MADV_COLLAPSE). */
+ TVA_SWAP_PAGEFAULT, /* serving a swap page fault. */
};
#define thp_vma_allowable_order(vma, type, order) \
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index db9a2a24d58c..0bfbb672a559 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -102,7 +102,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
unsigned long orders)
{
const bool smaps = type == TVA_SMAPS;
- const bool in_pf = type == TVA_PAGEFAULT;
+ const bool in_pf = (type == TVA_PAGEFAULT || type == TVA_SWAP_PAGEFAULT);
const bool forced_collapse = type == TVA_FORCED_COLLAPSE;
unsigned long supported_orders;
vm_flags_t vm_flags = vma->vm_flags;
diff --git a/mm/memory.c b/mm/memory.c
index 7b52068372d8..c6a766b271ef 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4558,7 +4558,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
* Get a list of all the (large) orders below PMD_ORDER that are enabled
* and suitable for swapping THP.
*/
- orders = thp_vma_allowable_orders(vma, TVA_PAGEFAULT,
+ orders = thp_vma_allowable_orders(vma, TVA_SWAP_PAGEFAULT,
BIT(PMD_ORDER) - 1);
orders = thp_vma_suitable_orders(vma, vmf->address, orders);
orders = thp_swap_suitable_orders(swp_offset(entry),
--
2.47.3
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v12 mm-new 05/10] mm: thp: enable THP allocation exclusively through khugepaged
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
` (3 preceding siblings ...)
2025-10-26 10:01 ` [PATCH v12 mm-new 04/10] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
@ 2025-10-26 10:01 ` Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode Yafang Shao
` (4 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Yafang Shao @ 2025-10-26 10:01 UTC (permalink / raw)
To: akpm, ast, daniel, andrii, david, lorenzo.stoakes
Cc: martin.lau, eddyz87, song, yonghong.song, john.fastabend,
kpsingh, sdf, haoluo, jolsa, ziy, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ameryhung, rientjes, corbet, 21cnbao, shakeel.butt, tj,
lance.yang, rdunlap, clm, bpf, linux-mm, Yafang Shao
khugepaged_enter_vma() ultimately invokes any attached BPF function with
the TVA_KHUGEPAGED flag set when determining whether or not to enable
khugepaged THP for a freshly faulted in VMA.
Currently, on fault, we invoke this in do_huge_pmd_anonymous_page(), as
invoked by create_huge_pmd() and only when we have already checked to
see if an allowable TVA_PAGEFAULT order is specified.
Since we might want to disallow THP on fault-in but allow it via
khugepaged, we move things around so we always attempt to enter
khugepaged upon fault.
This change is safe because:
- khugepaged operates at the MM level rather than per-VMA. The THP
allocation might fail during page faults due to transient conditions
(e.g., memory pressure), it is safe to add this MM to khugepaged for
subsequent defragmentation.
- If __thp_vma_allowable_orders(TVA_PAGEFAULT) returns 0, then
__thp_vma_allowable_orders(TVA_KHUGEPAGED) will also return 0.
While we could also extend prctl() to utilize this new policy, such a
change would require a uAPI modification to PR_SET_THP_DISABLE.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Lance Yang <lance.yang@linux.dev>
Cc: Usama Arif <usamaarif642@gmail.com>
---
mm/huge_memory.c | 1 -
mm/memory.c | 13 ++++++++-----
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0bfbb672a559..b675c9041c0f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1476,7 +1476,6 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
ret = vmf_anon_prepare(vmf);
if (ret)
return ret;
- khugepaged_enter_vma(vma);
if (!(vmf->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm) &&
diff --git a/mm/memory.c b/mm/memory.c
index c6a766b271ef..3e2857b30f3b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6336,11 +6336,14 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
if (pud_trans_unstable(vmf.pud))
goto retry_pud;
- if (pmd_none(*vmf.pmd) &&
- thp_vma_allowable_order(vma, TVA_PAGEFAULT, PMD_ORDER)) {
- ret = create_huge_pmd(&vmf);
- if (!(ret & VM_FAULT_FALLBACK))
- return ret;
+ if (pmd_none(*vmf.pmd)) {
+ if (vma_is_anonymous(vma))
+ khugepaged_enter_vma(vma);
+ if (thp_vma_allowable_order(vma, TVA_PAGEFAULT, PMD_ORDER)) {
+ ret = create_huge_pmd(&vmf);
+ if (!(ret & VM_FAULT_FALLBACK))
+ return ret;
+ }
} else {
vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
--
2.47.3
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
` (4 preceding siblings ...)
2025-10-26 10:01 ` [PATCH v12 mm-new 05/10] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
@ 2025-10-26 10:01 ` Yafang Shao
2025-10-29 1:32 ` Alexei Starovoitov
2025-10-26 10:01 ` [PATCH v12 mm-new 07/10] Documentation: add BPF THP Yafang Shao
` (3 subsequent siblings)
9 siblings, 1 reply; 29+ messages in thread
From: Yafang Shao @ 2025-10-26 10:01 UTC (permalink / raw)
To: akpm, ast, daniel, andrii, david, lorenzo.stoakes
Cc: martin.lau, eddyz87, song, yonghong.song, john.fastabend,
kpsingh, sdf, haoluo, jolsa, ziy, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ameryhung, rientjes, corbet, 21cnbao, shakeel.butt, tj,
lance.yang, rdunlap, clm, bpf, linux-mm, Yafang Shao
The per-process BPF-THP mode is unsuitable for managing shared resources
such as shmem THP and file-backed THP. This aligns with known cgroup
limitations for similar scenarios [0].
Introduce a global BPF-THP mode to address this gap. When registered:
- All existing per-process instances are disabled
- New per-process registrations are blocked
- Existing per-process instances remain registered (no forced unregistration)
The global mode takes precedence over per-process instances. Updates are
type-isolated: global instances can only be updated by new global
instances, and per-process instances by new per-process instances.
Link: https://lore.kernel.org/linux-mm/YwNold0GMOappUxc@slm.duckdns.org/ [0]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
mm/huge_memory_bpf.c | 111 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 109 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
index f69c5851ea61..f8383c2a299f 100644
--- a/mm/huge_memory_bpf.c
+++ b/mm/huge_memory_bpf.c
@@ -35,6 +35,30 @@ struct bpf_thp_ops {
};
static DEFINE_SPINLOCK(thp_ops_lock);
+static struct bpf_thp_ops __rcu *bpf_thp_global; /* global mode */
+
+static unsigned long
+bpf_hook_thp_get_orders_global(struct vm_area_struct *vma,
+ enum tva_type type,
+ unsigned long orders)
+{
+ static struct bpf_thp_ops *bpf_thp;
+ int bpf_order;
+
+ rcu_read_lock();
+ bpf_thp = rcu_dereference(bpf_thp_global);
+ if (!bpf_thp || !bpf_thp->thp_get_order)
+ goto out;
+
+ bpf_order = bpf_thp->thp_get_order(vma, type, orders);
+ if (bpf_order < 0)
+ goto out;
+ orders &= BIT(bpf_order);
+
+out:
+ rcu_read_unlock();
+ return orders;
+}
unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
enum tva_type type,
@@ -47,6 +71,10 @@ unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
if (!mm)
return orders;
+ /* Global BPF-THP takes precedence over per-process BPF-THP. */
+ if (rcu_access_pointer(bpf_thp_global))
+ return bpf_hook_thp_get_orders_global(vma, type, orders);
+
rcu_read_lock();
bpf_thp = rcu_dereference(mm->bpf_mm.bpf_thp);
if (!bpf_thp || !bpf_thp->thp_get_order)
@@ -181,6 +209,23 @@ static int bpf_thp_init_member(const struct btf_type *t,
return 0;
}
+static int bpf_thp_reg_gloabl(void *kdata, struct bpf_link *link)
+{
+ struct bpf_thp_ops *ops = kdata;
+
+ /* Protect the global pointer bpf_thp_global from concurrent writes. */
+ spin_lock(&thp_ops_lock);
+ /* Only one instance is allowed. */
+ if (rcu_access_pointer(bpf_thp_global)) {
+ spin_unlock(&thp_ops_lock);
+ return -EBUSY;
+ }
+
+ rcu_assign_pointer(bpf_thp_global, ops);
+ spin_unlock(&thp_ops_lock);
+ return 0;
+}
+
static int bpf_thp_reg(void *kdata, struct bpf_link *link)
{
struct bpf_thp_ops *bpf_thp = kdata;
@@ -191,6 +236,11 @@ static int bpf_thp_reg(void *kdata, struct bpf_link *link)
pid_t pid;
pid = bpf_thp->pid;
+
+ /* Fallback to global mode if pid is not set. */
+ if (!pid)
+ return bpf_thp_reg_gloabl(kdata, link);
+
p = find_get_task_by_vpid(pid);
if (!p)
return -ESRCH;
@@ -209,8 +259,10 @@ static int bpf_thp_reg(void *kdata, struct bpf_link *link)
* might register this task simultaneously.
*/
spin_lock(&thp_ops_lock);
- /* Each process is exclusively managed by a single BPF-THP. */
- if (rcu_access_pointer(mm->bpf_mm.bpf_thp)) {
+ /* Each process is exclusively managed by a single BPF-THP.
+ * Global mode disables per-process instances.
+ */
+ if (rcu_access_pointer(mm->bpf_mm.bpf_thp) || rcu_access_pointer(bpf_thp_global)) {
err = -EBUSY;
goto out;
}
@@ -226,12 +278,33 @@ static int bpf_thp_reg(void *kdata, struct bpf_link *link)
return err;
}
+static void bpf_thp_unreg_global(void *kdata, struct bpf_link *link)
+{
+ struct bpf_thp_ops *bpf_thp;
+
+ spin_lock(&thp_ops_lock);
+ if (!rcu_access_pointer(bpf_thp_global)) {
+ spin_unlock(&thp_ops_lock);
+ return;
+ }
+
+ bpf_thp = rcu_replace_pointer(bpf_thp_global, NULL,
+ lockdep_is_held(&thp_ops_lock));
+ WARN_ON_ONCE(!bpf_thp);
+ spin_unlock(&thp_ops_lock);
+
+ synchronize_rcu();
+}
+
static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
{
struct bpf_thp_ops *bpf_thp = kdata;
struct bpf_mm_ops *bpf_mm;
struct list_head *pos, *n;
+ if (!bpf_thp->pid)
+ return bpf_thp_unreg_global(kdata, link);
+
spin_lock(&thp_ops_lock);
list_for_each_safe(pos, n, &bpf_thp->mm_list) {
bpf_mm = list_entry(pos, struct bpf_mm_ops, bpf_thp_list);
@@ -244,6 +317,31 @@ static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
synchronize_rcu();
}
+static int bpf_thp_update_global(void *kdata, void *old_kdata, struct bpf_link *link)
+{
+ struct bpf_thp_ops *old_bpf_thp = old_kdata;
+ struct bpf_thp_ops *bpf_thp = kdata;
+ struct bpf_thp_ops *old_global;
+
+ if (!old_bpf_thp || !bpf_thp)
+ return -EINVAL;
+
+ spin_lock(&thp_ops_lock);
+ /* BPF-THP global instance has already been removed. */
+ if (!rcu_access_pointer(bpf_thp_global)) {
+ spin_unlock(&thp_ops_lock);
+ return -ENOENT;
+ }
+
+ old_global = rcu_replace_pointer(bpf_thp_global, bpf_thp,
+ lockdep_is_held(&thp_ops_lock));
+ WARN_ON_ONCE(!old_global);
+ spin_unlock(&thp_ops_lock);
+
+ synchronize_rcu();
+ return 0;
+}
+
static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
{
struct bpf_thp_ops *old_bpf_thp = old_kdata;
@@ -251,6 +349,15 @@ static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
struct bpf_mm_ops *bpf_mm;
struct list_head *pos, *n;
+ /* Updates are confined to instances of the same scope:
+ * global to global, process-local to process-local.
+ */
+ if (!!old_bpf_thp->pid != !!bpf_thp->pid)
+ return -EINVAL;
+
+ if (!old_bpf_thp->pid)
+ return bpf_thp_update_global(kdata, old_kdata, link);
+
INIT_LIST_HEAD(&bpf_thp->mm_list);
/* Could be optimized to a per-instance lock if this lock becomes a bottleneck. */
--
2.47.3
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v12 mm-new 07/10] Documentation: add BPF THP
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
` (5 preceding siblings ...)
2025-10-26 10:01 ` [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode Yafang Shao
@ 2025-10-26 10:01 ` Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 08/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
` (2 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Yafang Shao @ 2025-10-26 10:01 UTC (permalink / raw)
To: akpm, ast, daniel, andrii, david, lorenzo.stoakes
Cc: martin.lau, eddyz87, song, yonghong.song, john.fastabend,
kpsingh, sdf, haoluo, jolsa, ziy, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ameryhung, rientjes, corbet, 21cnbao, shakeel.butt, tj,
lance.yang, rdunlap, clm, bpf, linux-mm, Yafang Shao
Add the documentation.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
Documentation/admin-guide/mm/transhuge.rst | 113 +++++++++++++++++++++
mm/Kconfig | 2 +
2 files changed, 115 insertions(+)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 2569a92fd96c..a85ebcf7e07c 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -776,3 +776,116 @@ support enabled just fine as always. No difference can be noted in
hugetlbfs other than there will be less overall fragmentation. All
usual features belonging to hugetlbfs are preserved and
unaffected. libhugetlbfs will also work fine as usual.
+
+BPF THP
+=======
+
+:Author: Yafang Shao <laoar.shao@gmail.com>
+:Date: October 2025
+
+Overview
+--------
+
+When the system is configured with "always" or "madvise" THP mode, a BPF program
+can be used to adjust THP allocation policies dynamically. This enables
+fine-grained control over THP decisions based on various factors including
+workload identity, allocation context, and system memory pressure.
+
+Program Interface
+-----------------
+
+This feature implements a struct_ops BPF program with the following interface::
+
+ struct bpf_thp_ops {
+ pid_t pid;
+ thp_order_fn_t *thp_get_order;
+ };
+
+Callback Functions
+------------------
+
+thp_get_order()
+~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ int thp_get_order(struct vm_area_struct *vma,
+ enum tva_type type,
+ unsigned long orders);
+
+Parameters
+^^^^^^^^^^
+
+``vma``
+ ``vm_area_struct`` associated with the THP allocation.
+
+``type``
+ TVA type for the current ``vma``.
+
+``orders``
+ Bitmask of available THP orders for this allocation.
+
+Return value
+^^^^^^^^^^^^
+
+- The suggested THP order for allocation from the BPF program
+- Must be a valid, available order from the provided ``orders`` bitmask
+
+Operation Modes
+---------------
+
+Per Process Mode
+~~~~~~~~~~~~~~~~
+
+When registering a BPF-THP with a specific PID, the program is installed in the
+target task's ``mm_struct``::
+
+ struct mm_struct {
+ struct bpf_thp_ops __rcu *bpf_thp;
+ };
+
+Inheritance Behavior
+^^^^^^^^^^^^^^^^^^^^
+
+- Existing child processes are unaffected
+- Newly forked children inherit the BPF-THP from their parent
+- The BPF-THP persists across execve() calls
+
+Management Rules
+^^^^^^^^^^^^^^^^
+
+- When a BPF-THP instance is unregistered, all managed tasks' ``bpf_thp``
+ pointers are reset to ``NULL``
+- When a BPF-THP instance is updated, all managed tasks' ``bpf_thp`` pointers
+ are automatically updated to the new version
+- Each process can be managed by only one BPF-THP instance at a time
+
+Global Mode
+~~~~~~~~~~~
+
+If no PID is specified during registration, the BPF-THP operates in global mode.
+In this mode, all tasks in the system are managed by the global instance.
+
+Global Mode Precedence
+^^^^^^^^^^^^^^^^^^^^^^
+
+- The global instance takes precedence over all per-process instances
+- All existing per-process instances are disabled when a global instance is
+ registered
+- New per-process registrations are blocked while a global instance is active
+- Existing per-process instances remain registered (no forced unregistration)
+
+Instance Management
+^^^^^^^^^^^^^^^^^^^
+
+- Updates are type-isolated: global instances can only be updated by new global
+ instances, and per-process instances by new per-process instances
+- Only one global BPF-THP can be registered at a time
+- Global instances can be updated dynamically without requiring task restarts
+
+Implementation Notes
+--------------------
+
+- This is currently an experimental feature
+- ``CONFIG_BPF_THP`` must be enabled to use this functionality
+- The feature depends on proper THP configuration ("always" or "madvise" mode)
diff --git a/mm/Kconfig b/mm/Kconfig
index 12a2fbdc0909..c374a0f4acc4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1476,6 +1476,8 @@ config BPF_THP
Enable dynamic THP policy adjustment using BPF programs. This feature
is currently experimental.
+ See Documentation/admin-guide/mm/transhuge.rst for more information.
+
WARNING: This feature is unstable and may change in future kernel
endif # BPF_MM
--
2.47.3
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v12 mm-new 08/10] selftests/bpf: add a simple BPF based THP policy
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
` (6 preceding siblings ...)
2025-10-26 10:01 ` [PATCH v12 mm-new 07/10] Documentation: add BPF THP Yafang Shao
@ 2025-10-26 10:01 ` Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 09/10] selftests/bpf: add test case to update " Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 10/10] selftests/bpf: add test case for BPF-THP inheritance across fork Yafang Shao
9 siblings, 0 replies; 29+ messages in thread
From: Yafang Shao @ 2025-10-26 10:01 UTC (permalink / raw)
To: akpm, ast, daniel, andrii, david, lorenzo.stoakes
Cc: martin.lau, eddyz87, song, yonghong.song, john.fastabend,
kpsingh, sdf, haoluo, jolsa, ziy, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ameryhung, rientjes, corbet, 21cnbao, shakeel.butt, tj,
lance.yang, rdunlap, clm, bpf, linux-mm, Yafang Shao
This test case implements a basic THP policy that sets THPeligible to 0 for
a specific task. I selected THPeligible for verification because its
straightforward nature makes it ideal for validating the BPF THP policy
functionality.
Below configs must be enabled for this test:
CONFIG_BPF_MM=y
CONFIG_BPF_THP=y
CONFIG_TRANSPARENT_HUGEPAGE=y
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
MAINTAINERS | 2 +
tools/testing/selftests/bpf/config | 3 +
.../selftests/bpf/prog_tests/thp_adjust.c | 245 ++++++++++++++++++
.../selftests/bpf/progs/test_thp_adjust.c | 24 ++
4 files changed, 274 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
diff --git a/MAINTAINERS b/MAINTAINERS
index e8eeb7c89431..295cbda88580 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16524,6 +16524,8 @@ F: mm/huge_memory.c
F: mm/huge_memory_bpf.c
F: mm/khugepaged.c
F: mm/mm_slot.h
+F: tools/testing/selftests/bpf/prog_tests/thp_adjust.c
+F: tools/testing/selftests/bpf/progs/test_thp_adjust*
F: tools/testing/selftests/mm/khugepaged.c
F: tools/testing/selftests/mm/split_huge_page_test.c
F: tools/testing/selftests/mm/transhuge-stress.c
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index 70b28c1e653e..8e57c449173b 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -7,8 +7,10 @@ CONFIG_BPF_JIT=y
CONFIG_BPF_KPROBE_OVERRIDE=y
CONFIG_BPF_LIRC_MODE2=y
CONFIG_BPF_LSM=y
+CONFIG_BPF_MM=y
CONFIG_BPF_STREAM_PARSER=y
CONFIG_BPF_SYSCALL=y
+CONFIG_BPF_THP=y
# CONFIG_BPF_UNPRIV_DEFAULT_OFF is not set
CONFIG_CGROUP_BPF=y
CONFIG_CRYPTO_HMAC=y
@@ -115,6 +117,7 @@ CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
CONFIG_SYN_COOKIES=y
CONFIG_TEST_BPF=m
+CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_UDMABUF=y
CONFIG_USERFAULTFD=y
CONFIG_VSOCKETS=y
diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
new file mode 100644
index 000000000000..2b23e2d08092
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -0,0 +1,245 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <sys/mman.h>
+#include <test_progs.h>
+#include "test_thp_adjust.skel.h"
+
+#define LEN (16 * 1024 * 1024) /* 16MB */
+#define THP_ENABLED_FILE "/sys/kernel/mm/transparent_hugepage/enabled"
+#define PMD_SIZE_FILE "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
+
+static struct test_thp_adjust *skel;
+static char old_mode[32];
+static long pagesize;
+
+static int thp_mode_save(void)
+{
+ const char *start, *end;
+ char buf[128];
+ int fd, err;
+ size_t len;
+
+ fd = open(THP_ENABLED_FILE, O_RDONLY);
+ if (fd == -1)
+ return -1;
+
+ err = read(fd, buf, sizeof(buf) - 1);
+ if (err == -1)
+ goto close;
+
+ start = strchr(buf, '[');
+ end = start ? strchr(start, ']') : NULL;
+ if (!start || !end || end <= start) {
+ err = -1;
+ goto close;
+ }
+
+ len = end - start - 1;
+ if (len >= sizeof(old_mode))
+ len = sizeof(old_mode) - 1;
+ strncpy(old_mode, start + 1, len);
+ old_mode[len] = '\0';
+
+close:
+ close(fd);
+ return err;
+}
+
+static int thp_mode_set(const char *desired_mode)
+{
+ int fd, err;
+
+ fd = open(THP_ENABLED_FILE, O_RDWR);
+ if (fd == -1)
+ return -1;
+
+ err = write(fd, desired_mode, strlen(desired_mode));
+ close(fd);
+ return err;
+}
+
+static int thp_mode_reset(void)
+{
+ int fd, err;
+
+ fd = open(THP_ENABLED_FILE, O_WRONLY);
+ if (fd == -1)
+ return -1;
+
+ err = write(fd, old_mode, strlen(old_mode));
+ close(fd);
+ return err;
+}
+
+static char *thp_alloc(void)
+{
+ char *addr;
+ int err, i;
+
+ addr = mmap(NULL, LEN, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (addr == MAP_FAILED)
+ return NULL;
+
+ err = madvise(addr, LEN, MADV_HUGEPAGE);
+ if (err == -1)
+ goto unmap;
+
+ /* Accessing a single byte within a page is sufficient to trigger a page fault. */
+ for (i = 0; i < LEN; i += pagesize)
+ addr[i] = 1;
+ return addr;
+
+unmap:
+ munmap(addr, LEN);
+ return NULL;
+}
+
+static void thp_free(char *ptr)
+{
+ munmap(ptr, LEN);
+}
+
+static int get_pmd_order(void)
+{
+ ssize_t bytes_read, size;
+ int fd, order, ret = -1;
+ char buf[64], *endptr;
+
+ fd = open(PMD_SIZE_FILE, O_RDONLY);
+ if (fd < 0)
+ return -1;
+
+ bytes_read = read(fd, buf, sizeof(buf) - 1);
+ if (bytes_read <= 0)
+ goto close_fd;
+
+ /* Remove potential newline character */
+ if (buf[bytes_read - 1] == '\n')
+ buf[bytes_read - 1] = '\0';
+
+ size = strtoul(buf, &endptr, 10);
+ if (endptr == buf || *endptr != '\0')
+ goto close_fd;
+ if (size % pagesize != 0)
+ goto close_fd;
+ ret = size / pagesize;
+ if ((ret & (ret - 1)) == 0) {
+ order = 0;
+ while (ret > 1) {
+ ret >>= 1;
+ order++;
+ }
+ ret = order;
+ }
+
+close_fd:
+ close(fd);
+ return ret;
+}
+
+static int get_thp_eligible(pid_t pid, unsigned long addr)
+{
+ int this_vma = 0, eligible = -1;
+ unsigned long start, end;
+ char smaps_path[64];
+ FILE *smaps_file;
+ char line[4096];
+
+ snprintf(smaps_path, sizeof(smaps_path), "/proc/%d/smaps", pid);
+ smaps_file = fopen(smaps_path, "r");
+ if (!smaps_file)
+ return -1;
+
+ while (fgets(line, sizeof(line), smaps_file)) {
+ if (sscanf(line, "%lx-%lx", &start, &end) == 2) {
+ /* addr is monotonic */
+ if (addr < start)
+ break;
+ this_vma = (addr >= start && addr < end) ? 1 : 0;
+ continue;
+ }
+
+ if (!this_vma)
+ continue;
+
+ if (strstr(line, "THPeligible:")) {
+ sscanf(line, "THPeligible: %d", &eligible);
+ break;
+ }
+ }
+
+ fclose(smaps_file);
+ return eligible;
+}
+
+static void subtest_thp_eligible(void)
+{
+ struct bpf_link *ops_link;
+ int elighble;
+ char *ptr;
+
+ ops_link = bpf_map__attach_struct_ops(skel->maps.thp_eligible_ops);
+ if (!ASSERT_OK_PTR(ops_link, "attach struct_ops"))
+ return;
+
+ ptr = thp_alloc();
+ if (!ASSERT_OK_PTR(ptr, "THP alloc"))
+ goto detach;
+
+ elighble = get_thp_eligible(getpid(), (unsigned long)ptr);
+ ASSERT_EQ(elighble, 0, "THPeligible");
+
+ thp_free(ptr);
+detach:
+ bpf_link__destroy(ops_link);
+}
+
+static int thp_adjust_setup(void)
+{
+ int err = -1, pmd_order;
+
+ pagesize = sysconf(_SC_PAGESIZE);
+ pmd_order = get_pmd_order();
+ if (!ASSERT_NEQ(pmd_order, -1, "get_pmd_order"))
+ return -1;
+
+ if (!ASSERT_NEQ(thp_mode_save(), -1, "THP mode save"))
+ return -1;
+ if (!ASSERT_GE(thp_mode_set("madvise"), 0, "THP mode set"))
+ return -1;
+
+ skel = test_thp_adjust__open();
+ if (!ASSERT_OK_PTR(skel, "open"))
+ goto thp_reset;
+
+ skel->bss->pmd_order = pmd_order;
+ skel->struct_ops.thp_eligible_ops->pid = getpid();
+
+ err = test_thp_adjust__load(skel);
+ if (!ASSERT_OK(err, "load"))
+ goto destroy;
+ return 0;
+
+destroy:
+ test_thp_adjust__destroy(skel);
+thp_reset:
+ ASSERT_GE(thp_mode_reset(), 0, "THP mode reset");
+ return err;
+}
+
+static void thp_adjust_destroy(void)
+{
+ test_thp_adjust__destroy(skel);
+ ASSERT_GE(thp_mode_reset(), 0, "THP mode reset");
+}
+
+void test_thp_adjust(void)
+{
+ if (thp_adjust_setup() == -1)
+ return;
+
+ if (test__start_subtest("thp_eligible"))
+ subtest_thp_eligible();
+
+ thp_adjust_destroy();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust.c b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
new file mode 100644
index 000000000000..b180a7f9b923
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+int pmd_order;
+
+SEC("struct_ops/thp_get_order")
+int BPF_PROG(thp_not_eligible, struct vm_area_struct *vma, enum tva_type type,
+ unsigned long orders)
+{
+ /* THPeligible in /proc/pid/smaps is 0 */
+ if (type == TVA_SMAPS)
+ return 0;
+ return pmd_order;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops thp_eligible_ops = {
+ .thp_get_order = (void *)thp_not_eligible,
+};
--
2.47.3
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v12 mm-new 09/10] selftests/bpf: add test case to update THP policy
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
` (7 preceding siblings ...)
2025-10-26 10:01 ` [PATCH v12 mm-new 08/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
@ 2025-10-26 10:01 ` Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 10/10] selftests/bpf: add test case for BPF-THP inheritance across fork Yafang Shao
9 siblings, 0 replies; 29+ messages in thread
From: Yafang Shao @ 2025-10-26 10:01 UTC (permalink / raw)
To: akpm, ast, daniel, andrii, david, lorenzo.stoakes
Cc: martin.lau, eddyz87, song, yonghong.song, john.fastabend,
kpsingh, sdf, haoluo, jolsa, ziy, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ameryhung, rientjes, corbet, 21cnbao, shakeel.butt, tj,
lance.yang, rdunlap, clm, bpf, linux-mm, Yafang Shao
This test case exercises the BPF THP update mechanism by modifying an
existing policy. The behavior confirms that:
- EBUSY error occurs when attempting to install a BPF program on a process
that already has an active BPF program
- Updates to currently running programs are successfully processed
- Local prog can't be updated by a global prog
- Global prog can't be updated by a local prog
- Global prog can be attached even if there's a local prog
- Local prog can't be attached if there's a global prog
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
.../selftests/bpf/prog_tests/thp_adjust.c | 79 +++++++++++++++++++
.../selftests/bpf/progs/test_thp_adjust.c | 29 +++++++
2 files changed, 108 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
index 2b23e2d08092..0d570cee9006 100644
--- a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -194,6 +194,79 @@ static void subtest_thp_eligible(void)
bpf_link__destroy(ops_link);
}
+static void subtest_thp_policy_update(void)
+{
+ struct bpf_link *old_link, *new_link;
+ int elighble, err, pid;
+ char *ptr;
+
+ pid = getpid();
+ ptr = thp_alloc();
+
+ old_link = bpf_map__attach_struct_ops(skel->maps.thp_eligible_ops);
+ if (!ASSERT_OK_PTR(old_link, "attach_old_link"))
+ goto free;
+
+ elighble = get_thp_eligible(pid, (unsigned long)ptr);
+ ASSERT_EQ(elighble, 0, "THPeligible");
+
+ /* Attach multi BPF-THP to a single process is rejected. */
+ new_link = bpf_map__attach_struct_ops(skel->maps.thp_eligible_ops2);
+ if (!ASSERT_NULL(new_link, "attach_new_link"))
+ goto destory_old;
+ ASSERT_EQ(errno, EBUSY, "attach_new_link");
+
+ elighble = get_thp_eligible(pid, (unsigned long)ptr);
+ ASSERT_EQ(elighble, 0, "THPeligible");
+
+ err = bpf_link__update_map(old_link, skel->maps.thp_eligible_ops2);
+ ASSERT_EQ(err, 0, "update_old_link");
+
+ elighble = get_thp_eligible(pid, (unsigned long)ptr);
+ ASSERT_EQ(elighble, 1, "THPeligible");
+
+ /* Per process prog can't be update by a global prog */
+ err = bpf_link__update_map(old_link, skel->maps.swap_ops);
+ ASSERT_EQ(err, -EINVAL, "update_old_link");
+
+destory_old:
+ bpf_link__destroy(old_link);
+free:
+ thp_free(ptr);
+}
+
+static void subtest_thp_global_policy(void)
+{
+ struct bpf_link *local_link, *global_link;
+ int err;
+
+ local_link = bpf_map__attach_struct_ops(skel->maps.thp_eligible_ops);
+ if (!ASSERT_OK_PTR(local_link, "attach_local_link"))
+ return;
+
+ /* global prog can be attached even if there is a local prog */
+ global_link = bpf_map__attach_struct_ops(skel->maps.swap_ops);
+ if (!ASSERT_OK_PTR(global_link, "attach_global_link")) {
+ bpf_link__destroy(local_link);
+ return;
+ }
+
+ bpf_link__destroy(local_link);
+
+ /* local prog can't be attaached if there is a global prog */
+ local_link = bpf_map__attach_struct_ops(skel->maps.thp_eligible_ops);
+ if (!ASSERT_NULL(local_link, "attach_new_link"))
+ goto destory_global;
+ ASSERT_EQ(errno, EBUSY, "attach_new_link");
+
+ /* global prog can't be updated by a local prog */
+ err = bpf_link__update_map(global_link, skel->maps.thp_eligible_ops);
+ ASSERT_EQ(err, -EINVAL, "update_old_link");
+
+destory_global:
+ bpf_link__destroy(global_link);
+}
+
static int thp_adjust_setup(void)
{
int err = -1, pmd_order;
@@ -214,6 +287,8 @@ static int thp_adjust_setup(void)
skel->bss->pmd_order = pmd_order;
skel->struct_ops.thp_eligible_ops->pid = getpid();
+ skel->struct_ops.thp_eligible_ops2->pid = getpid();
+ /* swap_ops is a global prog since its pid is not set. */
err = test_thp_adjust__load(skel);
if (!ASSERT_OK(err, "load"))
@@ -240,6 +315,10 @@ void test_thp_adjust(void)
if (test__start_subtest("thp_eligible"))
subtest_thp_eligible();
+ if (test__start_subtest("policy_update"))
+ subtest_thp_policy_update();
+ if (test__start_subtest("global_policy"))
+ subtest_thp_global_policy();
thp_adjust_destroy();
}
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust.c b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
index b180a7f9b923..44648326819a 100644
--- a/tools/testing/selftests/bpf/progs/test_thp_adjust.c
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
@@ -22,3 +22,32 @@ SEC(".struct_ops.link")
struct bpf_thp_ops thp_eligible_ops = {
.thp_get_order = (void *)thp_not_eligible,
};
+
+SEC("struct_ops/thp_get_order")
+int BPF_PROG(thp_eligible, struct vm_area_struct *vma, enum tva_type type,
+ unsigned long orders)
+{
+ /* THPeligible in /proc/pid/smaps is 1 */
+ if (type == TVA_SMAPS)
+ return pmd_order;
+ return pmd_order;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops thp_eligible_ops2 = {
+ .thp_get_order = (void *)thp_eligible,
+};
+
+SEC("struct_ops/thp_get_order")
+int BPF_PROG(alloc_not_in_swap, struct vm_area_struct *vma, enum tva_type type,
+ unsigned long orders)
+{
+ if (type == TVA_SWAP_PAGEFAULT)
+ return 0;
+ return -1;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops swap_ops = {
+ .thp_get_order = (void *)alloc_not_in_swap,
+};
--
2.47.3
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v12 mm-new 10/10] selftests/bpf: add test case for BPF-THP inheritance across fork
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
` (8 preceding siblings ...)
2025-10-26 10:01 ` [PATCH v12 mm-new 09/10] selftests/bpf: add test case to update " Yafang Shao
@ 2025-10-26 10:01 ` Yafang Shao
9 siblings, 0 replies; 29+ messages in thread
From: Yafang Shao @ 2025-10-26 10:01 UTC (permalink / raw)
To: akpm, ast, daniel, andrii, david, lorenzo.stoakes
Cc: martin.lau, eddyz87, song, yonghong.song, john.fastabend,
kpsingh, sdf, haoluo, jolsa, ziy, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ameryhung, rientjes, corbet, 21cnbao, shakeel.butt, tj,
lance.yang, rdunlap, clm, bpf, linux-mm, Yafang Shao
Verify that child processes correctly inherit BPF-THP policy from their
parent during fork() operations.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
.../selftests/bpf/prog_tests/thp_adjust.c | 33 +++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
index 0d570cee9006..f585e60882e8 100644
--- a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -267,6 +267,37 @@ static void subtest_thp_global_policy(void)
bpf_link__destroy(global_link);
}
+static void subtest_thp_fork(void)
+{
+ int elighble, child, pid, status;
+ struct bpf_link *ops_link;
+ char *ptr;
+
+ ops_link = bpf_map__attach_struct_ops(skel->maps.thp_eligible_ops);
+ if (!ASSERT_OK_PTR(ops_link, "attach struct_ops"))
+ return;
+
+ child = fork();
+ if (!ASSERT_GE(child, 0, "fork"))
+ goto destroy;
+
+ if (child == 0) {
+ ptr = thp_alloc();
+ elighble = get_thp_eligible(getpid(), (unsigned long)ptr);
+ ASSERT_EQ(elighble, 0, "THPeligible");
+ thp_free(ptr);
+
+ exit(EXIT_SUCCESS);
+ }
+
+ pid = waitpid(child, &status, 0);
+ ASSERT_EQ(pid, child, "waitpid");
+
+destroy:
+ bpf_link__destroy(ops_link);
+
+}
+
static int thp_adjust_setup(void)
{
int err = -1, pmd_order;
@@ -319,6 +350,8 @@ void test_thp_adjust(void)
subtest_thp_policy_update();
if (test__start_subtest("global_policy"))
subtest_thp_global_policy();
+ if (test__start_subtest("thp_fork"))
+ subtest_thp_fork();
thp_adjust_destroy();
}
--
2.47.3
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 04/10] mm: thp: decouple THP allocation between swap and page fault paths
2025-10-26 10:01 ` [PATCH v12 mm-new 04/10] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
@ 2025-10-27 4:07 ` Barry Song
0 siblings, 0 replies; 29+ messages in thread
From: Barry Song @ 2025-10-27 4:07 UTC (permalink / raw)
To: Yafang Shao
Cc: akpm, ast, daniel, andrii, david, lorenzo.stoakes, martin.lau,
eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
haoluo, jolsa, ziy, Liam.Howlett, npache, ryan.roberts, dev.jain,
hannes, usamaarif642, gutierrez.asier, willy, ameryhung,
rientjes, corbet, shakeel.butt, tj, lance.yang, rdunlap, clm,
bpf, linux-mm
On Sun, Oct 26, 2025 at 6:02 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> The new BPF capability enables finer-grained THP policy decisions by
> introducing separate handling for swap faults versus normal page faults.
>
> As highlighted by Barry:
>
> We’ve observed that swapping in large folios can lead to more
> swap thrashing for some workloads- e.g. kernel build. Consequently,
> some workloads might prefer swapping in smaller folios than those
> allocated by alloc_anon_folio().
>
> While prtcl() could potentially be extended to leverage this new policy,
> doing so would require modifications to the uAPI.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Usama Arif <usamaarif642@gmail.com>
> Cc: Barry Song <21cnbao@gmail.com>
Thanks for addressing this.
Acked-by: Barry Song <baohua@kernel.org>
Thanks
Barry
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-10-26 10:01 ` [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode Yafang Shao
@ 2025-10-29 1:32 ` Alexei Starovoitov
2025-10-29 2:13 ` Yafang Shao
2025-11-26 15:13 ` Rik van Riel
0 siblings, 2 replies; 29+ messages in thread
From: Alexei Starovoitov @ 2025-10-29 1:32 UTC (permalink / raw)
To: Yafang Shao
Cc: Andrew Morton, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, David Hildenbrand, Lorenzo Stoakes,
Martin KaFai Lau, Eduard, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Yan, Liam Howlett, npache, ryan.roberts, dev.jain,
Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
Amery Hung, David Rientjes, Jonathan Corbet, Barry Song,
Shakeel Butt, Tejun Heo, lance.yang, Randy Dunlap, Chris Mason,
bpf, linux-mm
On Sun, Oct 26, 2025 at 3:03 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> The per-process BPF-THP mode is unsuitable for managing shared resources
> such as shmem THP and file-backed THP. This aligns with known cgroup
> limitations for similar scenarios [0].
>
> Introduce a global BPF-THP mode to address this gap. When registered:
> - All existing per-process instances are disabled
> - New per-process registrations are blocked
> - Existing per-process instances remain registered (no forced unregistration)
>
> The global mode takes precedence over per-process instances. Updates are
> type-isolated: global instances can only be updated by new global
> instances, and per-process instances by new per-process instances.
...
> spin_lock(&thp_ops_lock);
> - /* Each process is exclusively managed by a single BPF-THP. */
> - if (rcu_access_pointer(mm->bpf_mm.bpf_thp)) {
> + /* Each process is exclusively managed by a single BPF-THP.
> + * Global mode disables per-process instances.
> + */
> + if (rcu_access_pointer(mm->bpf_mm.bpf_thp) || rcu_access_pointer(bpf_thp_global)) {
> err = -EBUSY;
> goto out;
> }
You didn't address the issue and instead doubled down
on this broken global approach.
This bait-and-switch patchset is frankly disingenuous.
'lets code up some per-mm hack, since people will hate it anyway,
and I'm not going to use it either, and add this global mode
as a fake "fallback"...'
The way the previous thread evolved and this followup hack
I don't see a genuine desire to find a solution.
Just relentless push for global mode.
Nacked-by: Alexei Starovoitov <ast@kernel.org>
Please carry it in all future patches.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-10-29 1:32 ` Alexei Starovoitov
@ 2025-10-29 2:13 ` Yafang Shao
2025-10-30 0:57 ` Alexei Starovoitov
2025-11-26 15:13 ` Rik van Riel
1 sibling, 1 reply; 29+ messages in thread
From: Yafang Shao @ 2025-10-29 2:13 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Andrew Morton, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, David Hildenbrand, Lorenzo Stoakes,
Martin KaFai Lau, Eduard, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Yan, Liam Howlett, npache, ryan.roberts, dev.jain,
Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
Amery Hung, David Rientjes, Jonathan Corbet, Barry Song,
Shakeel Butt, Tejun Heo, lance.yang, Randy Dunlap, Chris Mason,
bpf, linux-mm
On Wed, Oct 29, 2025 at 9:33 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Oct 26, 2025 at 3:03 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > The per-process BPF-THP mode is unsuitable for managing shared resources
> > such as shmem THP and file-backed THP. This aligns with known cgroup
> > limitations for similar scenarios [0].
> >
> > Introduce a global BPF-THP mode to address this gap. When registered:
> > - All existing per-process instances are disabled
> > - New per-process registrations are blocked
> > - Existing per-process instances remain registered (no forced unregistration)
> >
> > The global mode takes precedence over per-process instances. Updates are
> > type-isolated: global instances can only be updated by new global
> > instances, and per-process instances by new per-process instances.
>
> ...
>
> > spin_lock(&thp_ops_lock);
> > - /* Each process is exclusively managed by a single BPF-THP. */
> > - if (rcu_access_pointer(mm->bpf_mm.bpf_thp)) {
> > + /* Each process is exclusively managed by a single BPF-THP.
> > + * Global mode disables per-process instances.
> > + */
> > + if (rcu_access_pointer(mm->bpf_mm.bpf_thp) || rcu_access_pointer(bpf_thp_global)) {
> > err = -EBUSY;
> > goto out;
> > }
>
> You didn't address the issue and instead doubled down
> on this broken global approach.
>
> This bait-and-switch patchset is frankly disingenuous.
> 'lets code up some per-mm hack, since people will hate it anyway,
> and I'm not going to use it either, and add this global mode
> as a fake "fallback"...'
>
> The way the previous thread evolved and this followup hack
> I don't see a genuine desire to find a solution.
> Just relentless push for global mode.
>
> Nacked-by: Alexei Starovoitov <ast@kernel.org>
>
> Please carry it in all future patches.
To move forward, I'm happy to set the global mode aside for now and
potentially drop it in the next version. I'd really like to hear your
perspective on the per-process mode. Does this implementation meet
your needs?
--
Regards
Yafang
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-10-29 2:13 ` Yafang Shao
@ 2025-10-30 0:57 ` Alexei Starovoitov
2025-10-30 2:40 ` Yafang Shao
2025-11-27 11:48 ` David Hildenbrand (Red Hat)
0 siblings, 2 replies; 29+ messages in thread
From: Alexei Starovoitov @ 2025-10-30 0:57 UTC (permalink / raw)
To: Yafang Shao
Cc: Andrew Morton, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, David Hildenbrand, Lorenzo Stoakes,
Martin KaFai Lau, Eduard, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Yan, Liam Howlett, npache, ryan.roberts, dev.jain,
Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
Amery Hung, David Rientjes, Jonathan Corbet, Barry Song,
Shakeel Butt, Tejun Heo, lance.yang, Randy Dunlap, Chris Mason,
bpf, linux-mm
On Tue, Oct 28, 2025 at 7:14 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Wed, Oct 29, 2025 at 9:33 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Sun, Oct 26, 2025 at 3:03 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > The per-process BPF-THP mode is unsuitable for managing shared resources
> > > such as shmem THP and file-backed THP. This aligns with known cgroup
> > > limitations for similar scenarios [0].
> > >
> > > Introduce a global BPF-THP mode to address this gap. When registered:
> > > - All existing per-process instances are disabled
> > > - New per-process registrations are blocked
> > > - Existing per-process instances remain registered (no forced unregistration)
> > >
> > > The global mode takes precedence over per-process instances. Updates are
> > > type-isolated: global instances can only be updated by new global
> > > instances, and per-process instances by new per-process instances.
> >
> > ...
> >
> > > spin_lock(&thp_ops_lock);
> > > - /* Each process is exclusively managed by a single BPF-THP. */
> > > - if (rcu_access_pointer(mm->bpf_mm.bpf_thp)) {
> > > + /* Each process is exclusively managed by a single BPF-THP.
> > > + * Global mode disables per-process instances.
> > > + */
> > > + if (rcu_access_pointer(mm->bpf_mm.bpf_thp) || rcu_access_pointer(bpf_thp_global)) {
> > > err = -EBUSY;
> > > goto out;
> > > }
> >
> > You didn't address the issue and instead doubled down
> > on this broken global approach.
> >
> > This bait-and-switch patchset is frankly disingenuous.
> > 'lets code up some per-mm hack, since people will hate it anyway,
> > and I'm not going to use it either, and add this global mode
> > as a fake "fallback"...'
> >
> > The way the previous thread evolved and this followup hack
> > I don't see a genuine desire to find a solution.
> > Just relentless push for global mode.
> >
> > Nacked-by: Alexei Starovoitov <ast@kernel.org>
> >
> > Please carry it in all future patches.
>
> To move forward, I'm happy to set the global mode aside for now and
> potentially drop it in the next version. I'd really like to hear your
> perspective on the per-process mode. Does this implementation meet
> your needs?
Attaching st_ops to task_struct or to mm_struct is a can of worms.
With cgroup-bpf we went through painful bugs with lifetime
of cgroup vs bpf, dying cgroups, wq deadlock, etc. All these
problems are behind us. With st_ops in mm_struct it will be more
painful. I'd rather not go that route.
And revist cgroup instead, since you were way too quick
to accept the pushback because all you wanted is global mode.
The main reason for pushback was:
"
Cgroup was designed for resource management not for grouping processes and
tune those processes
"
which was true when cgroup-v2 was designed, but that ship sailed
years ago when we introduced cgroup-bpf.
None of the progs are doing resource management and lots of infrastructure,
container management, and open source projects use cgroup-bpf
as a grouping of processes. bpf progs attached to cgroup/hook tuple
only care about processes within that cgroup. No resource management.
See __cgroup_bpf_check_dev_permission or __cgroup_bpf_run_filter_sysctl
and others.
The path is current->cgroup->bpf_progs and progs do exactly
what cgroup wasn't designed to do. They tune a set of processes.
You should do the same.
Also I really don't see a compelling use case for bpf in THP.
Your selftest is beyond primitive:
+int pmd_order;
+
+SEC("struct_ops/thp_get_order")
+int BPF_PROG(thp_not_eligible, struct vm_area_struct *vma, enum tva_type type,
+ unsigned long orders)
+{
+ /* THPeligible in /proc/pid/smaps is 0 */
+ if (type == TVA_SMAPS)
+ return 0;
+ return pmd_order;
+}
hard code this thing. Don't bother with bpf.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-10-30 0:57 ` Alexei Starovoitov
@ 2025-10-30 2:40 ` Yafang Shao
2025-11-27 11:48 ` David Hildenbrand (Red Hat)
1 sibling, 0 replies; 29+ messages in thread
From: Yafang Shao @ 2025-10-30 2:40 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Andrew Morton, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, David Hildenbrand, Lorenzo Stoakes,
Martin KaFai Lau, Eduard, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Yan, Liam Howlett, npache, ryan.roberts, dev.jain,
Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
Amery Hung, David Rientjes, Jonathan Corbet, Barry Song,
Shakeel Butt, Tejun Heo, lance.yang, Randy Dunlap, Chris Mason,
bpf, linux-mm
On Thu, Oct 30, 2025 at 8:57 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Oct 28, 2025 at 7:14 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Wed, Oct 29, 2025 at 9:33 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Sun, Oct 26, 2025 at 3:03 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > The per-process BPF-THP mode is unsuitable for managing shared resources
> > > > such as shmem THP and file-backed THP. This aligns with known cgroup
> > > > limitations for similar scenarios [0].
> > > >
> > > > Introduce a global BPF-THP mode to address this gap. When registered:
> > > > - All existing per-process instances are disabled
> > > > - New per-process registrations are blocked
> > > > - Existing per-process instances remain registered (no forced unregistration)
> > > >
> > > > The global mode takes precedence over per-process instances. Updates are
> > > > type-isolated: global instances can only be updated by new global
> > > > instances, and per-process instances by new per-process instances.
> > >
> > > ...
> > >
> > > > spin_lock(&thp_ops_lock);
> > > > - /* Each process is exclusively managed by a single BPF-THP. */
> > > > - if (rcu_access_pointer(mm->bpf_mm.bpf_thp)) {
> > > > + /* Each process is exclusively managed by a single BPF-THP.
> > > > + * Global mode disables per-process instances.
> > > > + */
> > > > + if (rcu_access_pointer(mm->bpf_mm.bpf_thp) || rcu_access_pointer(bpf_thp_global)) {
> > > > err = -EBUSY;
> > > > goto out;
> > > > }
> > >
> > > You didn't address the issue and instead doubled down
> > > on this broken global approach.
> > >
> > > This bait-and-switch patchset is frankly disingenuous.
> > > 'lets code up some per-mm hack, since people will hate it anyway,
> > > and I'm not going to use it either, and add this global mode
> > > as a fake "fallback"...'
> > >
> > > The way the previous thread evolved and this followup hack
> > > I don't see a genuine desire to find a solution.
> > > Just relentless push for global mode.
> > >
> > > Nacked-by: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Please carry it in all future patches.
> >
> > To move forward, I'm happy to set the global mode aside for now and
> > potentially drop it in the next version. I'd really like to hear your
> > perspective on the per-process mode. Does this implementation meet
> > your needs?
>
> Attaching st_ops to task_struct or to mm_struct is a can of worms.
The feedback suggests there may not have been an opportunity to review
patch #3 in detail yet. I would appreciate it if you could take a look
at the specific changes in that patch, as it addresses the core of the
implementation.
> With cgroup-bpf we went through painful bugs with lifetime
> of cgroup vs bpf, dying cgroups, wq deadlock, etc. All these
> problems are behind us.
The attachment-based design of cgroup-bpf creates significant
operational challenges. It lacks visibility, making it difficult to
identify which cgroups have active attachments, and requires explicit
author knowledge for clean detachment.
> With st_ops in mm_struct it will be more
> painful.
To save your time, I've pasted the relevant portion of patch #3 below:
When registering a BPF-THP, we specify the PID of a target task. The
BPF-THP is then installed in the task's `mm_struct`
struct mm_struct {
struct bpf_thp_ops __rcu *thp_thp;
};
Inheritance Behavior:
- Existing child processes are unaffected
- Newly forked children inherit the BPF-THP from their parent
- The BPF-THP persists across execve() calls
A new linked list tracks all tasks managed by each BPF-THP instance:
- Newly managed tasks are added to the list
- Exiting tasks are automatically removed from the list
- During BPF-THP unregistration (e.g., when the BPF link is removed), all
managed tasks have their bpf_thp pointer set to NULL
- BPF-THP instances can be dynamically updated, with all tracked tasks
automatically migrating to the new version.
This design simplifies BPF-THP management in production environments by
providing clear lifecycle management and preventing conflicts between
multiple BPF-THP instances.
To clarify, this design has no lifecycle issues. It provides clear
traceability: you can always identify who loaded the program and which
processes it's attached to. Moreover, removing either the loader or
the pinned bpf_link will completely remove the program and all its
associated state.
> I'd rather not go that route.
I'm glad we can talk about this directly—it saves us both a lot of guesswork.
>
> And revist cgroup instead, since you were way too quick
> to accept the pushback because all you wanted is global mode.
>
> The main reason for pushback was:
> "
> Cgroup was designed for resource management not for grouping processes and
> tune those processes
> "
>
> which was true when cgroup-v2 was designed, but that ship sailed
> years ago when we introduced cgroup-bpf.
> None of the progs are doing resource management and lots of infrastructure,
> container management, and open source projects use cgroup-bpf
> as a grouping of processes. bpf progs attached to cgroup/hook tuple
> only care about processes within that cgroup. No resource management.
> See __cgroup_bpf_check_dev_permission or __cgroup_bpf_run_filter_sysctl
> and others.
> The path is current->cgroup->bpf_progs and progs do exactly
> what cgroup wasn't designed to do. They tune a set of processes.
>
> You should do the same.
I'm fully supportive of a cgroup-based approach, as it simplifies
integration by requiring only a kubelet plugin instead of
modifications to containerd.
However, my primary concern is the potential for maintainer pushback,
given the historical precedent. For instance, a similar discussion in
the NUMA-balancing context saw cgroup maintainers insisting on a
process-based method (see link below):
https://lore.kernel.org/lkml/ldyynnd3ngxnu3bie7ezuavewshgfepro5kjids6cuxy4imzy5@nt5id7nj5kt7/
To proactively address this, what alternative plan would you recommend
if we encounter such resistance? It's unclear what a viable path
forward would be if we are committed to a cgroup-based approach but it
is ultimately rejected by the maintainers.
(Adding Michal to CC for visibility)
>
> Also I really don't see a compelling use case for bpf in THP.
I'd recommend familiarizing yourself with the THP implementation. This
would be beneficial for our discussion on the specific changes.
> Your selftest is beyond primitive:
> +int pmd_order;
> +
> +SEC("struct_ops/thp_get_order")
> +int BPF_PROG(thp_not_eligible, struct vm_area_struct *vma, enum tva_type type,
> + unsigned long orders)
> +{
> + /* THPeligible in /proc/pid/smaps is 0 */
> + if (type == TVA_SMAPS)
> + return 0;
> + return pmd_order;
> +}
> hard code this thing. Don't bother with bpf.
A prior implementation that combined these components existed in an
earlier version:
https://lore.kernel.org/linux-mm/20250729091807.84310-5-laoar.shao@gmail.com/
However, based on your previous guidance that fexit and struct_ops
should not be mixed, the current approach was adopted.
In summary, I'm happy to proceed with a cgroup-based implementation. I
would appreciate your support in addressing any concerns the cgroup
maintainers might have.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-10-29 1:32 ` Alexei Starovoitov
2025-10-29 2:13 ` Yafang Shao
@ 2025-11-26 15:13 ` Rik van Riel
2025-11-27 2:35 ` Yafang Shao
1 sibling, 1 reply; 29+ messages in thread
From: Rik van Riel @ 2025-11-26 15:13 UTC (permalink / raw)
To: Alexei Starovoitov, Yafang Shao
Cc: Andrew Morton, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, David Hildenbrand, Lorenzo Stoakes,
Martin KaFai Lau, Eduard, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Yan, Liam Howlett, npache, ryan.roberts, dev.jain,
Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
Amery Hung, David Rientjes, Jonathan Corbet, Barry Song,
Shakeel Butt, Tejun Heo, lance.yang, Randy Dunlap, Chris Mason,
bpf, linux-mm
On Tue, 2025-10-28 at 18:32 -0700, Alexei Starovoitov wrote:
> On Sun, Oct 26, 2025 at 3:03 AM Yafang Shao <laoar.shao@gmail.com>
> wrote:
> >
> > The per-process BPF-THP mode is unsuitable for managing shared
> > resources
> > such as shmem THP and file-backed THP. This aligns with known
> > cgroup
> > limitations for similar scenarios [0].
> >
> > Introduce a global BPF-THP mode to address this gap. When
> > registered:
> > - All existing per-process instances are disabled
> > - New per-process registrations are blocked
> > - Existing per-process instances remain registered (no forced
> > unregistration)
> >
> > The global mode takes precedence over per-process instances.
> > Updates are
> > type-isolated: global instances can only be updated by new global
> > instances, and per-process instances by new per-process instances.
>
> ...
>
> > spin_lock(&thp_ops_lock);
> > - /* Each process is exclusively managed by a single BPF-THP.
> > */
> > - if (rcu_access_pointer(mm->bpf_mm.bpf_thp)) {
> > + /* Each process is exclusively managed by a single BPF-THP.
> > + * Global mode disables per-process instances.
> > + */
> > + if (rcu_access_pointer(mm->bpf_mm.bpf_thp) ||
> > rcu_access_pointer(bpf_thp_global)) {
> > err = -EBUSY;
> > goto out;
> > }
>
> You didn't address the issue and instead doubled down
> on this broken global approach.
>
> This bait-and-switch patchset is frankly disingenuous.
> 'lets code up some per-mm hack, since people will hate it anyway,
> and I'm not going to use it either, and add this global mode
> as a fake "fallback"...'
Should things be the other way around, where
per-process BPF THP policy overrides global
policy?
I can definitely see a use for global policy,
but also a reason to override it for some
programs or containers.
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-11-26 15:13 ` Rik van Riel
@ 2025-11-27 2:35 ` Yafang Shao
0 siblings, 0 replies; 29+ messages in thread
From: Yafang Shao @ 2025-11-27 2:35 UTC (permalink / raw)
To: Rik van Riel, Alexei Starovoitov
Cc: Andrew Morton, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, David Hildenbrand, Lorenzo Stoakes,
Martin KaFai Lau, Eduard, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Yan, Liam Howlett, npache, ryan.roberts, dev.jain,
Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
Amery Hung, David Rientjes, Jonathan Corbet, Barry Song,
Shakeel Butt, Tejun Heo, lance.yang, Randy Dunlap, Chris Mason,
bpf, linux-mm
On Wed, Nov 26, 2025 at 11:13 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Tue, 2025-10-28 at 18:32 -0700, Alexei Starovoitov wrote:
> > On Sun, Oct 26, 2025 at 3:03 AM Yafang Shao <laoar.shao@gmail.com>
> > wrote:
> > >
> > > The per-process BPF-THP mode is unsuitable for managing shared
> > > resources
> > > such as shmem THP and file-backed THP. This aligns with known
> > > cgroup
> > > limitations for similar scenarios [0].
> > >
> > > Introduce a global BPF-THP mode to address this gap. When
> > > registered:
> > > - All existing per-process instances are disabled
> > > - New per-process registrations are blocked
> > > - Existing per-process instances remain registered (no forced
> > > unregistration)
> > >
> > > The global mode takes precedence over per-process instances.
> > > Updates are
> > > type-isolated: global instances can only be updated by new global
> > > instances, and per-process instances by new per-process instances.
> >
> > ...
> >
> > > spin_lock(&thp_ops_lock);
> > > - /* Each process is exclusively managed by a single BPF-THP.
> > > */
> > > - if (rcu_access_pointer(mm->bpf_mm.bpf_thp)) {
> > > + /* Each process is exclusively managed by a single BPF-THP.
> > > + * Global mode disables per-process instances.
> > > + */
> > > + if (rcu_access_pointer(mm->bpf_mm.bpf_thp) ||
> > > rcu_access_pointer(bpf_thp_global)) {
> > > err = -EBUSY;
> > > goto out;
> > > }
> >
> > You didn't address the issue and instead doubled down
> > on this broken global approach.
> >
> > This bait-and-switch patchset is frankly disingenuous.
> > 'lets code up some per-mm hack, since people will hate it anyway,
> > and I'm not going to use it either, and add this global mode
> > as a fake "fallback"...'
>
> Should things be the other way around, where
> per-process BPF THP policy overrides global
> policy?
Makes sense
>
> I can definitely see a use for global policy,
> but also a reason to override it for some
> programs or containers.
We have deployed BPF-THP across nearly all of our fleets for over six
months and have enabled THP for dozens of our services.
Based on our practical experience, the global mode has proven highly
useful as it establishes a default policy for all services. When a
specific THP policy is required for a particular service, we implement
it using dedicated BPF maps—such as thp-always, thp-madvise,
thp-never, or other custom policy maps.
That said, I also find value in combining a default global policy with
the ability to override it for certain processes or containers.
The global mode and the per-process/cgroup mode are not mutually
exclusive; they can coexist.
Through our use of BPF-THP, we have found that the most reliable
approach is to allocate all THP at the initial stage. If a service
dynamically allocates a large number of THP during runtime, it can
easily trigger compaction stalls—even on the latest upstream kernel.
Therefore, monitoring compaction stalls and memory pressure is
essential to determine when a service should stop allocating
additional THP.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-10-30 0:57 ` Alexei Starovoitov
2025-10-30 2:40 ` Yafang Shao
@ 2025-11-27 11:48 ` David Hildenbrand (Red Hat)
2025-11-28 2:53 ` Yafang Shao
1 sibling, 1 reply; 29+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-27 11:48 UTC (permalink / raw)
To: Alexei Starovoitov, Yafang Shao
Cc: Andrew Morton, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Lorenzo Stoakes, Martin KaFai Lau, Eduard,
Song Liu, Yonghong Song, John Fastabend, KP Singh,
Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Yan, Liam Howlett,
npache, ryan.roberts, dev.jain, Johannes Weiner, usamaarif642,
gutierrez.asier, Matthew Wilcox, Amery Hung, David Rientjes,
Jonathan Corbet, Barry Song, Shakeel Butt, Tejun Heo, lance.yang,
Randy Dunlap, Chris Mason, bpf, linux-mm
>> To move forward, I'm happy to set the global mode aside for now and
>> potentially drop it in the next version. I'd really like to hear your
>> perspective on the per-process mode. Does this implementation meet
>> your needs?
I haven't had the capacity to follow the evolution of this patch set
unfortunately, just to comment on some points from my perspective.
First, I agree that the global mode is not what we want, not even as a
fallback.
>
> Attaching st_ops to task_struct or to mm_struct is a can of worms.
> With cgroup-bpf we went through painful bugs with lifetime
> of cgroup vs bpf, dying cgroups, wq deadlock, etc. All these
> problems are behind us. With st_ops in mm_struct it will be more
> painful. I'd rather not go that route.
That's valuable information, thanks. I would have hoped that per-MM
policies would be easier.
Are there some pointers to explore regarding the "can of worms" you
mention when it comes to per-MM policies?
>
> And revist cgroup instead, since you were way too quick
> to accept the pushback because all you wanted is global mode.
>
> The main reason for pushback was:
> "
> Cgroup was designed for resource management not for grouping processes and
> tune those processes
> "
>
> which was true when cgroup-v2 was designed, but that ship sailed
> years ago when we introduced cgroup-bpf.
Also valuable information.
Personally I don't have a preference regarding per-mm or per-cgroup.
Whatever we can get working reliably. Sounds like cgroup-bpf has sorted
out most of the mess.
memcg/cgroup maintainers might disagree, but it's probably worth having
that discussion once again.
> None of the progs are doing resource management and lots of infrastructure,
> container management, and open source projects use cgroup-bpf
> as a grouping of processes. bpf progs attached to cgroup/hook tuple
> only care about processes within that cgroup. No resource management.
> See __cgroup_bpf_check_dev_permission or __cgroup_bpf_run_filter_sysctl
> and others.
> The path is current->cgroup->bpf_progs and progs do exactly
> what cgroup wasn't designed to do. They tune a set of processes.
>
> You should do the same.
>
> Also I really don't see a compelling use case for bpf in THP.
There is a lot more potential there to write fine-tuned policies that
thack VMA information into account.
The tests likely reflect what Yafang seems to focus on: IIUC primarily
enabling+disabling traditional THPs (e.g., 2M) on a per-process basis.
Some of what Yafang might want to achieve could maybe at this point be
maybe achieved through the prctl(PR_SET_THP_DISABLE) support, including
extensions we recently added [1].
Systemd support still seems to be in the works [2] for some of that.
[1] https://lwn.net/Articles/1032014/
[2] https://github.com/systemd/systemd/pull/39085
--
Cheers
David
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-11-27 11:48 ` David Hildenbrand (Red Hat)
@ 2025-11-28 2:53 ` Yafang Shao
2025-11-28 7:57 ` Lorenzo Stoakes
2025-11-28 8:39 ` David Hildenbrand (Red Hat)
0 siblings, 2 replies; 29+ messages in thread
From: Yafang Shao @ 2025-11-28 2:53 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Alexei Starovoitov, Andrew Morton, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Lorenzo Stoakes,
Martin KaFai Lau, Eduard, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Yan, Liam Howlett, npache, ryan.roberts, dev.jain,
Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
Amery Hung, David Rientjes, Jonathan Corbet, Barry Song,
Shakeel Butt, Tejun Heo, lance.yang, Randy Dunlap, Chris Mason,
bpf, linux-mm
On Thu, Nov 27, 2025 at 7:48 PM David Hildenbrand (Red Hat)
<david@kernel.org> wrote:
>
> >> To move forward, I'm happy to set the global mode aside for now and
> >> potentially drop it in the next version. I'd really like to hear your
> >> perspective on the per-process mode. Does this implementation meet
> >> your needs?
>
> I haven't had the capacity to follow the evolution of this patch set
> unfortunately, just to comment on some points from my perspective.
>
> First, I agree that the global mode is not what we want, not even as a
> fallback.
>
> >
> > Attaching st_ops to task_struct or to mm_struct is a can of worms.
> > With cgroup-bpf we went through painful bugs with lifetime
> > of cgroup vs bpf, dying cgroups, wq deadlock, etc. All these
> > problems are behind us. With st_ops in mm_struct it will be more
> > painful. I'd rather not go that route.
>
> That's valuable information, thanks. I would have hoped that per-MM
> policies would be easier.
The per-MM approach has a performance advantage over per-MEMCG
policies. This is because it accesses the policy hook directly via
vma->vm_mm->bpf_mm->policy_hook()
whereas the per-MEMCG method requires a more expensive lookup:
memcg = get_mem_cgroup_from_mm(vma->vm_mm);
memcg->bpf_memcg->policy_hook();
This lookup could be a concern in a critical path. However, this
performance issue in the per-MEMCG mode can be mitigated. For
instance, when a task is added to a new memcg, we can cache the hook
pointer:
task->mm->bpf_mm->policy_hook = memcg->bpf_memcg->policy_hook
Ultimately, we might still introduce a mm_struct:bpf_mm field to
provide an efficient interface.
>
> Are there some pointers to explore regarding the "can of worms" you
> mention when it comes to per-MM policies?
>
> >
> > And revist cgroup instead, since you were way too quick
> > to accept the pushback because all you wanted is global mode.
> >
> > The main reason for pushback was:
> > "
> > Cgroup was designed for resource management not for grouping processes and
> > tune those processes
> > "
> >
> > which was true when cgroup-v2 was designed, but that ship sailed
> > years ago when we introduced cgroup-bpf.
>
> Also valuable information.
>
> Personally I don't have a preference regarding per-mm or per-cgroup.
> Whatever we can get working reliably.
I am open to either approach, as long as it's acceptable to the maintainers.
> Sounds like cgroup-bpf has sorted
> out most of the mess.
No, the attach-based cgroup-bpf has proven to be ... a "can of worms"
in practice ...
(I welcome corrections from the BPF maintainers if my assessment is
inaccurate.)
While the struct-ops-based cgroup-bpf is still under discussion.
>
> memcg/cgroup maintainers might disagree, but it's probably worth having
> that discussion once again.
>
> > None of the progs are doing resource management and lots of infrastructure,
> > container management, and open source projects use cgroup-bpf
> > as a grouping of processes. bpf progs attached to cgroup/hook tuple
> > only care about processes within that cgroup. No resource management.
> > See __cgroup_bpf_check_dev_permission or __cgroup_bpf_run_filter_sysctl
> > and others.
> > The path is current->cgroup->bpf_progs and progs do exactly
> > what cgroup wasn't designed to do. They tune a set of processes.
> >
> > You should do the same.
> >
> > Also I really don't see a compelling use case for bpf in THP.
>
> There is a lot more potential there to write fine-tuned policies that
> thack VMA information into account.
>
> The tests likely reflect what Yafang seems to focus on: IIUC primarily
> enabling+disabling traditional THPs (e.g., 2M) on a per-process basis.
Right.
>
> Some of what Yafang might want to achieve could maybe at this point be
> maybe achieved through the prctl(PR_SET_THP_DISABLE) support, including
> extensions we recently added [1].
>
> Systemd support still seems to be in the works [2] for some of that.
>
>
> [1] https://lwn.net/Articles/1032014/
> [2] https://github.com/systemd/systemd/pull/39085
Thank you for sharing this.
However, BPF-THP is already deployed across our server fleet and both
our users and my boss are satisfied with it. As such, we are not
considering a switch. The current solution also offers us a valuable
opportunity to experiment with additional policies in production.
In summary, I am fine with either the per-MM or per-MEMCG method.
Furthermore, I don't believe this is an either-or decision; both can
be implemented to work together.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-11-28 2:53 ` Yafang Shao
@ 2025-11-28 7:57 ` Lorenzo Stoakes
2025-11-28 8:18 ` Yafang Shao
2025-11-28 8:39 ` David Hildenbrand (Red Hat)
1 sibling, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-11-28 7:57 UTC (permalink / raw)
To: Yafang Shao
Cc: David Hildenbrand (Red Hat),
Alexei Starovoitov, Andrew Morton, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
Song Liu, Yonghong Song, John Fastabend, KP Singh,
Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Yan, Liam Howlett,
npache, ryan.roberts, dev.jain, Johannes Weiner, usamaarif642,
gutierrez.asier, Matthew Wilcox, Amery Hung, David Rientjes,
Jonathan Corbet, Barry Song, Shakeel Butt, Tejun Heo, lance.yang,
Randy Dunlap, Chris Mason, bpf, linux-mm
TL;DR - NAK this series as-is.
On Fri, Nov 28, 2025 at 10:53:53AM +0800, Yafang Shao wrote:
> Thank you for sharing this.
> However, BPF-THP is already deployed across our server fleet and both
> our users and my boss are satisfied with it. As such, we are not
> considering a switch. The current solution also offers us a valuable
> opportunity to experiment with additional policies in production.
Sorry Yafang, this isn't how upstream works.
I've not been paying attention to this series as I have been waiting for
you and Alexei to reach some kind of resolution before diving back in.
But your response here is _very_ concerning to me.
Of course you're welcome to deploy unmerged arbitrary patches to your
kernel (as long as you abide by the GPL naturally).
But we've made it _very_ clear that this is an - experimental - feature,
that might go away at any time, while we iterate and determine how useful
it might be to users in general.
Now it seems that exactly the thing I feared has already happened - people
ignoring the fact we are hiding this behind an, in effect,
CONFIG_EXPERIMENTAL_PLEASE_DO_NOT_RELY_ON_THIS flag.
This means that I am no longer confident this approach is going to work,
which inclines me to reject this proposal outright.
The bar is now a lot higher in my view, and now we're going to need
extensive and overwhelming evidence that whatever BPF hook we provide is
both future proof as to how we intend THP to develop and of use to more
than one user.
Again as David mentioned, you seem to be able to achieve what you want to
achieve via the extensions we added to PR_SET_THP_DISABLE.
That then reduces the number of users of this feature to 0 and again
inclines me to reject this approach entirely.
So for now it's a NAK.
>
> In summary, I am fine with either the per-MM or per-MEMCG method.
> Furthermore, I don't believe this is an either-or decision; both can
> be implemented to work together.
No, it is - the global approach is broken and we won't be having that.
>
>
> --
> Regards
> Yafang
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-11-28 7:57 ` Lorenzo Stoakes
@ 2025-11-28 8:18 ` Yafang Shao
2025-11-28 8:31 ` Lorenzo Stoakes
0 siblings, 1 reply; 29+ messages in thread
From: Yafang Shao @ 2025-11-28 8:18 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand (Red Hat),
Alexei Starovoitov, Andrew Morton, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
Song Liu, Yonghong Song, John Fastabend, KP Singh,
Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Yan, Liam Howlett,
npache, ryan.roberts, dev.jain, Johannes Weiner, usamaarif642,
gutierrez.asier, Matthew Wilcox, Amery Hung, David Rientjes,
Jonathan Corbet, Barry Song, Shakeel Butt, Tejun Heo, lance.yang,
Randy Dunlap, Chris Mason, bpf, linux-mm
On Fri, Nov 28, 2025 at 3:57 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> TL;DR - NAK this series as-is.
>
> On Fri, Nov 28, 2025 at 10:53:53AM +0800, Yafang Shao wrote:
> > Thank you for sharing this.
> > However, BPF-THP is already deployed across our server fleet and both
> > our users and my boss are satisfied with it. As such, we are not
> > considering a switch. The current solution also offers us a valuable
> > opportunity to experiment with additional policies in production.
>
> Sorry Yafang, this isn't how upstream works.
>
> I've not been paying attention to this series as I have been waiting for
> you and Alexei to reach some kind of resolution before diving back in.
>
> But your response here is _very_ concerning to me.
>
> Of course you're welcome to deploy unmerged arbitrary patches to your
> kernel (as long as you abide by the GPL naturally).
>
> But we've made it _very_ clear that this is an - experimental - feature,
> that might go away at any time, while we iterate and determine how useful
> it might be to users in general.
>
> Now it seems that exactly the thing I feared has already happened - people
> ignoring the fact we are hiding this behind an, in effect,
> CONFIG_EXPERIMENTAL_PLEASE_DO_NOT_RELY_ON_THIS flag.
Thank you for your concern. We have a dedicated kernel team that
maintains our runtime. Our standard practice for new kernel features
is to first validate them in our production environment. This ensures
that any feature we propose to upstream has been proven in a
real-world, large-scale use case.
>
> This means that I am no longer confident this approach is going to work,
> which inclines me to reject this proposal outright.
>
> The bar is now a lot higher in my view, and now we're going to need
> extensive and overwhelming evidence that whatever BPF hook we provide is
> both future proof as to how we intend THP to develop and of use to more
> than one user.
>
> Again as David mentioned, you seem to be able to achieve what you want to
> achieve via the extensions we added to PR_SET_THP_DISABLE.
We see no compelling reason to switch to PR_SET_THP_DISABLE. BPF-THP
has proven to be perfectly stable across our production fleet, and we
have the full capability to maintain it.
>
> That then reduces the number of users of this feature to 0 and again
> inclines me to reject this approach entirely.
I understand your concern. Our intention is simply to contribute a
feature that we have found valuable in production, in the hope that it
may benefit others as well. We of course respect the upstream process
and are fully prepared for the possibility that it may not be
accepted.
>
> So for now it's a NAK.
>
> >
> > In summary, I am fine with either the per-MM or per-MEMCG method.
> > Furthermore, I don't believe this is an either-or decision; both can
> > be implemented to work together.
>
> No, it is - the global approach is broken and we won't be having that.
Let me rephrase for clarity: I see the per-MM and per-MEMCG approaches
as compatible. They can be implemented together, potentially as a
hybrid approach.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-11-28 8:18 ` Yafang Shao
@ 2025-11-28 8:31 ` Lorenzo Stoakes
2025-11-28 11:56 ` Yafang Shao
0 siblings, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-11-28 8:31 UTC (permalink / raw)
To: Yafang Shao
Cc: David Hildenbrand (Red Hat),
Alexei Starovoitov, Andrew Morton, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
Song Liu, Yonghong Song, John Fastabend, KP Singh,
Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Yan, Liam Howlett,
npache, ryan.roberts, dev.jain, Johannes Weiner, usamaarif642,
gutierrez.asier, Matthew Wilcox, Amery Hung, David Rientjes,
Jonathan Corbet, Barry Song, Shakeel Butt, Tejun Heo, lance.yang,
Randy Dunlap, Chris Mason, bpf, linux-mm
On Fri, Nov 28, 2025 at 04:18:10PM +0800, Yafang Shao wrote:
> On Fri, Nov 28, 2025 at 3:57 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > TL;DR - NAK this series as-is.
> >
> > On Fri, Nov 28, 2025 at 10:53:53AM +0800, Yafang Shao wrote:
> > > Thank you for sharing this.
> > > However, BPF-THP is already deployed across our server fleet and both
> > > our users and my boss are satisfied with it. As such, we are not
> > > considering a switch. The current solution also offers us a valuable
> > > opportunity to experiment with additional policies in production.
> >
> > Sorry Yafang, this isn't how upstream works.
> >
> > I've not been paying attention to this series as I have been waiting for
> > you and Alexei to reach some kind of resolution before diving back in.
> >
> > But your response here is _very_ concerning to me.
> >
> > Of course you're welcome to deploy unmerged arbitrary patches to your
> > kernel (as long as you abide by the GPL naturally).
> >
> > But we've made it _very_ clear that this is an - experimental - feature,
> > that might go away at any time, while we iterate and determine how useful
> > it might be to users in general.
> >
> > Now it seems that exactly the thing I feared has already happened - people
> > ignoring the fact we are hiding this behind an, in effect,
> > CONFIG_EXPERIMENTAL_PLEASE_DO_NOT_RELY_ON_THIS flag.
>
> Thank you for your concern. We have a dedicated kernel team that
> maintains our runtime. Our standard practice for new kernel features
> is to first validate them in our production environment. This ensures
> that any feature we propose to upstream has been proven in a
> real-world, large-scale use case.
This strictly contradicts the intent of the config flag. I seem to recall
asking to put 'experimental' in the flag name also to avoid people assuming
this is permanent or at least permanently implemented as-is. But this
iteration of the series doesn't...
I no longer believe this flag achieves the stated goal, which is to give us
latitude to make changes in the future based on internal changes to THP
(which so sorely needs them).
I fear we will end up with users depending on it should we ship any form of
BPF hook that we aren't 100% certain is 'future proof', so it raises the
bar for this work very substantially.
So I am really of a mind that we shouldn't be taking any such series at
this point in time.
>
> >
> > This means that I am no longer confident this approach is going to work,
> > which inclines me to reject this proposal outright.
> >
> > The bar is now a lot higher in my view, and now we're going to need
> > extensive and overwhelming evidence that whatever BPF hook we provide is
> > both future proof as to how we intend THP to develop and of use to more
> > than one user.
> >
> > Again as David mentioned, you seem to be able to achieve what you want to
> > achieve via the extensions we added to PR_SET_THP_DISABLE.
>
> We see no compelling reason to switch to PR_SET_THP_DISABLE. BPF-THP
> has proven to be perfectly stable across our production fleet, and we
> have the full capability to maintain it.
Again, this is entirely your prerogative, but it doesn't imply that other
users will need this feature themselves.
>
> >
> > That then reduces the number of users of this feature to 0 and again
> > inclines me to reject this approach entirely.
>
> I understand your concern. Our intention is simply to contribute a
> feature that we have found valuable in production, in the hope that it
> may benefit others as well. We of course respect the upstream process
> and are fully prepared for the possibility that it may not be
> accepted.
Right.
>
> >
> > So for now it's a NAK.
> >
> > >
> > > In summary, I am fine with either the per-MM or per-MEMCG method.
> > > Furthermore, I don't believe this is an either-or decision; both can
> > > be implemented to work together.
> >
> > No, it is - the global approach is broken and we won't be having that.
>
> Let me rephrase for clarity: I see the per-MM and per-MEMCG approaches
> as compatible. They can be implemented together, potentially as a
> hybrid approach.
OK sorry I think I misread this/misinterpreted you here - the objection was
to the global approach.
Yes sure perhaps we could.
I mean we end up back in the silly 'THPs are not a resource' argument the
cgroup people put forward when it comes to memcg + THP (I don't
agree...). But let's not open that can of worms again :)
>
> --
> Regards
> Yafang
>
Sorry to push back so harshly on this, but I do it out of concern for our
future ability to tame THP into something more sensible than the - frankly
- mess we have now.
I feel like we must defend against painting ourselves into any kind of
corner worse than we already have :)
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-11-28 2:53 ` Yafang Shao
2025-11-28 7:57 ` Lorenzo Stoakes
@ 2025-11-28 8:39 ` David Hildenbrand (Red Hat)
2025-11-28 8:55 ` Lorenzo Stoakes
2025-11-30 13:06 ` Yafang Shao
1 sibling, 2 replies; 29+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-28 8:39 UTC (permalink / raw)
To: Yafang Shao
Cc: Alexei Starovoitov, Andrew Morton, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Lorenzo Stoakes,
Martin KaFai Lau, Eduard, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Yan, Liam Howlett, npache, ryan.roberts, dev.jain,
Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
Amery Hung, David Rientjes, Jonathan Corbet, Barry Song,
Shakeel Butt, Tejun Heo, lance.yang, Randy Dunlap, Chris Mason,
bpf, linux-mm
On 11/28/25 03:53, Yafang Shao wrote:
> On Thu, Nov 27, 2025 at 7:48 PM David Hildenbrand (Red Hat)
> <david@kernel.org> wrote:
Lorenzo commented on the upstream topic, let me mostly comment on the
other parts:
>>> Attaching st_ops to task_struct or to mm_struct is a can of worms.
>>> With cgroup-bpf we went through painful bugs with lifetime
>>> of cgroup vs bpf, dying cgroups, wq deadlock, etc. All these
>>> problems are behind us. With st_ops in mm_struct it will be more
>>> painful. I'd rather not go that route.
>>
>> That's valuable information, thanks. I would have hoped that per-MM
>> policies would be easier.
>
> The per-MM approach has a performance advantage over per-MEMCG
> policies. This is because it accesses the policy hook directly via
>
> vma->vm_mm->bpf_mm->policy_hook()
>
> whereas the per-MEMCG method requires a more expensive lookup:
>
> memcg = get_mem_cgroup_from_mm(vma->vm_mm);
> memcg->bpf_memcg->policy_hook();
> > This lookup could be a concern in a critical path. However, this
> performance issue in the per-MEMCG mode can be mitigated. For
> instance, when a task is added to a new memcg, we can cache the hook
> pointer:
>
> task->mm->bpf_mm->policy_hook = memcg->bpf_memcg->policy_hook
>
> Ultimately, we might still introduce a mm_struct:bpf_mm field to
> provide an efficient interface.
Right, caching is what I would have proposed. I would expect some
headakes with lifetime, but probably nothing unsolvable.
>> Sounds like cgroup-bpf has sorted
>> out most of the mess.
>
> No, the attach-based cgroup-bpf has proven to be ... a "can of worms"
> in practice ...
> (I welcome corrections from the BPF maintainers if my assessment is
> inaccurate.)
I don't know what's right or wrong here, as Alexei said the "mm_struct"
based one would be a can of worms and that the the cgroup-based one
apparently solved these issues ("All these problems are behind us."),
that's why I asked for some clarifications. :)
[...]
>>
>> Some of what Yafang might want to achieve could maybe at this point be
>> maybe achieved through the prctl(PR_SET_THP_DISABLE) support, including
>> extensions we recently added [1].
>>
>> Systemd support still seems to be in the works [2] for some of that.
>>
>>
>> [1] https://lwn.net/Articles/1032014/
>> [2] https://github.com/systemd/systemd/pull/39085
>
> Thank you for sharing this.
> However, BPF-THP is already deployed across our server fleet and both
> our users and my boss are satisfied with it. As such, we are not
> considering a switch. The current solution also offers us a valuable
> opportunity to experiment with additional policies in production.
Just to emphasize: we usually don't add two mechanisms to achieve the
very same end goal. There really must be something delivering more value
for us to accept something more complex. Focusing on solving a solved
problem is not good.
If some company went with a downstream-only approach they might be stuck
having to maintain that forever.
That's why other companies prefer upstream-first :)
Having that said, the original reason why I agreed that having bpf for
THP can be valuable is that I see a lot more value for rapid prototyping
and policies once you can actually control on a per-VMA basis (using vma
size, flags, anon-vma names etc) where specific folio orders could be
valuable, and where not. But also, possibly where we would want to waste
memory and where not.
As we are speaking I have a customer running into issues [1] with
virtio-balloon discarding pages in a VM and khugepaged undoing part of
that work in the hypervisor. The workaround of telling khugepaged to not
waste memory in all of the system really feels suboptimal when we know
that it's only the VM memory of such VMs (with balloon deflation
enabled) where we would not want to waste memory but still use THPs.
[1] https://issues.redhat.com/browse/RHEL-121177
--
Cheers
David
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-11-28 8:39 ` David Hildenbrand (Red Hat)
@ 2025-11-28 8:55 ` Lorenzo Stoakes
2025-11-30 13:06 ` Yafang Shao
1 sibling, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-11-28 8:55 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Yafang Shao, Alexei Starovoitov, Andrew Morton,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Yan, Liam Howlett, npache, ryan.roberts, dev.jain,
Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
Amery Hung, David Rientjes, Jonathan Corbet, Barry Song,
Shakeel Butt, Tejun Heo, lance.yang, Randy Dunlap, Chris Mason,
bpf, linux-mm
On Fri, Nov 28, 2025 at 09:39:06AM +0100, David Hildenbrand (Red Hat) wrote:
> On 11/28/25 03:53, Yafang Shao wrote:
> > On Thu, Nov 27, 2025 at 7:48 PM David Hildenbrand (Red Hat)
> > <david@kernel.org> wrote:
>
> Lorenzo commented on the upstream topic, let me mostly comment on the other
> parts:
> > > > Attaching st_ops to task_struct or to mm_struct is a can of worms.
> > > > With cgroup-bpf we went through painful bugs with lifetime
> > > > of cgroup vs bpf, dying cgroups, wq deadlock, etc. All these
> > > > problems are behind us. With st_ops in mm_struct it will be more
> > > > painful. I'd rather not go that route.
> > >
> > > That's valuable information, thanks. I would have hoped that per-MM
> > > policies would be easier.
> >
> > The per-MM approach has a performance advantage over per-MEMCG
> > policies. This is because it accesses the policy hook directly via
> >
> > vma->vm_mm->bpf_mm->policy_hook()
> >
> > whereas the per-MEMCG method requires a more expensive lookup:
> >
> > memcg = get_mem_cgroup_from_mm(vma->vm_mm);
> > memcg->bpf_memcg->policy_hook();
> > > This lookup could be a concern in a critical path. However, this
> > performance issue in the per-MEMCG mode can be mitigated. For
> > instance, when a task is added to a new memcg, we can cache the hook
> > pointer:
> >
> > task->mm->bpf_mm->policy_hook = memcg->bpf_memcg->policy_hook
> >
> > Ultimately, we might still introduce a mm_struct:bpf_mm field to
> > provide an efficient interface.
>
> Right, caching is what I would have proposed. I would expect some headakes
> with lifetime, but probably nothing unsolvable.
>
>
> > > Sounds like cgroup-bpf has sorted
> > > out most of the mess.
> >
> > No, the attach-based cgroup-bpf has proven to be ... a "can of worms"
> > in practice ...
> > (I welcome corrections from the BPF maintainers if my assessment is
> > inaccurate.)
>
> I don't know what's right or wrong here, as Alexei said the "mm_struct"
> based one would be a can of worms and that the the cgroup-based one
> apparently solved these issues ("All these problems are behind us."), that's
> why I asked for some clarifications. :)
>
> [...]
>
> > >
> > > Some of what Yafang might want to achieve could maybe at this point be
> > > maybe achieved through the prctl(PR_SET_THP_DISABLE) support, including
> > > extensions we recently added [1].
> > >
> > > Systemd support still seems to be in the works [2] for some of that.
> > >
> > >
> > > [1] https://lwn.net/Articles/1032014/
> > > [2] https://github.com/systemd/systemd/pull/39085
> >
> > Thank you for sharing this.
> > However, BPF-THP is already deployed across our server fleet and both
> > our users and my boss are satisfied with it. As such, we are not
> > considering a switch. The current solution also offers us a valuable
> > opportunity to experiment with additional policies in production.
>
> Just to emphasize: we usually don't add two mechanisms to achieve the very
> same end goal. There really must be something delivering more value for us
> to accept something more complex. Focusing on solving a solved problem is
> not good.
Yes.
>
> If some company went with a downstream-only approach they might be stuck
> having to maintain that forever.
>
> That's why other companies prefer upstream-first :)
I think trying to do downstream-only is going to cause very big headaches if we
choose to substantially alter THP in future (and of course - we do intend to).
>
>
> Having that said, the original reason why I agreed that having bpf for THP
> can be valuable is that I see a lot more value for rapid prototyping and
> policies once you can actually control on a per-VMA basis (using vma size,
> flags, anon-vma names etc) where specific folio orders could be valuable,
> and where not. But also, possibly where we would want to waste memory and
> where not.
The same for me.
But given the actual author of the feature has already treated this as a
permanent and unchanging feature, I absolutely do not have confidence that we
can do this.
The situation I feared us running in to is that we'd release this even with
CONFIG_EXPERIMENTAL_DO_NOT_RELY etc. (note the flag is somehow now
CONFIG_BPF_THP which... isn't what I wanted) and people would STILL rely on it,
then when we try to change it loudly complain and make it difficult to remove.
I am now convinced that this is just going to happen no matter what we do.
So the 'rapid prototyping' approach is just not workable, at all in my view.
>
> As we are speaking I have a customer running into issues [1] with
> virtio-balloon discarding pages in a VM and khugepaged undoing part of that
> work in the hypervisor. The workaround of telling khugepaged to not waste
> memory in all of the system really feels suboptimal when we know that it's
> only the VM memory of such VMs (with balloon deflation enabled) where we
> would not want to waste memory but still use THPs.
>
> [1] https://issues.redhat.com/browse/RHEL-121177
Right, and that's very sad that we now lose the ability to do so, but rapid
prototyping isn't feasible - I think we're seeing that.
That doesn't mean we can't have BPF for THP. It just means we have to set the
bar CONSIDERABLY higher - whatever interface we provide _has_ to be
future-proofed to any future changes we make to THP in terms of making things
more 'automatic' - and has to provide sufficient power to be useful.
I wonder how easy it will be to figure out such an interface without
accidentally causing ourselves issues down the line.
THP is a special case like that - right now we have very broken interfaces (as
evidenced by users requesting things like the prctl extensions) - and we want to
be able to fix those in the future.
Of course we have to maintain uAPI compatibility, but even the discussion around
mTHP khugepaged and 'eagerness' points to a desire to change how existing
interfaces work - imagine if we had some BPF hook that then ended up needing to
introspect current max_pte_none for instance.
So perhaps the answer is that a BPF interface should come later when we have a
better idea of the future of THP?
The whole cgroup vs mm thing again raises old issues about isolation - the
cgroup people reject the idea of THP being a resource that can be managed by
cgroups - so by even allowing a per-memcg thing we're opening that can of worms.
Anyway overall this series as-is is not really upstreamable I don't think.
Maybe we can figure out a read-only introspection hook that makes the least
assumptions that can be provided at low-risk that'd help with issues such as the
one you mention at least in respect of informing what's going on?
That could form the basis of future work towards a hook that actually changes
things?
There's no need to rush.
>
> --
> Cheers
>
> David
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-11-28 8:31 ` Lorenzo Stoakes
@ 2025-11-28 11:56 ` Yafang Shao
2025-11-28 12:18 ` Lorenzo Stoakes
0 siblings, 1 reply; 29+ messages in thread
From: Yafang Shao @ 2025-11-28 11:56 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand (Red Hat),
Alexei Starovoitov, Andrew Morton, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
Song Liu, Yonghong Song, John Fastabend, KP Singh,
Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Yan, Liam Howlett,
npache, ryan.roberts, dev.jain, Johannes Weiner, usamaarif642,
gutierrez.asier, Matthew Wilcox, Amery Hung, David Rientjes,
Jonathan Corbet, Barry Song, Shakeel Butt, Tejun Heo, lance.yang,
Randy Dunlap, Chris Mason, bpf, linux-mm
On Fri, Nov 28, 2025 at 4:31 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Nov 28, 2025 at 04:18:10PM +0800, Yafang Shao wrote:
> > On Fri, Nov 28, 2025 at 3:57 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > TL;DR - NAK this series as-is.
> > >
> > > On Fri, Nov 28, 2025 at 10:53:53AM +0800, Yafang Shao wrote:
> > > > Thank you for sharing this.
> > > > However, BPF-THP is already deployed across our server fleet and both
> > > > our users and my boss are satisfied with it. As such, we are not
> > > > considering a switch. The current solution also offers us a valuable
> > > > opportunity to experiment with additional policies in production.
> > >
> > > Sorry Yafang, this isn't how upstream works.
> > >
> > > I've not been paying attention to this series as I have been waiting for
> > > you and Alexei to reach some kind of resolution before diving back in.
> > >
> > > But your response here is _very_ concerning to me.
> > >
> > > Of course you're welcome to deploy unmerged arbitrary patches to your
> > > kernel (as long as you abide by the GPL naturally).
> > >
> > > But we've made it _very_ clear that this is an - experimental - feature,
> > > that might go away at any time, while we iterate and determine how useful
> > > it might be to users in general.
> > >
> > > Now it seems that exactly the thing I feared has already happened - people
> > > ignoring the fact we are hiding this behind an, in effect,
> > > CONFIG_EXPERIMENTAL_PLEASE_DO_NOT_RELY_ON_THIS flag.
> >
> > Thank you for your concern. We have a dedicated kernel team that
> > maintains our runtime. Our standard practice for new kernel features
> > is to first validate them in our production environment. This ensures
> > that any feature we propose to upstream has been proven in a
> > real-world, large-scale use case.
>
> This strictly contradicts the intent of the config flag. I seem to recall
> asking to put 'experimental' in the flag name also to avoid people assuming
> this is permanent or at least permanently implemented as-is. But this
> iteration of the series doesn't...
Ah, I understand your point now.
The CONFIG_EXPERIMENTAL_PLEASE_DO_NOT_RELY_ON_THIS flag was changed in v9:
https://lore.kernel.org/linux-mm/20250930055826.9810-1-laoar.shao@gmail.com/
The change was suggested by Randy and Usama:
https://lwn.net/ml/all/a5015724-a799-4151-bcc4-000c2c5c7178@infradead.org/
At that time, you were on holiday, so you may have missed this update.
>
> I no longer believe this flag achieves the stated goal, which is to give us
> latitude to make changes in the future based on internal changes to THP
> (which so sorely needs them).
>
> I fear we will end up with users depending on it should we ship any form of
> BPF hook that we aren't 100% certain is 'future proof', so it raises the
> bar for this work very substantially.
>
> So I am really of a mind that we shouldn't be taking any such series at
> this point in time.
understood.
>
> >
> > >
> > > This means that I am no longer confident this approach is going to work,
> > > which inclines me to reject this proposal outright.
> > >
> > > The bar is now a lot higher in my view, and now we're going to need
> > > extensive and overwhelming evidence that whatever BPF hook we provide is
> > > both future proof as to how we intend THP to develop and of use to more
> > > than one user.
> > >
> > > Again as David mentioned, you seem to be able to achieve what you want to
> > > achieve via the extensions we added to PR_SET_THP_DISABLE.
> >
> > We see no compelling reason to switch to PR_SET_THP_DISABLE. BPF-THP
> > has proven to be perfectly stable across our production fleet, and we
> > have the full capability to maintain it.
>
> Again, this is entirely your prerogative, but it doesn't imply that other
> users will need this feature themselves.
Right, we’re not trying to force anyone else to use it.
We’re simply sharing our use case with upstream.
It’s up to the maintainers to decide whether to accept it.
>
> >
> > >
> > > That then reduces the number of users of this feature to 0 and again
> > > inclines me to reject this approach entirely.
> >
> > I understand your concern. Our intention is simply to contribute a
> > feature that we have found valuable in production, in the hope that it
> > may benefit others as well. We of course respect the upstream process
> > and are fully prepared for the possibility that it may not be
> > accepted.
>
> Right.
>
> >
> > >
> > > So for now it's a NAK.
> > >
> > > >
> > > > In summary, I am fine with either the per-MM or per-MEMCG method.
> > > > Furthermore, I don't believe this is an either-or decision; both can
> > > > be implemented to work together.
> > >
> > > No, it is - the global approach is broken and we won't be having that.
> >
> > Let me rephrase for clarity: I see the per-MM and per-MEMCG approaches
> > as compatible. They can be implemented together, potentially as a
> > hybrid approach.
>
> OK sorry I think I misread this/misinterpreted you here - the objection was
> to the global approach.
>
> Yes sure perhaps we could.
>
> I mean we end up back in the silly 'THPs are not a resource' argument the
> cgroup people put forward when it comes to memcg + THP (I don't
> agree...). But let's not open that can of worms again :)
>
> >
> > --
> > Regards
> > Yafang
> >
>
> Sorry to push back so harshly on this, but I do it out of concern for our
> future ability to tame THP into something more sensible than the - frankly
> - mess we have now.
>
> I feel like we must defend against painting ourselves into any kind of
> corner worse than we already have :)
Understood.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-11-28 11:56 ` Yafang Shao
@ 2025-11-28 12:18 ` Lorenzo Stoakes
2025-11-28 12:51 ` Yafang Shao
0 siblings, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-11-28 12:18 UTC (permalink / raw)
To: Yafang Shao
Cc: David Hildenbrand (Red Hat),
Alexei Starovoitov, Andrew Morton, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
Song Liu, Yonghong Song, John Fastabend, KP Singh,
Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Yan, Liam Howlett,
npache, ryan.roberts, dev.jain, Johannes Weiner, usamaarif642,
gutierrez.asier, Matthew Wilcox, Amery Hung, David Rientjes,
Jonathan Corbet, Barry Song, Shakeel Butt, Tejun Heo, lance.yang,
Randy Dunlap, Chris Mason, bpf, linux-mm
On Fri, Nov 28, 2025 at 07:56:48PM +0800, Yafang Shao wrote:
> The CONFIG_EXPERIMENTAL_PLEASE_DO_NOT_RELY_ON_THIS flag was changed in v9:
>
> https://lore.kernel.org/linux-mm/20250930055826.9810-1-laoar.shao@gmail.com/
>
> The change was suggested by Randy and Usama:
>
> https://lwn.net/ml/all/a5015724-a799-4151-bcc4-000c2c5c7178@infradead.org/
>
> At that time, you were on holiday, so you may have missed this update.
>
It's moot because this series isn't upstreamable, but... :)
To risk sounding grumpy, in future do please make sure to check about changes
that contradict things maintainers _explicitly_ ask you to do.
You can always off-list mail if people take time to come back to review.
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-11-28 12:18 ` Lorenzo Stoakes
@ 2025-11-28 12:51 ` Yafang Shao
0 siblings, 0 replies; 29+ messages in thread
From: Yafang Shao @ 2025-11-28 12:51 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand (Red Hat),
Alexei Starovoitov, Andrew Morton, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
Song Liu, Yonghong Song, John Fastabend, KP Singh,
Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Yan, Liam Howlett,
npache, ryan.roberts, dev.jain, Johannes Weiner, usamaarif642,
gutierrez.asier, Matthew Wilcox, Amery Hung, David Rientjes,
Jonathan Corbet, Barry Song, Shakeel Butt, Tejun Heo, lance.yang,
Randy Dunlap, Chris Mason, bpf, linux-mm
On Fri, Nov 28, 2025 at 8:18 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Nov 28, 2025 at 07:56:48PM +0800, Yafang Shao wrote:
> > The CONFIG_EXPERIMENTAL_PLEASE_DO_NOT_RELY_ON_THIS flag was changed in v9:
> >
> > https://lore.kernel.org/linux-mm/20250930055826.9810-1-laoar.shao@gmail.com/
> >
> > The change was suggested by Randy and Usama:
> >
> > https://lwn.net/ml/all/a5015724-a799-4151-bcc4-000c2c5c7178@infradead.org/
> >
> > At that time, you were on holiday, so you may have missed this update.
> >
>
> It's moot because this series isn't upstreamable, but... :)
>
> To risk sounding grumpy, in future do please make sure to check about changes
> that contradict things maintainers _explicitly_ ask you to do.
>
> You can always off-list mail if people take time to come back to review.
Thanks for your suggestion :-)
--
Regards
Yafang
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
2025-11-28 8:39 ` David Hildenbrand (Red Hat)
2025-11-28 8:55 ` Lorenzo Stoakes
@ 2025-11-30 13:06 ` Yafang Shao
1 sibling, 0 replies; 29+ messages in thread
From: Yafang Shao @ 2025-11-30 13:06 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Alexei Starovoitov, Andrew Morton, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Lorenzo Stoakes,
Martin KaFai Lau, Eduard, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Yan, Liam Howlett, npache, ryan.roberts, dev.jain,
Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
Amery Hung, David Rientjes, Jonathan Corbet, Barry Song,
Shakeel Butt, Tejun Heo, lance.yang, Randy Dunlap, Chris Mason,
bpf, linux-mm
On Fri, Nov 28, 2025 at 4:39 PM David Hildenbrand (Red Hat)
<david@kernel.org> wrote:
>
> On 11/28/25 03:53, Yafang Shao wrote:
> > On Thu, Nov 27, 2025 at 7:48 PM David Hildenbrand (Red Hat)
> > <david@kernel.org> wrote:
>
> Lorenzo commented on the upstream topic, let me mostly comment on the
> other parts:
> >>> Attaching st_ops to task_struct or to mm_struct is a can of worms.
> >>> With cgroup-bpf we went through painful bugs with lifetime
> >>> of cgroup vs bpf, dying cgroups, wq deadlock, etc. All these
> >>> problems are behind us. With st_ops in mm_struct it will be more
> >>> painful. I'd rather not go that route.
> >>
> >> That's valuable information, thanks. I would have hoped that per-MM
> >> policies would be easier.
> >
> > The per-MM approach has a performance advantage over per-MEMCG
> > policies. This is because it accesses the policy hook directly via
> >
> > vma->vm_mm->bpf_mm->policy_hook()
> >
> > whereas the per-MEMCG method requires a more expensive lookup:
> >
> > memcg = get_mem_cgroup_from_mm(vma->vm_mm);
> > memcg->bpf_memcg->policy_hook();
> > > This lookup could be a concern in a critical path. However, this
> > performance issue in the per-MEMCG mode can be mitigated. For
> > instance, when a task is added to a new memcg, we can cache the hook
> > pointer:
> >
> > task->mm->bpf_mm->policy_hook = memcg->bpf_memcg->policy_hook
> >
> > Ultimately, we might still introduce a mm_struct:bpf_mm field to
> > provide an efficient interface.
>
> Right, caching is what I would have proposed. I would expect some
> headakes with lifetime, but probably nothing unsolvable.
>
>
> >> Sounds like cgroup-bpf has sorted
> >> out most of the mess.
> >
> > No, the attach-based cgroup-bpf has proven to be ... a "can of worms"
> > in practice ...
> > (I welcome corrections from the BPF maintainers if my assessment is
> > inaccurate.)
>
> I don't know what's right or wrong here, as Alexei said the "mm_struct"
> based one would be a can of worms and that the the cgroup-based one
> apparently solved these issues ("All these problems are behind us."),
> that's why I asked for some clarifications. :)
>
> [...]
>
> >>
> >> Some of what Yafang might want to achieve could maybe at this point be
> >> maybe achieved through the prctl(PR_SET_THP_DISABLE) support, including
> >> extensions we recently added [1].
> >>
> >> Systemd support still seems to be in the works [2] for some of that.
> >>
> >>
> >> [1] https://lwn.net/Articles/1032014/
> >> [2] https://github.com/systemd/systemd/pull/39085
> >
> > Thank you for sharing this.
> > However, BPF-THP is already deployed across our server fleet and both
> > our users and my boss are satisfied with it. As such, we are not
> > considering a switch. The current solution also offers us a valuable
> > opportunity to experiment with additional policies in production.
>
> Just to emphasize: we usually don't add two mechanisms to achieve the
> very same end goal. There really must be something delivering more value
> for us to accept something more complex. Focusing on solving a solved
> problem is not good.
>
> If some company went with a downstream-only approach they might be stuck
> having to maintain that forever.
>
> That's why other companies prefer upstream-first :)
The upstream kernel process is often too slow for our users' needs and
frequently results in the rejection of our submissions.
Therefore, we maintain a set of local features that, despite being
rejected upstream, are critical for delivering user benefits.
>
>
> Having that said, the original reason why I agreed that having bpf for
> THP can be valuable is that I see a lot more value for rapid prototyping
> and policies once you can actually control on a per-VMA basis (using vma
> size, flags, anon-vma names etc) where specific folio orders could be
> valuable, and where not.
agreed.
> But also, possibly where we would want to waste
> memory and where not.
This is a challenge we have also encountered since enabling THP for
production services. We are continuing to develop our BPF-THP system
to make it more automated.
>
> As we are speaking I have a customer running into issues [1] with
> virtio-balloon discarding pages in a VM and khugepaged undoing part of
> that work in the hypervisor. The workaround of telling khugepaged to not
> waste memory in all of the system really feels suboptimal when we know
> that it's only the VM memory of such VMs (with balloon deflation
> enabled) where we would not want to waste memory but still use THPs.
>
> [1] https://issues.redhat.com/browse/RHEL-121177
This is an excellent analysis—thank you for sharing it.
I don't have a better solution than your current approach of setting
max_ptes_none to 0. However, I believe this situation serves as a
compelling example for why we should implement a per-process control
for `/sys/kernel/mm/transparent_hugepage/` parameters, such as
`khugepaged/max_ptes_none`. This direction also aligns perfectly with
our roadmap for evolving the BPF-THP system on our production servers.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2025-11-30 13:07 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 01/10] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 02/10] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 03/10] mm: thp: add support for BPF based THP order selection Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 04/10] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
2025-10-27 4:07 ` Barry Song
2025-10-26 10:01 ` [PATCH v12 mm-new 05/10] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode Yafang Shao
2025-10-29 1:32 ` Alexei Starovoitov
2025-10-29 2:13 ` Yafang Shao
2025-10-30 0:57 ` Alexei Starovoitov
2025-10-30 2:40 ` Yafang Shao
2025-11-27 11:48 ` David Hildenbrand (Red Hat)
2025-11-28 2:53 ` Yafang Shao
2025-11-28 7:57 ` Lorenzo Stoakes
2025-11-28 8:18 ` Yafang Shao
2025-11-28 8:31 ` Lorenzo Stoakes
2025-11-28 11:56 ` Yafang Shao
2025-11-28 12:18 ` Lorenzo Stoakes
2025-11-28 12:51 ` Yafang Shao
2025-11-28 8:39 ` David Hildenbrand (Red Hat)
2025-11-28 8:55 ` Lorenzo Stoakes
2025-11-30 13:06 ` Yafang Shao
2025-11-26 15:13 ` Rik van Riel
2025-11-27 2:35 ` Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 07/10] Documentation: add BPF THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 08/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 09/10] selftests/bpf: add test case to update " Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 10/10] selftests/bpf: add test case for BPF-THP inheritance across fork Yafang Shao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox