linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 mm-new 00/11] mm, bpf: BPF based THP order selection
@ 2025-09-30  5:58 Yafang Shao
  2025-09-30  5:58 ` [PATCH v9 mm-new 01/11] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
                   ` (10 more replies)
  0 siblings, 11 replies; 37+ messages in thread
From: Yafang Shao @ 2025-09-30  5:58 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, tj, lance.yang, rdunlap
  Cc: bpf, linux-mm, linux-doc, linux-kernel, Yafang Shao

Background
==========

Our production servers consistently configure THP to "never" due to
historical incidents caused by its behavior. Key issues include:
- Increased Memory Consumption
  THP significantly raises overall memory usage, reducing available memory
  for workloads.

- Latency Spikes
  Random latency spikes occur due to frequent memory compaction triggered
  by THP.

- Lack of Fine-Grained Control
  THP tuning is globally configured, making it unsuitable for containerized
  environments. When multiple workloads share a host, enabling THP without
  per-workload control leads to unpredictable behavior.

Due to these issues, administrators avoid switching to madvise or always
modes—unless per-workload THP control is implemented.

To address this, we propose BPF-based THP policy for flexible adjustment.
Additionally, as David mentioned, this mechanism can also serve as a
policy prototyping tool (test policies via BPF before upstreaming them).

Proposed Solution
=================

This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
THP tuning. It includes a hook thp_get_order(), allowing BPF programs to
influence THP order selection based on factors such as:

- Workload identity
  For example, workloads running in specific containers or cgroups.
- Allocation context
  Whether the allocation occurs during a page fault, khugepaged, swap or
  other paths.
- VMA's memory advice settings
  MADV_HUGEPAGE or MADV_NOHUGEPAGE
- Memory pressure
  PSI system data or associated cgroup PSI metrics

The new interface for the BPF program is as follows:

/**
 * thp_order_fn_t: Get the suggested THP order from a BPF program for allocation
 * @vma: vm_area_struct associated with the THP allocation
 * @type: TVA type for current @vma
 * @orders: Bitmask of available THP orders for this allocation
 *
 * Return: The suggested THP order for allocation from the BPF program. Must be
 *         a valid, available order.
 */
typedef int thp_order_fn_t(struct vm_area_struct *vma,
			   enum tva_type type,
			   unsigned long orders);

Only a single BPF program can be attached at any given time, though it can
be dynamically updated to adjust the policy. The implementation supports
anonymous THP, shmem THP, and mTHP, with future extensions planned for
file-backed THP.

This functionality is only active when system-wide THP is configured to
madvise or always mode. It remains disabled in never mode. Additionally,
if THP is explicitly disabled for a specific task via prctl(), this BPF
functionality will also be unavailable for that task

Rationale Behind the Non-Cgroup Design
--------------------------------------

cgroups are designed as nested hierarchies for partitioning resources. They
are a poor fit for enforcing arbitrary, non-hierarchical policies.

The THP policy is a quintessential example of such an arbitrary
setting. Even within a single cgroup, it is often necessary to enable
THP for performance-critical tasks while disabling it for others to
avoid latency spikes. Implementing this policy through a cgroup
interface that propagates hierarchically would eliminate the crucial
ability to configure it on a per-task basis.

While the bpf-thp mechanism has a global scope, this does not limit
its application to a single system-wide policy. In contrast to a
hierarchical cgroup-based setting, bpf-thp offers the flexibility to
set policies per-task, per-cgroup, or globally.

Fundamentally, it is a more powerful variant of prctl(), not a variant of
cgroup interface file.

WARNING
-------

- This feature requires CONFIG_BPF_THP (marked EXPERIMENTAL) to
  be enabled.
- The interface may change
- Behavior may differ in future kernel versions
- We might remove it in the future

Selftests
=========

BPF CI 
------

Patch #8:  Implements a basic BPF THP policy
Patch #9: Provides tests for dynamic BPF program updates and replacement.
Patch #10: Includes negative tests for invalid BPF helper usage, verifying
           proper verification by the BPF verifier.

Currently, several dependency patches reside in mm-new but haven't been
merged into bpf-next. To enable BPF CI testing, these dependencies were
manually applied to bpf-next. All selftests in this series pass 
successfully [0].

Performance Evaluation
----------------------

Performance impact was measured given the page fault handler modifications.
The standard `perf bench mem memset` benchmark was employed to assess page
fault performance.

Testing was conducted on an AMD EPYC 7W83 64-Core Processor (single NUMA
node). Due to variance between individual test runs, a script executed
10000 iterations to calculate meaningful averages.

- Baseline (without this patch series)
- With patch series but no BPF program attached
- With patch series and BPF program attached

The results across three configurations show negligible performance impact:

  Number of runs: 10,000
  Average throughput: 40-41 GB/sec

Production verification
-----------------------

We have successfully deployed a variant of this approach across numerous
Kubernetes production servers. The implementation enables THP for specific
workloads (such as applications utilizing ZGC [1]) while disabling it for
others. This selective deployment has operated flawlessly, with no
regression reports to date.

For ZGC-based applications, our verification demonstrates that shmem THP
delivers significant improvements:
- Reduced CPU utilization
- Lower average latencies

We are continuously extending its support to more workloads, such as
TCMalloc-based services. [2]

Deployment Steps in our production servers are as follows,

1. Initial Setup:
- Set THP mode to "never" (disabling THP by default).
- Attach the BPF program and pin the BPF maps and links.
- Pinning ensures persistence (like a kernel module), preventing
disruption under system pressure.
- A THP whitelist map tracks allowed cgroups (initially empty -> no THP
allocations).

2. Enable THP Control:
- Switch THP mode to "always" or "madvise" (BPF now governs actual allocations).

3. Dynamic Management:
- To permit THP for a cgroup, add its ID to the whitelist map.
- To revoke permission, remove the cgroup ID from the map.
- The BPF program can be updated live (policy adjustments require no
task interruption).

4. To roll back, disable THP and remove this BPF program. 

**WARNING**
Be aware that the maintainers do not suggest this use case, as the BPF hook
interface is unstable and might be removed from the upstream kernel—unless
you have your own kernel team to maintain it ;-)

Tested By
---------

This v7 patch series has been tested by Lance. Thanks a lot!

  Tested-by: Lance Yang <lance.yang@linux.dev> (for v7)

Since the changes from v7 are minimal, I've retained the Tested-by tag
in the current version.

Future work
===========

Per-Task Defrag Policy
----------------------

In our production environment, applications handle memory allocation in two
ways: some pre-touch all memory at startup, while others allocate
dynamically.

For pre-touching applications, we prefer to allocate THP via direct reclaim
during their initial phase. For dynamic allocators, however, we prefer to
defer THP allocation to khugepaged to prevent latency spikes.

To support both strategies effectively, the defrag setting must be
configurable on a per-task basis.

File-backed THP Policy
----------------------

Based on our validation with production workloads, we observed mixed
results with XFS large folios (also known as file-backed THP):

- Performance Benefits
  Some workloads demonstrated significant improvements with XFS large
  folios enabled
- Performance Regression
  Some workloads experienced degradation when using XFS large folios

These results demonstrate that file-backed THP, similar to anonymous THP,
requires a more granular approach instead of a uniform implementation.

We will extend the BPF-based order selection mechanism to support
file-backed THP allocation policies.

Hooking fork() with BPF for Task Configuration
----------------------------------------------

The current method for controlling a newly fork()-ed task involves calling
prctl() (e.g., with PR_SET_THP_DISABLE) to set flags in its mm->flags. This
requires explicit userspace modification.

A more efficient alternative is to implement a new BPF hook within the
fork() path. This hook would allow a BPF program to set the task's
mm->flags directly after mm initialization, leveraging BPF helpers for a
solution that is transparent to userspace. This is particularly valuable in
data center environments for fleet-wide management. 

Link: https://github.com/kernel-patches/bpf/pull/9893 [0] 
Link: https://wiki.openjdk.org/display/zgc/Main#Main-EnablingTransparentHugePagesOnLinux [1] 
Link: https://google.github.io/tcmalloc/tuning.html#system-level-optimizations [2]

Changes:
=======:

v8->v9:
- Rename CONFIG_BPF_THP_GET_ORDER_EXPERIMENTAL to CONFIG_BPF_THP for
  future extensionis. (Usama, Randy)
- Remove the first patch and send it separately (Usama)

v7->v8: https://lwn.net/Articles/1039689/
Key Changes:
From Lorenzo:
  - Remove the @vma_type parameter and get it from @vma instead
  - Rename the config to BPF_THP_GET_ORDER_EXPERIMENTAL for highlighting
  - Code improvement around the returned order
- Fix the buiding error reported by kernel test robot in patch #1
  (Lance, Zi, Lorenzo)

v6->v7: https://lwn.net/Articles/1037490/
Key Changes Implemented Based on Feedback:
From Lorenzo:
  - Rename the hook from get_suggested_order() to bpf_hook_get_thp_order(). 
  - Rename bpf_thp.c to huge_memory_bpf.c
  - Focuse the current patchset on THP order selection
  - Add the BPF hook into thp_vma_allowable_orders()
  - Make the hook VMA-based and remove the mm parameter
  - Modify the BPF program to return a single order
  - Stop passing vma_flags directly to BPF programs
  - Mark vma->vm_mm as trusted_or_null
  - Change the MAINTAINER file
From Andrii:
  - Mark mm->owner as rcu_or_null to avoid introducing new helpers
From Barry:
  - decouple swap from the normal page fault path
kernel test robot:
  - Fix a sparse warning
Shakeel helped clarify the implementation.

RFC v5-> v6: https://lwn.net/Articles/1035116/
- Code improvement around the RCU usage (Usama)
- Add selftests for khugepaged fork (Usama)
- Add performance data for page fault (Usama)
- Remove the RFC tag

RFC v4->v5: https://lwn.net/Articles/1034265/
- Add support for vma (David)
- Add mTHP support in khugepaged (Zi)
- Use bitmask of all allowed orders instead (Zi)
- Retrieve the page size and PMD order rather than hardcoding them (Zi)

RFC v3->v4: https://lwn.net/Articles/1031829/
- Use a new interface get_suggested_order() (David)
- Mark it as experimental (David, Lorenzo)
- Code improvement in THP (Usama)
- Code improvement in BPF struct ops (Amery)

RFC v2->v3: https://lwn.net/Articles/1024545/
- Finer-graind tuning based on madvise or always mode (David, Lorenzo)
- Use BPF to write more advanced policies logic (David, Lorenzo)

RFC v1->v2: https://lwn.net/Articles/1021783/
The main changes are as follows,
- Use struct_ops instead of fmod_ret (Alexei)
- Introduce a new THP mode (Johannes)
- Introduce new helpers for BPF hook (Zi)
- Refine the commit log

RFC v1: https://lwn.net/Articles/1019290/

Yafang Shao (11):
  mm: thp: remove vm_flags parameter from khugepaged_enter_vma()
  mm: thp: remove vm_flags parameter from thp_vma_allowable_order()
  mm: thp: add support for BPF based THP order selection
  mm: thp: decouple THP allocation between swap and page fault paths
  mm: thp: enable THP allocation exclusively through khugepaged
  bpf: mark mm->owner as __safe_rcu_or_null
  bpf: mark vma->vm_mm as __safe_trusted_or_null
  selftests/bpf: add a simple BPF based THP policy
  selftests/bpf: add test case to update THP policy
  selftests/bpf: add test cases for invalid thp_adjust usage
  Documentation: add BPF-based THP policy management

 Documentation/admin-guide/mm/transhuge.rst    |  39 +++
 MAINTAINERS                                   |   3 +
 fs/proc/task_mmu.c                            |   3 +-
 include/linux/huge_mm.h                       |  42 ++-
 include/linux/khugepaged.h                    |  10 +-
 kernel/bpf/verifier.c                         |   8 +
 mm/Kconfig                                    |  11 +
 mm/Makefile                                   |   1 +
 mm/huge_memory.c                              |   7 +-
 mm/huge_memory_bpf.c                          | 204 +++++++++++++
 mm/khugepaged.c                               |  35 +--
 mm/madvise.c                                  |   7 +
 mm/memory.c                                   |  22 +-
 mm/shmem.c                                    |   2 +-
 mm/vma.c                                      |   6 +-
 tools/testing/selftests/bpf/config            |   3 +
 .../selftests/bpf/prog_tests/thp_adjust.c     | 287 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/lsm.c       |   8 +-
 .../selftests/bpf/progs/test_thp_adjust.c     |  55 ++++
 .../bpf/progs/test_thp_adjust_sleepable.c     |  22 ++
 .../bpf/progs/test_thp_adjust_trusted_owner.c |  30 ++
 .../bpf/progs/test_thp_adjust_trusted_vma.c   |  27 ++
 22 files changed, 779 insertions(+), 53 deletions(-)
 create mode 100644 mm/huge_memory_bpf.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_sleepable.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_owner.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c

-- 
2.47.3



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v9 mm-new 01/11] mm: thp: remove vm_flags parameter from khugepaged_enter_vma()
  2025-09-30  5:58 [PATCH v9 mm-new 00/11] mm, bpf: BPF based THP order selection Yafang Shao
@ 2025-09-30  5:58 ` Yafang Shao
  2025-09-30  5:58 ` [PATCH v9 mm-new 02/11] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Yafang Shao
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Yafang Shao @ 2025-09-30  5:58 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, tj, lance.yang, rdunlap
  Cc: bpf, linux-mm, linux-doc, linux-kernel, Yafang Shao, Yang Shi

The khugepaged_enter_vma() function requires handling in two specific
scenarios:
1. New VMA creation
  When a new VMA is created (for anon vma, it is deferred to pagefault), if
  vma->vm_mm is not present in khugepaged_mm_slot, it must be added. In
  this case, khugepaged_enter_vma() is called after vma->vm_flags have been
  set, allowing direct use of the VMA's flags.
2. VMA flag modification
  When vma->vm_flags are modified (particularly when VM_HUGEPAGE is set),
  the system must recheck whether to add vma->vm_mm to khugepaged_mm_slot.
  Currently, khugepaged_enter_vma() is called before the flag update, so
  the call must be relocated to occur after vma->vm_flags have been set.

In the VMA merging path, khugepaged_enter_vma() is also called. For this
case, since VMA merging only occurs when the vm_flags of both VMAs are
identical (excluding special flags like VM_SOFTDIRTY), we can safely use
target->vm_flags instead. (It is worth noting that khugepaged_enter_vma()
can be removed from the VMA merging path because the VMA has already been
added in the two aforementioned cases. We will address this cleanup in a
separate patch.)

After this change, we can further remove vm_flags parameter from
thp_vma_allowable_order(). That will be handled in a followup patch.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/khugepaged.h | 10 ++++++----
 mm/huge_memory.c           |  2 +-
 mm/khugepaged.c            | 27 ++++++++++++++-------------
 mm/madvise.c               |  7 +++++++
 mm/vma.c                   |  6 +++---
 5 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index eb1946a70cff..b30814d3d665 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -13,8 +13,8 @@ extern void khugepaged_destroy(void);
 extern int start_stop_khugepaged(void);
 extern void __khugepaged_enter(struct mm_struct *mm);
 extern void __khugepaged_exit(struct mm_struct *mm);
-extern void khugepaged_enter_vma(struct vm_area_struct *vma,
-				 vm_flags_t vm_flags);
+extern void khugepaged_enter_vma(struct vm_area_struct *vma);
+extern void khugepaged_enter_mm(struct mm_struct *mm);
 extern void khugepaged_min_free_kbytes_update(void);
 extern bool current_is_khugepaged(void);
 extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
@@ -38,8 +38,10 @@ static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm
 static inline void khugepaged_exit(struct mm_struct *mm)
 {
 }
-static inline void khugepaged_enter_vma(struct vm_area_struct *vma,
-					vm_flags_t vm_flags)
+static inline void khugepaged_enter_vma(struct vm_area_struct *vma)
+{
+}
+static inline void khugepaged_enter_mm(struct mm_struct *mm)
 {
 }
 static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1b81680b4225..ac6601f30e65 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1346,7 +1346,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	ret = vmf_anon_prepare(vmf);
 	if (ret)
 		return ret;
-	khugepaged_enter_vma(vma, vma->vm_flags);
+	khugepaged_enter_vma(vma);
 
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm) &&
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 7ab2d1a42df3..5088eedafc35 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -353,12 +353,6 @@ int hugepage_madvise(struct vm_area_struct *vma,
 #endif
 		*vm_flags &= ~VM_NOHUGEPAGE;
 		*vm_flags |= VM_HUGEPAGE;
-		/*
-		 * If the vma become good for khugepaged to scan,
-		 * register it here without waiting a page fault that
-		 * may not happen any time soon.
-		 */
-		khugepaged_enter_vma(vma, *vm_flags);
 		break;
 	case MADV_NOHUGEPAGE:
 		*vm_flags &= ~VM_HUGEPAGE;
@@ -460,14 +454,21 @@ void __khugepaged_enter(struct mm_struct *mm)
 		wake_up_interruptible(&khugepaged_wait);
 }
 
-void khugepaged_enter_vma(struct vm_area_struct *vma,
-			  vm_flags_t vm_flags)
+void khugepaged_enter_mm(struct mm_struct *mm)
 {
-	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
-	    hugepage_pmd_enabled()) {
-		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
-			__khugepaged_enter(vma->vm_mm);
-	}
+	if (mm_flags_test(MMF_VM_HUGEPAGE, mm))
+		return;
+	if (!hugepage_pmd_enabled())
+		return;
+
+	__khugepaged_enter(mm);
+}
+
+void khugepaged_enter_vma(struct vm_area_struct *vma)
+{
+	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
+		return;
+	khugepaged_enter_mm(vma->vm_mm);
 }
 
 void __khugepaged_exit(struct mm_struct *mm)
diff --git a/mm/madvise.c b/mm/madvise.c
index fb1c86e630b6..8de7c39305dd 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1425,6 +1425,13 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
 	VM_WARN_ON_ONCE(madv_behavior->lock_mode != MADVISE_MMAP_WRITE_LOCK);
 
 	error = madvise_update_vma(new_flags, madv_behavior);
+	/*
+	 * If the vma become good for khugepaged to scan,
+	 * register it here without waiting a page fault that
+	 * may not happen any time soon.
+	 */
+	if (!error && new_flags & VM_HUGEPAGE)
+		khugepaged_enter_mm(vma->vm_mm);
 out:
 	/*
 	 * madvise() returns EAGAIN if kernel resources, such as
diff --git a/mm/vma.c b/mm/vma.c
index a1ec405bda25..6a548b0d64cd 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -973,7 +973,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
 	if (err || commit_merge(vmg))
 		goto abort;
 
-	khugepaged_enter_vma(vmg->target, vmg->vm_flags);
+	khugepaged_enter_vma(vmg->target);
 	vmg->state = VMA_MERGE_SUCCESS;
 	return vmg->target;
 
@@ -1093,7 +1093,7 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg)
 	 * following VMA if we have VMAs on both sides.
 	 */
 	if (vmg->target && !vma_expand(vmg)) {
-		khugepaged_enter_vma(vmg->target, vmg->vm_flags);
+		khugepaged_enter_vma(vmg->target);
 		vmg->state = VMA_MERGE_SUCCESS;
 		return vmg->target;
 	}
@@ -2520,7 +2520,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
 	 * call covers the non-merge case.
 	 */
 	if (!vma_is_anonymous(vma))
-		khugepaged_enter_vma(vma, map->vm_flags);
+		khugepaged_enter_vma(vma);
 	*vmap = vma;
 	return 0;
 
-- 
2.47.3



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v9 mm-new 02/11] mm: thp: remove vm_flags parameter from thp_vma_allowable_order()
  2025-09-30  5:58 [PATCH v9 mm-new 00/11] mm, bpf: BPF based THP order selection Yafang Shao
  2025-09-30  5:58 ` [PATCH v9 mm-new 01/11] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
@ 2025-09-30  5:58 ` Yafang Shao
  2025-09-30  5:58 ` [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection Yafang Shao
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Yafang Shao @ 2025-09-30  5:58 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, tj, lance.yang, rdunlap
  Cc: bpf, linux-mm, linux-doc, linux-kernel, Yafang Shao

Because all calls to thp_vma_allowable_order() pass vma->vm_flags as the
vma_flags argument, we can remove the parameter and have the function
access vma->vm_flags directly.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
---
 fs/proc/task_mmu.c      |  3 +--
 include/linux/huge_mm.h | 16 ++++++++--------
 mm/huge_memory.c        |  4 ++--
 mm/khugepaged.c         | 10 +++++-----
 mm/memory.c             | 11 +++++------
 mm/shmem.c              |  2 +-
 6 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fc35a0543f01..e713d1905750 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1369,8 +1369,7 @@ static int show_smap(struct seq_file *m, void *v)
 	__show_smap(m, &mss, false);
 
 	seq_printf(m, "THPeligible:    %8u\n",
-		   !!thp_vma_allowable_orders(vma, vma->vm_flags, TVA_SMAPS,
-					      THP_ORDERS_ALL));
+		   !!thp_vma_allowable_orders(vma, TVA_SMAPS, THP_ORDERS_ALL));
 
 	if (arch_pkeys_enabled())
 		seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f327d62fc985..a635dcbb2b99 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -101,8 +101,8 @@ enum tva_type {
 	TVA_FORCED_COLLAPSE,	/* Forced collapse (e.g. MADV_COLLAPSE). */
 };
 
-#define thp_vma_allowable_order(vma, vm_flags, type, order) \
-	(!!thp_vma_allowable_orders(vma, vm_flags, type, BIT(order)))
+#define thp_vma_allowable_order(vma, type, order) \
+	(!!thp_vma_allowable_orders(vma, type, BIT(order)))
 
 #define split_folio(f) split_folio_to_list(f, NULL)
 
@@ -266,14 +266,12 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
 }
 
 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
-					 vm_flags_t vm_flags,
 					 enum tva_type type,
 					 unsigned long orders);
 
 /**
  * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
  * @vma:  the vm area to check
- * @vm_flags: use these vm_flags instead of vma->vm_flags
  * @type: TVA type
  * @orders: bitfield of all orders to consider
  *
@@ -287,10 +285,11 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
  */
 static inline
 unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
-				       vm_flags_t vm_flags,
 				       enum tva_type type,
 				       unsigned long orders)
 {
+	vm_flags_t vm_flags = vma->vm_flags;
+
 	/*
 	 * Optimization to check if required orders are enabled early. Only
 	 * forced collapse ignores sysfs configs.
@@ -309,7 +308,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 			return 0;
 	}
 
-	return __thp_vma_allowable_orders(vma, vm_flags, type, orders);
+	return __thp_vma_allowable_orders(vma, type, orders);
 }
 
 struct thpsize {
@@ -329,8 +328,10 @@ struct thpsize {
  * through madvise or prctl.
  */
 static inline bool vma_thp_disabled(struct vm_area_struct *vma,
-		vm_flags_t vm_flags, bool forced_collapse)
+				    bool forced_collapse)
 {
+	vm_flags_t vm_flags = vma->vm_flags;
+
 	/* Are THPs disabled for this VMA? */
 	if (vm_flags & VM_NOHUGEPAGE)
 		return true;
@@ -560,7 +561,6 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
 }
 
 static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
-					vm_flags_t vm_flags,
 					enum tva_type type,
 					unsigned long orders)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ac6601f30e65..1ac476fe6dc5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -98,7 +98,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
 }
 
 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
-					 vm_flags_t vm_flags,
 					 enum tva_type type,
 					 unsigned long orders)
 {
@@ -106,6 +105,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 	const bool in_pf = type == TVA_PAGEFAULT;
 	const bool forced_collapse = type == TVA_FORCED_COLLAPSE;
 	unsigned long supported_orders;
+	vm_flags_t vm_flags = vma->vm_flags;
 
 	/* Check the intersection of requested and supported orders. */
 	if (vma_is_anonymous(vma))
@@ -122,7 +122,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 	if (!vma->vm_mm)		/* vdso */
 		return 0;
 
-	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags, forced_collapse))
+	if (thp_disabled_by_hw() || vma_thp_disabled(vma, forced_collapse))
 		return 0;
 
 	/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5088eedafc35..b60f1856714a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -466,7 +466,7 @@ void khugepaged_enter_mm(struct mm_struct *mm)
 
 void khugepaged_enter_vma(struct vm_area_struct *vma)
 {
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, TVA_KHUGEPAGED, PMD_ORDER))
 		return;
 	khugepaged_enter_mm(vma->vm_mm);
 }
@@ -917,7 +917,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 
 	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
 		return SCAN_ADDRESS_RANGE;
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, type, PMD_ORDER))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
@@ -1531,7 +1531,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	 * and map it by a PMD, regardless of sysfs THP settings. As such, let's
 	 * analogously elide sysfs THP settings here and force collapse.
 	 */
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, TVA_FORCED_COLLAPSE, PMD_ORDER))
 		return SCAN_VMA_CHECK;
 
 	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
@@ -2426,7 +2426,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
+		if (!thp_vma_allowable_order(vma, TVA_KHUGEPAGED, PMD_ORDER)) {
 skip:
 			progress++;
 			continue;
@@ -2757,7 +2757,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	BUG_ON(vma->vm_start > start);
 	BUG_ON(vma->vm_end < end);
 
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, TVA_FORCED_COLLAPSE, PMD_ORDER))
 		return -EINVAL;
 
 	cc = kmalloc(sizeof(*cc), GFP_KERNEL);
diff --git a/mm/memory.c b/mm/memory.c
index 7e32eb79ba99..cd04e4894725 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4558,7 +4558,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
 	 * and suitable for swapping THP.
 	 */
-	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
+	orders = thp_vma_allowable_orders(vma, TVA_PAGEFAULT,
 					  BIT(PMD_ORDER) - 1);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 	orders = thp_swap_suitable_orders(swp_offset(entry),
@@ -5107,7 +5107,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 	 * for this vma. Then filter out the orders that can't be allocated over
 	 * the faulting address and still be fully contained in the vma.
 	 */
-	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
+	orders = thp_vma_allowable_orders(vma, TVA_PAGEFAULT,
 					  BIT(PMD_ORDER) - 1);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 
@@ -5379,7 +5379,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
 	 * PMD mappings if THPs are disabled. As we already have a THP,
 	 * behave as if we are forcing a collapse.
 	 */
-	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags,
+	if (thp_disabled_by_hw() || vma_thp_disabled(vma,
 						     /* forced_collapse=*/ true))
 		return ret;
 
@@ -6280,7 +6280,6 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		.gfp_mask = __get_fault_gfp_mask(vma),
 	};
 	struct mm_struct *mm = vma->vm_mm;
-	vm_flags_t vm_flags = vma->vm_flags;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	vm_fault_t ret;
@@ -6295,7 +6294,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 retry_pud:
 	if (pud_none(*vmf.pud) &&
-	    thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PUD_ORDER)) {
+	    thp_vma_allowable_order(vma, TVA_PAGEFAULT, PUD_ORDER)) {
 		ret = create_huge_pud(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
@@ -6329,7 +6328,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		goto retry_pud;
 
 	if (pmd_none(*vmf.pmd) &&
-	    thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
+	    thp_vma_allowable_order(vma, TVA_PAGEFAULT, PMD_ORDER)) {
 		ret = create_huge_pmd(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
diff --git a/mm/shmem.c b/mm/shmem.c
index 4855eee22731..cc2c90656b66 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1780,7 +1780,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
 	vm_flags_t vm_flags = vma ? vma->vm_flags : 0;
 	unsigned int global_orders;
 
-	if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags, shmem_huge_force)))
+	if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, shmem_huge_force)))
 		return 0;
 
 	global_orders = shmem_huge_global_enabled(inode, index, write_end,
-- 
2.47.3



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-09-30  5:58 [PATCH v9 mm-new 00/11] mm, bpf: BPF based THP order selection Yafang Shao
  2025-09-30  5:58 ` [PATCH v9 mm-new 01/11] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
  2025-09-30  5:58 ` [PATCH v9 mm-new 02/11] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Yafang Shao
@ 2025-09-30  5:58 ` Yafang Shao
  2025-10-03  2:18   ` Alexei Starovoitov
  2025-09-30  5:58 ` [PATCH v9 mm-new 04/11] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Yafang Shao @ 2025-09-30  5:58 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, tj, lance.yang, rdunlap
  Cc: bpf, linux-mm, linux-doc, linux-kernel, Yafang Shao, Alexei Starovoitov

This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
programs to influence THP order selection based on factors such as:
- Workload identity
  For example, workloads running in specific containers or cgroups.
- Allocation context
  Whether the allocation occurs during a page fault, khugepaged, swap or
  other paths.
- VMA's memory advice settings
  MADV_HUGEPAGE or MADV_NOHUGEPAGE
- Memory pressure
  PSI system data or associated cgroup PSI metrics

The kernel API of this new BPF hook is as follows,

/**
 * thp_order_fn_t: Get the suggested THP order from a BPF program for allocation
 * @vma: vm_area_struct associated with the THP allocation
 * @type: TVA type for current @vma
 * @orders: Bitmask of available THP orders for this allocation
 *
 * Return: The suggested THP order for allocation from the BPF program. Must be
 *         a valid, available order.
 */
typedef int thp_order_fn_t(struct vm_area_struct *vma,
			   enum tva_type type,
			   unsigned long orders);

Only a single BPF program can be attached at any given time, though it can
be dynamically updated to adjust the policy. The implementation supports
anonymous THP, shmem THP, and mTHP, with future extensions planned for
file-backed THP.

This functionality is only active when system-wide THP is configured to
madvise or always mode. It remains disabled in never mode. Additionally,
if THP is explicitly disabled for a specific task via prctl(), this BPF
functionality will also be unavailable for that task.

This BPF hook enables the implementation of flexible THP allocation
policies at the system, per-cgroup, or per-task level.

This feature requires CONFIG_BPF_THP (EXPERIMENTAL) to be enabled. Note
that this capability is currently unstable and may undergo significant
changes—including potential removal—in future kernel versions.

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
---
 MAINTAINERS             |   1 +
 include/linux/huge_mm.h |  23 +++++
 mm/Kconfig              |  11 +++
 mm/Makefile             |   1 +
 mm/huge_memory_bpf.c    | 204 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 240 insertions(+)
 create mode 100644 mm/huge_memory_bpf.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ca8e3d18eedd..7be34b2a64fd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16257,6 +16257,7 @@ F:	include/linux/huge_mm.h
 F:	include/linux/khugepaged.h
 F:	include/trace/events/huge_memory.h
 F:	mm/huge_memory.c
+F:	mm/huge_memory_bpf.c
 F:	mm/khugepaged.c
 F:	mm/mm_slot.h
 F:	tools/testing/selftests/mm/khugepaged.c
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a635dcbb2b99..02055cc93bfe 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -56,6 +56,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
 	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
 };
 
 struct kobject;
@@ -269,6 +270,23 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 enum tva_type type,
 					 unsigned long orders);
 
+#ifdef CONFIG_BPF_THP
+
+unsigned long
+bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
+			unsigned long orders);
+
+#else
+
+static inline unsigned long
+bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
+			unsigned long orders)
+{
+	return orders;
+}
+
+#endif
+
 /**
  * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
  * @vma:  the vm area to check
@@ -290,6 +308,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 {
 	vm_flags_t vm_flags = vma->vm_flags;
 
+	/* The BPF-specified order overrides which order is selected. */
+	orders &= bpf_hook_thp_get_orders(vma, type, orders);
+	if (!orders)
+		return 0;
+
 	/*
 	 * Optimization to check if required orders are enabled early. Only
 	 * forced collapse ignores sysfs configs.
diff --git a/mm/Kconfig b/mm/Kconfig
index bde9f842a4a8..ffbcc5febb10 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -895,6 +895,17 @@ config NO_PAGE_MAPCOUNT
 
 	  EXPERIMENTAL because the impact of some changes is still unclear.
 
+config BPF_THP
+	bool "BPF-based THP Policy (EXPERIMENTAL)"
+	depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
+
+	help
+	  Enable dynamic THP policy adjustment using BPF programs. This feature
+	  is currently experimental.
+
+	  WARNING: This feature is unstable and may change in future kernel
+	  versions.
+
 endif # TRANSPARENT_HUGEPAGE
 
 # simple helper to make the code a bit easier to read
diff --git a/mm/Makefile b/mm/Makefile
index 21abb3353550..4efca1c8a919 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+obj-$(CONFIG_BPF_THP) += huge_memory_bpf.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
new file mode 100644
index 000000000000..47c124d588b2
--- /dev/null
+++ b/mm/huge_memory_bpf.c
@@ -0,0 +1,204 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BPF-based THP policy management
+ *
+ * Author: Yafang Shao <laoar.shao@gmail.com>
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/huge_mm.h>
+#include <linux/khugepaged.h>
+
+/**
+ * @thp_order_fn_t: Get the suggested THP order from a BPF program for allocation
+ * @vma: vm_area_struct associated with the THP allocation
+ * @type: TVA type for current @vma
+ * @orders: Bitmask of available THP orders for this allocation
+ *
+ * Return: The suggested THP order for allocation from the BPF program. Must be
+ *         a valid, available order.
+ */
+typedef int thp_order_fn_t(struct vm_area_struct *vma,
+			   enum tva_type type,
+			   unsigned long orders);
+
+struct bpf_thp_ops {
+	thp_order_fn_t __rcu *thp_get_order;
+};
+
+static struct bpf_thp_ops bpf_thp;
+static DEFINE_SPINLOCK(thp_ops_lock);
+
+unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
+				      enum tva_type type,
+				      unsigned long orders)
+{
+	thp_order_fn_t *bpf_hook_thp_get_order;
+	int bpf_order;
+
+	/* No BPF program is attached */
+	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+		      &transparent_hugepage_flags))
+		return orders;
+
+	rcu_read_lock();
+	bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
+	if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
+		goto out;
+
+	bpf_order = bpf_hook_thp_get_order(vma, type, orders);
+	orders &= BIT(bpf_order);
+
+out:
+	rcu_read_unlock();
+	return orders;
+}
+
+static bool bpf_thp_ops_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					const struct bpf_prog *prog,
+					struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_func_proto *
+bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id, prog);
+}
+
+static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
+	.get_func_proto = bpf_thp_get_func_proto,
+	.is_valid_access = bpf_thp_ops_is_valid_access,
+};
+
+static int bpf_thp_init(struct btf *btf)
+{
+	return 0;
+}
+
+static int bpf_thp_check_member(const struct btf_type *t,
+				const struct btf_member *member,
+				const struct bpf_prog *prog)
+{
+	/* The call site operates under RCU protection. */
+	if (prog->sleepable)
+		return -EINVAL;
+	return 0;
+}
+
+static int bpf_thp_init_member(const struct btf_type *t,
+			       const struct btf_member *member,
+			       void *kdata, const void *udata)
+{
+	return 0;
+}
+
+static int bpf_thp_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_thp_ops *ops = kdata;
+
+	spin_lock(&thp_ops_lock);
+	if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+			     &transparent_hugepage_flags)) {
+		spin_unlock(&thp_ops_lock);
+		return -EBUSY;
+	}
+	WARN_ON_ONCE(rcu_access_pointer(bpf_thp.thp_get_order));
+	rcu_assign_pointer(bpf_thp.thp_get_order, ops->thp_get_order);
+	spin_unlock(&thp_ops_lock);
+	return 0;
+}
+
+static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
+{
+	thp_order_fn_t *old_fn;
+
+	spin_lock(&thp_ops_lock);
+	clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
+	old_fn = rcu_replace_pointer(bpf_thp.thp_get_order, NULL,
+				     lockdep_is_held(&thp_ops_lock));
+	WARN_ON_ONCE(!old_fn);
+	spin_unlock(&thp_ops_lock);
+
+	synchronize_rcu();
+}
+
+static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
+{
+	thp_order_fn_t *old_fn, *new_fn;
+	struct bpf_thp_ops *old = old_kdata;
+	struct bpf_thp_ops *ops = kdata;
+	int ret = 0;
+
+	if (!ops || !old)
+		return -EINVAL;
+
+	spin_lock(&thp_ops_lock);
+	/* The prog has aleady been removed. */
+	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+		      &transparent_hugepage_flags)) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	new_fn = rcu_dereference(ops->thp_get_order);
+	old_fn = rcu_replace_pointer(bpf_thp.thp_get_order, new_fn,
+				     lockdep_is_held(&thp_ops_lock));
+	WARN_ON_ONCE(!old_fn || !new_fn);
+
+out:
+	spin_unlock(&thp_ops_lock);
+	if (!ret)
+		synchronize_rcu();
+	return ret;
+}
+
+static int bpf_thp_validate(void *kdata)
+{
+	struct bpf_thp_ops *ops = kdata;
+
+	if (!ops->thp_get_order) {
+		pr_err("bpf_thp: required ops isn't implemented\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int bpf_thp_get_order(struct vm_area_struct *vma,
+			     enum tva_type type,
+			     unsigned long orders)
+{
+	return -1;
+}
+
+static struct bpf_thp_ops __bpf_thp_ops = {
+	.thp_get_order = (thp_order_fn_t __rcu *)bpf_thp_get_order,
+};
+
+static struct bpf_struct_ops bpf_bpf_thp_ops = {
+	.verifier_ops = &thp_bpf_verifier_ops,
+	.init = bpf_thp_init,
+	.check_member = bpf_thp_check_member,
+	.init_member = bpf_thp_init_member,
+	.reg = bpf_thp_reg,
+	.unreg = bpf_thp_unreg,
+	.update = bpf_thp_update,
+	.validate = bpf_thp_validate,
+	.cfi_stubs = &__bpf_thp_ops,
+	.owner = THIS_MODULE,
+	.name = "bpf_thp_ops",
+};
+
+static int __init bpf_thp_ops_init(void)
+{
+	int err;
+
+	err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
+	if (err)
+		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
+	return err;
+}
+late_initcall(bpf_thp_ops_init);
-- 
2.47.3



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v9 mm-new 04/11] mm: thp: decouple THP allocation between swap and page fault paths
  2025-09-30  5:58 [PATCH v9 mm-new 00/11] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (2 preceding siblings ...)
  2025-09-30  5:58 ` [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection Yafang Shao
@ 2025-09-30  5:58 ` Yafang Shao
  2025-09-30  5:58 ` [PATCH v9 mm-new 05/11] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Yafang Shao @ 2025-09-30  5:58 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, tj, lance.yang, rdunlap
  Cc: bpf, linux-mm, linux-doc, linux-kernel, Yafang Shao

The new BPF capability enables finer-grained THP policy decisions by
introducing separate handling for swap faults versus normal page faults.

As highlighted by Barry:

  We’ve observed that swapping in large folios can lead to more
  swap thrashing for some workloads- e.g. kernel build. Consequently,
  some workloads might prefer swapping in smaller folios than those
  allocated by alloc_anon_folio().

While prtcl() could potentially be extended to leverage this new policy,
doing so would require modifications to the uAPI.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Cc: Barry Song <21cnbao@gmail.com>
---
 include/linux/huge_mm.h | 3 ++-
 mm/huge_memory.c        | 2 +-
 mm/memory.c             | 2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 02055cc93bfe..9b9dfe646daa 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -97,9 +97,10 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 
 enum tva_type {
 	TVA_SMAPS,		/* Exposing "THPeligible:" in smaps. */
-	TVA_PAGEFAULT,		/* Serving a page fault. */
+	TVA_PAGEFAULT,		/* Serving a non-swap page fault. */
 	TVA_KHUGEPAGED,		/* Khugepaged collapse. */
 	TVA_FORCED_COLLAPSE,	/* Forced collapse (e.g. MADV_COLLAPSE). */
+	TVA_SWAP_PAGEFAULT,	/* serving a swap page fault. */
 };
 
 #define thp_vma_allowable_order(vma, type, order) \
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1ac476fe6dc5..08372dfcb41a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -102,7 +102,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 unsigned long orders)
 {
 	const bool smaps = type == TVA_SMAPS;
-	const bool in_pf = type == TVA_PAGEFAULT;
+	const bool in_pf = (type == TVA_PAGEFAULT || type == TVA_SWAP_PAGEFAULT);
 	const bool forced_collapse = type == TVA_FORCED_COLLAPSE;
 	unsigned long supported_orders;
 	vm_flags_t vm_flags = vma->vm_flags;
diff --git a/mm/memory.c b/mm/memory.c
index cd04e4894725..58ea0f93f79e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4558,7 +4558,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
 	 * and suitable for swapping THP.
 	 */
-	orders = thp_vma_allowable_orders(vma, TVA_PAGEFAULT,
+	orders = thp_vma_allowable_orders(vma, TVA_SWAP_PAGEFAULT,
 					  BIT(PMD_ORDER) - 1);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 	orders = thp_swap_suitable_orders(swp_offset(entry),
-- 
2.47.3



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v9 mm-new 05/11] mm: thp: enable THP allocation exclusively through khugepaged
  2025-09-30  5:58 [PATCH v9 mm-new 00/11] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (3 preceding siblings ...)
  2025-09-30  5:58 ` [PATCH v9 mm-new 04/11] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
@ 2025-09-30  5:58 ` Yafang Shao
  2025-09-30  5:58 ` [PATCH v9 mm-new 06/11] bpf: mark mm->owner as __safe_rcu_or_null Yafang Shao
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Yafang Shao @ 2025-09-30  5:58 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, tj, lance.yang, rdunlap
  Cc: bpf, linux-mm, linux-doc, linux-kernel, Yafang Shao

khugepaged_enter_vma() ultimately invokes any attached BPF function with
the TVA_KHUGEPAGED flag set when determining whether or not to enable
khugepaged THP for a freshly faulted in VMA.

Currently, on fault, we invoke this in do_huge_pmd_anonymous_page(), as
invoked by create_huge_pmd() and only when we have already checked to
see if an allowable TVA_PAGEFAULT order is specified.

Since we might want to disallow THP on fault-in but allow it via
khugepaged, we move things around so we always attempt to enter
khugepaged upon fault.

This change is safe because:
- khugepaged operates at the MM level rather than per-VMA. The THP
  allocation might fail during page faults due to transient conditions
  (e.g., memory pressure), it is safe to add this MM to khugepaged for
  subsequent defragmentation.
- If __thp_vma_allowable_orders(TVA_PAGEFAULT) returns 0, then
  __thp_vma_allowable_orders(TVA_KHUGEPAGED) will also return 0.

While we could also extend prctl() to utilize this new policy, such a
change would require a uAPI modification to PR_SET_THP_DISABLE.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Lance Yang <lance.yang@linux.dev>
Cc: Usama Arif <usamaarif642@gmail.com>
---
 mm/huge_memory.c |  1 -
 mm/memory.c      | 13 ++++++++-----
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 08372dfcb41a..2b155a734c78 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1346,7 +1346,6 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	ret = vmf_anon_prepare(vmf);
 	if (ret)
 		return ret;
-	khugepaged_enter_vma(vma);
 
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm) &&
diff --git a/mm/memory.c b/mm/memory.c
index 58ea0f93f79e..64f91191ffff 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6327,11 +6327,14 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	if (pud_trans_unstable(vmf.pud))
 		goto retry_pud;
 
-	if (pmd_none(*vmf.pmd) &&
-	    thp_vma_allowable_order(vma, TVA_PAGEFAULT, PMD_ORDER)) {
-		ret = create_huge_pmd(&vmf);
-		if (!(ret & VM_FAULT_FALLBACK))
-			return ret;
+	if (pmd_none(*vmf.pmd)) {
+		if (vma_is_anonymous(vma))
+			khugepaged_enter_vma(vma);
+		if (thp_vma_allowable_order(vma, TVA_PAGEFAULT, PMD_ORDER)) {
+			ret = create_huge_pmd(&vmf);
+			if (!(ret & VM_FAULT_FALLBACK))
+				return ret;
+		}
 	} else {
 		vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
 
-- 
2.47.3



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v9 mm-new 06/11] bpf: mark mm->owner as __safe_rcu_or_null
  2025-09-30  5:58 [PATCH v9 mm-new 00/11] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (4 preceding siblings ...)
  2025-09-30  5:58 ` [PATCH v9 mm-new 05/11] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
@ 2025-09-30  5:58 ` Yafang Shao
  2025-09-30  5:58 ` [PATCH v9 mm-new 07/11] bpf: mark vma->vm_mm as __safe_trusted_or_null Yafang Shao
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Yafang Shao @ 2025-09-30  5:58 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, tj, lance.yang, rdunlap
  Cc: bpf, linux-mm, linux-doc, linux-kernel, Yafang Shao

When CONFIG_MEMCG is enabled, we can access mm->owner under RCU. The
owner can be NULL. With this change, BPF helpers can safely access
mm->owner to retrieve the associated task from the mm. We can then make
policy decision based on the task attribute.

The typical use case is as follows,

  bpf_rcu_read_lock(); // rcu lock must be held for rcu trusted field
  @owner = @mm->owner; // mm_struct::owner is rcu trusted or null
  if (!@owner)
      goto out;

  /* Do something based on the task attribute */

out:
  bpf_rcu_read_unlock();

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 kernel/bpf/verifier.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c4f69a9e9af6..d400e18ee31e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7123,6 +7123,9 @@ BTF_TYPE_SAFE_RCU(struct cgroup_subsys_state) {
 /* RCU trusted: these fields are trusted in RCU CS and can be NULL */
 BTF_TYPE_SAFE_RCU_OR_NULL(struct mm_struct) {
 	struct file __rcu *exe_file;
+#ifdef CONFIG_MEMCG
+	struct task_struct __rcu *owner;
+#endif
 };
 
 /* skb->sk, req->sk are not RCU protected, but we mark them as such
-- 
2.47.3



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v9 mm-new 07/11] bpf: mark vma->vm_mm as __safe_trusted_or_null
  2025-09-30  5:58 [PATCH v9 mm-new 00/11] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (5 preceding siblings ...)
  2025-09-30  5:58 ` [PATCH v9 mm-new 06/11] bpf: mark mm->owner as __safe_rcu_or_null Yafang Shao
@ 2025-09-30  5:58 ` Yafang Shao
  2025-10-06 21:06   ` Andrii Nakryiko
  2025-09-30  5:58 ` [PATCH v9 mm-new 08/11] selftests/bpf: add a simple BPF based THP policy Yafang Shao
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Yafang Shao @ 2025-09-30  5:58 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, tj, lance.yang, rdunlap
  Cc: bpf, linux-mm, linux-doc, linux-kernel, Yafang Shao

The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus,
we can mark it as trusted_or_null. With this change, BPF helpers can safely
access vma->vm_mm to retrieve the associated mm_struct from the VMA.
Then we can make policy decision from the VMA.

The "trusted" annotation enables direct access to vma->vm_mm within kfuncs
marked with KF_TRUSTED_ARGS or KF_RCU, such as bpf_task_get_cgroup1() and
bpf_task_under_cgroup(). Conversely, "null" enforcement requires all
callsites using vma->vm_mm to perform NULL checks.

The lsm selftest must be modified because it directly accesses vma->vm_mm
without a NULL pointer check; otherwise it will break due to this
change.

For the VMA based THP policy, the use case is as follows,

  @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null
  if (!@mm)
      return;
  bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner
  @owner = @mm->owner; // mm_struct::owner is rcu trusted or null
  if (!@owner)
    goto out;
  @cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID);

  /* make the decision based on the @cgroup1 attribute */

  bpf_cgroup_release(@cgroup1); // release the associated cgroup
out:
  bpf_rcu_read_unlock();

PSI memory information can be obtained from the associated cgroup to inform
policy decisions. Since upstream PSI support is currently limited to cgroup
v2, the following example demonstrates cgroup v2 implementation:

  @owner = @mm->owner;
  if (@owner) {
      // @ancestor_cgid is user-configured
      @ancestor = bpf_cgroup_from_id(@ancestor_cgid);
      if (bpf_task_under_cgroup(@owner, @ancestor)) {
          @psi_group = @ancestor->psi;

          /* Extract PSI metrics from @psi_group and
           * implement policy logic based on the values
           */

      }
  }

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
---
 kernel/bpf/verifier.c                   | 5 +++++
 tools/testing/selftests/bpf/progs/lsm.c | 8 +++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d400e18ee31e..b708b98f796c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7165,6 +7165,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket) {
 	struct sock *sk;
 };
 
+BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct) {
+	struct mm_struct *vm_mm;
+};
+
 static bool type_is_rcu(struct bpf_verifier_env *env,
 			struct bpf_reg_state *reg,
 			const char *field_name, u32 btf_id)
@@ -7206,6 +7210,7 @@ static bool type_is_trusted_or_null(struct bpf_verifier_env *env,
 {
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket));
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry));
+	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct));
 
 	return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id,
 					  "__safe_trusted_or_null");
diff --git a/tools/testing/selftests/bpf/progs/lsm.c b/tools/testing/selftests/bpf/progs/lsm.c
index 0c13b7409947..7de173daf27b 100644
--- a/tools/testing/selftests/bpf/progs/lsm.c
+++ b/tools/testing/selftests/bpf/progs/lsm.c
@@ -89,14 +89,16 @@ SEC("lsm/file_mprotect")
 int BPF_PROG(test_int_hook, struct vm_area_struct *vma,
 	     unsigned long reqprot, unsigned long prot, int ret)
 {
-	if (ret != 0)
+	struct mm_struct *mm = vma->vm_mm;
+
+	if (ret != 0 || !mm)
 		return ret;
 
 	__s32 pid = bpf_get_current_pid_tgid() >> 32;
 	int is_stack = 0;
 
-	is_stack = (vma->vm_start <= vma->vm_mm->start_stack &&
-		    vma->vm_end >= vma->vm_mm->start_stack);
+	is_stack = (vma->vm_start <= mm->start_stack &&
+		    vma->vm_end >= mm->start_stack);
 
 	if (is_stack && monitored_pid == pid) {
 		mprotect_count++;
-- 
2.47.3



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v9 mm-new 08/11] selftests/bpf: add a simple BPF based THP policy
  2025-09-30  5:58 [PATCH v9 mm-new 00/11] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (6 preceding siblings ...)
  2025-09-30  5:58 ` [PATCH v9 mm-new 07/11] bpf: mark vma->vm_mm as __safe_trusted_or_null Yafang Shao
@ 2025-09-30  5:58 ` Yafang Shao
  2025-09-30  5:58 ` [PATCH v9 mm-new 09/11] selftests/bpf: add test case to update " Yafang Shao
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Yafang Shao @ 2025-09-30  5:58 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, tj, lance.yang, rdunlap
  Cc: bpf, linux-mm, linux-doc, linux-kernel, Yafang Shao

This test case implements a basic THP policy that sets THPeligible to 1 for
a specific task and to 0 for all others. I selected THPeligible for
verification because its straightforward nature makes it ideal for
validating the BPF THP policy functionality.

Below configs must be enabled for this test:

  CONFIG_BPF_THP=y
  CONFIG_MEMCG=y
  CONFIG_TRANSPARENT_HUGEPAGE=y

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 MAINTAINERS                                   |   2 +
 tools/testing/selftests/bpf/config            |   3 +
 .../selftests/bpf/prog_tests/thp_adjust.c     | 257 ++++++++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     |  41 +++
 4 files changed, 303 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 7be34b2a64fd..c1219bcd27c1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16260,6 +16260,8 @@ F:	mm/huge_memory.c
 F:	mm/huge_memory_bpf.c
 F:	mm/khugepaged.c
 F:	mm/mm_slot.h
+F:	tools/testing/selftests/bpf/prog_tests/thp_adjust.c
+F:	tools/testing/selftests/bpf/progs/test_thp_adjust*
 F:	tools/testing/selftests/mm/khugepaged.c
 F:	tools/testing/selftests/mm/split_huge_page_test.c
 F:	tools/testing/selftests/mm/transhuge-stress.c
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index 8916ab814a3e..13711f773091 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -9,6 +9,7 @@ CONFIG_BPF_LIRC_MODE2=y
 CONFIG_BPF_LSM=y
 CONFIG_BPF_STREAM_PARSER=y
 CONFIG_BPF_SYSCALL=y
+CONFIG_BPF_THP=y
 # CONFIG_BPF_UNPRIV_DEFAULT_OFF is not set
 CONFIG_CGROUP_BPF=y
 CONFIG_CRYPTO_HMAC=y
@@ -51,6 +52,7 @@ CONFIG_IPV6_TUNNEL=y
 CONFIG_KEYS=y
 CONFIG_LIRC=y
 CONFIG_LWTUNNEL=y
+CONFIG_MEMCG=y
 CONFIG_MODULE_SIG=y
 CONFIG_MODULE_SRCVERSION_ALL=y
 CONFIG_MODULE_UNLOAD=y
@@ -114,6 +116,7 @@ CONFIG_SECURITY=y
 CONFIG_SECURITYFS=y
 CONFIG_SYN_COOKIES=y
 CONFIG_TEST_BPF=m
+CONFIG_TRANSPARENT_HUGEPAGE=y
 CONFIG_UDMABUF=y
 CONFIG_USERFAULTFD=y
 CONFIG_VSOCKETS=y
diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
new file mode 100644
index 000000000000..0a5a43416f2f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -0,0 +1,257 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <sys/mman.h>
+#include <test_progs.h>
+#include "test_thp_adjust.skel.h"
+
+#define LEN (16 * 1024 * 1024) /* 16MB */
+#define THP_ENABLED_FILE "/sys/kernel/mm/transparent_hugepage/enabled"
+#define PMD_SIZE_FILE "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
+
+static struct test_thp_adjust *skel;
+static char old_mode[32];
+static long pagesize;
+
+static int thp_mode_save(void)
+{
+	const char *start, *end;
+	char buf[128];
+	int fd, err;
+	size_t len;
+
+	fd = open(THP_ENABLED_FILE, O_RDONLY);
+	if (fd == -1)
+		return -1;
+
+	err = read(fd, buf, sizeof(buf) - 1);
+	if (err == -1)
+		goto close;
+
+	start = strchr(buf, '[');
+	end = start ? strchr(start, ']') : NULL;
+	if (!start || !end || end <= start) {
+		err = -1;
+		goto close;
+	}
+
+	len = end - start - 1;
+	if (len >= sizeof(old_mode))
+		len = sizeof(old_mode) - 1;
+	strncpy(old_mode, start + 1, len);
+	old_mode[len] = '\0';
+
+close:
+	close(fd);
+	return err;
+}
+
+static int thp_mode_set(const char *desired_mode)
+{
+	int fd, err;
+
+	fd = open(THP_ENABLED_FILE, O_RDWR);
+	if (fd == -1)
+		return -1;
+
+	err = write(fd, desired_mode, strlen(desired_mode));
+	close(fd);
+	return err;
+}
+
+static int thp_mode_reset(void)
+{
+	int fd, err;
+
+	fd = open(THP_ENABLED_FILE, O_WRONLY);
+	if (fd == -1)
+		return -1;
+
+	err = write(fd, old_mode, strlen(old_mode));
+	close(fd);
+	return err;
+}
+
+static char *thp_alloc(void)
+{
+	char *addr;
+	int err, i;
+
+	addr = mmap(NULL, LEN, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (addr == MAP_FAILED)
+		return NULL;
+
+	err = madvise(addr, LEN, MADV_HUGEPAGE);
+	if (err == -1)
+		goto unmap;
+
+	/* Accessing a single byte within a page is sufficient to trigger a page fault. */
+	for (i = 0; i < LEN; i += pagesize)
+		addr[i] = 1;
+	return addr;
+
+unmap:
+	munmap(addr, LEN);
+	return NULL;
+}
+
+static void thp_free(char *ptr)
+{
+	munmap(ptr, LEN);
+}
+
+static int get_pmd_order(void)
+{
+	ssize_t bytes_read, size;
+	int fd, order, ret = -1;
+	char buf[64], *endptr;
+
+	fd = open(PMD_SIZE_FILE, O_RDONLY);
+	if (fd < 0)
+		return -1;
+
+	bytes_read = read(fd, buf, sizeof(buf) - 1);
+	if (bytes_read <= 0)
+		goto close_fd;
+
+	/* Remove potential newline character */
+	if (buf[bytes_read - 1] == '\n')
+		buf[bytes_read - 1] = '\0';
+
+	size = strtoul(buf, &endptr, 10);
+	if (endptr == buf || *endptr != '\0')
+		goto close_fd;
+	if (size % pagesize != 0)
+		goto close_fd;
+	ret = size / pagesize;
+	if ((ret & (ret - 1)) == 0) {
+		order = 0;
+		while (ret > 1) {
+			ret >>= 1;
+			order++;
+		}
+		ret = order;
+	}
+
+close_fd:
+	close(fd);
+	return ret;
+}
+
+static int get_thp_eligible(pid_t pid, unsigned long addr)
+{
+	int this_vma = 0, eligible = -1;
+	unsigned long start, end;
+	char smaps_path[64];
+	FILE *smaps_file;
+	char line[4096];
+
+	snprintf(smaps_path, sizeof(smaps_path), "/proc/%d/smaps", pid);
+	smaps_file = fopen(smaps_path, "r");
+	if (!smaps_file)
+		return -1;
+
+	while (fgets(line, sizeof(line), smaps_file)) {
+		if (sscanf(line, "%lx-%lx", &start, &end) == 2) {
+			/* addr is monotonic */
+			if (addr < start)
+				break;
+			this_vma = (addr >= start && addr < end) ? 1 : 0;
+			continue;
+		}
+
+		if (!this_vma)
+			continue;
+
+		if (strstr(line, "THPeligible:")) {
+			sscanf(line, "THPeligible: %d", &eligible);
+			break;
+		}
+	}
+
+	fclose(smaps_file);
+	return eligible;
+}
+
+static void subtest_thp_eligible(void)
+{
+	struct bpf_link *ops_link;
+	int elighble;
+	pid_t pid;
+	char *ptr;
+
+	ops_link = bpf_map__attach_struct_ops(skel->maps.thp_eligible_ops);
+	if (!ASSERT_OK_PTR(ops_link, "attach struct_ops"))
+		return;
+
+	pid = getpid();
+	ptr = thp_alloc();
+	if (!ASSERT_OK_PTR(ptr, "THP alloc"))
+		goto detach;
+
+	skel->bss->pid_eligible = pid;
+	elighble = get_thp_eligible(pid, (unsigned long)ptr);
+	ASSERT_EQ(elighble, 1, "THPeligible");
+
+	skel->bss->pid_eligible = 0;
+	skel->bss->pid_not_eligible = pid;
+	elighble = get_thp_eligible(pid, (unsigned long)ptr);
+	ASSERT_EQ(elighble, 0, "THP not eligible");
+
+	skel->bss->pid_eligible = 0;
+	skel->bss->pid_not_eligible = 0;
+	elighble = get_thp_eligible(pid, (unsigned long)ptr);
+	ASSERT_EQ(elighble, 0, "THP not eligible");
+
+	thp_free(ptr);
+detach:
+	bpf_link__destroy(ops_link);
+}
+
+static int thp_adjust_setup(void)
+{
+	int err = -1, pmd_order;
+
+	pagesize = sysconf(_SC_PAGESIZE);
+	pmd_order = get_pmd_order();
+	if (!ASSERT_NEQ(pmd_order, -1, "get_pmd_order"))
+		return -1;
+
+	if (!ASSERT_NEQ(thp_mode_save(), -1, "THP mode save"))
+		return -1;
+	if (!ASSERT_GE(thp_mode_set("madvise"), 0, "THP mode set"))
+		return -1;
+
+	skel = test_thp_adjust__open();
+	if (!ASSERT_OK_PTR(skel, "open"))
+		goto thp_reset;
+
+	skel->bss->pmd_order = pmd_order;
+
+	err = test_thp_adjust__load(skel);
+	if (!ASSERT_OK(err, "load"))
+		goto destroy;
+	return 0;
+
+destroy:
+	test_thp_adjust__destroy(skel);
+thp_reset:
+	ASSERT_GE(thp_mode_reset(), 0, "THP mode reset");
+	return err;
+}
+
+static void thp_adjust_destroy(void)
+{
+	test_thp_adjust__destroy(skel);
+	ASSERT_GE(thp_mode_reset(), 0, "THP mode reset");
+}
+
+void test_thp_adjust(void)
+{
+	if (thp_adjust_setup() == -1)
+		return;
+
+	if (test__start_subtest("thp_eligible"))
+		subtest_thp_eligible();
+
+	thp_adjust_destroy();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust.c b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
new file mode 100644
index 000000000000..74ad70c837ba
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
@@ -0,0 +1,41 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+int pid_not_eligible, pid_eligible;
+int pmd_order;
+
+SEC("struct_ops/thp_get_order")
+int BPF_PROG(thp_eligible, struct vm_area_struct *vma, enum tva_type type,
+	     unsigned long orders)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int suggested_order = 0;
+	struct task_struct *p;
+
+	if (type != TVA_SMAPS)
+		return 0;
+
+	if (!mm)
+		return 0;
+
+	/* This BPF hook is already under RCU */
+	p = mm->owner;
+	if (!p || (p->pid != pid_eligible && p->pid != pid_not_eligible))
+		return 0;
+
+	if (p->pid == pid_eligible)
+		suggested_order = pmd_order;
+	else
+		suggested_order = 30;	/* invalid order */
+	return suggested_order;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops thp_eligible_ops = {
+	.thp_get_order = (void *)thp_eligible,
+};
-- 
2.47.3



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v9 mm-new 09/11] selftests/bpf: add test case to update THP policy
  2025-09-30  5:58 [PATCH v9 mm-new 00/11] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (7 preceding siblings ...)
  2025-09-30  5:58 ` [PATCH v9 mm-new 08/11] selftests/bpf: add a simple BPF based THP policy Yafang Shao
@ 2025-09-30  5:58 ` Yafang Shao
  2025-09-30  5:58 ` [PATCH v9 mm-new 10/11] selftests/bpf: add test cases for invalid thp_adjust usage Yafang Shao
  2025-09-30  5:58 ` [PATCH v9 mm-new 11/11] Documentation: add BPF-based THP policy management Yafang Shao
  10 siblings, 0 replies; 37+ messages in thread
From: Yafang Shao @ 2025-09-30  5:58 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, tj, lance.yang, rdunlap
  Cc: bpf, linux-mm, linux-doc, linux-kernel, Yafang Shao

This test case exercises the BPF THP update mechanism by modifying an
existing policy. The behavior confirms that:
- EBUSY error occurs when attempting to install a new BPF program while
  another is active
- Updates to currently running programs are successfully processed

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 .../selftests/bpf/prog_tests/thp_adjust.c     | 23 +++++++++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     | 14 +++++++++++
 2 files changed, 37 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
index 0a5a43416f2f..409ffe9e18f2 100644
--- a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -207,6 +207,27 @@ static void subtest_thp_eligible(void)
 	bpf_link__destroy(ops_link);
 }
 
+static void subtest_thp_policy_update(void)
+{
+	struct bpf_link *old_link, *new_link;
+	int err;
+
+	old_link = bpf_map__attach_struct_ops(skel->maps.swap_ops);
+	if (!ASSERT_OK_PTR(old_link, "attach_old_link"))
+		return;
+
+	new_link = bpf_map__attach_struct_ops(skel->maps.thp_eligible_ops);
+	if (!ASSERT_NULL(new_link, "attach_new_link"))
+		goto destory_old;
+	ASSERT_EQ(errno, EBUSY, "attach_new_link");
+
+	err = bpf_link__update_map(old_link, skel->maps.thp_eligible_ops);
+	ASSERT_EQ(err, 0, "update_old_link");
+
+destory_old:
+	bpf_link__destroy(old_link);
+}
+
 static int thp_adjust_setup(void)
 {
 	int err = -1, pmd_order;
@@ -252,6 +273,8 @@ void test_thp_adjust(void)
 
 	if (test__start_subtest("thp_eligible"))
 		subtest_thp_eligible();
+	if (test__start_subtest("policy_update"))
+		subtest_thp_policy_update();
 
 	thp_adjust_destroy();
 }
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust.c b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
index 74ad70c837ba..fc62f0c6f891 100644
--- a/tools/testing/selftests/bpf/progs/test_thp_adjust.c
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
@@ -39,3 +39,17 @@ SEC(".struct_ops.link")
 struct bpf_thp_ops thp_eligible_ops = {
 	.thp_get_order = (void *)thp_eligible,
 };
+
+SEC("struct_ops/thp_get_order")
+int BPF_PROG(alloc_not_in_swap, struct vm_area_struct *vma, enum tva_type type,
+	     unsigned long orders)
+{
+	if (type == TVA_SWAP_PAGEFAULT)
+		return 0;
+	return -1;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops swap_ops = {
+	.thp_get_order = (void *)alloc_not_in_swap,
+};
-- 
2.47.3



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v9 mm-new 10/11] selftests/bpf: add test cases for invalid thp_adjust usage
  2025-09-30  5:58 [PATCH v9 mm-new 00/11] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (8 preceding siblings ...)
  2025-09-30  5:58 ` [PATCH v9 mm-new 09/11] selftests/bpf: add test case to update " Yafang Shao
@ 2025-09-30  5:58 ` Yafang Shao
  2025-09-30  5:58 ` [PATCH v9 mm-new 11/11] Documentation: add BPF-based THP policy management Yafang Shao
  10 siblings, 0 replies; 37+ messages in thread
From: Yafang Shao @ 2025-09-30  5:58 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, tj, lance.yang, rdunlap
  Cc: bpf, linux-mm, linux-doc, linux-kernel, Yafang Shao

1. The trusted vma->vm_mm pointer can be null and must be checked before
   dereferencing.
2. The trusted mm->owner pointer can be null and must be checked before
   dereferencing.
3. Sleepable programs are prohibited because the call site operates under
   RCU protection.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 .../selftests/bpf/prog_tests/thp_adjust.c     |  7 +++++
 .../bpf/progs/test_thp_adjust_sleepable.c     | 22 ++++++++++++++
 .../bpf/progs/test_thp_adjust_trusted_owner.c | 30 +++++++++++++++++++
 .../bpf/progs/test_thp_adjust_trusted_vma.c   | 27 +++++++++++++++++
 4 files changed, 86 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_sleepable.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_owner.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c

diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
index 409ffe9e18f2..90af0322f775 100644
--- a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -3,6 +3,9 @@
 #include <sys/mman.h>
 #include <test_progs.h>
 #include "test_thp_adjust.skel.h"
+#include "test_thp_adjust_sleepable.skel.h"
+#include "test_thp_adjust_trusted_vma.skel.h"
+#include "test_thp_adjust_trusted_owner.skel.h"
 
 #define LEN (16 * 1024 * 1024) /* 16MB */
 #define THP_ENABLED_FILE "/sys/kernel/mm/transparent_hugepage/enabled"
@@ -277,4 +280,8 @@ void test_thp_adjust(void)
 		subtest_thp_policy_update();
 
 	thp_adjust_destroy();
+
+	RUN_TESTS(test_thp_adjust_trusted_vma);
+	RUN_TESTS(test_thp_adjust_trusted_owner);
+	RUN_TESTS(test_thp_adjust_sleepable);
 }
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust_sleepable.c b/tools/testing/selftests/bpf/progs/test_thp_adjust_sleepable.c
new file mode 100644
index 000000000000..e3d70f258d84
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust_sleepable.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops.s/thp_get_order")
+__failure __msg("attach to unsupported member thp_get_order of struct bpf_thp_ops")
+int BPF_PROG(thp_sleepable, struct vm_area_struct *vma, enum tva_type type,
+	     unsigned long orders)
+{
+	return -1;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops vma_ops = {
+	.thp_get_order = (void *)thp_sleepable,
+};
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_owner.c b/tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_owner.c
new file mode 100644
index 000000000000..88bb09cb7cc2
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_owner.c
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/thp_get_order")
+__failure __msg("R3 pointer arithmetic on rcu_ptr_or_null_ prohibited, null-check it first")
+int BPF_PROG(thp_trusted_owner, struct vm_area_struct *vma, enum tva_type tva_type,
+	     unsigned long orders)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct task_struct *p;
+
+	if (!mm)
+		return 0;
+
+	p = mm->owner;
+	bpf_printk("The task name is %s\n", p->comm);
+	return -1;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops vma_ops = {
+	.thp_get_order = (void *)thp_trusted_owner,
+};
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c b/tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c
new file mode 100644
index 000000000000..df7b0c160153
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/thp_get_order")
+__failure __msg("R1 invalid mem access 'trusted_ptr_or_null_'")
+int BPF_PROG(thp_trusted_vma, struct vm_area_struct *vma, enum tva_type tva_type,
+	     unsigned long orders)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct task_struct *p = mm->owner;
+
+	if (!p)
+		return 0;
+	return -1;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops vma_ops = {
+	.thp_get_order = (void *)thp_trusted_vma,
+};
-- 
2.47.3



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v9 mm-new 11/11] Documentation: add BPF-based THP policy management
  2025-09-30  5:58 [PATCH v9 mm-new 00/11] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (9 preceding siblings ...)
  2025-09-30  5:58 ` [PATCH v9 mm-new 10/11] selftests/bpf: add test cases for invalid thp_adjust usage Yafang Shao
@ 2025-09-30  5:58 ` Yafang Shao
  10 siblings, 0 replies; 37+ messages in thread
From: Yafang Shao @ 2025-09-30  5:58 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, tj, lance.yang, rdunlap
  Cc: bpf, linux-mm, linux-doc, linux-kernel, Yafang Shao

Add the documentation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 39 ++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 1654211cc6cf..f6991c674329 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -738,3 +738,42 @@ support enabled just fine as always. No difference can be noted in
 hugetlbfs other than there will be less overall fragmentation. All
 usual features belonging to hugetlbfs are preserved and
 unaffected. libhugetlbfs will also work fine as usual.
+
+BPF THP
+=======
+
+Overview
+--------
+
+When the system is configured with "always" or "madvise" THP mode, a BPF program
+can be used to adjust THP allocation policies dynamically. This enables
+fine-grained control over THP decisions based on various factors including
+workload identity, allocation context, and system memory pressure.
+
+Program Interface
+-----------------
+
+This feature implements a struct_ops BPF program with the following interface::
+
+  int thp_get_order(struct vm_area_struct *vma,
+                    enum tva_type type,
+                    unsigned long orders);
+
+Parameters::
+
+  @vma: vm_area_struct associated with the THP allocation
+  @type: TVA type for current @vma
+  @orders: Bitmask of available THP orders for this allocation
+
+Return value::
+
+  The suggested THP order for allocation from the BPF program. Must be
+  a valid, available order.
+
+Implementation Notes
+--------------------
+
+This is currently an experimental feature. CONFIG_BPF_THP (EXPERIMENTAL) must be
+enabled to use it. Only one BPF program can be attached at a time, but the
+program can be updated dynamically to adjust policies without requiring affected
+tasks to be restarted.
-- 
2.47.3



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-09-30  5:58 ` [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection Yafang Shao
@ 2025-10-03  2:18   ` Alexei Starovoitov
  2025-10-07  8:47     ` Yafang Shao
  2025-10-08  8:08     ` David Hildenbrand
  0 siblings, 2 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2025-10-03  2:18 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Andrew Morton, David Hildenbrand, ziy, baolin.wang,
	Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts, dev.jain,
	Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> +                                     enum tva_type type,
> +                                     unsigned long orders)
> +{
> +       thp_order_fn_t *bpf_hook_thp_get_order;
> +       int bpf_order;
> +
> +       /* No BPF program is attached */
> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> +                     &transparent_hugepage_flags))
> +               return orders;
> +
> +       rcu_read_lock();
> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> +               goto out;
> +
> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> +       orders &= BIT(bpf_order);
> +
> +out:
> +       rcu_read_unlock();
> +       return orders;
> +}

I thought I explained it earlier.
Nack to a single global prog approach.

The logic must accommodate multiple programs per-container
or any other way from the beginning.
If cgroup based scoping doesn't fit use per process tree scoping.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 07/11] bpf: mark vma->vm_mm as __safe_trusted_or_null
  2025-09-30  5:58 ` [PATCH v9 mm-new 07/11] bpf: mark vma->vm_mm as __safe_trusted_or_null Yafang Shao
@ 2025-10-06 21:06   ` Andrii Nakryiko
  2025-10-07  9:05     ` Yafang Shao
  0 siblings, 1 reply; 37+ messages in thread
From: Andrii Nakryiko @ 2025-10-06 21:06 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, tj, lance.yang, rdunlap, bpf,
	linux-mm, linux-doc, linux-kernel, Mykyta Yatsenko

On Mon, Sep 29, 2025 at 11:00 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus,
> we can mark it as trusted_or_null. With this change, BPF helpers can safely
> access vma->vm_mm to retrieve the associated mm_struct from the VMA.
> Then we can make policy decision from the VMA.
>
> The "trusted" annotation enables direct access to vma->vm_mm within kfuncs
> marked with KF_TRUSTED_ARGS or KF_RCU, such as bpf_task_get_cgroup1() and
> bpf_task_under_cgroup(). Conversely, "null" enforcement requires all
> callsites using vma->vm_mm to perform NULL checks.
>
> The lsm selftest must be modified because it directly accesses vma->vm_mm
> without a NULL pointer check; otherwise it will break due to this
> change.
>
> For the VMA based THP policy, the use case is as follows,
>
>   @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null
>   if (!@mm)
>       return;
>   bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner
>   @owner = @mm->owner; // mm_struct::owner is rcu trusted or null
>   if (!@owner)
>     goto out;
>   @cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID);
>
>   /* make the decision based on the @cgroup1 attribute */
>
>   bpf_cgroup_release(@cgroup1); // release the associated cgroup
> out:
>   bpf_rcu_read_unlock();
>
> PSI memory information can be obtained from the associated cgroup to inform
> policy decisions. Since upstream PSI support is currently limited to cgroup
> v2, the following example demonstrates cgroup v2 implementation:
>
>   @owner = @mm->owner;
>   if (@owner) {
>       // @ancestor_cgid is user-configured
>       @ancestor = bpf_cgroup_from_id(@ancestor_cgid);
>       if (bpf_task_under_cgroup(@owner, @ancestor)) {
>           @psi_group = @ancestor->psi;
>
>           /* Extract PSI metrics from @psi_group and
>            * implement policy logic based on the values
>            */
>
>       }
>   }
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> ---
>  kernel/bpf/verifier.c                   | 5 +++++
>  tools/testing/selftests/bpf/progs/lsm.c | 8 +++++---
>  2 files changed, 10 insertions(+), 3 deletions(-)
>

Hey Yafang,

This looks like a generally useful change, so I think it would be best
if you can send it as a stand-alone patch to bpf-next to land it
sooner.

Also, am I imagining this, or did you have similar change for the
vm_file field as well? Any reasons to not mark vm_file as trusted as
well?

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index d400e18ee31e..b708b98f796c 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -7165,6 +7165,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket) {
>         struct sock *sk;
>  };
>
> +BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct) {
> +       struct mm_struct *vm_mm;
> +};
> +
>  static bool type_is_rcu(struct bpf_verifier_env *env,
>                         struct bpf_reg_state *reg,
>                         const char *field_name, u32 btf_id)
> @@ -7206,6 +7210,7 @@ static bool type_is_trusted_or_null(struct bpf_verifier_env *env,
>  {
>         BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket));
>         BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry));
> +       BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct));
>
>         return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id,
>                                           "__safe_trusted_or_null");
> diff --git a/tools/testing/selftests/bpf/progs/lsm.c b/tools/testing/selftests/bpf/progs/lsm.c
> index 0c13b7409947..7de173daf27b 100644
> --- a/tools/testing/selftests/bpf/progs/lsm.c
> +++ b/tools/testing/selftests/bpf/progs/lsm.c
> @@ -89,14 +89,16 @@ SEC("lsm/file_mprotect")
>  int BPF_PROG(test_int_hook, struct vm_area_struct *vma,
>              unsigned long reqprot, unsigned long prot, int ret)
>  {
> -       if (ret != 0)
> +       struct mm_struct *mm = vma->vm_mm;
> +
> +       if (ret != 0 || !mm)
>                 return ret;
>
>         __s32 pid = bpf_get_current_pid_tgid() >> 32;
>         int is_stack = 0;
>
> -       is_stack = (vma->vm_start <= vma->vm_mm->start_stack &&
> -                   vma->vm_end >= vma->vm_mm->start_stack);
> +       is_stack = (vma->vm_start <= mm->start_stack &&
> +                   vma->vm_end >= mm->start_stack);
>
>         if (is_stack && monitored_pid == pid) {
>                 mprotect_count++;
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-03  2:18   ` Alexei Starovoitov
@ 2025-10-07  8:47     ` Yafang Shao
  2025-10-08  3:25       ` Alexei Starovoitov
  2025-10-08  8:08     ` David Hildenbrand
  1 sibling, 1 reply; 37+ messages in thread
From: Yafang Shao @ 2025-10-07  8:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrew Morton, David Hildenbrand, ziy, baolin.wang,
	Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts, dev.jain,
	Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

On Fri, Oct 3, 2025 at 10:18 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> > +                                     enum tva_type type,
> > +                                     unsigned long orders)
> > +{
> > +       thp_order_fn_t *bpf_hook_thp_get_order;
> > +       int bpf_order;
> > +
> > +       /* No BPF program is attached */
> > +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > +                     &transparent_hugepage_flags))
> > +               return orders;
> > +
> > +       rcu_read_lock();
> > +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> > +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> > +               goto out;
> > +
> > +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> > +       orders &= BIT(bpf_order);
> > +
> > +out:
> > +       rcu_read_unlock();
> > +       return orders;
> > +}
>

Hello Alexei,

My apologies for the slow reply. I'm on a family vacation and am
checking email intermittently.

> I thought I explained it earlier.

I recall your earlier suggestion for a cgroup-based approach for
BPF-THP. However, as I mentioned, I believe cgroups might not be the
best fit[0]. My understanding was that we had agreed to move away from
that model. Could we realign on this?

[0].  https://lwn.net/ml/all/CALOAHbBvwT+6f_4gBHzPc9n_SukhAs_sa5yX=AjHYsWic1MRuw@mail.gmail.com/

> Nack to a single global prog approach.

The design of BPF-THP as a global program is a direct consequence of
its purpose: to extend the existing global
/sys/kernel/mm/transparent_hugepage/ interface. This architectural
consistency simplifies both understanding and maintenance.

Crucially, this global nature does not limit policy control. The
program is designed with the flexibility to enforce policies at
multiple levels—globally, per-cgroup, or per-task—enabling all of our
target use cases through a unified mechanism.

>
> The logic must accommodate multiple programs per-container
> or any other way from the beginning.
> If cgroup based scoping doesn't fit use per process tree scoping.

During the initial design of BPF-THP, I evaluated whether a global
program or a per-process program would be more suitable. While a
per-process design would require embedding a struct_ops into
task_struct, this seemed like over-engineering to me. We can
efficiently implement both cgroup-tree-scoped and process-tree-scoped
THP policies using existing BPF helpers, such as:

  SCOPING                        BPF kfuncs
  cgroup tree   ->  bpf_task_under_cgroup()
  process tree -> bpf_task_is_ ancestors()

With these kfuncs, there is no need to attach individual BPF-THP
programs to every process or cgroup tree. I have not identified a
valid use case that necessitates embedding a struct_ops in task_struct
which can't be achieved more simply with these kfuncs. If such use
cases exist, please detail them. Consequently, I proceeded with a
global struct_ops implementation.

The desire to attach multiple BPF-THP programs simultaneously does not
appear to be a valid use case. Furthermore, our production experience
has shown that multiple attachments often introduce conflicts. This is
precisely why system administrators prefer to manage BPF programs with
a single manager—to avoid undefined behaviors from competing programs.

Focusing specifically on BPF-THP, the semantics of the program make
multiple attachments unsuitable. A BPF-THP program's outcome is its
return value (a suggested THP order), not the side effects of its
execution. In other words, it is functionally a variant of fmod_ret.

If we allow multiple attachments and they return different values, how
do we resolve the conflict?

If one program returns order-9 and another returns order-1, which
value should be chosen? Neither 1, 9, nor a combination (1 & 9) is
appropriate. The only logical solution is to reject subsequent
attachments and explicitly notify the user of the conflict. Our goal
should be to prevent conflicts from the outset, rather than forcing
developers to create another userspace manager to handle them.

A single global program is a natural and logical extension of the
existing global /sys/kernel/mm/transparent_hugepage/ interface. It is
a good fit for BPF-THP and avoids unnecessary complexity.

Please provide a detailed clarification if I have misunderstood your position.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 07/11] bpf: mark vma->vm_mm as __safe_trusted_or_null
  2025-10-06 21:06   ` Andrii Nakryiko
@ 2025-10-07  9:05     ` Yafang Shao
  0 siblings, 0 replies; 37+ messages in thread
From: Yafang Shao @ 2025-10-07  9:05 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, tj, lance.yang, rdunlap, bpf,
	linux-mm, linux-doc, linux-kernel, Mykyta Yatsenko

On Tue, Oct 7, 2025 at 5:07 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Sep 29, 2025 at 11:00 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus,
> > we can mark it as trusted_or_null. With this change, BPF helpers can safely
> > access vma->vm_mm to retrieve the associated mm_struct from the VMA.
> > Then we can make policy decision from the VMA.
> >
> > The "trusted" annotation enables direct access to vma->vm_mm within kfuncs
> > marked with KF_TRUSTED_ARGS or KF_RCU, such as bpf_task_get_cgroup1() and
> > bpf_task_under_cgroup(). Conversely, "null" enforcement requires all
> > callsites using vma->vm_mm to perform NULL checks.
> >
> > The lsm selftest must be modified because it directly accesses vma->vm_mm
> > without a NULL pointer check; otherwise it will break due to this
> > change.
> >
> > For the VMA based THP policy, the use case is as follows,
> >
> >   @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null
> >   if (!@mm)
> >       return;
> >   bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner
> >   @owner = @mm->owner; // mm_struct::owner is rcu trusted or null
> >   if (!@owner)
> >     goto out;
> >   @cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID);
> >
> >   /* make the decision based on the @cgroup1 attribute */
> >
> >   bpf_cgroup_release(@cgroup1); // release the associated cgroup
> > out:
> >   bpf_rcu_read_unlock();
> >
> > PSI memory information can be obtained from the associated cgroup to inform
> > policy decisions. Since upstream PSI support is currently limited to cgroup
> > v2, the following example demonstrates cgroup v2 implementation:
> >
> >   @owner = @mm->owner;
> >   if (@owner) {
> >       // @ancestor_cgid is user-configured
> >       @ancestor = bpf_cgroup_from_id(@ancestor_cgid);
> >       if (bpf_task_under_cgroup(@owner, @ancestor)) {
> >           @psi_group = @ancestor->psi;
> >
> >           /* Extract PSI metrics from @psi_group and
> >            * implement policy logic based on the values
> >            */
> >
> >       }
> >   }
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > ---
> >  kernel/bpf/verifier.c                   | 5 +++++
> >  tools/testing/selftests/bpf/progs/lsm.c | 8 +++++---
> >  2 files changed, 10 insertions(+), 3 deletions(-)
> >
>
> Hey Yafang,
>
> This looks like a generally useful change, so I think it would be best
> if you can send it as a stand-alone patch to bpf-next to land it
> sooner.

Sure. I will do it.

>
> Also, am I imagining this, or did you have similar change for the
> vm_file field as well? Any reasons to not mark vm_file as trusted as
> well?

Marking vm_file as trusted will directly support our follow-up work on
file-backed THP policies, where we need to apply different policies to
different files in production. I will include this change in the same
stand-alone patch. Thanks for the suggestion.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-07  8:47     ` Yafang Shao
@ 2025-10-08  3:25       ` Alexei Starovoitov
  2025-10-08  3:50         ` Yafang Shao
  0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2025-10-08  3:25 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Andrew Morton, David Hildenbrand, ziy, baolin.wang,
	Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts, dev.jain,
	Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> has shown that multiple attachments often introduce conflicts. This is
> precisely why system administrators prefer to manage BPF programs with
> a single manager—to avoid undefined behaviors from competing programs.

I don't believe this a single bit. bpf-thp didn't have any
production exposure. Everything that you said above is wishful thinking.
In actual production every programmable component needs to be
scoped in some way. One can argue that scheduling is a global
property too, yet sched-ext only works on a specific scheduling class.
All bpf program types are scoped except tracing, since kprobe/fentry
are global by definition, and even than multiple tracing programs
can be attached to the same kprobe.

> execution. In other words, it is functionally a variant of fmod_ret.

hid-bpf initially went with fmod_ret approach, deleted the whole thing
and redesigned it with _scoped_ struct-ops.

> If we allow multiple attachments and they return different values, how
> do we resolve the conflict?
>
> If one program returns order-9 and another returns order-1, which
> value should be chosen? Neither 1, 9, nor a combination (1 & 9) is
> appropriate.

No. If you cannot figure out how to stack multiple programs
it means that the api you picked is broken.

> A single global program is a natural and logical extension of the
> existing global /sys/kernel/mm/transparent_hugepage/ interface. It is
> a good fit for BPF-THP and avoids unnecessary complexity.

The Nack to single global prog is not negotiable.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-08  3:25       ` Alexei Starovoitov
@ 2025-10-08  3:50         ` Yafang Shao
  2025-10-08  4:10           ` Alexei Starovoitov
  0 siblings, 1 reply; 37+ messages in thread
From: Yafang Shao @ 2025-10-08  3:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrew Morton, David Hildenbrand, ziy, baolin.wang,
	Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts, dev.jain,
	Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

On Wed, Oct 8, 2025 at 11:25 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > has shown that multiple attachments often introduce conflicts. This is
> > precisely why system administrators prefer to manage BPF programs with
> > a single manager—to avoid undefined behaviors from competing programs.
>
> I don't believe this a single bit.

You should spend some time seeing how users are actually applying BPF
in practice. Some information for you :

https://github.com/bpfman/bpfman
https://github.com/DataDog/ebpf-manager
https://github.com/ccfos/huatuo

> bpf-thp didn't have any
> production exposure.
>  Everything that you said above is wishful thinking.

The statement above applies to other multi-attachable programs, not to bpf-thp.

> In actual production every programmable component needs to be
> scoped in some way. One can argue that scheduling is a global
> property too, yet sched-ext only works on a specific scheduling class.

I can also argue that bpf-thp only works on a specific thp mode
(madvise and always) ;-)

> All bpf program types are scoped except tracing, since kprobe/fentry
> are global by definition, and even than multiple tracing programs
> can be attached to the same kprobe.
>
> > execution. In other words, it is functionally a variant of fmod_ret.
>
> hid-bpf initially went with fmod_ret approach, deleted the whole thing
> and redesigned it with _scoped_ struct-ops.

I see little value in embedding a bpf_thp_struct_ops into the
task_struct. The benefits don't appear to justify the added
complexity.

>
> > If we allow multiple attachments and they return different values, how
> > do we resolve the conflict?
> >
> > If one program returns order-9 and another returns order-1, which
> > value should be chosen? Neither 1, 9, nor a combination (1 & 9) is
> > appropriate.
>
> No. If you cannot figure out how to stack multiple programs
> it means that the api you picked is broken.
>
> > A single global program is a natural and logical extension of the
> > existing global /sys/kernel/mm/transparent_hugepage/ interface. It is
> > a good fit for BPF-THP and avoids unnecessary complexity.
>
> The Nack to single global prog is not negotiable.

We still lack a compelling technical reason for embedding
bpf_thp_struct_ops into task_struct. Can you clearly articulate the
problem that this specific design is solving?

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-08  3:50         ` Yafang Shao
@ 2025-10-08  4:10           ` Alexei Starovoitov
  2025-10-08  4:25             ` Yafang Shao
  0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2025-10-08  4:10 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Andrew Morton, David Hildenbrand, ziy, baolin.wang,
	Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts, dev.jain,
	Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

On Tue, Oct 7, 2025 at 8:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Wed, Oct 8, 2025 at 11:25 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > has shown that multiple attachments often introduce conflicts. This is
> > > precisely why system administrators prefer to manage BPF programs with
> > > a single manager—to avoid undefined behaviors from competing programs.
> >
> > I don't believe this a single bit.
>
> You should spend some time seeing how users are actually applying BPF
> in practice. Some information for you :
>
> https://github.com/bpfman/bpfman
> https://github.com/DataDog/ebpf-manager
> https://github.com/ccfos/huatuo

By seeing the above you learned the wrong lesson.
These orchestrators and many others were created because
we made mistakes in the kernel by not scoping the progs enough.
XDP is a prime example. It allows one program per netdev.
This was a massive mistake which we're still trying to fix.

> > hid-bpf initially went with fmod_ret approach, deleted the whole thing
> > and redesigned it with _scoped_ struct-ops.
>
> I see little value in embedding a bpf_thp_struct_ops into the
> task_struct. The benefits don't appear to justify the added
> complexity.

huh? where did I say that struct-ops should be embedded in task_struct ?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-08  4:10           ` Alexei Starovoitov
@ 2025-10-08  4:25             ` Yafang Shao
  2025-10-08  4:39               ` Alexei Starovoitov
  0 siblings, 1 reply; 37+ messages in thread
From: Yafang Shao @ 2025-10-08  4:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrew Morton, David Hildenbrand, ziy, baolin.wang,
	Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts, dev.jain,
	Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

On Wed, Oct 8, 2025 at 12:10 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Oct 7, 2025 at 8:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Wed, Oct 8, 2025 at 11:25 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > has shown that multiple attachments often introduce conflicts. This is
> > > > precisely why system administrators prefer to manage BPF programs with
> > > > a single manager—to avoid undefined behaviors from competing programs.
> > >
> > > I don't believe this a single bit.
> >
> > You should spend some time seeing how users are actually applying BPF
> > in practice. Some information for you :
> >
> > https://github.com/bpfman/bpfman
> > https://github.com/DataDog/ebpf-manager
> > https://github.com/ccfos/huatuo
>
> By seeing the above you learned the wrong lesson.
> These orchestrators and many others were created because
> we made mistakes in the kernel by not scoping the progs enough.
> XDP is a prime example. It allows one program per netdev.
> This was a massive mistake which we're still trying to fix.

Since we don't use XDP in production, I can't comment on it. However,
for our multi-attachable cgroup BPF programs, a key issue arises: if a
program has permission to attach to one cgroup, it can attach to any
cgroup. While scoping enables attachment to individual cgroups, it
does not enforce isolation. This means we must still check for
conflicts between programs, which begs the question: what is the
functional purpose of this scoping mechanism?

>
> > > hid-bpf initially went with fmod_ret approach, deleted the whole thing
> > > and redesigned it with _scoped_ struct-ops.
> >
> > I see little value in embedding a bpf_thp_struct_ops into the
> > task_struct. The benefits don't appear to justify the added
> > complexity.
>
> huh? where did I say that struct-ops should be embedded in task_struct ?

Given that, what would you propose?
My position is that the only valid scope for bpf-thp is at the level
of specific THP modes like madvise and always. This patch correctly
implements that precise design.

--
Regards
Yafang


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-08  4:25             ` Yafang Shao
@ 2025-10-08  4:39               ` Alexei Starovoitov
  2025-10-08  6:02                 ` Yafang Shao
  0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2025-10-08  4:39 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Andrew Morton, David Hildenbrand, ziy, baolin.wang,
	Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts, dev.jain,
	Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

On Tue, Oct 7, 2025 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Wed, Oct 8, 2025 at 12:10 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Oct 7, 2025 at 8:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Wed, Oct 8, 2025 at 11:25 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > has shown that multiple attachments often introduce conflicts. This is
> > > > > precisely why system administrators prefer to manage BPF programs with
> > > > > a single manager—to avoid undefined behaviors from competing programs.
> > > >
> > > > I don't believe this a single bit.
> > >
> > > You should spend some time seeing how users are actually applying BPF
> > > in practice. Some information for you :
> > >
> > > https://github.com/bpfman/bpfman
> > > https://github.com/DataDog/ebpf-manager
> > > https://github.com/ccfos/huatuo
> >
> > By seeing the above you learned the wrong lesson.
> > These orchestrators and many others were created because
> > we made mistakes in the kernel by not scoping the progs enough.
> > XDP is a prime example. It allows one program per netdev.
> > This was a massive mistake which we're still trying to fix.
>
> Since we don't use XDP in production, I can't comment on it. However,
> for our multi-attachable cgroup BPF programs, a key issue arises: if a
> program has permission to attach to one cgroup, it can attach to any
> cgroup. While scoping enables attachment to individual cgroups, it
> does not enforce isolation. This means we must still check for
> conflicts between programs, which begs the question: what is the
> functional purpose of this scoping mechanism?

cgroup mprog was added to remove the need for an orchestrator.

> My position is that the only valid scope for bpf-thp is at the level
> of specific THP modes like madvise and always. This patch correctly
> implements that precise design.

I'm done with this thread.

Nacked-by: Alexei Starovoitov <ast@kernel.org>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-08  4:39               ` Alexei Starovoitov
@ 2025-10-08  6:02                 ` Yafang Shao
  0 siblings, 0 replies; 37+ messages in thread
From: Yafang Shao @ 2025-10-08  6:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrew Morton, David Hildenbrand, ziy, baolin.wang,
	Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts, dev.jain,
	Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

On Wed, Oct 8, 2025 at 12:39 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Oct 7, 2025 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Wed, Oct 8, 2025 at 12:10 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Oct 7, 2025 at 8:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Wed, Oct 8, 2025 at 11:25 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > has shown that multiple attachments often introduce conflicts. This is
> > > > > > precisely why system administrators prefer to manage BPF programs with
> > > > > > a single manager—to avoid undefined behaviors from competing programs.
> > > > >
> > > > > I don't believe this a single bit.
> > > >
> > > > You should spend some time seeing how users are actually applying BPF
> > > > in practice. Some information for you :
> > > >
> > > > https://github.com/bpfman/bpfman
> > > > https://github.com/DataDog/ebpf-manager
> > > > https://github.com/ccfos/huatuo
> > >
> > > By seeing the above you learned the wrong lesson.
> > > These orchestrators and many others were created because
> > > we made mistakes in the kernel by not scoping the progs enough.
> > > XDP is a prime example. It allows one program per netdev.
> > > This was a massive mistake which we're still trying to fix.
> >
> > Since we don't use XDP in production, I can't comment on it. However,
> > for our multi-attachable cgroup BPF programs, a key issue arises: if a
> > program has permission to attach to one cgroup, it can attach to any
> > cgroup. While scoping enables attachment to individual cgroups, it
> > does not enforce isolation. This means we must still check for
> > conflicts between programs, which begs the question: what is the
> > functional purpose of this scoping mechanism?
>
> cgroup mprog was added to remove the need for an orchestrator.

However, this approach would still require a userspace manager to
coordinate the mprog attachments and prevent conflicts between
different programs, no ?

>
> > My position is that the only valid scope for bpf-thp is at the level
> > of specific THP modes like madvise and always. This patch correctly
> > implements that precise design.
>
> I'm done with this thread.
>
> Nacked-by: Alexei Starovoitov <ast@kernel.org>

Given its experimental status, I believe any scoping mechanism would
be premature and over-engineered. Even integrating it into the
mm_struct introduces unnecessary complexity at this stage.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-03  2:18   ` Alexei Starovoitov
  2025-10-07  8:47     ` Yafang Shao
@ 2025-10-08  8:08     ` David Hildenbrand
  2025-10-08  8:18       ` Yafang Shao
  1 sibling, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-10-08  8:08 UTC (permalink / raw)
  To: Alexei Starovoitov, Yafang Shao
  Cc: Andrew Morton, ziy, baolin.wang, Lorenzo Stoakes, Liam Howlett,
	npache, ryan.roberts, dev.jain, Johannes Weiner, usamaarif642,
	gutierrez.asier, Matthew Wilcox, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Amery Hung, David Rientjes,
	Jonathan Corbet, 21cnbao, Shakeel Butt, Tejun Heo, lance.yang,
	Randy Dunlap, bpf, linux-mm, open list:DOCUMENTATION, LKML

On 03.10.25 04:18, Alexei Starovoitov wrote:
> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>
>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>> +                                     enum tva_type type,
>> +                                     unsigned long orders)
>> +{
>> +       thp_order_fn_t *bpf_hook_thp_get_order;
>> +       int bpf_order;
>> +
>> +       /* No BPF program is attached */
>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>> +                     &transparent_hugepage_flags))
>> +               return orders;
>> +
>> +       rcu_read_lock();
>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>> +               goto out;
>> +
>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>> +       orders &= BIT(bpf_order);
>> +
>> +out:
>> +       rcu_read_unlock();
>> +       return orders;
>> +}
> 
> I thought I explained it earlier.
> Nack to a single global prog approach.

I agree. We should have the option to either specify a policy globally, 
or more refined for cgroups/processes.

It's an interesting question if a program would ever want to ship its 
own policy: I can see use cases for that.

So I agree that we should make it more flexible right from the start.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-08  8:08     ` David Hildenbrand
@ 2025-10-08  8:18       ` Yafang Shao
  2025-10-08  8:28         ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Yafang Shao @ 2025-10-08  8:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexei Starovoitov, Andrew Morton, ziy, baolin.wang,
	Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts, dev.jain,
	Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 03.10.25 04:18, Alexei Starovoitov wrote:
> > On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>
> >> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >> +                                     enum tva_type type,
> >> +                                     unsigned long orders)
> >> +{
> >> +       thp_order_fn_t *bpf_hook_thp_get_order;
> >> +       int bpf_order;
> >> +
> >> +       /* No BPF program is attached */
> >> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >> +                     &transparent_hugepage_flags))
> >> +               return orders;
> >> +
> >> +       rcu_read_lock();
> >> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >> +               goto out;
> >> +
> >> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >> +       orders &= BIT(bpf_order);
> >> +
> >> +out:
> >> +       rcu_read_unlock();
> >> +       return orders;
> >> +}
> >
> > I thought I explained it earlier.
> > Nack to a single global prog approach.
>
> I agree. We should have the option to either specify a policy globally,
> or more refined for cgroups/processes.
>
> It's an interesting question if a program would ever want to ship its
> own policy: I can see use cases for that.
>
> So I agree that we should make it more flexible right from the start.

To achieve per-process granularity, the struct-ops must be embedded
within the mm_struct as follows:

+#ifdef CONFIG_BPF_MM
+struct bpf_mm_ops {
+#ifdef CONFIG_BPF_THP
+       struct bpf_thp_ops bpf_thp;
+#endif
+};
+#endif
+
 /*
  * Opaque type representing current mm_struct flag state. Must be accessed via
  * mm_flags_xxx() helper functions.
@@ -1268,6 +1281,10 @@ struct mm_struct {
 #ifdef CONFIG_MM_ID
                mm_id_t mm_id;
 #endif /* CONFIG_MM_ID */
+
+#ifdef CONFIG_BPF_MM
+               struct bpf_mm_ops bpf_mm;
+#endif
        } __randomize_layout;

We should be aware that this will involve extensive changes in mm/. If
we're aligned on this direction, I'll start working on the patches.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-08  8:18       ` Yafang Shao
@ 2025-10-08  8:28         ` David Hildenbrand
  2025-10-08  9:04           ` Yafang Shao
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-10-08  8:28 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Andrew Morton, ziy, baolin.wang,
	Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts, dev.jain,
	Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

On 08.10.25 10:18, Yafang Shao wrote:
> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>
>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>> +                                     enum tva_type type,
>>>> +                                     unsigned long orders)
>>>> +{
>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
>>>> +       int bpf_order;
>>>> +
>>>> +       /* No BPF program is attached */
>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>> +                     &transparent_hugepage_flags))
>>>> +               return orders;
>>>> +
>>>> +       rcu_read_lock();
>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>> +               goto out;
>>>> +
>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>> +       orders &= BIT(bpf_order);
>>>> +
>>>> +out:
>>>> +       rcu_read_unlock();
>>>> +       return orders;
>>>> +}
>>>
>>> I thought I explained it earlier.
>>> Nack to a single global prog approach.
>>
>> I agree. We should have the option to either specify a policy globally,
>> or more refined for cgroups/processes.
>>
>> It's an interesting question if a program would ever want to ship its
>> own policy: I can see use cases for that.
>>
>> So I agree that we should make it more flexible right from the start.
> 
> To achieve per-process granularity, the struct-ops must be embedded
> within the mm_struct as follows:
> 
> +#ifdef CONFIG_BPF_MM
> +struct bpf_mm_ops {
> +#ifdef CONFIG_BPF_THP
> +       struct bpf_thp_ops bpf_thp;
> +#endif
> +};
> +#endif
> +
>   /*
>    * Opaque type representing current mm_struct flag state. Must be accessed via
>    * mm_flags_xxx() helper functions.
> @@ -1268,6 +1281,10 @@ struct mm_struct {
>   #ifdef CONFIG_MM_ID
>                  mm_id_t mm_id;
>   #endif /* CONFIG_MM_ID */
> +
> +#ifdef CONFIG_BPF_MM
> +               struct bpf_mm_ops bpf_mm;
> +#endif
>          } __randomize_layout;
> 
> We should be aware that this will involve extensive changes in mm/.

That's what we do on linux-mm :)

It would be great to use Alexei's feedback/experience to come up with 
something that is flexible for various use cases.

So I think this is likely the right direction.

It would be great to evaluate which scenarios we could unlock with this 
(global vs. per-process vs. per-cgroup) approach, and how 
extensive/involved the changes will be.

If we need a slot in the bi-weekly mm alignment session to brainstorm, 
we can ask Dave R. for one in the upcoming weeks.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-08  8:28         ` David Hildenbrand
@ 2025-10-08  9:04           ` Yafang Shao
  2025-10-08 11:27             ` Zi Yan
  0 siblings, 1 reply; 37+ messages in thread
From: Yafang Shao @ 2025-10-08  9:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexei Starovoitov, Andrew Morton, ziy, baolin.wang,
	Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts, dev.jain,
	Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.10.25 10:18, Yafang Shao wrote:
> > On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 03.10.25 04:18, Alexei Starovoitov wrote:
> >>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>
> >>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >>>> +                                     enum tva_type type,
> >>>> +                                     unsigned long orders)
> >>>> +{
> >>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
> >>>> +       int bpf_order;
> >>>> +
> >>>> +       /* No BPF program is attached */
> >>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >>>> +                     &transparent_hugepage_flags))
> >>>> +               return orders;
> >>>> +
> >>>> +       rcu_read_lock();
> >>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >>>> +               goto out;
> >>>> +
> >>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >>>> +       orders &= BIT(bpf_order);
> >>>> +
> >>>> +out:
> >>>> +       rcu_read_unlock();
> >>>> +       return orders;
> >>>> +}
> >>>
> >>> I thought I explained it earlier.
> >>> Nack to a single global prog approach.
> >>
> >> I agree. We should have the option to either specify a policy globally,
> >> or more refined for cgroups/processes.
> >>
> >> It's an interesting question if a program would ever want to ship its
> >> own policy: I can see use cases for that.
> >>
> >> So I agree that we should make it more flexible right from the start.
> >
> > To achieve per-process granularity, the struct-ops must be embedded
> > within the mm_struct as follows:
> >
> > +#ifdef CONFIG_BPF_MM
> > +struct bpf_mm_ops {
> > +#ifdef CONFIG_BPF_THP
> > +       struct bpf_thp_ops bpf_thp;
> > +#endif
> > +};
> > +#endif
> > +
> >   /*
> >    * Opaque type representing current mm_struct flag state. Must be accessed via
> >    * mm_flags_xxx() helper functions.
> > @@ -1268,6 +1281,10 @@ struct mm_struct {
> >   #ifdef CONFIG_MM_ID
> >                  mm_id_t mm_id;
> >   #endif /* CONFIG_MM_ID */
> > +
> > +#ifdef CONFIG_BPF_MM
> > +               struct bpf_mm_ops bpf_mm;
> > +#endif
> >          } __randomize_layout;
> >
> > We should be aware that this will involve extensive changes in mm/.
>
> That's what we do on linux-mm :)
>
> It would be great to use Alexei's feedback/experience to come up with
> something that is flexible for various use cases.

I'm still not entirely convinced that allowing individual processes or
cgroups to run independent progs is a valid use case. However, since
we have a consensus that this is the right direction, I will proceed
with this approach.

>
> So I think this is likely the right direction.
>
> It would be great to evaluate which scenarios we could unlock with this
> (global vs. per-process vs. per-cgroup) approach, and how
> extensive/involved the changes will be.

1. Global Approach
   - Pros:
     Simple;
     Can manage different THP policies for different cgroups or processes.
  - Cons:
     Does not allow individual processes to run their own BPF programs.

2. Per-Process Approach
    - Pros:
      Enables each process to run its own BPF program.
    - Cons:
      Introduces significant complexity, as it requires handling the
BPF program's lifecycle (creation, destruction, inheritance) within
every mm_struct.

3. Per-Cgroup Approach
    - Pros:
       Allows individual cgroups to run their own BPF programs.
       Less complex than the per-process model, as it can leverage the
existing cgroup operations structure.
    - Cons:
       Creates a dependency on the cgroup subsystem.
       might not be easy to control at the per-process level.

>
> If we need a slot in the bi-weekly mm alignment session to brainstorm,
> we can ask Dave R. for one in the upcoming weeks.

I will draft an RFC to outline the required changes in both the mm/
and bpf/ subsystems and solicit feedback.

--
Regards
Yafang


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-08  9:04           ` Yafang Shao
@ 2025-10-08 11:27             ` Zi Yan
  2025-10-08 12:06               ` Yafang Shao
  2025-10-08 12:07               ` David Hildenbrand
  0 siblings, 2 replies; 37+ messages in thread
From: Zi Yan @ 2025-10-08 11:27 UTC (permalink / raw)
  To: Yafang Shao, David Hildenbrand, Alexei Starovoitov, Johannes Weiner
  Cc: Andrew Morton, baolin.wang, Lorenzo Stoakes, Liam Howlett,
	npache, ryan.roberts, dev.jain, usamaarif642, gutierrez.asier,
	Matthew Wilcox, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Amery Hung, David Rientjes, Jonathan Corbet,
	21cnbao, Shakeel Butt, Tejun Heo, lance.yang, Randy Dunlap, bpf,
	linux-mm, open list:DOCUMENTATION, LKML

On 8 Oct 2025, at 5:04, Yafang Shao wrote:

> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.10.25 10:18, Yafang Shao wrote:
>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>
>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>> +                                     enum tva_type type,
>>>>>> +                                     unsigned long orders)
>>>>>> +{
>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>> +       int bpf_order;
>>>>>> +
>>>>>> +       /* No BPF program is attached */
>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>> +                     &transparent_hugepage_flags))
>>>>>> +               return orders;
>>>>>> +
>>>>>> +       rcu_read_lock();
>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>> +               goto out;
>>>>>> +
>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>> +       orders &= BIT(bpf_order);
>>>>>> +
>>>>>> +out:
>>>>>> +       rcu_read_unlock();
>>>>>> +       return orders;
>>>>>> +}
>>>>>
>>>>> I thought I explained it earlier.
>>>>> Nack to a single global prog approach.
>>>>
>>>> I agree. We should have the option to either specify a policy globally,
>>>> or more refined for cgroups/processes.
>>>>
>>>> It's an interesting question if a program would ever want to ship its
>>>> own policy: I can see use cases for that.
>>>>
>>>> So I agree that we should make it more flexible right from the start.
>>>
>>> To achieve per-process granularity, the struct-ops must be embedded
>>> within the mm_struct as follows:
>>>
>>> +#ifdef CONFIG_BPF_MM
>>> +struct bpf_mm_ops {
>>> +#ifdef CONFIG_BPF_THP
>>> +       struct bpf_thp_ops bpf_thp;
>>> +#endif
>>> +};
>>> +#endif
>>> +
>>>   /*
>>>    * Opaque type representing current mm_struct flag state. Must be accessed via
>>>    * mm_flags_xxx() helper functions.
>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>   #ifdef CONFIG_MM_ID
>>>                  mm_id_t mm_id;
>>>   #endif /* CONFIG_MM_ID */
>>> +
>>> +#ifdef CONFIG_BPF_MM
>>> +               struct bpf_mm_ops bpf_mm;
>>> +#endif
>>>          } __randomize_layout;
>>>
>>> We should be aware that this will involve extensive changes in mm/.
>>
>> That's what we do on linux-mm :)
>>
>> It would be great to use Alexei's feedback/experience to come up with
>> something that is flexible for various use cases.
>
> I'm still not entirely convinced that allowing individual processes or
> cgroups to run independent progs is a valid use case. However, since
> we have a consensus that this is the right direction, I will proceed
> with this approach.
>
>>
>> So I think this is likely the right direction.
>>
>> It would be great to evaluate which scenarios we could unlock with this
>> (global vs. per-process vs. per-cgroup) approach, and how
>> extensive/involved the changes will be.
>
> 1. Global Approach
>    - Pros:
>      Simple;
>      Can manage different THP policies for different cgroups or processes.
>   - Cons:
>      Does not allow individual processes to run their own BPF programs.
>
> 2. Per-Process Approach
>     - Pros:
>       Enables each process to run its own BPF program.
>     - Cons:
>       Introduces significant complexity, as it requires handling the
> BPF program's lifecycle (creation, destruction, inheritance) within
> every mm_struct.
>
> 3. Per-Cgroup Approach
>     - Pros:
>        Allows individual cgroups to run their own BPF programs.
>        Less complex than the per-process model, as it can leverage the
> existing cgroup operations structure.
>     - Cons:
>        Creates a dependency on the cgroup subsystem.
>        might not be easy to control at the per-process level.

Another issue is that how and who to deal with hierarchical cgroup, where one
cgroup is a parent of another. Should bpf program to do that or mm code
to do that? I remember hierarchical cgroup is the main reason THP control
at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
get the same rejection from cgroup folks?


>
>>
>> If we need a slot in the bi-weekly mm alignment session to brainstorm,
>> we can ask Dave R. for one in the upcoming weeks.
>
> I will draft an RFC to outline the required changes in both the mm/
> and bpf/ subsystems and solicit feedback.
>
> --
> Regards
> Yafang


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-08 11:27             ` Zi Yan
@ 2025-10-08 12:06               ` Yafang Shao
  2025-10-08 12:49                 ` Gutierrez Asier
  2025-10-08 12:07               ` David Hildenbrand
  1 sibling, 1 reply; 37+ messages in thread
From: Yafang Shao @ 2025-10-08 12:06 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Alexei Starovoitov, Johannes Weiner,
	Andrew Morton, baolin.wang, Lorenzo Stoakes, Liam Howlett,
	npache, ryan.roberts, dev.jain, usamaarif642, gutierrez.asier,
	Matthew Wilcox, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Amery Hung, David Rientjes, Jonathan Corbet,
	21cnbao, Shakeel Butt, Tejun Heo, lance.yang, Randy Dunlap, bpf,
	linux-mm, open list:DOCUMENTATION, LKML

On Wed, Oct 8, 2025 at 7:27 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
>
> > On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 08.10.25 10:18, Yafang Shao wrote:
> >>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
> >>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>>>
> >>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >>>>>> +                                     enum tva_type type,
> >>>>>> +                                     unsigned long orders)
> >>>>>> +{
> >>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
> >>>>>> +       int bpf_order;
> >>>>>> +
> >>>>>> +       /* No BPF program is attached */
> >>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >>>>>> +                     &transparent_hugepage_flags))
> >>>>>> +               return orders;
> >>>>>> +
> >>>>>> +       rcu_read_lock();
> >>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >>>>>> +               goto out;
> >>>>>> +
> >>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >>>>>> +       orders &= BIT(bpf_order);
> >>>>>> +
> >>>>>> +out:
> >>>>>> +       rcu_read_unlock();
> >>>>>> +       return orders;
> >>>>>> +}
> >>>>>
> >>>>> I thought I explained it earlier.
> >>>>> Nack to a single global prog approach.
> >>>>
> >>>> I agree. We should have the option to either specify a policy globally,
> >>>> or more refined for cgroups/processes.
> >>>>
> >>>> It's an interesting question if a program would ever want to ship its
> >>>> own policy: I can see use cases for that.
> >>>>
> >>>> So I agree that we should make it more flexible right from the start.
> >>>
> >>> To achieve per-process granularity, the struct-ops must be embedded
> >>> within the mm_struct as follows:
> >>>
> >>> +#ifdef CONFIG_BPF_MM
> >>> +struct bpf_mm_ops {
> >>> +#ifdef CONFIG_BPF_THP
> >>> +       struct bpf_thp_ops bpf_thp;
> >>> +#endif
> >>> +};
> >>> +#endif
> >>> +
> >>>   /*
> >>>    * Opaque type representing current mm_struct flag state. Must be accessed via
> >>>    * mm_flags_xxx() helper functions.
> >>> @@ -1268,6 +1281,10 @@ struct mm_struct {
> >>>   #ifdef CONFIG_MM_ID
> >>>                  mm_id_t mm_id;
> >>>   #endif /* CONFIG_MM_ID */
> >>> +
> >>> +#ifdef CONFIG_BPF_MM
> >>> +               struct bpf_mm_ops bpf_mm;
> >>> +#endif
> >>>          } __randomize_layout;
> >>>
> >>> We should be aware that this will involve extensive changes in mm/.
> >>
> >> That's what we do on linux-mm :)
> >>
> >> It would be great to use Alexei's feedback/experience to come up with
> >> something that is flexible for various use cases.
> >
> > I'm still not entirely convinced that allowing individual processes or
> > cgroups to run independent progs is a valid use case. However, since
> > we have a consensus that this is the right direction, I will proceed
> > with this approach.
> >
> >>
> >> So I think this is likely the right direction.
> >>
> >> It would be great to evaluate which scenarios we could unlock with this
> >> (global vs. per-process vs. per-cgroup) approach, and how
> >> extensive/involved the changes will be.
> >
> > 1. Global Approach
> >    - Pros:
> >      Simple;
> >      Can manage different THP policies for different cgroups or processes.
> >   - Cons:
> >      Does not allow individual processes to run their own BPF programs.
> >
> > 2. Per-Process Approach
> >     - Pros:
> >       Enables each process to run its own BPF program.
> >     - Cons:
> >       Introduces significant complexity, as it requires handling the
> > BPF program's lifecycle (creation, destruction, inheritance) within
> > every mm_struct.
> >
> > 3. Per-Cgroup Approach
> >     - Pros:
> >        Allows individual cgroups to run their own BPF programs.
> >        Less complex than the per-process model, as it can leverage the
> > existing cgroup operations structure.
> >     - Cons:
> >        Creates a dependency on the cgroup subsystem.
> >        might not be easy to control at the per-process level.
>
> Another issue is that how and who to deal with hierarchical cgroup, where one
> cgroup is a parent of another. Should bpf program to do that or mm code
> to do that?

The cgroup subsystem handles this propagation automatically. When a
BPF program is attached to a cgroup via cgroup_bpf_attach(), it's
automatically inherited by all descendant cgroups.

Note: struct-ops programs aren't supported by cgroup_bpf_attach(),
requiring us to build new attachment mechanisms for cgroup-based
struct-ops.

> I remember hierarchical cgroup is the main reason THP control
> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
> get the same rejection from cgroup folks?

Right, it was rejected by the cgroup maintainers [0]

[0]. https://lore.kernel.org/linux-mm/20241030150851.GB706616@cmpxchg.org/

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-08 11:27             ` Zi Yan
  2025-10-08 12:06               ` Yafang Shao
@ 2025-10-08 12:07               ` David Hildenbrand
  2025-10-08 13:11                 ` Yafang Shao
  1 sibling, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-10-08 12:07 UTC (permalink / raw)
  To: Zi Yan, Yafang Shao, Alexei Starovoitov, Johannes Weiner
  Cc: Andrew Morton, baolin.wang, Lorenzo Stoakes, Liam Howlett,
	npache, ryan.roberts, dev.jain, usamaarif642, gutierrez.asier,
	Matthew Wilcox, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Amery Hung, David Rientjes, Jonathan Corbet,
	21cnbao, Shakeel Butt, Tejun Heo, lance.yang, Randy Dunlap, bpf,
	linux-mm, open list:DOCUMENTATION, LKML

On 08.10.25 13:27, Zi Yan wrote:
> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
> 
>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 08.10.25 10:18, Yafang Shao wrote:
>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>>
>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>>> +                                     enum tva_type type,
>>>>>>> +                                     unsigned long orders)
>>>>>>> +{
>>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>>> +       int bpf_order;
>>>>>>> +
>>>>>>> +       /* No BPF program is attached */
>>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>>> +                     &transparent_hugepage_flags))
>>>>>>> +               return orders;
>>>>>>> +
>>>>>>> +       rcu_read_lock();
>>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>>> +               goto out;
>>>>>>> +
>>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>>> +       orders &= BIT(bpf_order);
>>>>>>> +
>>>>>>> +out:
>>>>>>> +       rcu_read_unlock();
>>>>>>> +       return orders;
>>>>>>> +}
>>>>>>
>>>>>> I thought I explained it earlier.
>>>>>> Nack to a single global prog approach.
>>>>>
>>>>> I agree. We should have the option to either specify a policy globally,
>>>>> or more refined for cgroups/processes.
>>>>>
>>>>> It's an interesting question if a program would ever want to ship its
>>>>> own policy: I can see use cases for that.
>>>>>
>>>>> So I agree that we should make it more flexible right from the start.
>>>>
>>>> To achieve per-process granularity, the struct-ops must be embedded
>>>> within the mm_struct as follows:
>>>>
>>>> +#ifdef CONFIG_BPF_MM
>>>> +struct bpf_mm_ops {
>>>> +#ifdef CONFIG_BPF_THP
>>>> +       struct bpf_thp_ops bpf_thp;
>>>> +#endif
>>>> +};
>>>> +#endif
>>>> +
>>>>    /*
>>>>     * Opaque type representing current mm_struct flag state. Must be accessed via
>>>>     * mm_flags_xxx() helper functions.
>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>>    #ifdef CONFIG_MM_ID
>>>>                   mm_id_t mm_id;
>>>>    #endif /* CONFIG_MM_ID */
>>>> +
>>>> +#ifdef CONFIG_BPF_MM
>>>> +               struct bpf_mm_ops bpf_mm;
>>>> +#endif
>>>>           } __randomize_layout;
>>>>
>>>> We should be aware that this will involve extensive changes in mm/.
>>>
>>> That's what we do on linux-mm :)
>>>
>>> It would be great to use Alexei's feedback/experience to come up with
>>> something that is flexible for various use cases.
>>
>> I'm still not entirely convinced that allowing individual processes or
>> cgroups to run independent progs is a valid use case. However, since
>> we have a consensus that this is the right direction, I will proceed
>> with this approach.
>>
>>>
>>> So I think this is likely the right direction.
>>>
>>> It would be great to evaluate which scenarios we could unlock with this
>>> (global vs. per-process vs. per-cgroup) approach, and how
>>> extensive/involved the changes will be.
>>
>> 1. Global Approach
>>     - Pros:
>>       Simple;
>>       Can manage different THP policies for different cgroups or processes.
>>    - Cons:
>>       Does not allow individual processes to run their own BPF programs.
>>
>> 2. Per-Process Approach
>>      - Pros:
>>        Enables each process to run its own BPF program.
>>      - Cons:
>>        Introduces significant complexity, as it requires handling the
>> BPF program's lifecycle (creation, destruction, inheritance) within
>> every mm_struct.
>>
>> 3. Per-Cgroup Approach
>>      - Pros:
>>         Allows individual cgroups to run their own BPF programs.
>>         Less complex than the per-process model, as it can leverage the
>> existing cgroup operations structure.
>>      - Cons:
>>         Creates a dependency on the cgroup subsystem.
>>         might not be easy to control at the per-process level.
> 
> Another issue is that how and who to deal with hierarchical cgroup, where one
> cgroup is a parent of another. Should bpf program to do that or mm code
> to do that? I remember hierarchical cgroup is the main reason THP control
> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
> get the same rejection from cgroup folks?

Valid point.

I do wonder if that problem was already encountered elsewhere with bpf 
and if there is already a solution.

Focusing on processes instead of cgroups might be easier initially.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-08 12:06               ` Yafang Shao
@ 2025-10-08 12:49                 ` Gutierrez Asier
  0 siblings, 0 replies; 37+ messages in thread
From: Gutierrez Asier @ 2025-10-08 12:49 UTC (permalink / raw)
  To: Yafang Shao, Zi Yan
  Cc: David Hildenbrand, Alexei Starovoitov, Johannes Weiner,
	Andrew Morton, baolin.wang, Lorenzo Stoakes, Liam Howlett,
	npache, ryan.roberts, dev.jain, usamaarif642, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

Hi,

On 10/8/2025 3:06 PM, Yafang Shao wrote:
> On Wed, Oct 8, 2025 at 7:27 PM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
>>
>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 08.10.25 10:18, Yafang Shao wrote:
>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>>>
>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>>>> +                                     enum tva_type type,
>>>>>>>> +                                     unsigned long orders)
>>>>>>>> +{
>>>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>>>> +       int bpf_order;
>>>>>>>> +
>>>>>>>> +       /* No BPF program is attached */
>>>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>>>> +                     &transparent_hugepage_flags))
>>>>>>>> +               return orders;
>>>>>>>> +
>>>>>>>> +       rcu_read_lock();
>>>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>>>> +               goto out;
>>>>>>>> +
>>>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>>>> +       orders &= BIT(bpf_order);
>>>>>>>> +
>>>>>>>> +out:
>>>>>>>> +       rcu_read_unlock();
>>>>>>>> +       return orders;
>>>>>>>> +}
>>>>>>>
>>>>>>> I thought I explained it earlier.
>>>>>>> Nack to a single global prog approach.
>>>>>>
>>>>>> I agree. We should have the option to either specify a policy globally,
>>>>>> or more refined for cgroups/processes.
>>>>>>
>>>>>> It's an interesting question if a program would ever want to ship its
>>>>>> own policy: I can see use cases for that.
>>>>>>
>>>>>> So I agree that we should make it more flexible right from the start.
>>>>>
>>>>> To achieve per-process granularity, the struct-ops must be embedded
>>>>> within the mm_struct as follows:
>>>>>
>>>>> +#ifdef CONFIG_BPF_MM
>>>>> +struct bpf_mm_ops {
>>>>> +#ifdef CONFIG_BPF_THP
>>>>> +       struct bpf_thp_ops bpf_thp;
>>>>> +#endif
>>>>> +};
>>>>> +#endif
>>>>> +
>>>>>   /*
>>>>>    * Opaque type representing current mm_struct flag state. Must be accessed via
>>>>>    * mm_flags_xxx() helper functions.
>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>>>   #ifdef CONFIG_MM_ID
>>>>>                  mm_id_t mm_id;
>>>>>   #endif /* CONFIG_MM_ID */
>>>>> +
>>>>> +#ifdef CONFIG_BPF_MM
>>>>> +               struct bpf_mm_ops bpf_mm;
>>>>> +#endif
>>>>>          } __randomize_layout;
>>>>>
>>>>> We should be aware that this will involve extensive changes in mm/.
>>>>
>>>> That's what we do on linux-mm :)
>>>>
>>>> It would be great to use Alexei's feedback/experience to come up with
>>>> something that is flexible for various use cases.
>>>
>>> I'm still not entirely convinced that allowing individual processes or
>>> cgroups to run independent progs is a valid use case. However, since
>>> we have a consensus that this is the right direction, I will proceed
>>> with this approach.
>>>
>>>>
>>>> So I think this is likely the right direction.
>>>>
>>>> It would be great to evaluate which scenarios we could unlock with this
>>>> (global vs. per-process vs. per-cgroup) approach, and how
>>>> extensive/involved the changes will be.
>>>
>>> 1. Global Approach
>>>    - Pros:
>>>      Simple;
>>>      Can manage different THP policies for different cgroups or processes.
>>>   - Cons:
>>>      Does not allow individual processes to run their own BPF programs.
>>>
>>> 2. Per-Process Approach
>>>     - Pros:
>>>       Enables each process to run its own BPF program.
>>>     - Cons:
>>>       Introduces significant complexity, as it requires handling the
>>> BPF program's lifecycle (creation, destruction, inheritance) within
>>> every mm_struct.
>>>
>>> 3. Per-Cgroup Approach
>>>     - Pros:
>>>        Allows individual cgroups to run their own BPF programs.
>>>        Less complex than the per-process model, as it can leverage the
>>> existing cgroup operations structure.
>>>     - Cons:
>>>        Creates a dependency on the cgroup subsystem.
>>>        might not be easy to control at the per-process level.
>>
>> Another issue is that how and who to deal with hierarchical cgroup, where one
>> cgroup is a parent of another. Should bpf program to do that or mm code
>> to do that?
> 
> The cgroup subsystem handles this propagation automatically. When a
> BPF program is attached to a cgroup via cgroup_bpf_attach(), it's
> automatically inherited by all descendant cgroups.
> 
> Note: struct-ops programs aren't supported by cgroup_bpf_attach(),
> requiring us to build new attachment mechanisms for cgroup-based
> struct-ops.
> 
>> I remember hierarchical cgroup is the main reason THP control
>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
>> get the same rejection from cgroup folks?
> 
> Right, it was rejected by the cgroup maintainers [0]
> 
> [0]. https://lore.kernel.org/linux-mm/20241030150851.GB706616@cmpxchg.org/
> 

Yes, the patch was rejected because:

1. It breaks the cgroup hierarchy when 2 siblings have different THP policies
2. Cgroup was designed for resource management not for grouping processes and 
   tune those processes
3. We set a precedent for other people adding new flags to cgroup and 
   potentially polluting cgroups. We may end up with cgroups having tens of 
   different flags, making sysadmin's job more complex

In the MM call I proposed a new mechanism based on limits, something like 
hugetlbfs.

The main issue, still, is that the sysadmins need to set those up, making 
their life more complex.

I remember few participants mentioned the idea of the kernel setting huge page 
consumption automatically using some sort of heuristics. To be honest, I 
haven't have the time to sit and think about it. I would be glad to cooperate
and come up with a feasible solution.

-- 
Asier Gutierrez
Huawei



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-08 12:07               ` David Hildenbrand
@ 2025-10-08 13:11                 ` Yafang Shao
  2025-10-09  9:19                   ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Yafang Shao @ 2025-10-08 13:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, Alexei Starovoitov, Johannes Weiner, Andrew Morton,
	baolin.wang, Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts,
	dev.jain, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

On Wed, Oct 8, 2025 at 8:07 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.10.25 13:27, Zi Yan wrote:
> > On 8 Oct 2025, at 5:04, Yafang Shao wrote:
> >
> >> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 08.10.25 10:18, Yafang Shao wrote:
> >>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
> >>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>>>>
> >>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >>>>>>> +                                     enum tva_type type,
> >>>>>>> +                                     unsigned long orders)
> >>>>>>> +{
> >>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
> >>>>>>> +       int bpf_order;
> >>>>>>> +
> >>>>>>> +       /* No BPF program is attached */
> >>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >>>>>>> +                     &transparent_hugepage_flags))
> >>>>>>> +               return orders;
> >>>>>>> +
> >>>>>>> +       rcu_read_lock();
> >>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >>>>>>> +               goto out;
> >>>>>>> +
> >>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >>>>>>> +       orders &= BIT(bpf_order);
> >>>>>>> +
> >>>>>>> +out:
> >>>>>>> +       rcu_read_unlock();
> >>>>>>> +       return orders;
> >>>>>>> +}
> >>>>>>
> >>>>>> I thought I explained it earlier.
> >>>>>> Nack to a single global prog approach.
> >>>>>
> >>>>> I agree. We should have the option to either specify a policy globally,
> >>>>> or more refined for cgroups/processes.
> >>>>>
> >>>>> It's an interesting question if a program would ever want to ship its
> >>>>> own policy: I can see use cases for that.
> >>>>>
> >>>>> So I agree that we should make it more flexible right from the start.
> >>>>
> >>>> To achieve per-process granularity, the struct-ops must be embedded
> >>>> within the mm_struct as follows:
> >>>>
> >>>> +#ifdef CONFIG_BPF_MM
> >>>> +struct bpf_mm_ops {
> >>>> +#ifdef CONFIG_BPF_THP
> >>>> +       struct bpf_thp_ops bpf_thp;
> >>>> +#endif
> >>>> +};
> >>>> +#endif
> >>>> +
> >>>>    /*
> >>>>     * Opaque type representing current mm_struct flag state. Must be accessed via
> >>>>     * mm_flags_xxx() helper functions.
> >>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
> >>>>    #ifdef CONFIG_MM_ID
> >>>>                   mm_id_t mm_id;
> >>>>    #endif /* CONFIG_MM_ID */
> >>>> +
> >>>> +#ifdef CONFIG_BPF_MM
> >>>> +               struct bpf_mm_ops bpf_mm;
> >>>> +#endif
> >>>>           } __randomize_layout;
> >>>>
> >>>> We should be aware that this will involve extensive changes in mm/.
> >>>
> >>> That's what we do on linux-mm :)
> >>>
> >>> It would be great to use Alexei's feedback/experience to come up with
> >>> something that is flexible for various use cases.
> >>
> >> I'm still not entirely convinced that allowing individual processes or
> >> cgroups to run independent progs is a valid use case. However, since
> >> we have a consensus that this is the right direction, I will proceed
> >> with this approach.
> >>
> >>>
> >>> So I think this is likely the right direction.
> >>>
> >>> It would be great to evaluate which scenarios we could unlock with this
> >>> (global vs. per-process vs. per-cgroup) approach, and how
> >>> extensive/involved the changes will be.
> >>
> >> 1. Global Approach
> >>     - Pros:
> >>       Simple;
> >>       Can manage different THP policies for different cgroups or processes.
> >>    - Cons:
> >>       Does not allow individual processes to run their own BPF programs.
> >>
> >> 2. Per-Process Approach
> >>      - Pros:
> >>        Enables each process to run its own BPF program.
> >>      - Cons:
> >>        Introduces significant complexity, as it requires handling the
> >> BPF program's lifecycle (creation, destruction, inheritance) within
> >> every mm_struct.
> >>
> >> 3. Per-Cgroup Approach
> >>      - Pros:
> >>         Allows individual cgroups to run their own BPF programs.
> >>         Less complex than the per-process model, as it can leverage the
> >> existing cgroup operations structure.
> >>      - Cons:
> >>         Creates a dependency on the cgroup subsystem.
> >>         might not be easy to control at the per-process level.
> >
> > Another issue is that how and who to deal with hierarchical cgroup, where one
> > cgroup is a parent of another. Should bpf program to do that or mm code
> > to do that? I remember hierarchical cgroup is the main reason THP control
> > at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
> > get the same rejection from cgroup folks?
>
> Valid point.
>
> I do wonder if that problem was already encountered elsewhere with bpf
> and if there is already a solution.

Our standard is to run only one instance of a BPF program type
system-wide to avoid conflicts. For example, we can't have both
systemd and a container runtime running bpf-thp simultaneously.

Perhaps Alexei can enlighten us, though we'd need to read between his
characteristically brief lines. ;-)

>
> Focusing on processes instead of cgroups might be easier initially.


-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-08 13:11                 ` Yafang Shao
@ 2025-10-09  9:19                   ` David Hildenbrand
  2025-10-09  9:59                     ` Yafang Shao
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-10-09  9:19 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Zi Yan, Alexei Starovoitov, Johannes Weiner, Andrew Morton,
	baolin.wang, Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts,
	dev.jain, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

On 08.10.25 15:11, Yafang Shao wrote:
> On Wed, Oct 8, 2025 at 8:07 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.10.25 13:27, Zi Yan wrote:
>>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
>>>
>>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 08.10.25 10:18, Yafang Shao wrote:
>>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>
>>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>>>>> +                                     enum tva_type type,
>>>>>>>>> +                                     unsigned long orders)
>>>>>>>>> +{
>>>>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>>>>> +       int bpf_order;
>>>>>>>>> +
>>>>>>>>> +       /* No BPF program is attached */
>>>>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>>>>> +                     &transparent_hugepage_flags))
>>>>>>>>> +               return orders;
>>>>>>>>> +
>>>>>>>>> +       rcu_read_lock();
>>>>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>>>>> +               goto out;
>>>>>>>>> +
>>>>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>>>>> +       orders &= BIT(bpf_order);
>>>>>>>>> +
>>>>>>>>> +out:
>>>>>>>>> +       rcu_read_unlock();
>>>>>>>>> +       return orders;
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> I thought I explained it earlier.
>>>>>>>> Nack to a single global prog approach.
>>>>>>>
>>>>>>> I agree. We should have the option to either specify a policy globally,
>>>>>>> or more refined for cgroups/processes.
>>>>>>>
>>>>>>> It's an interesting question if a program would ever want to ship its
>>>>>>> own policy: I can see use cases for that.
>>>>>>>
>>>>>>> So I agree that we should make it more flexible right from the start.
>>>>>>
>>>>>> To achieve per-process granularity, the struct-ops must be embedded
>>>>>> within the mm_struct as follows:
>>>>>>
>>>>>> +#ifdef CONFIG_BPF_MM
>>>>>> +struct bpf_mm_ops {
>>>>>> +#ifdef CONFIG_BPF_THP
>>>>>> +       struct bpf_thp_ops bpf_thp;
>>>>>> +#endif
>>>>>> +};
>>>>>> +#endif
>>>>>> +
>>>>>>     /*
>>>>>>      * Opaque type representing current mm_struct flag state. Must be accessed via
>>>>>>      * mm_flags_xxx() helper functions.
>>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>>>>     #ifdef CONFIG_MM_ID
>>>>>>                    mm_id_t mm_id;
>>>>>>     #endif /* CONFIG_MM_ID */
>>>>>> +
>>>>>> +#ifdef CONFIG_BPF_MM
>>>>>> +               struct bpf_mm_ops bpf_mm;
>>>>>> +#endif
>>>>>>            } __randomize_layout;
>>>>>>
>>>>>> We should be aware that this will involve extensive changes in mm/.
>>>>>
>>>>> That's what we do on linux-mm :)
>>>>>
>>>>> It would be great to use Alexei's feedback/experience to come up with
>>>>> something that is flexible for various use cases.
>>>>
>>>> I'm still not entirely convinced that allowing individual processes or
>>>> cgroups to run independent progs is a valid use case. However, since
>>>> we have a consensus that this is the right direction, I will proceed
>>>> with this approach.
>>>>
>>>>>
>>>>> So I think this is likely the right direction.
>>>>>
>>>>> It would be great to evaluate which scenarios we could unlock with this
>>>>> (global vs. per-process vs. per-cgroup) approach, and how
>>>>> extensive/involved the changes will be.
>>>>
>>>> 1. Global Approach
>>>>      - Pros:
>>>>        Simple;
>>>>        Can manage different THP policies for different cgroups or processes.
>>>>     - Cons:
>>>>        Does not allow individual processes to run their own BPF programs.
>>>>
>>>> 2. Per-Process Approach
>>>>       - Pros:
>>>>         Enables each process to run its own BPF program.
>>>>       - Cons:
>>>>         Introduces significant complexity, as it requires handling the
>>>> BPF program's lifecycle (creation, destruction, inheritance) within
>>>> every mm_struct.
>>>>
>>>> 3. Per-Cgroup Approach
>>>>       - Pros:
>>>>          Allows individual cgroups to run their own BPF programs.
>>>>          Less complex than the per-process model, as it can leverage the
>>>> existing cgroup operations structure.
>>>>       - Cons:
>>>>          Creates a dependency on the cgroup subsystem.
>>>>          might not be easy to control at the per-process level.
>>>
>>> Another issue is that how and who to deal with hierarchical cgroup, where one
>>> cgroup is a parent of another. Should bpf program to do that or mm code
>>> to do that? I remember hierarchical cgroup is the main reason THP control
>>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
>>> get the same rejection from cgroup folks?
>>
>> Valid point.
>>
>> I do wonder if that problem was already encountered elsewhere with bpf
>> and if there is already a solution.
> 
> Our standard is to run only one instance of a BPF program type
> system-wide to avoid conflicts. For example, we can't have both
> systemd and a container runtime running bpf-thp simultaneously.

Right, it's a good question how to combine policies, or "who wins".

> 
> Perhaps Alexei can enlighten us, though we'd need to read between his
> characteristically brief lines. ;-)

There might be some insights to be had in the bpf OOM discussion at

https://lkml.kernel.org/r/CAEf4BzafXv-PstSAP6krers=S74ri1+zTB4Y2oT6f+33yznqsA@mail.gmail.com

I didn't completely read through that, but that discussion also seems to 
be about interaction between cgroups and bpd programs.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-09  9:19                   ` David Hildenbrand
@ 2025-10-09  9:59                     ` Yafang Shao
  2025-10-10  7:54                       ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Yafang Shao @ 2025-10-09  9:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, Alexei Starovoitov, Johannes Weiner, Andrew Morton,
	baolin.wang, Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts,
	dev.jain, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

On Thu, Oct 9, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.10.25 15:11, Yafang Shao wrote:
> > On Wed, Oct 8, 2025 at 8:07 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 08.10.25 13:27, Zi Yan wrote:
> >>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
> >>>
> >>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>> On 08.10.25 10:18, Yafang Shao wrote:
> >>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
> >>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >>>>>>>>> +                                     enum tva_type type,
> >>>>>>>>> +                                     unsigned long orders)
> >>>>>>>>> +{
> >>>>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
> >>>>>>>>> +       int bpf_order;
> >>>>>>>>> +
> >>>>>>>>> +       /* No BPF program is attached */
> >>>>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >>>>>>>>> +                     &transparent_hugepage_flags))
> >>>>>>>>> +               return orders;
> >>>>>>>>> +
> >>>>>>>>> +       rcu_read_lock();
> >>>>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >>>>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >>>>>>>>> +               goto out;
> >>>>>>>>> +
> >>>>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >>>>>>>>> +       orders &= BIT(bpf_order);
> >>>>>>>>> +
> >>>>>>>>> +out:
> >>>>>>>>> +       rcu_read_unlock();
> >>>>>>>>> +       return orders;
> >>>>>>>>> +}
> >>>>>>>>
> >>>>>>>> I thought I explained it earlier.
> >>>>>>>> Nack to a single global prog approach.
> >>>>>>>
> >>>>>>> I agree. We should have the option to either specify a policy globally,
> >>>>>>> or more refined for cgroups/processes.
> >>>>>>>
> >>>>>>> It's an interesting question if a program would ever want to ship its
> >>>>>>> own policy: I can see use cases for that.
> >>>>>>>
> >>>>>>> So I agree that we should make it more flexible right from the start.
> >>>>>>
> >>>>>> To achieve per-process granularity, the struct-ops must be embedded
> >>>>>> within the mm_struct as follows:
> >>>>>>
> >>>>>> +#ifdef CONFIG_BPF_MM
> >>>>>> +struct bpf_mm_ops {
> >>>>>> +#ifdef CONFIG_BPF_THP
> >>>>>> +       struct bpf_thp_ops bpf_thp;
> >>>>>> +#endif
> >>>>>> +};
> >>>>>> +#endif
> >>>>>> +
> >>>>>>     /*
> >>>>>>      * Opaque type representing current mm_struct flag state. Must be accessed via
> >>>>>>      * mm_flags_xxx() helper functions.
> >>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
> >>>>>>     #ifdef CONFIG_MM_ID
> >>>>>>                    mm_id_t mm_id;
> >>>>>>     #endif /* CONFIG_MM_ID */
> >>>>>> +
> >>>>>> +#ifdef CONFIG_BPF_MM
> >>>>>> +               struct bpf_mm_ops bpf_mm;
> >>>>>> +#endif
> >>>>>>            } __randomize_layout;
> >>>>>>
> >>>>>> We should be aware that this will involve extensive changes in mm/.
> >>>>>
> >>>>> That's what we do on linux-mm :)
> >>>>>
> >>>>> It would be great to use Alexei's feedback/experience to come up with
> >>>>> something that is flexible for various use cases.
> >>>>
> >>>> I'm still not entirely convinced that allowing individual processes or
> >>>> cgroups to run independent progs is a valid use case. However, since
> >>>> we have a consensus that this is the right direction, I will proceed
> >>>> with this approach.
> >>>>
> >>>>>
> >>>>> So I think this is likely the right direction.
> >>>>>
> >>>>> It would be great to evaluate which scenarios we could unlock with this
> >>>>> (global vs. per-process vs. per-cgroup) approach, and how
> >>>>> extensive/involved the changes will be.
> >>>>
> >>>> 1. Global Approach
> >>>>      - Pros:
> >>>>        Simple;
> >>>>        Can manage different THP policies for different cgroups or processes.
> >>>>     - Cons:
> >>>>        Does not allow individual processes to run their own BPF programs.
> >>>>
> >>>> 2. Per-Process Approach
> >>>>       - Pros:
> >>>>         Enables each process to run its own BPF program.
> >>>>       - Cons:
> >>>>         Introduces significant complexity, as it requires handling the
> >>>> BPF program's lifecycle (creation, destruction, inheritance) within
> >>>> every mm_struct.
> >>>>
> >>>> 3. Per-Cgroup Approach
> >>>>       - Pros:
> >>>>          Allows individual cgroups to run their own BPF programs.
> >>>>          Less complex than the per-process model, as it can leverage the
> >>>> existing cgroup operations structure.
> >>>>       - Cons:
> >>>>          Creates a dependency on the cgroup subsystem.
> >>>>          might not be easy to control at the per-process level.
> >>>
> >>> Another issue is that how and who to deal with hierarchical cgroup, where one
> >>> cgroup is a parent of another. Should bpf program to do that or mm code
> >>> to do that? I remember hierarchical cgroup is the main reason THP control
> >>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
> >>> get the same rejection from cgroup folks?
> >>
> >> Valid point.
> >>
> >> I do wonder if that problem was already encountered elsewhere with bpf
> >> and if there is already a solution.
> >
> > Our standard is to run only one instance of a BPF program type
> > system-wide to avoid conflicts. For example, we can't have both
> > systemd and a container runtime running bpf-thp simultaneously.
>
> Right, it's a good question how to combine policies, or "who wins".

From my perspective, the ideal approach is to have one BPF-THP
instance per mm_struct. This allows for separate managers in different
domains, such as systemd managing BPF-THP for system processes and
containerd for container processes, while ensuring that any single
process is managed by only one BPF-THP.

>
> >
> > Perhaps Alexei can enlighten us, though we'd need to read between his
> > characteristically brief lines. ;-)
>
> There might be some insights to be had in the bpf OOM discussion at
>
> https://lkml.kernel.org/r/CAEf4BzafXv-PstSAP6krers=S74ri1+zTB4Y2oT6f+33yznqsA@mail.gmail.com
>
> I didn't completely read through that, but that discussion also seems to
> be about interaction between cgroups and bpd programs.

I have reviewed the discussions.

Given that the OOM might be cgroup-specific, implementing a
cgroup-based BPF-OOM handler makes sense.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-09  9:59                     ` Yafang Shao
@ 2025-10-10  7:54                       ` David Hildenbrand
  2025-10-11  2:13                         ` Yafang Shao
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-10-10  7:54 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Zi Yan, Alexei Starovoitov, Johannes Weiner, Andrew Morton,
	baolin.wang, Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts,
	dev.jain, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	Tejun Heo, lance.yang, Randy Dunlap, bpf, linux-mm,
	open list:DOCUMENTATION, LKML

On 09.10.25 11:59, Yafang Shao wrote:
> On Thu, Oct 9, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.10.25 15:11, Yafang Shao wrote:
>>> On Wed, Oct 8, 2025 at 8:07 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 08.10.25 13:27, Zi Yan wrote:
>>>>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
>>>>>
>>>>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>
>>>>>>> On 08.10.25 10:18, Yafang Shao wrote:
>>>>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>>>>>>> +                                     enum tva_type type,
>>>>>>>>>>> +                                     unsigned long orders)
>>>>>>>>>>> +{
>>>>>>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>>>>>>> +       int bpf_order;
>>>>>>>>>>> +
>>>>>>>>>>> +       /* No BPF program is attached */
>>>>>>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>>>>>>> +                     &transparent_hugepage_flags))
>>>>>>>>>>> +               return orders;
>>>>>>>>>>> +
>>>>>>>>>>> +       rcu_read_lock();
>>>>>>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>>>>>>> +               goto out;
>>>>>>>>>>> +
>>>>>>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>>>>>>> +       orders &= BIT(bpf_order);
>>>>>>>>>>> +
>>>>>>>>>>> +out:
>>>>>>>>>>> +       rcu_read_unlock();
>>>>>>>>>>> +       return orders;
>>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>> I thought I explained it earlier.
>>>>>>>>>> Nack to a single global prog approach.
>>>>>>>>>
>>>>>>>>> I agree. We should have the option to either specify a policy globally,
>>>>>>>>> or more refined for cgroups/processes.
>>>>>>>>>
>>>>>>>>> It's an interesting question if a program would ever want to ship its
>>>>>>>>> own policy: I can see use cases for that.
>>>>>>>>>
>>>>>>>>> So I agree that we should make it more flexible right from the start.
>>>>>>>>
>>>>>>>> To achieve per-process granularity, the struct-ops must be embedded
>>>>>>>> within the mm_struct as follows:
>>>>>>>>
>>>>>>>> +#ifdef CONFIG_BPF_MM
>>>>>>>> +struct bpf_mm_ops {
>>>>>>>> +#ifdef CONFIG_BPF_THP
>>>>>>>> +       struct bpf_thp_ops bpf_thp;
>>>>>>>> +#endif
>>>>>>>> +};
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>>      /*
>>>>>>>>       * Opaque type representing current mm_struct flag state. Must be accessed via
>>>>>>>>       * mm_flags_xxx() helper functions.
>>>>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>>>>>>      #ifdef CONFIG_MM_ID
>>>>>>>>                     mm_id_t mm_id;
>>>>>>>>      #endif /* CONFIG_MM_ID */
>>>>>>>> +
>>>>>>>> +#ifdef CONFIG_BPF_MM
>>>>>>>> +               struct bpf_mm_ops bpf_mm;
>>>>>>>> +#endif
>>>>>>>>             } __randomize_layout;
>>>>>>>>
>>>>>>>> We should be aware that this will involve extensive changes in mm/.
>>>>>>>
>>>>>>> That's what we do on linux-mm :)
>>>>>>>
>>>>>>> It would be great to use Alexei's feedback/experience to come up with
>>>>>>> something that is flexible for various use cases.
>>>>>>
>>>>>> I'm still not entirely convinced that allowing individual processes or
>>>>>> cgroups to run independent progs is a valid use case. However, since
>>>>>> we have a consensus that this is the right direction, I will proceed
>>>>>> with this approach.
>>>>>>
>>>>>>>
>>>>>>> So I think this is likely the right direction.
>>>>>>>
>>>>>>> It would be great to evaluate which scenarios we could unlock with this
>>>>>>> (global vs. per-process vs. per-cgroup) approach, and how
>>>>>>> extensive/involved the changes will be.
>>>>>>
>>>>>> 1. Global Approach
>>>>>>       - Pros:
>>>>>>         Simple;
>>>>>>         Can manage different THP policies for different cgroups or processes.
>>>>>>      - Cons:
>>>>>>         Does not allow individual processes to run their own BPF programs.
>>>>>>
>>>>>> 2. Per-Process Approach
>>>>>>        - Pros:
>>>>>>          Enables each process to run its own BPF program.
>>>>>>        - Cons:
>>>>>>          Introduces significant complexity, as it requires handling the
>>>>>> BPF program's lifecycle (creation, destruction, inheritance) within
>>>>>> every mm_struct.
>>>>>>
>>>>>> 3. Per-Cgroup Approach
>>>>>>        - Pros:
>>>>>>           Allows individual cgroups to run their own BPF programs.
>>>>>>           Less complex than the per-process model, as it can leverage the
>>>>>> existing cgroup operations structure.
>>>>>>        - Cons:
>>>>>>           Creates a dependency on the cgroup subsystem.
>>>>>>           might not be easy to control at the per-process level.
>>>>>
>>>>> Another issue is that how and who to deal with hierarchical cgroup, where one
>>>>> cgroup is a parent of another. Should bpf program to do that or mm code
>>>>> to do that? I remember hierarchical cgroup is the main reason THP control
>>>>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
>>>>> get the same rejection from cgroup folks?
>>>>
>>>> Valid point.
>>>>
>>>> I do wonder if that problem was already encountered elsewhere with bpf
>>>> and if there is already a solution.
>>>
>>> Our standard is to run only one instance of a BPF program type
>>> system-wide to avoid conflicts. For example, we can't have both
>>> systemd and a container runtime running bpf-thp simultaneously.
>>
>> Right, it's a good question how to combine policies, or "who wins".
> 
>  From my perspective, the ideal approach is to have one BPF-THP
> instance per mm_struct. This allows for separate managers in different
> domains, such as systemd managing BPF-THP for system processes and
> containerd for container processes, while ensuring that any single
> process is managed by only one BPF-THP.

I came to the same conclusion. At least it's a valid start.

Maybe we would later want a global fallback BPF-THP prog if none was 
enabled for a specific MM.

But I would expect to start with a per MM way of doing it, it gives you 
way more flexibility in the long run.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-10  7:54                       ` David Hildenbrand
@ 2025-10-11  2:13                         ` Yafang Shao
  2025-10-13 12:41                           ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Yafang Shao @ 2025-10-11  2:13 UTC (permalink / raw)
  To: David Hildenbrand, Tejun Heo, Michal Hocko, Roman Gushchin
  Cc: Zi Yan, Alexei Starovoitov, Johannes Weiner, Andrew Morton,
	baolin.wang, Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts,
	dev.jain, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	lance.yang, Randy Dunlap, bpf, linux-mm, open list:DOCUMENTATION,
	LKML

On Fri, Oct 10, 2025 at 3:54 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 09.10.25 11:59, Yafang Shao wrote:
> > On Thu, Oct 9, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 08.10.25 15:11, Yafang Shao wrote:
> >>> On Wed, Oct 8, 2025 at 8:07 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 08.10.25 13:27, Zi Yan wrote:
> >>>>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
> >>>>>
> >>>>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On 08.10.25 10:18, Yafang Shao wrote:
> >>>>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
> >>>>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >>>>>>>>>>> +                                     enum tva_type type,
> >>>>>>>>>>> +                                     unsigned long orders)
> >>>>>>>>>>> +{
> >>>>>>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
> >>>>>>>>>>> +       int bpf_order;
> >>>>>>>>>>> +
> >>>>>>>>>>> +       /* No BPF program is attached */
> >>>>>>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >>>>>>>>>>> +                     &transparent_hugepage_flags))
> >>>>>>>>>>> +               return orders;
> >>>>>>>>>>> +
> >>>>>>>>>>> +       rcu_read_lock();
> >>>>>>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >>>>>>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >>>>>>>>>>> +               goto out;
> >>>>>>>>>>> +
> >>>>>>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >>>>>>>>>>> +       orders &= BIT(bpf_order);
> >>>>>>>>>>> +
> >>>>>>>>>>> +out:
> >>>>>>>>>>> +       rcu_read_unlock();
> >>>>>>>>>>> +       return orders;
> >>>>>>>>>>> +}
> >>>>>>>>>>
> >>>>>>>>>> I thought I explained it earlier.
> >>>>>>>>>> Nack to a single global prog approach.
> >>>>>>>>>
> >>>>>>>>> I agree. We should have the option to either specify a policy globally,
> >>>>>>>>> or more refined for cgroups/processes.
> >>>>>>>>>
> >>>>>>>>> It's an interesting question if a program would ever want to ship its
> >>>>>>>>> own policy: I can see use cases for that.
> >>>>>>>>>
> >>>>>>>>> So I agree that we should make it more flexible right from the start.
> >>>>>>>>
> >>>>>>>> To achieve per-process granularity, the struct-ops must be embedded
> >>>>>>>> within the mm_struct as follows:
> >>>>>>>>
> >>>>>>>> +#ifdef CONFIG_BPF_MM
> >>>>>>>> +struct bpf_mm_ops {
> >>>>>>>> +#ifdef CONFIG_BPF_THP
> >>>>>>>> +       struct bpf_thp_ops bpf_thp;
> >>>>>>>> +#endif
> >>>>>>>> +};
> >>>>>>>> +#endif
> >>>>>>>> +
> >>>>>>>>      /*
> >>>>>>>>       * Opaque type representing current mm_struct flag state. Must be accessed via
> >>>>>>>>       * mm_flags_xxx() helper functions.
> >>>>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
> >>>>>>>>      #ifdef CONFIG_MM_ID
> >>>>>>>>                     mm_id_t mm_id;
> >>>>>>>>      #endif /* CONFIG_MM_ID */
> >>>>>>>> +
> >>>>>>>> +#ifdef CONFIG_BPF_MM
> >>>>>>>> +               struct bpf_mm_ops bpf_mm;
> >>>>>>>> +#endif
> >>>>>>>>             } __randomize_layout;
> >>>>>>>>
> >>>>>>>> We should be aware that this will involve extensive changes in mm/.
> >>>>>>>
> >>>>>>> That's what we do on linux-mm :)
> >>>>>>>
> >>>>>>> It would be great to use Alexei's feedback/experience to come up with
> >>>>>>> something that is flexible for various use cases.
> >>>>>>
> >>>>>> I'm still not entirely convinced that allowing individual processes or
> >>>>>> cgroups to run independent progs is a valid use case. However, since
> >>>>>> we have a consensus that this is the right direction, I will proceed
> >>>>>> with this approach.
> >>>>>>
> >>>>>>>
> >>>>>>> So I think this is likely the right direction.
> >>>>>>>
> >>>>>>> It would be great to evaluate which scenarios we could unlock with this
> >>>>>>> (global vs. per-process vs. per-cgroup) approach, and how
> >>>>>>> extensive/involved the changes will be.
> >>>>>>
> >>>>>> 1. Global Approach
> >>>>>>       - Pros:
> >>>>>>         Simple;
> >>>>>>         Can manage different THP policies for different cgroups or processes.
> >>>>>>      - Cons:
> >>>>>>         Does not allow individual processes to run their own BPF programs.
> >>>>>>
> >>>>>> 2. Per-Process Approach
> >>>>>>        - Pros:
> >>>>>>          Enables each process to run its own BPF program.
> >>>>>>        - Cons:
> >>>>>>          Introduces significant complexity, as it requires handling the
> >>>>>> BPF program's lifecycle (creation, destruction, inheritance) within
> >>>>>> every mm_struct.
> >>>>>>
> >>>>>> 3. Per-Cgroup Approach
> >>>>>>        - Pros:
> >>>>>>           Allows individual cgroups to run their own BPF programs.
> >>>>>>           Less complex than the per-process model, as it can leverage the
> >>>>>> existing cgroup operations structure.
> >>>>>>        - Cons:
> >>>>>>           Creates a dependency on the cgroup subsystem.
> >>>>>>           might not be easy to control at the per-process level.
> >>>>>
> >>>>> Another issue is that how and who to deal with hierarchical cgroup, where one
> >>>>> cgroup is a parent of another. Should bpf program to do that or mm code
> >>>>> to do that? I remember hierarchical cgroup is the main reason THP control
> >>>>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
> >>>>> get the same rejection from cgroup folks?
> >>>>
> >>>> Valid point.
> >>>>
> >>>> I do wonder if that problem was already encountered elsewhere with bpf
> >>>> and if there is already a solution.
> >>>
> >>> Our standard is to run only one instance of a BPF program type
> >>> system-wide to avoid conflicts. For example, we can't have both
> >>> systemd and a container runtime running bpf-thp simultaneously.
> >>
> >> Right, it's a good question how to combine policies, or "who wins".
> >
> >  From my perspective, the ideal approach is to have one BPF-THP
> > instance per mm_struct. This allows for separate managers in different
> > domains, such as systemd managing BPF-THP for system processes and
> > containerd for container processes, while ensuring that any single
> > process is managed by only one BPF-THP.
>
> I came to the same conclusion. At least it's a valid start.
>
> Maybe we would later want a global fallback BPF-THP prog if none was
> enabled for a specific MM.

good idea. We can fallback to the global model when attaching pid 1.

>
> But I would expect to start with a per MM way of doing it, it gives you
> way more flexibility in the long run.

THP, such as shmem and file-backed THP, are shareable across multiple
processes and cgroups. If we allow different BPF-THP policies to be
applied to these shared resources, it could lead to policy
inconsistencies. This would ultimately recreate a long-standing issue
in memcg, which still lacks a robust solution for this problem [0].

This suggests that applying SCOPED policies to SHAREABLE memory may be
fundamentally flawed ;-)

[0]. https://lore.kernel.org/linux-mm/YwNold0GMOappUxc@slm.duckdns.org/

(Added the maintainers from the old discussion to this thread.)

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-11  2:13                         ` Yafang Shao
@ 2025-10-13 12:41                           ` David Hildenbrand
  2025-10-13 13:07                             ` Yafang Shao
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-10-13 12:41 UTC (permalink / raw)
  To: Yafang Shao, Tejun Heo, Michal Hocko, Roman Gushchin
  Cc: Zi Yan, Alexei Starovoitov, Johannes Weiner, Andrew Morton,
	baolin.wang, Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts,
	dev.jain, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	lance.yang, Randy Dunlap, bpf, linux-mm, open list:DOCUMENTATION,
	LKML

>> I came to the same conclusion. At least it's a valid start.
>>
>> Maybe we would later want a global fallback BPF-THP prog if none was
>> enabled for a specific MM.
> 
> good idea. We can fallback to the global model when attaching pid 1.
> 
>>
>> But I would expect to start with a per MM way of doing it, it gives you
>> way more flexibility in the long run.
> 
> THP, such as shmem and file-backed THP, are shareable across multiple
> processes and cgroups. If we allow different BPF-THP policies to be
> applied to these shared resources, it could lead to policy
> inconsistencies.

Sure, but nothing new about that (e.g., VM_HUGEPAGE, VM_NOHUGEPAGE, 
PR_GET_THP_DISABLE).

I'd expect that we focus on anon THP as the first step either way.

Skimming over this series, anon memory seems to be the main focus.

> This would ultimately recreate a long-standing issue
> in memcg, which still lacks a robust solution for this problem [0].
> 
> This suggests that applying SCOPED policies to SHAREABLE memory may be
> fundamentally flawed ;-)

Yeah, shared memory is usually more tricky: see mempolicy handling for 
shmem. There, the policy is much rather glued to a file than to a process.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection
  2025-10-13 12:41                           ` David Hildenbrand
@ 2025-10-13 13:07                             ` Yafang Shao
  0 siblings, 0 replies; 37+ messages in thread
From: Yafang Shao @ 2025-10-13 13:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Tejun Heo, Michal Hocko, Roman Gushchin, Zi Yan,
	Alexei Starovoitov, Johannes Weiner, Andrew Morton, baolin.wang,
	Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts, dev.jain,
	usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt,
	lance.yang, Randy Dunlap, bpf, linux-mm, open list:DOCUMENTATION,
	LKML

On Mon, Oct 13, 2025 at 8:42 PM David Hildenbrand <david@redhat.com> wrote:
>
> >> I came to the same conclusion. At least it's a valid start.
> >>
> >> Maybe we would later want a global fallback BPF-THP prog if none was
> >> enabled for a specific MM.
> >
> > good idea. We can fallback to the global model when attaching pid 1.
> >
> >>
> >> But I would expect to start with a per MM way of doing it, it gives you
> >> way more flexibility in the long run.
> >
> > THP, such as shmem and file-backed THP, are shareable across multiple
> > processes and cgroups. If we allow different BPF-THP policies to be
> > applied to these shared resources, it could lead to policy
> > inconsistencies.
>
> Sure, but nothing new about that (e.g., VM_HUGEPAGE, VM_NOHUGEPAGE,
> PR_GET_THP_DISABLE).
>
> I'd expect that we focus on anon THP as the first step either way.
>
> Skimming over this series, anon memory seems to be the main focus.

Right, currently it is focusing on anon memory. In the next step it
will be extended to file-backed THP.

>
> > This would ultimately recreate a long-standing issue
> > in memcg, which still lacks a robust solution for this problem [0].
> >
> > This suggests that applying SCOPED policies to SHAREABLE memory may be
> > fundamentally flawed ;-)
>
> Yeah, shared memory is usually more tricky: see mempolicy handling for
> shmem. There, the policy is much rather glued to a file than to a process.

For shared THP we are planning to apply the THP policy based on vma->vm_file.

Consequently, the existing BPF-THP policies, which are scoped to a
process or cgroup, are incompatible with shared THP. This raises the
question of how to effectively scope policies for shared memory. While
one option is to key the policy to the file structure, this may not be
ideal as it could lead to considerable implementation and maintenance
challenges...

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2025-10-13 13:08 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-30  5:58 [PATCH v9 mm-new 00/11] mm, bpf: BPF based THP order selection Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 01/11] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 02/11] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection Yafang Shao
2025-10-03  2:18   ` Alexei Starovoitov
2025-10-07  8:47     ` Yafang Shao
2025-10-08  3:25       ` Alexei Starovoitov
2025-10-08  3:50         ` Yafang Shao
2025-10-08  4:10           ` Alexei Starovoitov
2025-10-08  4:25             ` Yafang Shao
2025-10-08  4:39               ` Alexei Starovoitov
2025-10-08  6:02                 ` Yafang Shao
2025-10-08  8:08     ` David Hildenbrand
2025-10-08  8:18       ` Yafang Shao
2025-10-08  8:28         ` David Hildenbrand
2025-10-08  9:04           ` Yafang Shao
2025-10-08 11:27             ` Zi Yan
2025-10-08 12:06               ` Yafang Shao
2025-10-08 12:49                 ` Gutierrez Asier
2025-10-08 12:07               ` David Hildenbrand
2025-10-08 13:11                 ` Yafang Shao
2025-10-09  9:19                   ` David Hildenbrand
2025-10-09  9:59                     ` Yafang Shao
2025-10-10  7:54                       ` David Hildenbrand
2025-10-11  2:13                         ` Yafang Shao
2025-10-13 12:41                           ` David Hildenbrand
2025-10-13 13:07                             ` Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 04/11] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 05/11] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 06/11] bpf: mark mm->owner as __safe_rcu_or_null Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 07/11] bpf: mark vma->vm_mm as __safe_trusted_or_null Yafang Shao
2025-10-06 21:06   ` Andrii Nakryiko
2025-10-07  9:05     ` Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 08/11] selftests/bpf: add a simple BPF based THP policy Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 09/11] selftests/bpf: add test case to update " Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 10/11] selftests/bpf: add test cases for invalid thp_adjust usage Yafang Shao
2025-09-30  5:58 ` [PATCH v9 mm-new 11/11] Documentation: add BPF-based THP policy management Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox