From: Yafang Shao <laoar.shao@gmail.com>
To: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com,
baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com,
gutierrez.asier@huawei-partners.com, willy@infradead.org,
ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
ameryhung@gmail.com, rientjes@google.com, corbet@lwn.net,
21cnbao@gmail.com, shakeel.butt@linux.dev, tj@kernel.org,
lance.yang@linux.dev, rdunlap@infradead.org
Cc: bpf@vger.kernel.org, linux-mm@kvack.org,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
Yafang Shao <laoar.shao@gmail.com>
Subject: [RFC PATCH v10 mm-new 0/9] mm, bpf: BPF-MM, BPF-THP
Date: Wed, 15 Oct 2025 22:17:07 +0800 [thread overview]
Message-ID: <20251015141716.887-1-laoar.shao@gmail.com> (raw)
History
=======
RFC v1: fmod_ret based BPF-THP hook
https://lore.kernel.org/linux-mm/20250429024139.34365-1-laoar.shao@gmail.com/
RFC v2: struct_ops based BPF-THP hook
https://lore.kernel.org/linux-mm/20250520060504.20251-1-laoar.shao@gmail.com/
RFC v4: Get THP order with interface get_suggested_order()
https://lore.kernel.org/linux-mm/20250729091807.84310-1-laoar.shao@gmail.com/
v4->v9: Simplify the interface to:
unsigned long
bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
unsigned long orders);
https://lore.kernel.org/linux-mm/20250930055826.9810-1-laoar.shao@gmail.com/
v9->RFC v10: Scope BPF-THP to individual processes
The Design
==========
Scoping BPF-THP to cgroup is rejected
-------------------------------------
As explained by Gutierrez:
1. It breaks the cgroup hierarchy when 2 siblings have different THP policies
2. Cgroup was designed for resource management not for grouping processes and
tune those processes
3. We set a precedent for other people adding new flags to cgroup and
potentially polluting cgroups. We may end up with cgroups having tens of
different flags, making sysadmin's job more complex
The related links are:
https://lore.kernel.org/linux-mm/1940d681-94a6-48fb-b889-cd8f0b91b330@huawei-partners.com/
https://lore.kernel.org/linux-mm/20241030150851.GB706616@cmpxchg.org/
So we has to scope it to process.
Scoping BPF-THP to process
--------------------------
To eliminate potential conflicts among competing BPF-THP instances, we
enforce that each process is exclusively managed by a single BPF-THP. This
approach has received agreement from David. For context, see:
https://lore.kernel.org/linux-mm/3577f7fd-429a-49c5-973b-38174a67be15@redhat.com/
When registering a BPF-THP, we specify the PID of a target task. The
BPF-THP is then installed in the task's `mm_struct`
struct mm_struct {
struct bpf_thp_ops __rcu *thp_thp;
};
Inheritance Behavior:
- Existing child processes are unaffected
- Newly forked children inherit the BPF-THP from their parent
- The BPF-THP persists across execve() calls
A new linked list tracks all tasks managed by each BPF-THP instance:
- Newly managed tasks are added to the list
- Exiting tasks are automatically removed from the list
- During BPF-THP unregistration (e.g., when the BPF link is removed), all
managed tasks have their bpf_thp pointer set to NULL
- BPF-THP instances can be dynamically updated, with all tracked tasks
automatically migrating to the new version.
This design simplifies BPF-THP management in production environments by
providing clear lifecycle management and preventing conflicts between
multiple BPF-THP instances.
Any feedback is welcomed.
Future Work
===========
Introduce a global fallback mechanism to address shared resource management
limitations in process and cgroup-based methods:
https://lore.kernel.org/linux-mm/YwNold0GMOappUxc@slm.duckdns.org/
Yafang Shao (9):
mm: thp: remove vm_flags parameter from khugepaged_enter_vma()
mm: thp: remove vm_flags parameter from thp_vma_allowable_order()
mm: thp: add support for BPF based THP order selection
mm: thp: decouple THP allocation between swap and page fault paths
mm: thp: enable THP allocation exclusively through khugepaged
bpf: mark mm->owner as __safe_rcu_or_null
bpf: mark vma->vm_mm as __safe_trusted_or_null
selftests/bpf: add a simple BPF based THP policy
Documentation: add BPF-based THP policy management
Documentation/admin-guide/mm/transhuge.rst | 39 +++
MAINTAINERS | 3 +
fs/exec.c | 1 +
fs/proc/task_mmu.c | 3 +-
include/linux/huge_mm.h | 59 +++-
include/linux/khugepaged.h | 10 +-
include/linux/mm_types.h | 18 ++
kernel/bpf/verifier.c | 8 +
kernel/fork.c | 1 +
mm/Kconfig | 22 ++
mm/Makefile | 1 +
mm/huge_memory.c | 7 +-
mm/huge_memory_bpf.c | 306 ++++++++++++++++++
mm/khugepaged.c | 35 +-
mm/madvise.c | 7 +
mm/memory.c | 22 +-
mm/mmap.c | 1 +
mm/shmem.c | 2 +-
mm/vma.c | 6 +-
tools/testing/selftests/bpf/config | 3 +
.../selftests/bpf/prog_tests/thp_adjust.c | 245 ++++++++++++++
tools/testing/selftests/bpf/progs/lsm.c | 8 +-
.../selftests/bpf/progs/test_thp_adjust.c | 23 ++
23 files changed, 777 insertions(+), 53 deletions(-)
create mode 100644 mm/huge_memory_bpf.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
--
2.47.3
next reply other threads:[~2025-10-15 14:17 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-15 14:17 Yafang Shao [this message]
2025-10-15 14:17 ` [RFC PATCH v10 mm-new 1/9] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
2025-10-15 14:17 ` [RFC PATCH v10 mm-new 2/9] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Yafang Shao
2025-10-15 14:17 ` [RFC PATCH v10 mm-new 3/9] mm: thp: add support for BPF based THP order selection Yafang Shao
2025-10-15 14:17 ` [RFC PATCH v10 mm-new 4/9] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
2025-10-15 14:17 ` [RFC PATCH v10 mm-new 5/9] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
2025-10-15 14:17 ` [RFC PATCH v10 mm-new 6/9] bpf: mark mm->owner as __safe_rcu_or_null Yafang Shao
2025-10-15 16:35 ` Andrii Nakryiko
2025-10-16 6:42 ` Yafang Shao
2025-10-16 7:21 ` Lorenzo Stoakes
2025-10-16 8:18 ` Yafang Shao
2025-10-15 14:17 ` [RFC PATCH v10 mm-new 7/9] bpf: mark vma->vm_mm as __safe_trusted_or_null Yafang Shao
2025-10-15 14:17 ` [RFC PATCH v10 mm-new 8/9] selftests/bpf: add a simple BPF based THP policy Yafang Shao
2025-10-15 14:17 ` [RFC PATCH v10 mm-new 9/9] Documentation: add BPF-based THP policy management Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251015141716.887-1-laoar.shao@gmail.com \
--to=laoar.shao@gmail.com \
--cc=21cnbao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=ameryhung@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bpf@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=daniel@iogearbox.net \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=gutierrez.asier@huawei-partners.com \
--cc=hannes@cmpxchg.org \
--cc=lance.yang@linux.dev \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=npache@redhat.com \
--cc=rdunlap@infradead.org \
--cc=rientjes@google.com \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=tj@kernel.org \
--cc=usamaarif642@gmail.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox