linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP
@ 2025-10-26 10:01 Yafang Shao
  2025-10-26 10:01 ` [PATCH v12 mm-new 01/10] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
                   ` (9 more replies)
  0 siblings, 10 replies; 29+ messages in thread
From: Yafang Shao @ 2025-10-26 10:01 UTC (permalink / raw)
  To: akpm, ast, daniel, andrii, david, lorenzo.stoakes
  Cc: martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ameryhung, rientjes, corbet, 21cnbao, shakeel.butt, tj,
	lance.yang, rdunlap, clm, bpf, linux-mm, Yafang Shao

History
=======

RFC v1: fmod_ret based BPF-THP hook
        https://lore.kernel.org/linux-mm/20250429024139.34365-1-laoar.shao@gmail.com/

RFC v2: struct_ops based BPF-THP hook
        https://lore.kernel.org/linux-mm/20250520060504.20251-1-laoar.shao@gmail.com/

RFC v4: Get THP order with interface get_suggested_order()
        https://lore.kernel.org/linux-mm/20250729091807.84310-1-laoar.shao@gmail.com/

v4->v9: Simplify the interface to:

        unsigned long
        bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
                                unsigned long orders);

        https://lore.kernel.org/linux-mm/20250930055826.9810-1-laoar.shao@gmail.com/

v9->RFC v10: Scope BPF-THP to individual processes
v10->v11:    Remove the RFC tag
v11->v12:    Fix issues reported by AI

The Design
==========

Scoping BPF-THP to cgroup is rejected
-------------------------------------

As explained by Gutierrez:

1. It breaks the cgroup hierarchy when 2 siblings have different THP policies
2. Cgroup was designed for resource management not for grouping processes and
   tune those processes
3. We set a precedent for other people adding new flags to cgroup and 
   potentially polluting cgroups. We may end up with cgroups having tens of
   different flags, making sysadmin's job more complex

The related links are: 

  https://lore.kernel.org/linux-mm/1940d681-94a6-48fb-b889-cd8f0b91b330@huawei-partners.com/
  https://lore.kernel.org/linux-mm/20241030150851.GB706616@cmpxchg.org/    

So we has to scope it to process.

Scoping BPF-THP to process
--------------------------

To eliminate potential conflicts among competing BPF-THP instances, we
enforce that each process is exclusively managed by a single BPF-THP. This
approach has received agreement from David. For context, see:

  https://lore.kernel.org/linux-mm/3577f7fd-429a-49c5-973b-38174a67be15@redhat.com/

When registering a BPF-THP, we specify the PID of a target task. The
BPF-THP is then installed in the task's `mm_struct`

  struct mm_struct {
      struct bpf_thp_ops __rcu *thp_thp;
  };

Inheritance Behavior:

- Existing child processes are unaffected
- Newly forked children inherit the BPF-THP from their parent
- The BPF-THP persists across exec

A new linked list tracks all tasks managed by each BPF-THP instance:

- Newly managed tasks are added to the list
- Exiting tasks are automatically removed from the list
- During BPF-THP unregistration (e.g., when the BPF link is removed), all
  managed tasks have their bpf_thp pointer set to NULL
- BPF-THP instances can be dynamically updated, with all tracked tasks
  automatically migrating to the new version.

This design simplifies BPF-THP management in production environments by
providing clear lifecycle management and preventing conflicts between
multiple BPF-THP instances.

Global Mode
-----------

The per-process BPF-THP mode is unsuitable for managing shared resources
such as shmem THP and file-backed THP. This aligns with known cgroup
limitations for similar scenarios:

  https://lore.kernel.org/linux-mm/YwNold0GMOappUxc@slm.duckdns.org/ 

Introduce a global BPF-THP mode to address this gap. When registered:
- All existing per-process instances are disabled
- New per-process registrations are blocked
- Existing per-process instances remain registered (no forced unregistration)

The global mode takes precedence over per-process instances. Updates are
type-isolated: global instances can only be updated by new global
instances, and per-process instances by new per-process instances.

BPF CI
------

Several dependency patches are currently in mm-new but haven't been merged
into bpf-next. To enable BPF CI testing, I had to make minor changes to
patches #1 and #2 and trigger the BPF CI manually. For details, see:

  https://github.com/kernel-patches/bpf/pull/10097

An error occurred during the test, but it was unrelated to this series.

Yafang Shao (10):
  mm: thp: remove vm_flags parameter from khugepaged_enter_vma()
  mm: thp: remove vm_flags parameter from thp_vma_allowable_order()
  mm: thp: add support for BPF based THP order selection
  mm: thp: decouple THP allocation between swap and page fault paths
  mm: thp: enable THP allocation exclusively through khugepaged
  mm: bpf-thp: add support for global mode
  Documentation: add BPF THP
  selftests/bpf: add a simple BPF based THP policy
  selftests/bpf: add test case to update THP policy
  selftests/bpf: add test case for BPF-THP inheritance across fork

 Documentation/admin-guide/mm/transhuge.rst    | 113 +++++
 MAINTAINERS                                   |   3 +
 fs/exec.c                                     |   1 +
 fs/proc/task_mmu.c                            |   3 +-
 include/linux/huge_mm.h                       |  58 ++-
 include/linux/khugepaged.h                    |  10 +-
 include/linux/mm_types.h                      |  17 +
 kernel/fork.c                                 |   1 +
 mm/Kconfig                                    |  24 +
 mm/Makefile                                   |   1 +
 mm/huge_memory.c                              |   7 +-
 mm/huge_memory_bpf.c                          | 423 ++++++++++++++++++
 mm/khugepaged.c                               |  43 +-
 mm/madvise.c                                  |   7 +
 mm/memory.c                                   |  22 +-
 mm/mmap.c                                     |   1 +
 mm/shmem.c                                    |   2 +-
 mm/vma.c                                      |   6 +-
 tools/testing/selftests/bpf/config            |   3 +
 .../selftests/bpf/prog_tests/thp_adjust.c     | 357 +++++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     |  53 +++
 21 files changed, 1101 insertions(+), 54 deletions(-)
 create mode 100644 mm/huge_memory_bpf.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c

-- 
2.47.3



^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-11-30 13:07 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 01/10] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 02/10] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 03/10] mm: thp: add support for BPF based THP order selection Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 04/10] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
2025-10-27  4:07   ` Barry Song
2025-10-26 10:01 ` [PATCH v12 mm-new 05/10] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode Yafang Shao
2025-10-29  1:32   ` Alexei Starovoitov
2025-10-29  2:13     ` Yafang Shao
2025-10-30  0:57       ` Alexei Starovoitov
2025-10-30  2:40         ` Yafang Shao
2025-11-27 11:48         ` David Hildenbrand (Red Hat)
2025-11-28  2:53           ` Yafang Shao
2025-11-28  7:57             ` Lorenzo Stoakes
2025-11-28  8:18               ` Yafang Shao
2025-11-28  8:31                 ` Lorenzo Stoakes
2025-11-28 11:56                   ` Yafang Shao
2025-11-28 12:18                     ` Lorenzo Stoakes
2025-11-28 12:51                       ` Yafang Shao
2025-11-28  8:39             ` David Hildenbrand (Red Hat)
2025-11-28  8:55               ` Lorenzo Stoakes
2025-11-30 13:06               ` Yafang Shao
2025-11-26 15:13     ` Rik van Riel
2025-11-27  2:35       ` Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 07/10] Documentation: add BPF THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 08/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 09/10] selftests/bpf: add test case to update " Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 10/10] selftests/bpf: add test case for BPF-THP inheritance across fork Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox