From: Yafang Shao <laoar.shao@gmail.com>
To: "David Hildenbrand (Red Hat)" <david@kernel.org>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
John Fastabend <john.fastabend@gmail.com>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, Zi Yan <ziy@nvidia.com>,
Liam Howlett <Liam.Howlett@oracle.com>,
npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
Johannes Weiner <hannes@cmpxchg.org>,
usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com,
Matthew Wilcox <willy@infradead.org>,
Amery Hung <ameryhung@gmail.com>,
David Rientjes <rientjes@google.com>,
Jonathan Corbet <corbet@lwn.net>, Barry Song <21cnbao@gmail.com>,
Shakeel Butt <shakeel.butt@linux.dev>, Tejun Heo <tj@kernel.org>,
lance.yang@linux.dev, Randy Dunlap <rdunlap@infradead.org>,
Chris Mason <clm@meta.com>, bpf <bpf@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>
Subject: Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
Date: Fri, 28 Nov 2025 10:53:53 +0800 [thread overview]
Message-ID: <CALOAHbCR3Y=GCpX8S9CctONO=Emh4RvYAibHU=ZQyLP1s0MOVQ@mail.gmail.com> (raw)
In-Reply-To: <9f73a5bd-32a0-4d5f-8a3f-7bff8232e408@kernel.org>
On Thu, Nov 27, 2025 at 7:48 PM David Hildenbrand (Red Hat)
<david@kernel.org> wrote:
>
> >> To move forward, I'm happy to set the global mode aside for now and
> >> potentially drop it in the next version. I'd really like to hear your
> >> perspective on the per-process mode. Does this implementation meet
> >> your needs?
>
> I haven't had the capacity to follow the evolution of this patch set
> unfortunately, just to comment on some points from my perspective.
>
> First, I agree that the global mode is not what we want, not even as a
> fallback.
>
> >
> > Attaching st_ops to task_struct or to mm_struct is a can of worms.
> > With cgroup-bpf we went through painful bugs with lifetime
> > of cgroup vs bpf, dying cgroups, wq deadlock, etc. All these
> > problems are behind us. With st_ops in mm_struct it will be more
> > painful. I'd rather not go that route.
>
> That's valuable information, thanks. I would have hoped that per-MM
> policies would be easier.
The per-MM approach has a performance advantage over per-MEMCG
policies. This is because it accesses the policy hook directly via
vma->vm_mm->bpf_mm->policy_hook()
whereas the per-MEMCG method requires a more expensive lookup:
memcg = get_mem_cgroup_from_mm(vma->vm_mm);
memcg->bpf_memcg->policy_hook();
This lookup could be a concern in a critical path. However, this
performance issue in the per-MEMCG mode can be mitigated. For
instance, when a task is added to a new memcg, we can cache the hook
pointer:
task->mm->bpf_mm->policy_hook = memcg->bpf_memcg->policy_hook
Ultimately, we might still introduce a mm_struct:bpf_mm field to
provide an efficient interface.
>
> Are there some pointers to explore regarding the "can of worms" you
> mention when it comes to per-MM policies?
>
> >
> > And revist cgroup instead, since you were way too quick
> > to accept the pushback because all you wanted is global mode.
> >
> > The main reason for pushback was:
> > "
> > Cgroup was designed for resource management not for grouping processes and
> > tune those processes
> > "
> >
> > which was true when cgroup-v2 was designed, but that ship sailed
> > years ago when we introduced cgroup-bpf.
>
> Also valuable information.
>
> Personally I don't have a preference regarding per-mm or per-cgroup.
> Whatever we can get working reliably.
I am open to either approach, as long as it's acceptable to the maintainers.
> Sounds like cgroup-bpf has sorted
> out most of the mess.
No, the attach-based cgroup-bpf has proven to be ... a "can of worms"
in practice ...
(I welcome corrections from the BPF maintainers if my assessment is
inaccurate.)
While the struct-ops-based cgroup-bpf is still under discussion.
>
> memcg/cgroup maintainers might disagree, but it's probably worth having
> that discussion once again.
>
> > None of the progs are doing resource management and lots of infrastructure,
> > container management, and open source projects use cgroup-bpf
> > as a grouping of processes. bpf progs attached to cgroup/hook tuple
> > only care about processes within that cgroup. No resource management.
> > See __cgroup_bpf_check_dev_permission or __cgroup_bpf_run_filter_sysctl
> > and others.
> > The path is current->cgroup->bpf_progs and progs do exactly
> > what cgroup wasn't designed to do. They tune a set of processes.
> >
> > You should do the same.
> >
> > Also I really don't see a compelling use case for bpf in THP.
>
> There is a lot more potential there to write fine-tuned policies that
> thack VMA information into account.
>
> The tests likely reflect what Yafang seems to focus on: IIUC primarily
> enabling+disabling traditional THPs (e.g., 2M) on a per-process basis.
Right.
>
> Some of what Yafang might want to achieve could maybe at this point be
> maybe achieved through the prctl(PR_SET_THP_DISABLE) support, including
> extensions we recently added [1].
>
> Systemd support still seems to be in the works [2] for some of that.
>
>
> [1] https://lwn.net/Articles/1032014/
> [2] https://github.com/systemd/systemd/pull/39085
Thank you for sharing this.
However, BPF-THP is already deployed across our server fleet and both
our users and my boss are satisfied with it. As such, we are not
considering a switch. The current solution also offers us a valuable
opportunity to experiment with additional policies in production.
In summary, I am fine with either the per-MM or per-MEMCG method.
Furthermore, I don't believe this is an either-or decision; both can
be implemented to work together.
--
Regards
Yafang
next prev parent reply other threads:[~2025-11-28 2:54 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 01/10] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 02/10] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 03/10] mm: thp: add support for BPF based THP order selection Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 04/10] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
2025-10-27 4:07 ` Barry Song
2025-10-26 10:01 ` [PATCH v12 mm-new 05/10] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode Yafang Shao
2025-10-29 1:32 ` Alexei Starovoitov
2025-10-29 2:13 ` Yafang Shao
2025-10-30 0:57 ` Alexei Starovoitov
2025-10-30 2:40 ` Yafang Shao
2025-11-27 11:48 ` David Hildenbrand (Red Hat)
2025-11-28 2:53 ` Yafang Shao [this message]
2025-11-28 7:57 ` Lorenzo Stoakes
2025-11-28 8:18 ` Yafang Shao
2025-11-28 8:31 ` Lorenzo Stoakes
2025-11-28 11:56 ` Yafang Shao
2025-11-28 12:18 ` Lorenzo Stoakes
2025-11-28 12:51 ` Yafang Shao
2025-11-28 8:39 ` David Hildenbrand (Red Hat)
2025-11-28 8:55 ` Lorenzo Stoakes
2025-11-30 13:06 ` Yafang Shao
2025-11-26 15:13 ` Rik van Riel
2025-11-27 2:35 ` Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 07/10] Documentation: add BPF THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 08/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 09/10] selftests/bpf: add test case to update " Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 10/10] selftests/bpf: add test case for BPF-THP inheritance across fork Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CALOAHbCR3Y=GCpX8S9CctONO=Emh4RvYAibHU=ZQyLP1s0MOVQ@mail.gmail.com' \
--to=laoar.shao@gmail.com \
--cc=21cnbao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alexei.starovoitov@gmail.com \
--cc=ameryhung@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=clm@meta.com \
--cc=corbet@lwn.net \
--cc=daniel@iogearbox.net \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=eddyz87@gmail.com \
--cc=gutierrez.asier@huawei-partners.com \
--cc=hannes@cmpxchg.org \
--cc=haoluo@google.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=lance.yang@linux.dev \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=martin.lau@linux.dev \
--cc=npache@redhat.com \
--cc=rdunlap@infradead.org \
--cc=rientjes@google.com \
--cc=ryan.roberts@arm.com \
--cc=sdf@fomichev.me \
--cc=shakeel.butt@linux.dev \
--cc=song@kernel.org \
--cc=tj@kernel.org \
--cc=usamaarif642@gmail.com \
--cc=willy@infradead.org \
--cc=yonghong.song@linux.dev \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox