From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: "David Hildenbrand (Red Hat)" <david@kernel.org>
Cc: Yafang Shao <laoar.shao@gmail.com>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
John Fastabend <john.fastabend@gmail.com>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, Zi Yan <ziy@nvidia.com>,
Liam Howlett <Liam.Howlett@oracle.com>,
npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
Johannes Weiner <hannes@cmpxchg.org>,
usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com,
Matthew Wilcox <willy@infradead.org>,
Amery Hung <ameryhung@gmail.com>,
David Rientjes <rientjes@google.com>,
Jonathan Corbet <corbet@lwn.net>, Barry Song <21cnbao@gmail.com>,
Shakeel Butt <shakeel.butt@linux.dev>, Tejun Heo <tj@kernel.org>,
lance.yang@linux.dev, Randy Dunlap <rdunlap@infradead.org>,
Chris Mason <clm@meta.com>, bpf <bpf@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>
Subject: Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
Date: Fri, 28 Nov 2025 08:55:12 +0000 [thread overview]
Message-ID: <f8b4bd7a-b2e6-4ac0-971a-75cfd03c824d@lucifer.local> (raw)
In-Reply-To: <e52bf30d-e63b-44ed-9808-ee3e612e0ba1@kernel.org>
On Fri, Nov 28, 2025 at 09:39:06AM +0100, David Hildenbrand (Red Hat) wrote:
> On 11/28/25 03:53, Yafang Shao wrote:
> > On Thu, Nov 27, 2025 at 7:48 PM David Hildenbrand (Red Hat)
> > <david@kernel.org> wrote:
>
> Lorenzo commented on the upstream topic, let me mostly comment on the other
> parts:
> > > > Attaching st_ops to task_struct or to mm_struct is a can of worms.
> > > > With cgroup-bpf we went through painful bugs with lifetime
> > > > of cgroup vs bpf, dying cgroups, wq deadlock, etc. All these
> > > > problems are behind us. With st_ops in mm_struct it will be more
> > > > painful. I'd rather not go that route.
> > >
> > > That's valuable information, thanks. I would have hoped that per-MM
> > > policies would be easier.
> >
> > The per-MM approach has a performance advantage over per-MEMCG
> > policies. This is because it accesses the policy hook directly via
> >
> > vma->vm_mm->bpf_mm->policy_hook()
> >
> > whereas the per-MEMCG method requires a more expensive lookup:
> >
> > memcg = get_mem_cgroup_from_mm(vma->vm_mm);
> > memcg->bpf_memcg->policy_hook();
> > > This lookup could be a concern in a critical path. However, this
> > performance issue in the per-MEMCG mode can be mitigated. For
> > instance, when a task is added to a new memcg, we can cache the hook
> > pointer:
> >
> > task->mm->bpf_mm->policy_hook = memcg->bpf_memcg->policy_hook
> >
> > Ultimately, we might still introduce a mm_struct:bpf_mm field to
> > provide an efficient interface.
>
> Right, caching is what I would have proposed. I would expect some headakes
> with lifetime, but probably nothing unsolvable.
>
>
> > > Sounds like cgroup-bpf has sorted
> > > out most of the mess.
> >
> > No, the attach-based cgroup-bpf has proven to be ... a "can of worms"
> > in practice ...
> > (I welcome corrections from the BPF maintainers if my assessment is
> > inaccurate.)
>
> I don't know what's right or wrong here, as Alexei said the "mm_struct"
> based one would be a can of worms and that the the cgroup-based one
> apparently solved these issues ("All these problems are behind us."), that's
> why I asked for some clarifications. :)
>
> [...]
>
> > >
> > > Some of what Yafang might want to achieve could maybe at this point be
> > > maybe achieved through the prctl(PR_SET_THP_DISABLE) support, including
> > > extensions we recently added [1].
> > >
> > > Systemd support still seems to be in the works [2] for some of that.
> > >
> > >
> > > [1] https://lwn.net/Articles/1032014/
> > > [2] https://github.com/systemd/systemd/pull/39085
> >
> > Thank you for sharing this.
> > However, BPF-THP is already deployed across our server fleet and both
> > our users and my boss are satisfied with it. As such, we are not
> > considering a switch. The current solution also offers us a valuable
> > opportunity to experiment with additional policies in production.
>
> Just to emphasize: we usually don't add two mechanisms to achieve the very
> same end goal. There really must be something delivering more value for us
> to accept something more complex. Focusing on solving a solved problem is
> not good.
Yes.
>
> If some company went with a downstream-only approach they might be stuck
> having to maintain that forever.
>
> That's why other companies prefer upstream-first :)
I think trying to do downstream-only is going to cause very big headaches if we
choose to substantially alter THP in future (and of course - we do intend to).
>
>
> Having that said, the original reason why I agreed that having bpf for THP
> can be valuable is that I see a lot more value for rapid prototyping and
> policies once you can actually control on a per-VMA basis (using vma size,
> flags, anon-vma names etc) where specific folio orders could be valuable,
> and where not. But also, possibly where we would want to waste memory and
> where not.
The same for me.
But given the actual author of the feature has already treated this as a
permanent and unchanging feature, I absolutely do not have confidence that we
can do this.
The situation I feared us running in to is that we'd release this even with
CONFIG_EXPERIMENTAL_DO_NOT_RELY etc. (note the flag is somehow now
CONFIG_BPF_THP which... isn't what I wanted) and people would STILL rely on it,
then when we try to change it loudly complain and make it difficult to remove.
I am now convinced that this is just going to happen no matter what we do.
So the 'rapid prototyping' approach is just not workable, at all in my view.
>
> As we are speaking I have a customer running into issues [1] with
> virtio-balloon discarding pages in a VM and khugepaged undoing part of that
> work in the hypervisor. The workaround of telling khugepaged to not waste
> memory in all of the system really feels suboptimal when we know that it's
> only the VM memory of such VMs (with balloon deflation enabled) where we
> would not want to waste memory but still use THPs.
>
> [1] https://issues.redhat.com/browse/RHEL-121177
Right, and that's very sad that we now lose the ability to do so, but rapid
prototyping isn't feasible - I think we're seeing that.
That doesn't mean we can't have BPF for THP. It just means we have to set the
bar CONSIDERABLY higher - whatever interface we provide _has_ to be
future-proofed to any future changes we make to THP in terms of making things
more 'automatic' - and has to provide sufficient power to be useful.
I wonder how easy it will be to figure out such an interface without
accidentally causing ourselves issues down the line.
THP is a special case like that - right now we have very broken interfaces (as
evidenced by users requesting things like the prctl extensions) - and we want to
be able to fix those in the future.
Of course we have to maintain uAPI compatibility, but even the discussion around
mTHP khugepaged and 'eagerness' points to a desire to change how existing
interfaces work - imagine if we had some BPF hook that then ended up needing to
introspect current max_pte_none for instance.
So perhaps the answer is that a BPF interface should come later when we have a
better idea of the future of THP?
The whole cgroup vs mm thing again raises old issues about isolation - the
cgroup people reject the idea of THP being a resource that can be managed by
cgroups - so by even allowing a per-memcg thing we're opening that can of worms.
Anyway overall this series as-is is not really upstreamable I don't think.
Maybe we can figure out a read-only introspection hook that makes the least
assumptions that can be provided at low-risk that'd help with issues such as the
one you mention at least in respect of informing what's going on?
That could form the basis of future work towards a hook that actually changes
things?
There's no need to rush.
>
> --
> Cheers
>
> David
Thanks, Lorenzo
next prev parent reply other threads:[~2025-11-28 8:56 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 01/10] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 02/10] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 03/10] mm: thp: add support for BPF based THP order selection Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 04/10] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
2025-10-27 4:07 ` Barry Song
2025-10-26 10:01 ` [PATCH v12 mm-new 05/10] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode Yafang Shao
2025-10-29 1:32 ` Alexei Starovoitov
2025-10-29 2:13 ` Yafang Shao
2025-10-30 0:57 ` Alexei Starovoitov
2025-10-30 2:40 ` Yafang Shao
2025-11-27 11:48 ` David Hildenbrand (Red Hat)
2025-11-28 2:53 ` Yafang Shao
2025-11-28 7:57 ` Lorenzo Stoakes
2025-11-28 8:18 ` Yafang Shao
2025-11-28 8:31 ` Lorenzo Stoakes
2025-11-28 11:56 ` Yafang Shao
2025-11-28 12:18 ` Lorenzo Stoakes
2025-11-28 12:51 ` Yafang Shao
2025-11-28 8:39 ` David Hildenbrand (Red Hat)
2025-11-28 8:55 ` Lorenzo Stoakes [this message]
2025-11-30 13:06 ` Yafang Shao
2025-11-26 15:13 ` Rik van Riel
2025-11-27 2:35 ` Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 07/10] Documentation: add BPF THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 08/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 09/10] selftests/bpf: add test case to update " Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 10/10] selftests/bpf: add test case for BPF-THP inheritance across fork Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f8b4bd7a-b2e6-4ac0-971a-75cfd03c824d@lucifer.local \
--to=lorenzo.stoakes@oracle.com \
--cc=21cnbao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alexei.starovoitov@gmail.com \
--cc=ameryhung@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=clm@meta.com \
--cc=corbet@lwn.net \
--cc=daniel@iogearbox.net \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=eddyz87@gmail.com \
--cc=gutierrez.asier@huawei-partners.com \
--cc=hannes@cmpxchg.org \
--cc=haoluo@google.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=lance.yang@linux.dev \
--cc=laoar.shao@gmail.com \
--cc=linux-mm@kvack.org \
--cc=martin.lau@linux.dev \
--cc=npache@redhat.com \
--cc=rdunlap@infradead.org \
--cc=rientjes@google.com \
--cc=ryan.roberts@arm.com \
--cc=sdf@fomichev.me \
--cc=shakeel.butt@linux.dev \
--cc=song@kernel.org \
--cc=tj@kernel.org \
--cc=usamaarif642@gmail.com \
--cc=willy@infradead.org \
--cc=yonghong.song@linux.dev \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox