From: "David Hildenbrand (Red Hat)" <david@kernel.org>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
John Fastabend <john.fastabend@gmail.com>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, Zi Yan <ziy@nvidia.com>,
Liam Howlett <Liam.Howlett@oracle.com>,
npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
Johannes Weiner <hannes@cmpxchg.org>,
usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com,
Matthew Wilcox <willy@infradead.org>,
Amery Hung <ameryhung@gmail.com>,
David Rientjes <rientjes@google.com>,
Jonathan Corbet <corbet@lwn.net>, Barry Song <21cnbao@gmail.com>,
Shakeel Butt <shakeel.butt@linux.dev>, Tejun Heo <tj@kernel.org>,
lance.yang@linux.dev, Randy Dunlap <rdunlap@infradead.org>,
Chris Mason <clm@meta.com>, bpf <bpf@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>
Subject: Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
Date: Fri, 28 Nov 2025 09:39:06 +0100 [thread overview]
Message-ID: <e52bf30d-e63b-44ed-9808-ee3e612e0ba1@kernel.org> (raw)
In-Reply-To: <CALOAHbCR3Y=GCpX8S9CctONO=Emh4RvYAibHU=ZQyLP1s0MOVQ@mail.gmail.com>
On 11/28/25 03:53, Yafang Shao wrote:
> On Thu, Nov 27, 2025 at 7:48 PM David Hildenbrand (Red Hat)
> <david@kernel.org> wrote:
Lorenzo commented on the upstream topic, let me mostly comment on the
other parts:
>>> Attaching st_ops to task_struct or to mm_struct is a can of worms.
>>> With cgroup-bpf we went through painful bugs with lifetime
>>> of cgroup vs bpf, dying cgroups, wq deadlock, etc. All these
>>> problems are behind us. With st_ops in mm_struct it will be more
>>> painful. I'd rather not go that route.
>>
>> That's valuable information, thanks. I would have hoped that per-MM
>> policies would be easier.
>
> The per-MM approach has a performance advantage over per-MEMCG
> policies. This is because it accesses the policy hook directly via
>
> vma->vm_mm->bpf_mm->policy_hook()
>
> whereas the per-MEMCG method requires a more expensive lookup:
>
> memcg = get_mem_cgroup_from_mm(vma->vm_mm);
> memcg->bpf_memcg->policy_hook();
> > This lookup could be a concern in a critical path. However, this
> performance issue in the per-MEMCG mode can be mitigated. For
> instance, when a task is added to a new memcg, we can cache the hook
> pointer:
>
> task->mm->bpf_mm->policy_hook = memcg->bpf_memcg->policy_hook
>
> Ultimately, we might still introduce a mm_struct:bpf_mm field to
> provide an efficient interface.
Right, caching is what I would have proposed. I would expect some
headakes with lifetime, but probably nothing unsolvable.
>> Sounds like cgroup-bpf has sorted
>> out most of the mess.
>
> No, the attach-based cgroup-bpf has proven to be ... a "can of worms"
> in practice ...
> (I welcome corrections from the BPF maintainers if my assessment is
> inaccurate.)
I don't know what's right or wrong here, as Alexei said the "mm_struct"
based one would be a can of worms and that the the cgroup-based one
apparently solved these issues ("All these problems are behind us."),
that's why I asked for some clarifications. :)
[...]
>>
>> Some of what Yafang might want to achieve could maybe at this point be
>> maybe achieved through the prctl(PR_SET_THP_DISABLE) support, including
>> extensions we recently added [1].
>>
>> Systemd support still seems to be in the works [2] for some of that.
>>
>>
>> [1] https://lwn.net/Articles/1032014/
>> [2] https://github.com/systemd/systemd/pull/39085
>
> Thank you for sharing this.
> However, BPF-THP is already deployed across our server fleet and both
> our users and my boss are satisfied with it. As such, we are not
> considering a switch. The current solution also offers us a valuable
> opportunity to experiment with additional policies in production.
Just to emphasize: we usually don't add two mechanisms to achieve the
very same end goal. There really must be something delivering more value
for us to accept something more complex. Focusing on solving a solved
problem is not good.
If some company went with a downstream-only approach they might be stuck
having to maintain that forever.
That's why other companies prefer upstream-first :)
Having that said, the original reason why I agreed that having bpf for
THP can be valuable is that I see a lot more value for rapid prototyping
and policies once you can actually control on a per-VMA basis (using vma
size, flags, anon-vma names etc) where specific folio orders could be
valuable, and where not. But also, possibly where we would want to waste
memory and where not.
As we are speaking I have a customer running into issues [1] with
virtio-balloon discarding pages in a VM and khugepaged undoing part of
that work in the hypervisor. The workaround of telling khugepaged to not
waste memory in all of the system really feels suboptimal when we know
that it's only the VM memory of such VMs (with balloon deflation
enabled) where we would not want to waste memory but still use THPs.
[1] https://issues.redhat.com/browse/RHEL-121177
--
Cheers
David
next prev parent reply other threads:[~2025-11-28 8:39 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 01/10] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 02/10] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 03/10] mm: thp: add support for BPF based THP order selection Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 04/10] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
2025-10-27 4:07 ` Barry Song
2025-10-26 10:01 ` [PATCH v12 mm-new 05/10] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode Yafang Shao
2025-10-29 1:32 ` Alexei Starovoitov
2025-10-29 2:13 ` Yafang Shao
2025-10-30 0:57 ` Alexei Starovoitov
2025-10-30 2:40 ` Yafang Shao
2025-11-27 11:48 ` David Hildenbrand (Red Hat)
2025-11-28 2:53 ` Yafang Shao
2025-11-28 7:57 ` Lorenzo Stoakes
2025-11-28 8:18 ` Yafang Shao
2025-11-28 8:31 ` Lorenzo Stoakes
2025-11-28 11:56 ` Yafang Shao
2025-11-28 12:18 ` Lorenzo Stoakes
2025-11-28 12:51 ` Yafang Shao
2025-11-28 8:39 ` David Hildenbrand (Red Hat) [this message]
2025-11-28 8:55 ` Lorenzo Stoakes
2025-11-30 13:06 ` Yafang Shao
2025-11-26 15:13 ` Rik van Riel
2025-11-27 2:35 ` Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 07/10] Documentation: add BPF THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 08/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 09/10] selftests/bpf: add test case to update " Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 10/10] selftests/bpf: add test case for BPF-THP inheritance across fork Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e52bf30d-e63b-44ed-9808-ee3e612e0ba1@kernel.org \
--to=david@kernel.org \
--cc=21cnbao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alexei.starovoitov@gmail.com \
--cc=ameryhung@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=clm@meta.com \
--cc=corbet@lwn.net \
--cc=daniel@iogearbox.net \
--cc=dev.jain@arm.com \
--cc=eddyz87@gmail.com \
--cc=gutierrez.asier@huawei-partners.com \
--cc=hannes@cmpxchg.org \
--cc=haoluo@google.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=lance.yang@linux.dev \
--cc=laoar.shao@gmail.com \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=martin.lau@linux.dev \
--cc=npache@redhat.com \
--cc=rdunlap@infradead.org \
--cc=rientjes@google.com \
--cc=ryan.roberts@arm.com \
--cc=sdf@fomichev.me \
--cc=shakeel.butt@linux.dev \
--cc=song@kernel.org \
--cc=tj@kernel.org \
--cc=usamaarif642@gmail.com \
--cc=willy@infradead.org \
--cc=yonghong.song@linux.dev \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox