linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "David Hildenbrand (Red Hat)" <david@kernel.org>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Eduard <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	John Fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>, Zi Yan <ziy@nvidia.com>,
	Liam Howlett <Liam.Howlett@oracle.com>,
	npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
	Johannes Weiner <hannes@cmpxchg.org>,
	usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com,
	Matthew Wilcox <willy@infradead.org>,
	Amery Hung <ameryhung@gmail.com>,
	David Rientjes <rientjes@google.com>,
	Jonathan Corbet <corbet@lwn.net>, Barry Song <21cnbao@gmail.com>,
	Shakeel Butt <shakeel.butt@linux.dev>, Tejun Heo <tj@kernel.org>,
	lance.yang@linux.dev, Randy Dunlap <rdunlap@infradead.org>,
	Chris Mason <clm@meta.com>, bpf <bpf@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode
Date: Fri, 28 Nov 2025 09:39:06 +0100	[thread overview]
Message-ID: <e52bf30d-e63b-44ed-9808-ee3e612e0ba1@kernel.org> (raw)
In-Reply-To: <CALOAHbCR3Y=GCpX8S9CctONO=Emh4RvYAibHU=ZQyLP1s0MOVQ@mail.gmail.com>

On 11/28/25 03:53, Yafang Shao wrote:
> On Thu, Nov 27, 2025 at 7:48 PM David Hildenbrand (Red Hat)
> <david@kernel.org> wrote:

Lorenzo commented on the upstream topic, let me mostly comment on the 
other parts:
>>> Attaching st_ops to task_struct or to mm_struct is a can of worms.
>>> With cgroup-bpf we went through painful bugs with lifetime
>>> of cgroup vs bpf, dying cgroups, wq deadlock, etc. All these
>>> problems are behind us. With st_ops in mm_struct it will be more
>>> painful. I'd rather not go that route.
>>
>> That's valuable information, thanks. I would have hoped that per-MM
>> policies would be easier.
> 
> The per-MM approach has a performance advantage over per-MEMCG
> policies. This is because it accesses the policy hook directly via
> 
>    vma->vm_mm->bpf_mm->policy_hook()
> 
> whereas the per-MEMCG method requires a more expensive lookup:
> 
>    memcg = get_mem_cgroup_from_mm(vma->vm_mm);
>    memcg->bpf_memcg->policy_hook();
> > This lookup could be a concern in a critical path. However, this
> performance issue in the per-MEMCG mode can be mitigated. For
> instance, when a task is added to a new memcg, we can cache the hook
> pointer:
> 
>    task->mm->bpf_mm->policy_hook = memcg->bpf_memcg->policy_hook
> 
> Ultimately, we might still introduce a mm_struct:bpf_mm field to
> provide an efficient interface.

Right, caching is what I would have proposed. I would expect some 
headakes with lifetime, but probably nothing unsolvable.


>> Sounds like cgroup-bpf has sorted
>> out most of the mess.
> 
> No, the attach-based cgroup-bpf has proven to be ... a "can of worms"
> in practice ...
>   (I welcome corrections from the BPF maintainers if my assessment is
> inaccurate.)

I don't know what's right or wrong here, as Alexei said the "mm_struct" 
based one would be a can of worms and that the the cgroup-based one 
apparently solved these issues ("All these problems are behind us."), 
that's why I asked for some clarifications. :)

[...]

>>
>> Some of what Yafang might want to achieve could maybe at this point be
>> maybe achieved through the prctl(PR_SET_THP_DISABLE) support, including
>> extensions we recently added [1].
>>
>> Systemd support still seems to be in the works [2] for some of that.
>>
>>
>> [1] https://lwn.net/Articles/1032014/
>> [2] https://github.com/systemd/systemd/pull/39085
> 
> Thank you for sharing this.
> However, BPF-THP is already deployed across our server fleet and both
> our users and my boss are satisfied with it. As such, we are not
> considering a switch. The current solution also offers us a valuable
> opportunity to experiment with additional policies in production.

Just to emphasize: we usually don't add two mechanisms to achieve the 
very same end goal. There really must be something delivering more value 
for us to accept something more complex. Focusing on solving a solved 
problem is not good.

If some company went with a downstream-only approach they might be stuck 
having to maintain that forever.

That's why other companies prefer upstream-first :)


Having that said, the original reason why I agreed that having bpf for 
THP can be valuable is that I see a lot more value for rapid prototyping 
and policies once you can actually control on a per-VMA basis (using vma 
size, flags, anon-vma names etc) where specific folio orders could be 
valuable, and where not. But also, possibly where we would want to waste 
memory and where not.

As we are speaking I have a customer running into issues [1] with 
virtio-balloon discarding pages in a VM and khugepaged undoing part of 
that work in the hypervisor. The workaround of telling khugepaged to not 
waste memory in all of the system really feels suboptimal when we know 
that it's only the VM memory of such VMs (with balloon deflation 
enabled) where we would not want to waste memory but still use THPs.

[1] https://issues.redhat.com/browse/RHEL-121177

-- 
Cheers

David


  parent reply	other threads:[~2025-11-28  8:39 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-26 10:01 [PATCH v12 mm-new 00/10] mm, bpf: BPF-MM, BPF-THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 01/10] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 02/10] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 03/10] mm: thp: add support for BPF based THP order selection Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 04/10] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
2025-10-27  4:07   ` Barry Song
2025-10-26 10:01 ` [PATCH v12 mm-new 05/10] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 06/10] mm: bpf-thp: add support for global mode Yafang Shao
2025-10-29  1:32   ` Alexei Starovoitov
2025-10-29  2:13     ` Yafang Shao
2025-10-30  0:57       ` Alexei Starovoitov
2025-10-30  2:40         ` Yafang Shao
2025-11-27 11:48         ` David Hildenbrand (Red Hat)
2025-11-28  2:53           ` Yafang Shao
2025-11-28  7:57             ` Lorenzo Stoakes
2025-11-28  8:18               ` Yafang Shao
2025-11-28  8:31                 ` Lorenzo Stoakes
2025-11-28 11:56                   ` Yafang Shao
2025-11-28 12:18                     ` Lorenzo Stoakes
2025-11-28 12:51                       ` Yafang Shao
2025-11-28  8:39             ` David Hildenbrand (Red Hat) [this message]
2025-11-28  8:55               ` Lorenzo Stoakes
2025-11-30 13:06               ` Yafang Shao
2025-11-26 15:13     ` Rik van Riel
2025-11-27  2:35       ` Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 07/10] Documentation: add BPF THP Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 08/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 09/10] selftests/bpf: add test case to update " Yafang Shao
2025-10-26 10:01 ` [PATCH v12 mm-new 10/10] selftests/bpf: add test case for BPF-THP inheritance across fork Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e52bf30d-e63b-44ed-9808-ee3e612e0ba1@kernel.org \
    --to=david@kernel.org \
    --cc=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=ameryhung@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=clm@meta.com \
    --cc=corbet@lwn.net \
    --cc=daniel@iogearbox.net \
    --cc=dev.jain@arm.com \
    --cc=eddyz87@gmail.com \
    --cc=gutierrez.asier@huawei-partners.com \
    --cc=hannes@cmpxchg.org \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=laoar.shao@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=martin.lau@linux.dev \
    --cc=npache@redhat.com \
    --cc=rdunlap@infradead.org \
    --cc=rientjes@google.com \
    --cc=ryan.roberts@arm.com \
    --cc=sdf@fomichev.me \
    --cc=shakeel.butt@linux.dev \
    --cc=song@kernel.org \
    --cc=tj@kernel.org \
    --cc=usamaarif642@gmail.com \
    --cc=willy@infradead.org \
    --cc=yonghong.song@linux.dev \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox