From: Yafang Shao <laoar.shao@gmail.com>
To: David Hildenbrand <david@redhat.com>
Cc: akpm@linux-foundation.org, ziy@nvidia.com,
baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, npache@redhat.com,
ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org,
usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com,
willy@infradead.org, ast@kernel.org, daniel@iogearbox.net,
andrii@kernel.org, bpf@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
Date: Tue, 27 May 2025 13:46:00 +0800 [thread overview]
Message-ID: <CALOAHbBjueZhwrzp81FP-7C7ntEp5Uzaz26o2s=ZukVSmidEOA@mail.gmail.com> (raw)
In-Reply-To: <c983ffa8-cd14-47d4-9430-b96acedd989c@redhat.com>
On Mon, May 26, 2025 at 6:49 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 26.05.25 11:37, Yafang Shao wrote:
> > On Mon, May 26, 2025 at 4:14 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Let’s summarize the current state of the discussion and identify how
> >>> to move forward.
> >>>
> >>> - Global-Only Control is Not Viable
> >>> We all seem to agree that a global-only control for THP is unwise. In
> >>> practice, some workloads benefit from THP while others do not, so a
> >>> one-size-fits-all approach doesn’t work.
> >>>
> >>> - Should We Use "Always" or "Madvise"?
> >>> I suspect no one would choose 'always' in its current state. ;)
> >>
> >> IIRC, RHEL9 has the default set to "always" for a long time.
> >
> > good to know.
> >
> >>
> >> I guess it really depends on how different the workloads are that you
> >> are running on the same machine.
> >
> > Correct. If we want to enable THP for specific workloads without
> > modifying the kernel, we must isolate them on dedicated servers.
> > However, this approach wastes resources and is not an acceptable
> > solution.
> >
> >>
> >> > Both Lorenzo and David propose relying on the madvise mode. However,>
> >> since madvise is an unprivileged userspace mechanism, any user can
> >>> freely adjust their THP policy. This makes fine-grained control
> >>> impossible without breaking userspace compatibility—an undesirable
> >>> tradeoff.
> >>
> >> If required, we could look into a "sealing" mechanism, that would
> >> essentially lock modification attempts performed by the process (i.e.,
> >> MADV_HUGEPAGE).
> >
> > If we don’t introduce a new THP mode and instead rely solely on
> > madvise, the "sealing" mechanism could either violate the intended
> > semantics of madvise(), or simply break madvise() entirely, right?
>
> We would have to be a bit careful, yes.
>
> Errors from MADV_HUGEPAGE/MADV_NOHUGEPAGE are often ignored, because
> these options also fail with -EINVAL on kernels without THP support.
>
> Ignoring MADV_NOHUGEPAGE can be problematic with userfaultfd.
>
> What you likely really want to do is seal when you configured
> MADV_NOHUGEPAGE to be the default, and fail MADV_HUGEPAGE later.
>
> >>
> >> The could be added on top of the current proposals that are flying
> >> around, and could be done e.g., per-process.
> >
> > How about introducing a dedicated "process" mode? This would allow
> > each process to use different THP modes—some in "always," others in
> > "madvise," and the rest in "never." Future THP modes could also be
> > added to this framework.
>
> We have to be really careful about not creating even more mess with more
> modes.
>
> How would that design look like in detail (how would we set it per
> process etc?)?
I have a preliminary idea to implement this using BPF. We could define
the API as follows:
struct bpf_thp_ops {
/**
* @task_thp_mode: Get the THP mode for a specific task
*
* Return:
* - TASK_THP_ALWAYS: "always" mode
* - TASK_THP_MADVISE: "madvise" mode
* - TASK_THP_NEVER: "never" mode
* Future modes can also be added.
*/
int (*task_thp_mode)(struct task_struct *p);
};
For observability, we could add a "THP mode" field to
/proc/[pid]/status. For example:
$ grep "THP mode" /proc/123/status
always
$ grep "THP mode" /proc/456/status
madvise
$ grep "THP mode" /proc/789/status
never
The THP mode for each task would be determined by the attached BPF
program based on the task's attributes. We would place the BPF hook in
appropriate kernel functions. Note that this setting wouldn't be
inherited during fork/exec - the BPF program would make the decision
dynamically for each task.
This approach also enables runtime adjustments to THP modes based on
system-wide conditions, such as memory fragmentation or other
performance overheads. The BPF program could adapt policies
dynamically, optimizing THP behavior in response to changing
workloads.
As Liam pointed out in another thread, naming is challenging here -
"process" might not be the most accurate term for this context.
--
Regards
Yafang
next prev parent reply other threads:[~2025-05-27 5:46 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-20 6:04 Yafang Shao
2025-05-20 6:04 ` [RFC PATCH v2 1/5] mm: thp: Add a new mode "bpf" Yafang Shao
2025-05-20 6:05 ` [RFC PATCH v2 2/5] mm: thp: Add hook for BPF based THP adjustment Yafang Shao
2025-05-20 6:05 ` [RFC PATCH v2 3/5] mm: thp: add struct ops " Yafang Shao
2025-05-20 6:05 ` [RFC PATCH v2 4/5] bpf: Add get_current_comm to bpf_base_func_proto Yafang Shao
2025-05-20 23:32 ` Andrii Nakryiko
2025-05-20 6:05 ` [RFC PATCH v2 5/5] selftests/bpf: Add selftest for THP adjustment Yafang Shao
2025-05-20 6:52 ` [RFC PATCH v2 0/5] mm, bpf: BPF based " Nico Pache
2025-05-20 7:25 ` Yafang Shao
2025-05-20 13:10 ` Matthew Wilcox
2025-05-20 14:08 ` Yafang Shao
2025-05-20 14:22 ` Lorenzo Stoakes
2025-05-20 14:32 ` Usama Arif
2025-05-20 14:35 ` Lorenzo Stoakes
2025-05-20 14:42 ` Matthew Wilcox
2025-05-20 14:56 ` David Hildenbrand
2025-05-21 4:28 ` Yafang Shao
2025-05-20 14:46 ` Usama Arif
2025-05-20 15:00 ` David Hildenbrand
2025-05-20 9:43 ` David Hildenbrand
2025-05-20 9:49 ` Lorenzo Stoakes
2025-05-20 12:06 ` Yafang Shao
2025-05-20 13:45 ` Lorenzo Stoakes
2025-05-20 15:54 ` David Hildenbrand
2025-05-21 4:02 ` Yafang Shao
2025-05-21 3:52 ` Yafang Shao
2025-05-20 11:59 ` Yafang Shao
2025-05-25 3:01 ` Yafang Shao
2025-05-26 7:41 ` Gutierrez Asier
2025-05-26 9:37 ` Yafang Shao
2025-05-26 8:14 ` David Hildenbrand
2025-05-26 9:37 ` Yafang Shao
2025-05-26 10:49 ` David Hildenbrand
2025-05-26 14:53 ` Liam R. Howlett
2025-05-26 15:54 ` Liam R. Howlett
2025-05-26 16:51 ` David Hildenbrand
2025-05-26 17:07 ` Liam R. Howlett
2025-05-26 17:12 ` David Hildenbrand
2025-05-26 20:30 ` Gutierrez Asier
2025-05-26 20:37 ` David Hildenbrand
2025-05-27 5:46 ` Yafang Shao [this message]
2025-05-27 7:57 ` David Hildenbrand
2025-05-27 8:13 ` Yafang Shao
2025-05-27 8:30 ` David Hildenbrand
2025-05-27 8:40 ` Yafang Shao
2025-05-27 9:27 ` David Hildenbrand
2025-05-27 9:43 ` Yafang Shao
2025-05-27 12:19 ` David Hildenbrand
2025-05-28 2:04 ` Yafang Shao
2025-05-28 20:32 ` David Hildenbrand
2025-05-26 14:32 ` Zi Yan
2025-05-27 5:53 ` Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CALOAHbBjueZhwrzp81FP-7C7ntEp5Uzaz26o2s=ZukVSmidEOA@mail.gmail.com' \
--to=laoar.shao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=gutierrez.asier@huawei-partners.com \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=npache@redhat.com \
--cc=ryan.roberts@arm.com \
--cc=usamaarif642@gmail.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox