From: Mateusz Guzik <mjguzik@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
bpf <bpf@vger.kernel.org>, Andrii Nakryiko <andrii@kernel.org>,
Kumar Kartikeya Dwivedi <memxor@gmail.com>,
Peter Zijlstra <peterz@infradead.org>,
Vlastimil Babka <vbabka@suse.cz>,
Sebastian Sewior <bigeasy@linutronix.de>,
Steven Rostedt <rostedt@goodmis.org>,
Hou Tao <houtao1@huawei.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Shakeel Butt <shakeel.butt@linux.dev>,
Michal Hocko <mhocko@suse.com>,
Matthew Wilcox <willy@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>,
Jann Horn <jannh@google.com>, Tejun Heo <tj@kernel.org>,
linux-mm <linux-mm@kvack.org>, Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
Date: Tue, 11 Mar 2025 19:04:47 +0100 [thread overview]
Message-ID: <rbfpuj6kmbwbzyd25tjhlrf4aytmhmegn5ez54rpb2mue3cxyk@ok46lkhvvfjt> (raw)
In-Reply-To: <CAADnVQJsYcMfn4XjAtgo9gHsiUs-BX-PEyi1oPHy5_gEuWKHFQ@mail.gmail.com>
On Tue, Mar 11, 2025 at 02:32:24PM +0100, Alexei Starovoitov wrote:
> On Tue, Mar 11, 2025 at 3:04 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Fri, 21 Feb 2025 18:44:23 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >
> > > Tracing BPF programs execute from tracepoints and kprobes where
> > > running context is unknown, but they need to request additional
> > > memory. The prior workarounds were using pre-allocated memory and
> > > BPF specific freelists to satisfy such allocation requests.
> >
> > The "prior workarounds" sound entirely appropriate. Because the
> > performance and maintainability of Linux's page allocator is about
> > 1,000,040 times more important than relieving BPF of having to carry a
> > "workaround".
>
> Please explain where performance and maintainability is affected?
>
I have some related questions below. Note I'm a bystander, not claiming
to have any (N)ACK power.
A small bit before that:
if (!spin_trylock_irqsave(&zone->lock, flags)) {
if (unlikely(alloc_flags & ALLOC_TRYLOCK))
return NULL;
spin_lock_irqsave(&zone->lock, flags);
}
This is going to perform worse when contested due to an extra access to
the lock. I presume it was done this way to avoid suffering another
branch, with the assumption the trylock is normally going to succeed.
So happens I'm looking at parallel exec on the side and while currently
there is bigger fish to fry, contention on zone->lock is very much a
factor. Majority of it comes from RCU freeing (in free_pcppages_bulk()),
but I also see several rmqueue calls below. As they trylock, they are
going to make it more expensive for free_pcppages_bulk() to even get the
lock.
So this *is* contested, but at the moment is largely overshadowed by
bigger problems (which someone(tm) hopefully will sort out sooner than
later).
So should this land, I expect someone is going to hoist the trylock at
some point in the future. If it was my patch I would just do it now, but
I understand this may result in new people showing up and complaining.
> As far as motivation, if I recall correctly, you were present in
> the room when Vlastimil presented the next steps for SLUB at
> LSFMM back in May of last year.
> A link to memory refresher is in the commit log:
> https://lwn.net/Articles/974138/
>
> Back then he talked about a bunch of reasons including better
> maintainability of the kernel overall, but what stood out to me
> as the main reason to use SLUB for bpf, objpool, mempool,
> and networking needs is prevention of memory waste.
> All these wrappers of slub pin memory that should be shared.
> bpf, objpool, mempools should be good citizens of the kernel
> instead of stealing the memory. That's the core job of the
> kernel. To share resources. Memory is one such resource.
>
I suspect the worry is that the added behavior may complicate things
down the road (or even prevent some optimizations) -- there is another
context to worry about.
I think it would help to outline why these are doing any memory
allocation from something like NMI to begin with. Perhaps you could have
carved out a very small piece of memory as a reserve just for that? It
would be refilled as needed from process context.
A general remark is that support for an arbitrary running context in
core primitives artificially limits what can be done to optimize them
for their most common users. imo the sheaves patchset is a little bit of
an admission (also see below). It may be the get pages routine will get
there.
If non-task memory allocs got beaten to the curb, or at least got heavily
limited, then a small allocator just for that purpose would do the
trick and the two variants would likely be simpler than one thing which
supports everyone. This patchset is a step in the opposite direction,
but it may be there is a good reason.
To my understanding ebpf can be used to run "real" code to do something
or "merely" collect data. I presume the former case is already running
from a context where it can allocate memory no problem.
For the latter I presume ebpf has conflicting goals: 1. not disrupt the
workload under observation (cpu and ram) -- to that end small memory
usage limits are in place. otherwise a carelessly written aggregation
can OOM the box (e.g., say someone wants to know which files get opened
the most and aggregates on names, while a malicious user opens and
unlinks autogenerated names, endlessly growing the list if you let it)
2. actually log stuff even if resources are scarce. to that end I
would expect that a small area is pre-allocated and periodically drained
Which for me further puts a question mark on general alloc from the NMI
context.
All that said, the cover letter:
> The main motivation is to make alloc page and slab reentrant and
> remove bpf_mem_alloc.
does not justify why ebpf performs allocations in a manner which warrant
any of this, which I suspect is what Andrew asked about.
I never put any effort into ebpf -- it may be all the above makes
excellent sense. But then you need to make a case to the people
maintaining the code.
next prev parent reply other threads:[~2025-03-11 18:05 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-22 2:44 [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
2025-02-22 2:44 ` [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t Alexei Starovoitov
2025-03-11 15:44 ` Mateusz Guzik
2025-03-11 16:20 ` Sebastian Andrzej Siewior
2025-03-11 16:31 ` Mateusz Guzik
2025-03-11 20:21 ` Vlastimil Babka
2025-03-11 22:24 ` Alexei Starovoitov
2025-03-12 8:29 ` Vlastimil Babka
2025-03-14 21:05 ` Alexei Starovoitov
2025-03-14 21:08 ` Vlastimil Babka
2025-03-14 21:18 ` Alexei Starovoitov
2025-02-22 2:44 ` [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
2025-03-11 2:04 ` Andrew Morton
2025-03-11 13:32 ` Alexei Starovoitov
2025-03-11 18:04 ` Mateusz Guzik [this message]
2025-03-12 9:45 ` Steven Rostedt
2025-03-15 0:34 ` Alexei Starovoitov
2025-03-12 10:00 ` Vlastimil Babka
2025-03-12 19:06 ` Shakeel Butt
2025-03-13 8:44 ` Michal Hocko
2025-03-13 14:21 ` Vlastimil Babka
2025-03-13 16:02 ` Shakeel Butt
2025-03-14 10:16 ` Michal Hocko
2025-03-15 0:51 ` Alexei Starovoitov
2025-02-22 2:44 ` [PATCH bpf-next v9 3/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
2025-02-22 2:44 ` [PATCH bpf-next v9 4/6] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
2025-02-22 2:44 ` [PATCH bpf-next v9 5/6] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
2025-02-22 2:44 ` [PATCH bpf-next v9 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
2025-02-26 3:19 ` [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
2025-02-27 17:50 ` patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=rbfpuj6kmbwbzyd25tjhlrf4aytmhmegn5ez54rpb2mue3cxyk@ok46lkhvvfjt \
--to=mjguzik@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=bigeasy@linutronix.de \
--cc=bpf@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=houtao1@huawei.com \
--cc=jannh@google.com \
--cc=kernel-team@fb.com \
--cc=linux-mm@kvack.org \
--cc=memxor@gmail.com \
--cc=mhocko@suse.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=shakeel.butt@linux.dev \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox