linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: bpf <bpf@vger.kernel.org>, Andrii Nakryiko <andrii@kernel.org>,
	 Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Peter Zijlstra <peterz@infradead.org>,
	Sebastian Sewior <bigeasy@linutronix.de>,
	 Steven Rostedt <rostedt@goodmis.org>,
	Hou Tao <houtao1@huawei.com>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	 Michal Hocko <mhocko@suse.com>,
	Matthew Wilcox <willy@infradead.org>,
	 Thomas Gleixner <tglx@linutronix.de>,
	Jann Horn <jannh@google.com>, Tejun Heo <tj@kernel.org>,
	 linux-mm <linux-mm@kvack.org>, Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH bpf-next v8 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs.
Date: Tue, 18 Feb 2025 18:38:24 -0800	[thread overview]
Message-ID: <CAADnVQ+tAWwfO5tv+aW0KUs-cz559vN8V6TCzhmtDMFxoEewRg@mail.gmail.com> (raw)
In-Reply-To: <fb983185-a577-405e-8fb4-b506d894cec5@suse.cz>

On Tue, Feb 18, 2025 at 7:36 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 2/13/25 04:35, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Use try_alloc_pages() and free_pages_nolock() for BPF needs
> > when context doesn't allow using normal alloc_pages.
> > This is a prerequisite for further work.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  include/linux/bpf.h  |  2 +-
> >  kernel/bpf/arena.c   |  5 ++---
> >  kernel/bpf/syscall.c | 23 ++++++++++++++++++++---
> >  3 files changed, 23 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index f3f50e29d639..e1838a341817 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -2348,7 +2348,7 @@ int  generic_map_delete_batch(struct bpf_map *map,
> >  struct bpf_map *bpf_map_get_curr_or_next(u32 *id);
> >  struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id);
> >
> > -int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,
> > +int bpf_map_alloc_pages(const struct bpf_map *map, int nid,
> >                       unsigned long nr_pages, struct page **page_array);
> >  #ifdef CONFIG_MEMCG
> >  void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
> > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> > index 0975d7f22544..8ecc62e6b1a2 100644
> > --- a/kernel/bpf/arena.c
> > +++ b/kernel/bpf/arena.c
> > @@ -287,7 +287,7 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
> >               return VM_FAULT_SIGSEGV;
> >
> >       /* Account into memcg of the process that created bpf_arena */
> > -     ret = bpf_map_alloc_pages(map, GFP_KERNEL | __GFP_ZERO, NUMA_NO_NODE, 1, &page);
> > +     ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
> >       if (ret) {
> >               range_tree_set(&arena->rt, vmf->pgoff, 1);
> >               return VM_FAULT_SIGSEGV;
> > @@ -465,8 +465,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
> >       if (ret)
> >               goto out_free_pages;
> >
> > -     ret = bpf_map_alloc_pages(&arena->map, GFP_KERNEL | __GFP_ZERO,
> > -                               node_id, page_cnt, pages);
> > +     ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages);
> >       if (ret)
> >               goto out;
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index c420edbfb7c8..a7af8d0185d0 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -569,7 +569,24 @@ static void bpf_map_release_memcg(struct bpf_map *map)
> >  }
> >  #endif
> >
> > -int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,
> > +static bool can_alloc_pages(void)
> > +{
> > +     return preempt_count() == 0 && !irqs_disabled() &&
> > +             !IS_ENABLED(CONFIG_PREEMPT_RT);
> > +}
> > +
>
> I see this is new since v6 and wasn't yet discussed (or I missed it?)

It was in v1:
https://lore.kernel.org/bpf/20241116014854.55141-1-alexei.starovoitov@gmail.com/
See Peter's comments.
In this version I open coded preemptible(), since it's more accurate
and disabled the detection on PREEMPT_RT.

> I wonder how reliable these preempt/irq_disabled checks are for correctness
> purposes, e.g. we don't have CONFIG_PREEMPT_COUNT enabled always?

I believe the above doesn't produce false positives.
It's not exhaustive and might change as we learn more and tune it.
Hence I moved it to be bpf specific to iterate quickly instead of
being in inux/gfp.h and also considering Sebastian's comment
that normal kernel code should better know the calling context.

> As longs
> as the callers of bpf_map_alloc_pages() know the context and pass gfp
> accordingly, can't we use i.e. gfpflags_allow_blocking() to determine if
> try_alloc_pages() should be used or not?

bpf infra has a very coarse knowledge of the context.
There are two categories: sleepable or not.
In sleepable GFP_KERNEL is allowed, but it's very narrow and
represents a tiny slice of use cases compared to
non-sleepable. The try_alloc_pages() is for the latter.
netconsole has a similar problem/challenge.
It doesn't know the context where it will be called.
Currently it's just doing GFP_ATOMIC and praying.
This is something to fix eventually when slab is taught about
gfpflags_allow_blocking.


      reply	other threads:[~2025-02-19  2:38 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-13  3:35 [PATCH bpf-next v8 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
2025-02-13  3:35 ` [PATCH bpf-next v8 1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
2025-02-13  3:35 ` [PATCH bpf-next v8 2/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
2025-02-13  3:35 ` [PATCH bpf-next v8 3/6] locking/local_lock: Introduce localtry_lock_t Alexei Starovoitov
2025-02-13 15:03   ` Vlastimil Babka
2025-02-13 15:23     ` Alexei Starovoitov
2025-02-13 15:28       ` Steven Rostedt
2025-02-14 12:15       ` Vlastimil Babka
2025-02-14 12:11   ` Vlastimil Babka
2025-02-14 18:32     ` Alexei Starovoitov
2025-02-14 18:48       ` Vlastimil Babka
2025-02-17 15:17         ` Sebastian Sewior
2025-02-18 15:17   ` Vlastimil Babka
2025-02-13  3:35 ` [PATCH bpf-next v8 4/6] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
2025-02-13  3:35 ` [PATCH bpf-next v8 5/6] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
2025-02-13  3:35 ` [PATCH bpf-next v8 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
2025-02-18 15:36   ` Vlastimil Babka
2025-02-19  2:38     ` Alexei Starovoitov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAADnVQ+tAWwfO5tv+aW0KUs-cz559vN8V6TCzhmtDMFxoEewRg@mail.gmail.com \
    --to=alexei.starovoitov@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=bpf@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=houtao1@huawei.com \
    --cc=jannh@google.com \
    --cc=kernel-team@fb.com \
    --cc=linux-mm@kvack.org \
    --cc=memxor@gmail.com \
    --cc=mhocko@suse.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=shakeel.butt@linux.dev \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox