linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf] bpf/helpers: Use __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init()
@ 2025-09-05 23:45 Peilin Ye
  2025-09-05 23:54 ` Peilin Ye
  2025-09-08 17:32 ` Shakeel Butt
  0 siblings, 2 replies; 4+ messages in thread
From: Peilin Ye @ 2025-09-05 23:45 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Shakeel Butt
  Cc: Peilin Ye, Johannes Weiner, Tejun Heo, Roman Gushchin,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Josh Don, Barret Rhoden, linux-mm

Currently, calling bpf_map_kmalloc_node() from __bpf_async_init() can
cause various locking issues; see the following stack trace (edited for
style) as one example:

...
 [10.011566]  do_raw_spin_lock.cold
 [10.011570]  try_to_wake_up             (5) double-acquiring the same
 [10.011575]  kick_pool                      rq_lock, causing a hardlockup
 [10.011579]  __queue_work
 [10.011582]  queue_work_on
 [10.011585]  kernfs_notify
 [10.011589]  cgroup_file_notify
 [10.011593]  try_charge_memcg           (4) memcg accounting raises an
 [10.011597]  obj_cgroup_charge_pages        MEMCG_MAX event
 [10.011599]  obj_cgroup_charge_account
 [10.011600]  __memcg_slab_post_alloc_hook
 [10.011603]  __kmalloc_node_noprof
...
 [10.011611]  bpf_map_kmalloc_node
 [10.011612]  __bpf_async_init
 [10.011615]  bpf_timer_init             (3) BPF calls bpf_timer_init()
 [10.011617]  bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable
 [10.011619]  bpf__sched_ext_ops_runnable
 [10.011620]  enqueue_task_scx           (2) BPF runs with rq_lock held
 [10.011622]  enqueue_task
 [10.011626]  ttwu_do_activate
 [10.011629]  sched_ttwu_pending         (1) grabs rq_lock
...

The above was reproduced on bpf-next (b338cf849ec8) by modifying
./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during
ops.runnable(), and hacking [1] the memcg accounting code a bit to make
a bpf_timer_init() call much more likely to raise an MEMCG_MAX event.

We have also run into other similar variants (both internally and on
bpf-next), including double-acquiring cgroup_file_kn_lock, the same
worker_pool::lock, etc.

As suggested by Shakeel, fix this by using __GFP_HIGH instead of
GFP_ATOMIC in __bpf_async_init(), so that if try_charge_memcg() raises
an MEMCG_MAX event, we call __memcg_memory_event() with
@allow_spinning=false and skip calling cgroup_file_notify(), in order to
avoid the locking issues described above.

Depends on mm patch "memcg: skip cgroup_file_notify if spinning is not
allowed".  Tested with vmtest.sh (llvm-18, x86-64):

 $ ./test_progs -a '*timer*' -a '*wq*'
...
 Summary: 7/12 PASSED, 0 SKIPPED, 0 FAILED

[1] Making bpf_timer_init() much more likely to raise an MEMCG_MAX event
(gist-only, for brevity):

kernel/bpf/helpers.c:__bpf_async_init():
 -        cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC, map->numa_node);
 +        cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC | __GFP_HACK,
 +                                  map->numa_node);

mm/memcontrol.c:try_charge_memcg():
          if (!do_memsw_account() ||
 -            page_counter_try_charge(&memcg->memsw, batch, &counter)) {
 -                if (page_counter_try_charge(&memcg->memory, batch, &counter))
 +            page_counter_try_charge_hack(&memcg->memsw, batch, &counter,
 +                                         gfp_mask & __GFP_HACK)) {
 +                if (page_counter_try_charge_hack(&memcg->memory, batch,
 +                                                 &counter,
 +                                                 gfp_mask & __GFP_HACK))
                          goto done_restock;

mm/page_counter.c:page_counter_try_charge():
 -bool page_counter_try_charge(struct page_counter *counter,
 -                             unsigned long nr_pages,
 -                             struct page_counter **fail)
 +bool page_counter_try_charge_hack(struct page_counter *counter,
 +                                  unsigned long nr_pages,
 +                                  struct page_counter **fail, bool hack)
 {
...
 -                if (new > c->max) {
 +                if (hack || new > c->max) {     // goto failed;
                          atomic_long_sub(nr_pages, &c->usage);

Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Peilin Ye <yepeilin@google.com>
---
 kernel/bpf/helpers.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index b9b0c5fe33f6..508b13c24778 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1274,8 +1274,14 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
 		goto out;
 	}
 
-	/* allocate hrtimer via map_kmalloc to use memcg accounting */
-	cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC, map->numa_node);
+	/* Allocate via bpf_map_kmalloc_node() for memcg accounting. Use
+	 * __GFP_HIGH instead of GFP_ATOMIC to avoid calling
+	 * cgroup_file_notify() if an MEMCG_MAX event is raised by
+	 * try_charge_memcg(). This prevents various locking issues, including
+	 * double-acquiring locks that may already be held here (e.g.,
+	 * cgroup_file_kn_lock, rq_lock).
+	 */
+	cb = bpf_map_kmalloc_node(map, size, __GFP_HIGH, map->numa_node);
 	if (!cb) {
 		ret = -ENOMEM;
 		goto out;
-- 
2.51.0.355.g5224444f11-goog


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH bpf] bpf/helpers: Use __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init()
  2025-09-05 23:45 [PATCH bpf] bpf/helpers: Use __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init() Peilin Ye
@ 2025-09-05 23:54 ` Peilin Ye
  2025-09-08 17:32 ` Shakeel Butt
  1 sibling, 0 replies; 4+ messages in thread
From: Peilin Ye @ 2025-09-05 23:54 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Shakeel Butt
  Cc: Johannes Weiner, Tejun Heo, Roman Gushchin, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Kumar Kartikeya Dwivedi, Josh Don,
	Barret Rhoden, linux-mm

Hi all,

On Fri, Sep 05, 2025 at 11:45:46PM +0000, Peilin Ye wrote:
> Depends on mm patch "memcg: skip cgroup_file_notify if spinning is not
> allowed".

For context, currently the above mm patch is pending at:
https://lore.kernel.org/bpf/20250905201606.66198-1-shakeel.butt@linux.dev/

Thanks,
Peilin Ye



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH bpf] bpf/helpers: Use __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init()
  2025-09-05 23:45 [PATCH bpf] bpf/helpers: Use __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init() Peilin Ye
  2025-09-05 23:54 ` Peilin Ye
@ 2025-09-08 17:32 ` Shakeel Butt
  2025-09-08 20:30   ` Peilin Ye
  1 sibling, 1 reply; 4+ messages in thread
From: Shakeel Butt @ 2025-09-08 17:32 UTC (permalink / raw)
  To: Peilin Ye
  Cc: bpf, Alexei Starovoitov, Johannes Weiner, Tejun Heo,
	Roman Gushchin, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Josh Don, Barret Rhoden, linux-mm

On Fri, Sep 05, 2025 at 11:45:46PM +0000, Peilin Ye wrote:
> Currently, calling bpf_map_kmalloc_node() from __bpf_async_init() can
> cause various locking issues; see the following stack trace (edited for
> style) as one example:
> 
> ...
>  [10.011566]  do_raw_spin_lock.cold
>  [10.011570]  try_to_wake_up             (5) double-acquiring the same
>  [10.011575]  kick_pool                      rq_lock, causing a hardlockup
>  [10.011579]  __queue_work
>  [10.011582]  queue_work_on
>  [10.011585]  kernfs_notify
>  [10.011589]  cgroup_file_notify
>  [10.011593]  try_charge_memcg           (4) memcg accounting raises an
>  [10.011597]  obj_cgroup_charge_pages        MEMCG_MAX event
>  [10.011599]  obj_cgroup_charge_account
>  [10.011600]  __memcg_slab_post_alloc_hook
>  [10.011603]  __kmalloc_node_noprof
> ...
>  [10.011611]  bpf_map_kmalloc_node
>  [10.011612]  __bpf_async_init
>  [10.011615]  bpf_timer_init             (3) BPF calls bpf_timer_init()
>  [10.011617]  bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable
>  [10.011619]  bpf__sched_ext_ops_runnable
>  [10.011620]  enqueue_task_scx           (2) BPF runs with rq_lock held
>  [10.011622]  enqueue_task
>  [10.011626]  ttwu_do_activate
>  [10.011629]  sched_ttwu_pending         (1) grabs rq_lock
> ...
> 
> The above was reproduced on bpf-next (b338cf849ec8) by modifying
> ./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during
> ops.runnable(), and hacking [1] the memcg accounting code a bit to make
> a bpf_timer_init() call much more likely to raise an MEMCG_MAX event.
> 
> We have also run into other similar variants (both internally and on
> bpf-next), including double-acquiring cgroup_file_kn_lock, the same
> worker_pool::lock, etc.
> 
> As suggested by Shakeel, fix this by using __GFP_HIGH instead of
> GFP_ATOMIC in __bpf_async_init(), so that if try_charge_memcg() raises
> an MEMCG_MAX event, we call __memcg_memory_event() with
> @allow_spinning=false and skip calling cgroup_file_notify(), in order to
> avoid the locking issues described above.
> 
> Depends on mm patch "memcg: skip cgroup_file_notify if spinning is not
> allowed".  Tested with vmtest.sh (llvm-18, x86-64):
> 
>  $ ./test_progs -a '*timer*' -a '*wq*'
> ...
>  Summary: 7/12 PASSED, 0 SKIPPED, 0 FAILED
> 
> [1] Making bpf_timer_init() much more likely to raise an MEMCG_MAX event
> (gist-only, for brevity):
> 
> kernel/bpf/helpers.c:__bpf_async_init():
>  -        cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC, map->numa_node);
>  +        cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC | __GFP_HACK,
>  +                                  map->numa_node);
> 
> mm/memcontrol.c:try_charge_memcg():
>           if (!do_memsw_account() ||
>  -            page_counter_try_charge(&memcg->memsw, batch, &counter)) {
>  -                if (page_counter_try_charge(&memcg->memory, batch, &counter))
>  +            page_counter_try_charge_hack(&memcg->memsw, batch, &counter,
>  +                                         gfp_mask & __GFP_HACK)) {
>  +                if (page_counter_try_charge_hack(&memcg->memory, batch,
>  +                                                 &counter,
>  +                                                 gfp_mask & __GFP_HACK))
>                           goto done_restock;
> 
> mm/page_counter.c:page_counter_try_charge():
>  -bool page_counter_try_charge(struct page_counter *counter,
>  -                             unsigned long nr_pages,
>  -                             struct page_counter **fail)
>  +bool page_counter_try_charge_hack(struct page_counter *counter,
>  +                                  unsigned long nr_pages,
>  +                                  struct page_counter **fail, bool hack)
>  {
> ...
>  -                if (new > c->max) {
>  +                if (hack || new > c->max) {     // goto failed;
>                           atomic_long_sub(nr_pages, &c->usage);
> 
> Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: Peilin Ye <yepeilin@google.com>
> ---
>  kernel/bpf/helpers.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index b9b0c5fe33f6..508b13c24778 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1274,8 +1274,14 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
>  		goto out;
>  	}
>  
> -	/* allocate hrtimer via map_kmalloc to use memcg accounting */
> -	cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC, map->numa_node);
> +	/* Allocate via bpf_map_kmalloc_node() for memcg accounting. Use
> +	 * __GFP_HIGH instead of GFP_ATOMIC to avoid calling
> +	 * cgroup_file_notify() if an MEMCG_MAX event is raised by
> +	 * try_charge_memcg(). This prevents various locking issues, including
> +	 * double-acquiring locks that may already be held here (e.g.,
> +	 * cgroup_file_kn_lock, rq_lock).

Too much unnecessary information in the comment. Just mention that we
want nolock allocations and for that we need to remove __GFP_RECLAIM
flags until nolock allocation interfaces are available.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH bpf] bpf/helpers: Use __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init()
  2025-09-08 17:32 ` Shakeel Butt
@ 2025-09-08 20:30   ` Peilin Ye
  0 siblings, 0 replies; 4+ messages in thread
From: Peilin Ye @ 2025-09-08 20:30 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: bpf, Alexei Starovoitov, Johannes Weiner, Tejun Heo,
	Roman Gushchin, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Josh Don, Barret Rhoden, linux-mm

On Mon, Sep 08, 2025 at 10:32:29AM -0700, Shakeel Butt wrote:
> On Fri, Sep 05, 2025 at 11:45:46PM +0000, Peilin Ye wrote:
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index b9b0c5fe33f6..508b13c24778 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1274,8 +1274,14 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
> >  		goto out;
> >  	}
> >  
> > -	/* allocate hrtimer via map_kmalloc to use memcg accounting */
> > -	cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC, map->numa_node);
> > +	/* Allocate via bpf_map_kmalloc_node() for memcg accounting. Use
> > +	 * __GFP_HIGH instead of GFP_ATOMIC to avoid calling
> > +	 * cgroup_file_notify() if an MEMCG_MAX event is raised by
> > +	 * try_charge_memcg(). This prevents various locking issues, including
> > +	 * double-acquiring locks that may already be held here (e.g.,
> > +	 * cgroup_file_kn_lock, rq_lock).
> 
> Too much unnecessary information in the comment. Just mention that we
> want nolock allocations and for that we need to remove __GFP_RECLAIM
> flags until nolock allocation interfaces are available.

Got it - I'll reword in v2.

Thanks,
Peilin Ye



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-09-08 20:30 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-05 23:45 [PATCH bpf] bpf/helpers: Use __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init() Peilin Ye
2025-09-05 23:54 ` Peilin Ye
2025-09-08 17:32 ` Shakeel Butt
2025-09-08 20:30   ` Peilin Ye

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox