linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Shakeel Butt <shakeel.butt@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf <bpf@vger.kernel.org>, linux-mm <linux-mm@kvack.org>,
	 Vlastimil Babka <vbabka@suse.cz>,
	Harry Yoo <harry.yoo@oracle.com>, Michal Hocko <mhocko@suse.com>,
	 Sebastian Sewior <bigeasy@linutronix.de>,
	Andrii Nakryiko <andrii@kernel.org>,
	 Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
Date: Mon, 5 May 2025 18:24:56 -0700	[thread overview]
Message-ID: <d25b6lxjjzi3zqbotlrapx57ukjl7frmyvg2lgx5omos3zqg4m@ukkod2jdmieb> (raw)
In-Reply-To: <CAADnVQ+OroM-auGvC7GPzaOUz90zHktF545BC7wRz5s_tW6z4w@mail.gmail.com>

On Mon, May 05, 2025 at 05:49:47PM -0700, Alexei Starovoitov wrote:
> On Mon, May 5, 2025 at 11:46 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Wed, Apr 30, 2025 at 08:27:18PM -0700, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -595,7 +595,13 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> > >       if (!val)
> > >               return;
> > >
> > > -     cgroup_rstat_updated(memcg->css.cgroup, cpu);
> > > +     /*
> > > +      * If called from NMI via kmalloc_nolock -> memcg_slab_post_alloc_hook
> > > +      * -> obj_cgroup_charge -> mod_memcg_state,
> > > +      * then delay the update.
> > > +      */
> > > +     if (!in_nmi())
> > > +             cgroup_rstat_updated(memcg->css.cgroup, cpu);
> >
> > I don't think we can just ignore cgroup_rstat_updated() for nmi as there
> > is a chance (though very small) that we will loose these stats updates.
> 
> I'm failing to understand why it's an issue.
> Not doing cgroup_rstat_updated() can only cause updated_next link
> to stay NULL when it should be set,
> but it should be harmless, and no different from racy check
> that the code already doing:
> if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next))
>   return;
> 
> Imaging it was !NULL, the code would return,
> but then preemption, something clears it to NULL,
> and here we're skipping a set of updated_next.

cgroup_rstat_updated() puts the given cgroup whose stats are modified in
the per-cpu update tree which later the read side will flush to get the
uptodate stats. Not putting in the update tree will cause the read side
to not flush the stats cached on that cpu. Though there is a possibility
that someone else in non-nmi context may put that cgroup on that cpu's
update tree but there is no guarantee.

> 
> > In addition, memcg_rstat_updated() itself is not reentrant safe along
> > with couple of functions leading to it like __mod_memcg_lruvec_state().
> 
> Sure. __mod_memcg_lruvec_state() is not reentrant,
> but it's not an issue for kmalloc_nolock(), since objcg/memcg
> charge/uncharge from slub is not calling it (as far as I can tell).

Without this patch:

__memcg_slab_post_alloc_hook() -> obj_cgroup_charge_account() ->
consume_obj_stock() -> __account_obj_stock() -> __account_obj_stock() ->
__mod_objcg_mlstate() -> __mod_memcg_lruvec_state()

With this patch:
__memcg_slab_post_alloc_hook() -> obj_cgroup_charge_atomic() ->
obj_cgroup_charge_pages() -> mod_memcg_state() -> __mod_memcg_state()

Other than __mod_memcg_state() being not reentrant safe, we will be
missing NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B after the
patch.

> 
> >
> > >       statc = this_cpu_ptr(memcg->vmstats_percpu);
> > >       for (; statc; statc = statc->parent) {
> > >               /*
> > > @@ -2895,7 +2901,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> > >       unsigned long flags;
> > >       bool ret = false;
> > >
> > > -     local_lock_irqsave(&memcg_stock.stock_lock, flags);
> > > +     local_lock_irqsave_check(&memcg_stock.stock_lock, flags);
> > >
> > >       stock = this_cpu_ptr(&memcg_stock);
> > >       if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) {
> > > @@ -2995,7 +3001,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> > >       unsigned long flags;
> > >       unsigned int nr_pages = 0;
> > >
> > > -     local_lock_irqsave(&memcg_stock.stock_lock, flags);
> > > +     local_lock_irqsave_check(&memcg_stock.stock_lock, flags);
> > >
> > >       stock = this_cpu_ptr(&memcg_stock);
> > >       if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
> > > @@ -3088,6 +3094,27 @@ static inline size_t obj_full_size(struct kmem_cache *s)
> > >       return s->size + sizeof(struct obj_cgroup *);
> > >  }
> > >
> > > +/*
> > > + * Try subtract from nr_charged_bytes without making it negative
> > > + */
> > > +static bool obj_cgroup_charge_atomic(struct obj_cgroup *objcg, gfp_t flags, size_t sz)
> > > +{
> > > +     size_t old = atomic_read(&objcg->nr_charged_bytes);
> > > +     u32 nr_pages = sz >> PAGE_SHIFT;
> > > +     u32 nr_bytes = sz & (PAGE_SIZE - 1);
> > > +
> > > +     if ((ssize_t)(old - sz) >= 0 &&
> > > +         atomic_cmpxchg(&objcg->nr_charged_bytes, old, old - sz) == old)
> > > +             return true;
> > > +
> > > +     nr_pages++;
> > > +     if (obj_cgroup_charge_pages(objcg, flags, nr_pages))
> > > +             return false;
> > > +
> > > +     atomic_add(PAGE_SIZE - nr_bytes, &objcg->nr_charged_bytes);
> > > +     return true;
> > > +}
> > > +
> > >  bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> > >                                 gfp_t flags, size_t size, void **p)
> > >  {
> > > @@ -3128,6 +3155,21 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> > >                       return false;
> > >       }
> > >
> > > +     if (!gfpflags_allow_spinning(flags)) {
> > > +             if (local_lock_is_locked(&memcg_stock.stock_lock)) {
> > > +                     /*
> > > +                      * Cannot use
> > > +                      * lockdep_assert_held(this_cpu_ptr(&memcg_stock.stock_lock));
> > > +                      * since lockdep might not have been informed yet
> > > +                      * of lock acquisition.
> > > +                      */
> > > +                     return obj_cgroup_charge_atomic(objcg, flags,
> > > +                                                     size * obj_full_size(s));
> >
> > We can not just ignore the stat updates here.
> >
> > > +             } else {
> > > +                     lockdep_assert_not_held(this_cpu_ptr(&memcg_stock.stock_lock));
> > > +             }
> > > +     }
> > > +
> > >       for (i = 0; i < size; i++) {
> > >               slab = virt_to_slab(p[i]);
> > >
> > > @@ -3162,8 +3204,12 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> > >  void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
> > >                           void **p, int objects, struct slabobj_ext *obj_exts)
> > >  {
> > > +     bool lock_held = local_lock_is_locked(&memcg_stock.stock_lock);
> > >       size_t obj_size = obj_full_size(s);
> > >
> > > +     if (likely(!lock_held))
> > > +             lockdep_assert_not_held(this_cpu_ptr(&memcg_stock.stock_lock));
> > > +
> > >       for (int i = 0; i < objects; i++) {
> > >               struct obj_cgroup *objcg;
> > >               unsigned int off;
> > > @@ -3174,8 +3220,12 @@ void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
> > >                       continue;
> > >
> > >               obj_exts[off].objcg = NULL;
> > > -             refill_obj_stock(objcg, obj_size, true, -obj_size,
> > > -                              slab_pgdat(slab), cache_vmstat_idx(s));
> > > +             if (unlikely(lock_held)) {
> > > +                     atomic_add(obj_size, &objcg->nr_charged_bytes);
> >
> > objcg->nr_charged_bytes is stats ignorant and the relevant stats need to
> > be updated before putting stuff into it.
> 
> I'm not following.
> It's functionally equivalent to refill_obj_stock() without
> __account_obj_stock().
> And the stats are not ignored.
> The next __memcg_slab_free_hook() from good context will update
> them. It's only a tiny delay in update.
> I don't see why it's an issue.

For the slab object of size obj_size which is being freed here, we need
to update NR_SLAB_RECLAIMABLE_B or NR_SLAB_UNRECLAIMABLE_B stat for the
corresponding objcg by the amount of obj_size. If we don't call
__account_obj_stock() here we will loose the context and information to
update these stats later.

> 
> > > +             } else {
> > > +                     refill_obj_stock(objcg, obj_size, true, -obj_size,
> > > +                                      slab_pgdat(slab), cache_vmstat_idx(s));
> > > +             }
> > >               obj_cgroup_put(objcg);
> > >       }
> > >  }
> >
> > I am actually working on making this whole call chain (i.e.
> > kmalloc/kmem_cache_alloc to memcg [un]charging) reentrant/nmi safe.
> 
> Thank you for working on it!
> You mean this set:
> https://lore.kernel.org/all/20250429061211.1295443-1-shakeel.butt@linux.dev/
> ?
> it's making css_rstat_updated() re-entrant,
> which is renamed/reworked version of memcg_rstat_updated().
> That's good, but not enough from slub pov.
> It removes the need for the first hunk in this patch from mm/memcontrol.c
> + if (!in_nmi())
> +               cgroup_rstat_updated(...);
> 
> but hunks in __memcg_slab_post_alloc_hook() and __memcg_slab_free_hook()
> are still needed.
> And I think the obj_cgroup_charge_atomic() approach in this patch is correct.
> The delay in rstat update seems fine.
> Please help me understand what I'm missing.
> 

The css_rstat_updated() is the new name of cgroup_rstat_updated() and it
is only a piece of the puzzle. My plan is to memcg stats reentrant which
would allow to call __account_obj_stock (or whatever new name would be)
in nmi context and then comsume_obj_stock() and refill_obj_stock() would
work very similar to consume_stock() and refill_stock().

Please give me couple of days and I can share the full RFC of the memcg
side.


  reply	other threads:[~2025-05-06  1:25 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-01  3:27 [PATCH 0/6] mm: Reentrant kmalloc Alexei Starovoitov
2025-05-01  3:27 ` [PATCH 1/6] mm: Rename try_alloc_pages() to alloc_pages_nolock() Alexei Starovoitov
2025-05-06  8:26   ` Vlastimil Babka
2025-05-07  1:24     ` Alexei Starovoitov
2025-05-01  3:27 ` [PATCH 2/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
2025-05-06 12:56   ` Vlastimil Babka
2025-05-06 14:55     ` Vlastimil Babka
2025-05-07  1:25       ` Alexei Starovoitov
2025-05-12 13:26   ` Sebastian Andrzej Siewior
2025-05-12 16:46     ` Alexei Starovoitov
2025-05-01  3:27 ` [PATCH 3/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
2025-05-06 12:59   ` Vlastimil Babka
2025-05-07  1:28     ` Alexei Starovoitov
2025-05-12 14:56   ` Sebastian Andrzej Siewior
2025-05-12 15:01     ` Vlastimil Babka
2025-05-12 15:23       ` Sebastian Andrzej Siewior
2025-05-01  3:27 ` [PATCH 4/6] locking/local_lock: Introduce local_lock_irqsave_check() Alexei Starovoitov
2025-05-07 13:02   ` Vlastimil Babka
2025-05-12 14:03   ` Sebastian Andrzej Siewior
2025-05-12 17:16     ` Alexei Starovoitov
2025-05-13  6:58       ` Vlastimil Babka
2025-05-13 21:55         ` Alexei Starovoitov
2025-05-01  3:27 ` [PATCH 5/6] mm: Allow GFP_ACCOUNT and GFP_COMP to be used in alloc_pages_nolock() Alexei Starovoitov
2025-05-06  8:55   ` Vlastimil Babka
2025-05-07  1:33     ` Alexei Starovoitov
2025-05-01  3:27 ` [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
2025-05-05 18:46   ` Shakeel Butt
2025-05-06  0:49     ` Alexei Starovoitov
2025-05-06  1:24       ` Shakeel Butt [this message]
2025-05-06  1:51         ` Alexei Starovoitov
2025-05-06 18:05           ` Shakeel Butt
2025-05-06 12:01   ` Vlastimil Babka
2025-05-07  0:31     ` Harry Yoo
2025-05-07  2:23       ` Alexei Starovoitov
2025-05-07  8:38       ` Vlastimil Babka
2025-05-07  2:20     ` Alexei Starovoitov
2025-05-07 10:44       ` Vlastimil Babka
2025-05-09  1:03   ` Harry Yoo
2025-06-24 17:13     ` SLAB_NO_CMPXCHG was:: " Alexei Starovoitov
2025-06-25 11:38       ` Harry Yoo
2025-06-26 20:03         ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d25b6lxjjzi3zqbotlrapx57ukjl7frmyvg2lgx5omos3zqg4m@ukkod2jdmieb \
    --to=shakeel.butt@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=bpf@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=memxor@gmail.com \
    --cc=mhocko@suse.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox