linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@gentwo.org>,
	David Rientjes <rientjes@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Alexei Starovoitov <ast@kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2] slab: fix slab accounting imbalance due to defer_deactivate_slab()
Date: Fri, 24 Oct 2025 11:03:29 +0900	[thread overview]
Message-ID: <aPrecUasNUbEkLlS@hyeyoo> (raw)
In-Reply-To: <CAADnVQK+3GLbq4GjOYO0Q6vhURPyNyy70bZKUUwRpLuK-R8NAA@mail.gmail.com>

On Thu, Oct 23, 2025 at 06:17:19PM -0700, Alexei Starovoitov wrote:
> On Thu, Oct 23, 2025 at 5:00 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > On Thu, Oct 23, 2025 at 04:13:37PM -0700, Alexei Starovoitov wrote:
> > > On Thu, Oct 23, 2025 at 5:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > >
> > > > Since commit af92793e52c3 ("slab: Introduce kmalloc_nolock() and
> > > > kfree_nolock().") there's a possibility in alloc_single_from_new_slab()
> > > > that we discard the newly allocated slab if we can't spin and we fail to
> > > > trylock. As a result we don't perform inc_slabs_node() later in the
> > > > function. Instead we perform a deferred deactivate_slab() which can
> > > > either put the unacounted slab on partial list, or discard it
> > > > immediately while performing dec_slabs_node(). Either way will cause an
> > > > accounting imbalance.
> > > >
> > > > Fix this by not marking the slab as frozen, and using free_slab()
> > > > instead of deactivate_slab() for non-frozen slabs in
> > > > free_deferred_objects(). For CONFIG_SLUB_TINY, that's the only possible
> > > > case. By not using discard_slab() we avoid dec_slabs_node().
> > > >
> > > > Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
> > > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > > > ---
> > > > Changes in v2:
> > > > - Fix the problem differently. Harry pointed out that we can't move
> > > >   inc_slabs_node() outside of list_lock protected regions as that would
> > > >   reintroduce issues fixed by commit c7323a5ad078
> > > > - Link to v1: https://patch.msgid.link/20251022-fix-slab-accounting-v1-1-27870ec363ce@suse.cz
> > > > ---
> > > >  mm/slub.c | 8 +++++---
> > > >  1 file changed, 5 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > index 23d8f54e9486..87a1d2f9de0d 100644
> > > > --- a/mm/slub.c
> > > > +++ b/mm/slub.c
> > > > @@ -3422,7 +3422,6 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
> > > >
> > > >         if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {
> > > >                 /* Unlucky, discard newly allocated slab */
> > > > -               slab->frozen = 1;
> > > >                 defer_deactivate_slab(slab, NULL);
> > > >                 return NULL;
> > > >         }
> > > > @@ -6471,9 +6470,12 @@ static void free_deferred_objects(struct irq_work *work)
> > > >                 struct slab *slab = container_of(pos, struct slab, llnode);
> > > >
> > > >  #ifdef CONFIG_SLUB_TINY
> > > > -               discard_slab(slab->slab_cache, slab);
> > > > +               free_slab(slab->slab_cache, slab);
> > > >  #else
> > > > -               deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
> > > > +               if (slab->frozen)
> > > > +                       deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
> > > > +               else
> > > > +                       free_slab(slab->slab_cache, slab);
> > >
> > > A bit odd to use 'frozen' flag as such a signal.
> > > I guess I'm worried that truly !frozen slab can come here
> > > via ___slab_alloc() -> retry_load_slab: -> defer_deactivate_slab().
> > > And things will be much worse than just accounting.
> >
> > But the cpu slab must have been frozen before it's attached to
> > c->slab?
> 
> Is it?
> the path is
> c = slub_get_cpu_ptr(s->cpu_slab);
> if (unlikely(c->slab)) {
>    struct slab *flush_slab = c->slab;
>    defer_deactivate_slab(flush_slab, ...);
> 
> I don't see why it would be frozen.

Oh god. I was going to say the cpu slab is always frozen. It has been
true for very long time, but it seems it's not true after commit 90b1e56641
("mm/slub: directly load freelist from cpu partial slab in the likely case").

So I think you're right that a non-frozen slab can go through
free_slab() in free_deferred_objects()...

But fixing this should be simple. Add something like
freeze_and_get_freelist() and call it when SLUB take a slab from
per-cpu partial slab list?

> > > Maybe add
> > >   inc_slabs_node(s, nid, slab->objects);
> > > right before
> > >   defer_deactivate_slab(slab, NULL);
> > >   return NULL;
> > >
> > > I don't quite get why c7323a5ad078 is doing everything under n->list_lock.
> > > It's been 3 years since.
> >
> > When n->nr_slabs is inconsistent, validate_slab_node() might report an
> > error (false positive) when someone wrote '1' to
> > /sys/kernel/slab/<cache name>/validate
> 
> Ok. I see it now. It's the actual number of elements in n->full
> list needs to match n->nr_slabs.
> 
> But then how it's not broken already?
> I see that
> alloc_single_from_new_slab()
> unconditionally does inc_slabs_node(), but

It increments n->nr_slabs. It doesn't matter which list it's going to be
added to, because it's total number of slabs in that node.

> slab itself is added either to n->full or n->partial lists.

and then n->nr_partial is also incremented if it's added to n->partial.

> And validate_slab_node() should be complaining already.

The debug routine checks if:
- the number of slabs in n->partial == n->nr_partial
- the number of slabs in n->full + n->partial == n->nr_slabs

under n->list_lock. So it's not broken?

> Anyway, I'm not arguing. Just trying to understand.
> If you think the fix is fine, then go ahead.

-- 
Cheers,
Harry / Hyeonggon


  reply	other threads:[~2025-10-24  2:03 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-23 12:01 Vlastimil Babka
2025-10-23 12:56 ` Harry Yoo
2025-10-23 23:13 ` Alexei Starovoitov
2025-10-24  0:00   ` Harry Yoo
2025-10-24  1:17     ` Alexei Starovoitov
2025-10-24  2:03       ` Harry Yoo [this message]
2025-10-24  8:55         ` Vlastimil Babka
2025-10-24  9:36           ` Harry Yoo
2025-10-24 18:08             ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPrecUasNUbEkLlS@hyeyoo \
    --to=harry.yoo@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=ast@kernel.org \
    --cc=cl@gentwo.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox