linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: <hu.shengming@zte.com.cn>
To: <harry@kernel.org>
Cc: <vbabka@kernel.org>, <akpm@linux-foundation.org>,
	<hao.li@linux.dev>, <cl@gentwo.org>, <rientjes@google.com>,
	<roman.gushchin@linux.dev>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>, <zhang.run@zte.com.cn>,
	<xu.xin16@zte.com.cn>, <yang.tao172@zte.com.cn>,
	<yang.yang29@zte.com.cn>
Subject: Re: [PATCH v3] mm/slub: defer freelist construction until after bulk allocation from a new slab
Date: Tue, 7 Apr 2026 21:02:16 +0800 (CST)	[thread overview]
Message-ID: <20260407210216761qrDj8RR7pN-ycbvYmA69v@zte.com.cn> (raw)
In-Reply-To: <adSFyyFWlLy177rB@hyeyoo>

Harry wrote:

> Hi Shengming, thanks for v3!
> 
> Good to see it's getting improved over the revisions.
> Let me leave some comments inline.
> 

Hi Harry,

Thanks a lot for the detailed review.

> On Mon, Apr 06, 2026 at 09:50:18PM +0800, hu.shengming@zte.com.cn wrote:
> > From: Shengming Hu <hu.shengming@zte.com.cn>
> > 
> > refill_objects() can consume many objects from a fresh slab, and when it
> > takes all objects from the slab the freelist built during slab allocation
> > is discarded immediately.
> > 
> > Instead of special-casing the whole-slab bulk refill case, defer freelist
> > construction until after objects are emitted from the new slab.
> > allocate_slab() now allocates and initializes slab metadata only.
> > new_slab() preserves the existing behaviour by building the full freelist
> > on top, while refill_objects() allocates a raw slab and lets
> > alloc_from_new_slab() emit objects directly and build a freelist only for
> > the remaining objects, if any.
> > 
> > To keep CONFIG_SLAB_FREELIST_RANDOM=y/n on the same path, introduce a
> > small iterator abstraction for walking free objects in allocation order.
> > The iterator is used both for filling the sheaf and for building the
> > freelist of the remaining objects.
> > 
> > This removes the need for a separate whole-slab special case, avoids
> > temporary freelist construction when the slab is consumed entirely.
> > 
> > Also mark setup_object() inline. After this optimization, the compiler no
> > longer consistently inlines this helper in the hot path, which can hurt
> > performance. Explicitly marking it inline restores the expected code
> > generation.
> > 
> > This reduces per-object overhead in bulk allocation paths and improves
> > allocation throughput significantly. In slub_bulk_bench, the time per
> > object drops by about 41% to 70% with CONFIG_SLAB_FREELIST_RANDOM=n, and
> > by about 59% to 71% with CONFIG_SLAB_FREELIST_RANDOM=y.
> > 
> > Benchmark results (slub_bulk_bench):
> > Machine: qemu-system-x86 -m 1024M -smp 8 -enable-kvm -cpu host
> > Kernel: Linux 7.0.0-rc6-next-20260330
> > Config: x86_64_defconfig
> > Cpu: 0
> > Rounds: 20
> > Total: 256MB
> > 
> > - CONFIG_SLAB_FREELIST_RANDOM=n -
> > 
> > obj_size=16, batch=256:
> > before: 4.62 +- 0.01 ns/object
> > after: 2.72 +- 0.01 ns/object
> > delta: -41.1%
> > 
> > obj_size=32, batch=128:
> > before: 6.58 +- 0.02 ns/object
> > after: 3.30 +- 0.02 ns/object
> > delta: -49.8%
> > 
> > obj_size=64, batch=64:
> > before: 10.20 +- 0.03 ns/object
> > after: 4.22 +- 0.03 ns/object
> > delta: -58.7%
> > 
> > obj_size=128, batch=32:
> > before: 17.91 +- 0.04 ns/object
> > after: 5.73 +- 0.09 ns/object
> > delta: -68.0%
> > 
> > obj_size=256, batch=32:
> > before: 21.03 +- 0.12 ns/object
> > after: 6.22 +- 0.08 ns/object
> > delta: -70.4%
> > 
> > obj_size=512, batch=32:
> > before: 19.00 +- 0.21 ns/object
> > after: 6.45 +- 0.13 ns/object
> > delta: -66.0%
> > 
> > - CONFIG_SLAB_FREELIST_RANDOM=y -
> > 
> > obj_size=16, batch=256:
> > before: 8.37 +- 0.06 ns/object
> > after: 3.38 +- 0.05 ns/object
> > delta: -59.6%
> > 
> > obj_size=32, batch=128:
> > before: 11.00 +- 0.13 ns/object
> > after: 4.05 +- 0.01 ns/object
> > delta: -63.2%
> > 
> > obj_size=64, batch=64:
> > before: 15.30 +- 0.20 ns/object
> > after: 5.21 +- 0.03 ns/object
> > delta: -65.9%
> > 
> > obj_size=128, batch=32:
> > before: 21.55 +- 0.14 ns/object
> > after: 7.10 +- 0.02 ns/object
> > delta: -67.1%
> > 
> > obj_size=256, batch=32:
> > before: 26.27 +- 0.29 ns/object
> > after: 7.54 +- 0.05 ns/object
> > delta: -71.3%
> > 
> > obj_size=512, batch=32:
> > before: 26.69 +- 0.28 ns/object
> > after: 7.73 +- 0.09 ns/object
> > delta: -71.0%
> > 
> > Link: https://github.com/HSM6236/slub_bulk_test.git
> > Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
> > ---
> > Changes in v2:
> > - Handle CONFIG_SLAB_FREELIST_RANDOM=y and add benchmark results.
> > - Update the QEMU benchmark setup to use -enable-kvm -cpu host so benchmark results better reflect native CPU performance.
> > - Link to v1: https://lore.kernel.org/all/20260328125538341lvTGRpS62UNdRiAAz2gH3@zte.com.cn/
> > 
> > Changes in v3:
> > - refactor fresh-slab allocation to use a shared slab_obj_iter
> > - defer freelist construction until after bulk allocation from a new slab
> > - build a freelist only for leftover objects when the slab is left partial
> > - add build_slab_freelist(), prepare_slab_alloc_flags() and next_slab_obj() helpers
> > - remove obsolete freelist construction helpers now replaced by the iterator-based path, including next_freelist_entry() and shuffle_freelist()
> > - Link to v2: https://lore.kernel.org/all/202604011257259669oAdDsdnKx6twdafNZsF5@zte.com.cn/
> > 
> > ---
> >  mm/slab.h |  11 +++
> >  mm/slub.c | 256 +++++++++++++++++++++++++++++-------------------------
> >  2 files changed, 149 insertions(+), 118 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index fb2c5c57bc4e..88537e577989 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4344,14 +4245,130 @@ static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
> >              0, sizeof(void *));
> >  }
> >  
> > +/* Return the next free object in allocation order. */
> > +static inline void *next_slab_obj(struct kmem_cache *s,
> > +                  struct slab_obj_iter *iter)
> > +{
> > +#ifdef CONFIG_SLAB_FREELIST_RANDOM
> > +    if (iter->random) {
> > +        unsigned long idx;
> > +
> > +        /*
> > +         * If the target page allocation failed, the number of objects on the
> > +         * page might be smaller than the usual size defined by the cache.
> > +         */
> > +        do {
> > +            idx = s->random_seq[iter->pos];
> > +            iter->pos++;
> > +            if (iter->pos >= iter->freelist_count)
> > +                iter->pos = 0;
> > +        } while (unlikely(idx >= iter->page_limit));
> > +
> > +        return setup_object(s, (char *)iter->start + idx);
> > +    }
> > +#endif
> > +    void *obj = iter->cur;
> > +
> > +    iter->cur = (char *)iter->cur + s->size;
> > +    return setup_object(s, obj);
> > +}
> > +
> > +/* Initialize an iterator over free objects in allocation order. */
> > +static inline void init_slab_obj_iter(struct kmem_cache *s, struct slab *slab,
> > +                      struct slab_obj_iter *iter,
> > +                      bool allow_spin)
> > +{
> > +    iter->pos = 0;
> > +    iter->start = fixup_red_left(s, slab_address(slab));
> > +    iter->cur = iter->start;
> 
> It's confusing that iter->pos field is used only when randomization is
> enabled and iter->cur field is used only when randomization is disabled.
> 
> I think we could simply use iter->pos for both random and non-random cases
> (as I have shown in the skeleton before)?
> 

Right, I introduced cur only to keep the non-random iteration close to
the original form, but I agree that using pos for both cases is cleaner.

> > +#ifdef CONFIG_SLAB_FREELIST_RANDOM
> > +    iter->random = (slab->objects >= 2 && s->random_seq);
> > +    if (!iter->random)
> > +        return;
> > +
> > +    iter->freelist_count = oo_objects(s->oo);
> > +    iter->page_limit = slab->objects * s->size;
> > +
> > +    if (allow_spin) {
> > +        iter->pos = get_random_u32_below(iter->freelist_count);
> > +    } else {
> > +        struct rnd_state *state;
> > +
> > +        /*
> > +         * An interrupt or NMI handler might interrupt and change
> > +         * the state in the middle, but that's safe.
> > +         */
> > +        state = &get_cpu_var(slab_rnd_state);
> > +        iter->pos = prandom_u32_state(state) % iter->freelist_count;
> > +        put_cpu_var(slab_rnd_state);
> > +    }
> > +#endif
> > +}
> >  static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> >          void **p, unsigned int count, bool allow_spin)
> 
> There is one problem with this change; ___slab_alloc() builds the
> freelist before calling alloc_from_new_slab(), while refill_objects()
> does not. For consistency, let's allocate a new slab without building
> freelist in ___slab_alloc() and build the freelist in
> alloc_single_from_new_slab() and alloc_from_new_slab()?
> 

Agreed, also, new_slab() is currently used by both early_kmem_cache_node_alloc()
and ___slab_alloc(), so I'll rework the early allocation path as well to
keep the new-slab flow consistent.

> >  {
> >      unsigned int allocated = 0;
> >      struct kmem_cache_node *n;
> > +    struct slab_obj_iter iter;
> >      bool needs_add_partial;
> >      unsigned long flags;
> > -    void *object;
> > +    unsigned int target_inuse;
> >  
> >      /*
> >       * Are we going to put the slab on the partial list?
> > @@ -4359,6 +4376,9 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> >       */
> >      needs_add_partial = (slab->objects > count);
> >  
> > +    /* Target inuse count after allocating from this new slab. */
> > +    target_inuse = needs_add_partial ? count : slab->objects;
> > +
> >      if (!allow_spin && needs_add_partial) {
> >  
> >          n = get_node(s, slab_nid(slab));
> 
> Now new slabs without freelist can be freed in this path.
> which is confusing but should be _technically_ fine, I think...
> 
> > @@ -4370,19 +4390,18 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> >          }
> >      }
> >  
> > -    object = slab->freelist;
> > -    while (object && allocated < count) {
> > -        p[allocated] = object;
> > -        object = get_freepointer(s, object);
> > +    init_slab_obj_iter(s, slab, &iter, allow_spin);
> > +
> > +    while (allocated < target_inuse) {
> > +        p[allocated] = next_slab_obj(s, &iter);
> >          maybe_wipe_obj_freeptr(s, p[allocated]);
> 
> We don't have to wipe the free pointer as we didn't build the freelist?
> 

Right, maybe_wipe_obj_freeptr() is not needed for objects emitted directly from a fresh slab,
I'll remove it.

> > -        slab->inuse++;
> >          allocated++;
> >      }
> > -    slab->freelist = object;
> > +    slab->inuse = target_inuse;
> >  
> >      if (needs_add_partial) {
> > -
> > +        build_slab_freelist(s, slab, &iter);
> 
> When allow_spin is true, it's building the freelist while holding the
> spinlock, and that's not great.
> 
> Hmm, can we do better?
> 
> Perhaps just allocate object(s) from the slab and build the freelist
> with the objects left (if exists), but free the slab if allow_spin
> is false AND trylock fails, and accept the fact that the slab may not be
> fully free when it's freed due to trylock failure?
> 
> something like:
> 
> alloc_from_new_slab() {
>     needs_add_partial = (slab->objects > count);
>     target_inuse = needs_add_partial ? count : slab->objects;
> 
>     init_slab_obj_iter(s, slab, &iter, allow_spin);
>     while (allocated < target_inuse) {
>         p[allocated] = next_slab_obj(s, &iter);
>         allocated++;
>     }
>     slab->inuse = target_inuse;
> 
>     if (needs_add_partial) {
>         build_slab_freelist(s, slab, &iter);
>         n = get_node(s, slab_nid(slab))
>         if (allow_spin) {
>             spin_lock_irqsave(&n->list_lock, flags);
>         } else if (!spin_trylock_irqsave(&n->list_lock, flags)) {
>             /* 
>              * Unlucky, discard newly allocated slab.
>              * The slab is not fully free, but it's fine as
>              * objects are not allocated to users.
>              */
>             free_new_slab_nolock(s, slab);
>             return 0;
>         }
>         add_partial(n, slab, ADD_TO_HEAD);
>         spin_unlock_irqrestore(&n->list_lock, flags);
>     }
>     [...]
> }
> 
> And do something similar in alloc_single_from_new_slab() as well.
> 

Good point. I'll restructure the path so objects are emitted first, the leftover
freelist is built only if needed, and the slab is added to partial afterwards. For
the !allow_spin trylock failure case, I'll discard the new slab and return 0. I'll
do the same for the single-object path as well.

> >          if (allow_spin) {
> >              n = get_node(s, slab_nid(slab));
> >              spin_lock_irqsave(&n->list_lock, flags);
> > @@ -7244,16 +7265,15 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> >  
> >  new_slab:
> >  
> > -    slab = new_slab(s, gfp, local_node);
> > +    slab_gfp = prepare_slab_alloc_flags(s, gfp);
> 
> Could we do `flags = prepare_slab_alloc_flags(s, flags);`
> within allocate_slab()? Having gfp and slab_gfp flags is distractive.
> The value of allow_spin should not change after
> prepare_slab_alloc_flags() anyway.
> 

Agreed. I'll move the prepare_slab_alloc_flags() handling into allocate_slab()
so the call sites stay simpler, and keep the iterator/freelist construction
local to alloc_single_from_new_slab() and alloc_from_new_slab().

I also have a question about the allow_spin semantics in the refill_objects path.

Now that init_slab_obj_iter() has been moved into alloc_from_new_slab(..., true),
allow_spin on this path appears to be unconditionally set to true.

Previously, shuffle_freelist() received allow_spin = gfpflags_allow_spinning(gfp),
so I wanted to check whether moving init_slab_obj_iter() into alloc_from_new_slab()
changes the intended semantics here.

My current understanding is that this is still fine, because refill_objects()
is guaranteed to run only when spinning is allowed. Is that correct?

> > +    allow_spin = gfpflags_allow_spinning(slab_gfp);
> > +
> > +    slab = allocate_slab(s, slab_gfp, local_node, allow_spin);
> >      if (!slab)
> >          goto out;
> >  
> >      stat(s, ALLOC_SLAB);
> >  
> > -    /*
> > -     * TODO: possible optimization - if we know we will consume the whole
> > -     * slab we might skip creating the freelist?
> > -     */
> >      refilled += alloc_from_new_slab(s, slab, p + refilled, max - refilled,
> >                      /* allow_spin = */ true);
> >  
> > -- 
> > 2.25.1
> 
> -- 
> Cheers,
> Harry / Hyeonggon

Thanks again. I will fold these changes into the next revision.
--
With Best Regards,
Shengming


  reply	other threads:[~2026-04-07 13:02 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-06 13:50 hu.shengming
2026-04-07  4:19 ` Harry Yoo (Oracle)
2026-04-07 13:02   ` hu.shengming [this message]
2026-04-08  1:38     ` Harry Yoo (Oracle)
2026-04-08  4:21       ` hu.shengming

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260407210216761qrDj8RR7pN-ycbvYmA69v@zte.com.cn \
    --to=hu.shengming@zte.com.cn \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=hao.li@linux.dev \
    --cc=harry@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=xu.xin16@zte.com.cn \
    --cc=yang.tao172@zte.com.cn \
    --cc=yang.yang29@zte.com.cn \
    --cc=zhang.run@zte.com.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox