From: <hu.shengming@zte.com.cn>
To: <harry@kernel.org>
Cc: <vbabka@kernel.org>, <akpm@linux-foundation.org>,
<hao.li@linux.dev>, <cl@gentwo.org>, <rientjes@google.com>,
<roman.gushchin@linux.dev>, <linux-mm@kvack.org>,
<linux-kernel@vger.kernel.org>, <zhang.run@zte.com.cn>,
<xu.xin16@zte.com.cn>, <yang.tao172@zte.com.cn>,
<yang.yang29@zte.com.cn>
Subject: Re: [PATCH v3] mm/slub: defer freelist construction until after bulk allocation from a new slab
Date: Tue, 7 Apr 2026 21:02:16 +0800 (CST) [thread overview]
Message-ID: <20260407210216761qrDj8RR7pN-ycbvYmA69v@zte.com.cn> (raw)
In-Reply-To: <adSFyyFWlLy177rB@hyeyoo>
Harry wrote:
> Hi Shengming, thanks for v3!
>
> Good to see it's getting improved over the revisions.
> Let me leave some comments inline.
>
Hi Harry,
Thanks a lot for the detailed review.
> On Mon, Apr 06, 2026 at 09:50:18PM +0800, hu.shengming@zte.com.cn wrote:
> > From: Shengming Hu <hu.shengming@zte.com.cn>
> >
> > refill_objects() can consume many objects from a fresh slab, and when it
> > takes all objects from the slab the freelist built during slab allocation
> > is discarded immediately.
> >
> > Instead of special-casing the whole-slab bulk refill case, defer freelist
> > construction until after objects are emitted from the new slab.
> > allocate_slab() now allocates and initializes slab metadata only.
> > new_slab() preserves the existing behaviour by building the full freelist
> > on top, while refill_objects() allocates a raw slab and lets
> > alloc_from_new_slab() emit objects directly and build a freelist only for
> > the remaining objects, if any.
> >
> > To keep CONFIG_SLAB_FREELIST_RANDOM=y/n on the same path, introduce a
> > small iterator abstraction for walking free objects in allocation order.
> > The iterator is used both for filling the sheaf and for building the
> > freelist of the remaining objects.
> >
> > This removes the need for a separate whole-slab special case, avoids
> > temporary freelist construction when the slab is consumed entirely.
> >
> > Also mark setup_object() inline. After this optimization, the compiler no
> > longer consistently inlines this helper in the hot path, which can hurt
> > performance. Explicitly marking it inline restores the expected code
> > generation.
> >
> > This reduces per-object overhead in bulk allocation paths and improves
> > allocation throughput significantly. In slub_bulk_bench, the time per
> > object drops by about 41% to 70% with CONFIG_SLAB_FREELIST_RANDOM=n, and
> > by about 59% to 71% with CONFIG_SLAB_FREELIST_RANDOM=y.
> >
> > Benchmark results (slub_bulk_bench):
> > Machine: qemu-system-x86 -m 1024M -smp 8 -enable-kvm -cpu host
> > Kernel: Linux 7.0.0-rc6-next-20260330
> > Config: x86_64_defconfig
> > Cpu: 0
> > Rounds: 20
> > Total: 256MB
> >
> > - CONFIG_SLAB_FREELIST_RANDOM=n -
> >
> > obj_size=16, batch=256:
> > before: 4.62 +- 0.01 ns/object
> > after: 2.72 +- 0.01 ns/object
> > delta: -41.1%
> >
> > obj_size=32, batch=128:
> > before: 6.58 +- 0.02 ns/object
> > after: 3.30 +- 0.02 ns/object
> > delta: -49.8%
> >
> > obj_size=64, batch=64:
> > before: 10.20 +- 0.03 ns/object
> > after: 4.22 +- 0.03 ns/object
> > delta: -58.7%
> >
> > obj_size=128, batch=32:
> > before: 17.91 +- 0.04 ns/object
> > after: 5.73 +- 0.09 ns/object
> > delta: -68.0%
> >
> > obj_size=256, batch=32:
> > before: 21.03 +- 0.12 ns/object
> > after: 6.22 +- 0.08 ns/object
> > delta: -70.4%
> >
> > obj_size=512, batch=32:
> > before: 19.00 +- 0.21 ns/object
> > after: 6.45 +- 0.13 ns/object
> > delta: -66.0%
> >
> > - CONFIG_SLAB_FREELIST_RANDOM=y -
> >
> > obj_size=16, batch=256:
> > before: 8.37 +- 0.06 ns/object
> > after: 3.38 +- 0.05 ns/object
> > delta: -59.6%
> >
> > obj_size=32, batch=128:
> > before: 11.00 +- 0.13 ns/object
> > after: 4.05 +- 0.01 ns/object
> > delta: -63.2%
> >
> > obj_size=64, batch=64:
> > before: 15.30 +- 0.20 ns/object
> > after: 5.21 +- 0.03 ns/object
> > delta: -65.9%
> >
> > obj_size=128, batch=32:
> > before: 21.55 +- 0.14 ns/object
> > after: 7.10 +- 0.02 ns/object
> > delta: -67.1%
> >
> > obj_size=256, batch=32:
> > before: 26.27 +- 0.29 ns/object
> > after: 7.54 +- 0.05 ns/object
> > delta: -71.3%
> >
> > obj_size=512, batch=32:
> > before: 26.69 +- 0.28 ns/object
> > after: 7.73 +- 0.09 ns/object
> > delta: -71.0%
> >
> > Link: https://github.com/HSM6236/slub_bulk_test.git
> > Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
> > ---
> > Changes in v2:
> > - Handle CONFIG_SLAB_FREELIST_RANDOM=y and add benchmark results.
> > - Update the QEMU benchmark setup to use -enable-kvm -cpu host so benchmark results better reflect native CPU performance.
> > - Link to v1: https://lore.kernel.org/all/20260328125538341lvTGRpS62UNdRiAAz2gH3@zte.com.cn/
> >
> > Changes in v3:
> > - refactor fresh-slab allocation to use a shared slab_obj_iter
> > - defer freelist construction until after bulk allocation from a new slab
> > - build a freelist only for leftover objects when the slab is left partial
> > - add build_slab_freelist(), prepare_slab_alloc_flags() and next_slab_obj() helpers
> > - remove obsolete freelist construction helpers now replaced by the iterator-based path, including next_freelist_entry() and shuffle_freelist()
> > - Link to v2: https://lore.kernel.org/all/202604011257259669oAdDsdnKx6twdafNZsF5@zte.com.cn/
> >
> > ---
> > mm/slab.h | 11 +++
> > mm/slub.c | 256 +++++++++++++++++++++++++++++-------------------------
> > 2 files changed, 149 insertions(+), 118 deletions(-)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index fb2c5c57bc4e..88537e577989 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4344,14 +4245,130 @@ static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
> > 0, sizeof(void *));
> > }
> >
> > +/* Return the next free object in allocation order. */
> > +static inline void *next_slab_obj(struct kmem_cache *s,
> > + struct slab_obj_iter *iter)
> > +{
> > +#ifdef CONFIG_SLAB_FREELIST_RANDOM
> > + if (iter->random) {
> > + unsigned long idx;
> > +
> > + /*
> > + * If the target page allocation failed, the number of objects on the
> > + * page might be smaller than the usual size defined by the cache.
> > + */
> > + do {
> > + idx = s->random_seq[iter->pos];
> > + iter->pos++;
> > + if (iter->pos >= iter->freelist_count)
> > + iter->pos = 0;
> > + } while (unlikely(idx >= iter->page_limit));
> > +
> > + return setup_object(s, (char *)iter->start + idx);
> > + }
> > +#endif
> > + void *obj = iter->cur;
> > +
> > + iter->cur = (char *)iter->cur + s->size;
> > + return setup_object(s, obj);
> > +}
> > +
> > +/* Initialize an iterator over free objects in allocation order. */
> > +static inline void init_slab_obj_iter(struct kmem_cache *s, struct slab *slab,
> > + struct slab_obj_iter *iter,
> > + bool allow_spin)
> > +{
> > + iter->pos = 0;
> > + iter->start = fixup_red_left(s, slab_address(slab));
> > + iter->cur = iter->start;
>
> It's confusing that iter->pos field is used only when randomization is
> enabled and iter->cur field is used only when randomization is disabled.
>
> I think we could simply use iter->pos for both random and non-random cases
> (as I have shown in the skeleton before)?
>
Right, I introduced cur only to keep the non-random iteration close to
the original form, but I agree that using pos for both cases is cleaner.
> > +#ifdef CONFIG_SLAB_FREELIST_RANDOM
> > + iter->random = (slab->objects >= 2 && s->random_seq);
> > + if (!iter->random)
> > + return;
> > +
> > + iter->freelist_count = oo_objects(s->oo);
> > + iter->page_limit = slab->objects * s->size;
> > +
> > + if (allow_spin) {
> > + iter->pos = get_random_u32_below(iter->freelist_count);
> > + } else {
> > + struct rnd_state *state;
> > +
> > + /*
> > + * An interrupt or NMI handler might interrupt and change
> > + * the state in the middle, but that's safe.
> > + */
> > + state = &get_cpu_var(slab_rnd_state);
> > + iter->pos = prandom_u32_state(state) % iter->freelist_count;
> > + put_cpu_var(slab_rnd_state);
> > + }
> > +#endif
> > +}
> > static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> > void **p, unsigned int count, bool allow_spin)
>
> There is one problem with this change; ___slab_alloc() builds the
> freelist before calling alloc_from_new_slab(), while refill_objects()
> does not. For consistency, let's allocate a new slab without building
> freelist in ___slab_alloc() and build the freelist in
> alloc_single_from_new_slab() and alloc_from_new_slab()?
>
Agreed, also, new_slab() is currently used by both early_kmem_cache_node_alloc()
and ___slab_alloc(), so I'll rework the early allocation path as well to
keep the new-slab flow consistent.
> > {
> > unsigned int allocated = 0;
> > struct kmem_cache_node *n;
> > + struct slab_obj_iter iter;
> > bool needs_add_partial;
> > unsigned long flags;
> > - void *object;
> > + unsigned int target_inuse;
> >
> > /*
> > * Are we going to put the slab on the partial list?
> > @@ -4359,6 +4376,9 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> > */
> > needs_add_partial = (slab->objects > count);
> >
> > + /* Target inuse count after allocating from this new slab. */
> > + target_inuse = needs_add_partial ? count : slab->objects;
> > +
> > if (!allow_spin && needs_add_partial) {
> >
> > n = get_node(s, slab_nid(slab));
>
> Now new slabs without freelist can be freed in this path.
> which is confusing but should be _technically_ fine, I think...
>
> > @@ -4370,19 +4390,18 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> > }
> > }
> >
> > - object = slab->freelist;
> > - while (object && allocated < count) {
> > - p[allocated] = object;
> > - object = get_freepointer(s, object);
> > + init_slab_obj_iter(s, slab, &iter, allow_spin);
> > +
> > + while (allocated < target_inuse) {
> > + p[allocated] = next_slab_obj(s, &iter);
> > maybe_wipe_obj_freeptr(s, p[allocated]);
>
> We don't have to wipe the free pointer as we didn't build the freelist?
>
Right, maybe_wipe_obj_freeptr() is not needed for objects emitted directly from a fresh slab,
I'll remove it.
> > - slab->inuse++;
> > allocated++;
> > }
> > - slab->freelist = object;
> > + slab->inuse = target_inuse;
> >
> > if (needs_add_partial) {
> > -
> > + build_slab_freelist(s, slab, &iter);
>
> When allow_spin is true, it's building the freelist while holding the
> spinlock, and that's not great.
>
> Hmm, can we do better?
>
> Perhaps just allocate object(s) from the slab and build the freelist
> with the objects left (if exists), but free the slab if allow_spin
> is false AND trylock fails, and accept the fact that the slab may not be
> fully free when it's freed due to trylock failure?
>
> something like:
>
> alloc_from_new_slab() {
> needs_add_partial = (slab->objects > count);
> target_inuse = needs_add_partial ? count : slab->objects;
>
> init_slab_obj_iter(s, slab, &iter, allow_spin);
> while (allocated < target_inuse) {
> p[allocated] = next_slab_obj(s, &iter);
> allocated++;
> }
> slab->inuse = target_inuse;
>
> if (needs_add_partial) {
> build_slab_freelist(s, slab, &iter);
> n = get_node(s, slab_nid(slab))
> if (allow_spin) {
> spin_lock_irqsave(&n->list_lock, flags);
> } else if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> /*
> * Unlucky, discard newly allocated slab.
> * The slab is not fully free, but it's fine as
> * objects are not allocated to users.
> */
> free_new_slab_nolock(s, slab);
> return 0;
> }
> add_partial(n, slab, ADD_TO_HEAD);
> spin_unlock_irqrestore(&n->list_lock, flags);
> }
> [...]
> }
>
> And do something similar in alloc_single_from_new_slab() as well.
>
Good point. I'll restructure the path so objects are emitted first, the leftover
freelist is built only if needed, and the slab is added to partial afterwards. For
the !allow_spin trylock failure case, I'll discard the new slab and return 0. I'll
do the same for the single-object path as well.
> > if (allow_spin) {
> > n = get_node(s, slab_nid(slab));
> > spin_lock_irqsave(&n->list_lock, flags);
> > @@ -7244,16 +7265,15 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> >
> > new_slab:
> >
> > - slab = new_slab(s, gfp, local_node);
> > + slab_gfp = prepare_slab_alloc_flags(s, gfp);
>
> Could we do `flags = prepare_slab_alloc_flags(s, flags);`
> within allocate_slab()? Having gfp and slab_gfp flags is distractive.
> The value of allow_spin should not change after
> prepare_slab_alloc_flags() anyway.
>
Agreed. I'll move the prepare_slab_alloc_flags() handling into allocate_slab()
so the call sites stay simpler, and keep the iterator/freelist construction
local to alloc_single_from_new_slab() and alloc_from_new_slab().
I also have a question about the allow_spin semantics in the refill_objects path.
Now that init_slab_obj_iter() has been moved into alloc_from_new_slab(..., true),
allow_spin on this path appears to be unconditionally set to true.
Previously, shuffle_freelist() received allow_spin = gfpflags_allow_spinning(gfp),
so I wanted to check whether moving init_slab_obj_iter() into alloc_from_new_slab()
changes the intended semantics here.
My current understanding is that this is still fine, because refill_objects()
is guaranteed to run only when spinning is allowed. Is that correct?
> > + allow_spin = gfpflags_allow_spinning(slab_gfp);
> > +
> > + slab = allocate_slab(s, slab_gfp, local_node, allow_spin);
> > if (!slab)
> > goto out;
> >
> > stat(s, ALLOC_SLAB);
> >
> > - /*
> > - * TODO: possible optimization - if we know we will consume the whole
> > - * slab we might skip creating the freelist?
> > - */
> > refilled += alloc_from_new_slab(s, slab, p + refilled, max - refilled,
> > /* allow_spin = */ true);
> >
> > --
> > 2.25.1
>
> --
> Cheers,
> Harry / Hyeonggon
Thanks again. I will fold these changes into the next revision.
--
With Best Regards,
Shengming
next prev parent reply other threads:[~2026-04-07 13:02 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-06 13:50 hu.shengming
2026-04-07 4:19 ` Harry Yoo (Oracle)
2026-04-07 13:02 ` hu.shengming [this message]
2026-04-08 1:38 ` Harry Yoo (Oracle)
2026-04-08 4:21 ` hu.shengming
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260407210216761qrDj8RR7pN-ycbvYmA69v@zte.com.cn \
--to=hu.shengming@zte.com.cn \
--cc=akpm@linux-foundation.org \
--cc=cl@gentwo.org \
--cc=hao.li@linux.dev \
--cc=harry@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=vbabka@kernel.org \
--cc=xu.xin16@zte.com.cn \
--cc=yang.tao172@zte.com.cn \
--cc=yang.yang29@zte.com.cn \
--cc=zhang.run@zte.com.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox