From: Shengming Hu Allocations from a fresh slab can consume all of its objects, and the freelist built during slab allocation is discarded immediately as a result. Instead of special-casing the whole-slab bulk refill case, defer freelist construction until after objects are emitted from a fresh slab. new_slab() now only allocates the slab and initializes its metadata. refill_objects() then obtains a fresh slab and lets alloc_from_new_slab() emit objects directly, building a freelist only for the objects left unallocated; the same change is applied to alloc_single_from_new_slab(). To keep CONFIG_SLAB_FREELIST_RANDOM=y/n on the same path, introduce a small iterator abstraction for walking free objects in allocation order. The iterator is used both for filling the sheaf and for building the freelist of the remaining objects. Also mark setup_object() inline. After this optimization, the compiler no longer consistently inlines this helper in the hot path, which can hurt performance. Explicitly marking it inline restores the expected code generation. This reduces per-object overhead when allocating from a fresh slab. The most direct benefit is in the paths that allocate objects first and only build a freelist for the remainder afterward: bulk allocation from a new slab in refill_objects(), single-object allocation from a new slab in ___slab_alloc(), and the corresponding early-boot paths that now use the same deferred-freelist scheme. Since refill_objects() is also used to refill sheaves, the optimization is not limited to the small set of kmem_cache_alloc_bulk()/kmem_cache_free_bulk() users; regular allocation workloads may benefit as well when they refill from a fresh slab. In slub_bulk_bench, the time per object drops by about 32% to 71% with CONFIG_SLAB_FREELIST_RANDOM=n, and by about 52% to 70% with CONFIG_SLAB_FREELIST_RANDOM=y. This benchmark is intended to isolate the cost removed by this change: each iteration allocates exactly slab->objects from a fresh slab. That makes it a near best-case scenario for deferred freelist construction, because the old path still built a full freelist even when no objects remained, while the new path avoids that work. Realistic workloads may see smaller end-to-end gains depending on how often allocations reach this fresh-slab refill path. Benchmark results (slub_bulk_bench): Machine: qemu-system-x86 -m 1024M -smp 8 -enable-kvm -cpu host Kernel: Linux 7.0.0-rc7-next-20260407 Config: x86_64_defconfig Cpu: 0 Rounds: 20 Total: 256MB - CONFIG_SLAB_FREELIST_RANDOM=n - obj_size=16, batch=256: before: 4.73 +- 0.05 ns/object after: 3.19 +- 0.03 ns/object delta: -32.6% obj_size=32, batch=128: before: 6.75 +- 0.06 ns/object after: 3.61 +- 0.04 ns/object delta: -46.5% obj_size=64, batch=64: before: 10.47 +- 0.07 ns/object after: 4.47 +- 0.06 ns/object delta: -57.3% obj_size=128, batch=32: before: 18.38 +- 0.08 ns/object after: 5.99 +- 0.11 ns/object delta: -67.4% obj_size=256, batch=32: before: 22.14 +- 0.21 ns/object after: 6.36 +- 0.12 ns/object delta: -71.3% obj_size=512, batch=32: before: 19.86 +- 0.10 ns/object after: 6.60 +- 0.18 ns/object delta: -66.8% - CONFIG_SLAB_FREELIST_RANDOM=y - obj_size=16, batch=256: before: 8.88 +- 0.06 ns/object after: 4.18 +- 0.15 ns/object delta: -52.9% obj_size=32, batch=128: before: 11.45 +- 0.10 ns/object after: 4.72 +- 0.11 ns/object delta: -58.8% obj_size=64, batch=64: before: 15.67 +- 0.25 ns/object after: 5.63 +- 0.04 ns/object delta: -64.1% obj_size=128, batch=32: before: 22.26 +- 0.18 ns/object after: 7.62 +- 0.08 ns/object delta: -65.8% obj_size=256, batch=32: before: 27.26 +- 0.37 ns/object after: 8.12 +- 0.07 ns/object delta: -70.2% obj_size=512, batch=32: before: 28.08 +- 0.55 ns/object after: 8.26 +- 0.12 ns/object delta: -70.6% Link: https://github.com/HSM6236/slub_bulk_test.git Signed-off-by: Shengming Hu --- Changes in v2: - Handle CONFIG_SLAB_FREELIST_RANDOM=y and add benchmark results. - Update the QEMU benchmark setup to use -enable-kvm -cpu host so benchmark results better reflect native CPU performance. - Link to v1: https://lore.kernel.org/all/20260328125538341lvTGRpS62UNdRiAAz2gH3@zte.com.cn/ Changes in v3: - refactor fresh-slab allocation to use a shared slab_obj_iter - defer freelist construction until after bulk allocation from a new slab - build a freelist only for leftover objects when the slab is left partial - add build_slab_freelist(), prepare_slab_alloc_flags() and next_slab_obj() helpers - remove obsolete freelist construction helpers now replaced by the iterator-based path, including next_freelist_entry() and shuffle_freelist() - Link to v2: https://lore.kernel.org/all/202604011257259669oAdDsdnKx6twdafNZsF5@zte.com.cn/ Changes in v4: - remove slab_obj_iter::cur - drop prepare_slab_alloc_flags() and restore the original flag handling in new_slab() - new_slab() no longer builds the freelist. Instead, the freelist is constructed only for the objects left unallocated in alloc_single_from_new_slab(), alloc_from_new_slab(), and early_kmem_cache_node_alloc(). - remove maybe_wipe_obj_freeptr() when allocating objects directly without freelist built - Link to v3: https://lore.kernel.org/all/202604062150182836ygUiyPoKcxtHjgF7rWXe@zte.com.cn/ Changes in v5: - Call build_slab_freelist() unconditionally, and remove the redundant "slab->freelist = NULL" initialization in allocate_slab(). - Check the return value of alloc_from_new_slab() to prevent a potential use-after-free bug. - Refine the commit message with more precise test coverage descriptions. - Link to v4: https://lore.kernel.org/all/2026040823281824773ybHpC3kgUhR9OE1rGTl@zte.com.cn/ --- mm/slab.h | 10 ++ mm/slub.c | 279 +++++++++++++++++++++++++++--------------------------- 2 files changed, 147 insertions(+), 142 deletions(-) diff --git a/mm/slab.h b/mm/slab.h index bf2f87acf5e3..ada3f9c3909f 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -91,6 +91,16 @@ struct slab { #endif }; +struct slab_obj_iter { + unsigned long pos; + void *start; +#ifdef CONFIG_SLAB_FREELIST_RANDOM + unsigned long freelist_count; + unsigned long page_limit; + bool random; +#endif +}; + #define SLAB_MATCH(pg, sl) \ static_assert(offsetof(struct page, pg) == offsetof(struct slab, sl)) SLAB_MATCH(flags, flags); diff --git a/mm/slub.c b/mm/slub.c index 4927407c9699..9ff8af8c2f73 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2733,7 +2733,7 @@ bool slab_free_freelist_hook(struct kmem_cache *s, void **head, void **tail, return *head != NULL; } -static void *setup_object(struct kmem_cache *s, void *object) +static inline void *setup_object(struct kmem_cache *s, void *object) { setup_object_debug(s, object); object = kasan_init_slab_obj(s, object); @@ -3329,87 +3329,14 @@ static void __init init_freelist_randomization(void) mutex_unlock(&slab_mutex); } -/* Get the next entry on the pre-computed freelist randomized */ -static void *next_freelist_entry(struct kmem_cache *s, - unsigned long *pos, void *start, - unsigned long page_limit, - unsigned long freelist_count) -{ - unsigned int idx; - - /* - * If the target page allocation failed, the number of objects on the - * page might be smaller than the usual size defined by the cache. - */ - do { - idx = s->random_seq[*pos]; - *pos += 1; - if (*pos >= freelist_count) - *pos = 0; - } while (unlikely(idx >= page_limit)); - - return (char *)start + idx; -} - static DEFINE_PER_CPU(struct rnd_state, slab_rnd_state); -/* Shuffle the single linked freelist based on a random pre-computed sequence */ -static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab, - bool allow_spin) -{ - void *start; - void *cur; - void *next; - unsigned long idx, pos, page_limit, freelist_count; - - if (slab->objects < 2 || !s->random_seq) - return false; - - freelist_count = oo_objects(s->oo); - if (allow_spin) { - pos = get_random_u32_below(freelist_count); - } else { - struct rnd_state *state; - - /* - * An interrupt or NMI handler might interrupt and change - * the state in the middle, but that's safe. - */ - state = &get_cpu_var(slab_rnd_state); - pos = prandom_u32_state(state) % freelist_count; - put_cpu_var(slab_rnd_state); - } - - page_limit = slab->objects * s->size; - start = fixup_red_left(s, slab_address(slab)); - - /* First entry is used as the base of the freelist */ - cur = next_freelist_entry(s, &pos, start, page_limit, freelist_count); - cur = setup_object(s, cur); - slab->freelist = cur; - - for (idx = 1; idx < slab->objects; idx++) { - next = next_freelist_entry(s, &pos, start, page_limit, - freelist_count); - next = setup_object(s, next); - set_freepointer(s, cur, next); - cur = next; - } - set_freepointer(s, cur, NULL); - - return true; -} #else static inline int init_cache_random_seq(struct kmem_cache *s) { return 0; } static inline void init_freelist_randomization(void) { } -static inline bool shuffle_freelist(struct kmem_cache *s, struct slab *slab, - bool allow_spin) -{ - return false; -} #endif /* CONFIG_SLAB_FREELIST_RANDOM */ static __always_inline void account_slab(struct slab *slab, int order, @@ -3438,15 +3365,14 @@ static __always_inline void unaccount_slab(struct slab *slab, int order, -(PAGE_SIZE << order)); } +/* Allocate and initialize a slab without building its freelist. */ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) { bool allow_spin = gfpflags_allow_spinning(flags); struct slab *slab; struct kmem_cache_order_objects oo = s->oo; gfp_t alloc_gfp; - void *start, *p, *next; - int idx; - bool shuffle; + void *start; flags &= gfp_allowed_mask; @@ -3497,21 +3423,6 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) alloc_slab_obj_exts_early(s, slab); account_slab(slab, oo_order(oo), s, flags); - shuffle = shuffle_freelist(s, slab, allow_spin); - - if (!shuffle) { - start = fixup_red_left(s, start); - start = setup_object(s, start); - slab->freelist = start; - for (idx = 0, p = start; idx < slab->objects - 1; idx++) { - next = p + s->size; - next = setup_object(s, next); - set_freepointer(s, p, next); - p = next; - } - set_freepointer(s, p, NULL); - } - return slab; } @@ -3665,30 +3576,109 @@ static void *alloc_single_from_partial(struct kmem_cache *s, return object; } +/* Return the next free object in allocation order. */ +static inline void *next_slab_obj(struct kmem_cache *s, + struct slab_obj_iter *iter) +{ +#ifdef CONFIG_SLAB_FREELIST_RANDOM + if (iter->random) { + unsigned long idx; + + /* + * If the target page allocation failed, the number of objects on the + * page might be smaller than the usual size defined by the cache. + */ + do { + idx = s->random_seq[iter->pos]; + iter->pos++; + if (iter->pos >= iter->freelist_count) + iter->pos = 0; + } while (unlikely(idx >= iter->page_limit)); + + return setup_object(s, (char *)iter->start + idx); + } +#endif + return setup_object(s, (char *)iter->start + iter->pos++ * s->size); +} + +/* Build a freelist from the objects not yet allocated from a fresh slab. */ +static inline void build_slab_freelist(struct kmem_cache *s, struct slab *slab, + struct slab_obj_iter *iter) +{ + unsigned int nr = slab->objects - slab->inuse; + unsigned int i; + void *cur, *next; + + if (!nr) { + slab->freelist = NULL; + return; + } + + cur = next_slab_obj(s, iter); + slab->freelist = cur; + + for (i = 1; i < nr; i++) { + next = next_slab_obj(s, iter); + set_freepointer(s, cur, next); + cur = next; + } + + set_freepointer(s, cur, NULL); +} + +/* Initialize an iterator over free objects in allocation order. */ +static inline void init_slab_obj_iter(struct kmem_cache *s, struct slab *slab, + struct slab_obj_iter *iter, + bool allow_spin) +{ + iter->pos = 0; + iter->start = fixup_red_left(s, slab_address(slab)); + +#ifdef CONFIG_SLAB_FREELIST_RANDOM + iter->random = (slab->objects >= 2 && s->random_seq); + if (!iter->random) + return; + + iter->freelist_count = oo_objects(s->oo); + iter->page_limit = slab->objects * s->size; + + if (allow_spin) { + iter->pos = get_random_u32_below(iter->freelist_count); + } else { + struct rnd_state *state; + + /* + * An interrupt or NMI handler might interrupt and change + * the state in the middle, but that's safe. + */ + state = &get_cpu_var(slab_rnd_state); + iter->pos = prandom_u32_state(state) % iter->freelist_count; + put_cpu_var(slab_rnd_state); + } +#endif +} + /* * Called only for kmem_cache_debug() caches to allocate from a freshly * allocated slab. Allocate a single object instead of whole freelist * and put the slab to the partial (or full) list. */ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab, - int orig_size, gfp_t gfpflags) + int orig_size, bool allow_spin) { - bool allow_spin = gfpflags_allow_spinning(gfpflags); - int nid = slab_nid(slab); - struct kmem_cache_node *n = get_node(s, nid); + struct kmem_cache_node *n; + struct slab_obj_iter iter; + bool needs_add_partial; unsigned long flags; void *object; - if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) { - /* Unlucky, discard newly allocated slab. */ - free_new_slab_nolock(s, slab); - return NULL; - } - - object = slab->freelist; - slab->freelist = get_freepointer(s, object); + init_slab_obj_iter(s, slab, &iter, allow_spin); + object = next_slab_obj(s, &iter); slab->inuse = 1; + needs_add_partial = (slab->objects > 1); + build_slab_freelist(s, slab, &iter); + if (!alloc_debug_processing(s, slab, object, orig_size)) { /* * It's not really expected that this would fail on a @@ -3696,22 +3686,30 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab, * corruption in theory could cause that. * Leak memory of allocated slab. */ - if (!allow_spin) - spin_unlock_irqrestore(&n->list_lock, flags); return NULL; } - if (allow_spin) + n = get_node(s, slab_nid(slab)); + if (allow_spin) { spin_lock_irqsave(&n->list_lock, flags); + } else if (!spin_trylock_irqsave(&n->list_lock, flags)) { + /* + * Unlucky, discard newly allocated slab. + * The slab is not fully free, but it's fine as + * objects are not allocated to users. + */ + free_new_slab_nolock(s, slab); + return NULL; + } - if (slab->inuse == slab->objects) - add_full(s, n, slab); - else + if (needs_add_partial) add_partial(n, slab, ADD_TO_HEAD); + else + add_full(s, n, slab); - inc_slabs_node(s, nid, slab->objects); spin_unlock_irqrestore(&n->list_lock, flags); + inc_slabs_node(s, slab_nid(slab), slab->objects); return object; } @@ -4349,9 +4347,10 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab, { unsigned int allocated = 0; struct kmem_cache_node *n; + struct slab_obj_iter iter; bool needs_add_partial; unsigned long flags; - void *object; + unsigned int target_inuse; /* * Are we going to put the slab on the partial list? @@ -4359,33 +4358,30 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab, */ needs_add_partial = (slab->objects > count); - if (!allow_spin && needs_add_partial) { + /* Target inuse count after allocating from this new slab. */ + target_inuse = needs_add_partial ? count : slab->objects; - n = get_node(s, slab_nid(slab)); - - if (!spin_trylock_irqsave(&n->list_lock, flags)) { - /* Unlucky, discard newly allocated slab */ - free_new_slab_nolock(s, slab); - return 0; - } - } - - object = slab->freelist; - while (object && allocated < count) { - p[allocated] = object; - object = get_freepointer(s, object); - maybe_wipe_obj_freeptr(s, p[allocated]); + init_slab_obj_iter(s, slab, &iter, allow_spin); - slab->inuse++; + while (allocated < target_inuse) { + p[allocated] = next_slab_obj(s, &iter); allocated++; } - slab->freelist = object; + slab->inuse = target_inuse; + build_slab_freelist(s, slab, &iter); if (needs_add_partial) { - + n = get_node(s, slab_nid(slab)); if (allow_spin) { - n = get_node(s, slab_nid(slab)); spin_lock_irqsave(&n->list_lock, flags); + } else if (!spin_trylock_irqsave(&n->list_lock, flags)) { + /* + * Unlucky, discard newly allocated slab. + * The slab is not fully free, but it's fine as + * objects are not allocated to users. + */ + free_new_slab_nolock(s, slab); + return 0; } add_partial(n, slab, ADD_TO_HEAD); spin_unlock_irqrestore(&n->list_lock, flags); @@ -4456,15 +4452,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, stat(s, ALLOC_SLAB); if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) { - object = alloc_single_from_new_slab(s, slab, orig_size, gfpflags); + object = alloc_single_from_new_slab(s, slab, orig_size, allow_spin); if (likely(object)) goto success; } else { - alloc_from_new_slab(s, slab, &object, 1, allow_spin); - /* we don't need to check SLAB_STORE_USER here */ - if (likely(object)) + if (alloc_from_new_slab(s, slab, &object, 1, allow_spin)) return object; } @@ -7251,10 +7245,6 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min, stat(s, ALLOC_SLAB); - /* - * TODO: possible optimization - if we know we will consume the whole - * slab we might skip creating the freelist? - */ refilled += alloc_from_new_slab(s, slab, p + refilled, max - refilled, /* allow_spin = */ true); @@ -7585,6 +7575,7 @@ static void early_kmem_cache_node_alloc(int node) { struct slab *slab; struct kmem_cache_node *n; + struct slab_obj_iter iter; BUG_ON(kmem_cache_node->size < sizeof(struct kmem_cache_node)); @@ -7596,14 +7587,18 @@ static void early_kmem_cache_node_alloc(int node) pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n"); } - n = slab->freelist; + init_slab_obj_iter(kmem_cache_node, slab, &iter, true); + + n = next_slab_obj(kmem_cache_node, &iter); BUG_ON(!n); + + slab->inuse = 1; + build_slab_freelist(kmem_cache_node, slab, &iter); + #ifdef CONFIG_SLUB_DEBUG init_object(kmem_cache_node, n, SLUB_RED_ACTIVE); #endif n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL, false); - slab->freelist = get_freepointer(kmem_cache_node, n); - slab->inuse = 1; kmem_cache_node->per_node[node].node = n; init_kmem_cache_node(n); inc_slabs_node(kmem_cache_node, node, slab->objects); -- 2.25.1