linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
@ 2025-04-24  8:07 Harry Yoo
  2025-04-24  8:07 ` [RFC PATCH 1/7] mm/slab: refactor freelist shuffle Harry Yoo
                   ` (10 more replies)
  0 siblings, 11 replies; 27+ messages in thread
From: Harry Yoo @ 2025-04-24  8:07 UTC (permalink / raw)
  To: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Mateusz Guzik
  Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vlad Buslov,
	Yevgeny Kliteynik, Jan Kara, Byungchul Park, linux-mm, netdev,
	linux-kernel, Harry Yoo

Overview
========

The slab destructor feature existed in early days of slab allocator(s).
It was removed by the commit c59def9f222d ("Slab allocators: Drop support
for destructors") in 2007 due to lack of serious use cases at that time.

Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
constructor/destructor pair to mitigate the global serialization point
(pcpu_alloc_mutex) that occurs when each slab object allocates and frees
percpu memory during its lifetime.

Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
so each allocate–free cycle requires two expensive acquire/release on
that mutex.

We can mitigate this contention by retaining the percpu regions after
the object is freed and releasing them only when the backing slab pages
are freed.

How to do this with slab constructors and destructors: the constructor
allocates percpu memory, and the destructor frees it when the slab pages
are reclaimed; this slightly alters the constructor’s semantics,
as it can now fail.

This series is functional (although not compatible with MM debug
features yet), but still far from perfect. I’m actively refining it and
would appreciate early feedback before I improve it further. :)

This series is based on slab/for-next [2].

Performance Improvement
=======================

I measured the benefit of this series for two different users:
exec() and tc filter insertion/removal.

exec() throughput
-----------------

The performance of exec() is important when short-lived processes are
frequently created. For example: shell-heavy workloads and running many
test cases [3].

I measured exec() throughput with a microbenchmark:
  - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
  - 4.56% gain on a desktop with 24 hardware threads, and
  - Even 4% gain on a single-threaded exec() throughput.

Further investigation showed that this was due to the overhead of
acquiring/releasing pcpu_alloc_mutex and its contention.

See patch 7 for more detail on the experiment.

Traffic Filter Insertion and Removal
------------------------------------

Each tc filter allocates three percpu memory regions per tc_action object,
so frequently inserting and removing filters contend heavily on the same
mutex.

In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
more detail), I observed a 26% reduction in system time and observed
much less contention on pcpu_alloc_mutex with this series.

I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
about tc filter insertion rate; these changes may still benefit
workloads they run today.

Feedback Needed from Percpu Allocator Folks
===========================================

As percpu allocator is directly affected by this series, this work
will need support from the percpu allocator maintainers, and we need to
address their concerns.

They will probably say "This is a percpu memory allocator scalability
issue and we need to make it scalable"? I don't know.

What do you say?

Some hanging thoughts:
- Tackling the problem on the slab side is much simpler, because the slab
  allocator already caches objects per CPU. Re-creating similar logic
  inside the percpu allocator would be redundant.

  Also, since this is opt-in per slab cache, other percpu allocator
  users remain unaffected.

- If fragmentation is a concern, we could probably allocate larger percpu
  chunks and partition them for slab objects.

- If memory overhead becomes an issue, we could introduce a shrinker
  to free empty slabs (and thus releasing underlying percpu memory chunks).

Patch Sequence
==============

Patch #1 refactors freelist_shuffle() to allow the slab constructor to
fail in the next patch.

Patch #2 allows the slab constructor fail.

Patch #3 implements the slab destructor feature.

Patch #4 converts net/sched/act_api to use the slab ctor/dtor pair. 

Patch #5, #6 implements APIs to charge and uncharge percpu memory and
percpu counter.

Patch #7 converts mm_struct to use the slab ctor/dtor pair. 

Known issues with MM debug features
===================================

The slab destructor feature is not yet compatible with KASAN, KMEMLEAK,
and DEBUG_OBJECTS.

KASAN reports an error when a percpu counter is inserted into the
percpu_counters linked list because the counter has not been allocated
yet.

DEBUG_OBJECTS and KMEMLEAK complain when the slab object is freed, while
the associated percpu memory is still resident in memory.

I don't expect fixing these issues to be too difficult, but I need to
think a little bit to fix it.

[1] https://lore.kernel.org/linux-mm/CAGudoHFc+Km-3usiy4Wdm1JkM+YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com

[2] https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next

[3] https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3

[4] https://lore.kernel.org/netdev/vbfmunui7dm.fsf@mellanox.com

Harry Yoo (7):
  mm/slab: refactor freelist shuffle
  treewide, slab: allow slab constructor to return an error
  mm/slab: revive the destructor feature in slab allocator
  net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu
    alloc
  mm/percpu: allow (un)charging objects without alloc/free
  lib/percpu_counter: allow (un)charging percpu counters without
    alloc/free
  kernel/fork: improve exec() throughput with slab ctor/dtor pair

 arch/powerpc/include/asm/svm.h            |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c    |   3 +-
 arch/powerpc/mm/init-common.c             |   3 +-
 arch/powerpc/platforms/cell/spufs/inode.c |   3 +-
 arch/powerpc/platforms/pseries/setup.c    |   2 +-
 arch/powerpc/platforms/pseries/svm.c      |   4 +-
 arch/sh/mm/pgtable.c                      |   3 +-
 arch/sparc/mm/tsb.c                       |   8 +-
 block/bdev.c                              |   3 +-
 drivers/dax/super.c                       |   3 +-
 drivers/gpu/drm/i915/i915_request.c       |   3 +-
 drivers/misc/lkdtm/heap.c                 |  12 +--
 drivers/usb/mon/mon_text.c                |   5 +-
 fs/9p/v9fs.c                              |   3 +-
 fs/adfs/super.c                           |   3 +-
 fs/affs/super.c                           |   3 +-
 fs/afs/super.c                            |   5 +-
 fs/befs/linuxvfs.c                        |   3 +-
 fs/bfs/inode.c                            |   3 +-
 fs/btrfs/inode.c                          |   3 +-
 fs/ceph/super.c                           |   3 +-
 fs/coda/inode.c                           |   3 +-
 fs/debugfs/inode.c                        |   3 +-
 fs/dlm/lowcomms.c                         |   3 +-
 fs/ecryptfs/main.c                        |   5 +-
 fs/efs/super.c                            |   3 +-
 fs/erofs/super.c                          |   3 +-
 fs/exfat/cache.c                          |   3 +-
 fs/exfat/super.c                          |   3 +-
 fs/ext2/super.c                           |   3 +-
 fs/ext4/super.c                           |   3 +-
 fs/fat/cache.c                            |   3 +-
 fs/fat/inode.c                            |   3 +-
 fs/fuse/inode.c                           |   3 +-
 fs/gfs2/main.c                            |   9 +-
 fs/hfs/super.c                            |   3 +-
 fs/hfsplus/super.c                        |   3 +-
 fs/hpfs/super.c                           |   3 +-
 fs/hugetlbfs/inode.c                      |   3 +-
 fs/inode.c                                |   3 +-
 fs/isofs/inode.c                          |   3 +-
 fs/jffs2/super.c                          |   3 +-
 fs/jfs/super.c                            |   3 +-
 fs/minix/inode.c                          |   3 +-
 fs/nfs/inode.c                            |   3 +-
 fs/nfs/nfs42xattr.c                       |   3 +-
 fs/nilfs2/super.c                         |   6 +-
 fs/ntfs3/super.c                          |   3 +-
 fs/ocfs2/dlmfs/dlmfs.c                    |   3 +-
 fs/ocfs2/super.c                          |   3 +-
 fs/openpromfs/inode.c                     |   3 +-
 fs/orangefs/super.c                       |   3 +-
 fs/overlayfs/super.c                      |   3 +-
 fs/pidfs.c                                |   3 +-
 fs/proc/inode.c                           |   3 +-
 fs/qnx4/inode.c                           |   3 +-
 fs/qnx6/inode.c                           |   3 +-
 fs/romfs/super.c                          |   3 +-
 fs/smb/client/cifsfs.c                    |   3 +-
 fs/squashfs/super.c                       |   3 +-
 fs/tracefs/inode.c                        |   3 +-
 fs/ubifs/super.c                          |   3 +-
 fs/udf/super.c                            |   3 +-
 fs/ufs/super.c                            |   3 +-
 fs/userfaultfd.c                          |   3 +-
 fs/vboxsf/super.c                         |   3 +-
 fs/xfs/xfs_super.c                        |   3 +-
 include/linux/mm_types.h                  |  40 ++++++---
 include/linux/percpu.h                    |  10 +++
 include/linux/percpu_counter.h            |   2 +
 include/linux/slab.h                      |  21 +++--
 ipc/mqueue.c                              |   3 +-
 kernel/fork.c                             |  65 +++++++++-----
 kernel/rcu/refscale.c                     |   3 +-
 lib/percpu_counter.c                      |  25 ++++++
 lib/radix-tree.c                          |   3 +-
 lib/test_meminit.c                        |   3 +-
 mm/kfence/kfence_test.c                   |   5 +-
 mm/percpu.c                               |  79 ++++++++++------
 mm/rmap.c                                 |   3 +-
 mm/shmem.c                                |   3 +-
 mm/slab.h                                 |  11 +--
 mm/slab_common.c                          |  43 +++++----
 mm/slub.c                                 | 105 ++++++++++++++++------
 net/sched/act_api.c                       |  82 +++++++++++------
 net/socket.c                              |   3 +-
 net/sunrpc/rpc_pipe.c                     |   3 +-
 security/integrity/ima/ima_iint.c         |   3 +-
 88 files changed, 518 insertions(+), 226 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 1/7] mm/slab: refactor freelist shuffle
  2025-04-24  8:07 [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Harry Yoo
@ 2025-04-24  8:07 ` Harry Yoo
  2025-04-24  8:07 ` [RFC PATCH 2/7] treewide, slab: allow slab constructor to return an error Harry Yoo
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Harry Yoo @ 2025-04-24  8:07 UTC (permalink / raw)
  To: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Mateusz Guzik
  Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vlad Buslov,
	Yevgeny Kliteynik, Jan Kara, Byungchul Park, linux-mm, netdev,
	linux-kernel, Harry Yoo

shuffle_freelist() function returns false when 1) it can't shuffle
the freelist due to lack of a random sequence or,
2) when there is a single object.

In a later patch, I'd like to return an error in shuffle_freelist() when
setup_object() fails. But with current code, it'll be hard to determine
whether it should initialize the objects anyway or let allocate_slab()
return an error.

To address this, decouple the shuffle eligibility checks into
should_shuffle_freelist() function and call it before shuffle_freelist().
Change the return type of shuffle_freelist() to void for now.

Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/slub.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index dac149df1be1..95a9f04b5904 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2534,17 +2534,21 @@ static void *next_freelist_entry(struct kmem_cache *s,
 	return (char *)start + idx;
 }
 
+static bool should_shuffle_freelist(struct kmem_cache *s, struct slab *slab)
+{
+	if (slab->objects < 2 || !s->random_seq)
+		return false;
+	return true;
+}
+
 /* Shuffle the single linked freelist based on a random pre-computed sequence */
-static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
+static void shuffle_freelist(struct kmem_cache *s, struct slab *slab)
 {
 	void *start;
 	void *cur;
 	void *next;
 	unsigned long idx, pos, page_limit, freelist_count;
 
-	if (slab->objects < 2 || !s->random_seq)
-		return false;
-
 	freelist_count = oo_objects(s->oo);
 	pos = get_random_u32_below(freelist_count);
 
@@ -2564,8 +2568,6 @@ static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
 		cur = next;
 	}
 	set_freepointer(s, cur, NULL);
-
-	return true;
 }
 #else
 static inline int init_cache_random_seq(struct kmem_cache *s)
@@ -2573,10 +2575,13 @@ static inline int init_cache_random_seq(struct kmem_cache *s)
 	return 0;
 }
 static inline void init_freelist_randomization(void) { }
-static inline bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
+static inline bool should_shuffle_freelist(struct kmem_cache *s,
+					   struct slab *slab)
 {
 	return false;
 }
+static inline void shuffle_freelist(struct kmem_cache *s, struct slab *slab)
+{ }
 #endif /* CONFIG_SLAB_FREELIST_RANDOM */
 
 static __always_inline void account_slab(struct slab *slab, int order,
@@ -2606,7 +2611,6 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	gfp_t alloc_gfp;
 	void *start, *p, *next;
 	int idx;
-	bool shuffle;
 
 	flags &= gfp_allowed_mask;
 
@@ -2648,9 +2652,9 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 
 	setup_slab_debug(s, slab, start);
 
-	shuffle = shuffle_freelist(s, slab);
-
-	if (!shuffle) {
+	if (should_shuffle_freelist(s, slab)) {
+		shuffle_freelist(s, slab);
+	} else {
 		start = fixup_red_left(s, start);
 		start = setup_object(s, start);
 		slab->freelist = start;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 2/7] treewide, slab: allow slab constructor to return an error
  2025-04-24  8:07 [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Harry Yoo
  2025-04-24  8:07 ` [RFC PATCH 1/7] mm/slab: refactor freelist shuffle Harry Yoo
@ 2025-04-24  8:07 ` Harry Yoo
  2025-04-24  8:07 ` [RFC PATCH 3/7] mm/slab: revive the destructor feature in slab allocator Harry Yoo
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Harry Yoo @ 2025-04-24  8:07 UTC (permalink / raw)
  To: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Mateusz Guzik
  Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vlad Buslov,
	Yevgeny Kliteynik, Jan Kara, Byungchul Park, linux-mm, netdev,
	linux-kernel, Harry Yoo

From the beginning of slab and until now, slab allocator has not allowed
constructors to fail. It made sense because operations like spinlock
initialization, list initialization, and memset() do not fail.

However, Mateusz Guzik explains [1] that allocating and freeing percpu
memory for each slab object's lifetime suffers from the global
serialization point of the percpu allocator.

That said, allocating & freeing percpu memory in ctor/dtor pair can
significantly reduce the contention. As a first step to that, allow
constructors to fail and update all users of the constructor feature
to return zero (no error).

When a constructor fails, allocate_slab() returns an error.

[1] https://lore.kernel.org/linux-mm/CAGudoHFc+Km-3usiy4Wdm1JkM+YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com

Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 arch/powerpc/include/asm/svm.h            |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c    |  3 +-
 arch/powerpc/mm/init-common.c             |  3 +-
 arch/powerpc/platforms/cell/spufs/inode.c |  3 +-
 arch/powerpc/platforms/pseries/setup.c    |  2 +-
 arch/powerpc/platforms/pseries/svm.c      |  4 +-
 arch/sh/mm/pgtable.c                      |  3 +-
 arch/sparc/mm/tsb.c                       |  8 ++-
 block/bdev.c                              |  3 +-
 drivers/dax/super.c                       |  3 +-
 drivers/gpu/drm/i915/i915_request.c       |  3 +-
 drivers/misc/lkdtm/heap.c                 | 12 ++--
 drivers/usb/mon/mon_text.c                |  5 +-
 fs/9p/v9fs.c                              |  3 +-
 fs/adfs/super.c                           |  3 +-
 fs/affs/super.c                           |  3 +-
 fs/afs/super.c                            |  5 +-
 fs/befs/linuxvfs.c                        |  3 +-
 fs/bfs/inode.c                            |  3 +-
 fs/btrfs/inode.c                          |  3 +-
 fs/ceph/super.c                           |  3 +-
 fs/coda/inode.c                           |  3 +-
 fs/debugfs/inode.c                        |  3 +-
 fs/dlm/lowcomms.c                         |  3 +-
 fs/ecryptfs/main.c                        |  5 +-
 fs/efs/super.c                            |  3 +-
 fs/erofs/super.c                          |  3 +-
 fs/exfat/cache.c                          |  3 +-
 fs/exfat/super.c                          |  3 +-
 fs/ext2/super.c                           |  3 +-
 fs/ext4/super.c                           |  3 +-
 fs/fat/cache.c                            |  3 +-
 fs/fat/inode.c                            |  3 +-
 fs/fuse/inode.c                           |  3 +-
 fs/gfs2/main.c                            |  9 ++-
 fs/hfs/super.c                            |  3 +-
 fs/hfsplus/super.c                        |  3 +-
 fs/hpfs/super.c                           |  3 +-
 fs/hugetlbfs/inode.c                      |  3 +-
 fs/inode.c                                |  3 +-
 fs/isofs/inode.c                          |  3 +-
 fs/jffs2/super.c                          |  3 +-
 fs/jfs/super.c                            |  3 +-
 fs/minix/inode.c                          |  3 +-
 fs/nfs/inode.c                            |  3 +-
 fs/nfs/nfs42xattr.c                       |  3 +-
 fs/nilfs2/super.c                         |  6 +-
 fs/ntfs3/super.c                          |  3 +-
 fs/ocfs2/dlmfs/dlmfs.c                    |  3 +-
 fs/ocfs2/super.c                          |  3 +-
 fs/openpromfs/inode.c                     |  3 +-
 fs/orangefs/super.c                       |  3 +-
 fs/overlayfs/super.c                      |  3 +-
 fs/pidfs.c                                |  3 +-
 fs/proc/inode.c                           |  3 +-
 fs/qnx4/inode.c                           |  3 +-
 fs/qnx6/inode.c                           |  3 +-
 fs/romfs/super.c                          |  3 +-
 fs/smb/client/cifsfs.c                    |  3 +-
 fs/squashfs/super.c                       |  3 +-
 fs/tracefs/inode.c                        |  3 +-
 fs/ubifs/super.c                          |  3 +-
 fs/udf/super.c                            |  3 +-
 fs/ufs/super.c                            |  3 +-
 fs/userfaultfd.c                          |  3 +-
 fs/vboxsf/super.c                         |  3 +-
 fs/xfs/xfs_super.c                        |  3 +-
 include/linux/slab.h                      | 11 ++--
 ipc/mqueue.c                              |  3 +-
 kernel/fork.c                             |  3 +-
 kernel/rcu/refscale.c                     |  3 +-
 lib/radix-tree.c                          |  3 +-
 lib/test_meminit.c                        |  3 +-
 mm/kfence/kfence_test.c                   |  5 +-
 mm/rmap.c                                 |  3 +-
 mm/shmem.c                                |  3 +-
 mm/slab.h                                 |  6 +-
 mm/slab_common.c                          |  4 +-
 mm/slub.c                                 | 68 ++++++++++++++++-------
 net/socket.c                              |  3 +-
 net/sunrpc/rpc_pipe.c                     |  3 +-
 security/integrity/ima/ima_iint.c         |  3 +-
 82 files changed, 233 insertions(+), 120 deletions(-)

diff --git a/arch/powerpc/include/asm/svm.h b/arch/powerpc/include/asm/svm.h
index a02bd54b8948..0ebfbcd212cb 100644
--- a/arch/powerpc/include/asm/svm.h
+++ b/arch/powerpc/include/asm/svm.h
@@ -17,7 +17,7 @@ static inline bool is_secure_guest(void)
 	return mfmsr() & MSR_S;
 }
 
-void dtl_cache_ctor(void *addr);
+int dtl_cache_ctor(void *addr);
 #define get_dtl_cache_ctor()	(is_secure_guest() ? dtl_cache_ctor : NULL)
 
 #else /* CONFIG_PPC_SVM */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index b3e6e73d6a08..9d6171d6db65 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1230,9 +1230,10 @@ int kvmppc_init_vm_radix(struct kvm *kvm)
 	return 0;
 }
 
-static void pte_ctor(void *addr)
+static int pte_ctor(void *addr)
 {
 	memset(addr, 0, RADIX_PTE_TABLE_SIZE);
+	return 0;
 }
 
 static void pmd_ctor(void *addr)
diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
index 745097554bea..4eb2cc44fa86 100644
--- a/arch/powerpc/mm/init-common.c
+++ b/arch/powerpc/mm/init-common.c
@@ -72,9 +72,10 @@ void setup_kup(void)
 	setup_kuep(disable_kuep);
 }
 
-#define CTOR(shift) static void ctor_##shift(void *addr) \
+#define CTOR(shift) static int ctor_##shift(void *addr) \
 {							\
 	memset(addr, 0, sizeof(pgd_t) << (shift));	\
+	return 0;					\
 }
 
 CTOR(0); CTOR(1); CTOR(2); CTOR(3); CTOR(4); CTOR(5); CTOR(6); CTOR(7);
diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c
index 9f9e4b871627..c654e95431fa 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -65,12 +65,13 @@ static void spufs_free_inode(struct inode *inode)
 	kmem_cache_free(spufs_inode_cache, SPUFS_I(inode));
 }
 
-static void
+static int
 spufs_init_once(void *p)
 {
 	struct spufs_inode_info *ei = p;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static struct inode *
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index b10a25325238..61f07df96f99 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -341,7 +341,7 @@ static inline int alloc_dispatch_logs(void)
 
 static int alloc_dispatch_log_kmem_cache(void)
 {
-	void (*ctor)(void *) = get_dtl_cache_ctor();
+	int (*ctor)(void *) = get_dtl_cache_ctor();
 
 	dtl_cache = kmem_cache_create_usercopy("dtl", DISPATCH_LOG_BYTES,
 						DISPATCH_LOG_BYTES, 0, 0, DISPATCH_LOG_BYTES, ctor);
diff --git a/arch/powerpc/platforms/pseries/svm.c b/arch/powerpc/platforms/pseries/svm.c
index 384c9dc1899a..06af254bdec7 100644
--- a/arch/powerpc/platforms/pseries/svm.c
+++ b/arch/powerpc/platforms/pseries/svm.c
@@ -81,7 +81,7 @@ static bool is_dtl_page_shared(struct page *page)
 	return false;
 }
 
-void dtl_cache_ctor(void *addr)
+int dtl_cache_ctor(void *addr)
 {
 	unsigned long pfn = PHYS_PFN(__pa(addr));
 	struct page *page = pfn_to_page(pfn);
@@ -92,4 +92,6 @@ void dtl_cache_ctor(void *addr)
 		WARN_ON(dtl_nr_pages >= NR_DTL_PAGE);
 		uv_share_page(pfn, 1);
 	}
+
+	return 0;
 }
diff --git a/arch/sh/mm/pgtable.c b/arch/sh/mm/pgtable.c
index 3a4085ea0161..ec3ea909b3bd 100644
--- a/arch/sh/mm/pgtable.c
+++ b/arch/sh/mm/pgtable.c
@@ -9,7 +9,7 @@ static struct kmem_cache *pgd_cachep;
 static struct kmem_cache *pmd_cachep;
 #endif
 
-static void pgd_ctor(void *x)
+static int pgd_ctor(void *x)
 {
 	pgd_t *pgd = x;
 
@@ -17,6 +17,7 @@ static void pgd_ctor(void *x)
 	memcpy(pgd + USER_PTRS_PER_PGD,
 	       swapper_pg_dir + USER_PTRS_PER_PGD,
 	       (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
+	return 0;
 }
 
 void pgtable_cache_init(void)
diff --git a/arch/sparc/mm/tsb.c b/arch/sparc/mm/tsb.c
index 5fe52a64c7e7..53b555ee9f7e 100644
--- a/arch/sparc/mm/tsb.c
+++ b/arch/sparc/mm/tsb.c
@@ -338,6 +338,12 @@ static const char *tsb_cache_names[8] = {
 	"tsb_1MB",
 };
 
+static inline int pgtable_ctor(void *objp)
+{
+	_clear_page(objp);
+	return 0;
+}
+
 void __init pgtable_cache_init(void)
 {
 	unsigned long i;
@@ -345,7 +351,7 @@ void __init pgtable_cache_init(void)
 	pgtable_cache = kmem_cache_create("pgtable_cache",
 					  PAGE_SIZE, PAGE_SIZE,
 					  0,
-					  _clear_page);
+					  pgtable_ctor);
 	if (!pgtable_cache) {
 		prom_printf("pgtable_cache_init(): Could not create!\n");
 		prom_halt();
diff --git a/block/bdev.c b/block/bdev.c
index 4844d1e27b6f..de4c231fbb04 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -356,11 +356,12 @@ static void bdev_free_inode(struct inode *inode)
 	kmem_cache_free(bdev_cachep, BDEV_I(inode));
 }
 
-static void init_once(void *data)
+static int init_once(void *data)
 {
 	struct bdev_inode *ei = data;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static void bdev_evict_inode(struct inode *inode)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e16d1d40d773..3b3bf03d10cf 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -532,13 +532,14 @@ void *dax_get_private(struct dax_device *dax_dev)
 }
 EXPORT_SYMBOL_GPL(dax_get_private);
 
-static void init_once(void *_dax_dev)
+static int init_once(void *_dax_dev)
 {
 	struct dax_device *dax_dev = _dax_dev;
 	struct inode *inode = &dax_dev->inode;
 
 	memset(dax_dev, 0, sizeof(*dax_dev));
 	inode_init_once(inode);
+	return 0;
 }
 
 static int dax_fs_init(void)
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index c3d27eadc0a7..70fbd2b34240 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -869,7 +869,7 @@ request_alloc_slow(struct intel_timeline *tl,
 	return kmem_cache_alloc(slab_requests, gfp);
 }
 
-static void __i915_request_ctor(void *arg)
+static int __i915_request_ctor(void *arg)
 {
 	struct i915_request *rq = arg;
 
@@ -882,6 +882,7 @@ static void __i915_request_ctor(void *arg)
 	rq->batch_res = NULL;
 
 	init_llist_head(&rq->execute_cb);
+	return 0;
 }
 
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
diff --git a/drivers/misc/lkdtm/heap.c b/drivers/misc/lkdtm/heap.c
index b1b316f99703..0a821503bb31 100644
--- a/drivers/misc/lkdtm/heap.c
+++ b/drivers/misc/lkdtm/heap.c
@@ -359,12 +359,12 @@ static void lkdtm_SLAB_FREE_PAGE(void)
  * We have constructors to keep the caches distinctly separated without
  * needing to boot with "slab_nomerge".
  */
-static void ctor_double_free(void *region)
-{ }
-static void ctor_a(void *region)
-{ }
-static void ctor_b(void *region)
-{ }
+static int ctor_double_free(void *region)
+{ return 0; }
+static int ctor_a(void *region)
+{ return 0; }
+static int ctor_b(void *region)
+{ return 0; }
 
 void __init lkdtm_heap_init(void)
 {
diff --git a/drivers/usb/mon/mon_text.c b/drivers/usb/mon/mon_text.c
index 68b9b2b41189..ef754be94b18 100644
--- a/drivers/usb/mon/mon_text.c
+++ b/drivers/usb/mon/mon_text.c
@@ -95,7 +95,7 @@ struct mon_reader_text {
 
 static struct dentry *mon_dir;		/* Usually /sys/kernel/debug/usbmon */
 
-static void mon_text_ctor(void *);
+static int mon_text_ctor(void *);
 
 struct mon_text_ptr {
 	int cnt, limit;
@@ -732,13 +732,14 @@ void mon_text_del(struct mon_bus *mbus)
 /*
  * Slab interface: constructor.
  */
-static void mon_text_ctor(void *mem)
+static int mon_text_ctor(void *mem)
 {
 	/*
 	 * Nothing to initialize. No, really!
 	 * So, we fill it with garbage to emulate a reused object.
 	 */
 	memset(mem, 0xe5, sizeof(struct mon_event_text));
+	return 0;
 }
 
 int __init mon_text_init(void)
diff --git a/fs/9p/v9fs.c b/fs/9p/v9fs.c
index 77e9c4387c1d..a11a39d369c5 100644
--- a/fs/9p/v9fs.c
+++ b/fs/9p/v9fs.c
@@ -620,12 +620,13 @@ static void v9fs_sysfs_cleanup(void)
 	kobject_put(v9fs_kobj);
 }
 
-static void v9fs_inode_init_once(void *foo)
+static int v9fs_inode_init_once(void *foo)
 {
 	struct v9fs_inode *v9inode = (struct v9fs_inode *)foo;
 
 	memset(&v9inode->qid, 0, sizeof(v9inode->qid));
 	inode_init_once(&v9inode->netfs.inode);
+	return 0;
 }
 
 /**
diff --git a/fs/adfs/super.c b/fs/adfs/super.c
index 017c48a80203..8b9df1dbfed5 100644
--- a/fs/adfs/super.c
+++ b/fs/adfs/super.c
@@ -212,11 +212,12 @@ static int adfs_drop_inode(struct inode *inode)
 	return !IS_ENABLED(CONFIG_ADFS_FS_RW) || IS_RDONLY(inode);
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct adfs_inode_info *ei = (struct adfs_inode_info *) foo;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static int __init init_inodecache(void)
diff --git a/fs/affs/super.c b/fs/affs/super.c
index 2fa40337776d..8cff0659d8f1 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -117,13 +117,14 @@ static void affs_free_inode(struct inode *inode)
 	kmem_cache_free(affs_inode_cachep, AFFS_I(inode));
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct affs_inode_info *ei = (struct affs_inode_info *) foo;
 
 	mutex_init(&ei->i_link_lock);
 	mutex_init(&ei->i_ext_lock);
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static int __init init_inodecache(void)
diff --git a/fs/afs/super.c b/fs/afs/super.c
index 25b306db6992..e6a4473cf113 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -29,7 +29,7 @@
 #include <net/net_namespace.h>
 #include "internal.h"
 
-static void afs_i_init_once(void *foo);
+static int afs_i_init_once(void *foo);
 static void afs_kill_super(struct super_block *sb);
 static struct inode *afs_alloc_inode(struct super_block *sb);
 static void afs_destroy_inode(struct inode *inode);
@@ -646,7 +646,7 @@ static int afs_init_fs_context(struct fs_context *fc)
  * afs_alloc_inode() *must* reset anything that could incorrectly leak from one
  * inode to another.
  */
-static void afs_i_init_once(void *_vnode)
+static int afs_i_init_once(void *_vnode)
 {
 	struct afs_vnode *vnode = _vnode;
 
@@ -662,6 +662,7 @@ static void afs_i_init_once(void *_vnode)
 	INIT_DELAYED_WORK(&vnode->lock_work, afs_lock_work);
 	INIT_LIST_HEAD(&vnode->cb_mmap_link);
 	seqlock_init(&vnode->cb_lock);
+	return 0;
 }
 
 /*
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index 8f430ff8e445..bb94680b0ca8 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -287,11 +287,12 @@ static void befs_free_inode(struct inode *inode)
 	kmem_cache_free(befs_inode_cachep, BEFS_I(inode));
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct befs_inode_info *bi = (struct befs_inode_info *) foo;
 
 	inode_init_once(&bi->vfs_inode);
+	return 0;
 }
 
 static struct inode *befs_iget(struct super_block *sb, unsigned long ino)
diff --git a/fs/bfs/inode.c b/fs/bfs/inode.c
index db81570c9637..c8eaf9d36507 100644
--- a/fs/bfs/inode.c
+++ b/fs/bfs/inode.c
@@ -247,11 +247,12 @@ static void bfs_free_inode(struct inode *inode)
 	kmem_cache_free(bfs_inode_cachep, BFS_I(inode));
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct bfs_inode_info *bi = foo;
 
 	inode_init_once(&bi->vfs_inode);
+	return 0;
 }
 
 static int __init init_inodecache(void)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cc67d1a2d611..dc17d60ec78d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7871,11 +7871,12 @@ int btrfs_drop_inode(struct inode *inode)
 		return generic_drop_inode(inode);
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct btrfs_inode *ei = foo;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 void __cold btrfs_destroy_cachep(void)
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index f3951253e393..e627c5f975c4 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -926,10 +926,11 @@ struct kmem_cache *ceph_dir_file_cachep;
 struct kmem_cache *ceph_mds_request_cachep;
 mempool_t *ceph_wb_pagevec_pool;
 
-static void ceph_inode_init_once(void *foo)
+static int ceph_inode_init_once(void *foo)
 {
 	struct ceph_inode_info *ci = foo;
 	inode_init_once(&ci->netfs.inode);
+	return 0;
 }
 
 static int __init init_caches(void)
diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index 6896fce122e1..4580ea6ae053 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -61,11 +61,12 @@ static void coda_free_inode(struct inode *inode)
 	kmem_cache_free(coda_inode_cachep, ITOC(inode));
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct coda_inode_info *ei = (struct coda_inode_info *) foo;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 int __init coda_init_inodecache(void)
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 75715d8877ee..b96b74b624ec 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -210,10 +210,11 @@ static int debugfs_show_options(struct seq_file *m, struct dentry *root)
 
 static struct kmem_cache *debugfs_inode_cachep __ro_after_init;
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct debugfs_inode_info *info = foo;
 	inode_init_once(&info->vfs_inode);
+	return 0;
 }
 
 static struct inode *debugfs_alloc_inode(struct super_block *sb)
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 70abd4da17a6..6a4f7a68f34c 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -232,11 +232,12 @@ static void lowcomms_queue_rwork(struct connection *con)
 		queue_work(io_workqueue, &con->rwork);
 }
 
-static void writequeue_entry_ctor(void *data)
+static int writequeue_entry_ctor(void *data)
 {
 	struct writequeue_entry *entry = data;
 
 	INIT_LIST_HEAD(&entry->msgs);
+	return 0;
 }
 
 struct kmem_cache *dlm_lowcomms_writequeue_cache_create(void)
diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index 8dd1d7189c3b..7afc008774e7 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -641,12 +641,13 @@ MODULE_ALIAS_FS("ecryptfs");
  *
  * Initializes the ecryptfs_inode_info_cache when it is created
  */
-static void
+static int
 inode_info_init_once(void *vptr)
 {
 	struct ecryptfs_inode_info *ei = (struct ecryptfs_inode_info *)vptr;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static struct ecryptfs_cache_info {
@@ -654,7 +655,7 @@ static struct ecryptfs_cache_info {
 	const char *name;
 	size_t size;
 	slab_flags_t flags;
-	void (*ctor)(void *obj);
+	int (*ctor)(void *obj);
 } ecryptfs_cache_infos[] = {
 	{
 		.cache = &ecryptfs_auth_tok_list_item_cache,
diff --git a/fs/efs/super.c b/fs/efs/super.c
index c59086b7eabf..09d4f5e0710b 100644
--- a/fs/efs/super.c
+++ b/fs/efs/super.c
@@ -76,11 +76,12 @@ static void efs_free_inode(struct inode *inode)
 	kmem_cache_free(efs_inode_cachep, INODE_INFO(inode));
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct efs_inode_info *ei = (struct efs_inode_info *) foo;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static int __init init_inodecache(void)
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index cadec6b1b554..09b84a549c64 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -56,11 +56,12 @@ static int erofs_superblock_csum_verify(struct super_block *sb, void *sbdata)
 	return -EBADMSG;
 }
 
-static void erofs_inode_init_once(void *ptr)
+static int erofs_inode_init_once(void *ptr)
 {
 	struct erofs_inode *vi = ptr;
 
 	inode_init_once(&vi->vfs_inode);
+	return 0;
 }
 
 static struct inode *erofs_alloc_inode(struct super_block *sb)
diff --git a/fs/exfat/cache.c b/fs/exfat/cache.c
index d5ce0ae660ba..e8b2fffc60b4 100644
--- a/fs/exfat/cache.c
+++ b/fs/exfat/cache.c
@@ -35,11 +35,12 @@ struct exfat_cache_id {
 
 static struct kmem_cache *exfat_cachep;
 
-static void exfat_cache_init_once(void *c)
+static int exfat_cache_init_once(void *c)
 {
 	struct exfat_cache *cache = (struct exfat_cache *)c;
 
 	INIT_LIST_HEAD(&cache->cache_list);
+	return 0;
 }
 
 int exfat_cache_init(void)
diff --git a/fs/exfat/super.c b/fs/exfat/super.c
index 8465033a6cf0..946b50e72aea 100644
--- a/fs/exfat/super.c
+++ b/fs/exfat/super.c
@@ -818,7 +818,7 @@ static struct file_system_type exfat_fs_type = {
 	.fs_flags		= FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
 };
 
-static void exfat_inode_init_once(void *foo)
+static int exfat_inode_init_once(void *foo)
 {
 	struct exfat_inode_info *ei = (struct exfat_inode_info *)foo;
 
@@ -828,6 +828,7 @@ static void exfat_inode_init_once(void *foo)
 	INIT_LIST_HEAD(&ei->cache_lru);
 	INIT_HLIST_NODE(&ei->i_hash_fat);
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static int __init init_exfat_fs(void)
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 28ff47ec4be6..7a4a6d6c069b 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -224,7 +224,7 @@ static void ext2_free_in_core_inode(struct inode *inode)
 	kmem_cache_free(ext2_inode_cachep, EXT2_I(inode));
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct ext2_inode_info *ei = (struct ext2_inode_info *) foo;
 
@@ -234,6 +234,7 @@ static void init_once(void *foo)
 #endif
 	mutex_init(&ei->truncate_mutex);
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static int __init init_inodecache(void)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 181934499624..7616ab697cdf 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1465,7 +1465,7 @@ static void ext4_shutdown(struct super_block *sb)
        ext4_force_shutdown(sb, EXT4_GOING_FLAGS_NOLOGFLUSH);
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct ext4_inode_info *ei = foo;
 
@@ -1474,6 +1474,7 @@ static void init_once(void *foo)
 	init_rwsem(&ei->i_data_sem);
 	inode_init_once(&ei->vfs_inode);
 	ext4_fc_init_inode(&ei->vfs_inode);
+	return 0;
 }
 
 static int __init init_inodecache(void)
diff --git a/fs/fat/cache.c b/fs/fat/cache.c
index 2af424e200b3..9478c8ca35e8 100644
--- a/fs/fat/cache.c
+++ b/fs/fat/cache.c
@@ -36,11 +36,12 @@ static inline int fat_max_cache(struct inode *inode)
 
 static struct kmem_cache *fat_cache_cachep;
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct fat_cache *cache = (struct fat_cache *)foo;
 
 	INIT_LIST_HEAD(&cache->cache_list);
+	return 0;
 }
 
 int __init fat_cache_init(void)
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 3852bb66358c..0576d7cda07d 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -767,7 +767,7 @@ static void fat_free_inode(struct inode *inode)
 	kmem_cache_free(fat_inode_cachep, MSDOS_I(inode));
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct msdos_inode_info *ei = (struct msdos_inode_info *)foo;
 
@@ -778,6 +778,7 @@ static void init_once(void *foo)
 	INIT_HLIST_NODE(&ei->i_fat_hash);
 	INIT_HLIST_NODE(&ei->i_dir_hash);
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static int __init fat_init_inodecache(void)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index fd48e8d37f2e..358f47f091c1 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -2131,11 +2131,12 @@ static inline void unregister_fuseblk(void)
 }
 #endif
 
-static void fuse_inode_init_once(void *foo)
+static int fuse_inode_init_once(void *foo)
 {
 	struct inode *inode = foo;
 
 	inode_init_once(inode);
+	return 0;
 }
 
 static int __init fuse_fs_init(void)
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index 0727f60ad028..0923a091d926 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -31,7 +31,7 @@
 
 struct workqueue_struct *gfs2_control_wq;
 
-static void gfs2_init_inode_once(void *foo)
+static int gfs2_init_inode_once(void *foo)
 {
 	struct gfs2_inode *ip = foo;
 
@@ -45,9 +45,10 @@ static void gfs2_init_inode_once(void *foo)
 	RB_CLEAR_NODE(&ip->i_res.rs_node);
 	ip->i_hash_cache = NULL;
 	gfs2_holder_mark_uninitialized(&ip->i_iopen_gh);
+	return 0;
 }
 
-static void gfs2_init_glock_once(void *foo)
+static int gfs2_init_glock_once(void *foo)
 {
 	struct gfs2_glock *gl = foo;
 
@@ -56,14 +57,16 @@ static void gfs2_init_glock_once(void *foo)
 	INIT_LIST_HEAD(&gl->gl_ail_list);
 	atomic_set(&gl->gl_ail_count, 0);
 	atomic_set(&gl->gl_revokes, 0);
+	return 0;
 }
 
-static void gfs2_init_gl_aspace_once(void *foo)
+static int gfs2_init_gl_aspace_once(void *foo)
 {
 	struct gfs2_glock_aspace *gla = foo;
 
 	gfs2_init_glock_once(&gla->glock);
 	address_space_init_once(&gla->mapping);
+	return 0;
 }
 
 /**
diff --git a/fs/hfs/super.c b/fs/hfs/super.c
index fe09c2093a93..ab8bd6c895c1 100644
--- a/fs/hfs/super.c
+++ b/fs/hfs/super.c
@@ -436,11 +436,12 @@ static struct file_system_type hfs_fs_type = {
 };
 MODULE_ALIAS_FS("hfs");
 
-static void hfs_init_once(void *p)
+static int hfs_init_once(void *p)
 {
 	struct hfs_inode_info *i = p;
 
 	inode_init_once(&i->vfs_inode);
+	return 0;
 }
 
 static int __init init_hfs_fs(void)
diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
index 948b8aaee33e..b390381e837f 100644
--- a/fs/hfsplus/super.c
+++ b/fs/hfsplus/super.c
@@ -678,11 +678,12 @@ static struct file_system_type hfsplus_fs_type = {
 };
 MODULE_ALIAS_FS("hfsplus");
 
-static void hfsplus_init_once(void *p)
+static int hfsplus_init_once(void *p)
 {
 	struct hfsplus_inode_info *i = p;
 
 	inode_init_once(&i->vfs_inode);
+	return 0;
 }
 
 static int __init init_hfsplus_fs(void)
diff --git a/fs/hpfs/super.c b/fs/hpfs/super.c
index 27567920abe4..1af5a9f5f931 100644
--- a/fs/hpfs/super.c
+++ b/fs/hpfs/super.c
@@ -244,11 +244,12 @@ static void hpfs_free_inode(struct inode *inode)
 	kmem_cache_free(hpfs_inode_cachep, hpfs_i(inode));
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct hpfs_inode_info *ei = (struct hpfs_inode_info *) foo;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static int init_inodecache(void)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index e4de5425838d..c0513bb7ed88 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1225,11 +1225,12 @@ static const struct address_space_operations hugetlbfs_aops = {
 };
 
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct hugetlbfs_inode_info *ei = foo;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static const struct file_operations hugetlbfs_file_operations = {
diff --git a/fs/inode.c b/fs/inode.c
index 99318b157a9a..d1b73cbda4f5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -514,11 +514,12 @@ void inode_init_once(struct inode *inode)
 }
 EXPORT_SYMBOL(inode_init_once);
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct inode *inode = (struct inode *) foo;
 
 	inode_init_once(inode);
+	return 0;
 }
 
 /*
diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index 47038e660812..0251eefdb18f 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -82,11 +82,12 @@ static void isofs_free_inode(struct inode *inode)
 	kmem_cache_free(isofs_inode_cachep, ISOFS_I(inode));
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct iso_inode_info *ei = foo;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static int __init init_inodecache(void)
diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index 4545f885c41e..7249f6ad5b0a 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -53,13 +53,14 @@ static void jffs2_free_inode(struct inode *inode)
 	kmem_cache_free(jffs2_inode_cachep, f);
 }
 
-static void jffs2_i_init_once(void *foo)
+static int jffs2_i_init_once(void *foo)
 {
 	struct jffs2_inode_info *f = foo;
 
 	mutex_init(&f->sem);
 	f->target = NULL;
 	inode_init_once(&f->vfs_inode);
+	return 0;
 }
 
 static const char *jffs2_compr_name(unsigned int compr)
diff --git a/fs/jfs/super.c b/fs/jfs/super.c
index 10368c188c5e..f9bae8adfc18 100644
--- a/fs/jfs/super.c
+++ b/fs/jfs/super.c
@@ -934,7 +934,7 @@ static struct file_system_type jfs_fs_type = {
 };
 MODULE_ALIAS_FS("jfs");
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct jfs_inode_info *jfs_ip = (struct jfs_inode_info *) foo;
 
@@ -946,6 +946,7 @@ static void init_once(void *foo)
 	spin_lock_init(&jfs_ip->ag_lock);
 	jfs_ip->active_ag = -1;
 	inode_init_once(&jfs_ip->vfs_inode);
+	return 0;
 }
 
 static int __init init_jfs_fs(void)
diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index f007e389d5d2..dc13b652754b 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -75,11 +75,12 @@ static void minix_free_in_core_inode(struct inode *inode)
 	kmem_cache_free(minix_inode_cachep, minix_i(inode));
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct minix_inode_info *ei = (struct minix_inode_info *) foo;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static int __init init_inodecache(void)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 119e447758b9..274cdad1c13e 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -2456,7 +2456,7 @@ static inline void nfs4_init_once(struct nfs_inode *nfsi)
 #endif
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct nfs_inode *nfsi = foo;
 
@@ -2465,6 +2465,7 @@ static void init_once(void *foo)
 	INIT_LIST_HEAD(&nfsi->access_cache_entry_lru);
 	INIT_LIST_HEAD(&nfsi->access_cache_inode_lru);
 	nfs4_init_once(nfsi);
+	return 0;
 }
 
 static int __init nfs_init_inodecache(void)
diff --git a/fs/nfs/nfs42xattr.c b/fs/nfs/nfs42xattr.c
index 37d79400e5f4..22e5e9751656 100644
--- a/fs/nfs/nfs42xattr.c
+++ b/fs/nfs/nfs42xattr.c
@@ -960,7 +960,7 @@ nfs4_xattr_entry_count(struct shrinker *shrink, struct shrink_control *sc)
 }
 
 
-static void nfs4_xattr_cache_init_once(void *p)
+static int nfs4_xattr_cache_init_once(void *p)
 {
 	struct nfs4_xattr_cache *cache = p;
 
@@ -970,6 +970,7 @@ static void nfs4_xattr_cache_init_once(void *p)
 	cache->listxattr = NULL;
 	INIT_LIST_HEAD(&cache->lru);
 	INIT_LIST_HEAD(&cache->dispose);
+	return 0;
 }
 
 typedef unsigned long (*count_objects_cb)(struct shrinker *s,
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index badc2cbc895e..415381d9e5f9 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -1313,7 +1313,7 @@ struct file_system_type nilfs_fs_type = {
 };
 MODULE_ALIAS_FS("nilfs2");
 
-static void nilfs_inode_init_once(void *obj)
+static int nilfs_inode_init_once(void *obj)
 {
 	struct nilfs_inode_info *ii = obj;
 
@@ -1322,11 +1322,13 @@ static void nilfs_inode_init_once(void *obj)
 	init_rwsem(&ii->xattr_sem);
 #endif
 	inode_init_once(&ii->vfs_inode);
+	return 0;
 }
 
-static void nilfs_segbuf_init_once(void *obj)
+static int nilfs_segbuf_init_once(void *obj)
 {
 	memset(obj, 0, sizeof(struct nilfs_segment_buffer));
+	return 0;
 }
 
 static void nilfs_destroy_cachep(void)
diff --git a/fs/ntfs3/super.c b/fs/ntfs3/super.c
index 920a1ab47b63..8dcfe06ba996 100644
--- a/fs/ntfs3/super.c
+++ b/fs/ntfs3/super.c
@@ -628,11 +628,12 @@ static void ntfs_free_inode(struct inode *inode)
 	kmem_cache_free(ntfs_inode_cachep, ni);
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct ntfs_inode *ni = foo;
 
 	inode_init_once(&ni->vfs_inode);
+	return 0;
 }
 
 /*
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index 5130ec44e5e1..3c044f94970f 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -265,7 +265,7 @@ static ssize_t dlmfs_file_write(struct file *filp,
 	return count;
 }
 
-static void dlmfs_init_once(void *foo)
+static int dlmfs_init_once(void *foo)
 {
 	struct dlmfs_inode_private *ip =
 		(struct dlmfs_inode_private *) foo;
@@ -274,6 +274,7 @@ static void dlmfs_init_once(void *foo)
 	ip->ip_parent = NULL;
 
 	inode_init_once(&ip->ip_vfs_inode);
+	return 0;
 }
 
 static struct inode *dlmfs_alloc_inode(struct super_block *sb)
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 8bb5022f3082..b33185567db9 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1614,7 +1614,7 @@ static int ocfs2_statfs(struct dentry *dentry, struct kstatfs *buf)
 	return status;
 }
 
-static void ocfs2_inode_init_once(void *data)
+static int ocfs2_inode_init_once(void *data)
 {
 	struct ocfs2_inode_info *oi = data;
 
@@ -1643,6 +1643,7 @@ static void ocfs2_inode_init_once(void *data)
 				  &ocfs2_inode_caching_ops);
 
 	inode_init_once(&oi->vfs_inode);
+	return 0;
 }
 
 static int ocfs2_initialize_mem_caches(void)
diff --git a/fs/openpromfs/inode.c b/fs/openpromfs/inode.c
index 26ecda0e4d19..5f40a4e0dd55 100644
--- a/fs/openpromfs/inode.c
+++ b/fs/openpromfs/inode.c
@@ -431,11 +431,12 @@ static struct file_system_type openprom_fs_type = {
 };
 MODULE_ALIAS_FS("openpromfs");
 
-static void op_inode_init_once(void *data)
+static int op_inode_init_once(void *data)
 {
 	struct op_inode_info *oi = (struct op_inode_info *) data;
 
 	inode_init_once(&oi->vfs_inode);
+	return 0;
 }
 
 static int __init init_openprom_fs(void)
diff --git a/fs/orangefs/super.c b/fs/orangefs/super.c
index eba3e357192e..9521452dea13 100644
--- a/fs/orangefs/super.c
+++ b/fs/orangefs/super.c
@@ -95,12 +95,13 @@ static int parse_mount_options(struct super_block *sb, char *options,
 	return -EINVAL;
 }
 
-static void orangefs_inode_cache_ctor(void *req)
+static int orangefs_inode_cache_ctor(void *req)
 {
 	struct orangefs_inode_s *orangefs_inode = req;
 
 	inode_init_once(&orangefs_inode->vfs_inode);
 	init_rwsem(&orangefs_inode->xattr_sem);
+	return 0;
 }
 
 static struct inode *orangefs_alloc_inode(struct super_block *sb)
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index b63474d1b064..d46d4ea30f66 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -1516,11 +1516,12 @@ struct file_system_type ovl_fs_type = {
 };
 MODULE_ALIAS_FS("overlay");
 
-static void ovl_inode_init_once(void *foo)
+static int ovl_inode_init_once(void *foo)
 {
 	struct ovl_inode *oi = foo;
 
 	inode_init_once(&oi->vfs_inode);
+	return 0;
 }
 
 static int __init ovl_init(void)
diff --git a/fs/pidfs.c b/fs/pidfs.c
index d64a4cbeb0da..efc9aac05446 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -896,11 +896,12 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
 	return pidfd_file;
 }
 
-static void pidfs_inode_init_once(void *data)
+static int pidfs_inode_init_once(void *data)
 {
 	struct pidfs_inode *pi = data;
 
 	inode_init_once(&pi->vfs_inode);
+	return 0;
 }
 
 void __init pidfs_init(void)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index a3eb3b740f76..f1bf32900389 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -80,11 +80,12 @@ static void proc_free_inode(struct inode *inode)
 	kmem_cache_free(proc_inode_cachep, PROC_I(inode));
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct proc_inode *ei = (struct proc_inode *) foo;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 void __init proc_init_kmemcache(void)
diff --git a/fs/qnx4/inode.c b/fs/qnx4/inode.c
index e399e2dd3a12..624911ffc8a8 100644
--- a/fs/qnx4/inode.c
+++ b/fs/qnx4/inode.c
@@ -366,11 +366,12 @@ static void qnx4_free_inode(struct inode *inode)
 	kmem_cache_free(qnx4_inode_cachep, qnx4_i(inode));
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct qnx4_inode_info *ei = (struct qnx4_inode_info *) foo;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static int init_inodecache(void)
diff --git a/fs/qnx6/inode.c b/fs/qnx6/inode.c
index 3310d1ad4d0e..196b4fafcc46 100644
--- a/fs/qnx6/inode.c
+++ b/fs/qnx6/inode.c
@@ -595,11 +595,12 @@ static void qnx6_free_inode(struct inode *inode)
 	kmem_cache_free(qnx6_inode_cachep, QNX6_I(inode));
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct qnx6_inode_info *ei = (struct qnx6_inode_info *) foo;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static int init_inodecache(void)
diff --git a/fs/romfs/super.c b/fs/romfs/super.c
index 0addcc849ff2..622a388dfc8f 100644
--- a/fs/romfs/super.c
+++ b/fs/romfs/super.c
@@ -601,11 +601,12 @@ MODULE_ALIAS_FS("romfs");
 /*
  * inode storage initialiser
  */
-static void romfs_i_init_once(void *_inode)
+static int romfs_i_init_once(void *_inode)
 {
 	struct romfs_inode_info *inode = _inode;
 
 	inode_init_once(&inode->vfs_inode);
+	return 0;
 }
 
 /*
diff --git a/fs/smb/client/cifsfs.c b/fs/smb/client/cifsfs.c
index a08c42363ffc..4e16806552ba 100644
--- a/fs/smb/client/cifsfs.c
+++ b/fs/smb/client/cifsfs.c
@@ -1641,13 +1641,14 @@ const struct file_operations cifs_dir_ops = {
 	.fsync = cifs_dir_fsync,
 };
 
-static void
+static int
 cifs_init_once(void *inode)
 {
 	struct cifsInodeInfo *cifsi = inode;
 
 	inode_init_once(&cifsi->netfs.inode);
 	init_rwsem(&cifsi->lock_sem);
+	return 0;
 }
 
 static int __init
diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
index 67c55fe32ce8..0764d52ecc2c 100644
--- a/fs/squashfs/super.c
+++ b/fs/squashfs/super.c
@@ -606,11 +606,12 @@ static void squashfs_put_super(struct super_block *sb)
 static struct kmem_cache *squashfs_inode_cachep;
 
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct squashfs_inode_info *ei = foo;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index cb1af30b49f5..87cc889e47d8 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -792,7 +792,7 @@ bool tracefs_initialized(void)
 	return tracefs_registered;
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct tracefs_inode *ti = (struct tracefs_inode *) foo;
 
@@ -801,6 +801,7 @@ static void init_once(void *foo)
 
 	/* Zero out the rest */
 	memset_after(ti, 0, vfs_inode);
+	return 0;
 }
 
 static int __init tracefs_init(void)
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index f3e3b2068608..3591397d7df7 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -2376,10 +2376,11 @@ MODULE_ALIAS_FS("ubifs");
 /*
  * Inode slab cache constructor.
  */
-static void inode_slab_ctor(void *obj)
+static int inode_slab_ctor(void *obj)
 {
 	struct ubifs_inode *ui = obj;
 	inode_init_once(&ui->vfs_inode);
+	return 0;
 }
 
 static int __init ubifs_init(void)
diff --git a/fs/udf/super.c b/fs/udf/super.c
index 1c8a736b3309..8fdf6efc7953 100644
--- a/fs/udf/super.c
+++ b/fs/udf/super.c
@@ -175,12 +175,13 @@ static void udf_free_in_core_inode(struct inode *inode)
 	kmem_cache_free(udf_inode_cachep, UDF_I(inode));
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct udf_inode_info *ei = foo;
 
 	ei->i_data = NULL;
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static int __init init_inodecache(void)
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index 762699c1bcf6..b188ebe93939 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -1444,11 +1444,12 @@ static void ufs_free_in_core_inode(struct inode *inode)
 	kmem_cache_free(ufs_inode_cachep, UFS_I(inode));
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct ufs_inode_info *ei = (struct ufs_inode_info *) foo;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static int __init init_inodecache(void)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index d80f94346199..53d59a94983c 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -2073,7 +2073,7 @@ static const struct file_operations userfaultfd_fops = {
 	.llseek		= noop_llseek,
 };
 
-static void init_once_userfaultfd_ctx(void *mem)
+static int init_once_userfaultfd_ctx(void *mem)
 {
 	struct userfaultfd_ctx *ctx = (struct userfaultfd_ctx *) mem;
 
@@ -2082,6 +2082,7 @@ static void init_once_userfaultfd_ctx(void *mem)
 	init_waitqueue_head(&ctx->event_wqh);
 	init_waitqueue_head(&ctx->fd_wqh);
 	seqcount_spinlock_init(&ctx->refile_seq, &ctx->fault_pending_wqh.lock);
+	return 0;
 }
 
 static int new_userfaultfd(int flags)
diff --git a/fs/vboxsf/super.c b/fs/vboxsf/super.c
index 0bc96ab6580b..f4fc1cb03014 100644
--- a/fs/vboxsf/super.c
+++ b/fs/vboxsf/super.c
@@ -222,12 +222,13 @@ static int vboxsf_fill_super(struct super_block *sb, struct fs_context *fc)
 	return err;
 }
 
-static void vboxsf_inode_init_once(void *data)
+static int vboxsf_inode_init_once(void *data)
 {
 	struct vboxsf_inode *sf_i = data;
 
 	mutex_init(&sf_i->handle_list_mutex);
 	inode_init_once(&sf_i->vfs_inode);
+	return 0;
 }
 
 static struct inode *vboxsf_alloc_inode(struct super_block *sb)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index b2dd0c0bf509..2eaab19c1fc5 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -725,7 +725,7 @@ xfs_fs_dirty_inode(
  * fields in the xfs inode that left in the initialise state
  * when freeing the inode.
  */
-STATIC void
+STATIC int
 xfs_fs_inode_init_once(
 	void			*inode)
 {
@@ -740,6 +740,7 @@ xfs_fs_inode_init_once(
 	atomic_set(&ip->i_pincount, 0);
 	spin_lock_init(&ip->i_flags_lock);
 	init_rwsem(&ip->i_lock);
+	return 0;
 }
 
 /*
diff --git a/include/linux/slab.h b/include/linux/slab.h
index d5a8ab98035c..1ef6d5384f0b 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -330,11 +330,12 @@ struct kmem_cache_args {
 	 * page. It is the cache user's responsibility to free object in the
 	 * same state as after calling the constructor, or deal appropriately
 	 * with any differences between a freshly constructed and a reallocated
-	 * object.
+	 * object. If ctor returns a nonzero value, indicating an error, slab
+	 * allocation fails.
 	 *
 	 * %NULL means no constructor.
 	 */
-	void (*ctor)(void *);
+	int (*ctor)(void *);
 };
 
 struct kmem_cache *__kmem_cache_create_args(const char *name,
@@ -343,7 +344,7 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
 					    slab_flags_t flags);
 static inline struct kmem_cache *
 __kmem_cache_create(const char *name, unsigned int size, unsigned int align,
-		    slab_flags_t flags, void (*ctor)(void *))
+		    slab_flags_t flags, int (*ctor)(void *))
 {
 	struct kmem_cache_args kmem_args = {
 		.align	= align,
@@ -375,7 +376,7 @@ static inline struct kmem_cache *
 kmem_cache_create_usercopy(const char *name, unsigned int size,
 			   unsigned int align, slab_flags_t flags,
 			   unsigned int useroffset, unsigned int usersize,
-			   void (*ctor)(void *))
+			   int (*ctor)(void *))
 {
 	struct kmem_cache_args kmem_args = {
 		.align		= align,
@@ -775,7 +776,7 @@ void kmem_cache_free(struct kmem_cache *s, void *objp);
 
 kmem_buckets *kmem_buckets_create(const char *name, slab_flags_t flags,
 				  unsigned int useroffset, unsigned int usersize,
-				  void (*ctor)(void *));
+				  int (*ctor)(void *));
 
 /*
  * Bulk allocation and freeing operations. These are accelerated in an
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 35b4f8659904..4e331e39f980 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -487,11 +487,12 @@ static struct vfsmount *mq_create_mount(struct ipc_namespace *ns)
 	return mnt;
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct mqueue_inode_info *p = foo;
 
 	inode_init_once(&p->vfs_inode);
+	return 0;
 }
 
 static struct inode *mqueue_alloc_inode(struct super_block *sb)
diff --git a/kernel/fork.c b/kernel/fork.c
index c4b26cd8998b..7966b0876dc3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -3184,12 +3184,13 @@ void walk_process_tree(struct task_struct *top, proc_visitor visitor, void *data
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
 
-static void sighand_ctor(void *data)
+static int sighand_ctor(void *data)
 {
 	struct sighand_struct *sighand = data;
 
 	spin_lock_init(&sighand->siglock);
 	init_waitqueue_head(&sighand->signalfd_wqh);
+	return 0;
 }
 
 void __init mm_cache_init(void)
diff --git a/kernel/rcu/refscale.c b/kernel/rcu/refscale.c
index f11a7c2af778..8ed7e3d4ca30 100644
--- a/kernel/rcu/refscale.c
+++ b/kernel/rcu/refscale.c
@@ -792,7 +792,7 @@ static struct refscale_typesafe *typesafe_alloc_one(void)
 
 // Slab-allocator constructor for refscale_typesafe structures created
 // out of a new slab of system memory.
-static void refscale_typesafe_ctor(void *rtsp_in)
+static int refscale_typesafe_ctor(void *rtsp_in)
 {
 	struct refscale_typesafe *rtsp = rtsp_in;
 
@@ -801,6 +801,7 @@ static void refscale_typesafe_ctor(void *rtsp_in)
 	preempt_disable();
 	rtsp->a = torture_random(this_cpu_ptr(&refscale_rand));
 	preempt_enable();
+	return 0;
 }
 
 static const struct ref_scale_ops typesafe_ref_ops;
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 976b9bd02a1b..0e0f83ca9175 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1566,13 +1566,14 @@ void idr_destroy(struct idr *idr)
 }
 EXPORT_SYMBOL(idr_destroy);
 
-static void
+static int
 radix_tree_node_ctor(void *arg)
 {
 	struct radix_tree_node *node = arg;
 
 	memset(node, 0, sizeof(*node));
 	INIT_LIST_HEAD(&node->private_list);
+	return 0;
 }
 
 static int radix_tree_cpu_dead(unsigned int cpu)
diff --git a/lib/test_meminit.c b/lib/test_meminit.c
index 6298f66c964b..4697a2aedee2 100644
--- a/lib/test_meminit.c
+++ b/lib/test_meminit.c
@@ -169,9 +169,10 @@ static int __init test_kvmalloc(int *total_failures)
 #define CTOR_BYTES (sizeof(unsigned int))
 #define CTOR_PATTERN (0x41414141)
 /* Initialize the first 4 bytes of the object. */
-static void test_ctor(void *obj)
+static int test_ctor(void *obj)
 {
 	*(unsigned int *)obj = CTOR_PATTERN;
+	return 0;
 }
 
 /*
diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
index 00034e37bc9f..451760e8d1f4 100644
--- a/mm/kfence/kfence_test.c
+++ b/mm/kfence/kfence_test.c
@@ -184,7 +184,7 @@ static bool report_matches(const struct expect_report *r)
 static struct kmem_cache *test_cache;
 
 static size_t setup_test_cache(struct kunit *test, size_t size, slab_flags_t flags,
-			       void (*ctor)(void *))
+			       int (*ctor)(void *))
 {
 	if (test->priv != TEST_PRIV_WANT_MEMCACHE)
 		return size;
@@ -539,10 +539,11 @@ static void test_shrink_memcache(struct kunit *test)
 	KUNIT_EXPECT_FALSE(test, report_available());
 }
 
-static void ctor_set_x(void *obj)
+static int ctor_set_x(void *obj)
 {
 	/* Every object has at least 8 bytes. */
 	memset(obj, 'x', 8);
+	return 0;
 }
 
 /* Ensure that SL*B does not modify KFENCE objects on bulk free. */
diff --git a/mm/rmap.c b/mm/rmap.c
index 67bb273dfb80..f2b45caa5acb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -448,13 +448,14 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 	}
 }
 
-static void anon_vma_ctor(void *data)
+static int anon_vma_ctor(void *data)
 {
 	struct anon_vma *anon_vma = data;
 
 	init_rwsem(&anon_vma->rwsem);
 	atomic_set(&anon_vma->refcount, 0);
 	anon_vma->rb_root = RB_ROOT_CACHED;
+	return 0;
 }
 
 void __init anon_vma_init(void)
diff --git a/mm/shmem.c b/mm/shmem.c
index 99327c30507c..50ac15dfaff8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -5165,10 +5165,11 @@ static void shmem_destroy_inode(struct inode *inode)
 		simple_offset_destroy(shmem_get_offset_ctx(inode));
 }
 
-static void shmem_init_inode(void *foo)
+static int shmem_init_inode(void *foo)
 {
 	struct shmem_inode_info *info = foo;
 	inode_init_once(&info->vfs_inode);
+	return 0;
 }
 
 static void __init shmem_init_inodecache(void)
diff --git a/mm/slab.h b/mm/slab.h
index 05a21dc796e0..30603907d936 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -278,7 +278,7 @@ struct kmem_cache {
 	struct kmem_cache_order_objects min;
 	gfp_t allocflags;		/* gfp flags to use on each alloc */
 	int refcount;			/* Refcount for slab cache destroy */
-	void (*ctor)(void *object);	/* Object constructor */
+	int (*ctor)(void *object);	/* Object constructor */
 	unsigned int inuse;		/* Offset to metadata */
 	unsigned int align;		/* Alignment */
 	unsigned int red_left_pad;	/* Left redzone padding size */
@@ -438,10 +438,10 @@ extern void create_boot_cache(struct kmem_cache *, const char *name,
 
 int slab_unmergeable(struct kmem_cache *s);
 struct kmem_cache *find_mergeable(unsigned size, unsigned align,
-		slab_flags_t flags, const char *name, void (*ctor)(void *));
+		slab_flags_t flags, const char *name, int (*ctor)(void *));
 struct kmem_cache *
 __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
-		   slab_flags_t flags, void (*ctor)(void *));
+		   slab_flags_t flags, int (*ctor)(void *));
 
 slab_flags_t kmem_cache_flags(slab_flags_t flags, const char *name);
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 5be257e03c7c..59938e44a8c2 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -173,7 +173,7 @@ int slab_unmergeable(struct kmem_cache *s)
 }
 
 struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
-		slab_flags_t flags, const char *name, void (*ctor)(void *))
+		slab_flags_t flags, const char *name, int (*ctor)(void *))
 {
 	struct kmem_cache *s;
 
@@ -382,7 +382,7 @@ static struct kmem_cache *kmem_buckets_cache __ro_after_init;
 kmem_buckets *kmem_buckets_create(const char *name, slab_flags_t flags,
 				  unsigned int useroffset,
 				  unsigned int usersize,
-				  void (*ctor)(void *))
+				  int (*ctor)(void *))
 {
 	unsigned long mask = 0;
 	unsigned int idx;
diff --git a/mm/slub.c b/mm/slub.c
index 95a9f04b5904..10b9c87792b7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2439,7 +2439,8 @@ static void *setup_object(struct kmem_cache *s, void *object)
 	object = kasan_init_slab_obj(s, object);
 	if (unlikely(s->ctor)) {
 		kasan_unpoison_new_object(s, object);
-		s->ctor(object);
+		if (s->ctor(object))
+			return NULL;
 		kasan_poison_new_object(s, object);
 	}
 	return object;
@@ -2542,7 +2543,7 @@ static bool should_shuffle_freelist(struct kmem_cache *s, struct slab *slab)
 }
 
 /* Shuffle the single linked freelist based on a random pre-computed sequence */
-static void shuffle_freelist(struct kmem_cache *s, struct slab *slab)
+static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
 {
 	void *start;
 	void *cur;
@@ -2559,15 +2560,29 @@ static void shuffle_freelist(struct kmem_cache *s, struct slab *slab)
 	cur = next_freelist_entry(s, &pos, start, page_limit, freelist_count);
 	cur = setup_object(s, cur);
 	slab->freelist = cur;
+	if (!cur) {
+		return false;
+	}
 
 	for (idx = 1; idx < slab->objects; idx++) {
 		next = next_freelist_entry(s, &pos, start, page_limit,
 			freelist_count);
 		next = setup_object(s, next);
+		if (!next) {
+			/*
+			 * This is necessary because later we need to call
+			 * the destructor for objects that the constructor
+			 * successfully initialized.
+			 */
+			set_freepointer(s, cur, NULL);
+			return false;
+		}
 		set_freepointer(s, cur, next);
 		cur = next;
 	}
 	set_freepointer(s, cur, NULL);
+
+	return true;
 }
 #else
 static inline int init_cache_random_seq(struct kmem_cache *s)
@@ -2580,8 +2595,10 @@ static inline bool should_shuffle_freelist(struct kmem_cache *s,
 {
 	return false;
 }
-static inline void shuffle_freelist(struct kmem_cache *s, struct slab *slab)
-{ }
+static inline bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
+{
+	return false;
+}
 #endif /* CONFIG_SLAB_FREELIST_RANDOM */
 
 static __always_inline void account_slab(struct slab *slab, int order,
@@ -2604,6 +2621,20 @@ static __always_inline void unaccount_slab(struct slab *slab, int order,
 			    -(PAGE_SIZE << order));
 }
 
+static void __free_slab(struct kmem_cache *s, struct slab *slab)
+{
+	struct folio *folio = slab_folio(slab);
+	int order = folio_order(folio);
+	int pages = 1 << order;
+
+	__slab_clear_pfmemalloc(slab);
+	folio->mapping = NULL;
+	__folio_clear_slab(folio);
+	mm_account_reclaimed_pages(pages);
+	unaccount_slab(slab, order, s);
+	free_frozen_pages(&folio->page, order);
+}
+
 static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
 	struct slab *slab;
@@ -2653,14 +2684,22 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	setup_slab_debug(s, slab, start);
 
 	if (should_shuffle_freelist(s, slab)) {
-		shuffle_freelist(s, slab);
+		if (!shuffle_freelist(s, slab))
+			goto err;
 	} else {
 		start = fixup_red_left(s, start);
 		start = setup_object(s, start);
 		slab->freelist = start;
+		if (!start)
+			goto err;
 		for (idx = 0, p = start; idx < slab->objects - 1; idx++) {
 			next = p + s->size;
 			next = setup_object(s, next);
+			if (!next) {
+				/* See comment in shuffle_freelist() */
+				set_freepointer(s, p, NULL);
+				goto err;
+			}
 			set_freepointer(s, p, next);
 			p = next;
 		}
@@ -2668,6 +2707,9 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	}
 
 	return slab;
+err:
+	__free_slab(s, slab);
+	return NULL;
 }
 
 static struct slab *new_slab(struct kmem_cache *s, gfp_t flags, int node)
@@ -2681,20 +2723,6 @@ static struct slab *new_slab(struct kmem_cache *s, gfp_t flags, int node)
 		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
 }
 
-static void __free_slab(struct kmem_cache *s, struct slab *slab)
-{
-	struct folio *folio = slab_folio(slab);
-	int order = folio_order(folio);
-	int pages = 1 << order;
-
-	__slab_clear_pfmemalloc(slab);
-	folio->mapping = NULL;
-	__folio_clear_slab(folio);
-	mm_account_reclaimed_pages(pages);
-	unaccount_slab(slab, order, s);
-	free_frozen_pages(&folio->page, order);
-}
-
 static void rcu_free_slab(struct rcu_head *h)
 {
 	struct slab *slab = container_of(h, struct slab, rcu_head);
@@ -6377,7 +6405,7 @@ void __init kmem_cache_init_late(void)
 
 struct kmem_cache *
 __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
-		   slab_flags_t flags, void (*ctor)(void *))
+		   slab_flags_t flags, int (*ctor)(void *))
 {
 	struct kmem_cache *s;
 
diff --git a/net/socket.c b/net/socket.c
index 9a0e720f0859..d5843d9c6ab4 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -330,11 +330,12 @@ static void sock_free_inode(struct inode *inode)
 	kmem_cache_free(sock_inode_cachep, ei);
 }
 
-static void init_once(void *foo)
+static int init_once(void *foo)
 {
 	struct socket_alloc *ei = (struct socket_alloc *)foo;
 
 	inode_init_once(&ei->vfs_inode);
+	return 0;
 }
 
 static void init_inodecache(void)
diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
index eadc00410ebc..d90d5f9b2fd7 100644
--- a/net/sunrpc/rpc_pipe.c
+++ b/net/sunrpc/rpc_pipe.c
@@ -1467,7 +1467,7 @@ static struct file_system_type rpc_pipe_fs_type = {
 MODULE_ALIAS_FS("rpc_pipefs");
 MODULE_ALIAS("rpc_pipefs");
 
-static void
+static int
 init_once(void *foo)
 {
 	struct rpc_inode *rpci = (struct rpc_inode *) foo;
@@ -1476,6 +1476,7 @@ init_once(void *foo)
 	rpci->private = NULL;
 	rpci->pipe = NULL;
 	init_waitqueue_head(&rpci->waitq);
+	return 0;
 }
 
 int register_rpc_pipefs(void)
diff --git a/security/integrity/ima/ima_iint.c b/security/integrity/ima/ima_iint.c
index 00b249101f98..f2af2ee91481 100644
--- a/security/integrity/ima/ima_iint.c
+++ b/security/integrity/ima/ima_iint.c
@@ -123,11 +123,12 @@ void ima_inode_free_rcu(void *inode_security)
 		ima_iint_free(*iint_p);
 }
 
-static void ima_iint_init_once(void *foo)
+static int ima_iint_init_once(void *foo)
 {
 	struct ima_iint_cache *iint = (struct ima_iint_cache *)foo;
 
 	memset(iint, 0, sizeof(*iint));
+	return 0;
 }
 
 void __init ima_iintcache_init(void)
-- 
2.43.0



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 3/7] mm/slab: revive the destructor feature in slab allocator
  2025-04-24  8:07 [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Harry Yoo
  2025-04-24  8:07 ` [RFC PATCH 1/7] mm/slab: refactor freelist shuffle Harry Yoo
  2025-04-24  8:07 ` [RFC PATCH 2/7] treewide, slab: allow slab constructor to return an error Harry Yoo
@ 2025-04-24  8:07 ` Harry Yoo
  2025-04-24  8:07 ` [RFC PATCH 4/7] net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu alloc Harry Yoo
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Harry Yoo @ 2025-04-24  8:07 UTC (permalink / raw)
  To: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Mateusz Guzik
  Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vlad Buslov,
	Yevgeny Kliteynik, Jan Kara, Byungchul Park, linux-mm, netdev,
	linux-kernel, Harry Yoo

Commit c59def9f222d ("Slab allocators: Drop support for destructors")
removed support for destructors, based on the belief that the feature had
few users and its usefulness was questionable.

Mateusz Guzik suggests [1] that using a constructor-destructor pair
can be beneficial when object initialization and de-initialization
require global serialization. For example, mm_struct allocates per-CPU
variables that rely on a global lock for serialization.

With the constructor-destructor pair, the serialization occurs only when
a slab is allocated or freed, rather than each time an individual object
is allocated or freed.

Introduce the destructor feature. A user can enable it by specifying
the 'dtor' field in kmem_cache_args. When a cache uses a destructor,
cache merging is disabled and the destructor is called before freeing a slab.
In case of SLAB_TYPESAFE_BY_RCU, it's invoked after RCU grace period.
Unlike the constructor, the destructor does not fail and it mustn't.

init_on_alloc, init_on_free, placing free pointer within the object,
slab merging, __GFP_ZERO, object poisoning do not work for caches with
destructor, for the same reason as constructor.

Meanwhile, refactor __kmem_cache_alias() to remove the check for the the
mergeability of the cache. Instead, these checks are performed before
calling __kmem_cache_alias().

[1] https://lore.kernel.org/linux-mm/CAGudoHFc+Km-3usiy4Wdm1JkM+YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com

Suggested-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 include/linux/slab.h | 10 ++++++++++
 mm/slab.h            |  9 +++++----
 mm/slab_common.c     | 41 +++++++++++++++++++++++++++--------------
 mm/slub.c            | 29 ++++++++++++++++++++++++-----
 4 files changed, 66 insertions(+), 23 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 1ef6d5384f0b..12a8a6b07050 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -336,6 +336,16 @@ struct kmem_cache_args {
 	 * %NULL means no constructor.
 	 */
 	int (*ctor)(void *);
+	/**
+	 * @dtor: A destructor for the objects.
+	 *
+	 * The destructor is invoked for each object when a slab page is freed.
+	 * In case of &SLAB_TYPESAFE_BY_RCU caches, dtor is called after RCU
+	 * grace period.
+	 *
+	 * %NULL means no destructor.
+	 */
+	void (*dtor)(void *);
 };
 
 struct kmem_cache *__kmem_cache_create_args(const char *name,
diff --git a/mm/slab.h b/mm/slab.h
index 30603907d936..cffe960f2611 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -279,6 +279,7 @@ struct kmem_cache {
 	gfp_t allocflags;		/* gfp flags to use on each alloc */
 	int refcount;			/* Refcount for slab cache destroy */
 	int (*ctor)(void *object);	/* Object constructor */
+	void (*dtor)(void *object);	/* Object destructor */
 	unsigned int inuse;		/* Offset to metadata */
 	unsigned int align;		/* Alignment */
 	unsigned int red_left_pad;	/* Left redzone padding size */
@@ -438,10 +439,10 @@ extern void create_boot_cache(struct kmem_cache *, const char *name,
 
 int slab_unmergeable(struct kmem_cache *s);
 struct kmem_cache *find_mergeable(unsigned size, unsigned align,
-		slab_flags_t flags, const char *name, int (*ctor)(void *));
+		slab_flags_t flags, const char *name);
 struct kmem_cache *
 __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
-		   slab_flags_t flags, int (*ctor)(void *));
+		   slab_flags_t flags);
 
 slab_flags_t kmem_cache_flags(slab_flags_t flags, const char *name);
 
@@ -638,7 +639,7 @@ static inline bool slab_want_init_on_alloc(gfp_t flags, struct kmem_cache *c)
 {
 	if (static_branch_maybe(CONFIG_INIT_ON_ALLOC_DEFAULT_ON,
 				&init_on_alloc)) {
-		if (c->ctor)
+		if (c->ctor || c->dtor)
 			return false;
 		if (c->flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON))
 			return flags & __GFP_ZERO;
@@ -651,7 +652,7 @@ static inline bool slab_want_init_on_free(struct kmem_cache *c)
 {
 	if (static_branch_maybe(CONFIG_INIT_ON_FREE_DEFAULT_ON,
 				&init_on_free))
-		return !(c->ctor ||
+		return !(c->ctor || c->dtor ||
 			 (c->flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)));
 	return false;
 }
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 59938e44a8c2..f2f1f7bb7170 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -155,7 +155,7 @@ int slab_unmergeable(struct kmem_cache *s)
 	if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE))
 		return 1;
 
-	if (s->ctor)
+	if (s->ctor || s->dtor)
 		return 1;
 
 #ifdef CONFIG_HARDENED_USERCOPY
@@ -172,22 +172,36 @@ int slab_unmergeable(struct kmem_cache *s)
 	return 0;
 }
 
-struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
-		slab_flags_t flags, const char *name, int (*ctor)(void *))
+/**
+ * Can this cache that's going to be created merged with others?
+ * We can't use struct kmem_cache here because it is not created yet.
+ */
+static bool is_mergeable(const char *name, slab_flags_t flags,
+			 struct kmem_cache_args *args)
 {
-	struct kmem_cache *s;
-
 	if (slab_nomerge)
-		return NULL;
-
-	if (ctor)
-		return NULL;
+		return false;
 
 	flags = kmem_cache_flags(flags, name);
-
 	if (flags & SLAB_NEVER_MERGE)
-		return NULL;
+		return false;
 
+	if (args->ctor || args->dtor)
+		return false;
+
+#ifdef CONFIG_HARDENED_USERCOPY
+	if (args->usersize)
+		return false;
+#endif
+	return true;
+}
+
+struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
+		slab_flags_t flags, const char *name)
+{
+	struct kmem_cache *s;
+
+	flags = kmem_cache_flags(flags, name);
 	size = ALIGN(size, sizeof(void *));
 	align = calculate_alignment(flags, align, size);
 	size = ALIGN(size, align);
@@ -321,9 +335,8 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
 		    object_size - args->usersize < args->useroffset))
 		args->usersize = args->useroffset = 0;
 
-	if (!args->usersize)
-		s = __kmem_cache_alias(name, object_size, args->align, flags,
-				       args->ctor);
+	if (is_mergeable(name, flags, args))
+		s = __kmem_cache_alias(name, object_size, args->align, flags);
 	if (s)
 		goto out_unlock;
 
diff --git a/mm/slub.c b/mm/slub.c
index 10b9c87792b7..b7e158da7ed3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2626,6 +2626,15 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
 	struct folio *folio = slab_folio(slab);
 	int order = folio_order(folio);
 	int pages = 1 << order;
+	void *p;
+
+	if (unlikely(s->dtor)) {
+		p = slab->freelist;
+		while (p != NULL) {
+			s->dtor(p);
+			p = get_freepointer(s, p);
+		}
+	}
 
 	__slab_clear_pfmemalloc(slab);
 	folio->mapping = NULL;
@@ -2717,7 +2726,7 @@ static struct slab *new_slab(struct kmem_cache *s, gfp_t flags, int node)
 	if (unlikely(flags & GFP_SLAB_BUG_MASK))
 		flags = kmalloc_fix_flags(flags);
 
-	WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
+	WARN_ON_ONCE((s->ctor || s->dtor) && (flags & __GFP_ZERO));
 
 	return allocate_slab(s,
 		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
@@ -5735,7 +5744,7 @@ static int calculate_sizes(struct kmem_cache_args *args, struct kmem_cache *s)
 	 * then we should never poison the object itself.
 	 */
 	if ((flags & SLAB_POISON) && !(flags & SLAB_TYPESAFE_BY_RCU) &&
-			!s->ctor)
+			!s->ctor && !s->dtor)
 		s->flags |= __OBJECT_POISON;
 	else
 		s->flags &= ~__OBJECT_POISON;
@@ -5757,7 +5766,7 @@ static int calculate_sizes(struct kmem_cache_args *args, struct kmem_cache *s)
 	s->inuse = size;
 
 	if (((flags & SLAB_TYPESAFE_BY_RCU) && !args->use_freeptr_offset) ||
-	    (flags & SLAB_POISON) || s->ctor ||
+	    (flags & SLAB_POISON) || s->ctor || s->dtor ||
 	    ((flags & SLAB_RED_ZONE) &&
 	     (s->object_size < sizeof(void *) || slub_debug_orig_size(s)))) {
 		/*
@@ -6405,11 +6414,11 @@ void __init kmem_cache_init_late(void)
 
 struct kmem_cache *
 __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
-		   slab_flags_t flags, int (*ctor)(void *))
+		   slab_flags_t flags)
 {
 	struct kmem_cache *s;
 
-	s = find_mergeable(size, align, flags, name, ctor);
+	s = find_mergeable(size, align, flags, name);
 	if (s) {
 		if (sysfs_slab_alias(s, name))
 			pr_err("SLUB: Unable to add cache alias %s to sysfs\n",
@@ -6443,6 +6452,7 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 #endif
 	s->align = args->align;
 	s->ctor = args->ctor;
+	s->dtor = args->dtor;
 #ifdef CONFIG_HARDENED_USERCOPY
 	s->useroffset = args->useroffset;
 	s->usersize = args->usersize;
@@ -7003,6 +7013,14 @@ static ssize_t ctor_show(struct kmem_cache *s, char *buf)
 }
 SLAB_ATTR_RO(ctor);
 
+static ssize_t dtor_show(struct kmem_cache *s, char *buf)
+{
+	if (!s->dtor)
+		return 0;
+	return sysfs_emit(buf, "%pS\n", s->dtor);
+}
+SLAB_ATTR_RO(dtor);
+
 static ssize_t aliases_show(struct kmem_cache *s, char *buf)
 {
 	return sysfs_emit(buf, "%d\n", s->refcount < 0 ? 0 : s->refcount - 1);
@@ -7356,6 +7374,7 @@ static struct attribute *slab_attrs[] = {
 	&partial_attr.attr,
 	&cpu_slabs_attr.attr,
 	&ctor_attr.attr,
+	&dtor_attr.attr,
 	&aliases_attr.attr,
 	&align_attr.attr,
 	&hwcache_align_attr.attr,
-- 
2.43.0



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 4/7] net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu alloc
  2025-04-24  8:07 [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Harry Yoo
                   ` (2 preceding siblings ...)
  2025-04-24  8:07 ` [RFC PATCH 3/7] mm/slab: revive the destructor feature in slab allocator Harry Yoo
@ 2025-04-24  8:07 ` Harry Yoo
  2025-04-24  8:07 ` [RFC PATCH 5/7] mm/percpu: allow (un)charging objects without alloc/free Harry Yoo
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Harry Yoo @ 2025-04-24  8:07 UTC (permalink / raw)
  To: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Mateusz Guzik
  Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vlad Buslov,
	Yevgeny Kliteynik, Jan Kara, Byungchul Park, linux-mm, netdev,
	linux-kernel, Harry Yoo

A tc_action object allocates three percpu memory regions and stores their
pointers within the object. For each object's lifetime, this requires
acquiring and releasing the globak lock, pcpu_alloc_mutex three times.

In workloads that frequently create and destroy TC filters, this leads
to severe lock contention due to the globl lock.

By using the slab constructor/destructor pair, the contention on
pcpu_alloc_mutex is shifted to the creation and destruction of slabs
(which contains multiple objects), which occur far less frequently than
allocating and freeing individual tc_action objects.

When tested with the following command, a 26% reduction in system time
was observed.

$ cd tools/testing/selftests/tc-testing
$ sudo python3 tdc.py -f ./tc-tests/filters/flower.json -d <NIC name>

Lock contention as measured with `perf lock record/report`:

Before:
                Name   acquired  contended     avg wait   total wait     max wait     min wait

                       15042346   15042346      3.82 us     57.40 s       4.00 ms       316 ns
    pcpu_alloc_mutex   10959650   10959650      7.10 us      1.30 m       3.76 ms       313 ns

After:
                Name   acquired  contended     avg wait   total wait     max wait     min wait

                       15488031   15488031      5.16 us      1.33 m  5124095.50 h        316 ns
    pcpu_alloc_mutex    7695276    7695276      3.39 us     26.07 s       4.03 ms       284 ns

The contention has moved from pcpu_alloc_mutex to other locks (which are not
symbolized and appear as blank in the output above).

Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 net/sched/act_api.c | 82 +++++++++++++++++++++++++++++++--------------
 1 file changed, 56 insertions(+), 26 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 839790043256..60cde766135a 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -112,6 +112,8 @@ struct tcf_chain *tcf_action_set_ctrlact(struct tc_action *a, int action,
 }
 EXPORT_SYMBOL(tcf_action_set_ctrlact);
 
+static struct kmem_cache *tc_action_cache;
+
 /* XXX: For standalone actions, we don't need a RCU grace period either, because
  * actions are always connected to filters and filters are already destroyed in
  * RCU callbacks, so after a RCU grace period actions are already disconnected
@@ -121,15 +123,15 @@ static void free_tcf(struct tc_action *p)
 {
 	struct tcf_chain *chain = rcu_dereference_protected(p->goto_chain, 1);
 
-	free_percpu(p->cpu_bstats);
-	free_percpu(p->cpu_bstats_hw);
-	free_percpu(p->cpu_qstats);
 
 	tcf_set_action_cookie(&p->user_cookie, NULL);
 	if (chain)
 		tcf_chain_put_by_act(chain);
 
-	kfree(p);
+	if (p->cpu_bstats)
+		kmem_cache_free(tc_action_cache, p);
+	else
+		kfree(p);
 }
 
 static void offload_action_hw_count_set(struct tc_action *act,
@@ -778,27 +780,20 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 		   struct tc_action **a, const struct tc_action_ops *ops,
 		   int bind, bool cpustats, u32 flags)
 {
-	struct tc_action *p = kzalloc(ops->size, GFP_KERNEL);
+	struct tc_action *p;
 	struct tcf_idrinfo *idrinfo = tn->idrinfo;
 	int err = -ENOMEM;
 
+	if (cpustats)
+		p = kmem_cache_alloc(tc_action_cache, GFP_KERNEL);
+	else
+		p = kzalloc(ops->size, GFP_KERNEL);
+
 	if (unlikely(!p))
 		return -ENOMEM;
 	refcount_set(&p->tcfa_refcnt, 1);
 	if (bind)
 		atomic_set(&p->tcfa_bindcnt, 1);
-
-	if (cpustats) {
-		p->cpu_bstats = netdev_alloc_pcpu_stats(struct gnet_stats_basic_sync);
-		if (!p->cpu_bstats)
-			goto err1;
-		p->cpu_bstats_hw = netdev_alloc_pcpu_stats(struct gnet_stats_basic_sync);
-		if (!p->cpu_bstats_hw)
-			goto err2;
-		p->cpu_qstats = alloc_percpu(struct gnet_stats_queue);
-		if (!p->cpu_qstats)
-			goto err3;
-	}
 	gnet_stats_basic_sync_init(&p->tcfa_bstats);
 	gnet_stats_basic_sync_init(&p->tcfa_bstats_hw);
 	spin_lock_init(&p->tcfa_lock);
@@ -812,7 +807,7 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 					&p->tcfa_rate_est,
 					&p->tcfa_lock, false, est);
 		if (err)
-			goto err4;
+			goto err;
 	}
 
 	p->idrinfo = idrinfo;
@@ -820,14 +815,11 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 	p->ops = ops;
 	*a = p;
 	return 0;
-err4:
-	free_percpu(p->cpu_qstats);
-err3:
-	free_percpu(p->cpu_bstats_hw);
-err2:
-	free_percpu(p->cpu_bstats);
-err1:
-	kfree(p);
+err:
+	if (cpustats)
+		kmem_cache_free(tc_action_cache, p);
+	else
+		kfree(p);
 	return err;
 }
 EXPORT_SYMBOL(tcf_idr_create);
@@ -2270,8 +2262,46 @@ static const struct rtnl_msg_handler tc_action_rtnl_msg_handlers[] __initconst =
 	 .dumpit = tc_dump_action},
 };
 
+static int tcf_action_ctor(void *object) {
+	struct tc_action *p = object;
+
+	p->cpu_bstats = netdev_alloc_pcpu_stats(struct gnet_stats_basic_sync);
+	if (!p->cpu_bstats)
+		goto err1;
+	p->cpu_bstats_hw = netdev_alloc_pcpu_stats(struct gnet_stats_basic_sync);
+	if (!p->cpu_bstats_hw)
+		goto err2;
+	p->cpu_qstats = alloc_percpu(struct gnet_stats_queue);
+	if (!p->cpu_qstats)
+		goto err3;
+	return 0;
+
+err3:
+	free_percpu(p->cpu_bstats_hw);
+err2:
+	free_percpu(p->cpu_bstats);
+err1:
+	return -ENOMEM;
+}
+
+static void tcf_action_dtor(void *object) {
+	struct tc_action *p = object;
+
+	free_percpu(p->cpu_bstats);
+	free_percpu(p->cpu_bstats_hw);
+	free_percpu(p->cpu_qstats);
+}
+
 static int __init tc_action_init(void)
 {
+	struct kmem_cache_args kmem_args = {
+		.ctor = tcf_action_ctor,
+		.dtor = tcf_action_dtor,
+	};
+
+	tc_action_cache = kmem_cache_create("tc_action",
+					    sizeof(struct tc_action),
+					    &kmem_args, SLAB_PANIC);
 	rtnl_register_many(tc_action_rtnl_msg_handlers);
 	return 0;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 5/7] mm/percpu: allow (un)charging objects without alloc/free
  2025-04-24  8:07 [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Harry Yoo
                   ` (3 preceding siblings ...)
  2025-04-24  8:07 ` [RFC PATCH 4/7] net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu alloc Harry Yoo
@ 2025-04-24  8:07 ` Harry Yoo
  2025-04-24  8:07 ` [RFC PATCH 6/7] lib/percpu_counter: allow (un)charging percpu counters " Harry Yoo
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Harry Yoo @ 2025-04-24  8:07 UTC (permalink / raw)
  To: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Mateusz Guzik
  Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vlad Buslov,
	Yevgeny Kliteynik, Jan Kara, Byungchul Park, linux-mm, netdev,
	linux-kernel, Harry Yoo

With a slab ctor/dtor pair, slab objects can retain a pointer to percpu
memory that remains allocated until the slab destructor frees it.

In such cases, the charging and uncharging of percpu memory should be
invoked when slab objects are allocated and freed. Allow explicit
(un)charging of percpu memory to ensure accurate memory accounting
for the slab destructor users.

Note that these APIs only (un)charge memory only for memory cgroups.
They do not affect memory allocation profiling. Memory allocation
profiling records percpu memory only when it is actually allocated or
freed.

Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 include/linux/percpu.h | 10 ++++++
 mm/percpu.c            | 79 +++++++++++++++++++++++++++++-------------
 2 files changed, 64 insertions(+), 25 deletions(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 52b5ea663b9f..2d13ef0885d6 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -140,6 +140,16 @@ extern void __init setup_per_cpu_areas(void);
 extern void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
 				   gfp_t gfp) __alloc_size(1);
 
+#ifdef CONFIG_MEMCG
+extern bool pcpu_charge(void __percpu *__pdata, size_t size, gfp_t gfp);
+extern void pcpu_uncharge(void __percpu *__pdata, size_t size);
+#else
+static inline bool pcpu_charge(void __percpu *__pdata, size_t size, gfp_t gfp)
+{
+	return true;
+}
+static inline void pcpu_uncharge(void __percpu *__pdata, size_t size) { }
+#endif
 #define __alloc_percpu_gfp(_size, _align, _gfp)				\
 	alloc_hooks(pcpu_alloc_noprof(_size, _align, false, _gfp))
 #define __alloc_percpu(_size, _align)					\
diff --git a/mm/percpu.c b/mm/percpu.c
index b35494c8ede2..069d8e593164 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1606,6 +1606,32 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
 	return pcpu_get_page_chunk(pcpu_addr_to_page(addr));
 }
 
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off,
+				      size_t size)
+{
+	if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts)) {
+		alloc_tag_add(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag,
+			      current->alloc_tag, size);
+	}
+}
+
+static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+	if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts))
+		alloc_tag_sub(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag, size);
+}
+#else
+static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off,
+				      size_t size)
+{
+}
+
+static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+}
+#endif
+
 #ifdef CONFIG_MEMCG
 static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
 				      struct obj_cgroup **objcgp)
@@ -1667,7 +1693,35 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 
 	obj_cgroup_put(objcg);
 }
+bool pcpu_charge(void *ptr, size_t size, gfp_t gfp)
+{
+	struct obj_cgroup *objcg = NULL;
+	void *addr;
+	struct pcpu_chunk *chunk;
+	int off;
+
+	addr = __pcpu_ptr_to_addr(ptr);
+	chunk = pcpu_chunk_addr_search(addr);
+	off = addr - chunk->base_addr;
+
+	if (!pcpu_memcg_pre_alloc_hook(size, gfp, &objcg))
+		return false;
+	pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
+	return true;
+}
+
+void pcpu_uncharge(void *ptr, size_t size)
+{
+	void *addr;
+	struct pcpu_chunk *chunk;
+	int off;
+
+	addr = __pcpu_ptr_to_addr(ptr);
+	chunk = pcpu_chunk_addr_search(addr);
+	off = addr - chunk->base_addr;
 
+	pcpu_memcg_free_hook(chunk, off, size);
+}
 #else /* CONFIG_MEMCG */
 static bool
 pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp, struct obj_cgroup **objcgp)
@@ -1686,31 +1740,6 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 }
 #endif /* CONFIG_MEMCG */
 
-#ifdef CONFIG_MEM_ALLOC_PROFILING
-static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off,
-				      size_t size)
-{
-	if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts)) {
-		alloc_tag_add(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag,
-			      current->alloc_tag, size);
-	}
-}
-
-static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
-{
-	if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts))
-		alloc_tag_sub(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag, size);
-}
-#else
-static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off,
-				      size_t size)
-{
-}
-
-static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
-{
-}
-#endif
 
 /**
  * pcpu_alloc - the percpu allocator
-- 
2.43.0



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 6/7] lib/percpu_counter: allow (un)charging percpu counters without alloc/free
  2025-04-24  8:07 [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Harry Yoo
                   ` (4 preceding siblings ...)
  2025-04-24  8:07 ` [RFC PATCH 5/7] mm/percpu: allow (un)charging objects without alloc/free Harry Yoo
@ 2025-04-24  8:07 ` Harry Yoo
  2025-04-24  8:07 ` [RFC PATCH 7/7] kernel/fork: improve exec() throughput with slab ctor/dtor pair Harry Yoo
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Harry Yoo @ 2025-04-24  8:07 UTC (permalink / raw)
  To: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Mateusz Guzik
  Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vlad Buslov,
	Yevgeny Kliteynik, Jan Kara, Byungchul Park, linux-mm, netdev,
	linux-kernel, Harry Yoo

Introduce percpu_counter_{charge,uncharge}_many() to allow explicit
charging and uncharging of percpu memory used by percpu_counter, without
requiring allocation and deallocation of the counters just to
(un)charge.

This is useful in cases where percpu counters preallocated and reused
multiple times—for example, when a slab constructor allocates percpu
counters that are (un)charged multiple times over their lifetime,
until they are finally freed by the destructor.

Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 include/linux/percpu_counter.h |  2 ++
 lib/percpu_counter.c           | 25 +++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 3a44dd1e33d2..6e6b0752b1e4 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -46,6 +46,8 @@ int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
 #define percpu_counter_init(fbc, value, gfp)				\
 	percpu_counter_init_many(fbc, value, gfp, 1)
 
+bool percpu_counter_charge_many(struct percpu_counter *fbc, gfp_t gfp, u32 nr_counters);
+void percpu_counter_uncharge_many(struct percpu_counter *fbc, u32 nr_counters);
 void percpu_counter_destroy_many(struct percpu_counter *fbc, u32 nr_counters);
 static inline void percpu_counter_destroy(struct percpu_counter *fbc)
 {
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 2891f94a11c6..a9fe96787725 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -224,6 +224,31 @@ int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
 }
 EXPORT_SYMBOL(__percpu_counter_init_many);
 
+bool percpu_counter_charge_many(struct percpu_counter *fbc, gfp_t gfp, u32 nr_counters)
+{
+	s32 __percpu *counters;
+	size_t counter_size;
+	size_t charge_size;
+
+	counter_size = ALIGN(sizeof(*counters), __alignof__(*counters));
+	counters = fbc->counters;
+	charge_size = nr_counters * counter_size;
+
+	return pcpu_charge(counters, charge_size, gfp);
+}
+
+void percpu_counter_uncharge_many(struct percpu_counter *fbc, u32 nr_counters)
+{
+	s32 __percpu *counters;
+	size_t counter_size;
+	size_t uncharge_size;
+
+	counter_size = ALIGN(sizeof(*counters), __alignof__(*counters));
+	counters = fbc->counters;
+	uncharge_size = nr_counters * counter_size;
+	pcpu_uncharge(counters, uncharge_size);
+}
+
 void percpu_counter_destroy_many(struct percpu_counter *fbc, u32 nr_counters)
 {
 	unsigned long flags __maybe_unused;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 7/7] kernel/fork: improve exec() throughput with slab ctor/dtor pair
  2025-04-24  8:07 [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Harry Yoo
                   ` (5 preceding siblings ...)
  2025-04-24  8:07 ` [RFC PATCH 6/7] lib/percpu_counter: allow (un)charging percpu counters " Harry Yoo
@ 2025-04-24  8:07 ` Harry Yoo
  2025-04-24  9:29 ` [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Mateusz Guzik
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Harry Yoo @ 2025-04-24  8:07 UTC (permalink / raw)
  To: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Mateusz Guzik
  Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vlad Buslov,
	Yevgeny Kliteynik, Jan Kara, Byungchul Park, linux-mm, netdev,
	linux-kernel, Harry Yoo

When initializing newly allocated mm_struct, mm_init() allocate
two chunks of percpu memory (pcpu_cid and rss_stat). Because percpu
memory allocator uses a global mutex (pcpu_alloc_mutex), it becomes
a global serialization point for exec().

Use slab ctor/dtor pair to allocate and free percpu memory (pcpu_cid,
rss_stat) for mm_struct. mm_init_cid() is moved to mm_init(), and rss_stat
percpu counter is charged in mm_init() and uncharged in __mmdrop().

As rss_stat and pcpu_cid fields should not be overwritten during
memset() by mm_init() and memcy() by dup_mm(), move those fields to the
end of mm_struct. Any field defined after 'ctor_fields_offset' won't be
overwritten by these helpers.

On the other hand, while cpu_bitmap[] is not initialized by
the constructor, it should always be at the end of mm_struct.
However, as cpu_bitmap[] is always initialized by mm_init(),
not calling memset() and memcpy() for this field does not affect its
current behavior.

Note that check_mm() validates whether any rss counter is nonzero
and reports an error if any nonzero value is found. In other words,
under normal conditions, the counters should always be zero when
an mm_struct is freed. Therefore is not necessary to reset the counters
in mm_init() once they have been initialized by the constructor.

To measure the performance impact, I ran the exec() benchmark [1], which
launches one process per CPU and each proess repeatedly invokes exec().

On a desktop with 12 cores (24 hardware threads), it raises
exec() throughput by an average of 4.56%. Even in a single-threaded run
I see roughly a 4% gain, showing that the cost of acquiring and releasing
pcpu_alloc_mutex is significant even when it is uncontended.

On a dual-socket server with 192 cores the mutex becomes a contention
hotpot; mitigating that contention boosts exec() throughput by 33%.

Link: http://apollo.backplane.com/DFlyMisc/doexec.c [1]
Link: https://lore.kernel.org/linux-mm/CAGudoHFc+Km-3usiy4Wdm1JkM+YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com
Suggested-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 include/linux/mm_types.h | 40 +++++++++++++++++---------
 kernel/fork.c            | 62 +++++++++++++++++++++++++++-------------
 2 files changed, 69 insertions(+), 33 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f9..3000ca47b8ba 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -946,14 +946,6 @@ struct mm_struct {
 		atomic_t mm_users;
 
 #ifdef CONFIG_SCHED_MM_CID
-		/**
-		 * @pcpu_cid: Per-cpu current cid.
-		 *
-		 * Keep track of the currently allocated mm_cid for each cpu.
-		 * The per-cpu mm_cid values are serialized by their respective
-		 * runqueue locks.
-		 */
-		struct mm_cid __percpu *pcpu_cid;
 		/*
 		 * @mm_cid_next_scan: Next mm_cid scan (in jiffies).
 		 *
@@ -982,6 +974,7 @@ struct mm_struct {
 		 * mm nr_cpus_allowed updates.
 		 */
 		raw_spinlock_t cpus_allowed_lock;
+		unsigned long _padding; /* for optimal offset of mmap_lock */
 #endif
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
@@ -1059,8 +1052,6 @@ struct mm_struct {
 
 		unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
-		struct percpu_counter rss_stat[NR_MM_COUNTERS];
-
 		struct linux_binfmt *binfmt;
 
 		/* Architecture-specific MM context */
@@ -1169,6 +1160,30 @@ struct mm_struct {
 #endif /* CONFIG_MM_ID */
 	} __randomize_layout;
 
+	/*
+	 * The fields below are not initialized by memset() or copied
+	 * by memcpy(), in order to avoid overwriting values that are
+	 * initialized by the slab constructor.
+	 *
+	 * The last field, cpu_bitmap, is an exception. This field is not
+	 * initialized by the constructor and is always initialized by
+	 * the mm_init() function.
+	 */
+	union {
+		unsigned long ctor_fields_offset;
+		struct percpu_counter rss_stat[NR_MM_COUNTERS];
+	};
+#ifdef CONFIG_SCHED_MM_CID
+	/**
+	 * @pcpu_cid: Per-cpu current cid.
+	 *
+	 * Keep track of the currently allocated mm_cid for each cpu.
+	 * The per-cpu mm_cid values are serialized by their respective
+	 * runqueue locks.
+	 */
+	struct mm_cid __percpu *pcpu_cid;
+#endif
+
 	/*
 	 * The mm_cpumask needs to be at the end of mm_struct, because it
 	 * is dynamically sized based on nr_cpu_ids.
@@ -1348,12 +1363,11 @@ static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
 	cpumask_clear(mm_cidmask(mm));
 }
 
-static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct *p)
+static inline int mm_alloc_cid_noprof(struct mm_struct *mm)
 {
 	mm->pcpu_cid = alloc_percpu_noprof(struct mm_cid);
 	if (!mm->pcpu_cid)
 		return -ENOMEM;
-	mm_init_cid(mm, p);
 	return 0;
 }
 #define mm_alloc_cid(...)	alloc_hooks(mm_alloc_cid_noprof(__VA_ARGS__))
@@ -1383,7 +1397,7 @@ static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumas
 }
 #else /* CONFIG_SCHED_MM_CID */
 static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) { }
-static inline int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p) { return 0; }
+static inline int mm_alloc_cid(struct mm_struct *mm) { return 0; }
 static inline void mm_destroy_cid(struct mm_struct *mm) { }
 
 static inline unsigned int mm_cid_size(void)
diff --git a/kernel/fork.c b/kernel/fork.c
index 7966b0876dc3..5940cf37379c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -943,8 +943,7 @@ void __mmdrop(struct mm_struct *mm)
 	check_mm(mm);
 	put_user_ns(mm->user_ns);
 	mm_pasid_drop(mm);
-	mm_destroy_cid(mm);
-	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
+	percpu_counter_uncharge_many(mm->rss_stat, NR_MM_COUNTERS);
 
 	free_mm(mm);
 }
@@ -1295,7 +1294,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->map_count = 0;
 	mm->locked_vm = 0;
 	atomic64_set(&mm->pinned_vm, 0);
-	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
 	spin_lock_init(&mm->arg_lock);
 	mm_init_cpumask(mm);
@@ -1328,21 +1326,17 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (init_new_context(p, mm))
 		goto fail_nocontext;
 
-	if (mm_alloc_cid(mm, p))
-		goto fail_cid;
-
-	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
-				     NR_MM_COUNTERS))
-		goto fail_pcpu;
+	if (!percpu_counter_charge_many(mm->rss_stat, GFP_KERNEL_ACCOUNT,
+					NR_MM_COUNTERS))
+		goto failed_charge;
 
+	mm_init_cid(mm, p);
 	mm->user_ns = get_user_ns(user_ns);
 	lru_gen_init_mm(mm);
 	return mm;
 
-fail_pcpu:
+failed_charge:
 	mm_destroy_cid(mm);
-fail_cid:
-	destroy_context(mm);
 fail_nocontext:
 	mm_free_id(mm);
 fail_noid:
@@ -1363,7 +1357,7 @@ struct mm_struct *mm_alloc(void)
 	if (!mm)
 		return NULL;
 
-	memset(mm, 0, sizeof(*mm));
+	memset(mm, 0, offsetof(struct mm_struct, ctor_fields_offset));
 	return mm_init(mm, current, current_user_ns());
 }
 EXPORT_SYMBOL_IF_KUNIT(mm_alloc);
@@ -1725,7 +1719,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
 	if (!mm)
 		goto fail_nomem;
 
-	memcpy(mm, oldmm, sizeof(*mm));
+	memcpy(mm, oldmm, offsetof(struct mm_struct, ctor_fields_offset));
 
 	if (!mm_init(mm, tsk, mm->user_ns))
 		goto fail_nomem;
@@ -3193,9 +3187,40 @@ static int sighand_ctor(void *data)
 	return 0;
 }
 
+static int mm_struct_ctor(void *object)
+{
+	struct mm_struct *mm = object;
+
+	if (mm_alloc_cid(mm))
+		return -ENOMEM;
+
+	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL,
+				     NR_MM_COUNTERS)) {
+		mm_destroy_cid(mm);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void mm_struct_dtor(void *object)
+{
+	struct mm_struct *mm = object;
+
+	mm_destroy_cid(mm);
+	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
+}
+
 void __init mm_cache_init(void)
 {
 	unsigned int mm_size;
+	struct kmem_cache_args kmem_args = {
+		.align = ARCH_MIN_MMSTRUCT_ALIGN,
+		.useroffset = offsetof(struct mm_struct, saved_auxv),
+		.usersize = sizeof_field(struct mm_struct, saved_auxv),
+		.ctor = mm_struct_ctor,
+		.dtor = mm_struct_dtor,
+	};
 
 	/*
 	 * The mm_cpumask is located at the end of mm_struct, and is
@@ -3204,12 +3229,9 @@ void __init mm_cache_init(void)
 	 */
 	mm_size = sizeof(struct mm_struct) + cpumask_size() + mm_cid_size();
 
-	mm_cachep = kmem_cache_create_usercopy("mm_struct",
-			mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
-			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
-			offsetof(struct mm_struct, saved_auxv),
-			sizeof_field(struct mm_struct, saved_auxv),
-			NULL);
+	mm_cachep = kmem_cache_create("mm_struct", mm_size, &kmem_args,
+				      SLAB_HWCACHE_ALIGN|SLAB_PANIC|
+				      SLAB_ACCOUNT);
 }
 
 void __init proc_caches_init(void)
-- 
2.43.0



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-24  8:07 [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Harry Yoo
                   ` (6 preceding siblings ...)
  2025-04-24  8:07 ` [RFC PATCH 7/7] kernel/fork: improve exec() throughput with slab ctor/dtor pair Harry Yoo
@ 2025-04-24  9:29 ` Mateusz Guzik
  2025-04-24  9:58   ` Harry Yoo
  2025-04-24 11:28 ` Pedro Falcato
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Mateusz Guzik @ 2025-04-24  9:29 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Vlad Buslov, Yevgeny Kliteynik, Jan Kara,
	Byungchul Park, linux-mm, netdev, linux-kernel

On Thu, Apr 24, 2025 at 10:08 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> Overview
> ========
>
> The slab destructor feature existed in early days of slab allocator(s).
> It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> for destructors") in 2007 due to lack of serious use cases at that time.
>
> Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> constructor/destructor pair to mitigate the global serialization point
> (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> percpu memory during its lifetime.
>
> Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> so each allocate–free cycle requires two expensive acquire/release on
> that mutex.
>
> We can mitigate this contention by retaining the percpu regions after
> the object is freed and releasing them only when the backing slab pages
> are freed.
>
> How to do this with slab constructors and destructors: the constructor
> allocates percpu memory, and the destructor frees it when the slab pages
> are reclaimed; this slightly alters the constructor’s semantics,
> as it can now fail.
>
> This series is functional (although not compatible with MM debug
> features yet), but still far from perfect. I’m actively refining it and
> would appreciate early feedback before I improve it further. :)
>

Thanks for looking into this.

The dtor thing poses a potential problem where a dtor acquiring
arbitrary locks can result in a deadlock during memory reclaim.

So for this to be viable one needs to ensure that in the worst case
this only ever takes leaf-locks (i.e., locks which are last in any
dependency chain -- no locks are being taken if you hold one). This
needs to demonstrate the percpu thing qualifies or needs to refactor
it to that extent.


> This series is based on slab/for-next [2].
>
> Performance Improvement
> =======================
>
> I measured the benefit of this series for two different users:
> exec() and tc filter insertion/removal.
>
> exec() throughput
> -----------------
>
> The performance of exec() is important when short-lived processes are
> frequently created. For example: shell-heavy workloads and running many
> test cases [3].
>
> I measured exec() throughput with a microbenchmark:
>   - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
>   - 4.56% gain on a desktop with 24 hardware threads, and
>   - Even 4% gain on a single-threaded exec() throughput.
>
> Further investigation showed that this was due to the overhead of
> acquiring/releasing pcpu_alloc_mutex and its contention.
>
> See patch 7 for more detail on the experiment.
>
> Traffic Filter Insertion and Removal
> ------------------------------------
>
> Each tc filter allocates three percpu memory regions per tc_action object,
> so frequently inserting and removing filters contend heavily on the same
> mutex.
>
> In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
> more detail), I observed a 26% reduction in system time and observed
> much less contention on pcpu_alloc_mutex with this series.
>
> I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
> about tc filter insertion rate; these changes may still benefit
> workloads they run today.
>
> Feedback Needed from Percpu Allocator Folks
> ===========================================
>
> As percpu allocator is directly affected by this series, this work
> will need support from the percpu allocator maintainers, and we need to
> address their concerns.
>
> They will probably say "This is a percpu memory allocator scalability
> issue and we need to make it scalable"? I don't know.
>
> What do you say?
>
> Some hanging thoughts:
> - Tackling the problem on the slab side is much simpler, because the slab
>   allocator already caches objects per CPU. Re-creating similar logic
>   inside the percpu allocator would be redundant.
>
>   Also, since this is opt-in per slab cache, other percpu allocator
>   users remain unaffected.
>
> - If fragmentation is a concern, we could probably allocate larger percpu
>   chunks and partition them for slab objects.
>
> - If memory overhead becomes an issue, we could introduce a shrinker
>   to free empty slabs (and thus releasing underlying percpu memory chunks).
>
> Patch Sequence
> ==============
>
> Patch #1 refactors freelist_shuffle() to allow the slab constructor to
> fail in the next patch.
>
> Patch #2 allows the slab constructor fail.
>
> Patch #3 implements the slab destructor feature.
>
> Patch #4 converts net/sched/act_api to use the slab ctor/dtor pair.
>
> Patch #5, #6 implements APIs to charge and uncharge percpu memory and
> percpu counter.
>
> Patch #7 converts mm_struct to use the slab ctor/dtor pair.
>
> Known issues with MM debug features
> ===================================
>
> The slab destructor feature is not yet compatible with KASAN, KMEMLEAK,
> and DEBUG_OBJECTS.
>
> KASAN reports an error when a percpu counter is inserted into the
> percpu_counters linked list because the counter has not been allocated
> yet.
>
> DEBUG_OBJECTS and KMEMLEAK complain when the slab object is freed, while
> the associated percpu memory is still resident in memory.
>
> I don't expect fixing these issues to be too difficult, but I need to
> think a little bit to fix it.
>
> [1] https://lore.kernel.org/linux-mm/CAGudoHFc+Km-3usiy4Wdm1JkM+YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com
>
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next
>
> [3] https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3
>
> [4] https://lore.kernel.org/netdev/vbfmunui7dm.fsf@mellanox.com
>
> Harry Yoo (7):
>   mm/slab: refactor freelist shuffle
>   treewide, slab: allow slab constructor to return an error
>   mm/slab: revive the destructor feature in slab allocator
>   net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu
>     alloc
>   mm/percpu: allow (un)charging objects without alloc/free
>   lib/percpu_counter: allow (un)charging percpu counters without
>     alloc/free
>   kernel/fork: improve exec() throughput with slab ctor/dtor pair
>
>  arch/powerpc/include/asm/svm.h            |   2 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c    |   3 +-
>  arch/powerpc/mm/init-common.c             |   3 +-
>  arch/powerpc/platforms/cell/spufs/inode.c |   3 +-
>  arch/powerpc/platforms/pseries/setup.c    |   2 +-
>  arch/powerpc/platforms/pseries/svm.c      |   4 +-
>  arch/sh/mm/pgtable.c                      |   3 +-
>  arch/sparc/mm/tsb.c                       |   8 +-
>  block/bdev.c                              |   3 +-
>  drivers/dax/super.c                       |   3 +-
>  drivers/gpu/drm/i915/i915_request.c       |   3 +-
>  drivers/misc/lkdtm/heap.c                 |  12 +--
>  drivers/usb/mon/mon_text.c                |   5 +-
>  fs/9p/v9fs.c                              |   3 +-
>  fs/adfs/super.c                           |   3 +-
>  fs/affs/super.c                           |   3 +-
>  fs/afs/super.c                            |   5 +-
>  fs/befs/linuxvfs.c                        |   3 +-
>  fs/bfs/inode.c                            |   3 +-
>  fs/btrfs/inode.c                          |   3 +-
>  fs/ceph/super.c                           |   3 +-
>  fs/coda/inode.c                           |   3 +-
>  fs/debugfs/inode.c                        |   3 +-
>  fs/dlm/lowcomms.c                         |   3 +-
>  fs/ecryptfs/main.c                        |   5 +-
>  fs/efs/super.c                            |   3 +-
>  fs/erofs/super.c                          |   3 +-
>  fs/exfat/cache.c                          |   3 +-
>  fs/exfat/super.c                          |   3 +-
>  fs/ext2/super.c                           |   3 +-
>  fs/ext4/super.c                           |   3 +-
>  fs/fat/cache.c                            |   3 +-
>  fs/fat/inode.c                            |   3 +-
>  fs/fuse/inode.c                           |   3 +-
>  fs/gfs2/main.c                            |   9 +-
>  fs/hfs/super.c                            |   3 +-
>  fs/hfsplus/super.c                        |   3 +-
>  fs/hpfs/super.c                           |   3 +-
>  fs/hugetlbfs/inode.c                      |   3 +-
>  fs/inode.c                                |   3 +-
>  fs/isofs/inode.c                          |   3 +-
>  fs/jffs2/super.c                          |   3 +-
>  fs/jfs/super.c                            |   3 +-
>  fs/minix/inode.c                          |   3 +-
>  fs/nfs/inode.c                            |   3 +-
>  fs/nfs/nfs42xattr.c                       |   3 +-
>  fs/nilfs2/super.c                         |   6 +-
>  fs/ntfs3/super.c                          |   3 +-
>  fs/ocfs2/dlmfs/dlmfs.c                    |   3 +-
>  fs/ocfs2/super.c                          |   3 +-
>  fs/openpromfs/inode.c                     |   3 +-
>  fs/orangefs/super.c                       |   3 +-
>  fs/overlayfs/super.c                      |   3 +-
>  fs/pidfs.c                                |   3 +-
>  fs/proc/inode.c                           |   3 +-
>  fs/qnx4/inode.c                           |   3 +-
>  fs/qnx6/inode.c                           |   3 +-
>  fs/romfs/super.c                          |   3 +-
>  fs/smb/client/cifsfs.c                    |   3 +-
>  fs/squashfs/super.c                       |   3 +-
>  fs/tracefs/inode.c                        |   3 +-
>  fs/ubifs/super.c                          |   3 +-
>  fs/udf/super.c                            |   3 +-
>  fs/ufs/super.c                            |   3 +-
>  fs/userfaultfd.c                          |   3 +-
>  fs/vboxsf/super.c                         |   3 +-
>  fs/xfs/xfs_super.c                        |   3 +-
>  include/linux/mm_types.h                  |  40 ++++++---
>  include/linux/percpu.h                    |  10 +++
>  include/linux/percpu_counter.h            |   2 +
>  include/linux/slab.h                      |  21 +++--
>  ipc/mqueue.c                              |   3 +-
>  kernel/fork.c                             |  65 +++++++++-----
>  kernel/rcu/refscale.c                     |   3 +-
>  lib/percpu_counter.c                      |  25 ++++++
>  lib/radix-tree.c                          |   3 +-
>  lib/test_meminit.c                        |   3 +-
>  mm/kfence/kfence_test.c                   |   5 +-
>  mm/percpu.c                               |  79 ++++++++++------
>  mm/rmap.c                                 |   3 +-
>  mm/shmem.c                                |   3 +-
>  mm/slab.h                                 |  11 +--
>  mm/slab_common.c                          |  43 +++++----
>  mm/slub.c                                 | 105 ++++++++++++++++------
>  net/sched/act_api.c                       |  82 +++++++++++------
>  net/socket.c                              |   3 +-
>  net/sunrpc/rpc_pipe.c                     |   3 +-
>  security/integrity/ima/ima_iint.c         |   3 +-
>  88 files changed, 518 insertions(+), 226 deletions(-)
>
> --
> 2.43.0
>


-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-24  9:29 ` [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Mateusz Guzik
@ 2025-04-24  9:58   ` Harry Yoo
  2025-04-24 15:00     ` Mateusz Guzik
  0 siblings, 1 reply; 27+ messages in thread
From: Harry Yoo @ 2025-04-24  9:58 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Vlad Buslov, Yevgeny Kliteynik, Jan Kara,
	Byungchul Park, linux-mm, netdev, linux-kernel

On Thu, Apr 24, 2025 at 11:29:13AM +0200, Mateusz Guzik wrote:
> On Thu, Apr 24, 2025 at 10:08 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > Overview
> > ========
> >
> > The slab destructor feature existed in early days of slab allocator(s).
> > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > for destructors") in 2007 due to lack of serious use cases at that time.
> >
> > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > constructor/destructor pair to mitigate the global serialization point
> > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > percpu memory during its lifetime.
> >
> > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > so each allocate–free cycle requires two expensive acquire/release on
> > that mutex.
> >
> > We can mitigate this contention by retaining the percpu regions after
> > the object is freed and releasing them only when the backing slab pages
> > are freed.
> >
> > How to do this with slab constructors and destructors: the constructor
> > allocates percpu memory, and the destructor frees it when the slab pages
> > are reclaimed; this slightly alters the constructor’s semantics,
> > as it can now fail.
> >
> > This series is functional (although not compatible with MM debug
> > features yet), but still far from perfect. I’m actively refining it and
> > would appreciate early feedback before I improve it further. :)
> >
> 
> Thanks for looking into this.

You're welcome. Thanks for the proposal.

> The dtor thing poses a potential problem where a dtor acquiring
> arbitrary locks can result in a deadlock during memory reclaim.

AFAICT, MM does not reclaim slab memory unless we register shrinker
interface to reclaim it. Or am I missing something?

Hmm let's say it does anyway, then is this what you worry about?

someone requests percpu memory
-> percpu allocator takes a lock (e.g., pcpu_alloc_mutex)
-> allocates pages from buddy
-> buddy reclaims slab memory
-> slab destructor calls pcpu_alloc_mutex (deadlock!)

> So for this to be viable one needs to ensure that in the worst case
> this only ever takes leaf-locks (i.e., locks which are last in any
> dependency chain -- no locks are being taken if you hold one).

Oh, then you can't allocate memory while holding pcpu_lock or
pcpu_alloc_mutex?

> This
> needs to demonstrate the percpu thing qualifies or needs to refactor
> it to that extent.
>
> > This series is based on slab/for-next [2].
> >
> > Performance Improvement
> > =======================
> >
> > I measured the benefit of this series for two different users:
> > exec() and tc filter insertion/removal.
> >
> > exec() throughput
> > -----------------
> >
> > The performance of exec() is important when short-lived processes are
> > frequently created. For example: shell-heavy workloads and running many
> > test cases [3].
> >
> > I measured exec() throughput with a microbenchmark:
> >   - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
> >   - 4.56% gain on a desktop with 24 hardware threads, and
> >   - Even 4% gain on a single-threaded exec() throughput.
> >
> > Further investigation showed that this was due to the overhead of
> > acquiring/releasing pcpu_alloc_mutex and its contention.
> >
> > See patch 7 for more detail on the experiment.
> >
> > Traffic Filter Insertion and Removal
> > ------------------------------------
> >
> > Each tc filter allocates three percpu memory regions per tc_action object,
> > so frequently inserting and removing filters contend heavily on the same
> > mutex.
> >
> > In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
> > more detail), I observed a 26% reduction in system time and observed
> > much less contention on pcpu_alloc_mutex with this series.
> >
> > I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
> > about tc filter insertion rate; these changes may still benefit
> > workloads they run today.
> >
> > Feedback Needed from Percpu Allocator Folks
> > ===========================================
> >
> > As percpu allocator is directly affected by this series, this work
> > will need support from the percpu allocator maintainers, and we need to
> > address their concerns.
> >
> > They will probably say "This is a percpu memory allocator scalability
> > issue and we need to make it scalable"? I don't know.
> >
> > What do you say?
> >
> > Some hanging thoughts:
> > - Tackling the problem on the slab side is much simpler, because the slab
> >   allocator already caches objects per CPU. Re-creating similar logic
> >   inside the percpu allocator would be redundant.
> >
> >   Also, since this is opt-in per slab cache, other percpu allocator
> >   users remain unaffected.
> >
> > - If fragmentation is a concern, we could probably allocate larger percpu
> >   chunks and partition them for slab objects.
> >
> > - If memory overhead becomes an issue, we could introduce a shrinker
> >   to free empty slabs (and thus releasing underlying percpu memory chunks).
> >
> > Patch Sequence
> > ==============
> >
> > Patch #1 refactors freelist_shuffle() to allow the slab constructor to
> > fail in the next patch.
> >
> > Patch #2 allows the slab constructor fail.
> >
> > Patch #3 implements the slab destructor feature.
> >
> > Patch #4 converts net/sched/act_api to use the slab ctor/dtor pair.
> >
> > Patch #5, #6 implements APIs to charge and uncharge percpu memory and
> > percpu counter.
> >
> > Patch #7 converts mm_struct to use the slab ctor/dtor pair.
> >
> > Known issues with MM debug features
> > ===================================
> >
> > The slab destructor feature is not yet compatible with KASAN, KMEMLEAK,
> > and DEBUG_OBJECTS.
> >
> > KASAN reports an error when a percpu counter is inserted into the
> > percpu_counters linked list because the counter has not been allocated
> > yet.
> >
> > DEBUG_OBJECTS and KMEMLEAK complain when the slab object is freed, while
> > the associated percpu memory is still resident in memory.
> >
> > I don't expect fixing these issues to be too difficult, but I need to
> > think a little bit to fix it.
> >
> > [1] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/CAGudoHFc*Km-3usiy4Wdm1JkM*YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com__;Kys!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnI6wgKqLM$ 
> >
> > [2] https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab*for-next__;Lw!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIGu8ThaA$ 
> >
> > [3] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIN_ctSTM$ 
> >
> > [4] https://urldefense.com/v3/__https://lore.kernel.org/netdev/vbfmunui7dm.fsf@mellanox.com__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIDPKy5XU$ 
> >
> > Harry Yoo (7):
> >   mm/slab: refactor freelist shuffle
> >   treewide, slab: allow slab constructor to return an error
> >   mm/slab: revive the destructor feature in slab allocator
> >   net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu
> >     alloc
> >   mm/percpu: allow (un)charging objects without alloc/free
> >   lib/percpu_counter: allow (un)charging percpu counters without
> >     alloc/free
> >   kernel/fork: improve exec() throughput with slab ctor/dtor pair
> >
> >  arch/powerpc/include/asm/svm.h            |   2 +-
> >  arch/powerpc/kvm/book3s_64_mmu_radix.c    |   3 +-
> >  arch/powerpc/mm/init-common.c             |   3 +-
> >  arch/powerpc/platforms/cell/spufs/inode.c |   3 +-
> >  arch/powerpc/platforms/pseries/setup.c    |   2 +-
> >  arch/powerpc/platforms/pseries/svm.c      |   4 +-
> >  arch/sh/mm/pgtable.c                      |   3 +-
> >  arch/sparc/mm/tsb.c                       |   8 +-
> >  block/bdev.c                              |   3 +-
> >  drivers/dax/super.c                       |   3 +-
> >  drivers/gpu/drm/i915/i915_request.c       |   3 +-
> >  drivers/misc/lkdtm/heap.c                 |  12 +--
> >  drivers/usb/mon/mon_text.c                |   5 +-
> >  fs/9p/v9fs.c                              |   3 +-
> >  fs/adfs/super.c                           |   3 +-
> >  fs/affs/super.c                           |   3 +-
> >  fs/afs/super.c                            |   5 +-
> >  fs/befs/linuxvfs.c                        |   3 +-
> >  fs/bfs/inode.c                            |   3 +-
> >  fs/btrfs/inode.c                          |   3 +-
> >  fs/ceph/super.c                           |   3 +-
> >  fs/coda/inode.c                           |   3 +-
> >  fs/debugfs/inode.c                        |   3 +-
> >  fs/dlm/lowcomms.c                         |   3 +-
> >  fs/ecryptfs/main.c                        |   5 +-
> >  fs/efs/super.c                            |   3 +-
> >  fs/erofs/super.c                          |   3 +-
> >  fs/exfat/cache.c                          |   3 +-
> >  fs/exfat/super.c                          |   3 +-
> >  fs/ext2/super.c                           |   3 +-
> >  fs/ext4/super.c                           |   3 +-
> >  fs/fat/cache.c                            |   3 +-
> >  fs/fat/inode.c                            |   3 +-
> >  fs/fuse/inode.c                           |   3 +-
> >  fs/gfs2/main.c                            |   9 +-
> >  fs/hfs/super.c                            |   3 +-
> >  fs/hfsplus/super.c                        |   3 +-
> >  fs/hpfs/super.c                           |   3 +-
> >  fs/hugetlbfs/inode.c                      |   3 +-
> >  fs/inode.c                                |   3 +-
> >  fs/isofs/inode.c                          |   3 +-
> >  fs/jffs2/super.c                          |   3 +-
> >  fs/jfs/super.c                            |   3 +-
> >  fs/minix/inode.c                          |   3 +-
> >  fs/nfs/inode.c                            |   3 +-
> >  fs/nfs/nfs42xattr.c                       |   3 +-
> >  fs/nilfs2/super.c                         |   6 +-
> >  fs/ntfs3/super.c                          |   3 +-
> >  fs/ocfs2/dlmfs/dlmfs.c                    |   3 +-
> >  fs/ocfs2/super.c                          |   3 +-
> >  fs/openpromfs/inode.c                     |   3 +-
> >  fs/orangefs/super.c                       |   3 +-
> >  fs/overlayfs/super.c                      |   3 +-
> >  fs/pidfs.c                                |   3 +-
> >  fs/proc/inode.c                           |   3 +-
> >  fs/qnx4/inode.c                           |   3 +-
> >  fs/qnx6/inode.c                           |   3 +-
> >  fs/romfs/super.c                          |   3 +-
> >  fs/smb/client/cifsfs.c                    |   3 +-
> >  fs/squashfs/super.c                       |   3 +-
> >  fs/tracefs/inode.c                        |   3 +-
> >  fs/ubifs/super.c                          |   3 +-
> >  fs/udf/super.c                            |   3 +-
> >  fs/ufs/super.c                            |   3 +-
> >  fs/userfaultfd.c                          |   3 +-
> >  fs/vboxsf/super.c                         |   3 +-
> >  fs/xfs/xfs_super.c                        |   3 +-
> >  include/linux/mm_types.h                  |  40 ++++++---
> >  include/linux/percpu.h                    |  10 +++
> >  include/linux/percpu_counter.h            |   2 +
> >  include/linux/slab.h                      |  21 +++--
> >  ipc/mqueue.c                              |   3 +-
> >  kernel/fork.c                             |  65 +++++++++-----
> >  kernel/rcu/refscale.c                     |   3 +-
> >  lib/percpu_counter.c                      |  25 ++++++
> >  lib/radix-tree.c                          |   3 +-
> >  lib/test_meminit.c                        |   3 +-
> >  mm/kfence/kfence_test.c                   |   5 +-
> >  mm/percpu.c                               |  79 ++++++++++------
> >  mm/rmap.c                                 |   3 +-
> >  mm/shmem.c                                |   3 +-
> >  mm/slab.h                                 |  11 +--
> >  mm/slab_common.c                          |  43 +++++----
> >  mm/slub.c                                 | 105 ++++++++++++++++------
> >  net/sched/act_api.c                       |  82 +++++++++++------
> >  net/socket.c                              |   3 +-
> >  net/sunrpc/rpc_pipe.c                     |   3 +-
> >  security/integrity/ima/ima_iint.c         |   3 +-
> >  88 files changed, 518 insertions(+), 226 deletions(-)
> >
> > --
> > 2.43.0
> >
> 
> 
> -- 
> Mateusz Guzik <mjguzik gmail.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-24  8:07 [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Harry Yoo
                   ` (7 preceding siblings ...)
  2025-04-24  9:29 ` [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Mateusz Guzik
@ 2025-04-24 11:28 ` Pedro Falcato
  2025-04-24 15:20   ` Mateusz Guzik
  2025-04-25 10:12   ` Harry Yoo
  2025-04-24 15:50 ` Christoph Lameter (Ampere)
  2025-04-24 18:47 ` Tejun Heo
  10 siblings, 2 replies; 27+ messages in thread
From: Pedro Falcato @ 2025-04-24 11:28 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Mateusz Guzik,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vlad Buslov,
	Yevgeny Kliteynik, Jan Kara, Byungchul Park, linux-mm, netdev,
	linux-kernel

On Thu, Apr 24, 2025 at 05:07:48PM +0900, Harry Yoo wrote:
> Overview
> ========
> 
> The slab destructor feature existed in early days of slab allocator(s).
> It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> for destructors") in 2007 due to lack of serious use cases at that time.
> 
> Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> constructor/destructor pair to mitigate the global serialization point
> (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> percpu memory during its lifetime.
> 
> Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> so each allocate–free cycle requires two expensive acquire/release on
> that mutex.
> 
> We can mitigate this contention by retaining the percpu regions after
> the object is freed and releasing them only when the backing slab pages
> are freed.
> 
> How to do this with slab constructors and destructors: the constructor
> allocates percpu memory, and the destructor frees it when the slab pages
> are reclaimed; this slightly alters the constructor’s semantics,
> as it can now fail.
> 

I really really really really don't like this. We're opening a pandora's box
of locking issues for slab deadlocks and other subtle issues. IMO the best
solution there would be, what, failing dtors? which says a lot about the whole
situation...

Case in point:
What happens if you allocate a slab and start ->ctor()-ing objects, and then
one of the ctors fails? We need to free the ctor, but not without ->dtor()-ing
everything back (AIUI this is not handled in this series, yet). Besides this
complication, if failing dtors were added into the mix, we'd be left with a
half-initialized slab(!!) in the middle of the cache waiting to get freed,
without being able to.

Then there are obviously other problems like: whatever you're calling must
not ever require the slab allocator (directly or indirectly) and must not
do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator
is a no-go (AIUI!) already because of such issues.

Then there's the separate (but adjacent, particularly as we're considering
this series due to performance improvements) issue that the ctor() and
dtor() interfaces are terrible, in the sense that they do not let you batch
in any way shape or form (requiring us to lock/unlock many times, allocate
many times, etc). If this is done for performance improvements, I would prefer
a superior ctor/dtor interface that takes something like a slab iterator and
lets you do these things.

The ghost of 1992 Solaris still haunts us...

> This series is functional (although not compatible with MM debug
> features yet), but still far from perfect. I’m actively refining it and
> would appreciate early feedback before I improve it further. :)
> 
> This series is based on slab/for-next [2].
> 
> Performance Improvement
> =======================
> 
> I measured the benefit of this series for two different users:
> exec() and tc filter insertion/removal.
> 
> exec() throughput
> -----------------
> 
> The performance of exec() is important when short-lived processes are
> frequently created. For example: shell-heavy workloads and running many
> test cases [3].
> 
> I measured exec() throughput with a microbenchmark:
>   - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
>   - 4.56% gain on a desktop with 24 hardware threads, and
>   - Even 4% gain on a single-threaded exec() throughput.
> 
> Further investigation showed that this was due to the overhead of
> acquiring/releasing pcpu_alloc_mutex and its contention.
> 
> See patch 7 for more detail on the experiment.
> 
> Traffic Filter Insertion and Removal
> ------------------------------------
> 
> Each tc filter allocates three percpu memory regions per tc_action object,
> so frequently inserting and removing filters contend heavily on the same
> mutex.
> 
> In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
> more detail), I observed a 26% reduction in system time and observed
> much less contention on pcpu_alloc_mutex with this series.
> 
> I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
> about tc filter insertion rate; these changes may still benefit
> workloads they run today.
> 

The performance improvements are obviously fantastic, but I do wonder
if things could be fixed by just fixing the underlying problems, instead
of tapering over them with slab allocator magic and dubious object lifecycles.

In this case, the big issue is that the pcpu allocator does not scale well.

-- 
Pedro


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-24  9:58   ` Harry Yoo
@ 2025-04-24 15:00     ` Mateusz Guzik
  0 siblings, 0 replies; 27+ messages in thread
From: Mateusz Guzik @ 2025-04-24 15:00 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Vlad Buslov, Yevgeny Kliteynik, Jan Kara,
	Byungchul Park, linux-mm, netdev, linux-kernel

On Thu, Apr 24, 2025 at 11:58 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Thu, Apr 24, 2025 at 11:29:13AM +0200, Mateusz Guzik wrote:
> > On Thu, Apr 24, 2025 at 10:08 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > >
> > > Overview
> > > ========
> > >
> > > The slab destructor feature existed in early days of slab allocator(s).
> > > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > > for destructors") in 2007 due to lack of serious use cases at that time.
> > >
> > > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > > constructor/destructor pair to mitigate the global serialization point
> > > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > > percpu memory during its lifetime.
> > >
> > > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > > so each allocate–free cycle requires two expensive acquire/release on
> > > that mutex.
> > >
> > > We can mitigate this contention by retaining the percpu regions after
> > > the object is freed and releasing them only when the backing slab pages
> > > are freed.
> > >
> > > How to do this with slab constructors and destructors: the constructor
> > > allocates percpu memory, and the destructor frees it when the slab pages
> > > are reclaimed; this slightly alters the constructor’s semantics,
> > > as it can now fail.
> > >
> > > This series is functional (although not compatible with MM debug
> > > features yet), but still far from perfect. I’m actively refining it and
> > > would appreciate early feedback before I improve it further. :)
> > >
> >
> > Thanks for looking into this.
>
> You're welcome. Thanks for the proposal.
>
> > The dtor thing poses a potential problem where a dtor acquiring
> > arbitrary locks can result in a deadlock during memory reclaim.
>
> AFAICT, MM does not reclaim slab memory unless we register shrinker
> interface to reclaim it. Or am I missing something?
>
> Hmm let's say it does anyway, then is this what you worry about?
>
> someone requests percpu memory
> -> percpu allocator takes a lock (e.g., pcpu_alloc_mutex)
> -> allocates pages from buddy
> -> buddy reclaims slab memory
> -> slab destructor calls pcpu_alloc_mutex (deadlock!)
>
> > So for this to be viable one needs to ensure that in the worst case
> > this only ever takes leaf-locks (i.e., locks which are last in any
> > dependency chain -- no locks are being taken if you hold one).
>
> Oh, then you can't allocate memory while holding pcpu_lock or
> pcpu_alloc_mutex?
>

It should be perfectly fine to allocate memory with pcpu_alloc_mutex
as it is not an inherent part of reclaim.

The part used by dtor would be the spinlock already present in the
percpu allocator.

The idea would be the mutex-protected area preps everything as needed,
but synchronisation with freeing the pcpu areas would only need the
leaf spinlock.

So issues there provided some care is employed.

> > This
> > needs to demonstrate the percpu thing qualifies or needs to refactor
> > it to that extent.
> >
> > > This series is based on slab/for-next [2].
> > >
> > > Performance Improvement
> > > =======================
> > >
> > > I measured the benefit of this series for two different users:
> > > exec() and tc filter insertion/removal.
> > >
> > > exec() throughput
> > > -----------------
> > >
> > > The performance of exec() is important when short-lived processes are
> > > frequently created. For example: shell-heavy workloads and running many
> > > test cases [3].
> > >
> > > I measured exec() throughput with a microbenchmark:
> > >   - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
> > >   - 4.56% gain on a desktop with 24 hardware threads, and
> > >   - Even 4% gain on a single-threaded exec() throughput.
> > >
> > > Further investigation showed that this was due to the overhead of
> > > acquiring/releasing pcpu_alloc_mutex and its contention.
> > >
> > > See patch 7 for more detail on the experiment.
> > >
> > > Traffic Filter Insertion and Removal
> > > ------------------------------------
> > >
> > > Each tc filter allocates three percpu memory regions per tc_action object,
> > > so frequently inserting and removing filters contend heavily on the same
> > > mutex.
> > >
> > > In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
> > > more detail), I observed a 26% reduction in system time and observed
> > > much less contention on pcpu_alloc_mutex with this series.
> > >
> > > I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
> > > about tc filter insertion rate; these changes may still benefit
> > > workloads they run today.
> > >
> > > Feedback Needed from Percpu Allocator Folks
> > > ===========================================
> > >
> > > As percpu allocator is directly affected by this series, this work
> > > will need support from the percpu allocator maintainers, and we need to
> > > address their concerns.
> > >
> > > They will probably say "This is a percpu memory allocator scalability
> > > issue and we need to make it scalable"? I don't know.
> > >
> > > What do you say?
> > >
> > > Some hanging thoughts:
> > > - Tackling the problem on the slab side is much simpler, because the slab
> > >   allocator already caches objects per CPU. Re-creating similar logic
> > >   inside the percpu allocator would be redundant.
> > >
> > >   Also, since this is opt-in per slab cache, other percpu allocator
> > >   users remain unaffected.
> > >
> > > - If fragmentation is a concern, we could probably allocate larger percpu
> > >   chunks and partition them for slab objects.
> > >
> > > - If memory overhead becomes an issue, we could introduce a shrinker
> > >   to free empty slabs (and thus releasing underlying percpu memory chunks).
> > >
> > > Patch Sequence
> > > ==============
> > >
> > > Patch #1 refactors freelist_shuffle() to allow the slab constructor to
> > > fail in the next patch.
> > >
> > > Patch #2 allows the slab constructor fail.
> > >
> > > Patch #3 implements the slab destructor feature.
> > >
> > > Patch #4 converts net/sched/act_api to use the slab ctor/dtor pair.
> > >
> > > Patch #5, #6 implements APIs to charge and uncharge percpu memory and
> > > percpu counter.
> > >
> > > Patch #7 converts mm_struct to use the slab ctor/dtor pair.
> > >
> > > Known issues with MM debug features
> > > ===================================
> > >
> > > The slab destructor feature is not yet compatible with KASAN, KMEMLEAK,
> > > and DEBUG_OBJECTS.
> > >
> > > KASAN reports an error when a percpu counter is inserted into the
> > > percpu_counters linked list because the counter has not been allocated
> > > yet.
> > >
> > > DEBUG_OBJECTS and KMEMLEAK complain when the slab object is freed, while
> > > the associated percpu memory is still resident in memory.
> > >
> > > I don't expect fixing these issues to be too difficult, but I need to
> > > think a little bit to fix it.
> > >
> > > [1] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/CAGudoHFc*Km-3usiy4Wdm1JkM*YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com__;Kys!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnI6wgKqLM$
> > >
> > > [2] https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab*for-next__;Lw!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIGu8ThaA$
> > >
> > > [3] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIN_ctSTM$
> > >
> > > [4] https://urldefense.com/v3/__https://lore.kernel.org/netdev/vbfmunui7dm.fsf@mellanox.com__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIDPKy5XU$
> > >
> > > Harry Yoo (7):
> > >   mm/slab: refactor freelist shuffle
> > >   treewide, slab: allow slab constructor to return an error
> > >   mm/slab: revive the destructor feature in slab allocator
> > >   net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu
> > >     alloc
> > >   mm/percpu: allow (un)charging objects without alloc/free
> > >   lib/percpu_counter: allow (un)charging percpu counters without
> > >     alloc/free
> > >   kernel/fork: improve exec() throughput with slab ctor/dtor pair
> > >
> > >  arch/powerpc/include/asm/svm.h            |   2 +-
> > >  arch/powerpc/kvm/book3s_64_mmu_radix.c    |   3 +-
> > >  arch/powerpc/mm/init-common.c             |   3 +-
> > >  arch/powerpc/platforms/cell/spufs/inode.c |   3 +-
> > >  arch/powerpc/platforms/pseries/setup.c    |   2 +-
> > >  arch/powerpc/platforms/pseries/svm.c      |   4 +-
> > >  arch/sh/mm/pgtable.c                      |   3 +-
> > >  arch/sparc/mm/tsb.c                       |   8 +-
> > >  block/bdev.c                              |   3 +-
> > >  drivers/dax/super.c                       |   3 +-
> > >  drivers/gpu/drm/i915/i915_request.c       |   3 +-
> > >  drivers/misc/lkdtm/heap.c                 |  12 +--
> > >  drivers/usb/mon/mon_text.c                |   5 +-
> > >  fs/9p/v9fs.c                              |   3 +-
> > >  fs/adfs/super.c                           |   3 +-
> > >  fs/affs/super.c                           |   3 +-
> > >  fs/afs/super.c                            |   5 +-
> > >  fs/befs/linuxvfs.c                        |   3 +-
> > >  fs/bfs/inode.c                            |   3 +-
> > >  fs/btrfs/inode.c                          |   3 +-
> > >  fs/ceph/super.c                           |   3 +-
> > >  fs/coda/inode.c                           |   3 +-
> > >  fs/debugfs/inode.c                        |   3 +-
> > >  fs/dlm/lowcomms.c                         |   3 +-
> > >  fs/ecryptfs/main.c                        |   5 +-
> > >  fs/efs/super.c                            |   3 +-
> > >  fs/erofs/super.c                          |   3 +-
> > >  fs/exfat/cache.c                          |   3 +-
> > >  fs/exfat/super.c                          |   3 +-
> > >  fs/ext2/super.c                           |   3 +-
> > >  fs/ext4/super.c                           |   3 +-
> > >  fs/fat/cache.c                            |   3 +-
> > >  fs/fat/inode.c                            |   3 +-
> > >  fs/fuse/inode.c                           |   3 +-
> > >  fs/gfs2/main.c                            |   9 +-
> > >  fs/hfs/super.c                            |   3 +-
> > >  fs/hfsplus/super.c                        |   3 +-
> > >  fs/hpfs/super.c                           |   3 +-
> > >  fs/hugetlbfs/inode.c                      |   3 +-
> > >  fs/inode.c                                |   3 +-
> > >  fs/isofs/inode.c                          |   3 +-
> > >  fs/jffs2/super.c                          |   3 +-
> > >  fs/jfs/super.c                            |   3 +-
> > >  fs/minix/inode.c                          |   3 +-
> > >  fs/nfs/inode.c                            |   3 +-
> > >  fs/nfs/nfs42xattr.c                       |   3 +-
> > >  fs/nilfs2/super.c                         |   6 +-
> > >  fs/ntfs3/super.c                          |   3 +-
> > >  fs/ocfs2/dlmfs/dlmfs.c                    |   3 +-
> > >  fs/ocfs2/super.c                          |   3 +-
> > >  fs/openpromfs/inode.c                     |   3 +-
> > >  fs/orangefs/super.c                       |   3 +-
> > >  fs/overlayfs/super.c                      |   3 +-
> > >  fs/pidfs.c                                |   3 +-
> > >  fs/proc/inode.c                           |   3 +-
> > >  fs/qnx4/inode.c                           |   3 +-
> > >  fs/qnx6/inode.c                           |   3 +-
> > >  fs/romfs/super.c                          |   3 +-
> > >  fs/smb/client/cifsfs.c                    |   3 +-
> > >  fs/squashfs/super.c                       |   3 +-
> > >  fs/tracefs/inode.c                        |   3 +-
> > >  fs/ubifs/super.c                          |   3 +-
> > >  fs/udf/super.c                            |   3 +-
> > >  fs/ufs/super.c                            |   3 +-
> > >  fs/userfaultfd.c                          |   3 +-
> > >  fs/vboxsf/super.c                         |   3 +-
> > >  fs/xfs/xfs_super.c                        |   3 +-
> > >  include/linux/mm_types.h                  |  40 ++++++---
> > >  include/linux/percpu.h                    |  10 +++
> > >  include/linux/percpu_counter.h            |   2 +
> > >  include/linux/slab.h                      |  21 +++--
> > >  ipc/mqueue.c                              |   3 +-
> > >  kernel/fork.c                             |  65 +++++++++-----
> > >  kernel/rcu/refscale.c                     |   3 +-
> > >  lib/percpu_counter.c                      |  25 ++++++
> > >  lib/radix-tree.c                          |   3 +-
> > >  lib/test_meminit.c                        |   3 +-
> > >  mm/kfence/kfence_test.c                   |   5 +-
> > >  mm/percpu.c                               |  79 ++++++++++------
> > >  mm/rmap.c                                 |   3 +-
> > >  mm/shmem.c                                |   3 +-
> > >  mm/slab.h                                 |  11 +--
> > >  mm/slab_common.c                          |  43 +++++----
> > >  mm/slub.c                                 | 105 ++++++++++++++++------
> > >  net/sched/act_api.c                       |  82 +++++++++++------
> > >  net/socket.c                              |   3 +-
> > >  net/sunrpc/rpc_pipe.c                     |   3 +-
> > >  security/integrity/ima/ima_iint.c         |   3 +-
> > >  88 files changed, 518 insertions(+), 226 deletions(-)
> > >
> > > --
> > > 2.43.0
> > >
> >
> >
> > --
> > Mateusz Guzik <mjguzik gmail.com>
>
> --
> Cheers,
> Harry / Hyeonggon



-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-24 11:28 ` Pedro Falcato
@ 2025-04-24 15:20   ` Mateusz Guzik
  2025-04-24 16:11     ` Mateusz Guzik
  2025-04-25  7:40     ` Harry Yoo
  2025-04-25 10:12   ` Harry Yoo
  1 sibling, 2 replies; 27+ messages in thread
From: Mateusz Guzik @ 2025-04-24 15:20 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Harry Yoo, Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Vlad Buslov, Yevgeny Kliteynik, Jan Kara,
	Byungchul Park, linux-mm, netdev, linux-kernel

On Thu, Apr 24, 2025 at 1:28 PM Pedro Falcato <pfalcato@suse.de> wrote:
> > How to do this with slab constructors and destructors: the constructor
> > allocates percpu memory, and the destructor frees it when the slab pages
> > are reclaimed; this slightly alters the constructor’s semantics,
> > as it can now fail.
> >
>
> I really really really really don't like this. We're opening a pandora's box
> of locking issues for slab deadlocks and other subtle issues. IMO the best
> solution there would be, what, failing dtors? which says a lot about the whole
> situation...
>

I noted the need to use leaf spin locks in my IRC conversations with
Harry and later in this very thread, it is a bummer this bit did not
make into the cover letter -- hopefully it would have avoided this
exchange.

I'm going to summarize this again here:

By API contract the dtor can only take a leaf spinlock, in this case one which:
1. disables irqs
2. is the last lock in the dependency chain, as in no locks are taken
while holding it

That way there is no possibility of a deadlock.

This poses a question on how to enforce it and this bit is easy: for
example one can add leaf-spinlock notion to lockdep. Then a misuse on
allocation side is going to complain immediately even without
triggering reclaim. Further, if one would feel so inclined, a test
module can walk the list of all slab caches and do a populate/reclaim
cycle on those with the ctor/dtor pair.

Then there is the matter of particular consumers being ready to do
what they need to on the dtor side only with the spinlock. Does not
sound like a fundamental problem.

> Case in point:
> What happens if you allocate a slab and start ->ctor()-ing objects, and then
> one of the ctors fails? We need to free the ctor, but not without ->dtor()-ing
> everything back (AIUI this is not handled in this series, yet). Besides this
> complication, if failing dtors were added into the mix, we'd be left with a
> half-initialized slab(!!) in the middle of the cache waiting to get freed,
> without being able to.
>

Per my previous paragraph failing dtors would be a self-induced problem.

I can agree one has to roll things back if ctors don't work out, but I
don't think this poses a significant problem.

> Then there are obviously other problems like: whatever you're calling must
> not ever require the slab allocator (directly or indirectly) and must not
> do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator
> is a no-go (AIUI!) already because of such issues.
>

I don't see how that's true.

> Then there's the separate (but adjacent, particularly as we're considering
> this series due to performance improvements) issue that the ctor() and
> dtor() interfaces are terrible, in the sense that they do not let you batch
> in any way shape or form (requiring us to lock/unlock many times, allocate
> many times, etc). If this is done for performance improvements, I would prefer
> a superior ctor/dtor interface that takes something like a slab iterator and
> lets you do these things.
>

Batching this is also something I mentioned and indeed is a "nice to
have" change. Note however that the work you are suggesting to batch
now also on every alloc/free cycle, so doing it once per creation of a
given object instead is already a win.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-24  8:07 [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Harry Yoo
                   ` (8 preceding siblings ...)
  2025-04-24 11:28 ` Pedro Falcato
@ 2025-04-24 15:50 ` Christoph Lameter (Ampere)
  2025-04-24 16:03   ` Mateusz Guzik
  2025-04-24 18:47 ` Tejun Heo
  10 siblings, 1 reply; 27+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-04-24 15:50 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, David Rientjes, Andrew Morton, Dennis Zhou,
	Tejun Heo, Mateusz Guzik, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Vlad Buslov, Yevgeny Kliteynik, Jan Kara,
	Byungchul Park, linux-mm, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 568 bytes --]

On Thu, 24 Apr 2025, Harry Yoo wrote:

> Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> so each allocate–free cycle requires two expensive acquire/release on
> that mutex.

> We can mitigate this contention by retaining the percpu regions after
> the object is freed and releasing them only when the backing slab pages
> are freed.

Could you keep a cache of recently used per cpu regions so that you can
avoid frequent percpu allocation operation?

You could allocate larger percpu areas for a batch of them and
then assign as needed.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-24 15:50 ` Christoph Lameter (Ampere)
@ 2025-04-24 16:03   ` Mateusz Guzik
  2025-04-24 16:39     ` Christoph Lameter (Ampere)
  0 siblings, 1 reply; 27+ messages in thread
From: Mateusz Guzik @ 2025-04-24 16:03 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Harry Yoo, Vlastimil Babka, David Rientjes, Andrew Morton,
	Dennis Zhou, Tejun Heo, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Vlad Buslov, Yevgeny Kliteynik, Jan Kara, Byungchul Park,
	linux-mm, netdev, linux-kernel

On Thu, Apr 24, 2025 at 5:50 PM Christoph Lameter (Ampere)
<cl@gentwo.org> wrote:
>
> On Thu, 24 Apr 2025, Harry Yoo wrote:
>
> > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > so each allocate–free cycle requires two expensive acquire/release on
> > that mutex.
>
> > We can mitigate this contention by retaining the percpu regions after
> > the object is freed and releasing them only when the backing slab pages
> > are freed.
>
> Could you keep a cache of recently used per cpu regions so that you can
> avoid frequent percpu allocation operation?
>
> You could allocate larger percpu areas for a batch of them and
> then assign as needed.

I was considering a mechanism like that earlier, but the changes
needed to make it happen would result in worse state for the
alloc/free path.

RSS counters are embedded into mm with only the per-cpu areas being a
pointer. The machinery maintains a global list of all of their
instances, i.e. the pointers to internal to mm_struct. That is to say
even if you deserialized allocation of percpu memory itself, you would
still globally serialize on adding/removing the counters to the global
list.

But suppose this got reworked somehow and this bit ceases to be a problem.

Another spot where mm alloc/free globally serializes (at least on
x86_64) is pgd_alloc/free on the global pgd_lock.

Suppose you managed to decompose the lock into a finer granularity, to
the point where it does not pose a problem from contention standpoint.
Even then that's work which does not have to happen there.

General theme is there is a lot of expensive work happening when
dealing with mm lifecycle (*both* from single- and multi-threaded
standpoint) and preferably it would only be dealt with once per
object's existence.
-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-24 15:20   ` Mateusz Guzik
@ 2025-04-24 16:11     ` Mateusz Guzik
  2025-04-25  7:40     ` Harry Yoo
  1 sibling, 0 replies; 27+ messages in thread
From: Mateusz Guzik @ 2025-04-24 16:11 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Harry Yoo, Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Vlad Buslov, Yevgeny Kliteynik, Jan Kara,
	Byungchul Park, linux-mm, netdev, linux-kernel

On Thu, Apr 24, 2025 at 5:20 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Thu, Apr 24, 2025 at 1:28 PM Pedro Falcato <pfalcato@suse.de> wrote:
> > > How to do this with slab constructors and destructors: the constructor
> > > allocates percpu memory, and the destructor frees it when the slab pages
> > > are reclaimed; this slightly alters the constructor’s semantics,
> > > as it can now fail.
> > >
> >
> > I really really really really don't like this. We're opening a pandora's box
> > of locking issues for slab deadlocks and other subtle issues. IMO the best
> > solution there would be, what, failing dtors? which says a lot about the whole
> > situation...
> >
>
> I noted the need to use leaf spin locks in my IRC conversations with
> Harry and later in this very thread, it is a bummer this bit did not
> make into the cover letter -- hopefully it would have avoided this
> exchange.
>
> I'm going to summarize this again here:
>
> By API contract the dtor can only take a leaf spinlock, in this case one which:
> 1. disables irqs
> 2. is the last lock in the dependency chain, as in no locks are taken
> while holding it
>
> That way there is no possibility of a deadlock.
>
> This poses a question on how to enforce it and this bit is easy: for
> example one can add leaf-spinlock notion to lockdep. Then a misuse on
> allocation side is going to complain immediately even without
> triggering reclaim. Further, if one would feel so inclined, a test
> module can walk the list of all slab caches and do a populate/reclaim
> cycle on those with the ctor/dtor pair.
>
> Then there is the matter of particular consumers being ready to do
> what they need to on the dtor side only with the spinlock. Does not
> sound like a fundamental problem.
>
> > Case in point:
> > What happens if you allocate a slab and start ->ctor()-ing objects, and then
> > one of the ctors fails? We need to free the ctor, but not without ->dtor()-ing
> > everything back (AIUI this is not handled in this series, yet). Besides this
> > complication, if failing dtors were added into the mix, we'd be left with a
> > half-initialized slab(!!) in the middle of the cache waiting to get freed,
> > without being able to.
> >
>
> Per my previous paragraph failing dtors would be a self-induced problem.
>
> I can agree one has to roll things back if ctors don't work out, but I
> don't think this poses a significant problem.
>
> > Then there are obviously other problems like: whatever you're calling must
> > not ever require the slab allocator (directly or indirectly) and must not
> > do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator
> > is a no-go (AIUI!) already because of such issues.
> >
>
> I don't see how that's true.
>
> > Then there's the separate (but adjacent, particularly as we're considering
> > this series due to performance improvements) issue that the ctor() and
> > dtor() interfaces are terrible, in the sense that they do not let you batch
> > in any way shape or form (requiring us to lock/unlock many times, allocate
> > many times, etc). If this is done for performance improvements, I would prefer
> > a superior ctor/dtor interface that takes something like a slab iterator and
> > lets you do these things.
> >
>
> Batching this is also something I mentioned and indeed is a "nice to
> have" change. Note however that the work you are suggesting to batch
> now also on every alloc/free cycle, so doing it once per creation of a
> given object instead is already a win.
>

Whether the ctor/dtor thing lands or not, I would like to point out
the current state is quite nasty and something(tm) needs to be done.

The mm object is allocated from a per-cpu cache, only to have the
mandatory initialization globally serialize *several* times, including
twice on the same lock. This is so bad that performance would be
better if someone created a globally-locked cache with mms still
holding onto per-cpu memory et al.

Or to put it differently, existence of a per-cpu cache of mm objs is
defeated by the global locking endured for each alloc/free cycle (and
this goes beyond percpu memory allocs for counters).

So another idea would be to instead create a cache with *some*
granularity (say 4 or 8 cpu threads per instance). Note this should
reduce the total number of mms allocated (but unused) in the system.
If mms hanging out there would still be populated like in this
patchset, perhaps the reduction in objs which are "wasted" would be
sufficient to ignore direct reclaim? Instead if need be this would be
reclaimable from a dedicated thread (whatever it is in Linux).

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-24 16:03   ` Mateusz Guzik
@ 2025-04-24 16:39     ` Christoph Lameter (Ampere)
  2025-04-24 17:26       ` Mateusz Guzik
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-04-24 16:39 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Harry Yoo, Vlastimil Babka, David Rientjes, Andrew Morton,
	Dennis Zhou, Tejun Heo, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Vlad Buslov, Yevgeny Kliteynik, Jan Kara, Byungchul Park,
	linux-mm, netdev, linux-kernel

On Thu, 24 Apr 2025, Mateusz Guzik wrote:

> > You could allocate larger percpu areas for a batch of them and
> > then assign as needed.
>
> I was considering a mechanism like that earlier, but the changes
> needed to make it happen would result in worse state for the
> alloc/free path.
>
> RSS counters are embedded into mm with only the per-cpu areas being a
> pointer. The machinery maintains a global list of all of their
> instances, i.e. the pointers to internal to mm_struct. That is to say
> even if you deserialized allocation of percpu memory itself, you would
> still globally serialize on adding/removing the counters to the global
> list.
>
> But suppose this got reworked somehow and this bit ceases to be a problem.
>
> Another spot where mm alloc/free globally serializes (at least on
> x86_64) is pgd_alloc/free on the global pgd_lock.
>
> Suppose you managed to decompose the lock into a finer granularity, to
> the point where it does not pose a problem from contention standpoint.
> Even then that's work which does not have to happen there.
>
> General theme is there is a lot of expensive work happening when
> dealing with mm lifecycle (*both* from single- and multi-threaded
> standpoint) and preferably it would only be dealt with once per
> object's existence.

Maybe change the lifecyle? Allocate a batch nr of entries initially from
the slab allocator and use them for multiple mm_structs as the need
arises.

Do not free them to the slab allocator until you
have too many that do nothing around?

You may also want to avoid counter updates with this scheme if you only
count the batchees useed. It will become a bit fuzzy but you improve scalability.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-24 16:39     ` Christoph Lameter (Ampere)
@ 2025-04-24 17:26       ` Mateusz Guzik
  0 siblings, 0 replies; 27+ messages in thread
From: Mateusz Guzik @ 2025-04-24 17:26 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Harry Yoo, Vlastimil Babka, David Rientjes, Andrew Morton,
	Dennis Zhou, Tejun Heo, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Vlad Buslov, Yevgeny Kliteynik, Jan Kara, Byungchul Park,
	linux-mm, netdev, linux-kernel

On Thu, Apr 24, 2025 at 6:39 PM Christoph Lameter (Ampere)
<cl@gentwo.org> wrote:
>
> On Thu, 24 Apr 2025, Mateusz Guzik wrote:
>
> > > You could allocate larger percpu areas for a batch of them and
> > > then assign as needed.
> >
> > I was considering a mechanism like that earlier, but the changes
> > needed to make it happen would result in worse state for the
> > alloc/free path.
> >
> > RSS counters are embedded into mm with only the per-cpu areas being a
> > pointer. The machinery maintains a global list of all of their
> > instances, i.e. the pointers to internal to mm_struct. That is to say
> > even if you deserialized allocation of percpu memory itself, you would
> > still globally serialize on adding/removing the counters to the global
> > list.
> >
> > But suppose this got reworked somehow and this bit ceases to be a problem.
> >
> > Another spot where mm alloc/free globally serializes (at least on
> > x86_64) is pgd_alloc/free on the global pgd_lock.
> >
> > Suppose you managed to decompose the lock into a finer granularity, to
> > the point where it does not pose a problem from contention standpoint.
> > Even then that's work which does not have to happen there.
> >
> > General theme is there is a lot of expensive work happening when
> > dealing with mm lifecycle (*both* from single- and multi-threaded
> > standpoint) and preferably it would only be dealt with once per
> > object's existence.
>
> Maybe change the lifecyle? Allocate a batch nr of entries initially from
> the slab allocator and use them for multiple mm_structs as the need
> arises.
>
> Do not free them to the slab allocator until you
> have too many that do nothing around?
>
> You may also want to avoid counter updates with this scheme if you only
> count the batchees useed. It will become a bit fuzzy but you improve scalability.
>

If I get this right this proposal boils down to caching all the state,
but hiding the objects from reclaim?

If going this kind of route, perhaps it would be simpler to prevent
direct reclaim on mm objs and instead if there is memory shortage, let
a different thread take care of them?
-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-24  8:07 [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Harry Yoo
                   ` (9 preceding siblings ...)
  2025-04-24 15:50 ` Christoph Lameter (Ampere)
@ 2025-04-24 18:47 ` Tejun Heo
  2025-04-25 10:10   ` Harry Yoo
  10 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2025-04-24 18:47 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Mateusz Guzik, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Vlad Buslov, Yevgeny Kliteynik, Jan Kara,
	Byungchul Park, linux-mm, netdev, linux-kernel

On Thu, Apr 24, 2025 at 05:07:48PM +0900, Harry Yoo wrote:
...
> Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> constructor/destructor pair to mitigate the global serialization point
> (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> percpu memory during its lifetime.
> 
> Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> so each allocate–free cycle requires two expensive acquire/release on
> that mutex.

When percpu allocator was first introduced, the use cases were a lot more
limited, so the single mutex and expensive alloc/free paths weren't a
problem. We keep using percpu memory for more and more things, and I always
thought we'd eventually need a more sophisticated allocator something with
object caching. I don't exactly know what that should look like but maybe a
simplified version of sl*b serving power of two sizes should do or maybe it
needs to be smaller and more adaptive. We'd need to collect some data to
decide which way to go.

Improving percpu allocator in general is obviously a heavier lift but that
may be a better long-term direction.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-24 15:20   ` Mateusz Guzik
  2025-04-24 16:11     ` Mateusz Guzik
@ 2025-04-25  7:40     ` Harry Yoo
  1 sibling, 0 replies; 27+ messages in thread
From: Harry Yoo @ 2025-04-25  7:40 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Pedro Falcato, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Andrew Morton, Dennis Zhou, Tejun Heo,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vlad Buslov,
	Yevgeny Kliteynik, Jan Kara, Byungchul Park, linux-mm, netdev,
	linux-kernel

On Thu, Apr 24, 2025 at 05:20:59PM +0200, Mateusz Guzik wrote:
> On Thu, Apr 24, 2025 at 1:28 PM Pedro Falcato <pfalcato@suse.de> wrote:
> > > How to do this with slab constructors and destructors: the constructor
> > > allocates percpu memory, and the destructor frees it when the slab pages
> > > are reclaimed; this slightly alters the constructor’s semantics,
> > > as it can now fail.
> >
> > I really really really really don't like this. We're opening a pandora's box
> > of locking issues for slab deadlocks and other subtle issues. IMO the best
> > solution there would be, what, failing dtors? which says a lot about the whole
> > situation...
> >
> 
> I noted the need to use leaf spin locks in my IRC conversations with
> Harry and later in this very thread, it is a bummer this bit did not
> make into the cover letter -- hopefully it would have avoided this
> exchange.

My bad. Yes, I should have included it in the cover letter.
Will include in the future series.

> I'm going to summarize this again here:
> 
> By API contract the dtor can only take a leaf spinlock, in this case one which:
> 1. disables irqs

Alright, as interrupt handlers can also introduce lock dependency.

> 2. is the last lock in the dependency chain, as in no locks are taken
> while holding it

So if the destructor takes a lock, no users of the lock, in any
dependency chain, can hold any locks while holding it.
 
> That way there is no possibility of a deadlock.
> 
> This poses a question on how to enforce it and this bit is easy: for
> example one can add leaf-spinlock notion to lockdep. Then a misuse on
> allocation side is going to complain immediately even without
> triggering reclaim. Further, if one would feel so inclined, a test
> module can walk the list of all slab caches and do a populate/reclaim
> cycle on those with the ctor/dtor pair.
> 
> Then there is the matter of particular consumers being ready to do
> what they need to on the dtor side only with the spinlock. Does not
> sound like a fundamental problem.
> 
> > Case in point:
> > What happens if you allocate a slab and start ->ctor()-ing objects, and then
> > one of the ctors fails? We need to free the ctor, but not without ->dtor()-ing
> > everything back (AIUI this is not handled in this series, yet).

It is handled in __free_slab().

>> Besides this
> > complication, if failing dtors were added into the mix, we'd be left with a
> > half-initialized slab(!!) in the middle of the cache waiting to get freed,
> > without being able to.
> >
> 
> Per my previous paragraph failing dtors would be a self-induced problem.
> 
> I can agree one has to roll things back if ctors don't work out, but I
> don't think this poses a significant problem.

Agreed. it'd better to write destructors carefully avoiding deadlocks
rather than allowing it to fail.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-24 18:47 ` Tejun Heo
@ 2025-04-25 10:10   ` Harry Yoo
  2025-04-25 19:03     ` Tejun Heo
  0 siblings, 1 reply; 27+ messages in thread
From: Harry Yoo @ 2025-04-25 10:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Mateusz Guzik, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Vlad Buslov, Yevgeny Kliteynik, Jan Kara,
	Byungchul Park, linux-mm, netdev, linux-kernel

On Thu, Apr 24, 2025 at 08:47:05AM -1000, Tejun Heo wrote:
> On Thu, Apr 24, 2025 at 05:07:48PM +0900, Harry Yoo wrote:
> ...
> > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > constructor/destructor pair to mitigate the global serialization point
> > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > percpu memory during its lifetime.
> > 
> > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > so each allocate–free cycle requires two expensive acquire/release on
> > that mutex.
> 
> When percpu allocator was first introduced, the use cases were a lot more
> limited, so the single mutex and expensive alloc/free paths weren't a
> problem. We keep using percpu memory for more and more things, and I always
> thought we'd eventually need a more sophisticated allocator something with
> object caching.

Yeah, when you first write an allocator, it's an overkill to make it
too scalable. But over time, as with other allocators, more users show up
that require a more sophisticated allocator.

> I don't exactly know what that should look like but maybe a
> simplified version of sl*b serving power of two sizes should do or maybe it
> needs to be smaller and more adaptive. We'd need to collect some data to
> decide which way to go.

I'm not sure what kind of data we need — maybe allocation size distributions,
or more profiling data on workloads that contend on percpu allocator's locks?

> Improving percpu allocator in general is obviously a heavier lift but that
> may be a better long-term direction.

Yeah, if that's doable. But until then I think it still makes sense to cache
it within slab objects, ...or probably even after improving the percpu
allocator? It's a still churn that's incurred during each object's lifetime
regardless. (Need some data to see if justifiable)

And, as Mateusz explained, the percpu allocator isn’t the only motivation
for the ctor/dtor pair. Other expensive serializations like pgd_lock and
percpu_counters_lock are other motivations to do this.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-24 11:28 ` Pedro Falcato
  2025-04-24 15:20   ` Mateusz Guzik
@ 2025-04-25 10:12   ` Harry Yoo
  2025-04-25 10:42     ` Pedro Falcato
  1 sibling, 1 reply; 27+ messages in thread
From: Harry Yoo @ 2025-04-25 10:12 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Mateusz Guzik,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vlad Buslov,
	Yevgeny Kliteynik, Jan Kara, Byungchul Park, linux-mm, netdev,
	linux-kernel

On Thu, Apr 24, 2025 at 12:28:37PM +0100, Pedro Falcato wrote:
> On Thu, Apr 24, 2025 at 05:07:48PM +0900, Harry Yoo wrote:
> > Overview
> > ========
> > 
> > The slab destructor feature existed in early days of slab allocator(s).
> > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > for destructors") in 2007 due to lack of serious use cases at that time.
> > 
> > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > constructor/destructor pair to mitigate the global serialization point
> > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > percpu memory during its lifetime.
> > 
> > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > so each allocate–free cycle requires two expensive acquire/release on
> > that mutex.
> > 
> > We can mitigate this contention by retaining the percpu regions after
> > the object is freed and releasing them only when the backing slab pages
> > are freed.
> > 
> > How to do this with slab constructors and destructors: the constructor
> > allocates percpu memory, and the destructor frees it when the slab pages
> > are reclaimed; this slightly alters the constructor’s semantics,
> > as it can now fail.
> > 
> 
> I really really really really don't like this. We're opening a pandora's box
> of locking issues for slab deadlocks and other subtle issues. IMO the best
> solution there would be, what, failing dtors? which says a lot about the whole
> situation...
> 
> Case in point:

<...snip...>

> Then there are obviously other problems like: whatever you're calling must
> not ever require the slab allocator (directly or indirectly) and must not
> do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator
> is a no-go (AIUI!) already because of such issues.

Could you please elaborate more on this?

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-25 10:12   ` Harry Yoo
@ 2025-04-25 10:42     ` Pedro Falcato
  2025-04-28  1:18       ` Harry Yoo
  2025-04-30 19:49       ` Mateusz Guzik
  0 siblings, 2 replies; 27+ messages in thread
From: Pedro Falcato @ 2025-04-25 10:42 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Mateusz Guzik,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vlad Buslov,
	Yevgeny Kliteynik, Jan Kara, Byungchul Park, linux-mm, netdev,
	linux-kernel

On Fri, Apr 25, 2025 at 07:12:02PM +0900, Harry Yoo wrote:
> On Thu, Apr 24, 2025 at 12:28:37PM +0100, Pedro Falcato wrote:
> > On Thu, Apr 24, 2025 at 05:07:48PM +0900, Harry Yoo wrote:
> > > Overview
> > > ========
> > > 
> > > The slab destructor feature existed in early days of slab allocator(s).
> > > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > > for destructors") in 2007 due to lack of serious use cases at that time.
> > > 
> > > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > > constructor/destructor pair to mitigate the global serialization point
> > > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > > percpu memory during its lifetime.
> > > 
> > > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > > so each allocate–free cycle requires two expensive acquire/release on
> > > that mutex.
> > > 
> > > We can mitigate this contention by retaining the percpu regions after
> > > the object is freed and releasing them only when the backing slab pages
> > > are freed.
> > > 
> > > How to do this with slab constructors and destructors: the constructor
> > > allocates percpu memory, and the destructor frees it when the slab pages
> > > are reclaimed; this slightly alters the constructor’s semantics,
> > > as it can now fail.
> > > 
> > 
> > I really really really really don't like this. We're opening a pandora's box
> > of locking issues for slab deadlocks and other subtle issues. IMO the best
> > solution there would be, what, failing dtors? which says a lot about the whole
> > situation...
> > 
> > Case in point:
> 
> <...snip...>
> 
> > Then there are obviously other problems like: whatever you're calling must
> > not ever require the slab allocator (directly or indirectly) and must not
> > do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator
> > is a no-go (AIUI!) already because of such issues.
> 
> Could you please elaborate more on this?

Well, as discussed multiple-times both on-and-off-list, the pcpu allocator is
not a problem here because the freeing path takes a spinlock, not a mutex. But
obviously you can see the fun locking horror dependency chains we're creating
with this patchset. ->ctor() needs to be super careful calling things, avoiding
any sort of loop. ->dtor() needs to be super careful calling things, avoiding
_any_ sort of direct reclaim possibilities. You also now need to pass a gfp_t
to both ->ctor and ->dtor.

With regards to "leaf locks", I still don't really understand what you/Mateusz
mean or how that's even enforceable from the get-go.

So basically:
- ->ctor takes more args, can fail, can do fancier things (multiple allocations,
  lock holding, etc, can be hidden with a normal kmem_cache_alloc; certain
  caches become GFP_ATOMIC-incompatible)

- ->dtor *will* do fancy things like recursing back onto the slab allocator and
  grabbing locks

- a normal kmem_cache_free can suddenly attempt to grab !SLUB locks as it tries
  to dispose of slabs. It can also uncontrollably do $whatever.

- a normal kmem_cache_alloc can call vast swaths of code, uncontrollably, due to
  ->ctor. It can also set off direct reclaim, and thus run into all sorts of kmem_
  cache_free/slab disposal issues

- a normal, possibly-unrelated GFP_KERNEL allocation can also run into all of these
  issues by purely starting up shrinkers on direct reclaim as well.

- the whole original "Slab object caching allocator" idea from 1992 is extremely
  confusing and works super poorly with various debugging features (like, e.g,
  KASAN). IMO it should really be reserved (in a limited capacity!) for stuff like
  TYPESAFE_BY_RCU, that we *really* need.

These are basically my issues with the whole idea. I highly disagree that we should
open this pandora's box for problems in *other places*.

-- 
Pedro


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-25 10:10   ` Harry Yoo
@ 2025-04-25 19:03     ` Tejun Heo
  0 siblings, 0 replies; 27+ messages in thread
From: Tejun Heo @ 2025-04-25 19:03 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Mateusz Guzik, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Vlad Buslov, Yevgeny Kliteynik, Jan Kara,
	Byungchul Park, linux-mm, netdev, linux-kernel

On Fri, Apr 25, 2025 at 07:10:03PM +0900, Harry Yoo wrote:
> > I don't exactly know what that should look like but maybe a
> > simplified version of sl*b serving power of two sizes should do or maybe it
> > needs to be smaller and more adaptive. We'd need to collect some data to
> > decide which way to go.
> 
> I'm not sure what kind of data we need — maybe allocation size distributions,
> or more profiling data on workloads that contend on percpu allocator's locks?

Oh yeah, mostly distributions of memory allocation sizes across different
systems and workloads.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-25 10:42     ` Pedro Falcato
@ 2025-04-28  1:18       ` Harry Yoo
  2025-04-30 19:49       ` Mateusz Guzik
  1 sibling, 0 replies; 27+ messages in thread
From: Harry Yoo @ 2025-04-28  1:18 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Mateusz Guzik,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vlad Buslov,
	Yevgeny Kliteynik, Jan Kara, Byungchul Park, linux-mm, netdev,
	linux-kernel

On Fri, Apr 25, 2025 at 11:42:27AM +0100, Pedro Falcato wrote:
> On Fri, Apr 25, 2025 at 07:12:02PM +0900, Harry Yoo wrote:
> > On Thu, Apr 24, 2025 at 12:28:37PM +0100, Pedro Falcato wrote:
> > > On Thu, Apr 24, 2025 at 05:07:48PM +0900, Harry Yoo wrote:
> > > > Overview
> > > > ========
> > > > 
> > > > The slab destructor feature existed in early days of slab allocator(s).
> > > > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > > > for destructors") in 2007 due to lack of serious use cases at that time.
> > > > 
> > > > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > > > constructor/destructor pair to mitigate the global serialization point
> > > > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > > > percpu memory during its lifetime.
> > > > 
> > > > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > > > so each allocate–free cycle requires two expensive acquire/release on
> > > > that mutex.
> > > > 
> > > > We can mitigate this contention by retaining the percpu regions after
> > > > the object is freed and releasing them only when the backing slab pages
> > > > are freed.
> > > > 
> > > > How to do this with slab constructors and destructors: the constructor
> > > > allocates percpu memory, and the destructor frees it when the slab pages
> > > > are reclaimed; this slightly alters the constructor’s semantics,
> > > > as it can now fail.
> > > > 
> > > 
> > > I really really really really don't like this. We're opening a pandora's box
> > > of locking issues for slab deadlocks and other subtle issues. IMO the best
> > > solution there would be, what, failing dtors? which says a lot about the whole
> > > situation...
> > > 
> > > Case in point:
> > 
> > <...snip...>
> > 
> > > Then there are obviously other problems like: whatever you're calling must
> > > not ever require the slab allocator (directly or indirectly) and must not
> > > do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator
> > > is a no-go (AIUI!) already because of such issues.
> > 
> > Could you please elaborate more on this?
>
> Well, as discussed multiple-times both on-and-off-list, the pcpu allocator is
> not a problem here because the freeing path takes a spinlock, not a mutex.

Yes, and it seems to be a leaf spinlock (no code path in the kernel takes
any other locks while holding the lock).

> But obviously you can see the fun locking horror dependency chains
> we're creating with this patchset.
>
> ->ctor() needs to be super careful calling things, avoiding
> any sort of loop.

You mean recursion to avoid exhausting the kernel stack?

It'd be fine as long as ->ctor does not allocate objects from
the same cache (and of course it should not).

> ->dtor() needs to be super careful calling things, avoiding
> _any_ sort of direct reclaim possibilities.

Why would a dtor _ever_ need to perform direct reclamation?

>You also now need to pass a gfp_t  to both ->ctor and ->dtor.

Passing gfp_t to ->ctor agreed, but why to ->dtor?

Destructors should not allocate any memory and thus no need for
'Get Free Page' flags.

> So basically:
> - ->ctor takes more args, can fail, can do fancier things (multiple allocations,
>   lock holding, etc, can be hidden with a normal kmem_cache_alloc;

Speaking of deadlocks involving ->ctor, of course, you shouldn't allocate
mm_struct while holding pcpu_lock, or allocate a pgd while holding pgd_lock.
I don't think avoiding deadlocks caused by ->ctor is that difficult, and
lockdep can detect them even if someone makes a mistake.

> - ->dtor *will* do fancy things like recursing back onto the slab allocator and
>   grabbing locks

AIUI it can't recurse back onto the slab allocator.
'Taking only leaf spinlocks' is a very restrictive rule.

For example, slab takes list_lock, and it is not a leaf spinlock because
in some path slab can take other locks while holding it. And thus you can't
call kmem_cache_free() directly in ->dtor.

'Doing fancy things and grabbing locks' in dtor is safe as long as
you 1) disable interrupts and 2) take only leaf spinlocks.
 
> - a normal kmem_cache_free can suddenly attempt to grab !SLUB locks as it tries
>   to dispose of slabs. It can also uncontrollably do $whatever.

'Can uncontrollably do $whatever' is not true.
It's safe as long as ->dtor only takes leaf spinlocks.

> - a normal kmem_cache_alloc can call vast swaths of code, uncontrollably, due to
>   ->ctor. It can also set off direct reclaim, and thus run into all sorts of kmem_
>   cache_free/slab disposal issues
> 
> - a normal, possibly-unrelated GFP_KERNEL allocation can also run into all of these
>   issues by purely starting up shrinkers on direct reclaim as well.

I don't see that is a problem (as long as ->dtor only takes leaf spinlocks).

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-25 10:42     ` Pedro Falcato
  2025-04-28  1:18       ` Harry Yoo
@ 2025-04-30 19:49       ` Mateusz Guzik
  2025-05-12 11:00         ` Harry Yoo
  1 sibling, 1 reply; 27+ messages in thread
From: Mateusz Guzik @ 2025-04-30 19:49 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Harry Yoo, Vlastimil Babka, Christoph Lameter, David Rientjes,
	Andrew Morton, Dennis Zhou, Tejun Heo, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Vlad Buslov, Yevgeny Kliteynik, Jan Kara,
	Byungchul Park, linux-mm, netdev, linux-kernel

On Fri, Apr 25, 2025 at 12:42 PM Pedro Falcato <pfalcato@suse.de> wrote:
> With regards to "leaf locks", I still don't really understand what you/Mateusz
> mean or how that's even enforceable from the get-go.
>
> So basically:
> - ->ctor takes more args, can fail, can do fancier things (multiple allocations,
>   lock holding, etc, can be hidden with a normal kmem_cache_alloc; certain
>   caches become GFP_ATOMIC-incompatible)
>
> - ->dtor *will* do fancy things like recursing back onto the slab allocator and
>   grabbing locks
>
> - a normal kmem_cache_free can suddenly attempt to grab !SLUB locks as it tries
>   to dispose of slabs. It can also uncontrollably do $whatever.
>
> - a normal kmem_cache_alloc can call vast swaths of code, uncontrollably, due to
>   ->ctor. It can also set off direct reclaim, and thus run into all sorts of kmem_
>   cache_free/slab disposal issues
>
> - a normal, possibly-unrelated GFP_KERNEL allocation can also run into all of these
>   issues by purely starting up shrinkers on direct reclaim as well.
>
> - the whole original "Slab object caching allocator" idea from 1992 is extremely
>   confusing and works super poorly with various debugging features (like, e.g,
>   KASAN). IMO it should really be reserved (in a limited capacity!) for stuff like
>   TYPESAFE_BY_RCU, that we *really* need.
>
> These are basically my issues with the whole idea. I highly disagree that we should
> open this pandora's box for problems in *other places*.

Apologies for the late reply.

Looks like your primary apprehension concerns dtor, I'm going to
address it below.

But first a quick remark that the headline here is expensive
single-threaded work which keeps happening on every mm alloc/free and
which does not have to, extending beyond the percpu allocator
problems. Having a memory allocator which can handle it would be most
welcome.

Now to business:
I'm going to start with pointing out that dtors callable from any
place are not an *inherent* requirement of the idea. Given that
apparently sheaves don't do direct reclaim and that Christoph's idea
does not do it either, I think there is some support for objs with
unsafe dtors *not* being direct reclaimable (instead there can be a
dedicated workqueue or some other mechanism sorting them out). I did
not realize something like this would be considered fine. It is the
easiest way out and is perfectly fine with me.

However, suppose objs with dtors do need to be reclaimable the usual way.

I claim that writing dtors which are safe to use in that context is
not a significant challenge. Moreover, it is also possible to extend
lockdep to validate correct behavior. Finally, test code can trigger
ctor and dtor calls for all slabs to execute all this code at least
once with lockdep enabled. So while *honest* mistakes with locking are
very much possible, they will be trivially caught and I don't believe
the box which is being opened here belongs to Pandora.

So here is another attempt at explaning leaf spinlocks.

Suppose you have a global lock named "crapper". Further suppose the
lock is only taken with _irqsave *and* no locks are taken while
holding it.

Say this is the only consumer:
void crapperbump(void) {
    int flags;
    spin_lock_irqsave(&crapper, flags);
    mehvar++;
    spin_unlock_irqsave(&crapper);
}

Perhaps you can agree *anyone* can call here at any point and not risk
deadlocking.

That's an example of a leaf lock.

Aight, so how does one combat cases where the code turns into:
spin_lock_irqsave(&crapper, flags);
spin_lock_irqsave(&meh, flags2);

In this case "crapper" is no longer a leaf lock and in principle there
may be lock ordering involving "meh" which does deadlock.

Here is an example way out: when initializing the "crapper" lock, it
gets marked as a leaf lock so that lockdep can check for it. Then on a
lockdep-enabled kernel you get a splat when executing the routine when
you get to locking "meh". This sorts out the ctor side of things.

How does one validate dtor? lockdep can have a "only leaf-locks
allowed in this area" tunable around calls to dtors. Then should a
dtor have ideas of acquiring a lock which is not a leaf lock you are
once more going to get a splat.

And of course you can just force call all ctors and dtors on a debug
kernel (no need to trigger any memory pressure, just walk the list of
slabs with ctor + dtor pairs and call the stuff).

This of course would require some effort to implement, but there is no
rocket science degree required.

However, given that direct reclaim for mm is apparently not a strict
requirement, my preferred way out is to simply not provide it.
-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
  2025-04-30 19:49       ` Mateusz Guzik
@ 2025-05-12 11:00         ` Harry Yoo
  0 siblings, 0 replies; 27+ messages in thread
From: Harry Yoo @ 2025-05-12 11:00 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Pedro Falcato, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Andrew Morton, Dennis Zhou, Tejun Heo,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vlad Buslov,
	Yevgeny Kliteynik, Jan Kara, Byungchul Park, linux-mm, netdev,
	linux-kernel

On Wed, Apr 30, 2025 at 09:49:02PM +0200, Mateusz Guzik wrote:
> On Fri, Apr 25, 2025 at 12:42 PM Pedro Falcato <pfalcato@suse.de> wrote:
> > With regards to "leaf locks", I still don't really understand what you/Mateusz
> > mean or how that's even enforceable from the get-go.
> >
> > So basically:
> > - ->ctor takes more args, can fail, can do fancier things (multiple allocations,
> >   lock holding, etc, can be hidden with a normal kmem_cache_alloc; certain
> >   caches become GFP_ATOMIC-incompatible)
> >
> > - ->dtor *will* do fancy things like recursing back onto the slab allocator and
> >   grabbing locks
> >
> > - a normal kmem_cache_free can suddenly attempt to grab !SLUB locks as it tries
> >   to dispose of slabs. It can also uncontrollably do $whatever.
> >
> > - a normal kmem_cache_alloc can call vast swaths of code, uncontrollably, due to
> >   ->ctor. It can also set off direct reclaim, and thus run into all sorts of kmem_
> >   cache_free/slab disposal issues
> >
> > - a normal, possibly-unrelated GFP_KERNEL allocation can also run into all of these
> >   issues by purely starting up shrinkers on direct reclaim as well.
> >
> > - the whole original "Slab object caching allocator" idea from 1992 is extremely
> >   confusing and works super poorly with various debugging features (like, e.g,
> >   KASAN). IMO it should really be reserved (in a limited capacity!) for stuff like
> >   TYPESAFE_BY_RCU, that we *really* need.
> >
> > These are basically my issues with the whole idea. I highly disagree that we should
> > open this pandora's box for problems in *other places*.
> 
> Now to business:
> I'm going to start with pointing out that dtors callable from any
> place are not an *inherent* requirement of the idea. Given that
> apparently sheaves don't do direct reclaim and that Christoph's idea
> does not do it either, I think there is some support for objs with
> unsafe dtors *not* being direct reclaimable (instead there can be a
> dedicated workqueue or some other mechanism sorting them out). I did
> not realize something like this would be considered fine. It is the
> easiest way out and is perfectly fine with me.

My concern is that this prevents reclaimable slabs from using a
destructure in the future. Probably doesn't matter now, but it
doesn't seem future-proof.

> However, suppose objs with dtors do need to be reclaimable the usual way.
> 
> I claim that writing dtors which are safe to use in that context is
> not a significant challenge. Moreover, it is also possible to extend
> lockdep to validate correct behavior. Finally, test code can trigger
> ctor and dtor calls for all slabs to execute all this code at least
> once with lockdep enabled. So while *honest* mistakes with locking are
> very much possible, they will be trivially caught and I don't believe
> the box which is being opened here belongs to Pandora.
> 
> So here is another attempt at explaning leaf spinlocks.
> 
> Suppose you have a global lock named "crapper". Further suppose the
> lock is only taken with _irqsave *and* no locks are taken while
> holding it.
> 
> Say this is the only consumer:
> void crapperbump(void) {
>     int flags;
>     spin_lock_irqsave(&crapper, flags);
>     mehvar++;
>     spin_unlock_irqsave(&crapper);
> }
> 
> Perhaps you can agree *anyone* can call here at any point and not risk
> deadlocking.
> 
> That's an example of a leaf lock.
> 
> Aight, so how does one combat cases where the code turns into:
> spin_lock_irqsave(&crapper, flags);
> spin_lock_irqsave(&meh, flags2);
> 
> In this case "crapper" is no longer a leaf lock and in principle there
> may be lock ordering involving "meh" which does deadlock.
> 
> Here is an example way out: when initializing the "crapper" lock, it
> gets marked as a leaf lock so that lockdep can check for it. Then on a
> lockdep-enabled kernel you get a splat when executing the routine when
> you get to locking "meh". This sorts out the ctor side of things.
> 
> How does one validate dtor? lockdep can have a "only leaf-locks
> allowed in this area" tunable around calls to dtors. Then should a
> dtor have ideas of acquiring a lock which is not a leaf lock you are
> once more going to get a splat.

Similar to the concern above, this would limit how users can use the
destructor feature (let's say percpu violated the rule, altough it
doesn't appear to, then we would have to modify percpu allocator to
conform to it).

So I think the most future-proof approach (as Vlastimil suggested
off-list) would be to reclaim slab memory and call dtors in a separate
context (e.g., dedicated workqueues as you mentioned)?

But before doing this I need to check if it's okay to increase the nr of
reclaimed pages by the current task, when we know it'll be reclaimed but
not reclaimed yet.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-05-12 11:00 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-24  8:07 [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 1/7] mm/slab: refactor freelist shuffle Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 2/7] treewide, slab: allow slab constructor to return an error Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 3/7] mm/slab: revive the destructor feature in slab allocator Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 4/7] net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu alloc Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 5/7] mm/percpu: allow (un)charging objects without alloc/free Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 6/7] lib/percpu_counter: allow (un)charging percpu counters " Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 7/7] kernel/fork: improve exec() throughput with slab ctor/dtor pair Harry Yoo
2025-04-24  9:29 ` [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Mateusz Guzik
2025-04-24  9:58   ` Harry Yoo
2025-04-24 15:00     ` Mateusz Guzik
2025-04-24 11:28 ` Pedro Falcato
2025-04-24 15:20   ` Mateusz Guzik
2025-04-24 16:11     ` Mateusz Guzik
2025-04-25  7:40     ` Harry Yoo
2025-04-25 10:12   ` Harry Yoo
2025-04-25 10:42     ` Pedro Falcato
2025-04-28  1:18       ` Harry Yoo
2025-04-30 19:49       ` Mateusz Guzik
2025-05-12 11:00         ` Harry Yoo
2025-04-24 15:50 ` Christoph Lameter (Ampere)
2025-04-24 16:03   ` Mateusz Guzik
2025-04-24 16:39     ` Christoph Lameter (Ampere)
2025-04-24 17:26       ` Mateusz Guzik
2025-04-24 18:47 ` Tejun Heo
2025-04-25 10:10   ` Harry Yoo
2025-04-25 19:03     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox