linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mateusz Guzik <mjguzik@gmail.com>
To: Harry Yoo <harry.yoo@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	Christoph Lameter <cl@gentwo.org>,
	David Rientjes <rientjes@google.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
	 Jamal Hadi Salim <jhs@mojatatu.com>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	 Jiri Pirko <jiri@resnulli.us>, Vlad Buslov <vladbu@nvidia.com>,
	 Yevgeny Kliteynik <kliteyn@nvidia.com>, Jan Kara <jack@suse.cz>,
	Byungchul Park <byungchul@sk.com>,
	 linux-mm@kvack.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
Date: Thu, 24 Apr 2025 11:29:13 +0200	[thread overview]
Message-ID: <CAGudoHGkNn032RVuJdwYqpzfQgAB8pv=hEzdR1APsFOOSQnq1Q@mail.gmail.com> (raw)
In-Reply-To: <20250424080755.272925-1-harry.yoo@oracle.com>

On Thu, Apr 24, 2025 at 10:08 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> Overview
> ========
>
> The slab destructor feature existed in early days of slab allocator(s).
> It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> for destructors") in 2007 due to lack of serious use cases at that time.
>
> Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> constructor/destructor pair to mitigate the global serialization point
> (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> percpu memory during its lifetime.
>
> Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> so each allocate–free cycle requires two expensive acquire/release on
> that mutex.
>
> We can mitigate this contention by retaining the percpu regions after
> the object is freed and releasing them only when the backing slab pages
> are freed.
>
> How to do this with slab constructors and destructors: the constructor
> allocates percpu memory, and the destructor frees it when the slab pages
> are reclaimed; this slightly alters the constructor’s semantics,
> as it can now fail.
>
> This series is functional (although not compatible with MM debug
> features yet), but still far from perfect. I’m actively refining it and
> would appreciate early feedback before I improve it further. :)
>

Thanks for looking into this.

The dtor thing poses a potential problem where a dtor acquiring
arbitrary locks can result in a deadlock during memory reclaim.

So for this to be viable one needs to ensure that in the worst case
this only ever takes leaf-locks (i.e., locks which are last in any
dependency chain -- no locks are being taken if you hold one). This
needs to demonstrate the percpu thing qualifies or needs to refactor
it to that extent.


> This series is based on slab/for-next [2].
>
> Performance Improvement
> =======================
>
> I measured the benefit of this series for two different users:
> exec() and tc filter insertion/removal.
>
> exec() throughput
> -----------------
>
> The performance of exec() is important when short-lived processes are
> frequently created. For example: shell-heavy workloads and running many
> test cases [3].
>
> I measured exec() throughput with a microbenchmark:
>   - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
>   - 4.56% gain on a desktop with 24 hardware threads, and
>   - Even 4% gain on a single-threaded exec() throughput.
>
> Further investigation showed that this was due to the overhead of
> acquiring/releasing pcpu_alloc_mutex and its contention.
>
> See patch 7 for more detail on the experiment.
>
> Traffic Filter Insertion and Removal
> ------------------------------------
>
> Each tc filter allocates three percpu memory regions per tc_action object,
> so frequently inserting and removing filters contend heavily on the same
> mutex.
>
> In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
> more detail), I observed a 26% reduction in system time and observed
> much less contention on pcpu_alloc_mutex with this series.
>
> I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
> about tc filter insertion rate; these changes may still benefit
> workloads they run today.
>
> Feedback Needed from Percpu Allocator Folks
> ===========================================
>
> As percpu allocator is directly affected by this series, this work
> will need support from the percpu allocator maintainers, and we need to
> address their concerns.
>
> They will probably say "This is a percpu memory allocator scalability
> issue and we need to make it scalable"? I don't know.
>
> What do you say?
>
> Some hanging thoughts:
> - Tackling the problem on the slab side is much simpler, because the slab
>   allocator already caches objects per CPU. Re-creating similar logic
>   inside the percpu allocator would be redundant.
>
>   Also, since this is opt-in per slab cache, other percpu allocator
>   users remain unaffected.
>
> - If fragmentation is a concern, we could probably allocate larger percpu
>   chunks and partition them for slab objects.
>
> - If memory overhead becomes an issue, we could introduce a shrinker
>   to free empty slabs (and thus releasing underlying percpu memory chunks).
>
> Patch Sequence
> ==============
>
> Patch #1 refactors freelist_shuffle() to allow the slab constructor to
> fail in the next patch.
>
> Patch #2 allows the slab constructor fail.
>
> Patch #3 implements the slab destructor feature.
>
> Patch #4 converts net/sched/act_api to use the slab ctor/dtor pair.
>
> Patch #5, #6 implements APIs to charge and uncharge percpu memory and
> percpu counter.
>
> Patch #7 converts mm_struct to use the slab ctor/dtor pair.
>
> Known issues with MM debug features
> ===================================
>
> The slab destructor feature is not yet compatible with KASAN, KMEMLEAK,
> and DEBUG_OBJECTS.
>
> KASAN reports an error when a percpu counter is inserted into the
> percpu_counters linked list because the counter has not been allocated
> yet.
>
> DEBUG_OBJECTS and KMEMLEAK complain when the slab object is freed, while
> the associated percpu memory is still resident in memory.
>
> I don't expect fixing these issues to be too difficult, but I need to
> think a little bit to fix it.
>
> [1] https://lore.kernel.org/linux-mm/CAGudoHFc+Km-3usiy4Wdm1JkM+YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com
>
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next
>
> [3] https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3
>
> [4] https://lore.kernel.org/netdev/vbfmunui7dm.fsf@mellanox.com
>
> Harry Yoo (7):
>   mm/slab: refactor freelist shuffle
>   treewide, slab: allow slab constructor to return an error
>   mm/slab: revive the destructor feature in slab allocator
>   net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu
>     alloc
>   mm/percpu: allow (un)charging objects without alloc/free
>   lib/percpu_counter: allow (un)charging percpu counters without
>     alloc/free
>   kernel/fork: improve exec() throughput with slab ctor/dtor pair
>
>  arch/powerpc/include/asm/svm.h            |   2 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c    |   3 +-
>  arch/powerpc/mm/init-common.c             |   3 +-
>  arch/powerpc/platforms/cell/spufs/inode.c |   3 +-
>  arch/powerpc/platforms/pseries/setup.c    |   2 +-
>  arch/powerpc/platforms/pseries/svm.c      |   4 +-
>  arch/sh/mm/pgtable.c                      |   3 +-
>  arch/sparc/mm/tsb.c                       |   8 +-
>  block/bdev.c                              |   3 +-
>  drivers/dax/super.c                       |   3 +-
>  drivers/gpu/drm/i915/i915_request.c       |   3 +-
>  drivers/misc/lkdtm/heap.c                 |  12 +--
>  drivers/usb/mon/mon_text.c                |   5 +-
>  fs/9p/v9fs.c                              |   3 +-
>  fs/adfs/super.c                           |   3 +-
>  fs/affs/super.c                           |   3 +-
>  fs/afs/super.c                            |   5 +-
>  fs/befs/linuxvfs.c                        |   3 +-
>  fs/bfs/inode.c                            |   3 +-
>  fs/btrfs/inode.c                          |   3 +-
>  fs/ceph/super.c                           |   3 +-
>  fs/coda/inode.c                           |   3 +-
>  fs/debugfs/inode.c                        |   3 +-
>  fs/dlm/lowcomms.c                         |   3 +-
>  fs/ecryptfs/main.c                        |   5 +-
>  fs/efs/super.c                            |   3 +-
>  fs/erofs/super.c                          |   3 +-
>  fs/exfat/cache.c                          |   3 +-
>  fs/exfat/super.c                          |   3 +-
>  fs/ext2/super.c                           |   3 +-
>  fs/ext4/super.c                           |   3 +-
>  fs/fat/cache.c                            |   3 +-
>  fs/fat/inode.c                            |   3 +-
>  fs/fuse/inode.c                           |   3 +-
>  fs/gfs2/main.c                            |   9 +-
>  fs/hfs/super.c                            |   3 +-
>  fs/hfsplus/super.c                        |   3 +-
>  fs/hpfs/super.c                           |   3 +-
>  fs/hugetlbfs/inode.c                      |   3 +-
>  fs/inode.c                                |   3 +-
>  fs/isofs/inode.c                          |   3 +-
>  fs/jffs2/super.c                          |   3 +-
>  fs/jfs/super.c                            |   3 +-
>  fs/minix/inode.c                          |   3 +-
>  fs/nfs/inode.c                            |   3 +-
>  fs/nfs/nfs42xattr.c                       |   3 +-
>  fs/nilfs2/super.c                         |   6 +-
>  fs/ntfs3/super.c                          |   3 +-
>  fs/ocfs2/dlmfs/dlmfs.c                    |   3 +-
>  fs/ocfs2/super.c                          |   3 +-
>  fs/openpromfs/inode.c                     |   3 +-
>  fs/orangefs/super.c                       |   3 +-
>  fs/overlayfs/super.c                      |   3 +-
>  fs/pidfs.c                                |   3 +-
>  fs/proc/inode.c                           |   3 +-
>  fs/qnx4/inode.c                           |   3 +-
>  fs/qnx6/inode.c                           |   3 +-
>  fs/romfs/super.c                          |   3 +-
>  fs/smb/client/cifsfs.c                    |   3 +-
>  fs/squashfs/super.c                       |   3 +-
>  fs/tracefs/inode.c                        |   3 +-
>  fs/ubifs/super.c                          |   3 +-
>  fs/udf/super.c                            |   3 +-
>  fs/ufs/super.c                            |   3 +-
>  fs/userfaultfd.c                          |   3 +-
>  fs/vboxsf/super.c                         |   3 +-
>  fs/xfs/xfs_super.c                        |   3 +-
>  include/linux/mm_types.h                  |  40 ++++++---
>  include/linux/percpu.h                    |  10 +++
>  include/linux/percpu_counter.h            |   2 +
>  include/linux/slab.h                      |  21 +++--
>  ipc/mqueue.c                              |   3 +-
>  kernel/fork.c                             |  65 +++++++++-----
>  kernel/rcu/refscale.c                     |   3 +-
>  lib/percpu_counter.c                      |  25 ++++++
>  lib/radix-tree.c                          |   3 +-
>  lib/test_meminit.c                        |   3 +-
>  mm/kfence/kfence_test.c                   |   5 +-
>  mm/percpu.c                               |  79 ++++++++++------
>  mm/rmap.c                                 |   3 +-
>  mm/shmem.c                                |   3 +-
>  mm/slab.h                                 |  11 +--
>  mm/slab_common.c                          |  43 +++++----
>  mm/slub.c                                 | 105 ++++++++++++++++------
>  net/sched/act_api.c                       |  82 +++++++++++------
>  net/socket.c                              |   3 +-
>  net/sunrpc/rpc_pipe.c                     |   3 +-
>  security/integrity/ima/ima_iint.c         |   3 +-
>  88 files changed, 518 insertions(+), 226 deletions(-)
>
> --
> 2.43.0
>


-- 
Mateusz Guzik <mjguzik gmail.com>


  parent reply	other threads:[~2025-04-24  9:29 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-24  8:07 Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 1/7] mm/slab: refactor freelist shuffle Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 2/7] treewide, slab: allow slab constructor to return an error Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 3/7] mm/slab: revive the destructor feature in slab allocator Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 4/7] net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu alloc Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 5/7] mm/percpu: allow (un)charging objects without alloc/free Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 6/7] lib/percpu_counter: allow (un)charging percpu counters " Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 7/7] kernel/fork: improve exec() throughput with slab ctor/dtor pair Harry Yoo
2025-04-24  9:29 ` Mateusz Guzik [this message]
2025-04-24  9:58   ` [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Harry Yoo
2025-04-24 15:00     ` Mateusz Guzik
2025-04-24 11:28 ` Pedro Falcato
2025-04-24 15:20   ` Mateusz Guzik
2025-04-24 16:11     ` Mateusz Guzik
2025-04-25  7:40     ` Harry Yoo
2025-04-25 10:12   ` Harry Yoo
2025-04-25 10:42     ` Pedro Falcato
2025-04-28  1:18       ` Harry Yoo
2025-04-30 19:49       ` Mateusz Guzik
2025-05-12 11:00         ` Harry Yoo
2025-04-24 15:50 ` Christoph Lameter (Ampere)
2025-04-24 16:03   ` Mateusz Guzik
2025-04-24 16:39     ` Christoph Lameter (Ampere)
2025-04-24 17:26       ` Mateusz Guzik
2025-04-24 18:47 ` Tejun Heo
2025-04-25 10:10   ` Harry Yoo
2025-04-25 19:03     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGudoHGkNn032RVuJdwYqpzfQgAB8pv=hEzdR1APsFOOSQnq1Q@mail.gmail.com' \
    --to=mjguzik@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=byungchul@sk.com \
    --cc=cl@gentwo.org \
    --cc=dennis@kernel.org \
    --cc=harry.yoo@oracle.com \
    --cc=jack@suse.cz \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=kliteyn@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=vladbu@nvidia.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox