From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EBBE3C5472F for ; Tue, 27 Aug 2024 13:42:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 820E66B0088; Tue, 27 Aug 2024 09:42:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7D1CD6B0089; Tue, 27 Aug 2024 09:42:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 698D86B008A; Tue, 27 Aug 2024 09:42:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 4BD466B0088 for ; Tue, 27 Aug 2024 09:42:10 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id D44711A17CC for ; Tue, 27 Aug 2024 13:42:09 +0000 (UTC) X-FDA: 82498139178.09.A328E8E Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf19.hostedemail.com (Postfix) with ESMTP id 389FB1A0005 for ; Tue, 27 Aug 2024 13:42:08 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=fQA2YBCp; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf19.hostedemail.com: domain of brauner@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=brauner@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724766108; a=rsa-sha256; cv=none; b=t3rNh5uQX/Yri6gyz3jj8hdLuebnQo6I27qtSFnTzcaM56YneLLOzlY6D6u6uSMFkxtol/ RrjS4brUUdv1lHWDf/ZtrnGbY6fsVzRZuAgNZq3vmVIwZ/vDqfDFgznFJ5pdChKaJinLJ+ vYmKPlErs4vPgf9ygLDgo4utEHAowaI= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=fQA2YBCp; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf19.hostedemail.com: domain of brauner@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=brauner@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724766108; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=M2qsdCvGxcwUG9tW4a7A0U4xvlpeSqCDibiq4aI8Suc=; b=JmsYbjhJbMHqQmmUF5lcDnLHibGR78LYRau1VqpPmlspAaXLog46bP9Gp+UMpR/Up7ASgh YOEaR/P/5VoSA0mYEeGuKxNlwek1Vw7Y2JnLqdnNcatpSETVKBZC1DXYDq+ksOXhNlo8Gn UW9oW3ZzwddlxtRCZI0YWyXm38ID9kM= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id 4B97CA419F0; Tue, 27 Aug 2024 13:42:00 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id E0F1FC61065; Tue, 27 Aug 2024 13:42:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1724766126; bh=1evU1PAXlQADscDPW1NC5UbDuYlQQu8Z6CGDSlCjqDY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=fQA2YBCpHBlyrAT3LBcVHtOXg5GY+74I92RGf3G7V9LX5xLXeVeqRq0UfFvT4IhG2 lU7TWXza7iW5veO3chCEJIrXYbkiow67lKPkYMMQ2suGYYUtGpEoAzc8+/SPPtX4VH R7c5+nWweR47SNb9CGbSpMbD1CP4LZo1LY66j2ESEqNzJpbgThC+4BTpYnxBSCiErM ijPg44CCRtqINIZD+h9UAmX/zDo5UUgeZ3nobsdAnm2FbniWolZQQGCwtBYhf2T1Od RoWJtTsh5RTtuOfPMURfY1kHdyrv+2Mn9BzuwLB46CNgOiwnfYx4SY8fXCujCpO8WM +G57t5ysMqnsg== Date: Tue, 27 Aug 2024 15:42:02 +0200 From: Christian Brauner To: Mike Rapoport Cc: Vlastimil Babka , Jens Axboe , "Paul E. McKenney" , Roman Gushchin , Linus Torvalds , Jann Horn , linux-mm@kvack.org Subject: Re: [PATCH] [RFC] mm: add kmem_cache_create_rcu() Message-ID: <20240827-perlen-mildern-cf8156a8b150@brauner> References: <20240826-zweihundert-luftkammer-9b2b47f4338e@brauner> <20240826-okkupieren-nachdenken-d88ac627e9bc@brauner> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Queue-Id: 389FB1A0005 X-Rspamd-Server: rspam01 X-Stat-Signature: tcksqk87tinn64oh631e9j4h3zo4m1y9 X-HE-Tag: 1724766128-449110 X-HE-Meta: U2FsdGVkX185Lit3ekfZid26BpCTyGjs2JNBsNpzoRIFC36rupSlGWHZtSX9ROU1eRDDoGaYJqUpjrzpsAJKJtSE2UwDwZWJw2r6RrKfyverZA0fvxmUWyGsuiR+24ecHIkmFnmnTmxnoyFzwtkRL/3tPU/5XE69L8mOA7Z51g0sW0hWMIIqTWt/o1GIhpIAVbLULXuKgzgCaXUzJc8LAvmB+IuDjTtVXwyCNhM5K8Dx76OOiPk6n+UJB2FyXYGQ05c8brkzwupVofOjAc36dhoAUCnPdpqSoSlTJ447kk/oYhXgd7kqumd3DSYoTF5kEJAf8jiP01VNuJftuy1+wFphdcIxdw7XTkrPAI5ZFDbgFscgFluaJnsz+uWvKdt8rhiQFERunBYk+HfV6EL4ffBWf+5GuCItPOWyu4BQtuVHUoWxDj8Y4db+KjQl1bhv7eoMkB+ggicjGwZNFoMvmxnmsu3vkaKPjXUbQU6kyiaDCMjcQ4v1mvVbT2Fez2byMCT8dX/Eoe21En7mLhaMyHlrducUcArseWQ5BKHlr7NNbBf+UK/4JvUdCkeYJ0cn0EPcPs4UHp7V6cYX8mbzZhll9hgGA0eLWcHuA5rdgXY0SMO8wzKkInMuKL/3rde+j1YzMZximIGeS/pI/OuS7/lRMpnPQTaLDZ2YY7M9QEInwjrh7/cfB2FWHO8ofz3Dcf0sJCwNDZQjOkNiY/VBnjdKc9hfptMl8sLus0yikYJEEW6jOfQsmWuGy71ilrzhnd+zX9go1otGFU2sxDjmb2ElmmS2U7LgbVmDuThtvOziofWWiqfdCJXYU5K3Dhi0y3/XZMQFK5hmknrrKfgWF4QIjryB86TtTu5yUXrQmlDP2upmEsZSMLSqmzmz9Lo93gQ1cYDgSICnJTlrPRq/HciVZP1ZcNJYxsaVqMrwBs/rnWzRoMJwnI0PNs8x9NS2IRrIxyY7zQMUcG51XYq RN84KlYg aU0J5vJZ/OruK/Iv8SZx25/svHlwO53VddWfUgKrLS4IreUmVDz+zHVWANhnkCLPcfI4xdKzV6IED4L33nkRkD+KTRnFKEN9sNv/uysL/Zzybj9vzzvRNL/ZOdewsKWrO/jhIfhXdmW+zGRTbEr0y5D9DkD8RhBkfF3BYJl4YRmTuUKExtWSYqBJMId5UJIiDIjD63QBXgAu87Js7zo5f5uiOh0HjMzxZ0EFW1F1vARGWpcgtMgnheVTL8ENa/fKTrioFN9hKJoITYj1sm4ljwLEPwzYRTAMeyANY4iGC5TcFlZhTZ2dBiQQPF5wWS/YKV4qsPYOhwtTEcpUqe9l/eKvgiPiyeA1j7YkcSUdT6d6oFdCDXYApJ3AH5exF92ME+/YX X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Aug 27, 2024 at 01:42:30PM GMT, Mike Rapoport wrote: > On Mon, Aug 26, 2024 at 06:04:13PM +0200, Christian Brauner wrote: > > When a kmem cache is created with SLAB_TYPESAFE_BY_RCU the free pointer > > must be located outside of the object because we don't know what part of > > the memory can safely be overwritten as it may be needed to prevent > > object recycling. > > > > That has the consequence that SLAB_TYPESAFE_BY_RCU may end up adding a > > new cacheline. This is the case for .e.g, struct file. After having it > > shrunk down by 40 bytes and having it fit in three cachelines we still > > have SLAB_TYPESAFE_BY_RCU adding a fourth cacheline because it needs to > > accomodate the free pointer and is hardware cacheline aligned. > > > > I tried to find ways to rectify this as struct file is pretty much > > everywhere and having it use less memory is a good thing. So here's a > > proposal that might be totally the wrong api and broken but I thought I > > give it a try. > > > > Signed-off-by: Christian Brauner > > --- > > fs/file_table.c | 7 ++-- > > include/linux/fs.h | 1 + > > include/linux/slab.h | 4 +++ > > mm/slab.h | 1 + > > mm/slab_common.c | 76 +++++++++++++++++++++++++++++++++++++------- > > mm/slub.c | 22 +++++++++---- > > 6 files changed, 91 insertions(+), 20 deletions(-) > > > > diff --git a/fs/file_table.c b/fs/file_table.c > > index 694199a1a966..a69b8a71eacb 100644 > > --- a/fs/file_table.c > > +++ b/fs/file_table.c > > @@ -514,9 +514,10 @@ EXPORT_SYMBOL(__fput_sync); > > > > void __init files_init(void) > > { > > - filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, > > - SLAB_TYPESAFE_BY_RCU | SLAB_HWCACHE_ALIGN | > > - SLAB_PANIC | SLAB_ACCOUNT, NULL); > > + filp_cachep = kmem_cache_create_rcu("filp", sizeof(struct file), > > + offsetof(struct file, __f_slab_free_ptr), > > + SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, > > + NULL); > > percpu_counter_init(&nr_files, 0, GFP_KERNEL); > > } > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 61097a9cf317..de509f5d1446 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -1057,6 +1057,7 @@ struct file { > > struct callback_head f_task_work; > > struct llist_node f_llist; > > struct file_ra_state f_ra; > > + void *__f_slab_free_ptr; > > }; > > /* --- cacheline 3 boundary (192 bytes) --- */ > > } __randomize_layout > > diff --git a/include/linux/slab.h b/include/linux/slab.h > > index eb2bf4629157..fc3c3cc9f689 100644 > > --- a/include/linux/slab.h > > +++ b/include/linux/slab.h > > @@ -242,6 +242,10 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name, > > slab_flags_t flags, > > unsigned int useroffset, unsigned int usersize, > > void (*ctor)(void *)); > > +struct kmem_cache *kmem_cache_create_rcu(const char *name, unsigned int size, > > + unsigned int offset, > > Just 'offset' is to vague, no? > Maybe freeptr_offset? Yep, switched to that. > > > + slab_flags_t flags, > > + void (*ctor)(void *)); > > void kmem_cache_destroy(struct kmem_cache *s); > > int kmem_cache_shrink(struct kmem_cache *s); > > > > diff --git a/mm/slab.h b/mm/slab.h > > index dcdb56b8e7f5..122ca41fea34 100644 > > --- a/mm/slab.h > > +++ b/mm/slab.h > > @@ -261,6 +261,7 @@ struct kmem_cache { > > unsigned int object_size; /* Object size without metadata */ > > struct reciprocal_value reciprocal_size; > > unsigned int offset; /* Free pointer offset */ > > + bool dedicated_offset; /* Specific free pointer requested */ > > has_freeptr_offset? Your comments made me think that this should just be rcu_freeptr_offset and avoid that boolean completely. > > > #ifdef CONFIG_SLUB_CPU_PARTIAL > > /* Number of per cpu partial objects to keep around */ > > unsigned int cpu_partial; > > diff --git a/mm/slab_common.c b/mm/slab_common.c > > index 40b582a014b8..b6ca63859b3a 100644 > > --- a/mm/slab_common.c > > +++ b/mm/slab_common.c > > @@ -202,10 +202,10 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align, > > } > > > > static struct kmem_cache *create_cache(const char *name, > > - unsigned int object_size, unsigned int align, > > - slab_flags_t flags, unsigned int useroffset, > > - unsigned int usersize, void (*ctor)(void *), > > - struct kmem_cache *root_cache) > > + unsigned int object_size, unsigned int offset, > > + unsigned int align, slab_flags_t flags, > > + unsigned int useroffset, unsigned int usersize, > > + void (*ctor)(void *), struct kmem_cache *root_cache) > > { > > struct kmem_cache *s; > > int err; > > @@ -213,6 +213,10 @@ static struct kmem_cache *create_cache(const char *name, > > if (WARN_ON(useroffset + usersize > object_size)) > > useroffset = usersize = 0; > > > > + if (WARN_ON(offset >= object_size || > > + (offset && !(flags & SLAB_TYPESAFE_BY_RCU)))) > > + offset = 0; > > + > > err = -ENOMEM; > > s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL); > > if (!s) > > @@ -226,6 +230,10 @@ static struct kmem_cache *create_cache(const char *name, > > s->useroffset = useroffset; > > s->usersize = usersize; > > #endif > > + if (offset > 0) { > > + s->offset = offset; > > + s->dedicated_offset = true; > > + } > > > > err = __kmem_cache_create(s, flags); > > if (err) > > @@ -269,10 +277,10 @@ static struct kmem_cache *create_cache(const char *name, > > * > > * Return: a pointer to the cache on success, NULL on failure. > > */ > > -struct kmem_cache * > > -kmem_cache_create_usercopy(const char *name, > > - unsigned int size, unsigned int align, > > - slab_flags_t flags, > > +static struct kmem_cache * > > +do_kmem_cache_create_usercopy(const char *name, > > + unsigned int size, unsigned int offset, > > + unsigned int align, slab_flags_t flags, > > unsigned int useroffset, unsigned int usersize, > > void (*ctor)(void *)) > > { > > @@ -332,7 +340,7 @@ kmem_cache_create_usercopy(const char *name, > > goto out_unlock; > > } > > > > - s = create_cache(cache_name, size, > > + s = create_cache(cache_name, size, offset, > > calculate_alignment(flags, align, size), > > flags, useroffset, usersize, ctor, NULL); > > if (IS_ERR(s)) { > > @@ -356,6 +364,16 @@ kmem_cache_create_usercopy(const char *name, > > } > > return s; > > } > > + > > +struct kmem_cache * > > +kmem_cache_create_usercopy(const char *name, unsigned int size, > > + unsigned int align, slab_flags_t flags, > > + unsigned int useroffset, unsigned int usersize, > > + void (*ctor)(void *)) > > +{ > > + return do_kmem_cache_create_usercopy(name, size, 0, align, flags, > > + useroffset, usersize, ctor); > > +} > > EXPORT_SYMBOL(kmem_cache_create_usercopy); > > > > /** > > @@ -387,11 +405,47 @@ struct kmem_cache * > > kmem_cache_create(const char *name, unsigned int size, unsigned int align, > > slab_flags_t flags, void (*ctor)(void *)) > > { > > - return kmem_cache_create_usercopy(name, size, align, flags, 0, 0, > > - ctor); > > + return do_kmem_cache_create_usercopy(name, size, 0, align, flags, 0, 0, > > + ctor); > > } > > EXPORT_SYMBOL(kmem_cache_create); > > > > +/** > > + * kmem_cache_create_rcu - Create a SLAB_TYPESAFE_BY_RCU cache. > > + * @name: A string which is used in /proc/slabinfo to identify this cache. > > + * @size: The size of objects to be created in this cache. > > + * @offset: The offset into the memory to the free pointer > > + * @flags: SLAB flags > > + * @ctor: A constructor for the objects. > > + * > > + * Cannot be called within a interrupt, but can be interrupted. > > + * The @ctor is run when new pages are allocated by the cache. > > + * > > + * The flags are > > + * > > + * %SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5) > > + * to catch references to uninitialised memory. > > + * > > + * %SLAB_RED_ZONE - Insert `Red` zones around the allocated memory to check > > + * for buffer overruns. > > + * > > + * %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware > > + * cacheline. This can be beneficial if you're counting cycles as closely > > + * as davem. > > + * > > + * Return: a pointer to the cache on success, NULL on failure. > > + */ > > +struct kmem_cache *kmem_cache_create_rcu(const char *name, unsigned int size, > > + unsigned int offset, > > + slab_flags_t flags, > > + void (*ctor)(void *)) > > +{ > > + return do_kmem_cache_create_usercopy(name, size, offset, 0, > > + flags | SLAB_TYPESAFE_BY_RCU, 0, 0, > > + ctor); > > +} > > +EXPORT_SYMBOL(kmem_cache_create_rcu); > > + > > static struct kmem_cache *kmem_buckets_cache __ro_after_init; > > > > /** > > diff --git a/mm/slub.c b/mm/slub.c > > index c9d8a2497fd6..34eac3f9a46e 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -3926,7 +3926,7 @@ static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s, > > void *obj) > > { > > if (unlikely(slab_want_init_on_free(s)) && obj && > > - !freeptr_outside_object(s)) > > + !freeptr_outside_object(s) && !s->dedicated_offset) > > memset((void *)((char *)kasan_reset_tag(obj) + s->offset), > > 0, sizeof(void *)); > > } > > @@ -5153,6 +5153,7 @@ static int calculate_sizes(struct kmem_cache *s) > > slab_flags_t flags = s->flags; > > unsigned int size = s->object_size; > > unsigned int order; > > + bool must_use_freeptr_offset; > > > > /* > > * Round up object size to the next word boundary. We can only > > @@ -5189,9 +5190,12 @@ static int calculate_sizes(struct kmem_cache *s) > > */ > > s->inuse = size; > > > > - if ((flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)) || s->ctor || > > - ((flags & SLAB_RED_ZONE) && > > - (s->object_size < sizeof(void *) || slub_debug_orig_size(s)))) { > > + must_use_freeptr_offset = > > + (flags & SLAB_POISON) || s->ctor || > > + ((flags & SLAB_RED_ZONE) && > > + (s->object_size < sizeof(void *) || slub_debug_orig_size(s))); > > + > > + if ((flags & SLAB_TYPESAFE_BY_RCU) || must_use_freeptr_offset) { > > /* > > * Relocate free pointer after the object if it is not > > * permitted to overwrite the first word of the object on > > @@ -5208,8 +5212,13 @@ static int calculate_sizes(struct kmem_cache *s) > > * freeptr_outside_object() function. If that is no > > * longer true, the function needs to be modified. > > */ > > - s->offset = size; > > - size += sizeof(void *); > > + if (!(flags & SLAB_TYPESAFE_BY_RCU) || must_use_freeptr_offset) { > > + s->offset = size; > > + size += sizeof(void *); > > + s->dedicated_offset = false; > > + } else { > > + s->dedicated_offset = true; > > Hmm, this seem to set s->dedicated_offset for any SLAB_TYPESAFE_BY_RCU cache, > even those that weren't created with kmem_cache_create_rcu(). > > Shouldn't we have > > must_use_freeptr_offset = > ((flags & SLAB_TYPESAFE_BY_RCU) && !s->dedicated_offset) || > (flags & SLAB_POISON) || s->ctor || > ((flags & SLAB_RED_ZONE) && > (s->object_size < sizeof(void *) || slub_debug_orig_size(s))); > > if (must_use_freeptr_offset) { > ... > } Yep, that's better. Will send a fixed version.