* [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock()
@ 2025-09-09 1:00 Alexei Starovoitov
2025-09-09 1:00 ` [PATCH slab v5 1/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
` (6 more replies)
0 siblings, 7 replies; 45+ messages in thread
From: Alexei Starovoitov @ 2025-09-09 1:00 UTC (permalink / raw)
To: bpf, linux-mm
Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
akpm, peterz, rostedt, hannes
From: Alexei Starovoitov <ast@kernel.org>
Overview:
This patch set introduces kmalloc_nolock() which is the next logical
step towards any context allocation necessary to remove bpf_mem_alloc
and get rid of preallocation requirement in BPF infrastructure.
In production BPF maps grew to gigabytes in size. Preallocation wastes
memory. Alloc from any context addresses this issue for BPF and
other subsystems that are forced to preallocate too.
This long task started with introduction of alloc_pages_nolock(),
then memcg and objcg were converted to operate from any context
including NMI, this set completes the task with kmalloc_nolock()
that builds on top of alloc_pages_nolock() and memcg changes.
After that BPF subsystem will gradually adopt it everywhere.
The patch set is on top of slab/for-next that already has
pre-patch "locking/local_lock: Expose dep_map in local_trylock_t." applied.
I think the patch set should be routed via vbabka/slab.git.
v4->v5:
- New patch "Reuse first bit for OBJEXTS_ALLOC_FAIL" to free up a bit
and use it to mark slabobj_ext vector allocated with kmalloc_nolock(),
so that freeing of the vector can be done with kfree_nolock()
- Call kasan_slab_free() directly from kfree_nolock() instead of deferring to
do_slab_free() to avoid double poisoning
- Addressed other minor issues spotted by Harry
v4:
https://lore.kernel.org/all/20250718021646.73353-1-alexei.starovoitov@gmail.com/
v3->v4:
- Converted local_lock_cpu_slab() to macro
- Reordered patches 5 and 6
- Emphasized that kfree_nolock() shouldn't be used on kmalloc()-ed objects
- Addressed other comments and improved commit logs
- Fixed build issues reported by bots
v3:
https://lore.kernel.org/bpf/20250716022950.69330-1-alexei.starovoitov@gmail.com/
v2->v3:
- Adopted Sebastian's local_lock_cpu_slab(), but dropped gfpflags
to avoid extra branch for performance reasons,
and added local_unlock_cpu_slab() for symmetry.
- Dropped local_lock_lockdep_start/end() pair and switched to
per kmem_cache lockdep class on PREEMPT_RT to silence false positive
when the same cpu/task acquires two local_lock-s.
- Refactorred defer_free per Sebastian's suggestion
- Fixed slab leak when it needs to be deactivated via irq_work and llist
as Vlastimil proposed. Including defer_free_barrier().
- Use kmem_cache->offset for llist_node pointer when linking objects
instead of zero offset, since whole object could be used for slabs
with ctors and other cases.
- Fixed "cnt = 1; goto redo;" issue.
- Fixed slab leak in alloc_single_from_new_slab().
- Retested with slab_debug, RT, !RT, lockdep, kasan, slab_tiny
- Added acks to patches 1-4 that should be good to go.
v2:
https://lore.kernel.org/bpf/20250709015303.8107-1-alexei.starovoitov@gmail.com/
v1->v2:
Added more comments for this non-trivial logic and addressed earlier comments.
In particular:
- Introduce alloc_frozen_pages_nolock() to avoid refcnt race
- alloc_pages_nolock() defaults to GFP_COMP
- Support SLUB_TINY
- Added more variants to stress tester to discover that kfree_nolock() can
OOM, because deferred per-slab llist won't be serviced if kfree_nolock()
gets unlucky long enough. Scraped previous approach and switched to
global per-cpu llist with immediate irq_work_queue() to process all
object sizes.
- Reentrant kmalloc cannot deactivate_slab(). In v1 the node hint was
downgraded to NUMA_NO_NODE before calling slab_alloc(). Realized it's not
good enough. There are odd cases that can trigger deactivate. Rewrote
this part.
- Struggled with SLAB_NO_CMPXCHG. Thankfully Harry had a great suggestion:
https://lore.kernel.org/bpf/aFvfr1KiNrLofavW@hyeyoo/
which was adopted. So slab_debug works now.
- In v1 I had to s/local_lock_irqsave/local_lock_irqsave_check/ in a bunch
of places in mm/slub.c to avoid lockdep false positives.
Came up with much cleaner approach to silence invalid lockdep reports
without sacrificing lockdep coverage. See local_lock_lockdep_start/end().
v1:
https://lore.kernel.org/bpf/20250501032718.65476-1-alexei.starovoitov@gmail.com/
Alexei Starovoitov (6):
locking/local_lock: Introduce local_lock_is_locked().
mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock().
mm: Introduce alloc_frozen_pages_nolock()
slab: Make slub local_(try)lock more precise for LOCKDEP
slab: Reuse first bit for OBJEXTS_ALLOC_FAIL
slab: Introduce kmalloc_nolock() and kfree_nolock().
include/linux/gfp.h | 2 +-
include/linux/kasan.h | 13 +-
include/linux/local_lock.h | 2 +
include/linux/local_lock_internal.h | 7 +
include/linux/memcontrol.h | 12 +-
include/linux/rtmutex.h | 10 +
include/linux/slab.h | 4 +
kernel/bpf/stream.c | 2 +-
kernel/bpf/syscall.c | 2 +-
kernel/locking/rtmutex_common.h | 9 -
mm/Kconfig | 1 +
mm/internal.h | 4 +
mm/kasan/common.c | 5 +-
mm/page_alloc.c | 55 ++--
mm/slab.h | 7 +
mm/slab_common.c | 3 +
mm/slub.c | 495 +++++++++++++++++++++++++---
17 files changed, 541 insertions(+), 92 deletions(-)
--
2.47.3
^ permalink raw reply [flat|nested] 45+ messages in thread* [PATCH slab v5 1/6] locking/local_lock: Introduce local_lock_is_locked(). 2025-09-09 1:00 [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov @ 2025-09-09 1:00 ` Alexei Starovoitov 2025-09-09 1:00 ` [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov ` (5 subsequent siblings) 6 siblings, 0 replies; 45+ messages in thread From: Alexei Starovoitov @ 2025-09-09 1:00 UTC (permalink / raw) To: bpf, linux-mm Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes From: Alexei Starovoitov <ast@kernel.org> Introduce local_lock_is_locked() that returns true when given local_lock is locked by current cpu (in !PREEMPT_RT) or by current task (in PREEMPT_RT). The goal is to detect a deadlock by the caller. Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> --- include/linux/local_lock.h | 2 ++ include/linux/local_lock_internal.h | 7 +++++++ include/linux/rtmutex.h | 10 ++++++++++ kernel/locking/rtmutex_common.h | 9 --------- 4 files changed, 19 insertions(+), 9 deletions(-) diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h index 2ba846419524..0d91d060e3e9 100644 --- a/include/linux/local_lock.h +++ b/include/linux/local_lock.h @@ -66,6 +66,8 @@ */ #define local_trylock(lock) __local_trylock(this_cpu_ptr(lock)) +#define local_lock_is_locked(lock) __local_lock_is_locked(lock) + /** * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable * interrupts if acquired diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h index 949de37700db..a4dc479157b5 100644 --- a/include/linux/local_lock_internal.h +++ b/include/linux/local_lock_internal.h @@ -165,6 +165,9 @@ do { \ !!tl; \ }) +/* preemption or migration must be disabled before calling __local_lock_is_locked */ +#define __local_lock_is_locked(lock) READ_ONCE(this_cpu_ptr(lock)->acquired) + #define __local_lock_release(lock) \ do { \ local_trylock_t *tl; \ @@ -285,4 +288,8 @@ do { \ __local_trylock(lock); \ }) +/* migration must be disabled before calling __local_lock_is_locked */ +#define __local_lock_is_locked(__lock) \ + (rt_mutex_owner(&this_cpu_ptr(__lock)->lock) == current) + #endif /* CONFIG_PREEMPT_RT */ diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h index fa9f1021541e..ede4c6bf6f22 100644 --- a/include/linux/rtmutex.h +++ b/include/linux/rtmutex.h @@ -44,6 +44,16 @@ static inline bool rt_mutex_base_is_locked(struct rt_mutex_base *lock) return READ_ONCE(lock->owner) != NULL; } +#ifdef CONFIG_RT_MUTEXES +#define RT_MUTEX_HAS_WAITERS 1UL + +static inline struct task_struct *rt_mutex_owner(struct rt_mutex_base *lock) +{ + unsigned long owner = (unsigned long) READ_ONCE(lock->owner); + + return (struct task_struct *) (owner & ~RT_MUTEX_HAS_WAITERS); +} +#endif extern void rt_mutex_base_init(struct rt_mutex_base *rtb); /** diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h index 78dd3d8c6554..cf6ddd1b23a2 100644 --- a/kernel/locking/rtmutex_common.h +++ b/kernel/locking/rtmutex_common.h @@ -153,15 +153,6 @@ static inline struct rt_mutex_waiter *task_top_pi_waiter(struct task_struct *p) pi_tree.entry); } -#define RT_MUTEX_HAS_WAITERS 1UL - -static inline struct task_struct *rt_mutex_owner(struct rt_mutex_base *lock) -{ - unsigned long owner = (unsigned long) READ_ONCE(lock->owner); - - return (struct task_struct *) (owner & ~RT_MUTEX_HAS_WAITERS); -} - /* * Constants for rt mutex functions which have a selectable deadlock * detection. -- 2.47.3 ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock(). 2025-09-09 1:00 [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov 2025-09-09 1:00 ` [PATCH slab v5 1/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov @ 2025-09-09 1:00 ` Alexei Starovoitov 2025-09-12 17:11 ` Shakeel Butt ` (2 more replies) 2025-09-09 1:00 ` [PATCH slab v5 3/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov ` (4 subsequent siblings) 6 siblings, 3 replies; 45+ messages in thread From: Alexei Starovoitov @ 2025-09-09 1:00 UTC (permalink / raw) To: bpf, linux-mm Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes From: Alexei Starovoitov <ast@kernel.org> Change alloc_pages_nolock() to default to __GFP_COMP when allocating pages, since upcoming reentrant alloc_slab_page() needs __GFP_COMP. Also allow __GFP_ACCOUNT flag to be specified, since most of BPF infra needs __GFP_ACCOUNT except BPF streams. Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> --- include/linux/gfp.h | 2 +- kernel/bpf/stream.c | 2 +- kernel/bpf/syscall.c | 2 +- mm/page_alloc.c | 10 ++++++---- 4 files changed, 9 insertions(+), 7 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 5ebf26fcdcfa..0ceb4e09306c 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -354,7 +354,7 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp, } #define alloc_page_vma(...) alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__)) -struct page *alloc_pages_nolock_noprof(int nid, unsigned int order); +struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order); #define alloc_pages_nolock(...) alloc_hooks(alloc_pages_nolock_noprof(__VA_ARGS__)) extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order); diff --git a/kernel/bpf/stream.c b/kernel/bpf/stream.c index ab592db4a4bf..eb6c5a21c2ef 100644 --- a/kernel/bpf/stream.c +++ b/kernel/bpf/stream.c @@ -83,7 +83,7 @@ static struct bpf_stream_page *bpf_stream_page_replace(void) struct bpf_stream_page *stream_page, *old_stream_page; struct page *page; - page = alloc_pages_nolock(NUMA_NO_NODE, 0); + page = alloc_pages_nolock(/* Don't account */ 0, NUMA_NO_NODE, 0); if (!page) return NULL; stream_page = page_address(page); diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 0fbfa8532c39..dbf86f8014de 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -581,7 +581,7 @@ static bool can_alloc_pages(void) static struct page *__bpf_alloc_page(int nid) { if (!can_alloc_pages()) - return alloc_pages_nolock(nid, 0); + return alloc_pages_nolock(__GFP_ACCOUNT, nid, 0); return alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d1d037f97c5f..30ccff0283fd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7480,6 +7480,7 @@ static bool __free_unaccepted(struct page *page) /** * alloc_pages_nolock - opportunistic reentrant allocation from any context + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed. * @nid: node to allocate from * @order: allocation order size * @@ -7493,7 +7494,7 @@ static bool __free_unaccepted(struct page *page) * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN. * It means ENOMEM. There is no reason to call it again and expect !NULL. */ -struct page *alloc_pages_nolock_noprof(int nid, unsigned int order) +struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order) { /* * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed. @@ -7515,12 +7516,13 @@ struct page *alloc_pages_nolock_noprof(int nid, unsigned int order) * specify it here to highlight that alloc_pages_nolock() * doesn't want to deplete reserves. */ - gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC - | __GFP_ACCOUNT; + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC | __GFP_COMP + | gfp_flags; unsigned int alloc_flags = ALLOC_TRYLOCK; struct alloc_context ac = { }; struct page *page; + VM_WARN_ON_ONCE(gfp_flags & ~__GFP_ACCOUNT); /* * In PREEMPT_RT spin_trylock() will call raw_spin_lock() which is * unsafe in NMI. If spin_trylock() is called from hard IRQ the current @@ -7558,7 +7560,7 @@ struct page *alloc_pages_nolock_noprof(int nid, unsigned int order) if (page) set_page_refcounted(page); - if (memcg_kmem_online() && page && + if (memcg_kmem_online() && page && (gfp_flags & __GFP_ACCOUNT) && unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) { free_pages_nolock(page, order); page = NULL; -- 2.47.3 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock(). 2025-09-09 1:00 ` [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov @ 2025-09-12 17:11 ` Shakeel Butt 2025-09-12 17:15 ` Matthew Wilcox 2025-09-12 17:47 ` Shakeel Butt 2025-09-15 5:25 ` Harry Yoo 2 siblings, 1 reply; 45+ messages in thread From: Shakeel Butt @ 2025-09-12 17:11 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, linux-mm, vbabka, harry.yoo, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes On Mon, Sep 08, 2025 at 06:00:03PM -0700, Alexei Starovoitov wrote: [...] > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index d1d037f97c5f..30ccff0283fd 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -7480,6 +7480,7 @@ static bool __free_unaccepted(struct page *page) > > /** > * alloc_pages_nolock - opportunistic reentrant allocation from any context > + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed. If only __GFP_ACCOUNT is allowed then why not use a 'bool account' in the parameter and add __GFP_ACCOUNT if account is true? > * @nid: node to allocate from > * @order: allocation order size > * > @@ -7493,7 +7494,7 @@ static bool __free_unaccepted(struct page *page) > * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN. > * It means ENOMEM. There is no reason to call it again and expect !NULL. > */ ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock(). 2025-09-12 17:11 ` Shakeel Butt @ 2025-09-12 17:15 ` Matthew Wilcox 2025-09-12 17:34 ` Alexei Starovoitov 0 siblings, 1 reply; 45+ messages in thread From: Matthew Wilcox @ 2025-09-12 17:15 UTC (permalink / raw) To: Shakeel Butt Cc: Alexei Starovoitov, bpf, linux-mm, vbabka, harry.yoo, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes On Fri, Sep 12, 2025 at 10:11:26AM -0700, Shakeel Butt wrote: > On Mon, Sep 08, 2025 at 06:00:03PM -0700, Alexei Starovoitov wrote: > [...] > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index d1d037f97c5f..30ccff0283fd 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -7480,6 +7480,7 @@ static bool __free_unaccepted(struct page *page) > > > > /** > > * alloc_pages_nolock - opportunistic reentrant allocation from any context > > + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed. > > If only __GFP_ACCOUNT is allowed then why not use a 'bool account' in the > parameter and add __GFP_ACCOUNT if account is true? It's clearer in the callers to call alloc_pages_nolock(__GFP_ACCOUNT) than it is to call alloc_pages_nolock(true). I can immediately tell what the first one does. I have no idea what the polarity of 'true' might be (does it mean accounted or unaccounted?) Is it rlated to accounting, GFP_COMP, highmem, whether it's OK to access atomic reserves ... or literally anything else that you might want to select when allocating memory. This use of unadorned booleans is an antipattern. Nobody should be advocating for such things. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock(). 2025-09-12 17:15 ` Matthew Wilcox @ 2025-09-12 17:34 ` Alexei Starovoitov 2025-09-12 17:46 ` Shakeel Butt 0 siblings, 1 reply; 45+ messages in thread From: Alexei Starovoitov @ 2025-09-12 17:34 UTC (permalink / raw) To: Matthew Wilcox Cc: Shakeel Butt, bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner On Fri, Sep 12, 2025 at 10:15 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Sep 12, 2025 at 10:11:26AM -0700, Shakeel Butt wrote: > > On Mon, Sep 08, 2025 at 06:00:03PM -0700, Alexei Starovoitov wrote: > > [...] > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index d1d037f97c5f..30ccff0283fd 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -7480,6 +7480,7 @@ static bool __free_unaccepted(struct page *page) > > > > > > /** > > > * alloc_pages_nolock - opportunistic reentrant allocation from any context > > > + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed. > > > > If only __GFP_ACCOUNT is allowed then why not use a 'bool account' in the > > parameter and add __GFP_ACCOUNT if account is true? > > It's clearer in the callers to call alloc_pages_nolock(__GFP_ACCOUNT) > than it is to call alloc_pages_nolock(true). > > I can immediately tell what the first one does. I have no idea what > the polarity of 'true' might be (does it mean accounted or unaccounted?) > Is it rlated to accounting, GFP_COMP, highmem, whether it's OK to access > atomic reserves ... or literally anything else that you might want to > select when allocating memory. > > This use of unadorned booleans is an antipattern. Nobody should be > advocating for such things. +1. We strongly discourage bool in arguments in any function. It makes callsites unreadable. We learned it the hard way though :( Some of the verifier code became a mess like: err = check_load_mem(env, insn, true, false, false, "atomic_load"); it's on our todo to clean this up. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock(). 2025-09-12 17:34 ` Alexei Starovoitov @ 2025-09-12 17:46 ` Shakeel Butt 0 siblings, 0 replies; 45+ messages in thread From: Shakeel Butt @ 2025-09-12 17:46 UTC (permalink / raw) To: Alexei Starovoitov Cc: Matthew Wilcox, bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner On Fri, Sep 12, 2025 at 10:34:15AM -0700, Alexei Starovoitov wrote: > On Fri, Sep 12, 2025 at 10:15 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Fri, Sep 12, 2025 at 10:11:26AM -0700, Shakeel Butt wrote: > > > On Mon, Sep 08, 2025 at 06:00:03PM -0700, Alexei Starovoitov wrote: > > > [...] > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > > index d1d037f97c5f..30ccff0283fd 100644 > > > > --- a/mm/page_alloc.c > > > > +++ b/mm/page_alloc.c > > > > @@ -7480,6 +7480,7 @@ static bool __free_unaccepted(struct page *page) > > > > > > > > /** > > > > * alloc_pages_nolock - opportunistic reentrant allocation from any context > > > > + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed. > > > > > > If only __GFP_ACCOUNT is allowed then why not use a 'bool account' in the > > > parameter and add __GFP_ACCOUNT if account is true? > > > > It's clearer in the callers to call alloc_pages_nolock(__GFP_ACCOUNT) > > than it is to call alloc_pages_nolock(true). > > > > I can immediately tell what the first one does. I have no idea what > > the polarity of 'true' might be (does it mean accounted or unaccounted?) > > Is it rlated to accounting, GFP_COMP, highmem, whether it's OK to access > > atomic reserves ... or literally anything else that you might want to > > select when allocating memory. > > > > This use of unadorned booleans is an antipattern. Nobody should be > > advocating for such things. > > +1. > We strongly discourage bool in arguments in any function. > It makes callsites unreadable. > > We learned it the hard way though :( > Some of the verifier code became a mess like: > err = check_load_mem(env, insn, true, false, false, "atomic_load"); > > it's on our todo to clean this up. Sounds good. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock(). 2025-09-09 1:00 ` [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov 2025-09-12 17:11 ` Shakeel Butt @ 2025-09-12 17:47 ` Shakeel Butt 2025-09-15 5:25 ` Harry Yoo 2 siblings, 0 replies; 45+ messages in thread From: Shakeel Butt @ 2025-09-12 17:47 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, linux-mm, vbabka, harry.yoo, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes On Mon, Sep 08, 2025 at 06:00:03PM -0700, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > Change alloc_pages_nolock() to default to __GFP_COMP when allocating > pages, since upcoming reentrant alloc_slab_page() needs __GFP_COMP. > Also allow __GFP_ACCOUNT flag to be specified, > since most of BPF infra needs __GFP_ACCOUNT except BPF streams. > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz> > Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock(). 2025-09-09 1:00 ` [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov 2025-09-12 17:11 ` Shakeel Butt 2025-09-12 17:47 ` Shakeel Butt @ 2025-09-15 5:25 ` Harry Yoo 2 siblings, 0 replies; 45+ messages in thread From: Harry Yoo @ 2025-09-15 5:25 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, linux-mm, vbabka, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes On Mon, Sep 08, 2025 at 06:00:03PM -0700, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > Change alloc_pages_nolock() to default to __GFP_COMP when allocating > pages, since upcoming reentrant alloc_slab_page() needs __GFP_COMP. > Also allow __GFP_ACCOUNT flag to be specified, > since most of BPF infra needs __GFP_ACCOUNT except BPF streams. > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz> > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > --- Looks good to me, Reviewed-by: Harry Yoo <harry.yoo@oracle.com> -- Cheers, Harry / Hyeonggon ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH slab v5 3/6] mm: Introduce alloc_frozen_pages_nolock() 2025-09-09 1:00 [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov 2025-09-09 1:00 ` [PATCH slab v5 1/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov 2025-09-09 1:00 ` [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov @ 2025-09-09 1:00 ` Alexei Starovoitov 2025-09-12 17:15 ` Shakeel Butt 2025-09-15 5:17 ` Harry Yoo 2025-09-09 1:00 ` [PATCH slab v5 4/6] slab: Make slub local_(try)lock more precise for LOCKDEP Alexei Starovoitov ` (3 subsequent siblings) 6 siblings, 2 replies; 45+ messages in thread From: Alexei Starovoitov @ 2025-09-09 1:00 UTC (permalink / raw) To: bpf, linux-mm Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes From: Alexei Starovoitov <ast@kernel.org> Split alloc_pages_nolock() and introduce alloc_frozen_pages_nolock() to be used by alloc_slab_page(). Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> --- mm/internal.h | 4 ++++ mm/page_alloc.c | 49 ++++++++++++++++++++++++++++--------------------- 2 files changed, 32 insertions(+), 21 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 45b725c3dc03..9904421cabc1 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -842,6 +842,10 @@ static inline struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned int ord #define alloc_frozen_pages(...) \ alloc_hooks(alloc_frozen_pages_noprof(__VA_ARGS__)) +struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order); +#define alloc_frozen_pages_nolock(...) \ + alloc_hooks(alloc_frozen_pages_nolock_noprof(__VA_ARGS__)) + extern void zone_pcp_reset(struct zone *zone); extern void zone_pcp_disable(struct zone *zone); extern void zone_pcp_enable(struct zone *zone); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 30ccff0283fd..5a40e2b7d148 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7478,23 +7478,7 @@ static bool __free_unaccepted(struct page *page) #endif /* CONFIG_UNACCEPTED_MEMORY */ -/** - * alloc_pages_nolock - opportunistic reentrant allocation from any context - * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed. - * @nid: node to allocate from - * @order: allocation order size - * - * Allocates pages of a given order from the given node. This is safe to - * call from any context (from atomic, NMI, and also reentrant - * allocator -> tracepoint -> alloc_pages_nolock_noprof). - * Allocation is best effort and to be expected to fail easily so nobody should - * rely on the success. Failures are not reported via warn_alloc(). - * See always fail conditions below. - * - * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN. - * It means ENOMEM. There is no reason to call it again and expect !NULL. - */ -struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order) +struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order) { /* * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed. @@ -7557,15 +7541,38 @@ struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int or /* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */ - if (page) - set_page_refcounted(page); - if (memcg_kmem_online() && page && (gfp_flags & __GFP_ACCOUNT) && unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) { - free_pages_nolock(page, order); + __free_frozen_pages(page, order, FPI_TRYLOCK); page = NULL; } trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype); kmsan_alloc_page(page, order, alloc_gfp); return page; } +/** + * alloc_pages_nolock - opportunistic reentrant allocation from any context + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed. + * @nid: node to allocate from + * @order: allocation order size + * + * Allocates pages of a given order from the given node. This is safe to + * call from any context (from atomic, NMI, and also reentrant + * allocator -> tracepoint -> alloc_pages_nolock_noprof). + * Allocation is best effort and to be expected to fail easily so nobody should + * rely on the success. Failures are not reported via warn_alloc(). + * See always fail conditions below. + * + * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN. + * It means ENOMEM. There is no reason to call it again and expect !NULL. + */ +struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order) +{ + struct page *page; + + page = alloc_frozen_pages_nolock_noprof(gfp_flags, nid, order); + if (page) + set_page_refcounted(page); + return page; +} +EXPORT_SYMBOL_GPL(alloc_pages_nolock_noprof); -- 2.47.3 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 3/6] mm: Introduce alloc_frozen_pages_nolock() 2025-09-09 1:00 ` [PATCH slab v5 3/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov @ 2025-09-12 17:15 ` Shakeel Butt 2025-09-15 5:17 ` Harry Yoo 1 sibling, 0 replies; 45+ messages in thread From: Shakeel Butt @ 2025-09-12 17:15 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, linux-mm, vbabka, harry.yoo, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes On Mon, Sep 08, 2025 at 06:00:04PM -0700, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > Split alloc_pages_nolock() and introduce alloc_frozen_pages_nolock() > to be used by alloc_slab_page(). > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz> > Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 3/6] mm: Introduce alloc_frozen_pages_nolock() 2025-09-09 1:00 ` [PATCH slab v5 3/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov 2025-09-12 17:15 ` Shakeel Butt @ 2025-09-15 5:17 ` Harry Yoo 1 sibling, 0 replies; 45+ messages in thread From: Harry Yoo @ 2025-09-15 5:17 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, linux-mm, vbabka, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes On Mon, Sep 08, 2025 at 06:00:04PM -0700, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > Split alloc_pages_nolock() and introduce alloc_frozen_pages_nolock() > to be used by alloc_slab_page(). > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz> > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > --- Looks good to me, Reviewed-by: Harry Yoo <harry.yoo@oracle.com> -- Cheers, Harry / Hyeonggon ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH slab v5 4/6] slab: Make slub local_(try)lock more precise for LOCKDEP 2025-09-09 1:00 [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov ` (2 preceding siblings ...) 2025-09-09 1:00 ` [PATCH slab v5 3/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov @ 2025-09-09 1:00 ` Alexei Starovoitov 2025-09-09 1:00 ` [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL Alexei Starovoitov ` (2 subsequent siblings) 6 siblings, 0 replies; 45+ messages in thread From: Alexei Starovoitov @ 2025-09-09 1:00 UTC (permalink / raw) To: bpf, linux-mm Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes From: Alexei Starovoitov <ast@kernel.org> kmalloc_nolock() can be called from any context the ___slab_alloc() can acquire local_trylock_t (which is rt_spin_lock in PREEMPT_RT) and attempt to acquire a different local_trylock_t while in the same task context. The calling sequence might look like: kmalloc() -> tracepoint -> bpf -> kmalloc_nolock() or more precisely: __lock_acquire+0x12ad/0x2590 lock_acquire+0x133/0x2d0 rt_spin_lock+0x6f/0x250 ___slab_alloc+0xb7/0xec0 kmalloc_nolock_noprof+0x15a/0x430 my_debug_callback+0x20e/0x390 [testmod] ___slab_alloc+0x256/0xec0 __kmalloc_cache_noprof+0xd6/0x3b0 Make LOCKDEP understand that local_trylock_t-s protect different kmem_caches. In order to do that add lock_class_key for each kmem_cache and use that key in local_trylock_t. This stack trace is possible on both PREEMPT_RT and !PREEMPT_RT, but teach lockdep about it only for PREEMPT_RT, since in !PREEMPT_RT the ___slab_alloc() code is using local_trylock_irqsave() when lockdep is on. Note, this patch applies this logic to local_lock_t while the next one converts it to local_trylock_t. Both are mapped to rt_spin_lock in PREEMPT_RT. Signed-off-by: Alexei Starovoitov <ast@kernel.org> --- mm/slab.h | 1 + mm/slub.c | 20 ++++++++++++++++++++ 2 files changed, 21 insertions(+) diff --git a/mm/slab.h b/mm/slab.h index f1866f2d9b21..5a6f824a282d 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -234,6 +234,7 @@ struct kmem_cache_order_objects { struct kmem_cache { #ifndef CONFIG_SLUB_TINY struct kmem_cache_cpu __percpu *cpu_slab; + struct lock_class_key lock_key; #endif struct slub_percpu_sheaves __percpu *cpu_sheaves; /* Used for retrieving partial slabs, etc. */ diff --git a/mm/slub.c b/mm/slub.c index bf399ba65a4b..212161dc0f29 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3591,12 +3591,29 @@ static inline void note_cmpxchg_failure(const char *n, static void init_kmem_cache_cpus(struct kmem_cache *s) { +#ifdef CONFIG_PREEMPT_RT + /* + * Register lockdep key for non-boot kmem caches to avoid + * WARN_ON_ONCE(static_obj(key))) in lockdep_register_key() + */ + bool finegrain_lockdep = !init_section_contains(s, 1); +#else + /* + * Don't bother with different lockdep classes for each + * kmem_cache, since we only use local_trylock_irqsave(). + */ + bool finegrain_lockdep = false; +#endif int cpu; struct kmem_cache_cpu *c; + if (finegrain_lockdep) + lockdep_register_key(&s->lock_key); for_each_possible_cpu(cpu) { c = per_cpu_ptr(s->cpu_slab, cpu); local_lock_init(&c->lock); + if (finegrain_lockdep) + lockdep_set_class(&c->lock, &s->lock_key); c->tid = init_tid(cpu); } } @@ -7142,6 +7159,9 @@ void __kmem_cache_release(struct kmem_cache *s) if (s->cpu_sheaves) pcs_destroy(s); #ifndef CONFIG_SLUB_TINY +#ifdef CONFIG_PREEMPT_RT + lockdep_unregister_key(&s->lock_key); +#endif free_percpu(s->cpu_slab); #endif free_kmem_cache_nodes(s); -- 2.47.3 ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-09 1:00 [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov ` (3 preceding siblings ...) 2025-09-09 1:00 ` [PATCH slab v5 4/6] slab: Make slub local_(try)lock more precise for LOCKDEP Alexei Starovoitov @ 2025-09-09 1:00 ` Alexei Starovoitov 2025-09-12 19:27 ` Shakeel Butt ` (2 more replies) 2025-09-09 1:00 ` [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov 2025-09-12 9:33 ` [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Vlastimil Babka 6 siblings, 3 replies; 45+ messages in thread From: Alexei Starovoitov @ 2025-09-09 1:00 UTC (permalink / raw) To: bpf, linux-mm Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes From: Alexei Starovoitov <ast@kernel.org> Since the combination of valid upper bits in slab->obj_exts with OBJEXTS_ALLOC_FAIL bit can never happen, use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel instead of (1ull << 2) to free up bit 2. Signed-off-by: Alexei Starovoitov <ast@kernel.org> --- include/linux/memcontrol.h | 10 ++++++++-- mm/slub.c | 2 +- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 785173aa0739..d254c0b96d0d 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -341,17 +341,23 @@ enum page_memcg_data_flags { __NR_MEMCG_DATA_FLAGS = (1UL << 2), }; +#define __OBJEXTS_ALLOC_FAIL MEMCG_DATA_OBJEXTS #define __FIRST_OBJEXT_FLAG __NR_MEMCG_DATA_FLAGS #else /* CONFIG_MEMCG */ +#define __OBJEXTS_ALLOC_FAIL (1UL << 0) #define __FIRST_OBJEXT_FLAG (1UL << 0) #endif /* CONFIG_MEMCG */ enum objext_flags { - /* slabobj_ext vector failed to allocate */ - OBJEXTS_ALLOC_FAIL = __FIRST_OBJEXT_FLAG, + /* + * Use bit 0 with zero other bits to signal that slabobj_ext vector + * failed to allocate. The same bit 0 with valid upper bits means + * MEMCG_DATA_OBJEXTS. + */ + OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL, /* the next bit after the last actual flag */ __NR_OBJEXTS_FLAGS = (__FIRST_OBJEXT_FLAG << 1), }; diff --git a/mm/slub.c b/mm/slub.c index 212161dc0f29..61841ba72120 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2051,7 +2051,7 @@ static inline void handle_failed_objexts_alloc(unsigned long obj_exts, * objects with no tag reference. Mark all references in this * vector as empty to avoid warnings later on. */ - if (obj_exts & OBJEXTS_ALLOC_FAIL) { + if (obj_exts == OBJEXTS_ALLOC_FAIL) { unsigned int i; for (i = 0; i < objects; i++) -- 2.47.3 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-09 1:00 ` [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL Alexei Starovoitov @ 2025-09-12 19:27 ` Shakeel Butt 2025-09-12 21:03 ` Suren Baghdasaryan 2025-09-13 1:16 ` Shakeel Butt 2025-09-15 6:14 ` Harry Yoo 2 siblings, 1 reply; 45+ messages in thread From: Shakeel Butt @ 2025-09-12 19:27 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, linux-mm, vbabka, harry.yoo, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes, surenb, roman.gushchin +Suren, Roman On Mon, Sep 08, 2025 at 06:00:06PM -0700, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > Since the combination of valid upper bits in slab->obj_exts with > OBJEXTS_ALLOC_FAIL bit can never happen, > use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel > instead of (1ull << 2) to free up bit 2. > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> Are we low on bits that we need to do this or is this good to have optimization but not required? I do have some questions on the state of slab->obj_exts even before this patch for Suren, Roman, Vlastimil and others: Suppose we newly allocate struct slab for a SLAB_ACCOUNT cache and tried to allocate obj_exts for it which failed. The kernel will set OBJEXTS_ALLOC_FAIL in slab->obj_exts (Note that this can only be set for new slab allocation and only for SLAB_ACCOUNT caches i.e. vec allocation failure for memory profiling does not set this flag). Now in the post alloc hook, either through memory profiling or through memcg charging, we will try again to allocate the vec and before that we will call slab_obj_exts() on the slab which has: unsigned long obj_exts = READ_ONCE(slab->obj_exts); VM_BUG_ON_PAGE(obj_exts && !(obj_exts & MEMCG_DATA_OBJEXTS), slab_page(slab)); It seems like the above VM_BUG_ON_PAGE() will trigger because obj_exts will have OBJEXTS_ALLOC_FAIL but it should not, right? Or am I missing something? After the following patch we will aliasing be MEMCG_DATA_OBJEXTS and OBJEXTS_ALLOC_FAIL and will avoid this trigger though which also seems unintended. Next question: OBJEXTS_ALLOC_FAIL is for memory profiling and we never set it when memcg is disabled and memory profiling is enabled or even with both memcg and memory profiling are enabled but cache does not have SLAB_ACCOUNT. This seems unintentional as well, right? Also I think slab_obj_exts() needs to handle OBJEXTS_ALLOC_FAIL explicitly. > --- > include/linux/memcontrol.h | 10 ++++++++-- > mm/slub.c | 2 +- > 2 files changed, 9 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 785173aa0739..d254c0b96d0d 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -341,17 +341,23 @@ enum page_memcg_data_flags { > __NR_MEMCG_DATA_FLAGS = (1UL << 2), > }; > > +#define __OBJEXTS_ALLOC_FAIL MEMCG_DATA_OBJEXTS > #define __FIRST_OBJEXT_FLAG __NR_MEMCG_DATA_FLAGS > > #else /* CONFIG_MEMCG */ > > +#define __OBJEXTS_ALLOC_FAIL (1UL << 0) > #define __FIRST_OBJEXT_FLAG (1UL << 0) > > #endif /* CONFIG_MEMCG */ > > enum objext_flags { > - /* slabobj_ext vector failed to allocate */ > - OBJEXTS_ALLOC_FAIL = __FIRST_OBJEXT_FLAG, > + /* > + * Use bit 0 with zero other bits to signal that slabobj_ext vector > + * failed to allocate. The same bit 0 with valid upper bits means > + * MEMCG_DATA_OBJEXTS. > + */ > + OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL, > /* the next bit after the last actual flag */ > __NR_OBJEXTS_FLAGS = (__FIRST_OBJEXT_FLAG << 1), > }; > diff --git a/mm/slub.c b/mm/slub.c > index 212161dc0f29..61841ba72120 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2051,7 +2051,7 @@ static inline void handle_failed_objexts_alloc(unsigned long obj_exts, > * objects with no tag reference. Mark all references in this > * vector as empty to avoid warnings later on. > */ > - if (obj_exts & OBJEXTS_ALLOC_FAIL) { > + if (obj_exts == OBJEXTS_ALLOC_FAIL) { > unsigned int i; > > for (i = 0; i < objects; i++) > -- > 2.47.3 > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-12 19:27 ` Shakeel Butt @ 2025-09-12 21:03 ` Suren Baghdasaryan 2025-09-12 21:11 ` Shakeel Butt 2025-09-12 21:24 ` Alexei Starovoitov 0 siblings, 2 replies; 45+ messages in thread From: Suren Baghdasaryan @ 2025-09-12 21:03 UTC (permalink / raw) To: Shakeel Butt Cc: Alexei Starovoitov, bpf, linux-mm, vbabka, harry.yoo, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes, roman.gushchin On Fri, Sep 12, 2025 at 12:27 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > +Suren, Roman > > On Mon, Sep 08, 2025 at 06:00:06PM -0700, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@kernel.org> > > > > Since the combination of valid upper bits in slab->obj_exts with > > OBJEXTS_ALLOC_FAIL bit can never happen, > > use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel > > instead of (1ull << 2) to free up bit 2. > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > Are we low on bits that we need to do this or is this good to have > optimization but not required? That's a good question. After this change MEMCG_DATA_OBJEXTS and OBJEXTS_ALLOC_FAIL will have the same value and they are used with the same field (page->memcg_data and slab->obj_exts are aliases). Even if page_memcg_data_flags can never be used for slab pages I think overlapping these bits is not a good idea and creates additional risks. Unless there is a good reason to do this I would advise against it. > > I do have some questions on the state of slab->obj_exts even before this > patch for Suren, Roman, Vlastimil and others: > > Suppose we newly allocate struct slab for a SLAB_ACCOUNT cache and tried > to allocate obj_exts for it which failed. The kernel will set > OBJEXTS_ALLOC_FAIL in slab->obj_exts (Note that this can only be set for > new slab allocation and only for SLAB_ACCOUNT caches i.e. vec allocation > failure for memory profiling does not set this flag). > > Now in the post alloc hook, either through memory profiling or through > memcg charging, we will try again to allocate the vec and before that we > will call slab_obj_exts() on the slab which has: > > unsigned long obj_exts = READ_ONCE(slab->obj_exts); > > VM_BUG_ON_PAGE(obj_exts && !(obj_exts & MEMCG_DATA_OBJEXTS), slab_page(slab)); > > It seems like the above VM_BUG_ON_PAGE() will trigger because obj_exts > will have OBJEXTS_ALLOC_FAIL but it should not, right? Or am I missing > something? After the following patch we will aliasing be MEMCG_DATA_OBJEXTS > and OBJEXTS_ALLOC_FAIL and will avoid this trigger though which also > seems unintended. You are correct. Current VM_BUG_ON_PAGE() will trigger if OBJEXTS_ALLOC_FAIL is set and that is wrong. When alloc_slab_obj_exts() fails to allocate the vector it does mark_failed_objexts_alloc() and exits without setting MEMCG_DATA_OBJEXTS (which it would have done if the allocation succeeded). So, any further calls to slab_obj_exts() will generate a warning because MEMCG_DATA_OBJEXTS is not set. I believe the proper fix would not be to set MEMCG_DATA_OBJEXTS along with OBJEXTS_ALLOC_FAIL because the pointer does not point to a valid vector but to modify the warning to: VM_BUG_ON_PAGE(obj_exts && !(obj_exts & (MEMCG_DATA_OBJEXTS | OBJEXTS_ALLOC_FAIL)), slab_page(slab)); IOW, we expect the obj_ext to be either NULL or have either MEMCG_DATA_OBJEXTS or OBJEXTS_ALLOC_FAIL set. > > Next question: OBJEXTS_ALLOC_FAIL is for memory profiling and we never > set it when memcg is disabled and memory profiling is enabled or even > with both memcg and memory profiling are enabled but cache does not have > SLAB_ACCOUNT. This seems unintentional as well, right? I'm not sure why you think OBJEXTS_ALLOC_FAIL is not set by memory profiling (independent of CONFIG_MEMCG state). __alloc_tagging_slab_alloc_hook()->prepare_slab_obj_exts_hook()->alloc_slab_obj_exts() will attempt to allocate the vector and set OBJEXTS_ALLOC_FAIL if that fails. > > Also I think slab_obj_exts() needs to handle OBJEXTS_ALLOC_FAIL explicitly. Agree, so is my proposal to update the warning sounds right to you? > > > > --- > > include/linux/memcontrol.h | 10 ++++++++-- > > mm/slub.c | 2 +- > > 2 files changed, 9 insertions(+), 3 deletions(-) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 785173aa0739..d254c0b96d0d 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -341,17 +341,23 @@ enum page_memcg_data_flags { > > __NR_MEMCG_DATA_FLAGS = (1UL << 2), > > }; > > > > +#define __OBJEXTS_ALLOC_FAIL MEMCG_DATA_OBJEXTS > > #define __FIRST_OBJEXT_FLAG __NR_MEMCG_DATA_FLAGS > > > > #else /* CONFIG_MEMCG */ > > > > +#define __OBJEXTS_ALLOC_FAIL (1UL << 0) > > #define __FIRST_OBJEXT_FLAG (1UL << 0) > > > > #endif /* CONFIG_MEMCG */ > > > > enum objext_flags { > > - /* slabobj_ext vector failed to allocate */ > > - OBJEXTS_ALLOC_FAIL = __FIRST_OBJEXT_FLAG, > > + /* > > + * Use bit 0 with zero other bits to signal that slabobj_ext vector > > + * failed to allocate. The same bit 0 with valid upper bits means > > + * MEMCG_DATA_OBJEXTS. > > + */ > > + OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL, > > /* the next bit after the last actual flag */ > > __NR_OBJEXTS_FLAGS = (__FIRST_OBJEXT_FLAG << 1), > > }; > > diff --git a/mm/slub.c b/mm/slub.c > > index 212161dc0f29..61841ba72120 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -2051,7 +2051,7 @@ static inline void handle_failed_objexts_alloc(unsigned long obj_exts, > > * objects with no tag reference. Mark all references in this > > * vector as empty to avoid warnings later on. > > */ > > - if (obj_exts & OBJEXTS_ALLOC_FAIL) { > > + if (obj_exts == OBJEXTS_ALLOC_FAIL) { > > unsigned int i; > > > > for (i = 0; i < objects; i++) > > -- > > 2.47.3 > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-12 21:03 ` Suren Baghdasaryan @ 2025-09-12 21:11 ` Shakeel Butt 2025-09-12 21:26 ` Suren Baghdasaryan 2025-09-12 21:24 ` Alexei Starovoitov 1 sibling, 1 reply; 45+ messages in thread From: Shakeel Butt @ 2025-09-12 21:11 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Alexei Starovoitov, bpf, linux-mm, vbabka, harry.yoo, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes, roman.gushchin On Fri, Sep 12, 2025 at 02:03:03PM -0700, Suren Baghdasaryan wrote: [...] > > I do have some questions on the state of slab->obj_exts even before this > > patch for Suren, Roman, Vlastimil and others: > > > > Suppose we newly allocate struct slab for a SLAB_ACCOUNT cache and tried > > to allocate obj_exts for it which failed. The kernel will set > > OBJEXTS_ALLOC_FAIL in slab->obj_exts (Note that this can only be set for > > new slab allocation and only for SLAB_ACCOUNT caches i.e. vec allocation > > failure for memory profiling does not set this flag). > > > > Now in the post alloc hook, either through memory profiling or through > > memcg charging, we will try again to allocate the vec and before that we > > will call slab_obj_exts() on the slab which has: > > > > unsigned long obj_exts = READ_ONCE(slab->obj_exts); > > > > VM_BUG_ON_PAGE(obj_exts && !(obj_exts & MEMCG_DATA_OBJEXTS), slab_page(slab)); > > > > It seems like the above VM_BUG_ON_PAGE() will trigger because obj_exts > > will have OBJEXTS_ALLOC_FAIL but it should not, right? Or am I missing > > something? After the following patch we will aliasing be MEMCG_DATA_OBJEXTS > > and OBJEXTS_ALLOC_FAIL and will avoid this trigger though which also > > seems unintended. > > You are correct. Current VM_BUG_ON_PAGE() will trigger if > OBJEXTS_ALLOC_FAIL is set and that is wrong. When > alloc_slab_obj_exts() fails to allocate the vector it does > mark_failed_objexts_alloc() and exits without setting > MEMCG_DATA_OBJEXTS (which it would have done if the allocation > succeeded). So, any further calls to slab_obj_exts() will generate a > warning because MEMCG_DATA_OBJEXTS is not set. I believe the proper > fix would not be to set MEMCG_DATA_OBJEXTS along with > OBJEXTS_ALLOC_FAIL because the pointer does not point to a valid > vector but to modify the warning to: > > VM_BUG_ON_PAGE(obj_exts && !(obj_exts & (MEMCG_DATA_OBJEXTS | > OBJEXTS_ALLOC_FAIL)), slab_page(slab)); > > IOW, we expect the obj_ext to be either NULL or have either > MEMCG_DATA_OBJEXTS or OBJEXTS_ALLOC_FAIL set. > > > > Next question: OBJEXTS_ALLOC_FAIL is for memory profiling and we never > > set it when memcg is disabled and memory profiling is enabled or even > > with both memcg and memory profiling are enabled but cache does not have > > SLAB_ACCOUNT. This seems unintentional as well, right? > > I'm not sure why you think OBJEXTS_ALLOC_FAIL is not set by memory > profiling (independent of CONFIG_MEMCG state). > __alloc_tagging_slab_alloc_hook()->prepare_slab_obj_exts_hook()->alloc_slab_obj_exts() > will attempt to allocate the vector and set OBJEXTS_ALLOC_FAIL if that > fails. > prepare_slab_obj_exts_hook() calls alloc_slab_obj_exts() with new_slab as false and alloc_slab_obj_exts() will only call mark_failed_objexts_alloc() if new_slab is true. > > > > Also I think slab_obj_exts() needs to handle OBJEXTS_ALLOC_FAIL explicitly. > > Agree, so is my proposal to update the warning sounds right to you? Yes that seems right to me. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-12 21:11 ` Shakeel Butt @ 2025-09-12 21:26 ` Suren Baghdasaryan 0 siblings, 0 replies; 45+ messages in thread From: Suren Baghdasaryan @ 2025-09-12 21:26 UTC (permalink / raw) To: Shakeel Butt Cc: Alexei Starovoitov, bpf, linux-mm, vbabka, harry.yoo, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes, roman.gushchin On Fri, Sep 12, 2025 at 2:12 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Fri, Sep 12, 2025 at 02:03:03PM -0700, Suren Baghdasaryan wrote: > [...] > > > I do have some questions on the state of slab->obj_exts even before this > > > patch for Suren, Roman, Vlastimil and others: > > > > > > Suppose we newly allocate struct slab for a SLAB_ACCOUNT cache and tried > > > to allocate obj_exts for it which failed. The kernel will set > > > OBJEXTS_ALLOC_FAIL in slab->obj_exts (Note that this can only be set for > > > new slab allocation and only for SLAB_ACCOUNT caches i.e. vec allocation > > > failure for memory profiling does not set this flag). > > > > > > Now in the post alloc hook, either through memory profiling or through > > > memcg charging, we will try again to allocate the vec and before that we > > > will call slab_obj_exts() on the slab which has: > > > > > > unsigned long obj_exts = READ_ONCE(slab->obj_exts); > > > > > > VM_BUG_ON_PAGE(obj_exts && !(obj_exts & MEMCG_DATA_OBJEXTS), slab_page(slab)); > > > > > > It seems like the above VM_BUG_ON_PAGE() will trigger because obj_exts > > > will have OBJEXTS_ALLOC_FAIL but it should not, right? Or am I missing > > > something? After the following patch we will aliasing be MEMCG_DATA_OBJEXTS > > > and OBJEXTS_ALLOC_FAIL and will avoid this trigger though which also > > > seems unintended. > > > > You are correct. Current VM_BUG_ON_PAGE() will trigger if > > OBJEXTS_ALLOC_FAIL is set and that is wrong. When > > alloc_slab_obj_exts() fails to allocate the vector it does > > mark_failed_objexts_alloc() and exits without setting > > MEMCG_DATA_OBJEXTS (which it would have done if the allocation > > succeeded). So, any further calls to slab_obj_exts() will generate a > > warning because MEMCG_DATA_OBJEXTS is not set. I believe the proper > > fix would not be to set MEMCG_DATA_OBJEXTS along with > > OBJEXTS_ALLOC_FAIL because the pointer does not point to a valid > > vector but to modify the warning to: > > > > VM_BUG_ON_PAGE(obj_exts && !(obj_exts & (MEMCG_DATA_OBJEXTS | > > OBJEXTS_ALLOC_FAIL)), slab_page(slab)); > > > > IOW, we expect the obj_ext to be either NULL or have either > > MEMCG_DATA_OBJEXTS or OBJEXTS_ALLOC_FAIL set. > > > > > > Next question: OBJEXTS_ALLOC_FAIL is for memory profiling and we never > > > set it when memcg is disabled and memory profiling is enabled or even > > > with both memcg and memory profiling are enabled but cache does not have > > > SLAB_ACCOUNT. This seems unintentional as well, right? > > > > I'm not sure why you think OBJEXTS_ALLOC_FAIL is not set by memory > > profiling (independent of CONFIG_MEMCG state). > > __alloc_tagging_slab_alloc_hook()->prepare_slab_obj_exts_hook()->alloc_slab_obj_exts() > > will attempt to allocate the vector and set OBJEXTS_ALLOC_FAIL if that > > fails. > > > > prepare_slab_obj_exts_hook() calls alloc_slab_obj_exts() with new_slab > as false and alloc_slab_obj_exts() will only call > mark_failed_objexts_alloc() if new_slab is true. Sorry, I missed that detail. I think mark_failed_objexts_alloc() should be called there unconditionally if vector allocation fails. > > > > > > > Also I think slab_obj_exts() needs to handle OBJEXTS_ALLOC_FAIL explicitly. > > > > Agree, so is my proposal to update the warning sounds right to you? > > Yes that seems right to me. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-12 21:03 ` Suren Baghdasaryan 2025-09-12 21:11 ` Shakeel Butt @ 2025-09-12 21:24 ` Alexei Starovoitov 2025-09-12 21:29 ` Shakeel Butt 1 sibling, 1 reply; 45+ messages in thread From: Alexei Starovoitov @ 2025-09-12 21:24 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Shakeel Butt, bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner, Roman Gushchin On Fri, Sep 12, 2025 at 2:03 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Fri, Sep 12, 2025 at 12:27 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > +Suren, Roman > > > > On Mon, Sep 08, 2025 at 06:00:06PM -0700, Alexei Starovoitov wrote: > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > Since the combination of valid upper bits in slab->obj_exts with > > > OBJEXTS_ALLOC_FAIL bit can never happen, > > > use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel > > > instead of (1ull << 2) to free up bit 2. > > > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > > > Are we low on bits that we need to do this or is this good to have > > optimization but not required? > > That's a good question. After this change MEMCG_DATA_OBJEXTS and > OBJEXTS_ALLOC_FAIL will have the same value and they are used with the > same field (page->memcg_data and slab->obj_exts are aliases). Even if > page_memcg_data_flags can never be used for slab pages I think > overlapping these bits is not a good idea and creates additional > risks. Unless there is a good reason to do this I would advise against > it. Completely disagree. You both missed the long discussion during v4. The other alternative was to increase alignment and waste memory. Saving the bit is obviously cleaner. The next patch is using the saved bit. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-12 21:24 ` Alexei Starovoitov @ 2025-09-12 21:29 ` Shakeel Butt 2025-09-12 21:31 ` Alexei Starovoitov 0 siblings, 1 reply; 45+ messages in thread From: Shakeel Butt @ 2025-09-12 21:29 UTC (permalink / raw) To: Alexei Starovoitov Cc: Suren Baghdasaryan, bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner, Roman Gushchin On Fri, Sep 12, 2025 at 02:24:26PM -0700, Alexei Starovoitov wrote: > On Fri, Sep 12, 2025 at 2:03 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Fri, Sep 12, 2025 at 12:27 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > +Suren, Roman > > > > > > On Mon, Sep 08, 2025 at 06:00:06PM -0700, Alexei Starovoitov wrote: > > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > > > Since the combination of valid upper bits in slab->obj_exts with > > > > OBJEXTS_ALLOC_FAIL bit can never happen, > > > > use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel > > > > instead of (1ull << 2) to free up bit 2. > > > > > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > > > > > Are we low on bits that we need to do this or is this good to have > > > optimization but not required? > > > > That's a good question. After this change MEMCG_DATA_OBJEXTS and > > OBJEXTS_ALLOC_FAIL will have the same value and they are used with the > > same field (page->memcg_data and slab->obj_exts are aliases). Even if > > page_memcg_data_flags can never be used for slab pages I think > > overlapping these bits is not a good idea and creates additional > > risks. Unless there is a good reason to do this I would advise against > > it. > > Completely disagree. You both missed the long discussion > during v4. The other alternative was to increase alignment > and waste memory. Saving the bit is obviously cleaner. > The next patch is using the saved bit. I will check out that discussion and it would be good to summarize that in the commit message. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-12 21:29 ` Shakeel Butt @ 2025-09-12 21:31 ` Alexei Starovoitov 2025-09-12 21:44 ` Shakeel Butt 0 siblings, 1 reply; 45+ messages in thread From: Alexei Starovoitov @ 2025-09-12 21:31 UTC (permalink / raw) To: Shakeel Butt Cc: Suren Baghdasaryan, bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner, Roman Gushchin On Fri, Sep 12, 2025 at 2:29 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Fri, Sep 12, 2025 at 02:24:26PM -0700, Alexei Starovoitov wrote: > > On Fri, Sep 12, 2025 at 2:03 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Fri, Sep 12, 2025 at 12:27 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > +Suren, Roman > > > > > > > > On Mon, Sep 08, 2025 at 06:00:06PM -0700, Alexei Starovoitov wrote: > > > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > > > > > Since the combination of valid upper bits in slab->obj_exts with > > > > > OBJEXTS_ALLOC_FAIL bit can never happen, > > > > > use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel > > > > > instead of (1ull << 2) to free up bit 2. > > > > > > > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > > > > > > > Are we low on bits that we need to do this or is this good to have > > > > optimization but not required? > > > > > > That's a good question. After this change MEMCG_DATA_OBJEXTS and > > > OBJEXTS_ALLOC_FAIL will have the same value and they are used with the > > > same field (page->memcg_data and slab->obj_exts are aliases). Even if > > > page_memcg_data_flags can never be used for slab pages I think > > > overlapping these bits is not a good idea and creates additional > > > risks. Unless there is a good reason to do this I would advise against > > > it. > > > > Completely disagree. You both missed the long discussion > > during v4. The other alternative was to increase alignment > > and waste memory. Saving the bit is obviously cleaner. > > The next patch is using the saved bit. > > I will check out that discussion and it would be good to summarize that > in the commit message. Disgaree. It's not a job of a small commit to summarize all options that were discussed on the list. That's what the cover letter is for and there there are links to all previous threads. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-12 21:31 ` Alexei Starovoitov @ 2025-09-12 21:44 ` Shakeel Butt 2025-09-12 21:59 ` Alexei Starovoitov 0 siblings, 1 reply; 45+ messages in thread From: Shakeel Butt @ 2025-09-12 21:44 UTC (permalink / raw) To: Alexei Starovoitov Cc: Suren Baghdasaryan, bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner, Roman Gushchin On Fri, Sep 12, 2025 at 02:31:47PM -0700, Alexei Starovoitov wrote: > On Fri, Sep 12, 2025 at 2:29 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Fri, Sep 12, 2025 at 02:24:26PM -0700, Alexei Starovoitov wrote: > > > On Fri, Sep 12, 2025 at 2:03 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Fri, Sep 12, 2025 at 12:27 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > > > +Suren, Roman > > > > > > > > > > On Mon, Sep 08, 2025 at 06:00:06PM -0700, Alexei Starovoitov wrote: > > > > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > > > > > > > Since the combination of valid upper bits in slab->obj_exts with > > > > > > OBJEXTS_ALLOC_FAIL bit can never happen, > > > > > > use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel > > > > > > instead of (1ull << 2) to free up bit 2. > > > > > > > > > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > > > > > > > > > Are we low on bits that we need to do this or is this good to have > > > > > optimization but not required? > > > > > > > > That's a good question. After this change MEMCG_DATA_OBJEXTS and > > > > OBJEXTS_ALLOC_FAIL will have the same value and they are used with the > > > > same field (page->memcg_data and slab->obj_exts are aliases). Even if > > > > page_memcg_data_flags can never be used for slab pages I think > > > > overlapping these bits is not a good idea and creates additional > > > > risks. Unless there is a good reason to do this I would advise against > > > > it. > > > > > > Completely disagree. You both missed the long discussion > > > during v4. The other alternative was to increase alignment > > > and waste memory. Saving the bit is obviously cleaner. > > > The next patch is using the saved bit. > > > > I will check out that discussion and it would be good to summarize that > > in the commit message. > > Disgaree. It's not a job of a small commit to summarize all options > that were discussed on the list. That's what the cover letter is for > and there there are links to all previous threads. Currently the commit message is only telling what the patch is doing and is missing the 'why' part and I think adding the 'why' part would make it better for future readers i.e. less effort to find why this is being done this way. (Anyways this is just a nit from me) ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-12 21:44 ` Shakeel Butt @ 2025-09-12 21:59 ` Alexei Starovoitov 2025-09-13 0:01 ` Shakeel Butt 0 siblings, 1 reply; 45+ messages in thread From: Alexei Starovoitov @ 2025-09-12 21:59 UTC (permalink / raw) To: Shakeel Butt Cc: Suren Baghdasaryan, bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner, Roman Gushchin On Fri, Sep 12, 2025 at 2:44 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Fri, Sep 12, 2025 at 02:31:47PM -0700, Alexei Starovoitov wrote: > > On Fri, Sep 12, 2025 at 2:29 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Fri, Sep 12, 2025 at 02:24:26PM -0700, Alexei Starovoitov wrote: > > > > On Fri, Sep 12, 2025 at 2:03 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > On Fri, Sep 12, 2025 at 12:27 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > > > > > +Suren, Roman > > > > > > > > > > > > On Mon, Sep 08, 2025 at 06:00:06PM -0700, Alexei Starovoitov wrote: > > > > > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > > > > > > > > > Since the combination of valid upper bits in slab->obj_exts with > > > > > > > OBJEXTS_ALLOC_FAIL bit can never happen, > > > > > > > use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel > > > > > > > instead of (1ull << 2) to free up bit 2. > > > > > > > > > > > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > > > > > > > > > > > Are we low on bits that we need to do this or is this good to have > > > > > > optimization but not required? > > > > > > > > > > That's a good question. After this change MEMCG_DATA_OBJEXTS and > > > > > OBJEXTS_ALLOC_FAIL will have the same value and they are used with the > > > > > same field (page->memcg_data and slab->obj_exts are aliases). Even if > > > > > page_memcg_data_flags can never be used for slab pages I think > > > > > overlapping these bits is not a good idea and creates additional > > > > > risks. Unless there is a good reason to do this I would advise against > > > > > it. > > > > > > > > Completely disagree. You both missed the long discussion > > > > during v4. The other alternative was to increase alignment > > > > and waste memory. Saving the bit is obviously cleaner. > > > > The next patch is using the saved bit. > > > > > > I will check out that discussion and it would be good to summarize that > > > in the commit message. > > > > Disgaree. It's not a job of a small commit to summarize all options > > that were discussed on the list. That's what the cover letter is for > > and there there are links to all previous threads. > > Currently the commit message is only telling what the patch is doing and > is missing the 'why' part and I think adding the 'why' part would make it > better for future readers i.e. less effort to find why this is being > done this way. (Anyways this is just a nit from me) I think 'why' here is obvious. Free the bit to use it later. From time to time people add a sentence like "this bit will be used in the next patch", but I never do this and sometimes remove it from other people's commits, since "in the next patch" is plenty ambiguous and not helpful. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-12 21:59 ` Alexei Starovoitov @ 2025-09-13 0:01 ` Shakeel Butt 2025-09-13 0:07 ` Alexei Starovoitov 0 siblings, 1 reply; 45+ messages in thread From: Shakeel Butt @ 2025-09-13 0:01 UTC (permalink / raw) To: Alexei Starovoitov Cc: Suren Baghdasaryan, bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner, Roman Gushchin On Fri, Sep 12, 2025 at 02:59:08PM -0700, Alexei Starovoitov wrote: > On Fri, Sep 12, 2025 at 2:44 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Fri, Sep 12, 2025 at 02:31:47PM -0700, Alexei Starovoitov wrote: > > > On Fri, Sep 12, 2025 at 2:29 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > On Fri, Sep 12, 2025 at 02:24:26PM -0700, Alexei Starovoitov wrote: > > > > > On Fri, Sep 12, 2025 at 2:03 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > > On Fri, Sep 12, 2025 at 12:27 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > > > > > > > +Suren, Roman > > > > > > > > > > > > > > On Mon, Sep 08, 2025 at 06:00:06PM -0700, Alexei Starovoitov wrote: > > > > > > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > > > > > > > > > > > Since the combination of valid upper bits in slab->obj_exts with > > > > > > > > OBJEXTS_ALLOC_FAIL bit can never happen, > > > > > > > > use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel > > > > > > > > instead of (1ull << 2) to free up bit 2. > > > > > > > > > > > > > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > > > > > > > > > > > > > Are we low on bits that we need to do this or is this good to have > > > > > > > optimization but not required? > > > > > > > > > > > > That's a good question. After this change MEMCG_DATA_OBJEXTS and > > > > > > OBJEXTS_ALLOC_FAIL will have the same value and they are used with the > > > > > > same field (page->memcg_data and slab->obj_exts are aliases). Even if > > > > > > page_memcg_data_flags can never be used for slab pages I think > > > > > > overlapping these bits is not a good idea and creates additional > > > > > > risks. Unless there is a good reason to do this I would advise against > > > > > > it. > > > > > > > > > > Completely disagree. You both missed the long discussion > > > > > during v4. The other alternative was to increase alignment > > > > > and waste memory. Saving the bit is obviously cleaner. > > > > > The next patch is using the saved bit. > > > > > > > > I will check out that discussion and it would be good to summarize that > > > > in the commit message. > > > > > > Disgaree. It's not a job of a small commit to summarize all options > > > that were discussed on the list. That's what the cover letter is for > > > and there there are links to all previous threads. > > > > Currently the commit message is only telling what the patch is doing and > > is missing the 'why' part and I think adding the 'why' part would make it > > better for future readers i.e. less effort to find why this is being > > done this way. (Anyways this is just a nit from me) > > I think 'why' here is obvious. Free the bit to use it later. > From time to time people add a sentence like > "this bit will be used in the next patch", > but I never do this and sometimes remove it from other people's > commits, since "in the next patch" is plenty ambiguous and not helpful. Yes, the part about the freed bit being used in later patch was clear. The part about if we really need it was not obvious and if I understand the discussion at [1] (relevant text below), it was not required but good to have. ``` > I was going to say "add a new flag to enum objext_flags", > but all lower 3 bits of slab->obj_exts pointer are already in use? oh... > > Maybe need a magic trick to add one more flag, > like always align the size with 16? > > In practice that should not lead to increase in memory consumption > anyway because most of the kmalloc-* sizes are already at least > 16 bytes aligned. Yes. That's an option, but I think we can do better. OBJEXTS_ALLOC_FAIL doesn't need to consume the bit. ``` Anyways no objection from me but Harry had a followup request [2]: ``` This will work, but it would be helpful to add a comment clarifying that when bit 0 is set with valid upper bits, it indicates MEMCG_DATA_OBJEXTS, but when the upper bits are all zero, it indicates OBJEXTS_ALLOC_FAIL. When someone looks at the code without checking history it might not be obvious at first glance. ``` I think the above requested comment would be really useful. Suren is fixing the condition of VM_BUG_ON_PAGE() in slab_obj_exts(). With this patch, I think, that condition will need to be changed again. [1] https://lore.kernel.org/all/CAADnVQLrTJ7hu0Au-XzBu9=GUKHeobnvULsjZtYO3JHHd75MTA@mail.gmail.com/ [2] https://lore.kernel.org/all/aJtZrgcylnWgfR9r@hyeyoo/ ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-13 0:01 ` Shakeel Butt @ 2025-09-13 0:07 ` Alexei Starovoitov 2025-09-13 0:33 ` Shakeel Butt 0 siblings, 1 reply; 45+ messages in thread From: Alexei Starovoitov @ 2025-09-13 0:07 UTC (permalink / raw) To: Shakeel Butt Cc: Suren Baghdasaryan, bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner, Roman Gushchin On Fri, Sep 12, 2025 at 5:02 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Fri, Sep 12, 2025 at 02:59:08PM -0700, Alexei Starovoitov wrote: > > On Fri, Sep 12, 2025 at 2:44 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Fri, Sep 12, 2025 at 02:31:47PM -0700, Alexei Starovoitov wrote: > > > > On Fri, Sep 12, 2025 at 2:29 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > > > On Fri, Sep 12, 2025 at 02:24:26PM -0700, Alexei Starovoitov wrote: > > > > > > On Fri, Sep 12, 2025 at 2:03 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > > > > On Fri, Sep 12, 2025 at 12:27 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > > > > > > > > > +Suren, Roman > > > > > > > > > > > > > > > > On Mon, Sep 08, 2025 at 06:00:06PM -0700, Alexei Starovoitov wrote: > > > > > > > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > > > > > > > > > > > > > Since the combination of valid upper bits in slab->obj_exts with > > > > > > > > > OBJEXTS_ALLOC_FAIL bit can never happen, > > > > > > > > > use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel > > > > > > > > > instead of (1ull << 2) to free up bit 2. > > > > > > > > > > > > > > > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > > > > > > > > > > > > > > > Are we low on bits that we need to do this or is this good to have > > > > > > > > optimization but not required? > > > > > > > > > > > > > > That's a good question. After this change MEMCG_DATA_OBJEXTS and > > > > > > > OBJEXTS_ALLOC_FAIL will have the same value and they are used with the > > > > > > > same field (page->memcg_data and slab->obj_exts are aliases). Even if > > > > > > > page_memcg_data_flags can never be used for slab pages I think > > > > > > > overlapping these bits is not a good idea and creates additional > > > > > > > risks. Unless there is a good reason to do this I would advise against > > > > > > > it. > > > > > > > > > > > > Completely disagree. You both missed the long discussion > > > > > > during v4. The other alternative was to increase alignment > > > > > > and waste memory. Saving the bit is obviously cleaner. > > > > > > The next patch is using the saved bit. > > > > > > > > > > I will check out that discussion and it would be good to summarize that > > > > > in the commit message. > > > > > > > > Disgaree. It's not a job of a small commit to summarize all options > > > > that were discussed on the list. That's what the cover letter is for > > > > and there there are links to all previous threads. > > > > > > Currently the commit message is only telling what the patch is doing and > > > is missing the 'why' part and I think adding the 'why' part would make it > > > better for future readers i.e. less effort to find why this is being > > > done this way. (Anyways this is just a nit from me) > > > > I think 'why' here is obvious. Free the bit to use it later. > > From time to time people add a sentence like > > "this bit will be used in the next patch", > > but I never do this and sometimes remove it from other people's > > commits, since "in the next patch" is plenty ambiguous and not helpful. > > Yes, the part about the freed bit being used in later patch was clear. > The part about if we really need it was not obvious and if I understand > the discussion at [1] (relevant text below), it was not required but > good to have. > ``` > > I was going to say "add a new flag to enum objext_flags", > > but all lower 3 bits of slab->obj_exts pointer are already in use? oh... > > > > Maybe need a magic trick to add one more flag, > > like always align the size with 16? > > > > In practice that should not lead to increase in memory consumption > > anyway because most of the kmalloc-* sizes are already at least > > 16 bytes aligned. > > Yes. That's an option, but I think we can do better. > OBJEXTS_ALLOC_FAIL doesn't need to consume the bit. > ``` > > Anyways no objection from me but Harry had a followup request [2]: > ``` > This will work, but it would be helpful to add a comment clarifying that > when bit 0 is set with valid upper bits, it indicates > MEMCG_DATA_OBJEXTS, but when the upper bits are all zero, it indicates > OBJEXTS_ALLOC_FAIL. > > When someone looks at the code without checking history it might not > be obvious at first glance. > ``` > > I think the above requested comment would be really useful. ... and that comment was added. pretty much verbatim copy paste of the above. Don't you see it in the patch? > Suren is > fixing the condition of VM_BUG_ON_PAGE() in slab_obj_exts(). With this > patch, I think, that condition will need to be changed again. That's orthogonal and I'm not convinced it's correct. slab_obj_exts() is doing the right thing. afaict. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-13 0:07 ` Alexei Starovoitov @ 2025-09-13 0:33 ` Shakeel Butt 2025-09-13 0:36 ` Suren Baghdasaryan 0 siblings, 1 reply; 45+ messages in thread From: Shakeel Butt @ 2025-09-13 0:33 UTC (permalink / raw) To: Alexei Starovoitov Cc: Suren Baghdasaryan, bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner, Roman Gushchin On Fri, Sep 12, 2025 at 05:07:59PM -0700, Alexei Starovoitov wrote: > On Fri, Sep 12, 2025 at 5:02 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Fri, Sep 12, 2025 at 02:59:08PM -0700, Alexei Starovoitov wrote: > > > On Fri, Sep 12, 2025 at 2:44 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > On Fri, Sep 12, 2025 at 02:31:47PM -0700, Alexei Starovoitov wrote: > > > > > On Fri, Sep 12, 2025 at 2:29 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > > > > > On Fri, Sep 12, 2025 at 02:24:26PM -0700, Alexei Starovoitov wrote: > > > > > > > On Fri, Sep 12, 2025 at 2:03 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > > > > > > On Fri, Sep 12, 2025 at 12:27 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > > > > > > > > > > > +Suren, Roman > > > > > > > > > > > > > > > > > > On Mon, Sep 08, 2025 at 06:00:06PM -0700, Alexei Starovoitov wrote: > > > > > > > > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > > > > > > > > > > > > > > > Since the combination of valid upper bits in slab->obj_exts with > > > > > > > > > > OBJEXTS_ALLOC_FAIL bit can never happen, > > > > > > > > > > use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel > > > > > > > > > > instead of (1ull << 2) to free up bit 2. > > > > > > > > > > > > > > > > > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > > > > > > > > > > > > > > > > > Are we low on bits that we need to do this or is this good to have > > > > > > > > > optimization but not required? > > > > > > > > > > > > > > > > That's a good question. After this change MEMCG_DATA_OBJEXTS and > > > > > > > > OBJEXTS_ALLOC_FAIL will have the same value and they are used with the > > > > > > > > same field (page->memcg_data and slab->obj_exts are aliases). Even if > > > > > > > > page_memcg_data_flags can never be used for slab pages I think > > > > > > > > overlapping these bits is not a good idea and creates additional > > > > > > > > risks. Unless there is a good reason to do this I would advise against > > > > > > > > it. > > > > > > > > > > > > > > Completely disagree. You both missed the long discussion > > > > > > > during v4. The other alternative was to increase alignment > > > > > > > and waste memory. Saving the bit is obviously cleaner. > > > > > > > The next patch is using the saved bit. > > > > > > > > > > > > I will check out that discussion and it would be good to summarize that > > > > > > in the commit message. > > > > > > > > > > Disgaree. It's not a job of a small commit to summarize all options > > > > > that were discussed on the list. That's what the cover letter is for > > > > > and there there are links to all previous threads. > > > > > > > > Currently the commit message is only telling what the patch is doing and > > > > is missing the 'why' part and I think adding the 'why' part would make it > > > > better for future readers i.e. less effort to find why this is being > > > > done this way. (Anyways this is just a nit from me) > > > > > > I think 'why' here is obvious. Free the bit to use it later. > > > From time to time people add a sentence like > > > "this bit will be used in the next patch", > > > but I never do this and sometimes remove it from other people's > > > commits, since "in the next patch" is plenty ambiguous and not helpful. > > > > Yes, the part about the freed bit being used in later patch was clear. > > The part about if we really need it was not obvious and if I understand > > the discussion at [1] (relevant text below), it was not required but > > good to have. > > ``` > > > I was going to say "add a new flag to enum objext_flags", > > > but all lower 3 bits of slab->obj_exts pointer are already in use? oh... > > > > > > Maybe need a magic trick to add one more flag, > > > like always align the size with 16? > > > > > > In practice that should not lead to increase in memory consumption > > > anyway because most of the kmalloc-* sizes are already at least > > > 16 bytes aligned. > > > > Yes. That's an option, but I think we can do better. > > OBJEXTS_ALLOC_FAIL doesn't need to consume the bit. > > ``` > > > > Anyways no objection from me but Harry had a followup request [2]: > > ``` > > This will work, but it would be helpful to add a comment clarifying that > > when bit 0 is set with valid upper bits, it indicates > > MEMCG_DATA_OBJEXTS, but when the upper bits are all zero, it indicates > > OBJEXTS_ALLOC_FAIL. > > > > When someone looks at the code without checking history it might not > > be obvious at first glance. > > ``` > > > > I think the above requested comment would be really useful. > > ... and that comment was added. pretty much verbatim copy paste > of the above. Don't you see it in the patch? Haha it seems I am blind, yup it is there. > > > Suren is > > fixing the condition of VM_BUG_ON_PAGE() in slab_obj_exts(). With this > > patch, I think, that condition will need to be changed again. > > That's orthogonal and I'm not convinced it's correct. > slab_obj_exts() is doing the right thing. afaict. Currently we have VM_BUG_ON_PAGE(obj_exts && !(obj_exts & MEMCG_DATA_OBJEXTS)) but it should be (before your patch) something like: VM_BUG_ON_PAGE(obj_exts && !(obj_exts & (MEMCG_DATA_OBJEXTS | OBJEXTS_ALLOC_FAIL))) After your patch, hmmm, the previous one would be right again and the newer one will be the same as the previous due to aliasing. This patch doesn't need to touch that VM_BUG. Older kernels will need to move to the second condition though. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-13 0:33 ` Shakeel Butt @ 2025-09-13 0:36 ` Suren Baghdasaryan 2025-09-13 1:12 ` Alexei Starovoitov 0 siblings, 1 reply; 45+ messages in thread From: Suren Baghdasaryan @ 2025-09-13 0:36 UTC (permalink / raw) To: Shakeel Butt Cc: Alexei Starovoitov, bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner, Roman Gushchin On Fri, Sep 12, 2025 at 5:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Fri, Sep 12, 2025 at 05:07:59PM -0700, Alexei Starovoitov wrote: > > On Fri, Sep 12, 2025 at 5:02 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Fri, Sep 12, 2025 at 02:59:08PM -0700, Alexei Starovoitov wrote: > > > > On Fri, Sep 12, 2025 at 2:44 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > > > On Fri, Sep 12, 2025 at 02:31:47PM -0700, Alexei Starovoitov wrote: > > > > > > On Fri, Sep 12, 2025 at 2:29 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > > > > > > > On Fri, Sep 12, 2025 at 02:24:26PM -0700, Alexei Starovoitov wrote: > > > > > > > > On Fri, Sep 12, 2025 at 2:03 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > > > > > > > > On Fri, Sep 12, 2025 at 12:27 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > > > > > > > > > > > > > +Suren, Roman > > > > > > > > > > > > > > > > > > > > On Mon, Sep 08, 2025 at 06:00:06PM -0700, Alexei Starovoitov wrote: > > > > > > > > > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > > > > > > > > > > > > > > > > > Since the combination of valid upper bits in slab->obj_exts with > > > > > > > > > > > OBJEXTS_ALLOC_FAIL bit can never happen, > > > > > > > > > > > use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel > > > > > > > > > > > instead of (1ull << 2) to free up bit 2. > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > > > > > > > > > > > > > > > > > > > Are we low on bits that we need to do this or is this good to have > > > > > > > > > > optimization but not required? > > > > > > > > > > > > > > > > > > That's a good question. After this change MEMCG_DATA_OBJEXTS and > > > > > > > > > OBJEXTS_ALLOC_FAIL will have the same value and they are used with the > > > > > > > > > same field (page->memcg_data and slab->obj_exts are aliases). Even if > > > > > > > > > page_memcg_data_flags can never be used for slab pages I think > > > > > > > > > overlapping these bits is not a good idea and creates additional > > > > > > > > > risks. Unless there is a good reason to do this I would advise against > > > > > > > > > it. > > > > > > > > > > > > > > > > Completely disagree. You both missed the long discussion > > > > > > > > during v4. The other alternative was to increase alignment > > > > > > > > and waste memory. Saving the bit is obviously cleaner. > > > > > > > > The next patch is using the saved bit. > > > > > > > > > > > > > > I will check out that discussion and it would be good to summarize that > > > > > > > in the commit message. > > > > > > > > > > > > Disgaree. It's not a job of a small commit to summarize all options > > > > > > that were discussed on the list. That's what the cover letter is for > > > > > > and there there are links to all previous threads. > > > > > > > > > > Currently the commit message is only telling what the patch is doing and > > > > > is missing the 'why' part and I think adding the 'why' part would make it > > > > > better for future readers i.e. less effort to find why this is being > > > > > done this way. (Anyways this is just a nit from me) > > > > > > > > I think 'why' here is obvious. Free the bit to use it later. > > > > From time to time people add a sentence like > > > > "this bit will be used in the next patch", > > > > but I never do this and sometimes remove it from other people's > > > > commits, since "in the next patch" is plenty ambiguous and not helpful. > > > > > > Yes, the part about the freed bit being used in later patch was clear. > > > The part about if we really need it was not obvious and if I understand > > > the discussion at [1] (relevant text below), it was not required but > > > good to have. > > > ``` > > > > I was going to say "add a new flag to enum objext_flags", > > > > but all lower 3 bits of slab->obj_exts pointer are already in use? oh... > > > > > > > > Maybe need a magic trick to add one more flag, > > > > like always align the size with 16? > > > > > > > > In practice that should not lead to increase in memory consumption > > > > anyway because most of the kmalloc-* sizes are already at least > > > > 16 bytes aligned. > > > > > > Yes. That's an option, but I think we can do better. > > > OBJEXTS_ALLOC_FAIL doesn't need to consume the bit. > > > ``` > > > > > > Anyways no objection from me but Harry had a followup request [2]: > > > ``` > > > This will work, but it would be helpful to add a comment clarifying that > > > when bit 0 is set with valid upper bits, it indicates > > > MEMCG_DATA_OBJEXTS, but when the upper bits are all zero, it indicates > > > OBJEXTS_ALLOC_FAIL. > > > > > > When someone looks at the code without checking history it might not > > > be obvious at first glance. > > > ``` > > > > > > I think the above requested comment would be really useful. > > > > ... and that comment was added. pretty much verbatim copy paste > > of the above. Don't you see it in the patch? > > Haha it seems I am blind, yup it is there. > > > > > > Suren is > > > fixing the condition of VM_BUG_ON_PAGE() in slab_obj_exts(). With this > > > patch, I think, that condition will need to be changed again. > > > > That's orthogonal and I'm not convinced it's correct. > > slab_obj_exts() is doing the right thing. afaict. > > Currently we have > > VM_BUG_ON_PAGE(obj_exts && !(obj_exts & MEMCG_DATA_OBJEXTS)) > > but it should be (before your patch) something like: > > VM_BUG_ON_PAGE(obj_exts && !(obj_exts & (MEMCG_DATA_OBJEXTS | OBJEXTS_ALLOC_FAIL))) > > After your patch, hmmm, the previous one would be right again and the > newer one will be the same as the previous due to aliasing. This patch > doesn't need to touch that VM_BUG. Older kernels will need to move to > the second condition though. Correct. Currently slab_obj_exts() will issue a warning when (obj_exts == OBJEXTS_ALLOC_FAIL), which is a perfectly valid state indicating that previous allocation of the vector failed due to memory exhaustion. Changing that warning to: VM_BUG_ON_PAGE(obj_exts && !(obj_exts & (MEMCG_DATA_OBJEXTS | OBJEXTS_ALLOC_FAIL))) will correctly avoid this warning and after your change will still work. (MEMCG_DATA_OBJEXTS | OBJEXTS_ALLOC_FAIL) when (MEMCG_DATA_OBJEXTS == OBJEXTS_ALLOC_FAIL) is technically unnecessary but is good for documenting the conditions we are checking. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-13 0:36 ` Suren Baghdasaryan @ 2025-09-13 1:12 ` Alexei Starovoitov 2025-09-15 7:51 ` Vlastimil Babka 0 siblings, 1 reply; 45+ messages in thread From: Alexei Starovoitov @ 2025-09-13 1:12 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Shakeel Butt, bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner, Roman Gushchin On Fri, Sep 12, 2025 at 5:36 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > Suren is > > > > fixing the condition of VM_BUG_ON_PAGE() in slab_obj_exts(). With this > > > > patch, I think, that condition will need to be changed again. > > > > > > That's orthogonal and I'm not convinced it's correct. > > > slab_obj_exts() is doing the right thing. afaict. > > > > Currently we have > > > > VM_BUG_ON_PAGE(obj_exts && !(obj_exts & MEMCG_DATA_OBJEXTS)) > > > > but it should be (before your patch) something like: > > > > VM_BUG_ON_PAGE(obj_exts && !(obj_exts & (MEMCG_DATA_OBJEXTS | OBJEXTS_ALLOC_FAIL))) > > > > After your patch, hmmm, the previous one would be right again and the > > newer one will be the same as the previous due to aliasing. This patch > > doesn't need to touch that VM_BUG. Older kernels will need to move to > > the second condition though. > > Correct. Currently slab_obj_exts() will issue a warning when (obj_exts > == OBJEXTS_ALLOC_FAIL), which is a perfectly valid state indicating > that previous allocation of the vector failed due to memory > exhaustion. Changing that warning to: > > VM_BUG_ON_PAGE(obj_exts && !(obj_exts & (MEMCG_DATA_OBJEXTS | > OBJEXTS_ALLOC_FAIL))) > > will correctly avoid this warning and after your change will still > work. (MEMCG_DATA_OBJEXTS | OBJEXTS_ALLOC_FAIL) when > (MEMCG_DATA_OBJEXTS == OBJEXTS_ALLOC_FAIL) is technically unnecessary > but is good for documenting the conditions we are checking. I see what you mean. I feel the comment in slab_obj_exts() that explains all that would be better long term than decipher from C code. Both are fine, I guess. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-13 1:12 ` Alexei Starovoitov @ 2025-09-15 7:51 ` Vlastimil Babka 2025-09-15 15:06 ` Suren Baghdasaryan 0 siblings, 1 reply; 45+ messages in thread From: Vlastimil Babka @ 2025-09-15 7:51 UTC (permalink / raw) To: Alexei Starovoitov, Suren Baghdasaryan Cc: Shakeel Butt, bpf, linux-mm, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner, Roman Gushchin On 9/13/25 03:12, Alexei Starovoitov wrote: > On Fri, Sep 12, 2025 at 5:36 PM Suren Baghdasaryan <surenb@google.com> wrote: >> >> > > > Suren is >> > > > fixing the condition of VM_BUG_ON_PAGE() in slab_obj_exts(). With this >> > > > patch, I think, that condition will need to be changed again. >> > > >> > > That's orthogonal and I'm not convinced it's correct. >> > > slab_obj_exts() is doing the right thing. afaict. >> > >> > Currently we have >> > >> > VM_BUG_ON_PAGE(obj_exts && !(obj_exts & MEMCG_DATA_OBJEXTS)) >> > >> > but it should be (before your patch) something like: >> > >> > VM_BUG_ON_PAGE(obj_exts && !(obj_exts & (MEMCG_DATA_OBJEXTS | OBJEXTS_ALLOC_FAIL))) >> > >> > After your patch, hmmm, the previous one would be right again and the >> > newer one will be the same as the previous due to aliasing. This patch >> > doesn't need to touch that VM_BUG. Older kernels will need to move to >> > the second condition though. >> >> Correct. Currently slab_obj_exts() will issue a warning when (obj_exts >> == OBJEXTS_ALLOC_FAIL), which is a perfectly valid state indicating >> that previous allocation of the vector failed due to memory >> exhaustion. Changing that warning to: >> >> VM_BUG_ON_PAGE(obj_exts && !(obj_exts & (MEMCG_DATA_OBJEXTS | >> OBJEXTS_ALLOC_FAIL))) >> >> will correctly avoid this warning and after your change will still >> work. (MEMCG_DATA_OBJEXTS | OBJEXTS_ALLOC_FAIL) when >> (MEMCG_DATA_OBJEXTS == OBJEXTS_ALLOC_FAIL) is technically unnecessary >> but is good for documenting the conditions we are checking. > > I see what you mean. I feel the comment in slab_obj_exts() > that explains all that would be better long term than decipher > from C code. Both are fine, I guess. I guess perhaps both, having "(MEMCG_DATA_OBJEXTS | OBJEXTS_ALLOC_FAIL)" in the code (to discover where OBJEXTS_ALLOC_FAIL is important to consider, in case the flag layout changes again), and a comment explaining what's going on. Shakeel or Suren, will you sent the fix, including Fixes: ? I can put in ahead of this series with cc stable in slab/for-next and it shouldn't affect the series. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-15 7:51 ` Vlastimil Babka @ 2025-09-15 15:06 ` Suren Baghdasaryan 2025-09-15 15:11 ` Vlastimil Babka 0 siblings, 1 reply; 45+ messages in thread From: Suren Baghdasaryan @ 2025-09-15 15:06 UTC (permalink / raw) To: Vlastimil Babka Cc: Alexei Starovoitov, Shakeel Butt, bpf, linux-mm, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner, Roman Gushchin On Mon, Sep 15, 2025 at 12:51 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 9/13/25 03:12, Alexei Starovoitov wrote: > > On Fri, Sep 12, 2025 at 5:36 PM Suren Baghdasaryan <surenb@google.com> wrote: > >> > >> > > > Suren is > >> > > > fixing the condition of VM_BUG_ON_PAGE() in slab_obj_exts(). With this > >> > > > patch, I think, that condition will need to be changed again. > >> > > > >> > > That's orthogonal and I'm not convinced it's correct. > >> > > slab_obj_exts() is doing the right thing. afaict. > >> > > >> > Currently we have > >> > > >> > VM_BUG_ON_PAGE(obj_exts && !(obj_exts & MEMCG_DATA_OBJEXTS)) > >> > > >> > but it should be (before your patch) something like: > >> > > >> > VM_BUG_ON_PAGE(obj_exts && !(obj_exts & (MEMCG_DATA_OBJEXTS | OBJEXTS_ALLOC_FAIL))) > >> > > >> > After your patch, hmmm, the previous one would be right again and the > >> > newer one will be the same as the previous due to aliasing. This patch > >> > doesn't need to touch that VM_BUG. Older kernels will need to move to > >> > the second condition though. > >> > >> Correct. Currently slab_obj_exts() will issue a warning when (obj_exts > >> == OBJEXTS_ALLOC_FAIL), which is a perfectly valid state indicating > >> that previous allocation of the vector failed due to memory > >> exhaustion. Changing that warning to: > >> > >> VM_BUG_ON_PAGE(obj_exts && !(obj_exts & (MEMCG_DATA_OBJEXTS | > >> OBJEXTS_ALLOC_FAIL))) > >> > >> will correctly avoid this warning and after your change will still > >> work. (MEMCG_DATA_OBJEXTS | OBJEXTS_ALLOC_FAIL) when > >> (MEMCG_DATA_OBJEXTS == OBJEXTS_ALLOC_FAIL) is technically unnecessary > >> but is good for documenting the conditions we are checking. > > > > I see what you mean. I feel the comment in slab_obj_exts() > > that explains all that would be better long term than decipher > > from C code. Both are fine, I guess. > > I guess perhaps both, having "(MEMCG_DATA_OBJEXTS | OBJEXTS_ALLOC_FAIL)" in > the code (to discover where OBJEXTS_ALLOC_FAIL is important to consider, in > case the flag layout changes again), and a comment explaining what's going on. Yes, exactly what I had in mind. > > Shakeel or Suren, will you sent the fix, including Fixes: ? I can put in > ahead of this series with cc stable in slab/for-next and it shouldn't affect > the series. I will post it today. I was planning to include it as a resping of the fixup patchset [1] but if you prefer it separately I can do that too. Please let me know your preference. Another fixup patch I'll be adding is the removal of the `if (new_slab)` condition for doing mark_failed_objexts_alloc() inside alloc_slab_obj_exts() [2]. [1] https://lore.kernel.org/all/20250909233409.1013367-1-surenb@google.com/ [2] https://elixir.bootlin.com/linux/v6.16.5/source/mm/slub.c#L1996 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-15 15:06 ` Suren Baghdasaryan @ 2025-09-15 15:11 ` Vlastimil Babka 2025-09-15 15:25 ` Suren Baghdasaryan 0 siblings, 1 reply; 45+ messages in thread From: Vlastimil Babka @ 2025-09-15 15:11 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Alexei Starovoitov, Shakeel Butt, bpf, linux-mm, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner, Roman Gushchin On 9/15/25 17:06, Suren Baghdasaryan wrote: > On Mon, Sep 15, 2025 at 12:51 AM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> >> Shakeel or Suren, will you sent the fix, including Fixes: ? I can put in >> ahead of this series with cc stable in slab/for-next and it shouldn't affect >> the series. > > I will post it today. I was planning to include it as a resping of the > fixup patchset [1] but if you prefer it separately I can do that too. > Please let me know your preference. I think it will be better for patches touching slab (only) to be separate, to avoid conflict potential. [1] seems mm tree material > Another fixup patch I'll be adding is the removal of the `if > (new_slab)` condition for doing mark_failed_objexts_alloc() inside > alloc_slab_obj_exts() [2]. Ack, then it's 2 patches for slab :) Thanks, Vlastimil > [1] https://lore.kernel.org/all/20250909233409.1013367-1-surenb@google.com/ > [2] https://elixir.bootlin.com/linux/v6.16.5/source/mm/slub.c#L1996 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-15 15:11 ` Vlastimil Babka @ 2025-09-15 15:25 ` Suren Baghdasaryan 2025-09-15 20:10 ` Suren Baghdasaryan 0 siblings, 1 reply; 45+ messages in thread From: Suren Baghdasaryan @ 2025-09-15 15:25 UTC (permalink / raw) To: Vlastimil Babka Cc: Alexei Starovoitov, Shakeel Butt, bpf, linux-mm, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner, Roman Gushchin On Mon, Sep 15, 2025 at 8:11 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 9/15/25 17:06, Suren Baghdasaryan wrote: > > On Mon, Sep 15, 2025 at 12:51 AM Vlastimil Babka <vbabka@suse.cz> wrote: > >> > >> > >> Shakeel or Suren, will you sent the fix, including Fixes: ? I can put in > >> ahead of this series with cc stable in slab/for-next and it shouldn't affect > >> the series. > > > > I will post it today. I was planning to include it as a resping of the > > fixup patchset [1] but if you prefer it separately I can do that too. > > Please let me know your preference. > > I think it will be better for patches touching slab (only) to be separate, > to avoid conflict potential. [1] seems mm tree material > > > Another fixup patch I'll be adding is the removal of the `if > > (new_slab)` condition for doing mark_failed_objexts_alloc() inside > > alloc_slab_obj_exts() [2]. > > Ack, then it's 2 patches for slab :) Ack. Will post after our meeting today. > > Thanks, > Vlastimil > > > [1] https://lore.kernel.org/all/20250909233409.1013367-1-surenb@google.com/ > > [2] https://elixir.bootlin.com/linux/v6.16.5/source/mm/slub.c#L1996 > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-15 15:25 ` Suren Baghdasaryan @ 2025-09-15 20:10 ` Suren Baghdasaryan 0 siblings, 0 replies; 45+ messages in thread From: Suren Baghdasaryan @ 2025-09-15 20:10 UTC (permalink / raw) To: Vlastimil Babka Cc: Alexei Starovoitov, Shakeel Butt, bpf, linux-mm, Harry Yoo, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner, Roman Gushchin On Mon, Sep 15, 2025 at 8:25 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Mon, Sep 15, 2025 at 8:11 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > > > On 9/15/25 17:06, Suren Baghdasaryan wrote: > > > On Mon, Sep 15, 2025 at 12:51 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > >> > > >> > > >> Shakeel or Suren, will you sent the fix, including Fixes: ? I can put in > > >> ahead of this series with cc stable in slab/for-next and it shouldn't affect > > >> the series. > > > > > > I will post it today. I was planning to include it as a resping of the > > > fixup patchset [1] but if you prefer it separately I can do that too. > > > Please let me know your preference. > > > > I think it will be better for patches touching slab (only) to be separate, > > to avoid conflict potential. [1] seems mm tree material > > > > > Another fixup patch I'll be adding is the removal of the `if > > > (new_slab)` condition for doing mark_failed_objexts_alloc() inside > > > alloc_slab_obj_exts() [2]. > > > > Ack, then it's 2 patches for slab :) > > Ack. Will post after our meeting today. Posted at https://lore.kernel.org/all/20250915200918.3855580-1-surenb@google.com/ > > > > > Thanks, > > Vlastimil > > > > > [1] https://lore.kernel.org/all/20250909233409.1013367-1-surenb@google.com/ > > > [2] https://elixir.bootlin.com/linux/v6.16.5/source/mm/slub.c#L1996 > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-09 1:00 ` [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL Alexei Starovoitov 2025-09-12 19:27 ` Shakeel Butt @ 2025-09-13 1:16 ` Shakeel Butt 2025-09-15 6:14 ` Harry Yoo 2 siblings, 0 replies; 45+ messages in thread From: Shakeel Butt @ 2025-09-13 1:16 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, linux-mm, vbabka, harry.yoo, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes On Mon, Sep 08, 2025 at 06:00:06PM -0700, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > Since the combination of valid upper bits in slab->obj_exts with > OBJEXTS_ALLOC_FAIL bit can never happen, > use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel > instead of (1ull << 2) to free up bit 2. > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL 2025-09-09 1:00 ` [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL Alexei Starovoitov 2025-09-12 19:27 ` Shakeel Butt 2025-09-13 1:16 ` Shakeel Butt @ 2025-09-15 6:14 ` Harry Yoo 2 siblings, 0 replies; 45+ messages in thread From: Harry Yoo @ 2025-09-15 6:14 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, linux-mm, vbabka, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes On Mon, Sep 08, 2025 at 06:00:06PM -0700, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > Since the combination of valid upper bits in slab->obj_exts with > OBJEXTS_ALLOC_FAIL bit can never happen, > use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel > instead of (1ull << 2) to free up bit 2. > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > --- Looks good to me, Reviewed-by: Harry Yoo <harry.yoo@oracle.com> > include/linux/memcontrol.h | 10 ++++++++-- > mm/slub.c | 2 +- > 2 files changed, 9 insertions(+), 3 deletions(-) -- Cheers, Harry / Hyeonggon ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock(). 2025-09-09 1:00 [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov ` (4 preceding siblings ...) 2025-09-09 1:00 ` [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL Alexei Starovoitov @ 2025-09-09 1:00 ` Alexei Starovoitov 2025-09-15 12:52 ` Harry Yoo 2025-09-24 0:40 ` Harry Yoo 2025-09-12 9:33 ` [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Vlastimil Babka 6 siblings, 2 replies; 45+ messages in thread From: Alexei Starovoitov @ 2025-09-09 1:00 UTC (permalink / raw) To: bpf, linux-mm Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes From: Alexei Starovoitov <ast@kernel.org> kmalloc_nolock() relies on ability of local_trylock_t to detect the situation when per-cpu kmem_cache is locked. In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags) disables IRQs and marks s->cpu_slab->lock as acquired. local_lock_is_locked(&s->cpu_slab->lock) returns true when slab is in the middle of manipulating per-cpu cache of that specific kmem_cache. kmalloc_nolock() can be called from any context and can re-enter into ___slab_alloc(): kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf -> kmalloc_nolock() -> ___slab_alloc(cache_B) or kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock() -> ___slab_alloc(cache_B) Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock can be acquired without a deadlock before invoking the function. If that specific per-cpu kmem_cache is busy the kmalloc_nolock() retries in a different kmalloc bucket. The second attempt will likely succeed, since this cpu locked different kmem_cache. Similarly, in PREEMPT_RT local_lock_is_locked() returns true when per-cpu rt_spin_lock is locked by current _task_. In this case re-entrance into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries a different bucket that is most likely is not locked by the current task. Though it may be locked by a different task it's safe to rt_spin_lock() and sleep on it. Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL immediately if called from hard irq or NMI in PREEMPT_RT. kfree_nolock() defers freeing to irq_work when local_lock_is_locked() and (in_nmi() or in PREEMPT_RT). SLUB_TINY config doesn't use local_lock_is_locked() and relies on spin_trylock_irqsave(&n->list_lock) to allocate, while kfree_nolock() always defers to irq_work. Note, kfree_nolock() must be called _only_ for objects allocated with kmalloc_nolock(). Debug checks (like kmemleak and kfence) were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj); will miss kmemleak/kfence book keeping and will cause false positives. large_kmalloc is not supported by either kmalloc_nolock() or kfree_nolock(). Signed-off-by: Alexei Starovoitov <ast@kernel.org> --- include/linux/kasan.h | 13 +- include/linux/memcontrol.h | 2 + include/linux/slab.h | 4 + mm/Kconfig | 1 + mm/kasan/common.c | 5 +- mm/slab.h | 6 + mm/slab_common.c | 3 + mm/slub.c | 473 +++++++++++++++++++++++++++++++++---- 8 files changed, 453 insertions(+), 54 deletions(-) diff --git a/include/linux/kasan.h b/include/linux/kasan.h index 890011071f2b..acdc8cb0152e 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -200,7 +200,7 @@ static __always_inline bool kasan_slab_pre_free(struct kmem_cache *s, } bool __kasan_slab_free(struct kmem_cache *s, void *object, bool init, - bool still_accessible); + bool still_accessible, bool no_quarantine); /** * kasan_slab_free - Poison, initialize, and quarantine a slab object. * @object: Object to be freed. @@ -226,11 +226,13 @@ bool __kasan_slab_free(struct kmem_cache *s, void *object, bool init, * @Return true if KASAN took ownership of the object; false otherwise. */ static __always_inline bool kasan_slab_free(struct kmem_cache *s, - void *object, bool init, - bool still_accessible) + void *object, bool init, + bool still_accessible, + bool no_quarantine) { if (kasan_enabled()) - return __kasan_slab_free(s, object, init, still_accessible); + return __kasan_slab_free(s, object, init, still_accessible, + no_quarantine); return false; } @@ -427,7 +429,8 @@ static inline bool kasan_slab_pre_free(struct kmem_cache *s, void *object) } static inline bool kasan_slab_free(struct kmem_cache *s, void *object, - bool init, bool still_accessible) + bool init, bool still_accessible, + bool no_quarantine) { return false; } diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d254c0b96d0d..82563236f35c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -358,6 +358,8 @@ enum objext_flags { * MEMCG_DATA_OBJEXTS. */ OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL, + /* slabobj_ext vector allocated with kmalloc_nolock() */ + OBJEXTS_NOSPIN_ALLOC = __FIRST_OBJEXT_FLAG, /* the next bit after the last actual flag */ __NR_OBJEXTS_FLAGS = (__FIRST_OBJEXT_FLAG << 1), }; diff --git a/include/linux/slab.h b/include/linux/slab.h index 680193356ac7..561597dd2164 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -501,6 +501,7 @@ void * __must_check krealloc_noprof(const void *objp, size_t new_size, #define krealloc(...) alloc_hooks(krealloc_noprof(__VA_ARGS__)) void kfree(const void *objp); +void kfree_nolock(const void *objp); void kfree_sensitive(const void *objp); size_t __ksize(const void *objp); @@ -957,6 +958,9 @@ static __always_inline __alloc_size(1) void *kmalloc_noprof(size_t size, gfp_t f } #define kmalloc(...) alloc_hooks(kmalloc_noprof(__VA_ARGS__)) +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node); +#define kmalloc_nolock(...) alloc_hooks(kmalloc_nolock_noprof(__VA_ARGS__)) + #define kmem_buckets_alloc(_b, _size, _flags) \ alloc_hooks(__kmalloc_node_noprof(PASS_BUCKET_PARAMS(_size, _b), _flags, NUMA_NO_NODE)) diff --git a/mm/Kconfig b/mm/Kconfig index e443fe8cd6cf..202e044f2b4d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -194,6 +194,7 @@ menu "Slab allocator options" config SLUB def_bool y + select IRQ_WORK config KVFREE_RCU_BATCHED def_bool y diff --git a/mm/kasan/common.c b/mm/kasan/common.c index 9142964ab9c9..3264900b942f 100644 --- a/mm/kasan/common.c +++ b/mm/kasan/common.c @@ -252,7 +252,7 @@ bool __kasan_slab_pre_free(struct kmem_cache *cache, void *object, } bool __kasan_slab_free(struct kmem_cache *cache, void *object, bool init, - bool still_accessible) + bool still_accessible, bool no_quarantine) { if (!kasan_arch_is_ready() || is_kfence_address(object)) return false; @@ -274,6 +274,9 @@ bool __kasan_slab_free(struct kmem_cache *cache, void *object, bool init, poison_slab_object(cache, object, init); + if (no_quarantine) + return false; + /* * If the object is put into quarantine, do not let slab put the object * onto the freelist for now. The object's metadata is kept until the diff --git a/mm/slab.h b/mm/slab.h index 5a6f824a282d..19bc15db6a72 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -57,6 +57,10 @@ struct slab { struct { union { struct list_head slab_list; + struct { /* For deferred deactivate_slab() */ + struct llist_node llnode; + void *flush_freelist; + }; #ifdef CONFIG_SLUB_CPU_PARTIAL struct { struct slab *next; @@ -661,6 +665,8 @@ void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab) void __check_heap_object(const void *ptr, unsigned long n, const struct slab *slab, bool to_user); +void defer_free_barrier(void); + static inline bool slub_debug_orig_size(struct kmem_cache *s) { return (kmem_cache_debug_flags(s, SLAB_STORE_USER) && diff --git a/mm/slab_common.c b/mm/slab_common.c index 08f5baee1309..77eefe660027 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -510,6 +510,9 @@ void kmem_cache_destroy(struct kmem_cache *s) rcu_barrier(); } + /* Wait for deferred work from kmalloc/kfree_nolock() */ + defer_free_barrier(); + cpus_read_lock(); mutex_lock(&slab_mutex); diff --git a/mm/slub.c b/mm/slub.c index 61841ba72120..9fdf74955227 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -44,6 +44,7 @@ #include <kunit/test.h> #include <kunit/test-bug.h> #include <linux/sort.h> +#include <linux/irq_work.h> #include <linux/debugfs.h> #include <trace/events/kmem.h> @@ -426,7 +427,7 @@ struct kmem_cache_cpu { #ifdef CONFIG_SLUB_CPU_PARTIAL struct slab *partial; /* Partially allocated slabs */ #endif - local_lock_t lock; /* Protects the fields above */ + local_trylock_t lock; /* Protects the fields above */ #ifdef CONFIG_SLUB_STATS unsigned int stat[NR_SLUB_STAT_ITEMS]; #endif @@ -2084,6 +2085,7 @@ static inline void init_slab_obj_exts(struct slab *slab) int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s, gfp_t gfp, bool new_slab) { + bool allow_spin = gfpflags_allow_spinning(gfp); unsigned int objects = objs_per_slab(s, slab); unsigned long new_exts; unsigned long old_exts; @@ -2092,8 +2094,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s, gfp &= ~OBJCGS_CLEAR_MASK; /* Prevent recursive extension vector allocation */ gfp |= __GFP_NO_OBJ_EXT; - vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp, - slab_nid(slab)); + if (unlikely(!allow_spin)) { + size_t sz = objects * sizeof(struct slabobj_ext); + + vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab)); + } else { + vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp, + slab_nid(slab)); + } if (!vec) { /* Mark vectors which failed to allocate */ if (new_slab) @@ -2103,6 +2111,8 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s, } new_exts = (unsigned long)vec; + if (unlikely(!allow_spin)) + new_exts |= OBJEXTS_NOSPIN_ALLOC; #ifdef CONFIG_MEMCG new_exts |= MEMCG_DATA_OBJEXTS; #endif @@ -2123,7 +2133,10 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s, * objcg vector should be reused. */ mark_objexts_empty(vec); - kfree(vec); + if (unlikely(!allow_spin)) + kfree_nolock(vec); + else + kfree(vec); return 0; } @@ -2147,7 +2160,10 @@ static inline void free_slab_obj_exts(struct slab *slab) * the extension for obj_exts is expected to be NULL. */ mark_objexts_empty(obj_exts); - kfree(obj_exts); + if (unlikely(READ_ONCE(slab->obj_exts) & OBJEXTS_NOSPIN_ALLOC)) + kfree_nolock(obj_exts); + else + kfree(obj_exts); slab->obj_exts = 0; } @@ -2481,7 +2497,7 @@ bool slab_free_hook(struct kmem_cache *s, void *x, bool init, } /* KASAN might put x into memory quarantine, delaying its reuse. */ - return !kasan_slab_free(s, x, init, still_accessible); + return !kasan_slab_free(s, x, init, still_accessible, false); } static __fastpath_inline @@ -2986,13 +3002,17 @@ static void barn_shrink(struct kmem_cache *s, struct node_barn *barn) * Slab allocation and freeing */ static inline struct slab *alloc_slab_page(gfp_t flags, int node, - struct kmem_cache_order_objects oo) + struct kmem_cache_order_objects oo, + bool allow_spin) { struct folio *folio; struct slab *slab; unsigned int order = oo_order(oo); - if (node == NUMA_NO_NODE) + if (unlikely(!allow_spin)) + folio = (struct folio *)alloc_frozen_pages_nolock(0/* __GFP_COMP is implied */, + node, order); + else if (node == NUMA_NO_NODE) folio = (struct folio *)alloc_frozen_pages(flags, order); else folio = (struct folio *)__alloc_frozen_pages(flags, order, node, NULL); @@ -3142,6 +3162,7 @@ static __always_inline void unaccount_slab(struct slab *slab, int order, static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) { + bool allow_spin = gfpflags_allow_spinning(flags); struct slab *slab; struct kmem_cache_order_objects oo = s->oo; gfp_t alloc_gfp; @@ -3161,7 +3182,11 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) if ((alloc_gfp & __GFP_DIRECT_RECLAIM) && oo_order(oo) > oo_order(s->min)) alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_RECLAIM; - slab = alloc_slab_page(alloc_gfp, node, oo); + /* + * __GFP_RECLAIM could be cleared on the first allocation attempt, + * so pass allow_spin flag directly. + */ + slab = alloc_slab_page(alloc_gfp, node, oo, allow_spin); if (unlikely(!slab)) { oo = s->min; alloc_gfp = flags; @@ -3169,7 +3194,7 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) * Allocation may have failed due to fragmentation. * Try a lower order alloc if possible */ - slab = alloc_slab_page(alloc_gfp, node, oo); + slab = alloc_slab_page(alloc_gfp, node, oo, allow_spin); if (unlikely(!slab)) return NULL; stat(s, ORDER_FALLBACK); @@ -3338,33 +3363,47 @@ static void *alloc_single_from_partial(struct kmem_cache *s, return object; } +static void defer_deactivate_slab(struct slab *slab, void *flush_freelist); + /* * Called only for kmem_cache_debug() caches to allocate from a freshly * allocated slab. Allocate a single object instead of whole freelist * and put the slab to the partial (or full) list. */ -static void *alloc_single_from_new_slab(struct kmem_cache *s, - struct slab *slab, int orig_size) +static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab, + int orig_size, gfp_t gfpflags) { + bool allow_spin = gfpflags_allow_spinning(gfpflags); int nid = slab_nid(slab); struct kmem_cache_node *n = get_node(s, nid); unsigned long flags; void *object; + if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) { + /* Unlucky, discard newly allocated slab */ + slab->frozen = 1; + defer_deactivate_slab(slab, NULL); + return NULL; + } object = slab->freelist; slab->freelist = get_freepointer(s, object); slab->inuse = 1; - if (!alloc_debug_processing(s, slab, object, orig_size)) + if (!alloc_debug_processing(s, slab, object, orig_size)) { /* * It's not really expected that this would fail on a * freshly allocated slab, but a concurrent memory * corruption in theory could cause that. + * Leak memory of allocated slab. */ + if (!allow_spin) + spin_unlock_irqrestore(&n->list_lock, flags); return NULL; + } - spin_lock_irqsave(&n->list_lock, flags); + if (allow_spin) + spin_lock_irqsave(&n->list_lock, flags); if (slab->inuse == slab->objects) add_full(s, n, slab); @@ -3405,7 +3444,10 @@ static struct slab *get_partial_node(struct kmem_cache *s, if (!n || !n->nr_partial) return NULL; - spin_lock_irqsave(&n->list_lock, flags); + if (gfpflags_allow_spinning(pc->flags)) + spin_lock_irqsave(&n->list_lock, flags); + else if (!spin_trylock_irqsave(&n->list_lock, flags)) + return NULL; list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) { if (!pfmemalloc_match(slab, pc->flags)) continue; @@ -3611,7 +3653,7 @@ static void init_kmem_cache_cpus(struct kmem_cache *s) lockdep_register_key(&s->lock_key); for_each_possible_cpu(cpu) { c = per_cpu_ptr(s->cpu_slab, cpu); - local_lock_init(&c->lock); + local_trylock_init(&c->lock); if (finegrain_lockdep) lockdep_set_class(&c->lock, &s->lock_key); c->tid = init_tid(cpu); @@ -3704,6 +3746,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab, } } +/* + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock + * can be acquired without a deadlock before invoking the function. + * + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(), + * and kmalloc() is not used in an unsupported context. + * + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave(). + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but + * lockdep_assert() will catch a bug in case: + * #1 + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock() + * or + * #2 + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock() + * + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt + * disabled context. The lock will always be acquired and if needed it + * block and sleep until the lock is available. + * #1 is possible in !PREEMPT_RT only. + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock: + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) -> + * tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B) + * + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B + */ +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP) +#define local_lock_cpu_slab(s, flags) \ + local_lock_irqsave(&(s)->cpu_slab->lock, flags) +#else +#define local_lock_cpu_slab(s, flags) \ + lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags)) +#endif + +#define local_unlock_cpu_slab(s, flags) \ + local_unlock_irqrestore(&(s)->cpu_slab->lock, flags) + #ifdef CONFIG_SLUB_CPU_PARTIAL static void __put_partials(struct kmem_cache *s, struct slab *partial_slab) { @@ -3788,7 +3868,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain) unsigned long flags; int slabs = 0; - local_lock_irqsave(&s->cpu_slab->lock, flags); + local_lock_cpu_slab(s, flags); oldslab = this_cpu_read(s->cpu_slab->partial); @@ -3813,7 +3893,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain) this_cpu_write(s->cpu_slab->partial, slab); - local_unlock_irqrestore(&s->cpu_slab->lock, flags); + local_unlock_cpu_slab(s, flags); if (slab_to_put) { __put_partials(s, slab_to_put); @@ -4262,6 +4342,7 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab) static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size) { + bool allow_spin = gfpflags_allow_spinning(gfpflags); void *freelist; struct slab *slab; unsigned long flags; @@ -4287,9 +4368,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, if (unlikely(!node_match(slab, node))) { /* * same as above but node_match() being false already - * implies node != NUMA_NO_NODE + * implies node != NUMA_NO_NODE. + * Reentrant slub cannot take locks necessary to + * deactivate_slab, hence ignore node preference. + * kmalloc_nolock() doesn't allow __GFP_THISNODE. */ - if (!node_isset(node, slab_nodes)) { + if (!node_isset(node, slab_nodes) || + !allow_spin) { node = NUMA_NO_NODE; } else { stat(s, ALLOC_NODE_MISMATCH); @@ -4302,13 +4387,14 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, * PFMEMALLOC but right now, we are losing the pfmemalloc * information when the page leaves the per-cpu allocator */ - if (unlikely(!pfmemalloc_match(slab, gfpflags))) + if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin)) goto deactivate_slab; /* must check again c->slab in case we got preempted and it changed */ - local_lock_irqsave(&s->cpu_slab->lock, flags); + local_lock_cpu_slab(s, flags); + if (unlikely(slab != c->slab)) { - local_unlock_irqrestore(&s->cpu_slab->lock, flags); + local_unlock_cpu_slab(s, flags); goto reread_slab; } freelist = c->freelist; @@ -4320,7 +4406,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, if (!freelist) { c->slab = NULL; c->tid = next_tid(c->tid); - local_unlock_irqrestore(&s->cpu_slab->lock, flags); + local_unlock_cpu_slab(s, flags); stat(s, DEACTIVATE_BYPASS); goto new_slab; } @@ -4339,34 +4425,34 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, VM_BUG_ON(!c->slab->frozen); c->freelist = get_freepointer(s, freelist); c->tid = next_tid(c->tid); - local_unlock_irqrestore(&s->cpu_slab->lock, flags); + local_unlock_cpu_slab(s, flags); return freelist; deactivate_slab: - local_lock_irqsave(&s->cpu_slab->lock, flags); + local_lock_cpu_slab(s, flags); if (slab != c->slab) { - local_unlock_irqrestore(&s->cpu_slab->lock, flags); + local_unlock_cpu_slab(s, flags); goto reread_slab; } freelist = c->freelist; c->slab = NULL; c->freelist = NULL; c->tid = next_tid(c->tid); - local_unlock_irqrestore(&s->cpu_slab->lock, flags); + local_unlock_cpu_slab(s, flags); deactivate_slab(s, slab, freelist); new_slab: #ifdef CONFIG_SLUB_CPU_PARTIAL while (slub_percpu_partial(c)) { - local_lock_irqsave(&s->cpu_slab->lock, flags); + local_lock_cpu_slab(s, flags); if (unlikely(c->slab)) { - local_unlock_irqrestore(&s->cpu_slab->lock, flags); + local_unlock_cpu_slab(s, flags); goto reread_slab; } if (unlikely(!slub_percpu_partial(c))) { - local_unlock_irqrestore(&s->cpu_slab->lock, flags); + local_unlock_cpu_slab(s, flags); /* we were preempted and partial list got empty */ goto new_objects; } @@ -4374,8 +4460,14 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, slab = slub_percpu_partial(c); slub_set_percpu_partial(c, slab); + /* + * Reentrant slub cannot take locks necessary for + * __put_partials(), hence ignore node preference. + * kmalloc_nolock() doesn't allow __GFP_THISNODE. + */ if (likely(node_match(slab, node) && - pfmemalloc_match(slab, gfpflags))) { + pfmemalloc_match(slab, gfpflags)) || + !allow_spin) { c->slab = slab; freelist = get_freelist(s, slab); VM_BUG_ON(!freelist); @@ -4383,7 +4475,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, goto load_freelist; } - local_unlock_irqrestore(&s->cpu_slab->lock, flags); + local_unlock_cpu_slab(s, flags); slab->next = NULL; __put_partials(s, slab); @@ -4405,8 +4497,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, * allocating new page from other nodes */ if (unlikely(node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) - && try_thisnode)) - pc.flags = GFP_NOWAIT | __GFP_THISNODE; + && try_thisnode)) { + if (unlikely(!allow_spin)) + /* Do not upgrade gfp to NOWAIT from more restrictive mode */ + pc.flags = gfpflags | __GFP_THISNODE; + else + pc.flags = GFP_NOWAIT | __GFP_THISNODE; + } pc.orig_size = orig_size; slab = get_partial(s, node, &pc); @@ -4445,7 +4542,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, stat(s, ALLOC_SLAB); if (kmem_cache_debug(s)) { - freelist = alloc_single_from_new_slab(s, slab, orig_size); + freelist = alloc_single_from_new_slab(s, slab, orig_size, gfpflags); if (unlikely(!freelist)) goto new_objects; @@ -4467,7 +4564,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, inc_slabs_node(s, slab_nid(slab), slab->objects); - if (unlikely(!pfmemalloc_match(slab, gfpflags))) { + if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin)) { /* * For !pfmemalloc_match() case we don't load freelist so that * we don't make further mismatched allocations easier. @@ -4478,7 +4575,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, retry_load_slab: - local_lock_irqsave(&s->cpu_slab->lock, flags); + local_lock_cpu_slab(s, flags); if (unlikely(c->slab)) { void *flush_freelist = c->freelist; struct slab *flush_slab = c->slab; @@ -4487,9 +4584,14 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, c->freelist = NULL; c->tid = next_tid(c->tid); - local_unlock_irqrestore(&s->cpu_slab->lock, flags); + local_unlock_cpu_slab(s, flags); - deactivate_slab(s, flush_slab, flush_freelist); + if (unlikely(!allow_spin)) { + /* Reentrant slub cannot take locks, defer */ + defer_deactivate_slab(flush_slab, flush_freelist); + } else { + deactivate_slab(s, flush_slab, flush_freelist); + } stat(s, CPUSLAB_FLUSH); @@ -4518,8 +4620,19 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, */ c = slub_get_cpu_ptr(s->cpu_slab); #endif - + if (unlikely(!gfpflags_allow_spinning(gfpflags))) { + if (local_lock_is_locked(&s->cpu_slab->lock)) { + /* + * EBUSY is an internal signal to kmalloc_nolock() to + * retry a different bucket. It's not propagated + * to the caller. + */ + p = ERR_PTR(-EBUSY); + goto out; + } + } p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size); +out: #ifdef CONFIG_PREEMPT_COUNT slub_put_cpu_ptr(s->cpu_slab); #endif @@ -4643,7 +4756,7 @@ static void *__slab_alloc_node(struct kmem_cache *s, return NULL; } - object = alloc_single_from_new_slab(s, slab, orig_size); + object = alloc_single_from_new_slab(s, slab, orig_size, gfpflags); return object; } @@ -4722,8 +4835,9 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, if (p[i] && init && (!kasan_init || !kasan_has_integrated_init())) memset(p[i], 0, zero_size); - kmemleak_alloc_recursive(p[i], s->object_size, 1, - s->flags, init_flags); + if (gfpflags_allow_spinning(flags)) + kmemleak_alloc_recursive(p[i], s->object_size, 1, + s->flags, init_flags); kmsan_slab_alloc(s, p[i], init_flags); alloc_tagging_slab_alloc_hook(s, p[i], flags); } @@ -5390,6 +5504,94 @@ void *__kmalloc_noprof(size_t size, gfp_t flags) } EXPORT_SYMBOL(__kmalloc_noprof); +/** + * kmalloc_nolock - Allocate an object of given size from any context. + * @size: size to allocate + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed. + * @node: node number of the target node. + * + * Return: pointer to the new object or NULL in case of error. + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM. + * There is no reason to call it again and expect !NULL. + */ +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node) +{ + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags; + struct kmem_cache *s; + bool can_retry = true; + void *ret = ERR_PTR(-EBUSY); + + VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO)); + + if (unlikely(!size)) + return ZERO_SIZE_PTR; + + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) + /* kmalloc_nolock() in PREEMPT_RT is not supported from irq */ + return NULL; +retry: + if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) + return NULL; + s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_); + + if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s)) + /* + * kmalloc_nolock() is not supported on architectures that + * don't implement cmpxchg16b, but debug caches don't use + * per-cpu slab and per-cpu partial slabs. They rely on + * kmem_cache_node->list_lock, so kmalloc_nolock() can + * attempt to allocate from debug caches by + * spin_trylock_irqsave(&n->list_lock, ...) + */ + return NULL; + + /* + * Do not call slab_alloc_node(), since trylock mode isn't + * compatible with slab_pre_alloc_hook/should_failslab and + * kfence_alloc. Hence call __slab_alloc_node() (at most twice) + * and slab_post_alloc_hook() directly. + * + * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair + * in irq saved region. It assumes that the same cpu will not + * __update_cpu_freelist_fast() into the same (freelist,tid) pair. + * Therefore use in_nmi() to check whether particular bucket is in + * irq protected section. + * + * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that + * this cpu was interrupted somewhere inside ___slab_alloc() after + * it did local_lock_irqsave(&s->cpu_slab->lock, flags). + * In this case fast path with __update_cpu_freelist_fast() is not safe. + */ +#ifndef CONFIG_SLUB_TINY + if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock)) +#endif + ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size); + + if (PTR_ERR(ret) == -EBUSY) { + if (can_retry) { + /* pick the next kmalloc bucket */ + size = s->object_size + 1; + /* + * Another alternative is to + * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT; + * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT; + * to retry from bucket of the same size. + */ + can_retry = false; + goto retry; + } + ret = NULL; + } + + maybe_wipe_obj_freeptr(s, ret); + slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret, + slab_want_init_on_alloc(alloc_gfp, s), size); + + ret = kasan_kmalloc(s, ret, size, alloc_gfp); + return ret; +} +EXPORT_SYMBOL_GPL(kmalloc_nolock_noprof); + void *__kmalloc_node_track_caller_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node, unsigned long caller) { @@ -6039,6 +6241,93 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p) } } +struct defer_free { + struct llist_head objects; + struct llist_head slabs; + struct irq_work work; +}; + +static void free_deferred_objects(struct irq_work *work); + +static DEFINE_PER_CPU(struct defer_free, defer_free_objects) = { + .objects = LLIST_HEAD_INIT(objects), + .slabs = LLIST_HEAD_INIT(slabs), + .work = IRQ_WORK_INIT(free_deferred_objects), +}; + +/* + * In PREEMPT_RT irq_work runs in per-cpu kthread, so it's safe + * to take sleeping spin_locks from __slab_free() and deactivate_slab(). + * In !PREEMPT_RT irq_work will run after local_unlock_irqrestore(). + */ +static void free_deferred_objects(struct irq_work *work) +{ + struct defer_free *df = container_of(work, struct defer_free, work); + struct llist_head *objs = &df->objects; + struct llist_head *slabs = &df->slabs; + struct llist_node *llnode, *pos, *t; + + if (llist_empty(objs) && llist_empty(slabs)) + return; + + llnode = llist_del_all(objs); + llist_for_each_safe(pos, t, llnode) { + struct kmem_cache *s; + struct slab *slab; + void *x = pos; + + slab = virt_to_slab(x); + s = slab->slab_cache; + + /* + * We used freepointer in 'x' to link 'x' into df->objects. + * Clear it to NULL to avoid false positive detection + * of "Freepointer corruption". + */ + *(void **)x = NULL; + + /* Point 'x' back to the beginning of allocated object */ + x -= s->offset; + __slab_free(s, slab, x, x, 1, _THIS_IP_); + } + + llnode = llist_del_all(slabs); + llist_for_each_safe(pos, t, llnode) { + struct slab *slab = container_of(pos, struct slab, llnode); + +#ifdef CONFIG_SLUB_TINY + discard_slab(slab->slab_cache, slab); +#else + deactivate_slab(slab->slab_cache, slab, slab->flush_freelist); +#endif + } +} + +static void defer_free(struct kmem_cache *s, void *head) +{ + struct defer_free *df = this_cpu_ptr(&defer_free_objects); + + if (llist_add(head + s->offset, &df->objects)) + irq_work_queue(&df->work); +} + +static void defer_deactivate_slab(struct slab *slab, void *flush_freelist) +{ + struct defer_free *df = this_cpu_ptr(&defer_free_objects); + + slab->flush_freelist = flush_freelist; + if (llist_add(&slab->llnode, &df->slabs)) + irq_work_queue(&df->work); +} + +void defer_free_barrier(void) +{ + int cpu; + + for_each_possible_cpu(cpu) + irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work); +} + #ifndef CONFIG_SLUB_TINY /* * Fastpath with forced inlining to produce a kfree and kmem_cache_free that @@ -6059,6 +6348,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s, struct slab *slab, void *head, void *tail, int cnt, unsigned long addr) { + /* cnt == 0 signals that it's called from kfree_nolock() */ + bool allow_spin = cnt; struct kmem_cache_cpu *c; unsigned long tid; void **freelist; @@ -6077,10 +6368,29 @@ static __always_inline void do_slab_free(struct kmem_cache *s, barrier(); if (unlikely(slab != c->slab)) { - __slab_free(s, slab, head, tail, cnt, addr); + if (unlikely(!allow_spin)) { + /* + * __slab_free() can locklessly cmpxchg16 into a slab, + * but then it might need to take spin_lock or local_lock + * in put_cpu_partial() for further processing. + * Avoid the complexity and simply add to a deferred list. + */ + defer_free(s, head); + } else { + __slab_free(s, slab, head, tail, cnt, addr); + } return; } + if (unlikely(!allow_spin)) { + if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) && + local_lock_is_locked(&s->cpu_slab->lock)) { + defer_free(s, head); + return; + } + cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */ + } + if (USE_LOCKLESS_FAST_PATH()) { freelist = READ_ONCE(c->freelist); @@ -6091,11 +6401,13 @@ static __always_inline void do_slab_free(struct kmem_cache *s, goto redo; } } else { + __maybe_unused unsigned long flags = 0; + /* Update the free list under the local lock */ - local_lock(&s->cpu_slab->lock); + local_lock_cpu_slab(s, flags); c = this_cpu_ptr(s->cpu_slab); if (unlikely(slab != c->slab)) { - local_unlock(&s->cpu_slab->lock); + local_unlock_cpu_slab(s, flags); goto redo; } tid = c->tid; @@ -6105,7 +6417,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s, c->freelist = head; c->tid = next_tid(tid); - local_unlock(&s->cpu_slab->lock); + local_unlock_cpu_slab(s, flags); } stat_add(s, FREE_FASTPATH, cnt); } @@ -6337,6 +6649,71 @@ void kfree(const void *object) } EXPORT_SYMBOL(kfree); +/* + * Can be called while holding raw_spinlock_t or from IRQ and NMI, + * but ONLY for objects allocated by kmalloc_nolock(). + * Debug checks (like kmemleak and kfence) were skipped on allocation, + * hence + * obj = kmalloc(); kfree_nolock(obj); + * will miss kmemleak/kfence book keeping and will cause false positives. + * large_kmalloc is not supported either. + */ +void kfree_nolock(const void *object) +{ + struct folio *folio; + struct slab *slab; + struct kmem_cache *s; + void *x = (void *)object; + + if (unlikely(ZERO_OR_NULL_PTR(object))) + return; + + folio = virt_to_folio(object); + if (unlikely(!folio_test_slab(folio))) { + WARN_ONCE(1, "large_kmalloc is not supported by kfree_nolock()"); + return; + } + + slab = folio_slab(folio); + s = slab->slab_cache; + + memcg_slab_free_hook(s, slab, &x, 1); + alloc_tagging_slab_free_hook(s, slab, &x, 1); + /* + * Unlike slab_free() do NOT call the following: + * kmemleak_free_recursive(x, s->flags); + * debug_check_no_locks_freed(x, s->object_size); + * debug_check_no_obj_freed(x, s->object_size); + * __kcsan_check_access(x, s->object_size, ..); + * kfence_free(x); + * since they take spinlocks or not safe from any context. + */ + kmsan_slab_free(s, x); + /* + * If KASAN finds a kernel bug it will do kasan_report_invalid_free() + * which will call raw_spin_lock_irqsave() which is technically + * unsafe from NMI, but take chance and report kernel bug. + * The sequence of + * kasan_report_invalid_free() -> raw_spin_lock_irqsave() -> NMI + * -> kfree_nolock() -> kasan_report_invalid_free() on the same CPU + * is double buggy and deserves to deadlock. + */ + if (kasan_slab_pre_free(s, x)) + return; + /* + * memcg, kasan_slab_pre_free are done for 'x'. + * The only thing left is kasan_poison without quarantine, + * since kasan quarantine takes locks and not supported from NMI. + */ + kasan_slab_free(s, x, false, false, /* skip quarantine */true); +#ifndef CONFIG_SLUB_TINY + do_slab_free(s, slab, x, x, 0, _RET_IP_); +#else + defer_free(s, x); +#endif +} +EXPORT_SYMBOL_GPL(kfree_nolock); + static __always_inline __realloc_size(2) void * __do_krealloc(const void *p, size_t new_size, gfp_t flags) { -- 2.47.3 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock(). 2025-09-09 1:00 ` [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov @ 2025-09-15 12:52 ` Harry Yoo 2025-09-15 14:39 ` Vlastimil Babka 2025-09-16 1:00 ` Alexei Starovoitov 2025-09-24 0:40 ` Harry Yoo 1 sibling, 2 replies; 45+ messages in thread From: Harry Yoo @ 2025-09-15 12:52 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, linux-mm, vbabka, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes On Mon, Sep 08, 2025 at 06:00:07PM -0700, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > kmalloc_nolock() relies on ability of local_trylock_t to detect > the situation when per-cpu kmem_cache is locked. > > In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags) > disables IRQs and marks s->cpu_slab->lock as acquired. > local_lock_is_locked(&s->cpu_slab->lock) returns true when > slab is in the middle of manipulating per-cpu cache > of that specific kmem_cache. > > kmalloc_nolock() can be called from any context and can re-enter > into ___slab_alloc(): > kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf -> > kmalloc_nolock() -> ___slab_alloc(cache_B) > or > kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf -> > kmalloc_nolock() -> ___slab_alloc(cache_B) > > Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock > can be acquired without a deadlock before invoking the function. > If that specific per-cpu kmem_cache is busy the kmalloc_nolock() > retries in a different kmalloc bucket. The second attempt will > likely succeed, since this cpu locked different kmem_cache. > > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when > per-cpu rt_spin_lock is locked by current _task_. In this case > re-entrance into the same kmalloc bucket is unsafe, and > kmalloc_nolock() tries a different bucket that is most likely is > not locked by the current task. Though it may be locked by a > different task it's safe to rt_spin_lock() and sleep on it. > > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL > immediately if called from hard irq or NMI in PREEMPT_RT. > > kfree_nolock() defers freeing to irq_work when local_lock_is_locked() > and (in_nmi() or in PREEMPT_RT). > > SLUB_TINY config doesn't use local_lock_is_locked() and relies on > spin_trylock_irqsave(&n->list_lock) to allocate, > while kfree_nolock() always defers to irq_work. > > Note, kfree_nolock() must be called _only_ for objects allocated > with kmalloc_nolock(). Debug checks (like kmemleak and kfence) > were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj); > will miss kmemleak/kfence book keeping and will cause false positives. > large_kmalloc is not supported by either kmalloc_nolock() > or kfree_nolock(). > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > --- > include/linux/kasan.h | 13 +- > include/linux/memcontrol.h | 2 + > include/linux/slab.h | 4 + > mm/Kconfig | 1 + > mm/kasan/common.c | 5 +- > mm/slab.h | 6 + > mm/slab_common.c | 3 + > mm/slub.c | 473 +++++++++++++++++++++++++++++++++---- > 8 files changed, 453 insertions(+), 54 deletions(-) > @@ -3704,6 +3746,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab, > } > } > > +/* > + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock > + * can be acquired without a deadlock before invoking the function. > + * > + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is > + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(), > + * and kmalloc() is not used in an unsupported context. > + * > + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave(). > + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but > + * lockdep_assert() will catch a bug in case: > + * #1 > + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock() > + * or > + * #2 > + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock() > + * > + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt > + * disabled context. The lock will always be acquired and if needed it > + * block and sleep until the lock is available. > + * #1 is possible in !PREEMPT_RT only. > + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock: > + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) -> > + * tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B) > + * > + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B > + */ > +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP) > +#define local_lock_cpu_slab(s, flags) \ > + local_lock_irqsave(&(s)->cpu_slab->lock, flags) > +#else > +#define local_lock_cpu_slab(s, flags) \ > + lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags)) > +#endif > + > +#define local_unlock_cpu_slab(s, flags) \ > + local_unlock_irqrestore(&(s)->cpu_slab->lock, flags) nit: Do we still need this trick with patch "slab: Make slub local_(try)lock more precise for LOCKDEP"? > #ifdef CONFIG_SLUB_CPU_PARTIAL > static void __put_partials(struct kmem_cache *s, struct slab *partial_slab) > { > @@ -4262,6 +4342,7 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab) > static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size) > { > + bool allow_spin = gfpflags_allow_spinning(gfpflags); > void *freelist; > struct slab *slab; > unsigned long flags; > @@ -4287,9 +4368,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > if (unlikely(!node_match(slab, node))) { > /* > * same as above but node_match() being false already > - * implies node != NUMA_NO_NODE > + * implies node != NUMA_NO_NODE. > + * Reentrant slub cannot take locks necessary to > + * deactivate_slab, hence ignore node preference. nit: the comment is obsolte? Per previous discussion there were two points. Maybe something like this? /* * We don't strictly honor pfmemalloc and NUMA preferences when * !allow_spin because: * * 1. Most kmalloc() users allocate objects on the local node, * so kmalloc_nolock() tries not to interfere with them by * deactivating the cpu slab. * * 2. Deactivating due to NUMA or pfmemalloc mismatch may cause * unnecessary slab allocations even when n->partial list is not empty. */ ...or if you don't feel like it's not worth documenting, just removing the misleading comment is fine. > + * kmalloc_nolock() doesn't allow __GFP_THISNODE. > */ > - if (!node_isset(node, slab_nodes)) { > + if (!node_isset(node, slab_nodes) || > + !allow_spin) { > node = NUMA_NO_NODE; > } else { > stat(s, ALLOC_NODE_MISMATCH); > @@ -4374,8 +4460,14 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > slab = slub_percpu_partial(c); > slub_set_percpu_partial(c, slab); > > + /* > + * Reentrant slub cannot take locks necessary for > + * __put_partials(), hence ignore node preference. nit: same here. > + * kmalloc_nolock() doesn't allow __GFP_THISNODE. > + */ > if (likely(node_match(slab, node) && > - pfmemalloc_match(slab, gfpflags))) { > + pfmemalloc_match(slab, gfpflags)) || > + !allow_spin) { > c->slab = slab; > freelist = get_freelist(s, slab); > VM_BUG_ON(!freelist); > @@ -5390,6 +5504,94 @@ void *__kmalloc_noprof(size_t size, gfp_t flags) > } > EXPORT_SYMBOL(__kmalloc_noprof); > > +/** > + * kmalloc_nolock - Allocate an object of given size from any context. > + * @size: size to allocate > + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed. > + * @node: node number of the target node. > + * > + * Return: pointer to the new object or NULL in case of error. > + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM. > + * There is no reason to call it again and expect !NULL. > + */ > +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node) > +{ > + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags; > + struct kmem_cache *s; > + bool can_retry = true; > + void *ret = ERR_PTR(-EBUSY); > + > + VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO)); > + > + if (unlikely(!size)) > + return ZERO_SIZE_PTR; > + > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) > + /* kmalloc_nolock() in PREEMPT_RT is not supported from irq */ > + return NULL; > +retry: > + if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) > + return NULL; > + s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_); > + > + if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s)) > + /* > + * kmalloc_nolock() is not supported on architectures that > + * don't implement cmpxchg16b, but debug caches don't use > + * per-cpu slab and per-cpu partial slabs. They rely on > + * kmem_cache_node->list_lock, so kmalloc_nolock() can > + * attempt to allocate from debug caches by > + * spin_trylock_irqsave(&n->list_lock, ...) > + */ > + return NULL; > + > + /* > + * Do not call slab_alloc_node(), since trylock mode isn't > + * compatible with slab_pre_alloc_hook/should_failslab and > + * kfence_alloc. Hence call __slab_alloc_node() (at most twice) > + * and slab_post_alloc_hook() directly. > + * > + * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair > + * in irq saved region. It assumes that the same cpu will not > + * __update_cpu_freelist_fast() into the same (freelist,tid) pair. > + * Therefore use in_nmi() to check whether particular bucket is in > + * irq protected section. > + * > + * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that > + * this cpu was interrupted somewhere inside ___slab_alloc() after > + * it did local_lock_irqsave(&s->cpu_slab->lock, flags). > + * In this case fast path with __update_cpu_freelist_fast() is not safe. > + */ > +#ifndef CONFIG_SLUB_TINY > + if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock)) > +#endif On !PREEMPT_RT, how does the kernel know that it should not use the lockless fastpath in kmalloc_nolock() in the following path: kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock() For the same reason as in NMIs (as slowpath doesn't expect that). Maybe check if interrupts are disabled instead of in_nmi()? Otherwise looks good to me. -- Cheers, Harry / Hyeonggon > + ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size); > + > + if (PTR_ERR(ret) == -EBUSY) { > + if (can_retry) { > + /* pick the next kmalloc bucket */ > + size = s->object_size + 1; > + /* > + * Another alternative is to > + * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT; > + * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT; > + * to retry from bucket of the same size. > + */ > + can_retry = false; > + goto retry; > + } > + ret = NULL; > + } > + > + maybe_wipe_obj_freeptr(s, ret); > + slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret, > + slab_want_init_on_alloc(alloc_gfp, s), size); > + > + ret = kasan_kmalloc(s, ret, size, alloc_gfp); > + return ret; > +} > +EXPORT_SYMBOL_GPL(kmalloc_nolock_noprof); ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock(). 2025-09-15 12:52 ` Harry Yoo @ 2025-09-15 14:39 ` Vlastimil Babka 2025-09-16 0:56 ` Alexei Starovoitov 2025-09-16 1:00 ` Alexei Starovoitov 1 sibling, 1 reply; 45+ messages in thread From: Vlastimil Babka @ 2025-09-15 14:39 UTC (permalink / raw) To: Harry Yoo, Alexei Starovoitov Cc: bpf, linux-mm, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes On 9/15/25 14:52, Harry Yoo wrote: > On Mon, Sep 08, 2025 at 06:00:07PM -0700, Alexei Starovoitov wrote: >> From: Alexei Starovoitov <ast@kernel.org> >> >> kmalloc_nolock() relies on ability of local_trylock_t to detect >> the situation when per-cpu kmem_cache is locked. >> >> In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags) >> disables IRQs and marks s->cpu_slab->lock as acquired. >> local_lock_is_locked(&s->cpu_slab->lock) returns true when >> slab is in the middle of manipulating per-cpu cache >> of that specific kmem_cache. >> >> kmalloc_nolock() can be called from any context and can re-enter >> into ___slab_alloc(): >> kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf -> >> kmalloc_nolock() -> ___slab_alloc(cache_B) >> or >> kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf -> >> kmalloc_nolock() -> ___slab_alloc(cache_B) >> >> Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock >> can be acquired without a deadlock before invoking the function. >> If that specific per-cpu kmem_cache is busy the kmalloc_nolock() >> retries in a different kmalloc bucket. The second attempt will >> likely succeed, since this cpu locked different kmem_cache. >> >> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when >> per-cpu rt_spin_lock is locked by current _task_. In this case >> re-entrance into the same kmalloc bucket is unsafe, and >> kmalloc_nolock() tries a different bucket that is most likely is >> not locked by the current task. Though it may be locked by a >> different task it's safe to rt_spin_lock() and sleep on it. >> >> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL >> immediately if called from hard irq or NMI in PREEMPT_RT. >> >> kfree_nolock() defers freeing to irq_work when local_lock_is_locked() >> and (in_nmi() or in PREEMPT_RT). >> >> SLUB_TINY config doesn't use local_lock_is_locked() and relies on >> spin_trylock_irqsave(&n->list_lock) to allocate, >> while kfree_nolock() always defers to irq_work. >> >> Note, kfree_nolock() must be called _only_ for objects allocated >> with kmalloc_nolock(). Debug checks (like kmemleak and kfence) >> were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj); >> will miss kmemleak/kfence book keeping and will cause false positives. >> large_kmalloc is not supported by either kmalloc_nolock() >> or kfree_nolock(). >> >> Signed-off-by: Alexei Starovoitov <ast@kernel.org> >> --- >> include/linux/kasan.h | 13 +- >> include/linux/memcontrol.h | 2 + >> include/linux/slab.h | 4 + >> mm/Kconfig | 1 + >> mm/kasan/common.c | 5 +- >> mm/slab.h | 6 + >> mm/slab_common.c | 3 + >> mm/slub.c | 473 +++++++++++++++++++++++++++++++++---- >> 8 files changed, 453 insertions(+), 54 deletions(-) >> @@ -3704,6 +3746,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab, >> } >> } >> >> +/* >> + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock >> + * can be acquired without a deadlock before invoking the function. >> + * >> + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is >> + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(), >> + * and kmalloc() is not used in an unsupported context. >> + * >> + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave(). >> + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but >> + * lockdep_assert() will catch a bug in case: >> + * #1 >> + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock() >> + * or >> + * #2 >> + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock() >> + * >> + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt >> + * disabled context. The lock will always be acquired and if needed it >> + * block and sleep until the lock is available. >> + * #1 is possible in !PREEMPT_RT only. >> + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock: >> + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) -> >> + * tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B) >> + * >> + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B >> + */ >> +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP) >> +#define local_lock_cpu_slab(s, flags) \ >> + local_lock_irqsave(&(s)->cpu_slab->lock, flags) >> +#else >> +#define local_lock_cpu_slab(s, flags) \ >> + lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags)) >> +#endif >> + >> +#define local_unlock_cpu_slab(s, flags) \ >> + local_unlock_irqrestore(&(s)->cpu_slab->lock, flags) > > nit: Do we still need this trick with patch "slab: Make slub local_(try)lock > more precise for LOCKDEP"? I think we only make it more precise on PREEMPT_RT because on !PREEMPT_RT we can avoid it using this trick. It's probably better for lockdep's overhead to avoid the class-per-cache when we can. Perhaps we can even improve by having a special class only for kmalloc caches? With kmalloc_nolock we shouldn't ever recurse from one non-kmalloc cache to another non-kmalloc cache? >> >> +/** >> + * kmalloc_nolock - Allocate an object of given size from any context. >> + * @size: size to allocate >> + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed. >> + * @node: node number of the target node. >> + * >> + * Return: pointer to the new object or NULL in case of error. >> + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM. >> + * There is no reason to call it again and expect !NULL. >> + */ >> +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node) >> +{ >> + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags; >> + struct kmem_cache *s; >> + bool can_retry = true; >> + void *ret = ERR_PTR(-EBUSY); >> + >> + VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO)); >> + >> + if (unlikely(!size)) >> + return ZERO_SIZE_PTR; >> + >> + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) >> + /* kmalloc_nolock() in PREEMPT_RT is not supported from irq */ >> + return NULL; >> +retry: >> + if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) >> + return NULL; >> + s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_); >> + >> + if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s)) >> + /* >> + * kmalloc_nolock() is not supported on architectures that >> + * don't implement cmpxchg16b, but debug caches don't use >> + * per-cpu slab and per-cpu partial slabs. They rely on >> + * kmem_cache_node->list_lock, so kmalloc_nolock() can >> + * attempt to allocate from debug caches by >> + * spin_trylock_irqsave(&n->list_lock, ...) >> + */ >> + return NULL; >> + >> + /* >> + * Do not call slab_alloc_node(), since trylock mode isn't >> + * compatible with slab_pre_alloc_hook/should_failslab and >> + * kfence_alloc. Hence call __slab_alloc_node() (at most twice) >> + * and slab_post_alloc_hook() directly. >> + * >> + * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair >> + * in irq saved region. It assumes that the same cpu will not >> + * __update_cpu_freelist_fast() into the same (freelist,tid) pair. >> + * Therefore use in_nmi() to check whether particular bucket is in >> + * irq protected section. >> + * >> + * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that >> + * this cpu was interrupted somewhere inside ___slab_alloc() after >> + * it did local_lock_irqsave(&s->cpu_slab->lock, flags). >> + * In this case fast path with __update_cpu_freelist_fast() is not safe. >> + */ >> +#ifndef CONFIG_SLUB_TINY >> + if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock)) >> +#endif > > On !PREEMPT_RT, how does the kernel know that it should not use > the lockless fastpath in kmalloc_nolock() in the following path: > > kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock() > > For the same reason as in NMIs (as slowpath doesn't expect that). Hmm... seems a good point, unless I'm missing something. > Maybe check if interrupts are disabled instead of in_nmi()? Why not just check for local_lock_is_locked(&s->cpu_slab->lock) then and just remove the "!in_nmi() ||" part? There shouldn't be false positives? > Otherwise looks good to me. > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock(). 2025-09-15 14:39 ` Vlastimil Babka @ 2025-09-16 0:56 ` Alexei Starovoitov 2025-09-16 9:55 ` Vlastimil Babka 0 siblings, 1 reply; 45+ messages in thread From: Alexei Starovoitov @ 2025-09-16 0:56 UTC (permalink / raw) To: Vlastimil Babka Cc: Harry Yoo, bpf, linux-mm, Shakeel Butt, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner On Mon, Sep 15, 2025 at 7:39 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 9/15/25 14:52, Harry Yoo wrote: > > On Mon, Sep 08, 2025 at 06:00:07PM -0700, Alexei Starovoitov wrote: > >> From: Alexei Starovoitov <ast@kernel.org> > >> > >> kmalloc_nolock() relies on ability of local_trylock_t to detect > >> the situation when per-cpu kmem_cache is locked. > >> > >> In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags) > >> disables IRQs and marks s->cpu_slab->lock as acquired. > >> local_lock_is_locked(&s->cpu_slab->lock) returns true when > >> slab is in the middle of manipulating per-cpu cache > >> of that specific kmem_cache. > >> > >> kmalloc_nolock() can be called from any context and can re-enter > >> into ___slab_alloc(): > >> kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf -> > >> kmalloc_nolock() -> ___slab_alloc(cache_B) > >> or > >> kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf -> > >> kmalloc_nolock() -> ___slab_alloc(cache_B) > >> > >> Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock > >> can be acquired without a deadlock before invoking the function. > >> If that specific per-cpu kmem_cache is busy the kmalloc_nolock() > >> retries in a different kmalloc bucket. The second attempt will > >> likely succeed, since this cpu locked different kmem_cache. > >> > >> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when > >> per-cpu rt_spin_lock is locked by current _task_. In this case > >> re-entrance into the same kmalloc bucket is unsafe, and > >> kmalloc_nolock() tries a different bucket that is most likely is > >> not locked by the current task. Though it may be locked by a > >> different task it's safe to rt_spin_lock() and sleep on it. > >> > >> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL > >> immediately if called from hard irq or NMI in PREEMPT_RT. > >> > >> kfree_nolock() defers freeing to irq_work when local_lock_is_locked() > >> and (in_nmi() or in PREEMPT_RT). > >> > >> SLUB_TINY config doesn't use local_lock_is_locked() and relies on > >> spin_trylock_irqsave(&n->list_lock) to allocate, > >> while kfree_nolock() always defers to irq_work. > >> > >> Note, kfree_nolock() must be called _only_ for objects allocated > >> with kmalloc_nolock(). Debug checks (like kmemleak and kfence) > >> were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj); > >> will miss kmemleak/kfence book keeping and will cause false positives. > >> large_kmalloc is not supported by either kmalloc_nolock() > >> or kfree_nolock(). > >> > >> Signed-off-by: Alexei Starovoitov <ast@kernel.org> > >> --- > >> include/linux/kasan.h | 13 +- > >> include/linux/memcontrol.h | 2 + > >> include/linux/slab.h | 4 + > >> mm/Kconfig | 1 + > >> mm/kasan/common.c | 5 +- > >> mm/slab.h | 6 + > >> mm/slab_common.c | 3 + > >> mm/slub.c | 473 +++++++++++++++++++++++++++++++++---- > >> 8 files changed, 453 insertions(+), 54 deletions(-) > >> @@ -3704,6 +3746,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab, > >> } > >> } > >> > >> +/* > >> + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock > >> + * can be acquired without a deadlock before invoking the function. > >> + * > >> + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is > >> + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(), > >> + * and kmalloc() is not used in an unsupported context. > >> + * > >> + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave(). > >> + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but > >> + * lockdep_assert() will catch a bug in case: > >> + * #1 > >> + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock() > >> + * or > >> + * #2 > >> + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock() > >> + * > >> + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt > >> + * disabled context. The lock will always be acquired and if needed it > >> + * block and sleep until the lock is available. > >> + * #1 is possible in !PREEMPT_RT only. > >> + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock: > >> + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) -> > >> + * tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B) > >> + * > >> + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B > >> + */ > >> +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP) > >> +#define local_lock_cpu_slab(s, flags) \ > >> + local_lock_irqsave(&(s)->cpu_slab->lock, flags) > >> +#else > >> +#define local_lock_cpu_slab(s, flags) \ > >> + lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags)) > >> +#endif > >> + > >> +#define local_unlock_cpu_slab(s, flags) \ > >> + local_unlock_irqrestore(&(s)->cpu_slab->lock, flags) > > > > nit: Do we still need this trick with patch "slab: Make slub local_(try)lock > > more precise for LOCKDEP"? > > I think we only make it more precise on PREEMPT_RT because on !PREEMPT_RT we > can avoid it using this trick. It's probably better for lockdep's overhead > to avoid the class-per-cache when we can. yes. > Perhaps we can even improve by having a special class only for kmalloc > caches? With kmalloc_nolock we shouldn't ever recurse from one non-kmalloc > cache to another non-kmalloc cache? Probably correct. The current algorithm of kmalloc_nolock() (pick a different bucket) works only for kmalloc caches, so other caches won't see _nolock() version any time soon... but caches are mergeable, so other kmem_cache_create()-d cache might get merged with kmalloc ? Still shouldn't be an issue. I guess we can fine tune "bool finegrain_lockdep" in that patch to make it false for non-kmalloc caches, but I don't know how to do it. Some flag struct kmem_cache ? I can do a follow up. > >> > >> +/** > >> + * kmalloc_nolock - Allocate an object of given size from any context. > >> + * @size: size to allocate > >> + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed. > >> + * @node: node number of the target node. > >> + * > >> + * Return: pointer to the new object or NULL in case of error. > >> + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM. > >> + * There is no reason to call it again and expect !NULL. > >> + */ > >> +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node) > >> +{ > >> + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags; > >> + struct kmem_cache *s; > >> + bool can_retry = true; > >> + void *ret = ERR_PTR(-EBUSY); > >> + > >> + VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO)); > >> + > >> + if (unlikely(!size)) > >> + return ZERO_SIZE_PTR; > >> + > >> + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) > >> + /* kmalloc_nolock() in PREEMPT_RT is not supported from irq */ > >> + return NULL; > >> +retry: > >> + if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) > >> + return NULL; > >> + s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_); > >> + > >> + if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s)) > >> + /* > >> + * kmalloc_nolock() is not supported on architectures that > >> + * don't implement cmpxchg16b, but debug caches don't use > >> + * per-cpu slab and per-cpu partial slabs. They rely on > >> + * kmem_cache_node->list_lock, so kmalloc_nolock() can > >> + * attempt to allocate from debug caches by > >> + * spin_trylock_irqsave(&n->list_lock, ...) > >> + */ > >> + return NULL; > >> + > >> + /* > >> + * Do not call slab_alloc_node(), since trylock mode isn't > >> + * compatible with slab_pre_alloc_hook/should_failslab and > >> + * kfence_alloc. Hence call __slab_alloc_node() (at most twice) > >> + * and slab_post_alloc_hook() directly. > >> + * > >> + * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair > >> + * in irq saved region. It assumes that the same cpu will not > >> + * __update_cpu_freelist_fast() into the same (freelist,tid) pair. > >> + * Therefore use in_nmi() to check whether particular bucket is in > >> + * irq protected section. > >> + * > >> + * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that > >> + * this cpu was interrupted somewhere inside ___slab_alloc() after > >> + * it did local_lock_irqsave(&s->cpu_slab->lock, flags). > >> + * In this case fast path with __update_cpu_freelist_fast() is not safe. > >> + */ > >> +#ifndef CONFIG_SLUB_TINY > >> + if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock)) > >> +#endif > > > > On !PREEMPT_RT, how does the kernel know that it should not use > > the lockless fastpath in kmalloc_nolock() in the following path: > > > > kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock() > > > > For the same reason as in NMIs (as slowpath doesn't expect that). > > Hmm... seems a good point, unless I'm missing something. Good point indeed. tracepoints are not an issue, since there are no tracepoints in the middle of freelist operations, but kprobe in the middle of ___slab_alloc() is indeed problematic. > > > Maybe check if interrupts are disabled instead of in_nmi()? but calling if (irqs_disabled()) isn't fast (list time I benchmarked it) and unnecessarily restrictive. I think it's better to add 'notrace' to ___slab_alloc or I can denylist that function on bpf side to disallow attaching. > > Why not just check for local_lock_is_locked(&s->cpu_slab->lock) then and > just remove the "!in_nmi() ||" part? There shouldn't be false positives? That wouldn't be correct. Remember you asked why access &s->cpu_slab->lock is stable? in_nmi() guarantees that the task won't migrate. Adding slub_put_cpu_ptr() wrap around local_lock_is_locked() _and_ subsequent call to __slab_alloc_node() will fix it, but it's ugly. Potentially can do if (!allow_sping && local_lock_is_locked()) right before calling __update_cpu_freelist_fast() but it's even uglier, since it will affect the fast path for everyone. So I prefer to leave this bit as-is. I'll add filtering of ___slab_alloc() on bpf side. We already have a precedent: btf_id_deny set. That would be one line patch that I can do in bpf tree. Good to disallow poking into ___slab_alloc() anyway. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock(). 2025-09-16 0:56 ` Alexei Starovoitov @ 2025-09-16 9:55 ` Vlastimil Babka 0 siblings, 0 replies; 45+ messages in thread From: Vlastimil Babka @ 2025-09-16 9:55 UTC (permalink / raw) To: Alexei Starovoitov Cc: Harry Yoo, bpf, linux-mm, Shakeel Butt, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner On 9/16/25 02:56, Alexei Starovoitov wrote: > On Mon, Sep 15, 2025 at 7:39 AM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> On 9/15/25 14:52, Harry Yoo wrote: >> > On Mon, Sep 08, 2025 at 06:00:07PM -0700, Alexei Starovoitov wrote: >> >> From: Alexei Starovoitov <ast@kernel.org> >> >> >> >> kmalloc_nolock() relies on ability of local_trylock_t to detect >> >> the situation when per-cpu kmem_cache is locked. >> >> >> >> In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags) >> >> disables IRQs and marks s->cpu_slab->lock as acquired. >> >> local_lock_is_locked(&s->cpu_slab->lock) returns true when >> >> slab is in the middle of manipulating per-cpu cache >> >> of that specific kmem_cache. >> >> >> >> kmalloc_nolock() can be called from any context and can re-enter >> >> into ___slab_alloc(): >> >> kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf -> >> >> kmalloc_nolock() -> ___slab_alloc(cache_B) >> >> or >> >> kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf -> >> >> kmalloc_nolock() -> ___slab_alloc(cache_B) >> >> >> >> Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock >> >> can be acquired without a deadlock before invoking the function. >> >> If that specific per-cpu kmem_cache is busy the kmalloc_nolock() >> >> retries in a different kmalloc bucket. The second attempt will >> >> likely succeed, since this cpu locked different kmem_cache. >> >> >> >> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when >> >> per-cpu rt_spin_lock is locked by current _task_. In this case >> >> re-entrance into the same kmalloc bucket is unsafe, and >> >> kmalloc_nolock() tries a different bucket that is most likely is >> >> not locked by the current task. Though it may be locked by a >> >> different task it's safe to rt_spin_lock() and sleep on it. >> >> >> >> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL >> >> immediately if called from hard irq or NMI in PREEMPT_RT. >> >> >> >> kfree_nolock() defers freeing to irq_work when local_lock_is_locked() >> >> and (in_nmi() or in PREEMPT_RT). >> >> >> >> SLUB_TINY config doesn't use local_lock_is_locked() and relies on >> >> spin_trylock_irqsave(&n->list_lock) to allocate, >> >> while kfree_nolock() always defers to irq_work. >> >> >> >> Note, kfree_nolock() must be called _only_ for objects allocated >> >> with kmalloc_nolock(). Debug checks (like kmemleak and kfence) >> >> were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj); >> >> will miss kmemleak/kfence book keeping and will cause false positives. >> >> large_kmalloc is not supported by either kmalloc_nolock() >> >> or kfree_nolock(). >> >> >> >> Signed-off-by: Alexei Starovoitov <ast@kernel.org> >> >> --- >> >> include/linux/kasan.h | 13 +- >> >> include/linux/memcontrol.h | 2 + >> >> include/linux/slab.h | 4 + >> >> mm/Kconfig | 1 + >> >> mm/kasan/common.c | 5 +- >> >> mm/slab.h | 6 + >> >> mm/slab_common.c | 3 + >> >> mm/slub.c | 473 +++++++++++++++++++++++++++++++++---- >> >> 8 files changed, 453 insertions(+), 54 deletions(-) >> >> @@ -3704,6 +3746,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab, >> >> } >> >> } >> >> >> >> +/* >> >> + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock >> >> + * can be acquired without a deadlock before invoking the function. >> >> + * >> >> + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is >> >> + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(), >> >> + * and kmalloc() is not used in an unsupported context. >> >> + * >> >> + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave(). >> >> + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but >> >> + * lockdep_assert() will catch a bug in case: >> >> + * #1 >> >> + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock() >> >> + * or >> >> + * #2 >> >> + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock() >> >> + * >> >> + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt >> >> + * disabled context. The lock will always be acquired and if needed it >> >> + * block and sleep until the lock is available. >> >> + * #1 is possible in !PREEMPT_RT only. >> >> + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock: >> >> + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) -> >> >> + * tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B) >> >> + * >> >> + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B >> >> + */ >> >> +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP) >> >> +#define local_lock_cpu_slab(s, flags) \ >> >> + local_lock_irqsave(&(s)->cpu_slab->lock, flags) >> >> +#else >> >> +#define local_lock_cpu_slab(s, flags) \ >> >> + lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags)) >> >> +#endif >> >> + >> >> +#define local_unlock_cpu_slab(s, flags) \ >> >> + local_unlock_irqrestore(&(s)->cpu_slab->lock, flags) >> > >> > nit: Do we still need this trick with patch "slab: Make slub local_(try)lock >> > more precise for LOCKDEP"? >> >> I think we only make it more precise on PREEMPT_RT because on !PREEMPT_RT we >> can avoid it using this trick. It's probably better for lockdep's overhead >> to avoid the class-per-cache when we can. > > yes. > >> Perhaps we can even improve by having a special class only for kmalloc >> caches? With kmalloc_nolock we shouldn't ever recurse from one non-kmalloc >> cache to another non-kmalloc cache? > > Probably correct. > The current algorithm of kmalloc_nolock() (pick a different bucket) > works only for kmalloc caches, so other caches won't see _nolock() > version any time soon... > but caches are mergeable, so other kmem_cache_create()-d cache > might get merged with kmalloc ? Still shouldn't be an issue. I think we mostly don't merge them, maybe in some configs it's still possible. We should probably just do it unconditionally as they are special enough. But it still shouldn't be an issue for this case, as we wouldn't ever merge such caches that would need a different lockdep class, AFAICS. > I guess we can fine tune "bool finegrain_lockdep" in that patch > to make it false for non-kmalloc caches, but I don't know how to do it. > Some flag struct kmem_cache ? I can do a follow up. Yeah there's now SLAB_KMALLOC and is_kmalloc_cache(). We can do followup, it shouldn't be urgent. >> >> Hmm... seems a good point, unless I'm missing something. > > Good point indeed. > tracepoints are not an issue, since there are no tracepoints > in the middle of freelist operations, > but kprobe in the middle of ___slab_alloc() is indeed problematic. > >> >> > Maybe check if interrupts are disabled instead of in_nmi()? > > but calling if (irqs_disabled()) isn't fast (list time I benchmarked it) > and unnecessarily restrictive. > > I think it's better to add 'notrace' to ___slab_alloc > or I can denylist that function on bpf side to disallow attaching. > >> >> Why not just check for local_lock_is_locked(&s->cpu_slab->lock) then and >> just remove the "!in_nmi() ||" part? There shouldn't be false positives? > > That wouldn't be correct. Remember you asked why > access &s->cpu_slab->lock is stable? in_nmi() guarantees that > the task won't migrate. Ah right. > Adding slub_put_cpu_ptr() wrap around local_lock_is_locked() _and_ > subsequent call to __slab_alloc_node() will fix it, > but it's ugly. > Potentially can do > if (!allow_sping && local_lock_is_locked()) > right before calling __update_cpu_freelist_fast() > but it's even uglier, since it will affect the fast path for everyone. Agreed. > So I prefer to leave this bit as-is. > I'll add filtering of ___slab_alloc() on bpf side. > We already have a precedent: btf_id_deny set. > That would be one line patch that I can do in bpf tree. > Good to disallow poking into ___slab_alloc() anyway. Yeah it sounds like the easiest fix and won't limit anything. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock(). 2025-09-15 12:52 ` Harry Yoo 2025-09-15 14:39 ` Vlastimil Babka @ 2025-09-16 1:00 ` Alexei Starovoitov 1 sibling, 0 replies; 45+ messages in thread From: Alexei Starovoitov @ 2025-09-16 1:00 UTC (permalink / raw) To: Harry Yoo Cc: bpf, linux-mm, Vlastimil Babka, Shakeel Butt, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner On Mon, Sep 15, 2025 at 5:53 AM Harry Yoo <harry.yoo@oracle.com> wrote: > > > if (unlikely(!node_match(slab, node))) { > > /* > > * same as above but node_match() being false already > > - * implies node != NUMA_NO_NODE > > + * implies node != NUMA_NO_NODE. > > + * Reentrant slub cannot take locks necessary to > > + * deactivate_slab, hence ignore node preference. > > nit: the comment is obsolte? Ohh. Sorry. I clearly remember fixing this comment per your feedback. Not sure how I lost this hunk. > Per previous discussion there were two points. > Maybe something like this? > > /* > * We don't strictly honor pfmemalloc and NUMA preferences when > * !allow_spin because: > * > * 1. Most kmalloc() users allocate objects on the local node, > * so kmalloc_nolock() tries not to interfere with them by > * deactivating the cpu slab. > * > * 2. Deactivating due to NUMA or pfmemalloc mismatch may cause > * unnecessary slab allocations even when n->partial list is not empty. > */ > > ...or if you don't feel like it's not worth documenting, > just removing the misleading comment is fine. The above reads great to me. Will send a follow up patch in a minute to fold in or keep it separate. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock(). 2025-09-09 1:00 ` [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov 2025-09-15 12:52 ` Harry Yoo @ 2025-09-24 0:40 ` Harry Yoo 2025-09-24 7:43 ` Alexei Starovoitov 1 sibling, 1 reply; 45+ messages in thread From: Harry Yoo @ 2025-09-24 0:40 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, linux-mm, vbabka, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes On Mon, Sep 08, 2025 at 06:00:07PM -0700, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > kmalloc_nolock() relies on ability of local_trylock_t to detect > the situation when per-cpu kmem_cache is locked. > > In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags) > disables IRQs and marks s->cpu_slab->lock as acquired. > local_lock_is_locked(&s->cpu_slab->lock) returns true when > slab is in the middle of manipulating per-cpu cache > of that specific kmem_cache. > > kmalloc_nolock() can be called from any context and can re-enter > into ___slab_alloc(): > kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf -> > kmalloc_nolock() -> ___slab_alloc(cache_B) > or > kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf -> > kmalloc_nolock() -> ___slab_alloc(cache_B) > > Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock > can be acquired without a deadlock before invoking the function. > If that specific per-cpu kmem_cache is busy the kmalloc_nolock() > retries in a different kmalloc bucket. The second attempt will > likely succeed, since this cpu locked different kmem_cache. > > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when > per-cpu rt_spin_lock is locked by current _task_. In this case > re-entrance into the same kmalloc bucket is unsafe, and > kmalloc_nolock() tries a different bucket that is most likely is > not locked by the current task. Though it may be locked by a > different task it's safe to rt_spin_lock() and sleep on it. > > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL > immediately if called from hard irq or NMI in PREEMPT_RT. > > kfree_nolock() defers freeing to irq_work when local_lock_is_locked() > and (in_nmi() or in PREEMPT_RT). > > SLUB_TINY config doesn't use local_lock_is_locked() and relies on > spin_trylock_irqsave(&n->list_lock) to allocate, > while kfree_nolock() always defers to irq_work. > > Note, kfree_nolock() must be called _only_ for objects allocated > with kmalloc_nolock(). Debug checks (like kmemleak and kfence) > were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj); > will miss kmemleak/kfence book keeping and will cause false positives. > large_kmalloc is not supported by either kmalloc_nolock() > or kfree_nolock(). > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > --- On the up-to-date version [1] of this patch, I tried my best to find flaws in the code, but came up empty this time. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> [1] https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/commit/?h=slab/for-6.18/kmalloc_nolock&id=b374424ce98fc9e03270ca1c4abb3aa82c939b5c -- Cheers, Harry / Hyeonggon ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock(). 2025-09-24 0:40 ` Harry Yoo @ 2025-09-24 7:43 ` Alexei Starovoitov 2025-09-24 11:07 ` Harry Yoo 0 siblings, 1 reply; 45+ messages in thread From: Alexei Starovoitov @ 2025-09-24 7:43 UTC (permalink / raw) To: Harry Yoo Cc: bpf, linux-mm, Vlastimil Babka, Shakeel Butt, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner On Wed, Sep 24, 2025 at 2:41 AM Harry Yoo <harry.yoo@oracle.com> wrote: > > On Mon, Sep 08, 2025 at 06:00:07PM -0700, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@kernel.org> > > > > kmalloc_nolock() relies on ability of local_trylock_t to detect > > the situation when per-cpu kmem_cache is locked. > > > > In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags) > > disables IRQs and marks s->cpu_slab->lock as acquired. > > local_lock_is_locked(&s->cpu_slab->lock) returns true when > > slab is in the middle of manipulating per-cpu cache > > of that specific kmem_cache. > > > > kmalloc_nolock() can be called from any context and can re-enter > > into ___slab_alloc(): > > kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf -> > > kmalloc_nolock() -> ___slab_alloc(cache_B) > > or > > kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf -> > > kmalloc_nolock() -> ___slab_alloc(cache_B) > > > > Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock > > can be acquired without a deadlock before invoking the function. > > If that specific per-cpu kmem_cache is busy the kmalloc_nolock() > > retries in a different kmalloc bucket. The second attempt will > > likely succeed, since this cpu locked different kmem_cache. > > > > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when > > per-cpu rt_spin_lock is locked by current _task_. In this case > > re-entrance into the same kmalloc bucket is unsafe, and > > kmalloc_nolock() tries a different bucket that is most likely is > > not locked by the current task. Though it may be locked by a > > different task it's safe to rt_spin_lock() and sleep on it. > > > > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL > > immediately if called from hard irq or NMI in PREEMPT_RT. > > > > kfree_nolock() defers freeing to irq_work when local_lock_is_locked() > > and (in_nmi() or in PREEMPT_RT). > > > > SLUB_TINY config doesn't use local_lock_is_locked() and relies on > > spin_trylock_irqsave(&n->list_lock) to allocate, > > while kfree_nolock() always defers to irq_work. > > > > Note, kfree_nolock() must be called _only_ for objects allocated > > with kmalloc_nolock(). Debug checks (like kmemleak and kfence) > > were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj); > > will miss kmemleak/kfence book keeping and will cause false positives. > > large_kmalloc is not supported by either kmalloc_nolock() > > or kfree_nolock(). > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > --- > > On the up-to-date version [1] of this patch, > I tried my best to find flaws in the code, but came up empty this time. Here's hoping :) Much appreciate all the feedback and reviews during this long journey (v1 was back in April). ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock(). 2025-09-24 7:43 ` Alexei Starovoitov @ 2025-09-24 11:07 ` Harry Yoo 0 siblings, 0 replies; 45+ messages in thread From: Harry Yoo @ 2025-09-24 11:07 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, linux-mm, Vlastimil Babka, Shakeel Butt, Michal Hocko, Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner On Wed, Sep 24, 2025 at 09:43:45AM +0200, Alexei Starovoitov wrote: > On Wed, Sep 24, 2025 at 2:41 AM Harry Yoo <harry.yoo@oracle.com> wrote: > > > > On Mon, Sep 08, 2025 at 06:00:07PM -0700, Alexei Starovoitov wrote: > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > kmalloc_nolock() relies on ability of local_trylock_t to detect > > > the situation when per-cpu kmem_cache is locked. > > > > > > In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags) > > > disables IRQs and marks s->cpu_slab->lock as acquired. > > > local_lock_is_locked(&s->cpu_slab->lock) returns true when > > > slab is in the middle of manipulating per-cpu cache > > > of that specific kmem_cache. > > > > > > kmalloc_nolock() can be called from any context and can re-enter > > > into ___slab_alloc(): > > > kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf -> > > > kmalloc_nolock() -> ___slab_alloc(cache_B) > > > or > > > kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf -> > > > kmalloc_nolock() -> ___slab_alloc(cache_B) > > > > > > Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock > > > can be acquired without a deadlock before invoking the function. > > > If that specific per-cpu kmem_cache is busy the kmalloc_nolock() > > > retries in a different kmalloc bucket. The second attempt will > > > likely succeed, since this cpu locked different kmem_cache. > > > > > > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when > > > per-cpu rt_spin_lock is locked by current _task_. In this case > > > re-entrance into the same kmalloc bucket is unsafe, and > > > kmalloc_nolock() tries a different bucket that is most likely is > > > not locked by the current task. Though it may be locked by a > > > different task it's safe to rt_spin_lock() and sleep on it. > > > > > > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL > > > immediately if called from hard irq or NMI in PREEMPT_RT. > > > > > > kfree_nolock() defers freeing to irq_work when local_lock_is_locked() > > > and (in_nmi() or in PREEMPT_RT). > > > > > > SLUB_TINY config doesn't use local_lock_is_locked() and relies on > > > spin_trylock_irqsave(&n->list_lock) to allocate, > > > while kfree_nolock() always defers to irq_work. > > > > > > Note, kfree_nolock() must be called _only_ for objects allocated > > > with kmalloc_nolock(). Debug checks (like kmemleak and kfence) > > > were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj); > > > will miss kmemleak/kfence book keeping and will cause false positives. > > > large_kmalloc is not supported by either kmalloc_nolock() > > > or kfree_nolock(). > > > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > > --- > > > > On the up-to-date version [1] of this patch, > > I tried my best to find flaws in the code, but came up empty this time. > > Here's hoping :) > Much appreciate all the feedback and reviews during > this long journey (v1 was back in April). Thanks for pushing this forward and incorporating feedback so far. Cheers! And hopefully nothing serious comes up 🙏 -- Cheers, Harry / Hyeonggon ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() 2025-09-09 1:00 [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov ` (5 preceding siblings ...) 2025-09-09 1:00 ` [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov @ 2025-09-12 9:33 ` Vlastimil Babka 6 siblings, 0 replies; 45+ messages in thread From: Vlastimil Babka @ 2025-09-12 9:33 UTC (permalink / raw) To: Alexei Starovoitov, bpf, linux-mm Cc: harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm, peterz, rostedt, hannes On 9/9/25 03:00, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > Overview: > > This patch set introduces kmalloc_nolock() which is the next logical > step towards any context allocation necessary to remove bpf_mem_alloc > and get rid of preallocation requirement in BPF infrastructure. > In production BPF maps grew to gigabytes in size. Preallocation wastes > memory. Alloc from any context addresses this issue for BPF and > other subsystems that are forced to preallocate too. > This long task started with introduction of alloc_pages_nolock(), > then memcg and objcg were converted to operate from any context > including NMI, this set completes the task with kmalloc_nolock() > that builds on top of alloc_pages_nolock() and memcg changes. > After that BPF subsystem will gradually adopt it everywhere. > > The patch set is on top of slab/for-next that already has > pre-patch "locking/local_lock: Expose dep_map in local_trylock_t." applied. > I think the patch set should be routed via vbabka/slab.git. Thanks, added to slab/for-next. There were no conflicts with mm-unstable when tried locally. > v4->v5: > - New patch "Reuse first bit for OBJEXTS_ALLOC_FAIL" to free up a bit > and use it to mark slabobj_ext vector allocated with kmalloc_nolock(), > so that freeing of the vector can be done with kfree_nolock() > - Call kasan_slab_free() directly from kfree_nolock() instead of deferring to > do_slab_free() to avoid double poisoning > - Addressed other minor issues spotted by Harry > > v4: > https://lore.kernel.org/all/20250718021646.73353-1-alexei.starovoitov@gmail.com/ > > v3->v4: > - Converted local_lock_cpu_slab() to macro > - Reordered patches 5 and 6 > - Emphasized that kfree_nolock() shouldn't be used on kmalloc()-ed objects > - Addressed other comments and improved commit logs > - Fixed build issues reported by bots > > v3: > https://lore.kernel.org/bpf/20250716022950.69330-1-alexei.starovoitov@gmail.com/ > > v2->v3: > - Adopted Sebastian's local_lock_cpu_slab(), but dropped gfpflags > to avoid extra branch for performance reasons, > and added local_unlock_cpu_slab() for symmetry. > - Dropped local_lock_lockdep_start/end() pair and switched to > per kmem_cache lockdep class on PREEMPT_RT to silence false positive > when the same cpu/task acquires two local_lock-s. > - Refactorred defer_free per Sebastian's suggestion > - Fixed slab leak when it needs to be deactivated via irq_work and llist > as Vlastimil proposed. Including defer_free_barrier(). > - Use kmem_cache->offset for llist_node pointer when linking objects > instead of zero offset, since whole object could be used for slabs > with ctors and other cases. > - Fixed "cnt = 1; goto redo;" issue. > - Fixed slab leak in alloc_single_from_new_slab(). > - Retested with slab_debug, RT, !RT, lockdep, kasan, slab_tiny > - Added acks to patches 1-4 that should be good to go. > > v2: > https://lore.kernel.org/bpf/20250709015303.8107-1-alexei.starovoitov@gmail.com/ > > v1->v2: > Added more comments for this non-trivial logic and addressed earlier comments. > In particular: > - Introduce alloc_frozen_pages_nolock() to avoid refcnt race > - alloc_pages_nolock() defaults to GFP_COMP > - Support SLUB_TINY > - Added more variants to stress tester to discover that kfree_nolock() can > OOM, because deferred per-slab llist won't be serviced if kfree_nolock() > gets unlucky long enough. Scraped previous approach and switched to > global per-cpu llist with immediate irq_work_queue() to process all > object sizes. > - Reentrant kmalloc cannot deactivate_slab(). In v1 the node hint was > downgraded to NUMA_NO_NODE before calling slab_alloc(). Realized it's not > good enough. There are odd cases that can trigger deactivate. Rewrote > this part. > - Struggled with SLAB_NO_CMPXCHG. Thankfully Harry had a great suggestion: > https://lore.kernel.org/bpf/aFvfr1KiNrLofavW@hyeyoo/ > which was adopted. So slab_debug works now. > - In v1 I had to s/local_lock_irqsave/local_lock_irqsave_check/ in a bunch > of places in mm/slub.c to avoid lockdep false positives. > Came up with much cleaner approach to silence invalid lockdep reports > without sacrificing lockdep coverage. See local_lock_lockdep_start/end(). > > v1: > https://lore.kernel.org/bpf/20250501032718.65476-1-alexei.starovoitov@gmail.com/ > > Alexei Starovoitov (6): > locking/local_lock: Introduce local_lock_is_locked(). > mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock(). > mm: Introduce alloc_frozen_pages_nolock() > slab: Make slub local_(try)lock more precise for LOCKDEP > slab: Reuse first bit for OBJEXTS_ALLOC_FAIL > slab: Introduce kmalloc_nolock() and kfree_nolock(). > > include/linux/gfp.h | 2 +- > include/linux/kasan.h | 13 +- > include/linux/local_lock.h | 2 + > include/linux/local_lock_internal.h | 7 + > include/linux/memcontrol.h | 12 +- > include/linux/rtmutex.h | 10 + > include/linux/slab.h | 4 + > kernel/bpf/stream.c | 2 +- > kernel/bpf/syscall.c | 2 +- > kernel/locking/rtmutex_common.h | 9 - > mm/Kconfig | 1 + > mm/internal.h | 4 + > mm/kasan/common.c | 5 +- > mm/page_alloc.c | 55 ++-- > mm/slab.h | 7 + > mm/slab_common.c | 3 + > mm/slub.c | 495 +++++++++++++++++++++++++--- > 17 files changed, 541 insertions(+), 92 deletions(-) > ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2025-09-24 11:08 UTC | newest] Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-09-09 1:00 [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov 2025-09-09 1:00 ` [PATCH slab v5 1/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov 2025-09-09 1:00 ` [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov 2025-09-12 17:11 ` Shakeel Butt 2025-09-12 17:15 ` Matthew Wilcox 2025-09-12 17:34 ` Alexei Starovoitov 2025-09-12 17:46 ` Shakeel Butt 2025-09-12 17:47 ` Shakeel Butt 2025-09-15 5:25 ` Harry Yoo 2025-09-09 1:00 ` [PATCH slab v5 3/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov 2025-09-12 17:15 ` Shakeel Butt 2025-09-15 5:17 ` Harry Yoo 2025-09-09 1:00 ` [PATCH slab v5 4/6] slab: Make slub local_(try)lock more precise for LOCKDEP Alexei Starovoitov 2025-09-09 1:00 ` [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL Alexei Starovoitov 2025-09-12 19:27 ` Shakeel Butt 2025-09-12 21:03 ` Suren Baghdasaryan 2025-09-12 21:11 ` Shakeel Butt 2025-09-12 21:26 ` Suren Baghdasaryan 2025-09-12 21:24 ` Alexei Starovoitov 2025-09-12 21:29 ` Shakeel Butt 2025-09-12 21:31 ` Alexei Starovoitov 2025-09-12 21:44 ` Shakeel Butt 2025-09-12 21:59 ` Alexei Starovoitov 2025-09-13 0:01 ` Shakeel Butt 2025-09-13 0:07 ` Alexei Starovoitov 2025-09-13 0:33 ` Shakeel Butt 2025-09-13 0:36 ` Suren Baghdasaryan 2025-09-13 1:12 ` Alexei Starovoitov 2025-09-15 7:51 ` Vlastimil Babka 2025-09-15 15:06 ` Suren Baghdasaryan 2025-09-15 15:11 ` Vlastimil Babka 2025-09-15 15:25 ` Suren Baghdasaryan 2025-09-15 20:10 ` Suren Baghdasaryan 2025-09-13 1:16 ` Shakeel Butt 2025-09-15 6:14 ` Harry Yoo 2025-09-09 1:00 ` [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov 2025-09-15 12:52 ` Harry Yoo 2025-09-15 14:39 ` Vlastimil Babka 2025-09-16 0:56 ` Alexei Starovoitov 2025-09-16 9:55 ` Vlastimil Babka 2025-09-16 1:00 ` Alexei Starovoitov 2025-09-24 0:40 ` Harry Yoo 2025-09-24 7:43 ` Alexei Starovoitov 2025-09-24 11:07 ` Harry Yoo 2025-09-12 9:33 ` [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Vlastimil Babka
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox