linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Vlastimil Babka <vbabka@suse.cz>
To: Phil Auld <pauld@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Christoph Lameter <cl@gentwo.org>
Subject: Re: Boot fails with 59faa4da7cd4 and 3accabda4da1
Date: Sat, 11 Oct 2025 00:22:39 +0200	[thread overview]
Message-ID: <b63f1f40-a8f5-4b54-b025-d8d1daf78c9b@suse.cz> (raw)
In-Reply-To: <20251010184259.GB436967@pauld.westford.csb>

On 10/10/25 20:42, Phil Auld wrote:
> On Fri, Oct 10, 2025 at 08:27:30PM +0200 Vlastimil Babka wrote:
>> On 10/10/25 20:19, Linus Torvalds wrote:
>> > On Fri, 10 Oct 2025 at 08:11, Phil Auld <pauld@redhat.com> wrote:
>> >>
>> >> After several days of failed boots I've gotten it down to these two
>> >> commits.
>> >>
>> >> 59faa4da7cd4 maple_tree: use percpu sheaves for maple_node_cache
>> >> 3accabda4da1 mm, vma: use percpu sheaves for vm_area_struct cache
>> >>
>> >> The first is such an early failure it's silent. With just 3acca I
>> >> get :
>> >>
>> >> [    9.341152] BUG: kernel NULL pointer dereference, address: 0000000000000040
>> >> [    9.348115] #PF: supervisor read access in kernel mode
>> >> [    9.353264] #PF: error_code(0x0000) - not-present page
>> >> [    9.358413] PGD 0 P4D 0
>> >> [    9.360959] Oops: Oops: 0000 [#1] SMP NOPTI
>> >> [    9.365154] CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
>> >> [    9.374982] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
>> >> [    9.382641] RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
>> >> [    9.388048] Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
>> > 
>> > That decodes to
>> > 
>> >    0:           mov    0x10(%rsi),%rax
>> >    4:           mov    0x8(%rsi),%rsi
>> >    8:           test   %rax,%rax
>> >    b:           je     0x18
>> >    d:           mov    0x18(%rax),%ecx
>> >   10:           test   %ecx,%ecx
>> >   12:           jne    0xfd
>> >   18:           movslq %gs:0x250eee4(%rip),%rax
>> >   20:           mov    0xe0(%r14,%rax,8),%rax
>> >   28:*          mov    0x40(%rax),%r13          <-- trapping instruction
>> >   2c:           mov    %r13,%rdi
>> >   2f:           call   0xffffffffffff81e4
>> >   34:           mov    %rax,%rbp
>> >   37:           test   %rax,%rax
>> >   3a:           je     0x59
>> > 
>> > which is the code around that barn_replace_empty_sheaf() call.
>> > 
>> > In particular, the trapping instruction is from get_barn(), it's the "->barn" in
>> > 
>> >         return get_node(s, numa_mem_id())->barn;
>> > 
>> > so it looks like 'get_node()' is returning NULL here:
>> > 
>> >         return s->node[node];
>> > 
>> > That 0x250eee4(%rip) is from "get_node()" becoming
>> > 
>> >   18:           movslq  %gs:numa_node(%rip), %rax  # node
>> >   20:           mov    0xe0(%r14,%rax,8),%rax # ->node[node]
>> > 
>> > instruction, and then that ->barn dereference is the trapping
>> > instruction that tries to read node->barn:
>> > 
>> >   28:*          mov    0x40(%rax),%r13   # node->barn
>> > 
>> > but I did *not* look into why s->node[node] would be NULL.
>> > 
>> > Over to you Vlastimil,
>> 
>> Thanks, yeah will look ASAP. I suspect the "nodes with zero memory" is
>> something that might not be handled well in general on x86. I know powerpc
>> used to do these kind of setups first and they have some special handling,
>> so numa_mem_id() would give you the closest node with memory in there and I
>> suspect it's not happening here. CPU 21 is node 6 so it's one of those
>> without memory. I'll see if I can simulate this with QEMU and what's the
>> most sensible fix
>>
> 
> Thanks for taking a look.  I thought the NPS4 thing might be playing a role.

From what I quickly found I understood that NPS4 is supposed to create extra
numa nodes per socket (4 instead of 1) and interleave the memory between
them. So it seems weird to me it would assign everything to one node and
leave 3 others memoryless?

> I'm happy to take any test/fix code you have for a spin on this system. 
 
Thanks. Here's a candidate fix in case you can test. I'll finalize it
tomorrow. The slab performance won't be optimal on cpus on those memoryless
nodes, that's why I'd like to figure out if it's a BIOS bug or not. If
memoryless nodes are really intended we should look into initializing things
so that numa_mem_id() works as expected and points to nearest populated
node.

----8<----
From 097c6251882bf5537162d17b6726575288ba9715 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Sat, 11 Oct 2025 00:13:20 +0200
Subject: [PATCH] slab: fix NULL pointer when trying to access barn

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 60 +++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 47 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 135c408e0515..bd3c2821e6c3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -507,7 +507,12 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 /* Get the barn of the current cpu's memory node */
 static inline struct node_barn *get_barn(struct kmem_cache *s)
 {
-	return get_node(s, numa_mem_id())->barn;
+	struct kmem_cache_node *n = get_node(s, numa_mem_id());
+
+	if (!n)
+		return NULL;
+
+	return n->barn;
 }
 
 /*
@@ -4982,6 +4987,10 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	}
 
 	barn = get_barn(s);
+	if (!barn) {
+		local_unlock(&s->cpu_sheaves->lock);
+		return NULL;
+	}
 
 	full = barn_replace_empty_sheaf(barn, pcs->main);
 
@@ -5153,13 +5162,20 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	if (unlikely(pcs->main->size == 0)) {
 
 		struct slab_sheaf *full;
+		struct node_barn *barn;
 
 		if (pcs->spare && pcs->spare->size > 0) {
 			swap(pcs->main, pcs->spare);
 			goto do_alloc;
 		}
 
-		full = barn_replace_empty_sheaf(get_barn(s), pcs->main);
+		barn = get_barn(s);
+		if (!barn) {
+			local_unlock(&s->cpu_sheaves->lock);
+			return allocated;
+		}
+
+		full = barn_replace_empty_sheaf(barn, pcs->main);
 
 		if (full) {
 			stat(s, BARN_GET);
@@ -5314,6 +5330,7 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *sheaf = NULL;
+	struct node_barn *barn;
 
 	if (unlikely(size > s->sheaf_capacity)) {
 
@@ -5355,8 +5372,11 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 		pcs->spare = NULL;
 		stat(s, SHEAF_PREFILL_FAST);
 	} else {
+		barn = get_barn(s);
+
 		stat(s, SHEAF_PREFILL_SLOW);
-		sheaf = barn_get_full_or_empty_sheaf(get_barn(s));
+		if (barn)
+			sheaf = barn_get_full_or_empty_sheaf(barn);
 		if (sheaf && sheaf->size)
 			stat(s, BARN_GET);
 		else
@@ -5426,7 +5446,7 @@ void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
 	 * If the barn has too many full sheaves or we fail to refill the sheaf,
 	 * simply flush and free it.
 	 */
-	if (data_race(barn->nr_full) >= MAX_FULL_SHEAVES ||
+	if (!barn || data_race(barn->nr_full) >= MAX_FULL_SHEAVES ||
 	    refill_sheaf(s, sheaf, gfp)) {
 		sheaf_flush_unused(s, sheaf);
 		free_empty_sheaf(s, sheaf);
@@ -5943,10 +5963,9 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
  * put the full sheaf there.
  */
 static void __pcs_install_empty_sheaf(struct kmem_cache *s,
-		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty)
+		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty,
+		struct node_barn *barn)
 {
-	struct node_barn *barn;
-
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
 	/* This is what we expect to find if nobody interrupted us. */
@@ -5956,8 +5975,6 @@ static void __pcs_install_empty_sheaf(struct kmem_cache *s,
 		return;
 	}
 
-	barn = get_barn(s);
-
 	/*
 	 * Unlikely because if the main sheaf had space, we would have just
 	 * freed to it. Get rid of our empty sheaf.
@@ -6002,6 +6019,11 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
 	barn = get_barn(s);
+	if (!barn) {
+		local_unlock(&s->cpu_sheaves->lock);
+		return NULL;
+	}
+
 	put_fail = false;
 
 	if (!pcs->spare) {
@@ -6084,7 +6106,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
 	}
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
-	__pcs_install_empty_sheaf(s, pcs, empty);
+	__pcs_install_empty_sheaf(s, pcs, empty, barn);
 
 	return pcs;
 }
@@ -6121,8 +6143,9 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
 
 static void rcu_free_sheaf(struct rcu_head *head)
 {
+	struct kmem_cache_node *n;
 	struct slab_sheaf *sheaf;
-	struct node_barn *barn;
+	struct node_barn *barn = NULL;
 	struct kmem_cache *s;
 
 	sheaf = container_of(head, struct slab_sheaf, rcu_head);
@@ -6139,7 +6162,11 @@ static void rcu_free_sheaf(struct rcu_head *head)
 	 */
 	__rcu_free_sheaf_prepare(s, sheaf);
 
-	barn = get_node(s, sheaf->node)->barn;
+	n = get_node(s, sheaf->node);
+	if (!n)
+		goto flush;
+
+	barn = n->barn;
 
 	/* due to slab_free_hook() */
 	if (unlikely(sheaf->size == 0))
@@ -6157,11 +6184,12 @@ static void rcu_free_sheaf(struct rcu_head *head)
 		return;
 	}
 
+flush:
 	stat(s, BARN_PUT_FAIL);
 	sheaf_flush_unused(s, sheaf);
 
 empty:
-	if (data_race(barn->nr_empty) < MAX_EMPTY_SHEAVES) {
+	if (barn && data_race(barn->nr_empty) < MAX_EMPTY_SHEAVES) {
 		barn_put_empty_sheaf(barn, sheaf);
 		return;
 	}
@@ -6191,6 +6219,10 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 		}
 
 		barn = get_barn(s);
+		if (!barn) {
+			local_unlock(&s->cpu_sheaves->lock);
+			goto fail;
+		}
 
 		empty = barn_get_empty_sheaf(barn);
 
@@ -6304,6 +6336,8 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 		goto do_free;
 
 	barn = get_barn(s);
+	if (!barn)
+		goto no_empty;
 
 	if (!pcs->spare) {
 		empty = barn_get_empty_sheaf(barn);
-- 
2.51.0




  reply	other threads:[~2025-10-10 22:22 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-10 15:11 Phil Auld
2025-10-10 18:19 ` Linus Torvalds
2025-10-10 18:27   ` Vlastimil Babka
2025-10-10 18:42     ` Phil Auld
2025-10-10 22:22       ` Vlastimil Babka [this message]
2025-10-11  0:29         ` Phil Auld
2025-10-13 13:09           ` Phil Auld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b63f1f40-a8f5-4b54-b025-d8d1daf78c9b@suse.cz \
    --to=vbabka@suse.cz \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pauld@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox