Boot fails with 59faa4da7cd4 and 3accabda4da1

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Boot fails with 59faa4da7cd4 and 3accabda4da1
@ 2025-10-10 15:11 Phil Auld
  2025-10-10 18:19 ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Phil Auld @ 2025-10-10 15:11 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Linus Torvalds, linux-mm, linux-kernel,
	Liam R. Howlett, pauld

Hi,

After several days of failed boots I've gotten it down to these two
commits.

59faa4da7cd4 maple_tree: use percpu sheaves for maple_node_cache
3accabda4da1 mm, vma: use percpu sheaves for vm_area_struct cache

The first is such an early failure it's silent. With just 3acca I
get :

[    9.341152] BUG: kernel NULL pointer dereference, address: 0000000000000040
[    9.348115] #PF: supervisor read access in kernel mode
[    9.353264] #PF: error_code(0x0000) - not-present page
[    9.358413] PGD 0 P4D 0
[    9.360959] Oops: Oops: 0000 [#1] SMP NOPTI
[    9.365154] CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
[    9.374982] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
[    9.382641] RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
[    9.388048] Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
[    9.406794] RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
[    9.412022] RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
[    9.419156] RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
[    9.426297] RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
[    9.433427] R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
[    9.440562] R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
[    9.447696] FS:  0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
[    9.455788] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.461535] CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
[    9.468665] Call Trace:
[    9.471123]  <TASK>
[    9.473233]  ? srso_return_thunk+0x5/0x5f
[    9.477258]  ? vm_area_alloc+0x1e/0x60
[    9.481020]  kmem_cache_alloc_noprof+0x4ec/0x5b0
[    9.485647]  vm_area_alloc+0x1e/0x60
[    9.489235]  create_init_stack_vma+0x26/0x210
[    9.493605]  alloc_bprm+0x139/0x200
[    9.497104]  kernel_execve+0x4a/0x140
[    9.500779]  call_usermodehelper_exec_async+0xd0/0x190
[    9.505923]  ? __pfx_call_usermodehelper_exec_async+0x10/0x10
[    9.511670]  ret_from_fork+0xf0/0x110
[    9.515346]  ? __pfx_call_usermodehelper_exec_async+0x10/0x10
[    9.521095]  ret_from_fork_asm+0x1a/0x30
[    9.525035]  </TASK>
[    9.527225] Modules linked in:
[    9.530290] CR2: 0000000000000040
[    9.533617] ---[ end trace 0000000000000000 ]---
[    9.538245] RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
[    9.543653] Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
[    9.562405] RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
[    9.567634] RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
[    9.574774] RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
[    9.581908] RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
[    9.589048] R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
[    9.596178] R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
[    9.603313] FS:  0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
[    9.611399] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.617145] CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
[    9.624278] Kernel panic - not syncing: Fatal exception
[    9.631463] Kernel Offset: 0x36a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    9.642244] ---[ end Kernel panic - not syncing: Fatal exception ]---


Reverting both produces a working kernel.

I have not looked into what the percpu sheaves thing is actually doing but
this is an AMD EPYC 7401 with 8 NUMA nodes configured such that memory is
only on 2 of them.  

# numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 8 16 24 32 40 48 56 64 72 80 88
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 2 10 18 26 34 42 50 58 66 74 82 90
node 1 size: 31584 MB
node 1 free: 30397 MB
node 2 cpus: 4 12 20 28 36 44 52 60 68 76 84 92
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 6 14 22 30 38 46 54 62 70 78 86 94
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 1 9 17 25 33 41 49 57 65 73 81 89
node 4 size: 0 MB
node 4 free: 0 MB
node 5 cpus: 3 11 19 27 35 43 51 59 67 75 83 91
node 5 size: 32214 MB
node 5 free: 31625 MB
node 6 cpus: 5 13 21 29 37 45 53 61 69 77 85 93
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 7 15 23 31 39 47 55 63 71 79 87 95
node 7 size: 0 MB
node 7 free: 0 MB
node distances:
node     0    1    2    3    4    5    6    7 
   0:   10   16   16   16   28   28   22   28 
   1:   16   10   16   16   28   28   28   22 
   2:   16   16   10   16   22   28   28   28 
   3:   16   16   16   10   28   22   28   28 
   4:   28   28   22   28   10   16   16   16 
   5:   28   28   28   22   16   10   16   16 
   6:   22   28   28   28   16   16   10   16 
   7:   28   22   28   28   16   16   16   10 


Let me know if I can provide any more information.


Thanks,
Phil


-- 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Boot fails with 59faa4da7cd4 and 3accabda4da1
  2025-10-10 15:11 Boot fails with 59faa4da7cd4 and 3accabda4da1 Phil Auld
@ 2025-10-10 18:19 ` Linus Torvalds
  2025-10-10 18:27   ` Vlastimil Babka
  0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2025-10-10 18:19 UTC (permalink / raw)
  To: Phil Auld
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel, Liam R. Howlett

On Fri, 10 Oct 2025 at 08:11, Phil Auld <pauld@redhat.com> wrote:
>
> After several days of failed boots I've gotten it down to these two
> commits.
>
> 59faa4da7cd4 maple_tree: use percpu sheaves for maple_node_cache
> 3accabda4da1 mm, vma: use percpu sheaves for vm_area_struct cache
>
> The first is such an early failure it's silent. With just 3acca I
> get :
>
> [    9.341152] BUG: kernel NULL pointer dereference, address: 0000000000000040
> [    9.348115] #PF: supervisor read access in kernel mode
> [    9.353264] #PF: error_code(0x0000) - not-present page
> [    9.358413] PGD 0 P4D 0
> [    9.360959] Oops: Oops: 0000 [#1] SMP NOPTI
> [    9.365154] CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
> [    9.374982] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
> [    9.382641] RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
> [    9.388048] Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89

That decodes to

   0:           mov    0x10(%rsi),%rax
   4:           mov    0x8(%rsi),%rsi
   8:           test   %rax,%rax
   b:           je     0x18
   d:           mov    0x18(%rax),%ecx
  10:           test   %ecx,%ecx
  12:           jne    0xfd
  18:           movslq %gs:0x250eee4(%rip),%rax
  20:           mov    0xe0(%r14,%rax,8),%rax
  28:*          mov    0x40(%rax),%r13          <-- trapping instruction
  2c:           mov    %r13,%rdi
  2f:           call   0xffffffffffff81e4
  34:           mov    %rax,%rbp
  37:           test   %rax,%rax
  3a:           je     0x59

which is the code around that barn_replace_empty_sheaf() call.

In particular, the trapping instruction is from get_barn(), it's the "->barn" in

        return get_node(s, numa_mem_id())->barn;

so it looks like 'get_node()' is returning NULL here:

        return s->node[node];

That 0x250eee4(%rip) is from "get_node()" becoming

  18:           movslq  %gs:numa_node(%rip), %rax  # node
  20:           mov    0xe0(%r14,%rax,8),%rax # ->node[node]

instruction, and then that ->barn dereference is the trapping
instruction that tries to read node->barn:

  28:*          mov    0x40(%rax),%r13   # node->barn

but I did *not* look into why s->node[node] would be NULL.

Over to you Vlastimil,

            Linus


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Boot fails with 59faa4da7cd4 and 3accabda4da1
  2025-10-10 18:19 ` Linus Torvalds
@ 2025-10-10 18:27   ` Vlastimil Babka
  2025-10-10 18:42     ` Phil Auld
  0 siblings, 1 reply; 7+ messages in thread
From: Vlastimil Babka @ 2025-10-10 18:27 UTC (permalink / raw)
  To: Linus Torvalds, Phil Auld
  Cc: Andrew Morton, linux-mm, linux-kernel, Liam R. Howlett

On 10/10/25 20:19, Linus Torvalds wrote:
> On Fri, 10 Oct 2025 at 08:11, Phil Auld <pauld@redhat.com> wrote:
>>
>> After several days of failed boots I've gotten it down to these two
>> commits.
>>
>> 59faa4da7cd4 maple_tree: use percpu sheaves for maple_node_cache
>> 3accabda4da1 mm, vma: use percpu sheaves for vm_area_struct cache
>>
>> The first is such an early failure it's silent. With just 3acca I
>> get :
>>
>> [    9.341152] BUG: kernel NULL pointer dereference, address: 0000000000000040
>> [    9.348115] #PF: supervisor read access in kernel mode
>> [    9.353264] #PF: error_code(0x0000) - not-present page
>> [    9.358413] PGD 0 P4D 0
>> [    9.360959] Oops: Oops: 0000 [#1] SMP NOPTI
>> [    9.365154] CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
>> [    9.374982] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
>> [    9.382641] RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
>> [    9.388048] Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
> 
> That decodes to
> 
>    0:           mov    0x10(%rsi),%rax
>    4:           mov    0x8(%rsi),%rsi
>    8:           test   %rax,%rax
>    b:           je     0x18
>    d:           mov    0x18(%rax),%ecx
>   10:           test   %ecx,%ecx
>   12:           jne    0xfd
>   18:           movslq %gs:0x250eee4(%rip),%rax
>   20:           mov    0xe0(%r14,%rax,8),%rax
>   28:*          mov    0x40(%rax),%r13          <-- trapping instruction
>   2c:           mov    %r13,%rdi
>   2f:           call   0xffffffffffff81e4
>   34:           mov    %rax,%rbp
>   37:           test   %rax,%rax
>   3a:           je     0x59
> 
> which is the code around that barn_replace_empty_sheaf() call.
> 
> In particular, the trapping instruction is from get_barn(), it's the "->barn" in
> 
>         return get_node(s, numa_mem_id())->barn;
> 
> so it looks like 'get_node()' is returning NULL here:
> 
>         return s->node[node];
> 
> That 0x250eee4(%rip) is from "get_node()" becoming
> 
>   18:           movslq  %gs:numa_node(%rip), %rax  # node
>   20:           mov    0xe0(%r14,%rax,8),%rax # ->node[node]
> 
> instruction, and then that ->barn dereference is the trapping
> instruction that tries to read node->barn:
> 
>   28:*          mov    0x40(%rax),%r13   # node->barn
> 
> but I did *not* look into why s->node[node] would be NULL.
> 
> Over to you Vlastimil,

Thanks, yeah will look ASAP. I suspect the "nodes with zero memory" is
something that might not be handled well in general on x86. I know powerpc
used to do these kind of setups first and they have some special handling,
so numa_mem_id() would give you the closest node with memory in there and I
suspect it's not happening here. CPU 21 is node 6 so it's one of those
without memory. I'll see if I can simulate this with QEMU and what's the
most sensible fix

>             Linus



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Boot fails with 59faa4da7cd4 and 3accabda4da1
  2025-10-10 18:27   ` Vlastimil Babka
@ 2025-10-10 18:42     ` Phil Auld
  2025-10-10 22:22       ` Vlastimil Babka
  0 siblings, 1 reply; 7+ messages in thread
From: Phil Auld @ 2025-10-10 18:42 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, Andrew Morton, linux-mm, linux-kernel, Liam R. Howlett

On Fri, Oct 10, 2025 at 08:27:30PM +0200 Vlastimil Babka wrote:
> On 10/10/25 20:19, Linus Torvalds wrote:
> > On Fri, 10 Oct 2025 at 08:11, Phil Auld <pauld@redhat.com> wrote:
> >>
> >> After several days of failed boots I've gotten it down to these two
> >> commits.
> >>
> >> 59faa4da7cd4 maple_tree: use percpu sheaves for maple_node_cache
> >> 3accabda4da1 mm, vma: use percpu sheaves for vm_area_struct cache
> >>
> >> The first is such an early failure it's silent. With just 3acca I
> >> get :
> >>
> >> [    9.341152] BUG: kernel NULL pointer dereference, address: 0000000000000040
> >> [    9.348115] #PF: supervisor read access in kernel mode
> >> [    9.353264] #PF: error_code(0x0000) - not-present page
> >> [    9.358413] PGD 0 P4D 0
> >> [    9.360959] Oops: Oops: 0000 [#1] SMP NOPTI
> >> [    9.365154] CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
> >> [    9.374982] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
> >> [    9.382641] RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
> >> [    9.388048] Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
> > 
> > That decodes to
> > 
> >    0:           mov    0x10(%rsi),%rax
> >    4:           mov    0x8(%rsi),%rsi
> >    8:           test   %rax,%rax
> >    b:           je     0x18
> >    d:           mov    0x18(%rax),%ecx
> >   10:           test   %ecx,%ecx
> >   12:           jne    0xfd
> >   18:           movslq %gs:0x250eee4(%rip),%rax
> >   20:           mov    0xe0(%r14,%rax,8),%rax
> >   28:*          mov    0x40(%rax),%r13          <-- trapping instruction
> >   2c:           mov    %r13,%rdi
> >   2f:           call   0xffffffffffff81e4
> >   34:           mov    %rax,%rbp
> >   37:           test   %rax,%rax
> >   3a:           je     0x59
> > 
> > which is the code around that barn_replace_empty_sheaf() call.
> > 
> > In particular, the trapping instruction is from get_barn(), it's the "->barn" in
> > 
> >         return get_node(s, numa_mem_id())->barn;
> > 
> > so it looks like 'get_node()' is returning NULL here:
> > 
> >         return s->node[node];
> > 
> > That 0x250eee4(%rip) is from "get_node()" becoming
> > 
> >   18:           movslq  %gs:numa_node(%rip), %rax  # node
> >   20:           mov    0xe0(%r14,%rax,8),%rax # ->node[node]
> > 
> > instruction, and then that ->barn dereference is the trapping
> > instruction that tries to read node->barn:
> > 
> >   28:*          mov    0x40(%rax),%r13   # node->barn
> > 
> > but I did *not* look into why s->node[node] would be NULL.
> > 
> > Over to you Vlastimil,
> 
> Thanks, yeah will look ASAP. I suspect the "nodes with zero memory" is
> something that might not be handled well in general on x86. I know powerpc
> used to do these kind of setups first and they have some special handling,
> so numa_mem_id() would give you the closest node with memory in there and I
> suspect it's not happening here. CPU 21 is node 6 so it's one of those
> without memory. I'll see if I can simulate this with QEMU and what's the
> most sensible fix
>

Thanks for taking a look.  I thought the NPS4 thing might be playing a role.

I'm happy to take any test/fix code you have for a spin on this system. 

Cheers,
Phil


> >             Linus
> 

-- 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Boot fails with 59faa4da7cd4 and 3accabda4da1
  2025-10-10 18:42     ` Phil Auld
@ 2025-10-10 22:22       ` Vlastimil Babka
  2025-10-11  0:29         ` Phil Auld
  0 siblings, 1 reply; 7+ messages in thread
From: Vlastimil Babka @ 2025-10-10 22:22 UTC (permalink / raw)
  To: Phil Auld
  Cc: Linus Torvalds, Andrew Morton, linux-mm, linux-kernel,
	Liam R. Howlett, Christoph Lameter

On 10/10/25 20:42, Phil Auld wrote:
> On Fri, Oct 10, 2025 at 08:27:30PM +0200 Vlastimil Babka wrote:
>> On 10/10/25 20:19, Linus Torvalds wrote:
>> > On Fri, 10 Oct 2025 at 08:11, Phil Auld <pauld@redhat.com> wrote:
>> >>
>> >> After several days of failed boots I've gotten it down to these two
>> >> commits.
>> >>
>> >> 59faa4da7cd4 maple_tree: use percpu sheaves for maple_node_cache
>> >> 3accabda4da1 mm, vma: use percpu sheaves for vm_area_struct cache
>> >>
>> >> The first is such an early failure it's silent. With just 3acca I
>> >> get :
>> >>
>> >> [    9.341152] BUG: kernel NULL pointer dereference, address: 0000000000000040
>> >> [    9.348115] #PF: supervisor read access in kernel mode
>> >> [    9.353264] #PF: error_code(0x0000) - not-present page
>> >> [    9.358413] PGD 0 P4D 0
>> >> [    9.360959] Oops: Oops: 0000 [#1] SMP NOPTI
>> >> [    9.365154] CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
>> >> [    9.374982] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
>> >> [    9.382641] RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
>> >> [    9.388048] Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
>> > 
>> > That decodes to
>> > 
>> >    0:           mov    0x10(%rsi),%rax
>> >    4:           mov    0x8(%rsi),%rsi
>> >    8:           test   %rax,%rax
>> >    b:           je     0x18
>> >    d:           mov    0x18(%rax),%ecx
>> >   10:           test   %ecx,%ecx
>> >   12:           jne    0xfd
>> >   18:           movslq %gs:0x250eee4(%rip),%rax
>> >   20:           mov    0xe0(%r14,%rax,8),%rax
>> >   28:*          mov    0x40(%rax),%r13          <-- trapping instruction
>> >   2c:           mov    %r13,%rdi
>> >   2f:           call   0xffffffffffff81e4
>> >   34:           mov    %rax,%rbp
>> >   37:           test   %rax,%rax
>> >   3a:           je     0x59
>> > 
>> > which is the code around that barn_replace_empty_sheaf() call.
>> > 
>> > In particular, the trapping instruction is from get_barn(), it's the "->barn" in
>> > 
>> >         return get_node(s, numa_mem_id())->barn;
>> > 
>> > so it looks like 'get_node()' is returning NULL here:
>> > 
>> >         return s->node[node];
>> > 
>> > That 0x250eee4(%rip) is from "get_node()" becoming
>> > 
>> >   18:           movslq  %gs:numa_node(%rip), %rax  # node
>> >   20:           mov    0xe0(%r14,%rax,8),%rax # ->node[node]
>> > 
>> > instruction, and then that ->barn dereference is the trapping
>> > instruction that tries to read node->barn:
>> > 
>> >   28:*          mov    0x40(%rax),%r13   # node->barn
>> > 
>> > but I did *not* look into why s->node[node] would be NULL.
>> > 
>> > Over to you Vlastimil,
>> 
>> Thanks, yeah will look ASAP. I suspect the "nodes with zero memory" is
>> something that might not be handled well in general on x86. I know powerpc
>> used to do these kind of setups first and they have some special handling,
>> so numa_mem_id() would give you the closest node with memory in there and I
>> suspect it's not happening here. CPU 21 is node 6 so it's one of those
>> without memory. I'll see if I can simulate this with QEMU and what's the
>> most sensible fix
>>
> 
> Thanks for taking a look.  I thought the NPS4 thing might be playing a role.

From what I quickly found I understood that NPS4 is supposed to create extra
numa nodes per socket (4 instead of 1) and interleave the memory between
them. So it seems weird to me it would assign everything to one node and
leave 3 others memoryless?

> I'm happy to take any test/fix code you have for a spin on this system. 
 
Thanks. Here's a candidate fix in case you can test. I'll finalize it
tomorrow. The slab performance won't be optimal on cpus on those memoryless
nodes, that's why I'd like to figure out if it's a BIOS bug or not. If
memoryless nodes are really intended we should look into initializing things
so that numa_mem_id() works as expected and points to nearest populated
node.

----8<----
From 097c6251882bf5537162d17b6726575288ba9715 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Sat, 11 Oct 2025 00:13:20 +0200
Subject: [PATCH] slab: fix NULL pointer when trying to access barn

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 60 +++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 47 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 135c408e0515..bd3c2821e6c3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -507,7 +507,12 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 /* Get the barn of the current cpu's memory node */
 static inline struct node_barn *get_barn(struct kmem_cache *s)
 {
-	return get_node(s, numa_mem_id())->barn;
+	struct kmem_cache_node *n = get_node(s, numa_mem_id());
+
+	if (!n)
+		return NULL;
+
+	return n->barn;
 }
 
 /*
@@ -4982,6 +4987,10 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	}
 
 	barn = get_barn(s);
+	if (!barn) {
+		local_unlock(&s->cpu_sheaves->lock);
+		return NULL;
+	}
 
 	full = barn_replace_empty_sheaf(barn, pcs->main);
 
@@ -5153,13 +5162,20 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	if (unlikely(pcs->main->size == 0)) {
 
 		struct slab_sheaf *full;
+		struct node_barn *barn;
 
 		if (pcs->spare && pcs->spare->size > 0) {
 			swap(pcs->main, pcs->spare);
 			goto do_alloc;
 		}
 
-		full = barn_replace_empty_sheaf(get_barn(s), pcs->main);
+		barn = get_barn(s);
+		if (!barn) {
+			local_unlock(&s->cpu_sheaves->lock);
+			return allocated;
+		}
+
+		full = barn_replace_empty_sheaf(barn, pcs->main);
 
 		if (full) {
 			stat(s, BARN_GET);
@@ -5314,6 +5330,7 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *sheaf = NULL;
+	struct node_barn *barn;
 
 	if (unlikely(size > s->sheaf_capacity)) {
 
@@ -5355,8 +5372,11 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 		pcs->spare = NULL;
 		stat(s, SHEAF_PREFILL_FAST);
 	} else {
+		barn = get_barn(s);
+
 		stat(s, SHEAF_PREFILL_SLOW);
-		sheaf = barn_get_full_or_empty_sheaf(get_barn(s));
+		if (barn)
+			sheaf = barn_get_full_or_empty_sheaf(barn);
 		if (sheaf && sheaf->size)
 			stat(s, BARN_GET);
 		else
@@ -5426,7 +5446,7 @@ void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
 	 * If the barn has too many full sheaves or we fail to refill the sheaf,
 	 * simply flush and free it.
 	 */
-	if (data_race(barn->nr_full) >= MAX_FULL_SHEAVES ||
+	if (!barn || data_race(barn->nr_full) >= MAX_FULL_SHEAVES ||
 	    refill_sheaf(s, sheaf, gfp)) {
 		sheaf_flush_unused(s, sheaf);
 		free_empty_sheaf(s, sheaf);
@@ -5943,10 +5963,9 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
  * put the full sheaf there.
  */
 static void __pcs_install_empty_sheaf(struct kmem_cache *s,
-		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty)
+		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty,
+		struct node_barn *barn)
 {
-	struct node_barn *barn;
-
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
 	/* This is what we expect to find if nobody interrupted us. */
@@ -5956,8 +5975,6 @@ static void __pcs_install_empty_sheaf(struct kmem_cache *s,
 		return;
 	}
 
-	barn = get_barn(s);
-
 	/*
 	 * Unlikely because if the main sheaf had space, we would have just
 	 * freed to it. Get rid of our empty sheaf.
@@ -6002,6 +6019,11 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
 	barn = get_barn(s);
+	if (!barn) {
+		local_unlock(&s->cpu_sheaves->lock);
+		return NULL;
+	}
+
 	put_fail = false;
 
 	if (!pcs->spare) {
@@ -6084,7 +6106,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
 	}
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
-	__pcs_install_empty_sheaf(s, pcs, empty);
+	__pcs_install_empty_sheaf(s, pcs, empty, barn);
 
 	return pcs;
 }
@@ -6121,8 +6143,9 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
 
 static void rcu_free_sheaf(struct rcu_head *head)
 {
+	struct kmem_cache_node *n;
 	struct slab_sheaf *sheaf;
-	struct node_barn *barn;
+	struct node_barn *barn = NULL;
 	struct kmem_cache *s;
 
 	sheaf = container_of(head, struct slab_sheaf, rcu_head);
@@ -6139,7 +6162,11 @@ static void rcu_free_sheaf(struct rcu_head *head)
 	 */
 	__rcu_free_sheaf_prepare(s, sheaf);
 
-	barn = get_node(s, sheaf->node)->barn;
+	n = get_node(s, sheaf->node);
+	if (!n)
+		goto flush;
+
+	barn = n->barn;
 
 	/* due to slab_free_hook() */
 	if (unlikely(sheaf->size == 0))
@@ -6157,11 +6184,12 @@ static void rcu_free_sheaf(struct rcu_head *head)
 		return;
 	}
 
+flush:
 	stat(s, BARN_PUT_FAIL);
 	sheaf_flush_unused(s, sheaf);
 
 empty:
-	if (data_race(barn->nr_empty) < MAX_EMPTY_SHEAVES) {
+	if (barn && data_race(barn->nr_empty) < MAX_EMPTY_SHEAVES) {
 		barn_put_empty_sheaf(barn, sheaf);
 		return;
 	}
@@ -6191,6 +6219,10 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 		}
 
 		barn = get_barn(s);
+		if (!barn) {
+			local_unlock(&s->cpu_sheaves->lock);
+			goto fail;
+		}
 
 		empty = barn_get_empty_sheaf(barn);
 
@@ -6304,6 +6336,8 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 		goto do_free;
 
 	barn = get_barn(s);
+	if (!barn)
+		goto no_empty;
 
 	if (!pcs->spare) {
 		empty = barn_get_empty_sheaf(barn);
-- 
2.51.0




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Boot fails with 59faa4da7cd4 and 3accabda4da1
  2025-10-10 22:22       ` Vlastimil Babka
@ 2025-10-11  0:29         ` Phil Auld
  2025-10-13 13:09           ` Phil Auld
  0 siblings, 1 reply; 7+ messages in thread
From: Phil Auld @ 2025-10-11  0:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, Andrew Morton, linux-mm, linux-kernel,
	Liam R. Howlett, Christoph Lameter

Hi Vlastimil,

On Sat, Oct 11, 2025 at 12:22:39AM +0200 Vlastimil Babka wrote:
> On 10/10/25 20:42, Phil Auld wrote:
> > On Fri, Oct 10, 2025 at 08:27:30PM +0200 Vlastimil Babka wrote:
> >> On 10/10/25 20:19, Linus Torvalds wrote:
> >> > On Fri, 10 Oct 2025 at 08:11, Phil Auld <pauld@redhat.com> wrote:
> >> >>
> >> >> After several days of failed boots I've gotten it down to these two
> >> >> commits.
> >> >>
> >> >> 59faa4da7cd4 maple_tree: use percpu sheaves for maple_node_cache
> >> >> 3accabda4da1 mm, vma: use percpu sheaves for vm_area_struct cache
> >> >>
> >> >> The first is such an early failure it's silent. With just 3acca I
> >> >> get :
> >> >>
> >> >> [    9.341152] BUG: kernel NULL pointer dereference, address: 0000000000000040
> >> >> [    9.348115] #PF: supervisor read access in kernel mode
> >> >> [    9.353264] #PF: error_code(0x0000) - not-present page
> >> >> [    9.358413] PGD 0 P4D 0
> >> >> [    9.360959] Oops: Oops: 0000 [#1] SMP NOPTI
> >> >> [    9.365154] CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
> >> >> [    9.374982] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
> >> >> [    9.382641] RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
> >> >> [    9.388048] Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
> >> > 
> >> > That decodes to
> >> > 
> >> >    0:           mov    0x10(%rsi),%rax
> >> >    4:           mov    0x8(%rsi),%rsi
> >> >    8:           test   %rax,%rax
> >> >    b:           je     0x18
> >> >    d:           mov    0x18(%rax),%ecx
> >> >   10:           test   %ecx,%ecx
> >> >   12:           jne    0xfd
> >> >   18:           movslq %gs:0x250eee4(%rip),%rax
> >> >   20:           mov    0xe0(%r14,%rax,8),%rax
> >> >   28:*          mov    0x40(%rax),%r13          <-- trapping instruction
> >> >   2c:           mov    %r13,%rdi
> >> >   2f:           call   0xffffffffffff81e4
> >> >   34:           mov    %rax,%rbp
> >> >   37:           test   %rax,%rax
> >> >   3a:           je     0x59
> >> > 
> >> > which is the code around that barn_replace_empty_sheaf() call.
> >> > 
> >> > In particular, the trapping instruction is from get_barn(), it's the "->barn" in
> >> > 
> >> >         return get_node(s, numa_mem_id())->barn;
> >> > 
> >> > so it looks like 'get_node()' is returning NULL here:
> >> > 
> >> >         return s->node[node];
> >> > 
> >> > That 0x250eee4(%rip) is from "get_node()" becoming
> >> > 
> >> >   18:           movslq  %gs:numa_node(%rip), %rax  # node
> >> >   20:           mov    0xe0(%r14,%rax,8),%rax # ->node[node]
> >> > 
> >> > instruction, and then that ->barn dereference is the trapping
> >> > instruction that tries to read node->barn:
> >> > 
> >> >   28:*          mov    0x40(%rax),%r13   # node->barn
> >> > 
> >> > but I did *not* look into why s->node[node] would be NULL.
> >> > 
> >> > Over to you Vlastimil,
> >> 
> >> Thanks, yeah will look ASAP. I suspect the "nodes with zero memory" is
> >> something that might not be handled well in general on x86. I know powerpc
> >> used to do these kind of setups first and they have some special handling,
> >> so numa_mem_id() would give you the closest node with memory in there and I
> >> suspect it's not happening here. CPU 21 is node 6 so it's one of those
> >> without memory. I'll see if I can simulate this with QEMU and what's the
> >> most sensible fix
> >>
> > 
> > Thanks for taking a look.  I thought the NPS4 thing might be playing a role.
> 
> From what I quickly found I understood that NPS4 is supposed to create extra
> numa nodes per socket (4 instead of 1) and interleave the memory between
> them. So it seems weird to me it would assign everything to one node and
> leave 3 others memoryless?
>

That I don't know. Someone from AMD might be able to help there. This system
has had its BIOS and other bits updated just a couple of months ago but
this numa layout has been there since I've been using the system (several
years now).

> > I'm happy to take any test/fix code you have for a spin on this system. 
>  
> Thanks. Here's a candidate fix in case you can test. I'll finalize it
> tomorrow. The slab performance won't be optimal on cpus on those memoryless
> nodes, that's why I'd like to figure out if it's a BIOS bug or not. If
> memoryless nodes are really intended we should look into initializing things
> so that numa_mem_id() works as expected and points to nearest populated
> node.

The below does the trick. It boots and I ran a suite of stress-ng tests
for sanity. Any performance it's getting now is better than it was when it
wouldn't boot :)

Tested-by: Phil Auld <pauld@redhat.com>


Cheers,
Phil

> 
> ----8<----
> From 097c6251882bf5537162d17b6726575288ba9715 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Sat, 11 Oct 2025 00:13:20 +0200
> Subject: [PATCH] slab: fix NULL pointer when trying to access barn
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 60 +++++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 47 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 135c408e0515..bd3c2821e6c3 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -507,7 +507,12 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
>  /* Get the barn of the current cpu's memory node */
>  static inline struct node_barn *get_barn(struct kmem_cache *s)
>  {
> -	return get_node(s, numa_mem_id())->barn;
> +	struct kmem_cache_node *n = get_node(s, numa_mem_id());
> +
> +	if (!n)
> +		return NULL;
> +
> +	return n->barn;
>  }
>  
>  /*
> @@ -4982,6 +4987,10 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>  	}
>  
>  	barn = get_barn(s);
> +	if (!barn) {
> +		local_unlock(&s->cpu_sheaves->lock);
> +		return NULL;
> +	}
>  
>  	full = barn_replace_empty_sheaf(barn, pcs->main);
>  
> @@ -5153,13 +5162,20 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>  	if (unlikely(pcs->main->size == 0)) {
>  
>  		struct slab_sheaf *full;
> +		struct node_barn *barn;
>  
>  		if (pcs->spare && pcs->spare->size > 0) {
>  			swap(pcs->main, pcs->spare);
>  			goto do_alloc;
>  		}
>  
> -		full = barn_replace_empty_sheaf(get_barn(s), pcs->main);
> +		barn = get_barn(s);
> +		if (!barn) {
> +			local_unlock(&s->cpu_sheaves->lock);
> +			return allocated;
> +		}
> +
> +		full = barn_replace_empty_sheaf(barn, pcs->main);
>  
>  		if (full) {
>  			stat(s, BARN_GET);
> @@ -5314,6 +5330,7 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
>  {
>  	struct slub_percpu_sheaves *pcs;
>  	struct slab_sheaf *sheaf = NULL;
> +	struct node_barn *barn;
>  
>  	if (unlikely(size > s->sheaf_capacity)) {
>  
> @@ -5355,8 +5372,11 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
>  		pcs->spare = NULL;
>  		stat(s, SHEAF_PREFILL_FAST);
>  	} else {
> +		barn = get_barn(s);
> +
>  		stat(s, SHEAF_PREFILL_SLOW);
> -		sheaf = barn_get_full_or_empty_sheaf(get_barn(s));
> +		if (barn)
> +			sheaf = barn_get_full_or_empty_sheaf(barn);
>  		if (sheaf && sheaf->size)
>  			stat(s, BARN_GET);
>  		else
> @@ -5426,7 +5446,7 @@ void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
>  	 * If the barn has too many full sheaves or we fail to refill the sheaf,
>  	 * simply flush and free it.
>  	 */
> -	if (data_race(barn->nr_full) >= MAX_FULL_SHEAVES ||
> +	if (!barn || data_race(barn->nr_full) >= MAX_FULL_SHEAVES ||
>  	    refill_sheaf(s, sheaf, gfp)) {
>  		sheaf_flush_unused(s, sheaf);
>  		free_empty_sheaf(s, sheaf);
> @@ -5943,10 +5963,9 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>   * put the full sheaf there.
>   */
>  static void __pcs_install_empty_sheaf(struct kmem_cache *s,
> -		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty)
> +		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty,
> +		struct node_barn *barn)
>  {
> -	struct node_barn *barn;
> -
>  	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
>  
>  	/* This is what we expect to find if nobody interrupted us. */
> @@ -5956,8 +5975,6 @@ static void __pcs_install_empty_sheaf(struct kmem_cache *s,
>  		return;
>  	}
>  
> -	barn = get_barn(s);
> -
>  	/*
>  	 * Unlikely because if the main sheaf had space, we would have just
>  	 * freed to it. Get rid of our empty sheaf.
> @@ -6002,6 +6019,11 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
>  	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
>  
>  	barn = get_barn(s);
> +	if (!barn) {
> +		local_unlock(&s->cpu_sheaves->lock);
> +		return NULL;
> +	}
> +
>  	put_fail = false;
>  
>  	if (!pcs->spare) {
> @@ -6084,7 +6106,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
>  	}
>  
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
> -	__pcs_install_empty_sheaf(s, pcs, empty);
> +	__pcs_install_empty_sheaf(s, pcs, empty, barn);
>  
>  	return pcs;
>  }
> @@ -6121,8 +6143,9 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
>  
>  static void rcu_free_sheaf(struct rcu_head *head)
>  {
> +	struct kmem_cache_node *n;
>  	struct slab_sheaf *sheaf;
> -	struct node_barn *barn;
> +	struct node_barn *barn = NULL;
>  	struct kmem_cache *s;
>  
>  	sheaf = container_of(head, struct slab_sheaf, rcu_head);
> @@ -6139,7 +6162,11 @@ static void rcu_free_sheaf(struct rcu_head *head)
>  	 */
>  	__rcu_free_sheaf_prepare(s, sheaf);
>  
> -	barn = get_node(s, sheaf->node)->barn;
> +	n = get_node(s, sheaf->node);
> +	if (!n)
> +		goto flush;
> +
> +	barn = n->barn;
>  
>  	/* due to slab_free_hook() */
>  	if (unlikely(sheaf->size == 0))
> @@ -6157,11 +6184,12 @@ static void rcu_free_sheaf(struct rcu_head *head)
>  		return;
>  	}
>  
> +flush:
>  	stat(s, BARN_PUT_FAIL);
>  	sheaf_flush_unused(s, sheaf);
>  
>  empty:
> -	if (data_race(barn->nr_empty) < MAX_EMPTY_SHEAVES) {
> +	if (barn && data_race(barn->nr_empty) < MAX_EMPTY_SHEAVES) {
>  		barn_put_empty_sheaf(barn, sheaf);
>  		return;
>  	}
> @@ -6191,6 +6219,10 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
>  		}
>  
>  		barn = get_barn(s);
> +		if (!barn) {
> +			local_unlock(&s->cpu_sheaves->lock);
> +			goto fail;
> +		}
>  
>  		empty = barn_get_empty_sheaf(barn);
>  
> @@ -6304,6 +6336,8 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>  		goto do_free;
>  
>  	barn = get_barn(s);
> +	if (!barn)
> +		goto no_empty;
>  
>  	if (!pcs->spare) {
>  		empty = barn_get_empty_sheaf(barn);
> -- 
> 2.51.0
> 
> 

-- 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Boot fails with 59faa4da7cd4 and 3accabda4da1
  2025-10-11  0:29         ` Phil Auld
@ 2025-10-13 13:09           ` Phil Auld
  0 siblings, 0 replies; 7+ messages in thread
From: Phil Auld @ 2025-10-13 13:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, Andrew Morton, linux-mm, linux-kernel,
	Liam R. Howlett, Christoph Lameter

Hi,

On Fri, Oct 10, 2025 at 08:29:07PM -0400 Phil Auld wrote:
> Hi Vlastimil,
> 
> On Sat, Oct 11, 2025 at 12:22:39AM +0200 Vlastimil Babka wrote:
> > On 10/10/25 20:42, Phil Auld wrote:
> > > On Fri, Oct 10, 2025 at 08:27:30PM +0200 Vlastimil Babka wrote:
> > >> On 10/10/25 20:19, Linus Torvalds wrote:
> > >> > On Fri, 10 Oct 2025 at 08:11, Phil Auld <pauld@redhat.com> wrote:
> > >> >>
> > >> >> After several days of failed boots I've gotten it down to these two
> > >> >> commits.
> > >> >>
> > >> >> 59faa4da7cd4 maple_tree: use percpu sheaves for maple_node_cache
> > >> >> 3accabda4da1 mm, vma: use percpu sheaves for vm_area_struct cache
> > >> >>
> > >> >> The first is such an early failure it's silent. With just 3acca I
> > >> >> get :
> > >> >>
> > >> >> [    9.341152] BUG: kernel NULL pointer dereference, address: 0000000000000040
> > >> >> [    9.348115] #PF: supervisor read access in kernel mode
> > >> >> [    9.353264] #PF: error_code(0x0000) - not-present page
> > >> >> [    9.358413] PGD 0 P4D 0
> > >> >> [    9.360959] Oops: Oops: 0000 [#1] SMP NOPTI
> > >> >> [    9.365154] CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
> > >> >> [    9.374982] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
> > >> >> [    9.382641] RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
> > >> >> [    9.388048] Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
> > >> > 
> > >> > That decodes to
> > >> > 
> > >> >    0:           mov    0x10(%rsi),%rax
> > >> >    4:           mov    0x8(%rsi),%rsi
> > >> >    8:           test   %rax,%rax
> > >> >    b:           je     0x18
> > >> >    d:           mov    0x18(%rax),%ecx
> > >> >   10:           test   %ecx,%ecx
> > >> >   12:           jne    0xfd
> > >> >   18:           movslq %gs:0x250eee4(%rip),%rax
> > >> >   20:           mov    0xe0(%r14,%rax,8),%rax
> > >> >   28:*          mov    0x40(%rax),%r13          <-- trapping instruction
> > >> >   2c:           mov    %r13,%rdi
> > >> >   2f:           call   0xffffffffffff81e4
> > >> >   34:           mov    %rax,%rbp
> > >> >   37:           test   %rax,%rax
> > >> >   3a:           je     0x59
> > >> > 
> > >> > which is the code around that barn_replace_empty_sheaf() call.
> > >> > 
> > >> > In particular, the trapping instruction is from get_barn(), it's the "->barn" in
> > >> > 
> > >> >         return get_node(s, numa_mem_id())->barn;
> > >> > 
> > >> > so it looks like 'get_node()' is returning NULL here:
> > >> > 
> > >> >         return s->node[node];
> > >> > 
> > >> > That 0x250eee4(%rip) is from "get_node()" becoming
> > >> > 
> > >> >   18:           movslq  %gs:numa_node(%rip), %rax  # node
> > >> >   20:           mov    0xe0(%r14,%rax,8),%rax # ->node[node]
> > >> > 
> > >> > instruction, and then that ->barn dereference is the trapping
> > >> > instruction that tries to read node->barn:
> > >> > 
> > >> >   28:*          mov    0x40(%rax),%r13   # node->barn
> > >> > 
> > >> > but I did *not* look into why s->node[node] would be NULL.
> > >> > 
> > >> > Over to you Vlastimil,
> > >> 
> > >> Thanks, yeah will look ASAP. I suspect the "nodes with zero memory" is
> > >> something that might not be handled well in general on x86. I know powerpc
> > >> used to do these kind of setups first and they have some special handling,
> > >> so numa_mem_id() would give you the closest node with memory in there and I
> > >> suspect it's not happening here. CPU 21 is node 6 so it's one of those
> > >> without memory. I'll see if I can simulate this with QEMU and what's the
> > >> most sensible fix
> > >>
> > > 
> > > Thanks for taking a look.  I thought the NPS4 thing might be playing a role.
> > 
> > From what I quickly found I understood that NPS4 is supposed to create extra
> > numa nodes per socket (4 instead of 1) and interleave the memory between
> > them. So it seems weird to me it would assign everything to one node and
> > leave 3 others memoryless?
> >
> 
> That I don't know. Someone from AMD might be able to help there. This system
> has had its BIOS and other bits updated just a couple of months ago but
> this numa layout has been there since I've been using the system (several
> years now).
> 

Just to follow up here.  I think the issue is just that this machine
is somewhat underprovisioned in the memory department.  It's got 32
slots, with only 4 actually populated. I suspect if it was fully populated
there'd be memory in every node.

Thanks for the fix and getting it -rc1.


Cheers,
Phil



> > > I'm happy to take any test/fix code you have for a spin on this system. 
> >  
> > Thanks. Here's a candidate fix in case you can test. I'll finalize it
> > tomorrow. The slab performance won't be optimal on cpus on those memoryless
> > nodes, that's why I'd like to figure out if it's a BIOS bug or not. If
> > memoryless nodes are really intended we should look into initializing things
> > so that numa_mem_id() works as expected and points to nearest populated
> > node.
> 
> The below does the trick. It boots and I ran a suite of stress-ng tests
> for sanity. Any performance it's getting now is better than it was when it
> wouldn't boot :)
> 
> Tested-by: Phil Auld <pauld@redhat.com>
> 
> 
> Cheers,
> Phil
> 
> > 
> > ----8<----
> > From 097c6251882bf5537162d17b6726575288ba9715 Mon Sep 17 00:00:00 2001
> > From: Vlastimil Babka <vbabka@suse.cz>
> > Date: Sat, 11 Oct 2025 00:13:20 +0200
> > Subject: [PATCH] slab: fix NULL pointer when trying to access barn
> > 
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  mm/slub.c | 60 +++++++++++++++++++++++++++++++++++++++++++------------
> >  1 file changed, 47 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 135c408e0515..bd3c2821e6c3 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -507,7 +507,12 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
> >  /* Get the barn of the current cpu's memory node */
> >  static inline struct node_barn *get_barn(struct kmem_cache *s)
> >  {
> > -	return get_node(s, numa_mem_id())->barn;
> > +	struct kmem_cache_node *n = get_node(s, numa_mem_id());
> > +
> > +	if (!n)
> > +		return NULL;
> > +
> > +	return n->barn;
> >  }
> >  
> >  /*
> > @@ -4982,6 +4987,10 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> >  	}
> >  
> >  	barn = get_barn(s);
> > +	if (!barn) {
> > +		local_unlock(&s->cpu_sheaves->lock);
> > +		return NULL;
> > +	}
> >  
> >  	full = barn_replace_empty_sheaf(barn, pcs->main);
> >  
> > @@ -5153,13 +5162,20 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
> >  	if (unlikely(pcs->main->size == 0)) {
> >  
> >  		struct slab_sheaf *full;
> > +		struct node_barn *barn;
> >  
> >  		if (pcs->spare && pcs->spare->size > 0) {
> >  			swap(pcs->main, pcs->spare);
> >  			goto do_alloc;
> >  		}
> >  
> > -		full = barn_replace_empty_sheaf(get_barn(s), pcs->main);
> > +		barn = get_barn(s);
> > +		if (!barn) {
> > +			local_unlock(&s->cpu_sheaves->lock);
> > +			return allocated;
> > +		}
> > +
> > +		full = barn_replace_empty_sheaf(barn, pcs->main);
> >  
> >  		if (full) {
> >  			stat(s, BARN_GET);
> > @@ -5314,6 +5330,7 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
> >  {
> >  	struct slub_percpu_sheaves *pcs;
> >  	struct slab_sheaf *sheaf = NULL;
> > +	struct node_barn *barn;
> >  
> >  	if (unlikely(size > s->sheaf_capacity)) {
> >  
> > @@ -5355,8 +5372,11 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
> >  		pcs->spare = NULL;
> >  		stat(s, SHEAF_PREFILL_FAST);
> >  	} else {
> > +		barn = get_barn(s);
> > +
> >  		stat(s, SHEAF_PREFILL_SLOW);
> > -		sheaf = barn_get_full_or_empty_sheaf(get_barn(s));
> > +		if (barn)
> > +			sheaf = barn_get_full_or_empty_sheaf(barn);
> >  		if (sheaf && sheaf->size)
> >  			stat(s, BARN_GET);
> >  		else
> > @@ -5426,7 +5446,7 @@ void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
> >  	 * If the barn has too many full sheaves or we fail to refill the sheaf,
> >  	 * simply flush and free it.
> >  	 */
> > -	if (data_race(barn->nr_full) >= MAX_FULL_SHEAVES ||
> > +	if (!barn || data_race(barn->nr_full) >= MAX_FULL_SHEAVES ||
> >  	    refill_sheaf(s, sheaf, gfp)) {
> >  		sheaf_flush_unused(s, sheaf);
> >  		free_empty_sheaf(s, sheaf);
> > @@ -5943,10 +5963,9 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> >   * put the full sheaf there.
> >   */
> >  static void __pcs_install_empty_sheaf(struct kmem_cache *s,
> > -		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty)
> > +		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty,
> > +		struct node_barn *barn)
> >  {
> > -	struct node_barn *barn;
> > -
> >  	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
> >  
> >  	/* This is what we expect to find if nobody interrupted us. */
> > @@ -5956,8 +5975,6 @@ static void __pcs_install_empty_sheaf(struct kmem_cache *s,
> >  		return;
> >  	}
> >  
> > -	barn = get_barn(s);
> > -
> >  	/*
> >  	 * Unlikely because if the main sheaf had space, we would have just
> >  	 * freed to it. Get rid of our empty sheaf.
> > @@ -6002,6 +6019,11 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
> >  	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
> >  
> >  	barn = get_barn(s);
> > +	if (!barn) {
> > +		local_unlock(&s->cpu_sheaves->lock);
> > +		return NULL;
> > +	}
> > +
> >  	put_fail = false;
> >  
> >  	if (!pcs->spare) {
> > @@ -6084,7 +6106,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
> >  	}
> >  
> >  	pcs = this_cpu_ptr(s->cpu_sheaves);
> > -	__pcs_install_empty_sheaf(s, pcs, empty);
> > +	__pcs_install_empty_sheaf(s, pcs, empty, barn);
> >  
> >  	return pcs;
> >  }
> > @@ -6121,8 +6143,9 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
> >  
> >  static void rcu_free_sheaf(struct rcu_head *head)
> >  {
> > +	struct kmem_cache_node *n;
> >  	struct slab_sheaf *sheaf;
> > -	struct node_barn *barn;
> > +	struct node_barn *barn = NULL;
> >  	struct kmem_cache *s;
> >  
> >  	sheaf = container_of(head, struct slab_sheaf, rcu_head);
> > @@ -6139,7 +6162,11 @@ static void rcu_free_sheaf(struct rcu_head *head)
> >  	 */
> >  	__rcu_free_sheaf_prepare(s, sheaf);
> >  
> > -	barn = get_node(s, sheaf->node)->barn;
> > +	n = get_node(s, sheaf->node);
> > +	if (!n)
> > +		goto flush;
> > +
> > +	barn = n->barn;
> >  
> >  	/* due to slab_free_hook() */
> >  	if (unlikely(sheaf->size == 0))
> > @@ -6157,11 +6184,12 @@ static void rcu_free_sheaf(struct rcu_head *head)
> >  		return;
> >  	}
> >  
> > +flush:
> >  	stat(s, BARN_PUT_FAIL);
> >  	sheaf_flush_unused(s, sheaf);
> >  
> >  empty:
> > -	if (data_race(barn->nr_empty) < MAX_EMPTY_SHEAVES) {
> > +	if (barn && data_race(barn->nr_empty) < MAX_EMPTY_SHEAVES) {
> >  		barn_put_empty_sheaf(barn, sheaf);
> >  		return;
> >  	}
> > @@ -6191,6 +6219,10 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
> >  		}
> >  
> >  		barn = get_barn(s);
> > +		if (!barn) {
> > +			local_unlock(&s->cpu_sheaves->lock);
> > +			goto fail;
> > +		}
> >  
> >  		empty = barn_get_empty_sheaf(barn);
> >  
> > @@ -6304,6 +6336,8 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
> >  		goto do_free;
> >  
> >  	barn = get_barn(s);
> > +	if (!barn)
> > +		goto no_empty;
> >  
> >  	if (!pcs->spare) {
> >  		empty = barn_get_empty_sheaf(barn);
> > -- 
> > 2.51.0
> > 
> > 
> 
> -- 

-- 



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-10-13 13:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-10 15:11 Boot fails with 59faa4da7cd4 and 3accabda4da1 Phil Auld
2025-10-10 18:19 ` Linus Torvalds
2025-10-10 18:27   ` Vlastimil Babka
2025-10-10 18:42     ` Phil Auld
2025-10-10 22:22       ` Vlastimil Babka
2025-10-11  0:29         ` Phil Auld
2025-10-13 13:09           ` Phil Auld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox