PROBLEM: kernel crashes when running xfsdump since ~6.4

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* PROBLEM: kernel crashes when running xfsdump since ~6.4
@ 2024-06-20  6:19 Nick Bowler
  2024-06-20  6:37 ` Hailong Liu
  2024-06-20 18:02 ` Nick Bowler
  0 siblings, 2 replies; 39+ messages in thread
From: Nick Bowler @ 2024-06-20  6:19 UTC (permalink / raw)
  To: linux-kernel, Linux regressions mailing list, linux-mm,
	sparclinux, Uladzislau Rezki (Sony),
	Andrew Morton

Hi,

After upgrading my sparc to 6.9.5 I noticed that attempting to run
xfsdump instantly (within a couple seconds) and reliably crashes the
kernel.  The same problem is also observed on 6.10-rc4.

This is a regression introduced around 6.4 timeframe.  6.3 appears
to work fine and xfsdump goes about its business dumping stuff.

Bisection implicates the following:

  062eacf57ad91b5c272f89dc964fd6dd9715ea7d is the first bad commit
  commit 062eacf57ad91b5c272f89dc964fd6dd9715ea7d
  Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
  Date:   Thu Mar 30 21:06:38 2023 +0200
  
      mm: vmalloc: remove a global vmap_blocks xarray

This reverts pretty easily on top of v6.10-rc4, as long as I first
revert fa1c77c13ca5 ("mm: vmalloc: rename addr_to_vb_xarray() function")
as this just causes conflicts.  Then there is one easily-corrected build
failure (adjust the one remaining &vbq->vmap_blocks back to &vmap_blocks).

If I do all of that then the kernel is not crashing anymore.

A splat like this one is output on the console when the crash occurs (varies a bit):

  spitfire_data_access_exception: SFSR[000000000080100d] SFAR[0000000000c51ba0], going.
                \|/ ____ \|/
                "@'/ .. \`@"
                /_| \__/ |_\
                   \__U_/
  xfsdump(2028): Dax [#1]
  CPU: 0 PID: 2028 Comm: xfsdump Not tainted 6.9.5 #199
  TSTATE: 0000000811001607 TPC: 0000000000974fc4 TNPC: 0000000000974fc8 Y: 00000000    Not tainted
  TPC: <queued_spin_lock_slowpath+0x1d0/0x2cc>
  g0: 0000000000aa9110 g1: 0000000000c51ba0 g2: 444b000000000000 g3: 0000000000c560c0
  g4: fffff800a71a1f00 g5: fffff800bebb6000 g6: fffff800ac0ec000 g7: 0000000000040000
  o0: 0000000000000002 o1: 00000000000007d8 o2: fffff800a4131420 o3: ffffffff0000ffff
  o4: 00000000900a2001 o5: 0000000000c4f5a0 sp: fffff800ac0eeac1 ret_pc: 0000000000040000
  RPC: <0x40000>
  l0: fffff800a40098c0 l1: 0000000100800000 l2: 0000000000000000 l3: 0000000000000103
  l4: fffff800a40081b0 l5: 0000000000aeec00 l6: fffff800a40080a0 l7: 0000000101000000
  i0: 0000000000c4f5a0 i1: 00000000900a2001 i2: 0000000000000000 i3: fffff800bf807b80
  i4: 0000000000000000 i5: fffff800bf807b80 i6: fffff800ac0eeb71 i7: 0000000000503438
  I7: <vm_map_ram+0x210/0x724>
  Call Trace:
  [<0000000000503438>] vm_map_ram+0x210/0x724
  [<00000000006661f8>] _xfs_buf_map_pages+0x58/0xa0
  [<0000000000667058>] xfs_buf_get_map+0x668/0x7a4
  [<00000000006673e0>] xfs_buf_read_map+0x20/0x160
  [<0000000000667548>] xfs_buf_readahead_map+0x28/0x38
  [<000000000067a4f8>] xfs_iwalk_ichunk_ra.isra.0+0xa8/0xc4
  [<000000000067a8f0>] xfs_iwalk_ag+0x1c0/0x260
  [<000000000067ab08>] xfs_iwalk+0xdc/0x130
  [<0000000000679fc8>] xfs_bulkstat+0x10c/0x140
  [<0000000000695528>] xfs_compat_ioc_fsbulkstat+0x1a4/0x1e8
  [<000000000069572c>] xfs_file_compat_ioctl+0x8c/0x1f4
  [<0000000000534ab0>] compat_sys_ioctl+0x9c/0xfc
  [<0000000000406214>] linux_sparc_syscall32+0x34/0x60
  Disabling lock debugging due to kernel taint
  Caller[0000000000503438]: vm_map_ram+0x210/0x724
  Caller[00000000006661f8]: _xfs_buf_map_pages+0x58/0xa0
  Caller[0000000000667058]: xfs_buf_get_map+0x668/0x7a4
  Caller[00000000006673e0]: xfs_buf_read_map+0x20/0x160
  Caller[0000000000667548]: xfs_buf_readahead_map+0x28/0x38
  Caller[000000000067a4f8]: xfs_iwalk_ichunk_ra.isra.0+0xa8/0xc4
  Caller[000000000067a8f0]: xfs_iwalk_ag+0x1c0/0x260
  Caller[000000000067ab08]: xfs_iwalk+0xdc/0x130
  Caller[0000000000679fc8]: xfs_bulkstat+0x10c/0x140
  Caller[0000000000695528]: xfs_compat_ioc_fsbulkstat+0x1a4/0x1e8
  Caller[000000000069572c]: xfs_file_compat_ioctl+0x8c/0x1f4
  Caller[0000000000534ab0]: compat_sys_ioctl+0x9c/0xfc
  Caller[0000000000406214]: linux_sparc_syscall32+0x34/0x60
  Caller[00000000f789ccdc]: 0xf789ccdc
  Instruction DUMP:
   8610e0c0
   8400c002
   c458a0f8
  <f6704002>
   c206e008
   80a06000
   12400012
   01000000
   81408000

Let me know if you need any more info!

Thanks,
  Nick


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-20  6:19 PROBLEM: kernel crashes when running xfsdump since ~6.4 Nick Bowler
@ 2024-06-20  6:37 ` Hailong Liu
  2024-06-20 14:36   ` Nick Bowler
  2024-06-20 18:02 ` Nick Bowler
  1 sibling, 1 reply; 39+ messages in thread
From: Hailong Liu @ 2024-06-20  6:37 UTC (permalink / raw)
  To: Nick Bowler
  Cc: linux-kernel, Linux regressions mailing list, linux-mm,
	sparclinux, Uladzislau Rezki (Sony),
	Andrew Morton

On Thu, 20. Jun 02:19, Nick Bowler wrote:
> Hi,
>
> After upgrading my sparc to 6.9.5 I noticed that attempting to run
> xfsdump instantly (within a couple seconds) and reliably crashes the
> kernel.  The same problem is also observed on 6.10-rc4.
>
> This is a regression introduced around 6.4 timeframe.  6.3 appears
> to work fine and xfsdump goes about its business dumping stuff.
>
> Bisection implicates the following:
>
>   062eacf57ad91b5c272f89dc964fd6dd9715ea7d is the first bad commit
>   commit 062eacf57ad91b5c272f89dc964fd6dd9715ea7d
>   Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
>   Date:   Thu Mar 30 21:06:38 2023 +0200
>
>       mm: vmalloc: remove a global vmap_blocks xarray
>
> This reverts pretty easily on top of v6.10-rc4, as long as I first
> revert fa1c77c13ca5 ("mm: vmalloc: rename addr_to_vb_xarray() function")
> as this just causes conflicts.  Then there is one easily-corrected build
> failure (adjust the one remaining &vbq->vmap_blocks back to &vmap_blocks).
>
> If I do all of that then the kernel is not crashing anymore.
>
> A splat like this one is output on the console when the crash occurs (varies a bit):
>
>   spitfire_data_access_exception: SFSR[000000000080100d] SFAR[0000000000c51ba0], going.
>                 \|/ ____ \|/
>                 "@'/ .. \`@"
>                 /_| \__/ |_\
>                    \__U_/
>   xfsdump(2028): Dax [#1]
>   CPU: 0 PID: 2028 Comm: xfsdump Not tainted 6.9.5 #199
>   TSTATE: 0000000811001607 TPC: 0000000000974fc4 TNPC: 0000000000974fc8 Y: 00000000    Not tainted
>   TPC: <queued_spin_lock_slowpath+0x1d0/0x2cc>
>   g0: 0000000000aa9110 g1: 0000000000c51ba0 g2: 444b000000000000 g3: 0000000000c560c0
>   g4: fffff800a71a1f00 g5: fffff800bebb6000 g6: fffff800ac0ec000 g7: 0000000000040000
>   o0: 0000000000000002 o1: 00000000000007d8 o2: fffff800a4131420 o3: ffffffff0000ffff
>   o4: 00000000900a2001 o5: 0000000000c4f5a0 sp: fffff800ac0eeac1 ret_pc: 0000000000040000
>   RPC: <0x40000>
>   l0: fffff800a40098c0 l1: 0000000100800000 l2: 0000000000000000 l3: 0000000000000103
>   l4: fffff800a40081b0 l5: 0000000000aeec00 l6: fffff800a40080a0 l7: 0000000101000000
>   i0: 0000000000c4f5a0 i1: 00000000900a2001 i2: 0000000000000000 i3: fffff800bf807b80
>   i4: 0000000000000000 i5: fffff800bf807b80 i6: fffff800ac0eeb71 i7: 0000000000503438
>   I7: <vm_map_ram+0x210/0x724>
>   Call Trace:
>   [<0000000000503438>] vm_map_ram+0x210/0x724
>   [<00000000006661f8>] _xfs_buf_map_pages+0x58/0xa0
>   [<0000000000667058>] xfs_buf_get_map+0x668/0x7a4
>   [<00000000006673e0>] xfs_buf_read_map+0x20/0x160
>   [<0000000000667548>] xfs_buf_readahead_map+0x28/0x38
>   [<000000000067a4f8>] xfs_iwalk_ichunk_ra.isra.0+0xa8/0xc4
>   [<000000000067a8f0>] xfs_iwalk_ag+0x1c0/0x260
>   [<000000000067ab08>] xfs_iwalk+0xdc/0x130
>   [<0000000000679fc8>] xfs_bulkstat+0x10c/0x140
>   [<0000000000695528>] xfs_compat_ioc_fsbulkstat+0x1a4/0x1e8
>   [<000000000069572c>] xfs_file_compat_ioctl+0x8c/0x1f4
>   [<0000000000534ab0>] compat_sys_ioctl+0x9c/0xfc
>   [<0000000000406214>] linux_sparc_syscall32+0x34/0x60
>   Disabling lock debugging due to kernel taint
>   Caller[0000000000503438]: vm_map_ram+0x210/0x724
>   Caller[00000000006661f8]: _xfs_buf_map_pages+0x58/0xa0
>   Caller[0000000000667058]: xfs_buf_get_map+0x668/0x7a4
>   Caller[00000000006673e0]: xfs_buf_read_map+0x20/0x160
>   Caller[0000000000667548]: xfs_buf_readahead_map+0x28/0x38
>   Caller[000000000067a4f8]: xfs_iwalk_ichunk_ra.isra.0+0xa8/0xc4
>   Caller[000000000067a8f0]: xfs_iwalk_ag+0x1c0/0x260
>   Caller[000000000067ab08]: xfs_iwalk+0xdc/0x130
>   Caller[0000000000679fc8]: xfs_bulkstat+0x10c/0x140
>   Caller[0000000000695528]: xfs_compat_ioc_fsbulkstat+0x1a4/0x1e8
>   Caller[000000000069572c]: xfs_file_compat_ioctl+0x8c/0x1f4
>   Caller[0000000000534ab0]: compat_sys_ioctl+0x9c/0xfc
>   Caller[0000000000406214]: linux_sparc_syscall32+0x34/0x60
>   Caller[00000000f789ccdc]: 0xf789ccdc
>   Instruction DUMP:
>    8610e0c0
>    8400c002
>    c458a0f8
>   <f6704002>
>    c206e008
>    80a06000
>    12400012
>    01000000
>    81408000
>
> Let me know if you need any more info!
>
> Thanks,
>   Nick
>
I guess you can patch this
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-hotfixes-unstable&id=00468d41c20cac748c2e4bfcf003283d554673f5

--
help you, help me,
Hailong.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-20  6:37 ` Hailong Liu
@ 2024-06-20 14:36   ` Nick Bowler
  0 siblings, 0 replies; 39+ messages in thread
From: Nick Bowler @ 2024-06-20 14:36 UTC (permalink / raw)
  To: Hailong Liu
  Cc: linux-kernel, Linux regressions mailing list, linux-mm,
	sparclinux, Uladzislau Rezki (Sony),
	Andrew Morton

On 2024-06-20 02:37, Hailong Liu wrote:
> On Thu, 20. Jun 02:19, Nick Bowler wrote:
>> After upgrading my sparc to 6.9.5 I noticed that attempting to run
>> xfsdump instantly (within a couple seconds) and reliably crashes the
>> kernel.  The same problem is also observed on 6.10-rc4.
[...]
>>   062eacf57ad91b5c272f89dc964fd6dd9715ea7d is the first bad commit
>>   commit 062eacf57ad91b5c272f89dc964fd6dd9715ea7d
>>   Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>   Date:   Thu Mar 30 21:06:38 2023 +0200
>>
>>       mm: vmalloc: remove a global vmap_blocks xarray
[...]
>>   spitfire_data_access_exception: SFSR[000000000080100d] SFAR[0000000000c51ba0], going.
>>                 \|/ ____ \|/
>>                 "@'/ .. \`@"
>>                 /_| \__/ |_\
>>                    \__U_/
>>   xfsdump(2028): Dax [#1]
>>   CPU: 0 PID: 2028 Comm: xfsdump Not tainted 6.9.5 #199
>>   TSTATE: 0000000811001607 TPC: 0000000000974fc4 TNPC: 0000000000974fc8 Y: 00000000    Not tainted
>>   TPC: <queued_spin_lock_slowpath+0x1d0/0x2cc>
[...]
> I guess you can patch this
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-hotfixes-unstable&id=00468d41c20cac748c2e4bfcf003283d554673f5

I tried with that patch applied on top of 6.10-rc4, and the crash still
occurs.  There is no obvious change in behaviour.

Thanks,
  Nick


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-20  6:19 PROBLEM: kernel crashes when running xfsdump since ~6.4 Nick Bowler
  2024-06-20  6:37 ` Hailong Liu
@ 2024-06-20 18:02 ` Nick Bowler
  2024-06-21  3:30   ` Hailong Liu
  2024-06-24 12:20   ` Uladzislau Rezki
  1 sibling, 2 replies; 39+ messages in thread
From: Nick Bowler @ 2024-06-20 18:02 UTC (permalink / raw)
  To: linux-kernel, Linux regressions mailing list, linux-mm,
	sparclinux, Uladzislau Rezki (Sony),
	Andrew Morton

On 2024-06-20 02:19, Nick Bowler wrote:
> After upgrading my sparc to 6.9.5 I noticed that attempting to run
> xfsdump instantly (within a couple seconds) and reliably crashes the
> kernel.  The same problem is also observed on 6.10-rc4.
[...]
>   062eacf57ad91b5c272f89dc964fd6dd9715ea7d is the first bad commit
>   commit 062eacf57ad91b5c272f89dc964fd6dd9715ea7d
>   Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
>   Date:   Thu Mar 30 21:06:38 2023 +0200
>   
>       mm: vmalloc: remove a global vmap_blocks xarray

I think I might see what is happening here.

On this machine, there are two CPUs numbered 0 and 2 (there is no CPU1).

The per-cpu variables in mm/vmalloc.c are initialized like this, in
vmalloc_init

  for_each_possible_cpu(i) {
    /* ... */
    vbq = &per_cpu(vmap_block_queue, i);
    /* initialize stuff in vbq */
  }

This loops over the set bits of cpu_possible_mask, bits 0 and 2 are set,
so it initializes stuff with i=0 and i=2, skipping i=1 (I added prints to
confirm this).

Then, in vm_map_ram, with the problematic change it calls the new
function addr_to_vb_xa, which does this:

  int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
  return &per_cpu(vmap_block_queue, index).vmap_blocks;

The num_possible_cpus() function counts the number of set bits in
cpu_possible_mask, so it returns 2.  Thus, index is either 0 or 1, which
does not correspond to what was initialized (0 or 2).  The crash occurs
when the computed index is 1 in this function.  In this case, the
returned value appears to be garbage (I added prints to confirm this).

If I change addr_to_vb_xa function to this:

  int index = ((addr / VMAP_BLOCK_SIZE) & 1) << 1; /* 0 or 2 */
  return &per_cpu(vmap_block_queue, index).vmap_blocks;

xfsdump is working again.

Cheers,
  Nick

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-20 18:02 ` Nick Bowler
@ 2024-06-21  3:30   ` Hailong Liu
  2024-06-21  7:07     ` Baoquan He
  2024-06-24 12:20   ` Uladzislau Rezki
  1 sibling, 1 reply; 39+ messages in thread
From: Hailong Liu @ 2024-06-21  3:30 UTC (permalink / raw)
  To: Nick Bowler
  Cc: linux-kernel, Linux regressions mailing list, linux-mm,
	sparclinux, Baoquan He, Uladzislau Rezki (Sony),
	Andrew Morton

On Thu, 20. Jun 14:02, Nick Bowler wrote:
> On 2024-06-20 02:19, Nick Bowler wrote:
> > After upgrading my sparc to 6.9.5 I noticed that attempting to run
> > xfsdump instantly (within a couple seconds) and reliably crashes the
> > kernel.  The same problem is also observed on 6.10-rc4.
> [...]
> >   062eacf57ad91b5c272f89dc964fd6dd9715ea7d is the first bad commit
> >   commit 062eacf57ad91b5c272f89dc964fd6dd9715ea7d
> >   Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> >   Date:   Thu Mar 30 21:06:38 2023 +0200
> >
> >       mm: vmalloc: remove a global vmap_blocks xarray
>
> I think I might see what is happening here.
>
> On this machine, there are two CPUs numbered 0 and 2 (there is no CPU1).
>
+Baoquan

Ahh, I thought you are right. addr_to_vb_xa assume that the CPU numbers are
contiguous. I don't have knowledge about CPU at all.
Technically change the implement addr_to_vb_xa() to
return &per_cpu(vmap_block_queue, raw_smp_processor_id()).vmap_blocks;
would also work, but it violate the load balance. Wating for
experts reply.

> The per-cpu variables in mm/vmalloc.c are initialized like this, in
> vmalloc_init
>
>   for_each_possible_cpu(i) {
>     /* ... */
>     vbq = &per_cpu(vmap_block_queue, i);
>     /* initialize stuff in vbq */
>   }
>
> This loops over the set bits of cpu_possible_mask, bits 0 and 2 are set,
> so it initializes stuff with i=0 and i=2, skipping i=1 (I added prints to
> confirm this).
>
> Then, in vm_map_ram, with the problematic change it calls the new
> function addr_to_vb_xa, which does this:
>
>   int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
>   return &per_cpu(vmap_block_queue, index).vmap_blocks;
>
> The num_possible_cpus() function counts the number of set bits in
> cpu_possible_mask, so it returns 2.  Thus, index is either 0 or 1, which
> does not correspond to what was initialized (0 or 2).  The crash occurs
> when the computed index is 1 in this function.  In this case, the
> returned value appears to be garbage (I added prints to confirm this).
>
> If I change addr_to_vb_xa function to this:
>
>   int index = ((addr / VMAP_BLOCK_SIZE) & 1) << 1; /* 0 or 2 */
>   return &per_cpu(vmap_block_queue, index).vmap_blocks;
>
> xfsdump is working again.
>
> Cheers,
>   Nick
>
>

--
help you, help me,
Hailong.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-21  3:30   ` Hailong Liu
@ 2024-06-21  7:07     ` Baoquan He
  2024-06-21  9:44       ` Uladzislau Rezki
  0 siblings, 1 reply; 39+ messages in thread
From: Baoquan He @ 2024-06-21  7:07 UTC (permalink / raw)
  To: Hailong Liu, Nick Bowler
  Cc: linux-kernel, Linux regressions mailing list, linux-mm,
	sparclinux, Uladzislau Rezki (Sony),
	Andrew Morton

On 06/21/24 at 11:30am, Hailong Liu wrote:
> On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > On 2024-06-20 02:19, Nick Bowler wrote:
> > > After upgrading my sparc to 6.9.5 I noticed that attempting to run
> > > xfsdump instantly (within a couple seconds) and reliably crashes the
> > > kernel.  The same problem is also observed on 6.10-rc4.
> > [...]
> > >   062eacf57ad91b5c272f89dc964fd6dd9715ea7d is the first bad commit
> > >   commit 062eacf57ad91b5c272f89dc964fd6dd9715ea7d
> > >   Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > >   Date:   Thu Mar 30 21:06:38 2023 +0200
> > >
> > >       mm: vmalloc: remove a global vmap_blocks xarray
> >
> > I think I might see what is happening here.
> >
> > On this machine, there are two CPUs numbered 0 and 2 (there is no CPU1).
> >
> +Baoquan

Thanks for adding me, Hailong.

> 
> Ahh, I thought you are right. addr_to_vb_xa assume that the CPU numbers are
> contiguous. I don't have knowledge about CPU at all.
> Technically change the implement addr_to_vb_xa() to
> return &per_cpu(vmap_block_queue, raw_smp_processor_id()).vmap_blocks;
> would also work, but it violate the load balance. Wating for
> experts reply.

Yeah, I think so as you explained.

> 
> > The per-cpu variables in mm/vmalloc.c are initialized like this, in
> > vmalloc_init
> >
> >   for_each_possible_cpu(i) {
> >     /* ... */
> >     vbq = &per_cpu(vmap_block_queue, i);
> >     /* initialize stuff in vbq */
> >   }
> >
> > This loops over the set bits of cpu_possible_mask, bits 0 and 2 are set,
> > so it initializes stuff with i=0 and i=2, skipping i=1 (I added prints to
> > confirm this).
> >
> > Then, in vm_map_ram, with the problematic change it calls the new
> > function addr_to_vb_xa, which does this:
> >
> >   int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> >   return &per_cpu(vmap_block_queue, index).vmap_blocks;
> >
> > The num_possible_cpus() function counts the number of set bits in
> > cpu_possible_mask, so it returns 2.  Thus, index is either 0 or 1, which
> > does not correspond to what was initialized (0 or 2).  The crash occurs
> > when the computed index is 1 in this function.  In this case, the
> > returned value appears to be garbage (I added prints to confirm this).

This is a great catch. 

> >
> > If I change addr_to_vb_xa function to this:
> >
> >   int index = ((addr / VMAP_BLOCK_SIZE) & 1) << 1; /* 0 or 2 */
> >   return &per_cpu(vmap_block_queue, index).vmap_blocks;

Yeah, while above change is not generic, e.g if it's CPU0 and CPU3.
I think we should take the max possible CPU number as the hush bucket
size. The vb->va is also got from global free_vmap_area, so no need to
worry about the waste.

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index be2dd281ea76..18e87cafbaf2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
 static struct xarray *
 addr_to_vb_xa(unsigned long addr)
 {
-	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
+	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
 
 	return &per_cpu(vmap_block_queue, index).vmap_blocks;
 }



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-21  7:07     ` Baoquan He
@ 2024-06-21  9:44       ` Uladzislau Rezki
  2024-06-21 10:45         ` Hailong Liu
                           ` (3 more replies)
  0 siblings, 4 replies; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-21  9:44 UTC (permalink / raw)
  To: Baoquan He, Nick Bowler
  Cc: Hailong Liu, Nick Bowler, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Uladzislau Rezki (Sony),
	Andrew Morton

On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> On 06/21/24 at 11:30am, Hailong Liu wrote:
> > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > On 2024-06-20 02:19, Nick Bowler wrote:
> > > > After upgrading my sparc to 6.9.5 I noticed that attempting to run
> > > > xfsdump instantly (within a couple seconds) and reliably crashes the
> > > > kernel.  The same problem is also observed on 6.10-rc4.
> > > [...]
> > > >   062eacf57ad91b5c272f89dc964fd6dd9715ea7d is the first bad commit
> > > >   commit 062eacf57ad91b5c272f89dc964fd6dd9715ea7d
> > > >   Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > >   Date:   Thu Mar 30 21:06:38 2023 +0200
> > > >
> > > >       mm: vmalloc: remove a global vmap_blocks xarray
> > >
> > > I think I might see what is happening here.
> > >
> > > On this machine, there are two CPUs numbered 0 and 2 (there is no CPU1).
> > >
> > +Baoquan
> 
> Thanks for adding me, Hailong.
> 
> > 
> > Ahh, I thought you are right. addr_to_vb_xa assume that the CPU numbers are
> > contiguous. I don't have knowledge about CPU at all.
> > Technically change the implement addr_to_vb_xa() to
> > return &per_cpu(vmap_block_queue, raw_smp_processor_id()).vmap_blocks;
> > would also work, but it violate the load balance. Wating for
> > experts reply.
> 
> Yeah, I think so as you explained.
> 
> > 
> > > The per-cpu variables in mm/vmalloc.c are initialized like this, in
> > > vmalloc_init
> > >
> > >   for_each_possible_cpu(i) {
> > >     /* ... */
> > >     vbq = &per_cpu(vmap_block_queue, i);
> > >     /* initialize stuff in vbq */
> > >   }
> > >
> > > This loops over the set bits of cpu_possible_mask, bits 0 and 2 are set,
> > > so it initializes stuff with i=0 and i=2, skipping i=1 (I added prints to
> > > confirm this).
> > >
> > > Then, in vm_map_ram, with the problematic change it calls the new
> > > function addr_to_vb_xa, which does this:
> > >
> > >   int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > >   return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > >
> > > The num_possible_cpus() function counts the number of set bits in
> > > cpu_possible_mask, so it returns 2.  Thus, index is either 0 or 1, which
> > > does not correspond to what was initialized (0 or 2).  The crash occurs
> > > when the computed index is 1 in this function.  In this case, the
> > > returned value appears to be garbage (I added prints to confirm this).
> 
> This is a great catch. 
> 
Indeed :)

> > >
> > > If I change addr_to_vb_xa function to this:
> > >
> > >   int index = ((addr / VMAP_BLOCK_SIZE) & 1) << 1; /* 0 or 2 */
> > >   return &per_cpu(vmap_block_queue, index).vmap_blocks;
> 
> Yeah, while above change is not generic, e.g if it's CPU0 and CPU3.
> I think we should take the max possible CPU number as the hush bucket
> size. The vb->va is also got from global free_vmap_area, so no need to
> worry about the waste.
>
Agree.

> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index be2dd281ea76..18e87cafbaf2 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
>  static struct xarray *
>  addr_to_vb_xa(unsigned long addr)
>  {
> -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
>  
>  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
>  }
> 
The problem i see is about not-initializing of the:
<snip>
	for_each_possible_cpu(i) {
		struct vmap_block_queue *vbq;
		struct vfree_deferred *p;

		vbq = &per_cpu(vmap_block_queue, i);
		spin_lock_init(&vbq->lock);
		INIT_LIST_HEAD(&vbq->free);
		p = &per_cpu(vfree_deferred, i);
		init_llist_head(&p->list);
		INIT_WORK(&p->wq, delayed_vfree_work);
		xa_init(&vbq->vmap_blocks);
	}
<snip>

correctly or fully. It is my bad i did not think that CPUs in a possible mask
can be non sequential :-/

nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.

Or i missed something in your patch, Baoquan?

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-21  9:44       ` Uladzislau Rezki
@ 2024-06-21 10:45         ` Hailong Liu
  2024-06-21 11:15         ` Hailong Liu
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 39+ messages in thread
From: Hailong Liu @ 2024-06-21 10:45 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Baoquan He, Nick Bowler, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On Fri, 21. Jun 11:44, Uladzislau Rezki wrote:
> On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > On 2024-06-20 02:19, Nick Bowler wrote:
> > > > > After upgrading my sparc to 6.9.5 I noticed that attempting to run
> > > > > xfsdump instantly (within a couple seconds) and reliably crashes the
> > > > > kernel.  The same problem is also observed on 6.10-rc4.
> > > > [...]
> > > > >   062eacf57ad91b5c272f89dc964fd6dd9715ea7d is the first bad commit
> > > > >   commit 062eacf57ad91b5c272f89dc964fd6dd9715ea7d
> > > > >   Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > >   Date:   Thu Mar 30 21:06:38 2023 +0200
> > > > >
> > > > >       mm: vmalloc: remove a global vmap_blocks xarray
> > > >
> > > > I think I might see what is happening here.
> > > >
> > > > On this machine, there are two CPUs numbered 0 and 2 (there is no CPU1).
> > > >
> > > +Baoquan
> >
> > Thanks for adding me, Hailong.
> >
> > >
> > > Ahh, I thought you are right. addr_to_vb_xa assume that the CPU numbers are
> > > contiguous. I don't have knowledge about CPU at all.
> > > Technically change the implement addr_to_vb_xa() to
> > > return &per_cpu(vmap_block_queue, raw_smp_processor_id()).vmap_blocks;
> > > would also work, but it violate the load balance. Wating for
> > > experts reply.
> >
> > Yeah, I think so as you explained.
> >
> > >
> > > > The per-cpu variables in mm/vmalloc.c are initialized like this, in
> > > > vmalloc_init
> > > >
> > > >   for_each_possible_cpu(i) {
> > > >     /* ... */
> > > >     vbq = &per_cpu(vmap_block_queue, i);
> > > >     /* initialize stuff in vbq */
> > > >   }
> > > >
> > > > This loops over the set bits of cpu_possible_mask, bits 0 and 2 are set,
> > > > so it initializes stuff with i=0 and i=2, skipping i=1 (I added prints to
> > > > confirm this).
> > > >
> > > > Then, in vm_map_ram, with the problematic change it calls the new
> > > > function addr_to_vb_xa, which does this:
> > > >
> > > >   int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > >   return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > >
> > > > The num_possible_cpus() function counts the number of set bits in
> > > > cpu_possible_mask, so it returns 2.  Thus, index is either 0 or 1, which
> > > > does not correspond to what was initialized (0 or 2).  The crash occurs
> > > > when the computed index is 1 in this function.  In this case, the
> > > > returned value appears to be garbage (I added prints to confirm this).
> >
> > This is a great catch.
> >
> Indeed :)
>
> > > >
> > > > If I change addr_to_vb_xa function to this:
> > > >
> > > >   int index = ((addr / VMAP_BLOCK_SIZE) & 1) << 1; /* 0 or 2 */
> > > >   return &per_cpu(vmap_block_queue, index).vmap_blocks;
> >
> > Yeah, while above change is not generic, e.g if it's CPU0 and CPU3.
> > I think we should take the max possible CPU number as the hush bucket
> > size. The vb->va is also got from global free_vmap_area, so no need to
> > worry about the waste.
> >
> Agree.
>
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index be2dd281ea76..18e87cafbaf2 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> >  static struct xarray *
> >  addr_to_vb_xa(unsigned long addr)
> >  {
> > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> >
> >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> >  }
> >
> The problem i see is about not-initializing of the:
> <snip>
> 	for_each_possible_cpu(i) {
> 		struct vmap_block_queue *vbq;
> 		struct vfree_deferred *p;
>
> 		vbq = &per_cpu(vmap_block_queue, i);
> 		spin_lock_init(&vbq->lock);
> 		INIT_LIST_HEAD(&vbq->free);
> 		p = &per_cpu(vfree_deferred, i);
> 		init_llist_head(&p->list);
> 		INIT_WORK(&p->wq, delayed_vfree_work);
> 		xa_init(&vbq->vmap_blocks);
> 	}
> <snip>
>

Sorry to trouble you. do you mean as follows:
@@ -4480,7 +4480,7 @@ void __init vmalloc_init(void)
         */
        vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);

-       for_each_possible_cpu(i) {
+       for (i = 0; i < nr_cpu_ids; i++) {
                struct vmap_block_queue *vbq;
                struct vfree_deferred *p;

That’s exactly what I was thinking for the first solution as
well. However, Do we really need this for the impossible CPU's variable
initialization? How about using the index to find the CPU?
Just like the following, but this is not the optimal solution, and it
also wastes a lot of CPU time :).

@@ -1994,8 +1994,12 @@ static struct xarray *
 addr_to_vb_xa(unsigned long addr)
 {
        int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
+       int cpu;

-       return &per_cpu(vmap_block_queue, index).vmap_blocks;
+       for_each_possible_cpu(cpu)
+               if (!index--)
+                       break;
+       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;

BTW, Using raw_smp_processor_id in this manner is incorrect; that’s my mistake.

> correctly or fully. It is my bad i did not think that CPUs in a possible mask
> can be non sequential :-/
>
> nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
> when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.
>
> Or i missed something in your patch, Baoquan?
>
> --
> Uladzislau Rezki

--
help you, help me,
Hailong.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-21  9:44       ` Uladzislau Rezki
  2024-06-21 10:45         ` Hailong Liu
@ 2024-06-21 11:15         ` Hailong Liu
  2024-06-24 12:18           ` Uladzislau Rezki
  2024-06-21 13:42         ` Michael Kelley
  2024-06-21 14:02         ` Baoquan He
  3 siblings, 1 reply; 39+ messages in thread
From: Hailong Liu @ 2024-06-21 11:15 UTC (permalink / raw)
  To: Uladzislau Rezki, Baoquan He
  Cc: Nick Bowler, linux-kernel, Linux regressions mailing list,
	linux-mm, sparclinux, Andrew Morton

On Fri, 21. Jun 11:44, Uladzislau Rezki wrote:
> On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > On 2024-06-20 02:19, Nick Bowler wrote:
> > > > > After upgrading my sparc to 6.9.5 I noticed that attempting to run
> > > > > xfsdump instantly (within a couple seconds) and reliably crashes the
> > > > > kernel.  The same problem is also observed on 6.10-rc4.
> > > > [...]
> > > > >   062eacf57ad91b5c272f89dc964fd6dd9715ea7d is the first bad commit
> > > > >   commit 062eacf57ad91b5c272f89dc964fd6dd9715ea7d
> > > > >   Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > >   Date:   Thu Mar 30 21:06:38 2023 +0200
> > > > >
> > > > >       mm: vmalloc: remove a global vmap_blocks xarray
> > > >
> > > > I think I might see what is happening here.
> > > >
> > > > On this machine, there are two CPUs numbered 0 and 2 (there is no CPU1).
> > > >
> > > +Baoquan
> >
> > Thanks for adding me, Hailong.
> >
> > >
> > > Ahh, I thought you are right. addr_to_vb_xa assume that the CPU numbers are
> > > contiguous. I don't have knowledge about CPU at all.
> > > Technically change the implement addr_to_vb_xa() to
> > > return &per_cpu(vmap_block_queue, raw_smp_processor_id()).vmap_blocks;
> > > would also work, but it violate the load balance. Wating for
> > > experts reply.
> >
> > Yeah, I think so as you explained.
> >
> > >
> > > > The per-cpu variables in mm/vmalloc.c are initialized like this, in
> > > > vmalloc_init
> > > >
> > > >   for_each_possible_cpu(i) {
> > > >     /* ... */
> > > >     vbq = &per_cpu(vmap_block_queue, i);
> > > >     /* initialize stuff in vbq */
> > > >   }
> > > >
> > > > This loops over the set bits of cpu_possible_mask, bits 0 and 2 are set,
> > > > so it initializes stuff with i=0 and i=2, skipping i=1 (I added prints to
> > > > confirm this).
> > > >
> > > > Then, in vm_map_ram, with the problematic change it calls the new
> > > > function addr_to_vb_xa, which does this:
> > > >
> > > >   int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > >   return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > >
> > > > The num_possible_cpus() function counts the number of set bits in
> > > > cpu_possible_mask, so it returns 2.  Thus, index is either 0 or 1, which
> > > > does not correspond to what was initialized (0 or 2).  The crash occurs
> > > > when the computed index is 1 in this function.  In this case, the
> > > > returned value appears to be garbage (I added prints to confirm this).
> >
> > This is a great catch.
> >
> Indeed :)
>
> > > >
> > > > If I change addr_to_vb_xa function to this:
> > > >
> > > >   int index = ((addr / VMAP_BLOCK_SIZE) & 1) << 1; /* 0 or 2 */
> > > >   return &per_cpu(vmap_block_queue, index).vmap_blocks;
> >
> > Yeah, while above change is not generic, e.g if it's CPU0 and CPU3.
> > I think we should take the max possible CPU number as the hush bucket
> > size. The vb->va is also got from global free_vmap_area, so no need to
> > worry about the waste.
> >
> Agree.
>
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index be2dd281ea76..18e87cafbaf2 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> >  static struct xarray *
> >  addr_to_vb_xa(unsigned long addr)
> >  {
> > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> >
> >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> >  }
> >
> The problem i see is about not-initializing of the:
> <snip>
> 	for_each_possible_cpu(i) {
> 		struct vmap_block_queue *vbq;
> 		struct vfree_deferred *p;
>
> 		vbq = &per_cpu(vmap_block_queue, i);
> 		spin_lock_init(&vbq->lock);
> 		INIT_LIST_HEAD(&vbq->free);
> 		p = &per_cpu(vfree_deferred, i);
> 		init_llist_head(&p->list);
> 		INIT_WORK(&p->wq, delayed_vfree_work);
> 		xa_init(&vbq->vmap_blocks);
> 	}
> <snip>
>
> correctly or fully. It is my bad i did not think that CPUs in a possible mask
> can be non sequential :-/
>
> nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
> when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.
>
> Or i missed something in your patch, Baoquan?
>
> --
> Uladzislau Rezki

IMO, I thought we can fix this by following.
It doesn't initialize unused variables and utilize the percpu xarray. If I said
anything wrong, please do let me know. I can learn a lot from you all :).


diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 11fe5ea208aa..f9f981674b2d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4480,17 +4480,21 @@ void __init vmalloc_init(void)
         */
        vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);

-       for_each_possible_cpu(i) {
+       for (i = 0; i < nr_cpu_ids; i++) {
                struct vmap_block_queue *vbq;
                struct vfree_deferred *p;

                vbq = &per_cpu(vmap_block_queue, i);
+               xa_init(&vbq->vmap_blocks);
+
+               if (!cpu_possible(i))
+                       continue;
+
                spin_lock_init(&vbq->lock);
                INIT_LIST_HEAD(&vbq->free);
                p = &per_cpu(vfree_deferred, i);
                init_llist_head(&p->list);
                INIT_WORK(&p->wq, delayed_vfree_work);
-               xa_init(&vbq->vmap_blocks);
        }

--
help you, help me,
Hailong.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-21 11:15         ` Hailong Liu
@ 2024-06-24 12:18           ` Uladzislau Rezki
  2024-06-25  9:26             ` Hailong Liu
  0 siblings, 1 reply; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-24 12:18 UTC (permalink / raw)
  To: Hailong Liu
  Cc: Uladzislau Rezki, Baoquan He, Nick Bowler, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

> 
> IMO, I thought we can fix this by following.
> It doesn't initialize unused variables and utilize the percpu xarray. If I said
> anything wrong, please do let me know. I can learn a lot from you all :).
> 
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 11fe5ea208aa..f9f981674b2d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -4480,17 +4480,21 @@ void __init vmalloc_init(void)
>          */
>         vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
> 
> -       for_each_possible_cpu(i) {
> +       for (i = 0; i < nr_cpu_ids; i++) {
>                 struct vmap_block_queue *vbq;
>                 struct vfree_deferred *p;
> 
>                 vbq = &per_cpu(vmap_block_queue, i);
> +               xa_init(&vbq->vmap_blocks);
> +
> +               if (!cpu_possible(i))
Why do you need such check?

Thanks!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-24 12:18           ` Uladzislau Rezki
@ 2024-06-25  9:26             ` Hailong Liu
  2024-06-25  9:55               ` Uladzislau Rezki
  0 siblings, 1 reply; 39+ messages in thread
From: Hailong Liu @ 2024-06-25  9:26 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Baoquan He, Nick Bowler, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On Mon, 24. Jun 14:18, Uladzislau Rezki wrote:
> >
> > IMO, I thought we can fix this by following.
> > It doesn't initialize unused variables and utilize the percpu xarray. If I said
> > anything wrong, please do let me know. I can learn a lot from you all :).
> >
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 11fe5ea208aa..f9f981674b2d 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -4480,17 +4480,21 @@ void __init vmalloc_init(void)
> >          */
> >         vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
> >
> > -       for_each_possible_cpu(i) {
> > +       for (i = 0; i < nr_cpu_ids; i++) {
> >                 struct vmap_block_queue *vbq;
> >                 struct vfree_deferred *p;
> >
> >                 vbq = &per_cpu(vmap_block_queue, i);
> > +               xa_init(&vbq->vmap_blocks);
> > +
> > +               if (!cpu_possible(i))
> Why do you need such check?
IIUC, take this issue as example, cpumask is b101 and nr_cpu_id is
3, if i = 1, There is no need to initialize unused variables here;
initializing the xarray for the hash index is sufficient.
>
> Thanks!
>
> --
> Uladzislau Rezki

--
help you, help me,
Hailong.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-25  9:26             ` Hailong Liu
@ 2024-06-25  9:55               ` Uladzislau Rezki
  0 siblings, 0 replies; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-25  9:55 UTC (permalink / raw)
  To: Hailong Liu, Baoquan He
  Cc: Uladzislau Rezki, Baoquan He, Nick Bowler, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On Tue, Jun 25, 2024 at 05:26:01PM +0800, Hailong Liu wrote:
> On Mon, 24. Jun 14:18, Uladzislau Rezki wrote:
> > >
> > > IMO, I thought we can fix this by following.
> > > It doesn't initialize unused variables and utilize the percpu xarray. If I said
> > > anything wrong, please do let me know. I can learn a lot from you all :).
> > >
> > >
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 11fe5ea208aa..f9f981674b2d 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -4480,17 +4480,21 @@ void __init vmalloc_init(void)
> > >          */
> > >         vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
> > >
> > > -       for_each_possible_cpu(i) {
> > > +       for (i = 0; i < nr_cpu_ids; i++) {
> > >                 struct vmap_block_queue *vbq;
> > >                 struct vfree_deferred *p;
> > >
> > >                 vbq = &per_cpu(vmap_block_queue, i);
> > > +               xa_init(&vbq->vmap_blocks);
> > > +
> > > +               if (!cpu_possible(i))
> > Why do you need such check?
> IIUC, take this issue as example, cpumask is b101 and nr_cpu_id is
> 3, if i = 1, There is no need to initialize unused variables here;
> initializing the xarray for the hash index is sufficient.
>
But. It does not make much sense to skip initialization or keep it "half"
initialized. At least we can guarantee that all data structures are properly
setup.

One concern is what Baoquan raised about per-cpu variables. In your
scenario, b101, accessing to second per-cpu variable is not allowed.

So, we need properly check that concern.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-21  9:44       ` Uladzislau Rezki
  2024-06-21 10:45         ` Hailong Liu
  2024-06-21 11:15         ` Hailong Liu
@ 2024-06-21 13:42         ` Michael Kelley
  2024-06-24 12:17           ` Uladzislau Rezki
  2024-06-21 14:02         ` Baoquan He
  3 siblings, 1 reply; 39+ messages in thread
From: Michael Kelley @ 2024-06-21 13:42 UTC (permalink / raw)
  To: Uladzislau Rezki, Baoquan He, Nick Bowler
  Cc: Hailong Liu, Nick Bowler, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

From: Uladzislau Rezki <urezki@gmail.com> Sent: Friday, June 21, 2024 2:44 AM
> 
> On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > On 2024-06-20 02:19, Nick Bowler wrote:

[snip]

> > > > The per-cpu variables in mm/vmalloc.c are initialized like this, in
> > > > vmalloc_init
> > > >
> > > >   for_each_possible_cpu(i) {
> > > >     /* ... */
> > > >     vbq = &per_cpu(vmap_block_queue, i);
> > > >     /* initialize stuff in vbq */
> > > >   }
> > > >
> > > > This loops over the set bits of cpu_possible_mask, bits 0 and 2 are set,
> > > > so it initializes stuff with i=0 and i=2, skipping i=1 (I added prints to
> > > > confirm this).
> > > >
> > > > Then, in vm_map_ram, with the problematic change it calls the new
> > > > function addr_to_vb_xa, which does this:
> > > >
> > > >   int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > >   return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > >
> > > > The num_possible_cpus() function counts the number of set bits in
> > > > cpu_possible_mask, so it returns 2.  Thus, index is either 0 or 1, which
> > > > does not correspond to what was initialized (0 or 2).  The crash occurs
> > > > when the computed index is 1 in this function.  In this case, the
> > > > returned value appears to be garbage (I added prints to confirm this).
> >
> > This is a great catch.
> >
> Indeed :)
> 

+1

More broadly, throughout kernel code there are a number of places
that incorrectly assume the cpu_possible_mask has no gaps in it. The
typical case does kcalloc() or kmalloc_array() with num_possible_cpus()
as the first argument, then indexes into the allocated array with a CPU
number from smp_processor_id() or a variant. These places should be
using nr_cpu_ids instead of num_possible_cpus().

I'm usually working on the code for Linux guests on Hyper-V, and
there are six occurrences in that code. While they probably don't
have immediate practical impact because I don't think the ACPI MADT
in a such a VM would have a gap in the processor enumeration,
I'm planning to do fixes in the interest of general correctness.

Michael


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-21 13:42         ` Michael Kelley
@ 2024-06-24 12:17           ` Uladzislau Rezki
  0 siblings, 0 replies; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-24 12:17 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Uladzislau Rezki, Baoquan He, Nick Bowler, Hailong Liu,
	linux-kernel, Linux regressions mailing list, linux-mm,
	sparclinux, Andrew Morton

On Fri, Jun 21, 2024 at 01:42:28PM +0000, Michael Kelley wrote:
> From: Uladzislau Rezki <urezki@gmail.com> Sent: Friday, June 21, 2024 2:44 AM
> > 
> > On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > > On 2024-06-20 02:19, Nick Bowler wrote:
> 
> [snip]
> 
> > > > > The per-cpu variables in mm/vmalloc.c are initialized like this, in
> > > > > vmalloc_init
> > > > >
> > > > >   for_each_possible_cpu(i) {
> > > > >     /* ... */
> > > > >     vbq = &per_cpu(vmap_block_queue, i);
> > > > >     /* initialize stuff in vbq */
> > > > >   }
> > > > >
> > > > > This loops over the set bits of cpu_possible_mask, bits 0 and 2 are set,
> > > > > so it initializes stuff with i=0 and i=2, skipping i=1 (I added prints to
> > > > > confirm this).
> > > > >
> > > > > Then, in vm_map_ram, with the problematic change it calls the new
> > > > > function addr_to_vb_xa, which does this:
> > > > >
> > > > >   int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > >   return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > >
> > > > > The num_possible_cpus() function counts the number of set bits in
> > > > > cpu_possible_mask, so it returns 2.  Thus, index is either 0 or 1, which
> > > > > does not correspond to what was initialized (0 or 2).  The crash occurs
> > > > > when the computed index is 1 in this function.  In this case, the
> > > > > returned value appears to be garbage (I added prints to confirm this).
> > >
> > > This is a great catch.
> > >
> > Indeed :)
> > 
> 
> +1
> 
> More broadly, throughout kernel code there are a number of places
> that incorrectly assume the cpu_possible_mask has no gaps in it. The
> typical case does kcalloc() or kmalloc_array() with num_possible_cpus()
> as the first argument, then indexes into the allocated array with a CPU
> number from smp_processor_id() or a variant. These places should be
> using nr_cpu_ids instead of num_possible_cpus().
> 
> I'm usually working on the code for Linux guests on Hyper-V, and
> there are six occurrences in that code. While they probably don't
> have immediate practical impact because I don't think the ACPI MADT
> in a such a VM would have a gap in the processor enumeration,
> I'm planning to do fixes in the interest of general correctness.
> 
Thank you for valuable information!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-21  9:44       ` Uladzislau Rezki
                           ` (2 preceding siblings ...)
  2024-06-21 13:42         ` Michael Kelley
@ 2024-06-21 14:02         ` Baoquan He
  2024-06-24 12:16           ` Uladzislau Rezki
  3 siblings, 1 reply; 39+ messages in thread
From: Baoquan He @ 2024-06-21 14:02 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Nick Bowler, Hailong Liu, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On 06/21/24 at 11:44am, Uladzislau Rezki wrote:
> On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > On 2024-06-20 02:19, Nick Bowler wrote:
......
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index be2dd281ea76..18e87cafbaf2 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> >  static struct xarray *
> >  addr_to_vb_xa(unsigned long addr)
> >  {
> > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> >  
> >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> >  }
> > 
> The problem i see is about not-initializing of the:
> <snip>
> 	for_each_possible_cpu(i) {
> 		struct vmap_block_queue *vbq;
> 		struct vfree_deferred *p;
> 
> 		vbq = &per_cpu(vmap_block_queue, i);
> 		spin_lock_init(&vbq->lock);
> 		INIT_LIST_HEAD(&vbq->free);
> 		p = &per_cpu(vfree_deferred, i);
> 		init_llist_head(&p->list);
> 		INIT_WORK(&p->wq, delayed_vfree_work);
> 		xa_init(&vbq->vmap_blocks);
> 	}
> <snip>
> 
> correctly or fully. It is my bad i did not think that CPUs in a possible mask
> can be non sequential :-/
> 
> nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
> when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.

I checked the generic version of setup_nr_cpu_ids(), from codes, they
are different with my understanding.

kernel/smp.c
void __init setup_nr_cpu_ids(void)
{
        set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1);
}

include/linux/cpumask.h:
#define num_possible_cpus()     cpumask_weight(cpu_possible_mask)



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-21 14:02         ` Baoquan He
@ 2024-06-24 12:16           ` Uladzislau Rezki
  2024-06-25  3:30             ` Baoquan He
  0 siblings, 1 reply; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-24 12:16 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki, Nick Bowler, Hailong Liu, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On Fri, Jun 21, 2024 at 10:02:50PM +0800, Baoquan He wrote:
> On 06/21/24 at 11:44am, Uladzislau Rezki wrote:
> > On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > > On 2024-06-20 02:19, Nick Bowler wrote:
> ......
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index be2dd281ea76..18e87cafbaf2 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > >  static struct xarray *
> > >  addr_to_vb_xa(unsigned long addr)
> > >  {
> > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > >  
> > >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > >  }
> > > 
> > The problem i see is about not-initializing of the:
> > <snip>
> > 	for_each_possible_cpu(i) {
> > 		struct vmap_block_queue *vbq;
> > 		struct vfree_deferred *p;
> > 
> > 		vbq = &per_cpu(vmap_block_queue, i);
> > 		spin_lock_init(&vbq->lock);
> > 		INIT_LIST_HEAD(&vbq->free);
> > 		p = &per_cpu(vfree_deferred, i);
> > 		init_llist_head(&p->list);
> > 		INIT_WORK(&p->wq, delayed_vfree_work);
> > 		xa_init(&vbq->vmap_blocks);
> > 	}
> > <snip>
> > 
> > correctly or fully. It is my bad i did not think that CPUs in a possible mask
> > can be non sequential :-/
> > 
> > nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
> > when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.
> 
> I checked the generic version of setup_nr_cpu_ids(), from codes, they
> are different with my understanding.
> 
> kernel/smp.c
> void __init setup_nr_cpu_ids(void)
> {
>         set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1);
> }
> 
I see that it is not a weak function, so it is generic, thus the
behavior can not be overwritten, which is great. This does what we
need.

Thank you for checking this you are right!

Then it is just a matter of proper initialization of the hash:

<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5d3aa2dc88a8..1733946f7a12 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -5087,7 +5087,13 @@ void __init vmalloc_init(void)
         */
        vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
 
-       for_each_possible_cpu(i) {
+       /*
+        * We use "nr_cpu_ids" here because some architectures
+        * may have "gaps" in cpu-possible-mask. It is OK for
+        * per-cpu approaches but is not OK for cases where it
+        * can be used as hashes also.
+        */
+       for (i = 0; i < nr_cpu_ids; i++) {
                struct vmap_block_queue *vbq;
                struct vfree_deferred *p;
<snip>

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-24 12:16           ` Uladzislau Rezki
@ 2024-06-25  3:30             ` Baoquan He
  2024-06-25 10:32               ` Uladzislau Rezki
                                 ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Baoquan He @ 2024-06-25  3:30 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Nick Bowler, Hailong Liu, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On 06/24/24 at 02:16pm, Uladzislau Rezki wrote:
> On Fri, Jun 21, 2024 at 10:02:50PM +0800, Baoquan He wrote:
> > On 06/21/24 at 11:44am, Uladzislau Rezki wrote:
> > > On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > > > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > > > On 2024-06-20 02:19, Nick Bowler wrote:
> > ......
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index be2dd281ea76..18e87cafbaf2 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > >  static struct xarray *
> > > >  addr_to_vb_xa(unsigned long addr)
> > > >  {
> > > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > >  
> > > >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > >  }
> > > > 
> > > The problem i see is about not-initializing of the:
> > > <snip>
> > > 	for_each_possible_cpu(i) {
> > > 		struct vmap_block_queue *vbq;
> > > 		struct vfree_deferred *p;
> > > 
> > > 		vbq = &per_cpu(vmap_block_queue, i);
> > > 		spin_lock_init(&vbq->lock);
> > > 		INIT_LIST_HEAD(&vbq->free);
> > > 		p = &per_cpu(vfree_deferred, i);
> > > 		init_llist_head(&p->list);
> > > 		INIT_WORK(&p->wq, delayed_vfree_work);
> > > 		xa_init(&vbq->vmap_blocks);
> > > 	}
> > > <snip>
> > > 
> > > correctly or fully. It is my bad i did not think that CPUs in a possible mask
> > > can be non sequential :-/
> > > 
> > > nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
> > > when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.
> > 
> > I checked the generic version of setup_nr_cpu_ids(), from codes, they
> > are different with my understanding.
> > 
> > kernel/smp.c
> > void __init setup_nr_cpu_ids(void)
> > {
> >         set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1);
> > }
> > 
> I see that it is not a weak function, so it is generic, thus the
> behavior can not be overwritten, which is great. This does what we
> need.
> 
> Thank you for checking this you are right!

Thanks for confirming this.

> 
> Then it is just a matter of proper initialization of the hash:
> 
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 5d3aa2dc88a8..1733946f7a12 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -5087,7 +5087,13 @@ void __init vmalloc_init(void)
>          */
>         vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
>  
> -       for_each_possible_cpu(i) {
> +       /*
> +        * We use "nr_cpu_ids" here because some architectures
> +        * may have "gaps" in cpu-possible-mask. It is OK for
> +        * per-cpu approaches but is not OK for cases where it
> +        * can be used as hashes also.
> +        */
> +       for (i = 0; i < nr_cpu_ids; i++) {

I was wrong about earlier comments. Percpu variables are only available
on possible CPUs. For those nonexistent possible CPUs of static percpu
variable vmap_block_queue, there isn't memory allocated and mapped for
them. So accessing into them will cause problem.

In Nick's case, there are only CPU0, CPU2. If you access
&per_cpu(vmap_block_queue, 1), problem occurs. So I think we may need to
change to take other way for vbq. E.g:
1) Storing the vb in the nearest neighbouring vbq on possible CPU as
   below draft patch;
2) create an normal array to store vbq of size nr_cpu_ids, then we can
   store/fetch each vbq on non-possible CPU?

The way 1) is simpler, the existing code can be adapted a little just as
below.

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 633363997dec..59a8951cc6c0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
 static struct xarray *
 addr_to_vb_xa(unsigned long addr)
 {
-	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
+	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
+
+	if (!cpu_possible(idex))
+		index = cpumask_next(index, cpu_possible_mask);
 
 	return &per_cpu(vmap_block_queue, index).vmap_blocks;
 }
@@ -2556,9 +2559,15 @@ addr_to_vb_xa(unsigned long addr)
 
 static unsigned long addr_to_vb_idx(unsigned long addr)
 {
+	int id = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
+	int id_dest = id; 
+
+	if (!cpu_possible(id))
+		id_dest = cpumask_next(id, cpu_possible_mask);
+
 	addr -= VMALLOC_START & ~(VMAP_BLOCK_SIZE-1);
 	addr /= VMAP_BLOCK_SIZE;
-	return addr;
+	return addr + (id_dest - id);
 }



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-25  3:30             ` Baoquan He
@ 2024-06-25 10:32               ` Uladzislau Rezki
  2024-06-25 11:40                 ` Baoquan He
  2024-06-25 11:19               ` Hailong Liu
  2024-06-25 12:41               ` Uladzislau Rezki
  2 siblings, 1 reply; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-25 10:32 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki, Nick Bowler, Hailong Liu, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On Tue, Jun 25, 2024 at 11:30:33AM +0800, Baoquan He wrote:
> On 06/24/24 at 02:16pm, Uladzislau Rezki wrote:
> > On Fri, Jun 21, 2024 at 10:02:50PM +0800, Baoquan He wrote:
> > > On 06/21/24 at 11:44am, Uladzislau Rezki wrote:
> > > > On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > > > > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > > > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > > > > On 2024-06-20 02:19, Nick Bowler wrote:
> > > ......
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index be2dd281ea76..18e87cafbaf2 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > > >  static struct xarray *
> > > > >  addr_to_vb_xa(unsigned long addr)
> > > > >  {
> > > > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > >  
> > > > >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > >  }
> > > > > 
> > > > The problem i see is about not-initializing of the:
> > > > <snip>
> > > > 	for_each_possible_cpu(i) {
> > > > 		struct vmap_block_queue *vbq;
> > > > 		struct vfree_deferred *p;
> > > > 
> > > > 		vbq = &per_cpu(vmap_block_queue, i);
> > > > 		spin_lock_init(&vbq->lock);
> > > > 		INIT_LIST_HEAD(&vbq->free);
> > > > 		p = &per_cpu(vfree_deferred, i);
> > > > 		init_llist_head(&p->list);
> > > > 		INIT_WORK(&p->wq, delayed_vfree_work);
> > > > 		xa_init(&vbq->vmap_blocks);
> > > > 	}
> > > > <snip>
> > > > 
> > > > correctly or fully. It is my bad i did not think that CPUs in a possible mask
> > > > can be non sequential :-/
> > > > 
> > > > nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
> > > > when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.
> > > 
> > > I checked the generic version of setup_nr_cpu_ids(), from codes, they
> > > are different with my understanding.
> > > 
> > > kernel/smp.c
> > > void __init setup_nr_cpu_ids(void)
> > > {
> > >         set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1);
> > > }
> > > 
> > I see that it is not a weak function, so it is generic, thus the
> > behavior can not be overwritten, which is great. This does what we
> > need.
> > 
> > Thank you for checking this you are right!
> 
> Thanks for confirming this.
> 
> > 
> > Then it is just a matter of proper initialization of the hash:
> > 
> > <snip>
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 5d3aa2dc88a8..1733946f7a12 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -5087,7 +5087,13 @@ void __init vmalloc_init(void)
> >          */
> >         vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
> >  
> > -       for_each_possible_cpu(i) {
> > +       /*
> > +        * We use "nr_cpu_ids" here because some architectures
> > +        * may have "gaps" in cpu-possible-mask. It is OK for
> > +        * per-cpu approaches but is not OK for cases where it
> > +        * can be used as hashes also.
> > +        */
> > +       for (i = 0; i < nr_cpu_ids; i++) {
> 
> I was wrong about earlier comments. Percpu variables are only available
> on possible CPUs. For those nonexistent possible CPUs of static percpu
> variable vmap_block_queue, there isn't memory allocated and mapped for
> them. So accessing into them will cause problem.
> 
> In Nick's case, there are only CPU0, CPU2. If you access
> &per_cpu(vmap_block_queue, 1), problem occurs. So I think we may need to
> change to take other way for vbq. E.g:
> 1) Storing the vb in the nearest neighbouring vbq on possible CPU as
>    below draft patch;
> 2) create an normal array to store vbq of size nr_cpu_ids, then we can
>    store/fetch each vbq on non-possible CPU?
> 
A correct way, i think, is to create a normal array. A quick fix can be
to stick to a next possible CPU.

> The way 1) is simpler, the existing code can be adapted a little just as
> below.
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 633363997dec..59a8951cc6c0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
>  static struct xarray *
>  addr_to_vb_xa(unsigned long addr)
>  {
> -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> +
> +	if (!cpu_possible(idex))
> +		index = cpumask_next(index, cpu_possible_mask);
>
cpumask_next() can return nr_cpu_ids if no next bits set.

Thanks!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-25 10:32               ` Uladzislau Rezki
@ 2024-06-25 11:40                 ` Baoquan He
  2024-06-25 12:40                   ` Uladzislau Rezki
  0 siblings, 1 reply; 39+ messages in thread
From: Baoquan He @ 2024-06-25 11:40 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Nick Bowler, Hailong Liu, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On 06/25/24 at 12:32pm, Uladzislau Rezki wrote:
> On Tue, Jun 25, 2024 at 11:30:33AM +0800, Baoquan He wrote:
> > On 06/24/24 at 02:16pm, Uladzislau Rezki wrote:
> > > On Fri, Jun 21, 2024 at 10:02:50PM +0800, Baoquan He wrote:
> > > > On 06/21/24 at 11:44am, Uladzislau Rezki wrote:
> > > > > On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > > > > > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > > > > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > > > > > On 2024-06-20 02:19, Nick Bowler wrote:
> > > > ......
> > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > index be2dd281ea76..18e87cafbaf2 100644
> > > > > > --- a/mm/vmalloc.c
> > > > > > +++ b/mm/vmalloc.c
> > > > > > @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > > > >  static struct xarray *
> > > > > >  addr_to_vb_xa(unsigned long addr)
> > > > > >  {
> > > > > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > > >  
> > > > > >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > > >  }
> > > > > > 
> > > > > The problem i see is about not-initializing of the:
> > > > > <snip>
> > > > > 	for_each_possible_cpu(i) {
> > > > > 		struct vmap_block_queue *vbq;
> > > > > 		struct vfree_deferred *p;
> > > > > 
> > > > > 		vbq = &per_cpu(vmap_block_queue, i);
> > > > > 		spin_lock_init(&vbq->lock);
> > > > > 		INIT_LIST_HEAD(&vbq->free);
> > > > > 		p = &per_cpu(vfree_deferred, i);
> > > > > 		init_llist_head(&p->list);
> > > > > 		INIT_WORK(&p->wq, delayed_vfree_work);
> > > > > 		xa_init(&vbq->vmap_blocks);
> > > > > 	}
> > > > > <snip>
> > > > > 
> > > > > correctly or fully. It is my bad i did not think that CPUs in a possible mask
> > > > > can be non sequential :-/
> > > > > 
> > > > > nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
> > > > > when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.
> > > > 
> > > > I checked the generic version of setup_nr_cpu_ids(), from codes, they
> > > > are different with my understanding.
> > > > 
> > > > kernel/smp.c
> > > > void __init setup_nr_cpu_ids(void)
> > > > {
> > > >         set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1);
> > > > }
> > > > 
> > > I see that it is not a weak function, so it is generic, thus the
> > > behavior can not be overwritten, which is great. This does what we
> > > need.
> > > 
> > > Thank you for checking this you are right!
> > 
> > Thanks for confirming this.
> > 
> > > 
> > > Then it is just a matter of proper initialization of the hash:
> > > 
> > > <snip>
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 5d3aa2dc88a8..1733946f7a12 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -5087,7 +5087,13 @@ void __init vmalloc_init(void)
> > >          */
> > >         vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
> > >  
> > > -       for_each_possible_cpu(i) {
> > > +       /*
> > > +        * We use "nr_cpu_ids" here because some architectures
> > > +        * may have "gaps" in cpu-possible-mask. It is OK for
> > > +        * per-cpu approaches but is not OK for cases where it
> > > +        * can be used as hashes also.
> > > +        */
> > > +       for (i = 0; i < nr_cpu_ids; i++) {
> > 
> > I was wrong about earlier comments. Percpu variables are only available
> > on possible CPUs. For those nonexistent possible CPUs of static percpu
> > variable vmap_block_queue, there isn't memory allocated and mapped for
> > them. So accessing into them will cause problem.
> > 
> > In Nick's case, there are only CPU0, CPU2. If you access
> > &per_cpu(vmap_block_queue, 1), problem occurs. So I think we may need to
> > change to take other way for vbq. E.g:
> > 1) Storing the vb in the nearest neighbouring vbq on possible CPU as
> >    below draft patch;
> > 2) create an normal array to store vbq of size nr_cpu_ids, then we can
> >    store/fetch each vbq on non-possible CPU?
> > 
> A correct way, i think, is to create a normal array. A quick fix can be
> to stick to a next possible CPU.
> 
> > The way 1) is simpler, the existing code can be adapted a little just as
> > below.
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 633363997dec..59a8951cc6c0 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> >  static struct xarray *
> >  addr_to_vb_xa(unsigned long addr)
> >  {
> > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > +
> > +	if (!cpu_possible(idex))
> > +		index = cpumask_next(index, cpu_possible_mask);
> >
> cpumask_next() can return nr_cpu_ids if no next bits set.

It won't. nr_cpu_ids is the largest index + 1, the hashed index will
be:  0 =<  index  <= (nr_cpu_ids - 1) e.g cpu_possible_mask is
b10001111, the nr_cpu_ids is 8, the largest bit is cpu7.
cpu_possible(index) will check that. So the largest bit of cpumask_next()
returns is (nr_cpu_ids - 1).



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-25 11:40                 ` Baoquan He
@ 2024-06-25 12:40                   ` Uladzislau Rezki
  2024-06-25 13:02                     ` Baoquan He
  0 siblings, 1 reply; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-25 12:40 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki, Nick Bowler, Hailong Liu, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On Tue, Jun 25, 2024 at 07:40:21PM +0800, Baoquan He wrote:
> On 06/25/24 at 12:32pm, Uladzislau Rezki wrote:
> > On Tue, Jun 25, 2024 at 11:30:33AM +0800, Baoquan He wrote:
> > > On 06/24/24 at 02:16pm, Uladzislau Rezki wrote:
> > > > On Fri, Jun 21, 2024 at 10:02:50PM +0800, Baoquan He wrote:
> > > > > On 06/21/24 at 11:44am, Uladzislau Rezki wrote:
> > > > > > On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > > > > > > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > > > > > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > > > > > > On 2024-06-20 02:19, Nick Bowler wrote:
> > > > > ......
> > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > index be2dd281ea76..18e87cafbaf2 100644
> > > > > > > --- a/mm/vmalloc.c
> > > > > > > +++ b/mm/vmalloc.c
> > > > > > > @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > > > > >  static struct xarray *
> > > > > > >  addr_to_vb_xa(unsigned long addr)
> > > > > > >  {
> > > > > > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > > > >  
> > > > > > >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > > > >  }
> > > > > > > 
> > > > > > The problem i see is about not-initializing of the:
> > > > > > <snip>
> > > > > > 	for_each_possible_cpu(i) {
> > > > > > 		struct vmap_block_queue *vbq;
> > > > > > 		struct vfree_deferred *p;
> > > > > > 
> > > > > > 		vbq = &per_cpu(vmap_block_queue, i);
> > > > > > 		spin_lock_init(&vbq->lock);
> > > > > > 		INIT_LIST_HEAD(&vbq->free);
> > > > > > 		p = &per_cpu(vfree_deferred, i);
> > > > > > 		init_llist_head(&p->list);
> > > > > > 		INIT_WORK(&p->wq, delayed_vfree_work);
> > > > > > 		xa_init(&vbq->vmap_blocks);
> > > > > > 	}
> > > > > > <snip>
> > > > > > 
> > > > > > correctly or fully. It is my bad i did not think that CPUs in a possible mask
> > > > > > can be non sequential :-/
> > > > > > 
> > > > > > nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
> > > > > > when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.
> > > > > 
> > > > > I checked the generic version of setup_nr_cpu_ids(), from codes, they
> > > > > are different with my understanding.
> > > > > 
> > > > > kernel/smp.c
> > > > > void __init setup_nr_cpu_ids(void)
> > > > > {
> > > > >         set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1);
> > > > > }
> > > > > 
> > > > I see that it is not a weak function, so it is generic, thus the
> > > > behavior can not be overwritten, which is great. This does what we
> > > > need.
> > > > 
> > > > Thank you for checking this you are right!
> > > 
> > > Thanks for confirming this.
> > > 
> > > > 
> > > > Then it is just a matter of proper initialization of the hash:
> > > > 
> > > > <snip>
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index 5d3aa2dc88a8..1733946f7a12 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -5087,7 +5087,13 @@ void __init vmalloc_init(void)
> > > >          */
> > > >         vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
> > > >  
> > > > -       for_each_possible_cpu(i) {
> > > > +       /*
> > > > +        * We use "nr_cpu_ids" here because some architectures
> > > > +        * may have "gaps" in cpu-possible-mask. It is OK for
> > > > +        * per-cpu approaches but is not OK for cases where it
> > > > +        * can be used as hashes also.
> > > > +        */
> > > > +       for (i = 0; i < nr_cpu_ids; i++) {
> > > 
> > > I was wrong about earlier comments. Percpu variables are only available
> > > on possible CPUs. For those nonexistent possible CPUs of static percpu
> > > variable vmap_block_queue, there isn't memory allocated and mapped for
> > > them. So accessing into them will cause problem.
> > > 
> > > In Nick's case, there are only CPU0, CPU2. If you access
> > > &per_cpu(vmap_block_queue, 1), problem occurs. So I think we may need to
> > > change to take other way for vbq. E.g:
> > > 1) Storing the vb in the nearest neighbouring vbq on possible CPU as
> > >    below draft patch;
> > > 2) create an normal array to store vbq of size nr_cpu_ids, then we can
> > >    store/fetch each vbq on non-possible CPU?
> > > 
> > A correct way, i think, is to create a normal array. A quick fix can be
> > to stick to a next possible CPU.
> > 
> > > The way 1) is simpler, the existing code can be adapted a little just as
> > > below.
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 633363997dec..59a8951cc6c0 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > >  static struct xarray *
> > >  addr_to_vb_xa(unsigned long addr)
> > >  {
> > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > +
> > > +	if (!cpu_possible(idex))
> > > +		index = cpumask_next(index, cpu_possible_mask);
> > >
> > cpumask_next() can return nr_cpu_ids if no next bits set.
> 
> It won't. nr_cpu_ids is the largest index + 1, the hashed index will
> be:  0 =<  index  <= (nr_cpu_ids - 1) e.g cpu_possible_mask is
> b10001111, the nr_cpu_ids is 8, the largest bit is cpu7.
> cpu_possible(index) will check that. So the largest bit of cpumask_next()
> returns is (nr_cpu_ids - 1).
> 
/**
 * cpumask_next - get the next cpu in a cpumask
 * @n: the cpu prior to the place to search (i.e. return will be > @n)
 * @srcp: the cpumask pointer
 *
 * Return: >= nr_cpu_ids if no further cpus set.
 */
static inline
unsigned int cpumask_next(int n, const struct cpumask *srcp)
{
	/* -1 is a legal arg here. */
	if (n != -1)
		cpumask_check(n);
	return find_next_bit(cpumask_bits(srcp), small_cpumask_bits, n + 1);
}

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-25 12:40                   ` Uladzislau Rezki
@ 2024-06-25 13:02                     ` Baoquan He
  2024-06-25 15:33                       ` Uladzislau Rezki
  0 siblings, 1 reply; 39+ messages in thread
From: Baoquan He @ 2024-06-25 13:02 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Nick Bowler, Hailong Liu, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On 06/25/24 at 02:40pm, Uladzislau Rezki wrote:
> On Tue, Jun 25, 2024 at 07:40:21PM +0800, Baoquan He wrote:
> > On 06/25/24 at 12:32pm, Uladzislau Rezki wrote:
> > > On Tue, Jun 25, 2024 at 11:30:33AM +0800, Baoquan He wrote:
> > > > On 06/24/24 at 02:16pm, Uladzislau Rezki wrote:
> > > > > On Fri, Jun 21, 2024 at 10:02:50PM +0800, Baoquan He wrote:
> > > > > > On 06/21/24 at 11:44am, Uladzislau Rezki wrote:
> > > > > > > On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > > > > > > > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > > > > > > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > > > > > > > On 2024-06-20 02:19, Nick Bowler wrote:
> > > > > > ......
> > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > index be2dd281ea76..18e87cafbaf2 100644
> > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > > > > > >  static struct xarray *
> > > > > > > >  addr_to_vb_xa(unsigned long addr)
> > > > > > > >  {
> > > > > > > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > > > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > > > > >  
> > > > > > > >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > > > > >  }
> > > > > > > > 
> > > > > > > The problem i see is about not-initializing of the:
> > > > > > > <snip>
> > > > > > > 	for_each_possible_cpu(i) {
> > > > > > > 		struct vmap_block_queue *vbq;
> > > > > > > 		struct vfree_deferred *p;
> > > > > > > 
> > > > > > > 		vbq = &per_cpu(vmap_block_queue, i);
> > > > > > > 		spin_lock_init(&vbq->lock);
> > > > > > > 		INIT_LIST_HEAD(&vbq->free);
> > > > > > > 		p = &per_cpu(vfree_deferred, i);
> > > > > > > 		init_llist_head(&p->list);
> > > > > > > 		INIT_WORK(&p->wq, delayed_vfree_work);
> > > > > > > 		xa_init(&vbq->vmap_blocks);
> > > > > > > 	}
> > > > > > > <snip>
> > > > > > > 
> > > > > > > correctly or fully. It is my bad i did not think that CPUs in a possible mask
> > > > > > > can be non sequential :-/
> > > > > > > 
> > > > > > > nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
> > > > > > > when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.
> > > > > > 
> > > > > > I checked the generic version of setup_nr_cpu_ids(), from codes, they
> > > > > > are different with my understanding.
> > > > > > 
> > > > > > kernel/smp.c
> > > > > > void __init setup_nr_cpu_ids(void)
> > > > > > {
> > > > > >         set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1);
> > > > > > }
> > > > > > 
> > > > > I see that it is not a weak function, so it is generic, thus the
> > > > > behavior can not be overwritten, which is great. This does what we
> > > > > need.
> > > > > 
> > > > > Thank you for checking this you are right!
> > > > 
> > > > Thanks for confirming this.
> > > > 
> > > > > 
> > > > > Then it is just a matter of proper initialization of the hash:
> > > > > 
> > > > > <snip>
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index 5d3aa2dc88a8..1733946f7a12 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -5087,7 +5087,13 @@ void __init vmalloc_init(void)
> > > > >          */
> > > > >         vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
> > > > >  
> > > > > -       for_each_possible_cpu(i) {
> > > > > +       /*
> > > > > +        * We use "nr_cpu_ids" here because some architectures
> > > > > +        * may have "gaps" in cpu-possible-mask. It is OK for
> > > > > +        * per-cpu approaches but is not OK for cases where it
> > > > > +        * can be used as hashes also.
> > > > > +        */
> > > > > +       for (i = 0; i < nr_cpu_ids; i++) {
> > > > 
> > > > I was wrong about earlier comments. Percpu variables are only available
> > > > on possible CPUs. For those nonexistent possible CPUs of static percpu
> > > > variable vmap_block_queue, there isn't memory allocated and mapped for
> > > > them. So accessing into them will cause problem.
> > > > 
> > > > In Nick's case, there are only CPU0, CPU2. If you access
> > > > &per_cpu(vmap_block_queue, 1), problem occurs. So I think we may need to
> > > > change to take other way for vbq. E.g:
> > > > 1) Storing the vb in the nearest neighbouring vbq on possible CPU as
> > > >    below draft patch;
> > > > 2) create an normal array to store vbq of size nr_cpu_ids, then we can
> > > >    store/fetch each vbq on non-possible CPU?
> > > > 
> > > A correct way, i think, is to create a normal array. A quick fix can be
> > > to stick to a next possible CPU.
> > > 
> > > > The way 1) is simpler, the existing code can be adapted a little just as
> > > > below.
> > > > 
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index 633363997dec..59a8951cc6c0 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > >  static struct xarray *
> > > >  addr_to_vb_xa(unsigned long addr)
> > > >  {
> > > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > +
> > > > +	if (!cpu_possible(idex))
> > > > +		index = cpumask_next(index, cpu_possible_mask);
> > > >
> > > cpumask_next() can return nr_cpu_ids if no next bits set.
> > 
> > It won't. nr_cpu_ids is the largest index + 1, the hashed index will
> > be:  0 =<  index  <= (nr_cpu_ids - 1) e.g cpu_possible_mask is
> > b10001111, the nr_cpu_ids is 8, the largest bit is cpu7.
> > cpu_possible(index) will check that. So the largest bit of cpumask_next()
> > returns is (nr_cpu_ids - 1).
> > 
> /**
>  * cpumask_next - get the next cpu in a cpumask
>  * @n: the cpu prior to the place to search (i.e. return will be > @n)
>  * @srcp: the cpumask pointer
>  *
>  * Return: >= nr_cpu_ids if no further cpus set.

Ah, I got what you mean. In the vbq case, it may not have chance to get
a return number as nr_cpu_ids. Becuase the hashed index limits the
range to [0, nr_cpu_ids-1], and cpu_possible(index) will guarantee it
won't be the highest cpu number [nr_cpu_ids-1] since CPU[nr_cpu_ids-1] must
be possible CPU.

Do I miss some corner cases?

>  */
> static inline
> unsigned int cpumask_next(int n, const struct cpumask *srcp)
> {
> 	/* -1 is a legal arg here. */
> 	if (n != -1)
> 		cpumask_check(n);
> 	return find_next_bit(cpumask_bits(srcp), small_cpumask_bits, n + 1);
> }
> 
> --
> Uladzislau Rezki
> 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-25 13:02                     ` Baoquan He
@ 2024-06-25 15:33                       ` Uladzislau Rezki
  2024-06-25 15:49                         ` Baoquan He
  0 siblings, 1 reply; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-25 15:33 UTC (permalink / raw)
  To: Baoquan He, Hailong Liu
  Cc: Uladzislau Rezki, Nick Bowler, Hailong Liu, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On Tue, Jun 25, 2024 at 09:02:43PM +0800, Baoquan He wrote:
> On 06/25/24 at 02:40pm, Uladzislau Rezki wrote:
> > On Tue, Jun 25, 2024 at 07:40:21PM +0800, Baoquan He wrote:
> > > On 06/25/24 at 12:32pm, Uladzislau Rezki wrote:
> > > > On Tue, Jun 25, 2024 at 11:30:33AM +0800, Baoquan He wrote:
> > > > > On 06/24/24 at 02:16pm, Uladzislau Rezki wrote:
> > > > > > On Fri, Jun 21, 2024 at 10:02:50PM +0800, Baoquan He wrote:
> > > > > > > On 06/21/24 at 11:44am, Uladzislau Rezki wrote:
> > > > > > > > On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > > > > > > > > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > > > > > > > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > > > > > > > > On 2024-06-20 02:19, Nick Bowler wrote:
> > > > > > > ......
> > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > > index be2dd281ea76..18e87cafbaf2 100644
> > > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > > @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > > > > > > >  static struct xarray *
> > > > > > > > >  addr_to_vb_xa(unsigned long addr)
> > > > > > > > >  {
> > > > > > > > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > > > > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > > > > > >  
> > > > > > > > >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > > > > > >  }
> > > > > > > > > 
> > > > > > > > The problem i see is about not-initializing of the:
> > > > > > > > <snip>
> > > > > > > > 	for_each_possible_cpu(i) {
> > > > > > > > 		struct vmap_block_queue *vbq;
> > > > > > > > 		struct vfree_deferred *p;
> > > > > > > > 
> > > > > > > > 		vbq = &per_cpu(vmap_block_queue, i);
> > > > > > > > 		spin_lock_init(&vbq->lock);
> > > > > > > > 		INIT_LIST_HEAD(&vbq->free);
> > > > > > > > 		p = &per_cpu(vfree_deferred, i);
> > > > > > > > 		init_llist_head(&p->list);
> > > > > > > > 		INIT_WORK(&p->wq, delayed_vfree_work);
> > > > > > > > 		xa_init(&vbq->vmap_blocks);
> > > > > > > > 	}
> > > > > > > > <snip>
> > > > > > > > 
> > > > > > > > correctly or fully. It is my bad i did not think that CPUs in a possible mask
> > > > > > > > can be non sequential :-/
> > > > > > > > 
> > > > > > > > nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
> > > > > > > > when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.
> > > > > > > 
> > > > > > > I checked the generic version of setup_nr_cpu_ids(), from codes, they
> > > > > > > are different with my understanding.
> > > > > > > 
> > > > > > > kernel/smp.c
> > > > > > > void __init setup_nr_cpu_ids(void)
> > > > > > > {
> > > > > > >         set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1);
> > > > > > > }
> > > > > > > 
> > > > > > I see that it is not a weak function, so it is generic, thus the
> > > > > > behavior can not be overwritten, which is great. This does what we
> > > > > > need.
> > > > > > 
> > > > > > Thank you for checking this you are right!
> > > > > 
> > > > > Thanks for confirming this.
> > > > > 
> > > > > > 
> > > > > > Then it is just a matter of proper initialization of the hash:
> > > > > > 
> > > > > > <snip>
> > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > index 5d3aa2dc88a8..1733946f7a12 100644
> > > > > > --- a/mm/vmalloc.c
> > > > > > +++ b/mm/vmalloc.c
> > > > > > @@ -5087,7 +5087,13 @@ void __init vmalloc_init(void)
> > > > > >          */
> > > > > >         vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
> > > > > >  
> > > > > > -       for_each_possible_cpu(i) {
> > > > > > +       /*
> > > > > > +        * We use "nr_cpu_ids" here because some architectures
> > > > > > +        * may have "gaps" in cpu-possible-mask. It is OK for
> > > > > > +        * per-cpu approaches but is not OK for cases where it
> > > > > > +        * can be used as hashes also.
> > > > > > +        */
> > > > > > +       for (i = 0; i < nr_cpu_ids; i++) {
> > > > > 
> > > > > I was wrong about earlier comments. Percpu variables are only available
> > > > > on possible CPUs. For those nonexistent possible CPUs of static percpu
> > > > > variable vmap_block_queue, there isn't memory allocated and mapped for
> > > > > them. So accessing into them will cause problem.
> > > > > 
> > > > > In Nick's case, there are only CPU0, CPU2. If you access
> > > > > &per_cpu(vmap_block_queue, 1), problem occurs. So I think we may need to
> > > > > change to take other way for vbq. E.g:
> > > > > 1) Storing the vb in the nearest neighbouring vbq on possible CPU as
> > > > >    below draft patch;
> > > > > 2) create an normal array to store vbq of size nr_cpu_ids, then we can
> > > > >    store/fetch each vbq on non-possible CPU?
> > > > > 
> > > > A correct way, i think, is to create a normal array. A quick fix can be
> > > > to stick to a next possible CPU.
> > > > 
> > > > > The way 1) is simpler, the existing code can be adapted a little just as
> > > > > below.
> > > > > 
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index 633363997dec..59a8951cc6c0 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > > >  static struct xarray *
> > > > >  addr_to_vb_xa(unsigned long addr)
> > > > >  {
> > > > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > > +
> > > > > +	if (!cpu_possible(idex))
> > > > > +		index = cpumask_next(index, cpu_possible_mask);
> > > > >
> > > > cpumask_next() can return nr_cpu_ids if no next bits set.
> > > 
> > > It won't. nr_cpu_ids is the largest index + 1, the hashed index will
> > > be:  0 =<  index  <= (nr_cpu_ids - 1) e.g cpu_possible_mask is
> > > b10001111, the nr_cpu_ids is 8, the largest bit is cpu7.
> > > cpu_possible(index) will check that. So the largest bit of cpumask_next()
> > > returns is (nr_cpu_ids - 1).
> > > 
> > /**
> >  * cpumask_next - get the next cpu in a cpumask
> >  * @n: the cpu prior to the place to search (i.e. return will be > @n)
> >  * @srcp: the cpumask pointer
> >  *
> >  * Return: >= nr_cpu_ids if no further cpus set.
> 
> Ah, I got what you mean. In the vbq case, it may not have chance to get
> a return number as nr_cpu_ids. Becuase the hashed index limits the
> range to [0, nr_cpu_ids-1], and cpu_possible(index) will guarantee it
> won't be the highest cpu number [nr_cpu_ids-1] since CPU[nr_cpu_ids-1] must
> be possible CPU.
> 
> Do I miss some corner cases?
> 
Right. We guarantee that a highest CPU is available by doing: % nr_cpu_ids.
So we do not need to use *next_wrap() variant. You do not miss anything :)

Hailong Liu has proposed more simpler version:

<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 11fe5ea208aa..e1e63ffb9c57 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1994,8 +1994,9 @@ static struct xarray *
 addr_to_vb_xa(unsigned long addr)
 {
        int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
+       int cpu = cpumask_nth(index, cpu_possible_mask);

-       return &per_cpu(vmap_block_queue, index).vmap_blocks;
+       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
<snip>

which just takes a next CPU if an index is not set in the cpu_possible_mask.

The only thing that can be updated in the patch is to replace num_possible_cpu()
by the nr_cpu_ids.

Any thoughts? I think we need to fix it by a minor change so it is
easier to back-port on stable kernels.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-25 15:33                       ` Uladzislau Rezki
@ 2024-06-25 15:49                         ` Baoquan He
  2024-06-25 16:49                           ` Uladzislau Rezki
  0 siblings, 1 reply; 39+ messages in thread
From: Baoquan He @ 2024-06-25 15:49 UTC (permalink / raw)
  To: Uladzislau Rezki, Hailong Liu
  Cc: Nick Bowler, linux-kernel, Linux regressions mailing list,
	linux-mm, sparclinux, Andrew Morton

On 06/25/24 at 05:33pm, Uladzislau Rezki wrote:
> On Tue, Jun 25, 2024 at 09:02:43PM +0800, Baoquan He wrote:
> > On 06/25/24 at 02:40pm, Uladzislau Rezki wrote:
> > > On Tue, Jun 25, 2024 at 07:40:21PM +0800, Baoquan He wrote:
> > > > On 06/25/24 at 12:32pm, Uladzislau Rezki wrote:
> > > > > On Tue, Jun 25, 2024 at 11:30:33AM +0800, Baoquan He wrote:
> > > > > > On 06/24/24 at 02:16pm, Uladzislau Rezki wrote:
> > > > > > > On Fri, Jun 21, 2024 at 10:02:50PM +0800, Baoquan He wrote:
> > > > > > > > On 06/21/24 at 11:44am, Uladzislau Rezki wrote:
> > > > > > > > > On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > > > > > > > > > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > > > > > > > > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > > > > > > > > > On 2024-06-20 02:19, Nick Bowler wrote:
> > > > > > > > ......
> > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > > > index be2dd281ea76..18e87cafbaf2 100644
> > > > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > > > @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > > > > > > > >  static struct xarray *
> > > > > > > > > >  addr_to_vb_xa(unsigned long addr)
> > > > > > > > > >  {
> > > > > > > > > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > > > > > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > > > > > > >  
> > > > > > > > > >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > > > > > > >  }
> > > > > > > > > > 
> > > > > > > > > The problem i see is about not-initializing of the:
> > > > > > > > > <snip>
> > > > > > > > > 	for_each_possible_cpu(i) {
> > > > > > > > > 		struct vmap_block_queue *vbq;
> > > > > > > > > 		struct vfree_deferred *p;
> > > > > > > > > 
> > > > > > > > > 		vbq = &per_cpu(vmap_block_queue, i);
> > > > > > > > > 		spin_lock_init(&vbq->lock);
> > > > > > > > > 		INIT_LIST_HEAD(&vbq->free);
> > > > > > > > > 		p = &per_cpu(vfree_deferred, i);
> > > > > > > > > 		init_llist_head(&p->list);
> > > > > > > > > 		INIT_WORK(&p->wq, delayed_vfree_work);
> > > > > > > > > 		xa_init(&vbq->vmap_blocks);
> > > > > > > > > 	}
> > > > > > > > > <snip>
> > > > > > > > > 
> > > > > > > > > correctly or fully. It is my bad i did not think that CPUs in a possible mask
> > > > > > > > > can be non sequential :-/
> > > > > > > > > 
> > > > > > > > > nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
> > > > > > > > > when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.
> > > > > > > > 
> > > > > > > > I checked the generic version of setup_nr_cpu_ids(), from codes, they
> > > > > > > > are different with my understanding.
> > > > > > > > 
> > > > > > > > kernel/smp.c
> > > > > > > > void __init setup_nr_cpu_ids(void)
> > > > > > > > {
> > > > > > > >         set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1);
> > > > > > > > }
> > > > > > > > 
> > > > > > > I see that it is not a weak function, so it is generic, thus the
> > > > > > > behavior can not be overwritten, which is great. This does what we
> > > > > > > need.
> > > > > > > 
> > > > > > > Thank you for checking this you are right!
> > > > > > 
> > > > > > Thanks for confirming this.
> > > > > > 
> > > > > > > 
> > > > > > > Then it is just a matter of proper initialization of the hash:
> > > > > > > 
> > > > > > > <snip>
> > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > index 5d3aa2dc88a8..1733946f7a12 100644
> > > > > > > --- a/mm/vmalloc.c
> > > > > > > +++ b/mm/vmalloc.c
> > > > > > > @@ -5087,7 +5087,13 @@ void __init vmalloc_init(void)
> > > > > > >          */
> > > > > > >         vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
> > > > > > >  
> > > > > > > -       for_each_possible_cpu(i) {
> > > > > > > +       /*
> > > > > > > +        * We use "nr_cpu_ids" here because some architectures
> > > > > > > +        * may have "gaps" in cpu-possible-mask. It is OK for
> > > > > > > +        * per-cpu approaches but is not OK for cases where it
> > > > > > > +        * can be used as hashes also.
> > > > > > > +        */
> > > > > > > +       for (i = 0; i < nr_cpu_ids; i++) {
> > > > > > 
> > > > > > I was wrong about earlier comments. Percpu variables are only available
> > > > > > on possible CPUs. For those nonexistent possible CPUs of static percpu
> > > > > > variable vmap_block_queue, there isn't memory allocated and mapped for
> > > > > > them. So accessing into them will cause problem.
> > > > > > 
> > > > > > In Nick's case, there are only CPU0, CPU2. If you access
> > > > > > &per_cpu(vmap_block_queue, 1), problem occurs. So I think we may need to
> > > > > > change to take other way for vbq. E.g:
> > > > > > 1) Storing the vb in the nearest neighbouring vbq on possible CPU as
> > > > > >    below draft patch;
> > > > > > 2) create an normal array to store vbq of size nr_cpu_ids, then we can
> > > > > >    store/fetch each vbq on non-possible CPU?
> > > > > > 
> > > > > A correct way, i think, is to create a normal array. A quick fix can be
> > > > > to stick to a next possible CPU.
> > > > > 
> > > > > > The way 1) is simpler, the existing code can be adapted a little just as
> > > > > > below.
> > > > > > 
> > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > index 633363997dec..59a8951cc6c0 100644
> > > > > > --- a/mm/vmalloc.c
> > > > > > +++ b/mm/vmalloc.c
> > > > > > @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > > > >  static struct xarray *
> > > > > >  addr_to_vb_xa(unsigned long addr)
> > > > > >  {
> > > > > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > > > +
> > > > > > +	if (!cpu_possible(idex))
> > > > > > +		index = cpumask_next(index, cpu_possible_mask);
> > > > > >
> > > > > cpumask_next() can return nr_cpu_ids if no next bits set.
> > > > 
> > > > It won't. nr_cpu_ids is the largest index + 1, the hashed index will
> > > > be:  0 =<  index  <= (nr_cpu_ids - 1) e.g cpu_possible_mask is
> > > > b10001111, the nr_cpu_ids is 8, the largest bit is cpu7.
> > > > cpu_possible(index) will check that. So the largest bit of cpumask_next()
> > > > returns is (nr_cpu_ids - 1).
> > > > 
> > > /**
> > >  * cpumask_next - get the next cpu in a cpumask
> > >  * @n: the cpu prior to the place to search (i.e. return will be > @n)
> > >  * @srcp: the cpumask pointer
> > >  *
> > >  * Return: >= nr_cpu_ids if no further cpus set.
> > 
> > Ah, I got what you mean. In the vbq case, it may not have chance to get
> > a return number as nr_cpu_ids. Becuase the hashed index limits the
> > range to [0, nr_cpu_ids-1], and cpu_possible(index) will guarantee it
> > won't be the highest cpu number [nr_cpu_ids-1] since CPU[nr_cpu_ids-1] must
> > be possible CPU.
> > 
> > Do I miss some corner cases?
> > 
> Right. We guarantee that a highest CPU is available by doing: % nr_cpu_ids.
> So we do not need to use *next_wrap() variant. You do not miss anything :)
> 
> Hailong Liu has proposed more simpler version:
> 
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 11fe5ea208aa..e1e63ffb9c57 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1994,8 +1994,9 @@ static struct xarray *
>  addr_to_vb_xa(unsigned long addr)
>  {
>         int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> +       int cpu = cpumask_nth(index, cpu_possible_mask);
> 
> -       return &per_cpu(vmap_block_queue, index).vmap_blocks;
> +       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
> <snip>
> 
> which just takes a next CPU if an index is not set in the cpu_possible_mask.
> 
> The only thing that can be updated in the patch is to replace num_possible_cpu()
> by the nr_cpu_ids.
> 
> Any thoughts? I think we need to fix it by a minor change so it is
> easier to back-port on stable kernels.

Yeah, sounds good since the regresson commit is merged in v6.3.
Please feel free to post this and the hash array patch separately for
formal reviewing.

By the way, when I am replying this mail, I check the cpumask_nth()
again. I doubt it may take more checking then cpu_possible(), given most
of systems don't have gaps in cpu_possible_mask. I could be dizzy at
this moment.

static inline unsigned int cpumask_nth(unsigned int cpu, const struct cpumask *srcp)
{
        return find_nth_bit(cpumask_bits(srcp), small_cpumask_bits, cpumask_check(cpu));
}



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-25 15:49                         ` Baoquan He
@ 2024-06-25 16:49                           ` Uladzislau Rezki
  2024-06-25 20:05                             ` Uladzislau Rezki
  0 siblings, 1 reply; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-25 16:49 UTC (permalink / raw)
  To: Nick Bowler
  Cc: Uladzislau Rezki, Hailong Liu, Nick Bowler, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

> > > > /**
> > > >  * cpumask_next - get the next cpu in a cpumask
> > > >  * @n: the cpu prior to the place to search (i.e. return will be > @n)
> > > >  * @srcp: the cpumask pointer
> > > >  *
> > > >  * Return: >= nr_cpu_ids if no further cpus set.
> > > 
> > > Ah, I got what you mean. In the vbq case, it may not have chance to get
> > > a return number as nr_cpu_ids. Becuase the hashed index limits the
> > > range to [0, nr_cpu_ids-1], and cpu_possible(index) will guarantee it
> > > won't be the highest cpu number [nr_cpu_ids-1] since CPU[nr_cpu_ids-1] must
> > > be possible CPU.
> > > 
> > > Do I miss some corner cases?
> > > 
> > Right. We guarantee that a highest CPU is available by doing: % nr_cpu_ids.
> > So we do not need to use *next_wrap() variant. You do not miss anything :)
> > 
> > Hailong Liu has proposed more simpler version:
> > 
> > <snip>
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 11fe5ea208aa..e1e63ffb9c57 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1994,8 +1994,9 @@ static struct xarray *
> >  addr_to_vb_xa(unsigned long addr)
> >  {
> >         int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > +       int cpu = cpumask_nth(index, cpu_possible_mask);
> > 
> > -       return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > +       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
> > <snip>
> > 
> > which just takes a next CPU if an index is not set in the cpu_possible_mask.
> > 
> > The only thing that can be updated in the patch is to replace num_possible_cpu()
> > by the nr_cpu_ids.
> > 
> > Any thoughts? I think we need to fix it by a minor change so it is
> > easier to back-port on stable kernels.
> 
> Yeah, sounds good since the regresson commit is merged in v6.3.
> Please feel free to post this and the hash array patch separately for
> formal reviewing.
> 
Agreed! The patch about hash array i will post later.

> By the way, when I am replying this mail, I check the cpumask_nth()
> again. I doubt it may take more checking then cpu_possible(), given most
> of systems don't have gaps in cpu_possible_mask. I could be dizzy at
> this moment.
> 
> static inline unsigned int cpumask_nth(unsigned int cpu, const struct cpumask *srcp)
> {
>         return find_nth_bit(cpumask_bits(srcp), small_cpumask_bits, cpumask_check(cpu));
> }
> 
Yep, i do not think it is a big problem based on your noted fact.

Nick, could you please test your machine with below change?

<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 45e1506d58c3..5458fd2290cf 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2542,9 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
 static struct xarray *
 addr_to_vb_xa(unsigned long addr)
 {
-       int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
+       int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
+       int cpu = cpumask_nth(index, cpu_possible_mask);

-       return &per_cpu(vmap_block_queue, index).vmap_blocks;
+       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
 }

 /*
<snip>

Thank you in advance!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-25 16:49                           ` Uladzislau Rezki
@ 2024-06-25 20:05                             ` Uladzislau Rezki
  2024-06-26  0:38                               ` Baoquan He
  2024-06-26  5:12                               ` Hailong Liu
  0 siblings, 2 replies; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-25 20:05 UTC (permalink / raw)
  To: Baoquan He
  Cc: Nick Bowler, Hailong Liu, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

> > > > > /**
> > > > >  * cpumask_next - get the next cpu in a cpumask
> > > > >  * @n: the cpu prior to the place to search (i.e. return will be > @n)
> > > > >  * @srcp: the cpumask pointer
> > > > >  *
> > > > >  * Return: >= nr_cpu_ids if no further cpus set.
> > > > 
> > > > Ah, I got what you mean. In the vbq case, it may not have chance to get
> > > > a return number as nr_cpu_ids. Becuase the hashed index limits the
> > > > range to [0, nr_cpu_ids-1], and cpu_possible(index) will guarantee it
> > > > won't be the highest cpu number [nr_cpu_ids-1] since CPU[nr_cpu_ids-1] must
> > > > be possible CPU.
> > > > 
> > > > Do I miss some corner cases?
> > > > 
> > > Right. We guarantee that a highest CPU is available by doing: % nr_cpu_ids.
> > > So we do not need to use *next_wrap() variant. You do not miss anything :)
> > > 
> > > Hailong Liu has proposed more simpler version:
> > > 
> > > <snip>
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 11fe5ea208aa..e1e63ffb9c57 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -1994,8 +1994,9 @@ static struct xarray *
> > >  addr_to_vb_xa(unsigned long addr)
> > >  {
> > >         int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > +       int cpu = cpumask_nth(index, cpu_possible_mask);
> > > 
> > > -       return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > +       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
> > > <snip>
> > > 
> > > which just takes a next CPU if an index is not set in the cpu_possible_mask.
> > > 
> > > The only thing that can be updated in the patch is to replace num_possible_cpu()
> > > by the nr_cpu_ids.
> > > 
> > > Any thoughts? I think we need to fix it by a minor change so it is
> > > easier to back-port on stable kernels.
> > 
> > Yeah, sounds good since the regresson commit is merged in v6.3.
> > Please feel free to post this and the hash array patch separately for
> > formal reviewing.
> > 
> Agreed! The patch about hash array i will post later.
> 
> > By the way, when I am replying this mail, I check the cpumask_nth()
> > again. I doubt it may take more checking then cpu_possible(), given most
> > of systems don't have gaps in cpu_possible_mask. I could be dizzy at
> > this moment.
> > 
> > static inline unsigned int cpumask_nth(unsigned int cpu, const struct cpumask *srcp)
> > {
> >         return find_nth_bit(cpumask_bits(srcp), small_cpumask_bits, cpumask_check(cpu));
> > }
> > 
> Yep, i do not think it is a big problem based on your noted fact.
> 
Checked. There is a difference:

1. Default

<snip>
...
+   15.95%     6.05%  [kernel]        [k] __vmap_pages_range_noflush
+   15.91%     1.74%  [kernel]        [k] addr_to_vb_xa <---------------
+   15.13%    12.05%  [kernel]        [k] vunmap_p4d_range
+   14.17%    13.38%  [kernel]        [k] __find_nth_bit <--------------
+   10.62%     0.00%  [kernel]        [k] ret_from_fork_asm
+   10.62%     0.00%  [kernel]        [k] ret_from_fork
+   10.62%     0.00%  [kernel]        [k] kthread
...
<snip>

2. Check if cpu_possible() and then fallback to cpumask_nth() if not

<snip>
...
+    6.84%     0.29%  [kernel]          [k] alloc_vmap_area
+    6.80%     6.70%  [kernel]          [k] native_queued_spin_lock_slowpath
+    4.24%     0.09%  [kernel]          [k] free_vmap_block
+    2.41%     2.38%  [kernel]          [k] addr_to_vb_xa <-----------
+    1.94%     1.91%  [kernel]          [k] xas_start
...
<snip>

It is _worth_ to check if an index is in possible mask:

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 45e1506d58c3..af20f78c2cbf 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
 static struct xarray *
 addr_to_vb_xa(unsigned long addr)
 {
-       int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
+       int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
+
+       if (!cpu_possible(index))
+               index = cpumask_nth(index, cpu_possible_mask);

        return &per_cpu(vmap_block_queue, index).vmap_blocks;
 }

cpumask_nth() is not cheap. My measurements are based on a synthetic
tight test and it detects a difference. In a real workloads it should
not be visible. Having gaps is not a common case plus a "slow path"
will be mitigated by the hit against possible mask.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-25 20:05                             ` Uladzislau Rezki
@ 2024-06-26  0:38                               ` Baoquan He
  2024-06-26  5:12                               ` Hailong Liu
  1 sibling, 0 replies; 39+ messages in thread
From: Baoquan He @ 2024-06-26  0:38 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Nick Bowler, Hailong Liu, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On 06/25/24 at 10:05pm, Uladzislau Rezki wrote:
> > > > > > /**
> > > > > >  * cpumask_next - get the next cpu in a cpumask
> > > > > >  * @n: the cpu prior to the place to search (i.e. return will be > @n)
> > > > > >  * @srcp: the cpumask pointer
> > > > > >  *
> > > > > >  * Return: >= nr_cpu_ids if no further cpus set.
> > > > > 
> > > > > Ah, I got what you mean. In the vbq case, it may not have chance to get
> > > > > a return number as nr_cpu_ids. Becuase the hashed index limits the
> > > > > range to [0, nr_cpu_ids-1], and cpu_possible(index) will guarantee it
> > > > > won't be the highest cpu number [nr_cpu_ids-1] since CPU[nr_cpu_ids-1] must
> > > > > be possible CPU.
> > > > > 
> > > > > Do I miss some corner cases?
> > > > > 
> > > > Right. We guarantee that a highest CPU is available by doing: % nr_cpu_ids.
> > > > So we do not need to use *next_wrap() variant. You do not miss anything :)
> > > > 
> > > > Hailong Liu has proposed more simpler version:
> > > > 
> > > > <snip>
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index 11fe5ea208aa..e1e63ffb9c57 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -1994,8 +1994,9 @@ static struct xarray *
> > > >  addr_to_vb_xa(unsigned long addr)
> > > >  {
> > > >         int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > +       int cpu = cpumask_nth(index, cpu_possible_mask);
> > > > 
> > > > -       return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > +       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
> > > > <snip>
> > > > 
> > > > which just takes a next CPU if an index is not set in the cpu_possible_mask.
> > > > 
> > > > The only thing that can be updated in the patch is to replace num_possible_cpu()
> > > > by the nr_cpu_ids.
> > > > 
> > > > Any thoughts? I think we need to fix it by a minor change so it is
> > > > easier to back-port on stable kernels.
> > > 
> > > Yeah, sounds good since the regresson commit is merged in v6.3.
> > > Please feel free to post this and the hash array patch separately for
> > > formal reviewing.
> > > 
> > Agreed! The patch about hash array i will post later.
> > 
> > > By the way, when I am replying this mail, I check the cpumask_nth()
> > > again. I doubt it may take more checking then cpu_possible(), given most
> > > of systems don't have gaps in cpu_possible_mask. I could be dizzy at
> > > this moment.
> > > 
> > > static inline unsigned int cpumask_nth(unsigned int cpu, const struct cpumask *srcp)
> > > {
> > >         return find_nth_bit(cpumask_bits(srcp), small_cpumask_bits, cpumask_check(cpu));
> > > }
> > > 
> > Yep, i do not think it is a big problem based on your noted fact.
> > 
> Checked. There is a difference:
> 
> 1. Default
> 
> <snip>
> ...
> +   15.95%     6.05%  [kernel]        [k] __vmap_pages_range_noflush
> +   15.91%     1.74%  [kernel]        [k] addr_to_vb_xa <---------------
> +   15.13%    12.05%  [kernel]        [k] vunmap_p4d_range
> +   14.17%    13.38%  [kernel]        [k] __find_nth_bit <--------------
> +   10.62%     0.00%  [kernel]        [k] ret_from_fork_asm
> +   10.62%     0.00%  [kernel]        [k] ret_from_fork
> +   10.62%     0.00%  [kernel]        [k] kthread
> ...
> <snip>
> 
> 2. Check if cpu_possible() and then fallback to cpumask_nth() if not
> 
> <snip>
> ...
> +    6.84%     0.29%  [kernel]          [k] alloc_vmap_area
> +    6.80%     6.70%  [kernel]          [k] native_queued_spin_lock_slowpath
> +    4.24%     0.09%  [kernel]          [k] free_vmap_block
> +    2.41%     2.38%  [kernel]          [k] addr_to_vb_xa <-----------
> +    1.94%     1.91%  [kernel]          [k] xas_start
> ...
> <snip>
> 
> It is _worth_ to check if an index is in possible mask:
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 45e1506d58c3..af20f78c2cbf 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
>  static struct xarray *
>  addr_to_vb_xa(unsigned long addr)
>  {
> -       int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> +       int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> +
> +       if (!cpu_possible(index))
> +               index = cpumask_nth(index, cpu_possible_mask);
> 
>         return &per_cpu(vmap_block_queue, index).vmap_blocks;
>  }
> 
> cpumask_nth() is not cheap. My measurements are based on a synthetic
> tight test and it detects a difference. In a real workloads it should
> not be visible. Having gaps is not a common case plus a "slow path"
> will be mitigated by the hit against possible mask.

Ah, this is consistent with my understanding from the code, thanks
for confirming by testing.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-25 20:05                             ` Uladzislau Rezki
  2024-06-26  0:38                               ` Baoquan He
@ 2024-06-26  5:12                               ` Hailong Liu
  2024-06-26  9:15                                 ` Uladzislau Rezki
  1 sibling, 1 reply; 39+ messages in thread
From: Hailong Liu @ 2024-06-26  5:12 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Baoquan He, Nick Bowler, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On Tue, 25. Jun 22:05, Uladzislau Rezki wrote:
> > > > > > /**
> > > > > >  * cpumask_next - get the next cpu in a cpumask
> > > > > >  * @n: the cpu prior to the place to search (i.e. return will be > @n)
> > > > > >  * @srcp: the cpumask pointer
> > > > > >  *
> > > > > >  * Return: >= nr_cpu_ids if no further cpus set.
> > > > >
> > > > > Ah, I got what you mean. In the vbq case, it may not have chance to get
> > > > > a return number as nr_cpu_ids. Becuase the hashed index limits the
> > > > > range to [0, nr_cpu_ids-1], and cpu_possible(index) will guarantee it
> > > > > won't be the highest cpu number [nr_cpu_ids-1] since CPU[nr_cpu_ids-1] must
> > > > > be possible CPU.
> > > > >
> > > > > Do I miss some corner cases?
> > > > >
> > > > Right. We guarantee that a highest CPU is available by doing: % nr_cpu_ids.
> > > > So we do not need to use *next_wrap() variant. You do not miss anything :)
> > > >
> > > > Hailong Liu has proposed more simpler version:
> > > >
> > > > <snip>
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index 11fe5ea208aa..e1e63ffb9c57 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -1994,8 +1994,9 @@ static struct xarray *
> > > >  addr_to_vb_xa(unsigned long addr)
> > > >  {
> > > >         int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > +       int cpu = cpumask_nth(index, cpu_possible_mask);
> > > >
> > > > -       return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > +       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
> > > > <snip>
> > > >
> > > > which just takes a next CPU if an index is not set in the cpu_possible_mask.
> > > >
> > > > The only thing that can be updated in the patch is to replace num_possible_cpu()
> > > > by the nr_cpu_ids.
> > > >
> > > > Any thoughts? I think we need to fix it by a minor change so it is
> > > > easier to back-port on stable kernels.
> > >
> > > Yeah, sounds good since the regresson commit is merged in v6.3.
> > > Please feel free to post this and the hash array patch separately for
> > > formal reviewing.
> > >
> > Agreed! The patch about hash array i will post later.
> >
> > > By the way, when I am replying this mail, I check the cpumask_nth()
> > > again. I doubt it may take more checking then cpu_possible(), given most
> > > of systems don't have gaps in cpu_possible_mask. I could be dizzy at
> > > this moment.
> > >
> > > static inline unsigned int cpumask_nth(unsigned int cpu, const struct cpumask *srcp)
> > > {
> > >         return find_nth_bit(cpumask_bits(srcp), small_cpumask_bits, cpumask_check(cpu));
> > > }
> > >
> > Yep, i do not think it is a big problem based on your noted fact.
> >
> Checked. There is a difference:
>
> 1. Default
>
> <snip>
> ...
> +   15.95%     6.05%  [kernel]        [k] __vmap_pages_range_noflush
> +   15.91%     1.74%  [kernel]        [k] addr_to_vb_xa <---------------
> +   15.13%    12.05%  [kernel]        [k] vunmap_p4d_range
> +   14.17%    13.38%  [kernel]        [k] __find_nth_bit <--------------
> +   10.62%     0.00%  [kernel]        [k] ret_from_fork_asm
> +   10.62%     0.00%  [kernel]        [k] ret_from_fork
> +   10.62%     0.00%  [kernel]        [k] kthread
> ...
> <snip>
>
> 2. Check if cpu_possible() and then fallback to cpumask_nth() if not
>
> <snip>
> ...
> +    6.84%     0.29%  [kernel]          [k] alloc_vmap_area
> +    6.80%     6.70%  [kernel]          [k] native_queued_spin_lock_slowpath
> +    4.24%     0.09%  [kernel]          [k] free_vmap_block
> +    2.41%     2.38%  [kernel]          [k] addr_to_vb_xa <-----------
> +    1.94%     1.91%  [kernel]          [k] xas_start
> ...
> <snip>
>
> It is _worth_ to check if an index is in possible mask:
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 45e1506d58c3..af20f78c2cbf 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
>  static struct xarray *
>  addr_to_vb_xa(unsigned long addr)
>  {
> -       int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> +       int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
IIUC, use nr_cpu_ids here maybe incorrect.

take b101 as example, nr_cpu_ids is 3. if index is 2 cpumask_nth(2, cpu_possible_mask);
might return 64.
/**
 * cpumask_nth_and - get the first cpu in 2 cpumasks
 * @srcp1: the cpumask pointer
 * @srcp2: the cpumask pointer
 * @cpu: the N'th cpu to find, starting from 0 <--- N'th cpu
 *
 * Returns >= nr_cpu_ids if such cpu doesn't exist. <-----
 */
static inline
unsigned int cpumask_nth_and(unsigned int cpu, const struct cpumask *srcp1,
							const struct cpumask *srcp2)
{
	return find_nth_and_bit(cpumask_bits(srcp1), cpumask_bits(srcp2),
				nr_cpumask_bits, cpumask_check(cpu));
}

I use num_possible_cpus() and cpumask_nth() here to distribute the addresses
evenly across different CPUs.

if we use cpumask_next(index) or use cpumask_nth(index, cpu_possible_mask)
becomes as follows:
  CPU_0  CPU_2  CPU_2
    |      |      |
    V      V      V
0     10     20     30     40     50     60
|------|------|------|------|------|------|..

> +
> +       if (!cpu_possible(index))
> +               index = cpumask_nth(index, cpu_possible_mask);
>
>         return &per_cpu(vmap_block_queue, index).vmap_blocks;
>  }
>
> cpumask_nth() is not cheap. My measurements are based on a synthetic
> tight test and it detects a difference. In a real workloads it should
> not be visible. Having gaps is not a common case plus a "slow path"
> will be mitigated by the hit against possible mask.

If cpumask_nth() is not cheap or have performance regression. Perhaps we
can use the solution suggested by Haoquan. I’ve drafted as follows:

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 11fe5ea208aa..355dbfdf51f1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -92,6 +92,7 @@ struct vfree_deferred {
        struct work_struct wq;
 };
 static DEFINE_PER_CPU(struct vfree_deferred, vfree_deferred);
+static unsigned int *table_non_seq_cpu;

 /*** Page table manipulation functions ***/
 static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
@@ -1995,6 +1996,10 @@ addr_to_vb_xa(unsigned long addr)
 {
        int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();

+       /* recalculate the cpuid if cpumask is not full. */
+       if (table_non_seq_cpu)
+               index = table_non_seq_cpu[index];
+
        return &per_cpu(vmap_block_queue, index).vmap_blocks;
 }

@@ -4473,17 +4478,25 @@ void __init vmalloc_init(void)
 {
        struct vmap_area *va;
        struct vm_struct *tmp;
-       int i;
+       int i, inx = 0;

        /*
         * Create the cache for vmap_area objects.
         */
        vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);

+       if (!cpumask_full(cpu_possible_mask)) {
+               table_non_seq_cpu = kzalloc(num_possible_cpus() * sizeof(unsigned int),
+                                           GFP_NOWAIT);
+               BUG_ON(!table_non_seq_cpu);
+       }
+
        for_each_possible_cpu(i) {
                struct vmap_block_queue *vbq;
                struct vfree_deferred *p;

+               if (table_non_seq_cpu)
+                       table_non_seq_cpu[inx++] = i;
                vbq = &per_cpu(vmap_block_queue, i);
                spin_lock_init(&vbq->lock);
                INIT_LIST_HEAD(&vbq->free);
>
> --
> Uladzislau Rezki

--
help you, help me,
Hailong.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-26  5:12                               ` Hailong Liu
@ 2024-06-26  9:15                                 ` Uladzislau Rezki
  2024-06-26 10:03                                   ` Hailong Liu
  0 siblings, 1 reply; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-26  9:15 UTC (permalink / raw)
  To: Hailong Liu
  Cc: Uladzislau Rezki, Baoquan He, Nick Bowler, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On Wed, Jun 26, 2024 at 01:12:06PM +0800, Hailong Liu wrote:
> On Tue, 25. Jun 22:05, Uladzislau Rezki wrote:
> > > > > > > /**
> > > > > > >  * cpumask_next - get the next cpu in a cpumask
> > > > > > >  * @n: the cpu prior to the place to search (i.e. return will be > @n)
> > > > > > >  * @srcp: the cpumask pointer
> > > > > > >  *
> > > > > > >  * Return: >= nr_cpu_ids if no further cpus set.
> > > > > >
> > > > > > Ah, I got what you mean. In the vbq case, it may not have chance to get
> > > > > > a return number as nr_cpu_ids. Becuase the hashed index limits the
> > > > > > range to [0, nr_cpu_ids-1], and cpu_possible(index) will guarantee it
> > > > > > won't be the highest cpu number [nr_cpu_ids-1] since CPU[nr_cpu_ids-1] must
> > > > > > be possible CPU.
> > > > > >
> > > > > > Do I miss some corner cases?
> > > > > >
> > > > > Right. We guarantee that a highest CPU is available by doing: % nr_cpu_ids.
> > > > > So we do not need to use *next_wrap() variant. You do not miss anything :)
> > > > >
> > > > > Hailong Liu has proposed more simpler version:
> > > > >
> > > > > <snip>
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index 11fe5ea208aa..e1e63ffb9c57 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -1994,8 +1994,9 @@ static struct xarray *
> > > > >  addr_to_vb_xa(unsigned long addr)
> > > > >  {
> > > > >         int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > +       int cpu = cpumask_nth(index, cpu_possible_mask);
> > > > >
> > > > > -       return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > > +       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
> > > > > <snip>
> > > > >
> > > > > which just takes a next CPU if an index is not set in the cpu_possible_mask.
> > > > >
> > > > > The only thing that can be updated in the patch is to replace num_possible_cpu()
> > > > > by the nr_cpu_ids.
> > > > >
> > > > > Any thoughts? I think we need to fix it by a minor change so it is
> > > > > easier to back-port on stable kernels.
> > > >
> > > > Yeah, sounds good since the regresson commit is merged in v6.3.
> > > > Please feel free to post this and the hash array patch separately for
> > > > formal reviewing.
> > > >
> > > Agreed! The patch about hash array i will post later.
> > >
> > > > By the way, when I am replying this mail, I check the cpumask_nth()
> > > > again. I doubt it may take more checking then cpu_possible(), given most
> > > > of systems don't have gaps in cpu_possible_mask. I could be dizzy at
> > > > this moment.
> > > >
> > > > static inline unsigned int cpumask_nth(unsigned int cpu, const struct cpumask *srcp)
> > > > {
> > > >         return find_nth_bit(cpumask_bits(srcp), small_cpumask_bits, cpumask_check(cpu));
> > > > }
> > > >
> > > Yep, i do not think it is a big problem based on your noted fact.
> > >
> > Checked. There is a difference:
> >
> > 1. Default
> >
> > <snip>
> > ...
> > +   15.95%     6.05%  [kernel]        [k] __vmap_pages_range_noflush
> > +   15.91%     1.74%  [kernel]        [k] addr_to_vb_xa <---------------
> > +   15.13%    12.05%  [kernel]        [k] vunmap_p4d_range
> > +   14.17%    13.38%  [kernel]        [k] __find_nth_bit <--------------
> > +   10.62%     0.00%  [kernel]        [k] ret_from_fork_asm
> > +   10.62%     0.00%  [kernel]        [k] ret_from_fork
> > +   10.62%     0.00%  [kernel]        [k] kthread
> > ...
> > <snip>
> >
> > 2. Check if cpu_possible() and then fallback to cpumask_nth() if not
> >
> > <snip>
> > ...
> > +    6.84%     0.29%  [kernel]          [k] alloc_vmap_area
> > +    6.80%     6.70%  [kernel]          [k] native_queued_spin_lock_slowpath
> > +    4.24%     0.09%  [kernel]          [k] free_vmap_block
> > +    2.41%     2.38%  [kernel]          [k] addr_to_vb_xa <-----------
> > +    1.94%     1.91%  [kernel]          [k] xas_start
> > ...
> > <snip>
> >
> > It is _worth_ to check if an index is in possible mask:
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 45e1506d58c3..af20f78c2cbf 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> >  static struct xarray *
> >  addr_to_vb_xa(unsigned long addr)
> >  {
> > -       int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > +       int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> IIUC, use nr_cpu_ids here maybe incorrect.
> 
> take b101 as example, nr_cpu_ids is 3. if index is 2 cpumask_nth(2, cpu_possible_mask);
> might return 64.
>
But then a CPU2 becomes possible? Cutting by % nr_cpu_ids generates values < nr_cpu_ids.
So, last CPU is always possible and we never do cpumask_nth() on a last possible CPU.

What i miss here?

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-26  9:15                                 ` Uladzislau Rezki
@ 2024-06-26 10:03                                   ` Hailong Liu
  2024-06-26 10:51                                     ` Baoquan He
  2024-06-26 10:51                                     ` Uladzislau Rezki
  0 siblings, 2 replies; 39+ messages in thread
From: Hailong Liu @ 2024-06-26 10:03 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Baoquan He, Nick Bowler, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On Wed, 26. Jun 11:15, Uladzislau Rezki wrote:
> On Wed, Jun 26, 2024 at 01:12:06PM +0800, Hailong Liu wrote:
> > On Tue, 25. Jun 22:05, Uladzislau Rezki wrote:
> > > > > > > > /**
> > > > > > > >  * cpumask_next - get the next cpu in a cpumask
> > > > > > > >  * @n: the cpu prior to the place to search (i.e. return will be > @n)
> > > > > > > >  * @srcp: the cpumask pointer
> > > > > > > >  *
> > > > > > > >  * Return: >= nr_cpu_ids if no further cpus set.
> > > > > > >
> > > > > > > Ah, I got what you mean. In the vbq case, it may not have chance to get
> > > > > > > a return number as nr_cpu_ids. Becuase the hashed index limits the
> > > > > > > range to [0, nr_cpu_ids-1], and cpu_possible(index) will guarantee it
> > > > > > > won't be the highest cpu number [nr_cpu_ids-1] since CPU[nr_cpu_ids-1] must
> > > > > > > be possible CPU.
> > > > > > >
> > > > > > > Do I miss some corner cases?
> > > > > > >
> > > > > > Right. We guarantee that a highest CPU is available by doing: % nr_cpu_ids.
> > > > > > So we do not need to use *next_wrap() variant. You do not miss anything :)
> > > > > >
> > > > > > Hailong Liu has proposed more simpler version:
> > > > > >
> > > > > > <snip>
> > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > index 11fe5ea208aa..e1e63ffb9c57 100644
> > > > > > --- a/mm/vmalloc.c
> > > > > > +++ b/mm/vmalloc.c
> > > > > > @@ -1994,8 +1994,9 @@ static struct xarray *
> > > > > >  addr_to_vb_xa(unsigned long addr)
> > > > > >  {
> > > > > >         int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > > +       int cpu = cpumask_nth(index, cpu_possible_mask);
> > > > > >
> > > > > > -       return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > > > +       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
> > > > > > <snip>
> > > > > >
> > > > > > which just takes a next CPU if an index is not set in the cpu_possible_mask.
> > > > > >
> > > > > > The only thing that can be updated in the patch is to replace num_possible_cpu()
> > > > > > by the nr_cpu_ids.
> > > > > >
> > > > > > Any thoughts? I think we need to fix it by a minor change so it is
> > > > > > easier to back-port on stable kernels.
> > > > >
> > > > > Yeah, sounds good since the regresson commit is merged in v6.3.
> > > > > Please feel free to post this and the hash array patch separately for
> > > > > formal reviewing.
> > > > >
> > > > Agreed! The patch about hash array i will post later.
> > > >
> > > > > By the way, when I am replying this mail, I check the cpumask_nth()
> > > > > again. I doubt it may take more checking then cpu_possible(), given most
> > > > > of systems don't have gaps in cpu_possible_mask. I could be dizzy at
> > > > > this moment.
> > > > >
> > > > > static inline unsigned int cpumask_nth(unsigned int cpu, const struct cpumask *srcp)
> > > > > {
> > > > >         return find_nth_bit(cpumask_bits(srcp), small_cpumask_bits, cpumask_check(cpu));
> > > > > }
> > > > >
> > > > Yep, i do not think it is a big problem based on your noted fact.
> > > >
> > > Checked. There is a difference:
> > >
> > > 1. Default
> > >
> > > <snip>
> > > ...
> > > +   15.95%     6.05%  [kernel]        [k] __vmap_pages_range_noflush
> > > +   15.91%     1.74%  [kernel]        [k] addr_to_vb_xa <---------------
> > > +   15.13%    12.05%  [kernel]        [k] vunmap_p4d_range
> > > +   14.17%    13.38%  [kernel]        [k] __find_nth_bit <--------------
> > > +   10.62%     0.00%  [kernel]        [k] ret_from_fork_asm
> > > +   10.62%     0.00%  [kernel]        [k] ret_from_fork
> > > +   10.62%     0.00%  [kernel]        [k] kthread
> > > ...
> > > <snip>
> > >
> > > 2. Check if cpu_possible() and then fallback to cpumask_nth() if not
> > >
> > > <snip>
> > > ...
> > > +    6.84%     0.29%  [kernel]          [k] alloc_vmap_area
> > > +    6.80%     6.70%  [kernel]          [k] native_queued_spin_lock_slowpath
> > > +    4.24%     0.09%  [kernel]          [k] free_vmap_block
> > > +    2.41%     2.38%  [kernel]          [k] addr_to_vb_xa <-----------
> > > +    1.94%     1.91%  [kernel]          [k] xas_start
> > > ...
> > > <snip>
> > >
> > > It is _worth_ to check if an index is in possible mask:
> > >
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 45e1506d58c3..af20f78c2cbf 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > >  static struct xarray *
> > >  addr_to_vb_xa(unsigned long addr)
> > >  {
> > > -       int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > +       int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > IIUC, use nr_cpu_ids here maybe incorrect.
> >
> > take b101 as example, nr_cpu_ids is 3. if index is 2 cpumask_nth(2, cpu_possible_mask);
> > might return 64.
> >
> But then a CPU2 becomes possible? Cutting by % nr_cpu_ids generates values < nr_cpu_ids.
> So, last CPU is always possible and we never do cpumask_nth() on a last possible CPU.
>
> What i miss here?
>
Sorry, I forget to reply to all :), I write a demo to test as follows:

static int cpumask_init(void)
{
       struct cpumask mask;
       unsigned int cpu_id;
       cpumask_clear(&mask);

       cpumask_set_cpu(1, &mask);
       cpumask_set_cpu(3, &mask);
       cpumask_set_cpu(5, &mask);

       cpu_id = find_last_bit(cpumask_bits(&mask), NR_CPUS) + 1;
       pr_info("cpu_id:%d\n", cpu_id);

       for (; i < nr_cpu_ids; i++) {
               pr_info("%d: cpu_%d\n", i, cpumask_nth(i, &mask));
       }

       return 0;
}

[    1.337020][    T1] cpu_id:6
[    1.337338][    T1] 0: cpu_1
[    1.337558][    T1] 1: cpu_3
[    1.337751][    T1] 2: cpu_5
[    1.337960][    T1] 3: cpu_64
[    1.338183][    T1] 4: cpu_64
[    1.338387][    T1] 5: cpu_64
[    1.338594][    T1] 6: cpu_64

In summary, the nr_cpu_ids = last_bit + 1, and cpumask_nth() return the nth cpu_id.

--
help you, help me,
Hailong.

> --
> Uladzislau Rezki

--
help you, help me,
Hailong.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-26 10:03                                   ` Hailong Liu
@ 2024-06-26 10:51                                     ` Baoquan He
  2024-06-26 10:53                                       ` Uladzislau Rezki
  2024-06-26 11:30                                       ` Hailong Liu
  2024-06-26 10:51                                     ` Uladzislau Rezki
  1 sibling, 2 replies; 39+ messages in thread
From: Baoquan He @ 2024-06-26 10:51 UTC (permalink / raw)
  To: Hailong Liu
  Cc: Uladzislau Rezki, Nick Bowler, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On 06/26/24 at 06:03pm, Hailong Liu wrote:
> On Wed, 26. Jun 11:15, Uladzislau Rezki wrote:
> > On Wed, Jun 26, 2024 at 01:12:06PM +0800, Hailong Liu wrote:
> > > On Tue, 25. Jun 22:05, Uladzislau Rezki wrote:
> > > > > > > > > /**
> > > > > > > > >  * cpumask_next - get the next cpu in a cpumask
> > > > > > > > >  * @n: the cpu prior to the place to search (i.e. return will be > @n)
> > > > > > > > >  * @srcp: the cpumask pointer
> > > > > > > > >  *
> > > > > > > > >  * Return: >= nr_cpu_ids if no further cpus set.
> > > > > > > >
> > > > > > > > Ah, I got what you mean. In the vbq case, it may not have chance to get
> > > > > > > > a return number as nr_cpu_ids. Becuase the hashed index limits the
> > > > > > > > range to [0, nr_cpu_ids-1], and cpu_possible(index) will guarantee it
> > > > > > > > won't be the highest cpu number [nr_cpu_ids-1] since CPU[nr_cpu_ids-1] must
> > > > > > > > be possible CPU.
> > > > > > > >
> > > > > > > > Do I miss some corner cases?
> > > > > > > >
> > > > > > > Right. We guarantee that a highest CPU is available by doing: % nr_cpu_ids.
> > > > > > > So we do not need to use *next_wrap() variant. You do not miss anything :)
> > > > > > >
> > > > > > > Hailong Liu has proposed more simpler version:
> > > > > > >
> > > > > > > <snip>
> > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > index 11fe5ea208aa..e1e63ffb9c57 100644
> > > > > > > --- a/mm/vmalloc.c
> > > > > > > +++ b/mm/vmalloc.c
> > > > > > > @@ -1994,8 +1994,9 @@ static struct xarray *
> > > > > > >  addr_to_vb_xa(unsigned long addr)
> > > > > > >  {
> > > > > > >         int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > > > +       int cpu = cpumask_nth(index, cpu_possible_mask);
> > > > > > >
> > > > > > > -       return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > > > > +       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
> > > > > > > <snip>
> > > > > > >
> > > > > > > which just takes a next CPU if an index is not set in the cpu_possible_mask.
> > > > > > >
> > > > > > > The only thing that can be updated in the patch is to replace num_possible_cpu()
> > > > > > > by the nr_cpu_ids.
> > > > > > >
> > > > > > > Any thoughts? I think we need to fix it by a minor change so it is
> > > > > > > easier to back-port on stable kernels.
> > > > > >
> > > > > > Yeah, sounds good since the regresson commit is merged in v6.3.
> > > > > > Please feel free to post this and the hash array patch separately for
> > > > > > formal reviewing.
> > > > > >
> > > > > Agreed! The patch about hash array i will post later.
> > > > >
> > > > > > By the way, when I am replying this mail, I check the cpumask_nth()
> > > > > > again. I doubt it may take more checking then cpu_possible(), given most
> > > > > > of systems don't have gaps in cpu_possible_mask. I could be dizzy at
> > > > > > this moment.
> > > > > >
> > > > > > static inline unsigned int cpumask_nth(unsigned int cpu, const struct cpumask *srcp)
> > > > > > {
> > > > > >         return find_nth_bit(cpumask_bits(srcp), small_cpumask_bits, cpumask_check(cpu));
> > > > > > }
> > > > > >
> > > > > Yep, i do not think it is a big problem based on your noted fact.
> > > > >
> > > > Checked. There is a difference:
> > > >
> > > > 1. Default
> > > >
> > > > <snip>
> > > > ...
> > > > +   15.95%     6.05%  [kernel]        [k] __vmap_pages_range_noflush
> > > > +   15.91%     1.74%  [kernel]        [k] addr_to_vb_xa <---------------
> > > > +   15.13%    12.05%  [kernel]        [k] vunmap_p4d_range
> > > > +   14.17%    13.38%  [kernel]        [k] __find_nth_bit <--------------
> > > > +   10.62%     0.00%  [kernel]        [k] ret_from_fork_asm
> > > > +   10.62%     0.00%  [kernel]        [k] ret_from_fork
> > > > +   10.62%     0.00%  [kernel]        [k] kthread
> > > > ...
> > > > <snip>
> > > >
> > > > 2. Check if cpu_possible() and then fallback to cpumask_nth() if not
> > > >
> > > > <snip>
> > > > ...
> > > > +    6.84%     0.29%  [kernel]          [k] alloc_vmap_area
> > > > +    6.80%     6.70%  [kernel]          [k] native_queued_spin_lock_slowpath
> > > > +    4.24%     0.09%  [kernel]          [k] free_vmap_block
> > > > +    2.41%     2.38%  [kernel]          [k] addr_to_vb_xa <-----------
> > > > +    1.94%     1.91%  [kernel]          [k] xas_start
> > > > ...
> > > > <snip>
> > > >
> > > > It is _worth_ to check if an index is in possible mask:
> > > >
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index 45e1506d58c3..af20f78c2cbf 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > >  static struct xarray *
> > > >  addr_to_vb_xa(unsigned long addr)
> > > >  {
> > > > -       int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > +       int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > IIUC, use nr_cpu_ids here maybe incorrect.
> > >
> > > take b101 as example, nr_cpu_ids is 3. if index is 2 cpumask_nth(2, cpu_possible_mask);
> > > might return 64.
> > >
> > But then a CPU2 becomes possible? Cutting by % nr_cpu_ids generates values < nr_cpu_ids.
> > So, last CPU is always possible and we never do cpumask_nth() on a last possible CPU.
> >
> > What i miss here?
> >
> Sorry, I forget to reply to all :), I write a demo to test as follows:
> 
> static int cpumask_init(void)
> {
>        struct cpumask mask;
>        unsigned int cpu_id;
>        cpumask_clear(&mask);
> 
>        cpumask_set_cpu(1, &mask);
>        cpumask_set_cpu(3, &mask);
>        cpumask_set_cpu(5, &mask);
> 
>        cpu_id = find_last_bit(cpumask_bits(&mask), NR_CPUS) + 1;
>        pr_info("cpu_id:%d\n", cpu_id);
> 
>        for (; i < nr_cpu_ids; i++) {
>                pr_info("%d: cpu_%d\n", i, cpumask_nth(i, &mask));
>        }
> 
>        return 0;
> }
> 
> [    1.337020][    T1] cpu_id:6
> [    1.337338][    T1] 0: cpu_1
> [    1.337558][    T1] 1: cpu_3
> [    1.337751][    T1] 2: cpu_5
> [    1.337960][    T1] 3: cpu_64
> [    1.338183][    T1] 4: cpu_64
> [    1.338387][    T1] 5: cpu_64
> [    1.338594][    T1] 6: cpu_64
> 
> In summary, the nr_cpu_ids = last_bit + 1, and cpumask_nth() return the nth cpu_id.

I think just using below change for a quick fix is enough. It doesn't
have the issue cpumask_nth() has and very simple. For most of systems,
it only adds an extra cpu_possible(idex) checking.

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 633363997dec..59a8951cc6c0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
 static struct xarray *
 addr_to_vb_xa(unsigned long addr)
 {
-	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
+	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
+
+	if (!cpu_possible(idex))
+		index = cpumask_next(index, cpu_possible_mask);
 
 	return &per_cpu(vmap_block_queue, index).vmap_blocks;
 }



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-26 10:51                                     ` Baoquan He
@ 2024-06-26 10:53                                       ` Uladzislau Rezki
  2024-06-26 11:30                                       ` Hailong Liu
  1 sibling, 0 replies; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-26 10:53 UTC (permalink / raw)
  To: Baoquan He
  Cc: Hailong Liu, Uladzislau Rezki, Nick Bowler, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On Wed, Jun 26, 2024 at 06:51:32PM +0800, Baoquan He wrote:
> On 06/26/24 at 06:03pm, Hailong Liu wrote:
> > On Wed, 26. Jun 11:15, Uladzislau Rezki wrote:
> > > On Wed, Jun 26, 2024 at 01:12:06PM +0800, Hailong Liu wrote:
> > > > On Tue, 25. Jun 22:05, Uladzislau Rezki wrote:
> > > > > > > > > > /**
> > > > > > > > > >  * cpumask_next - get the next cpu in a cpumask
> > > > > > > > > >  * @n: the cpu prior to the place to search (i.e. return will be > @n)
> > > > > > > > > >  * @srcp: the cpumask pointer
> > > > > > > > > >  *
> > > > > > > > > >  * Return: >= nr_cpu_ids if no further cpus set.
> > > > > > > > >
> > > > > > > > > Ah, I got what you mean. In the vbq case, it may not have chance to get
> > > > > > > > > a return number as nr_cpu_ids. Becuase the hashed index limits the
> > > > > > > > > range to [0, nr_cpu_ids-1], and cpu_possible(index) will guarantee it
> > > > > > > > > won't be the highest cpu number [nr_cpu_ids-1] since CPU[nr_cpu_ids-1] must
> > > > > > > > > be possible CPU.
> > > > > > > > >
> > > > > > > > > Do I miss some corner cases?
> > > > > > > > >
> > > > > > > > Right. We guarantee that a highest CPU is available by doing: % nr_cpu_ids.
> > > > > > > > So we do not need to use *next_wrap() variant. You do not miss anything :)
> > > > > > > >
> > > > > > > > Hailong Liu has proposed more simpler version:
> > > > > > > >
> > > > > > > > <snip>
> > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > index 11fe5ea208aa..e1e63ffb9c57 100644
> > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > @@ -1994,8 +1994,9 @@ static struct xarray *
> > > > > > > >  addr_to_vb_xa(unsigned long addr)
> > > > > > > >  {
> > > > > > > >         int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > > > > +       int cpu = cpumask_nth(index, cpu_possible_mask);
> > > > > > > >
> > > > > > > > -       return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > > > > > +       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
> > > > > > > > <snip>
> > > > > > > >
> > > > > > > > which just takes a next CPU if an index is not set in the cpu_possible_mask.
> > > > > > > >
> > > > > > > > The only thing that can be updated in the patch is to replace num_possible_cpu()
> > > > > > > > by the nr_cpu_ids.
> > > > > > > >
> > > > > > > > Any thoughts? I think we need to fix it by a minor change so it is
> > > > > > > > easier to back-port on stable kernels.
> > > > > > >
> > > > > > > Yeah, sounds good since the regresson commit is merged in v6.3.
> > > > > > > Please feel free to post this and the hash array patch separately for
> > > > > > > formal reviewing.
> > > > > > >
> > > > > > Agreed! The patch about hash array i will post later.
> > > > > >
> > > > > > > By the way, when I am replying this mail, I check the cpumask_nth()
> > > > > > > again. I doubt it may take more checking then cpu_possible(), given most
> > > > > > > of systems don't have gaps in cpu_possible_mask. I could be dizzy at
> > > > > > > this moment.
> > > > > > >
> > > > > > > static inline unsigned int cpumask_nth(unsigned int cpu, const struct cpumask *srcp)
> > > > > > > {
> > > > > > >         return find_nth_bit(cpumask_bits(srcp), small_cpumask_bits, cpumask_check(cpu));
> > > > > > > }
> > > > > > >
> > > > > > Yep, i do not think it is a big problem based on your noted fact.
> > > > > >
> > > > > Checked. There is a difference:
> > > > >
> > > > > 1. Default
> > > > >
> > > > > <snip>
> > > > > ...
> > > > > +   15.95%     6.05%  [kernel]        [k] __vmap_pages_range_noflush
> > > > > +   15.91%     1.74%  [kernel]        [k] addr_to_vb_xa <---------------
> > > > > +   15.13%    12.05%  [kernel]        [k] vunmap_p4d_range
> > > > > +   14.17%    13.38%  [kernel]        [k] __find_nth_bit <--------------
> > > > > +   10.62%     0.00%  [kernel]        [k] ret_from_fork_asm
> > > > > +   10.62%     0.00%  [kernel]        [k] ret_from_fork
> > > > > +   10.62%     0.00%  [kernel]        [k] kthread
> > > > > ...
> > > > > <snip>
> > > > >
> > > > > 2. Check if cpu_possible() and then fallback to cpumask_nth() if not
> > > > >
> > > > > <snip>
> > > > > ...
> > > > > +    6.84%     0.29%  [kernel]          [k] alloc_vmap_area
> > > > > +    6.80%     6.70%  [kernel]          [k] native_queued_spin_lock_slowpath
> > > > > +    4.24%     0.09%  [kernel]          [k] free_vmap_block
> > > > > +    2.41%     2.38%  [kernel]          [k] addr_to_vb_xa <-----------
> > > > > +    1.94%     1.91%  [kernel]          [k] xas_start
> > > > > ...
> > > > > <snip>
> > > > >
> > > > > It is _worth_ to check if an index is in possible mask:
> > > > >
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index 45e1506d58c3..af20f78c2cbf 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > > >  static struct xarray *
> > > > >  addr_to_vb_xa(unsigned long addr)
> > > > >  {
> > > > > -       int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > +       int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > IIUC, use nr_cpu_ids here maybe incorrect.
> > > >
> > > > take b101 as example, nr_cpu_ids is 3. if index is 2 cpumask_nth(2, cpu_possible_mask);
> > > > might return 64.
> > > >
> > > But then a CPU2 becomes possible? Cutting by % nr_cpu_ids generates values < nr_cpu_ids.
> > > So, last CPU is always possible and we never do cpumask_nth() on a last possible CPU.
> > >
> > > What i miss here?
> > >
> > Sorry, I forget to reply to all :), I write a demo to test as follows:
> > 
> > static int cpumask_init(void)
> > {
> >        struct cpumask mask;
> >        unsigned int cpu_id;
> >        cpumask_clear(&mask);
> > 
> >        cpumask_set_cpu(1, &mask);
> >        cpumask_set_cpu(3, &mask);
> >        cpumask_set_cpu(5, &mask);
> > 
> >        cpu_id = find_last_bit(cpumask_bits(&mask), NR_CPUS) + 1;
> >        pr_info("cpu_id:%d\n", cpu_id);
> > 
> >        for (; i < nr_cpu_ids; i++) {
> >                pr_info("%d: cpu_%d\n", i, cpumask_nth(i, &mask));
> >        }
> > 
> >        return 0;
> > }
> > 
> > [    1.337020][    T1] cpu_id:6
> > [    1.337338][    T1] 0: cpu_1
> > [    1.337558][    T1] 1: cpu_3
> > [    1.337751][    T1] 2: cpu_5
> > [    1.337960][    T1] 3: cpu_64
> > [    1.338183][    T1] 4: cpu_64
> > [    1.338387][    T1] 5: cpu_64
> > [    1.338594][    T1] 6: cpu_64
> > 
> > In summary, the nr_cpu_ids = last_bit + 1, and cpumask_nth() return the nth cpu_id.
> 
> I think just using below change for a quick fix is enough. It doesn't
> have the issue cpumask_nth() has and very simple. For most of systems,
> it only adds an extra cpu_possible(idex) checking.
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 633363997dec..59a8951cc6c0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
>  static struct xarray *
>  addr_to_vb_xa(unsigned long addr)
>  {
> -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> +
> +	if (!cpu_possible(idex))
> +		index = cpumask_next(index, cpu_possible_mask);
>  
>  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
>  }
> 
Just submitted the same :) I will send out the patch.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-26 10:51                                     ` Baoquan He
  2024-06-26 10:53                                       ` Uladzislau Rezki
@ 2024-06-26 11:30                                       ` Hailong Liu
  2024-06-26 11:45                                         ` Uladzislau Rezki
  1 sibling, 1 reply; 39+ messages in thread
From: Hailong Liu @ 2024-06-26 11:30 UTC (permalink / raw)
  To: Baoquan He, Uladzislau Rezki
  Cc: Uladzislau Rezki, Nick Bowler, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On Wed, 26. Jun 18:51, Baoquan He wrote:
> On 06/26/24 at 06:03pm, Hailong Liu wrote:
> > On Wed, 26. Jun 11:15, Uladzislau Rezki wrote:
> > > On Wed, Jun 26, 2024 at 01:12:06PM +0800, Hailong Liu wrote:
> > > > On Tue, 25. Jun 22:05, Uladzislau Rezki wrote:
> > > > > > > > > > /**
> > > > > > > > > >  * cpumask_next - get the next cpu in a cpumask
> > > > > > > > > >  * @n: the cpu prior to the place to search (i.e. return will be > @n)
> > > > > > > > > >  * @srcp: the cpumask pointer
> > > > > > > > > >  *
> > > > > > > > > >  * Return: >= nr_cpu_ids if no further cpus set.
> > > > > > > > >
> > > > > > > > > Ah, I got what you mean. In the vbq case, it may not have chance to get
> > > > > > > > > a return number as nr_cpu_ids. Becuase the hashed index limits the
> > > > > > > > > range to [0, nr_cpu_ids-1], and cpu_possible(index) will guarantee it
> > > > > > > > > won't be the highest cpu number [nr_cpu_ids-1] since CPU[nr_cpu_ids-1] must
> > > > > > > > > be possible CPU.
> > > > > > > > >
> > > > > > > > > Do I miss some corner cases?
> > > > > > > > >
> > > > > > > > Right. We guarantee that a highest CPU is available by doing: % nr_cpu_ids.
> > > > > > > > So we do not need to use *next_wrap() variant. You do not miss anything :)
> > > > > > > >
> > > > > > > > Hailong Liu has proposed more simpler version:
> > > > > > > >
> > > > > > > > <snip>
> > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > index 11fe5ea208aa..e1e63ffb9c57 100644
> > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > @@ -1994,8 +1994,9 @@ static struct xarray *
> > > > > > > >  addr_to_vb_xa(unsigned long addr)
> > > > > > > >  {
> > > > > > > >         int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > > > > +       int cpu = cpumask_nth(index, cpu_possible_mask);
> > > > > > > >
> > > > > > > > -       return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > > > > > +       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
> > > > > > > > <snip>
> > > > > > > >
> > > > > > > > which just takes a next CPU if an index is not set in the cpu_possible_mask.
> > > > > > > >
> > > > > > > > The only thing that can be updated in the patch is to replace num_possible_cpu()
> > > > > > > > by the nr_cpu_ids.
> > > > > > > >
> > > > > > > > Any thoughts? I think we need to fix it by a minor change so it is
> > > > > > > > easier to back-port on stable kernels.
> > > > > > >
> > > > > > > Yeah, sounds good since the regresson commit is merged in v6.3.
> > > > > > > Please feel free to post this and the hash array patch separately for
> > > > > > > formal reviewing.
> > > > > > >
> > > > > > Agreed! The patch about hash array i will post later.
G> > > > > >
> > > > > > > By the way, when I am replying this mail, I check the cpumask_nth()
> > > > > > > again. I doubt it may take more checking then cpu_possible(), given most
> > > > > > > of systems don't have gaps in cpu_possible_mask. I could be dizzy at
> > > > > > > this moment.
> > > > > > >
> > > > > > > static inline unsigned int cpumask_nth(unsigned int cpu, const struct cpumask *srcp)
> > > > > > > {
> > > > > > >         return find_nth_bit(cpumask_bits(srcp), small_cpumask_bits, cpumask_check(cpu));
> > > > > > > }
> > > > > > >
> > > > > > Yep, i do not think it is a big problem based on your noted fact.
> > > > > >
> > > > > Checked. There is a difference:
> > > > >
> > > > > 1. Default
> > > > >
> > > > > <snip>
> > > > > ...
> > > > > +   15.95%     6.05%  [kernel]        [k] __vmap_pages_range_noflush
> > > > > +   15.91%     1.74%  [kernel]        [k] addr_to_vb_xa <---------------
> > > > > +   15.13%    12.05%  [kernel]        [k] vunmap_p4d_range
> > > > > +   14.17%    13.38%  [kernel]        [k] __find_nth_bit <--------------
> > > > > +   10.62%     0.00%  [kernel]        [k] ret_from_fork_asm
> > > > > +   10.62%     0.00%  [kernel]        [k] ret_from_fork
> > > > > +   10.62%     0.00%  [kernel]        [k] kthread
> > > > > ...
> > > > > <snip>
> > > > >
> > > > > 2. Check if cpu_possible() and then fallback to cpumask_nth() if not
> > > > >
> > > > > <snip>
> > > > > ...
> > > > > +    6.84%     0.29%  [kernel]          [k] alloc_vmap_area
> > > > > +    6.80%     6.70%  [kernel]          [k] native_queued_spin_lock_slowpath
> > > > > +    4.24%     0.09%  [kernel]          [k] free_vmap_block
> > > > > +    2.41%     2.38%  [kernel]          [k] addr_to_vb_xa <-----------
> > > > > +    1.94%     1.91%  [kernel]          [k] xas_start
> > > > > ...
> > > > > <snip>
> > > > >
> > > > > It is _worth_ to check if an index is in possible mask:
> > > > >
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index 45e1506d58c3..af20f78c2cbf 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > > >  static struct xarray *
> > > > >  addr_to_vb_xa(unsigned long addr)
> > > > >  {
> > > > > -       int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > +       int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > IIUC, use nr_cpu_ids here maybe incorrect.
> > > >
> > > > take b101 as example, nr_cpu_ids is 3. if index is 2 cpumask_nth(2, cpu_possible_mask);
> > > > might return 64.
> > > >
> > > But then a CPU2 becomes possible? Cutting by % nr_cpu_ids generates values < nr_cpu_ids.
> > > So, last CPU is always possible and we never do cpumask_nth() on a last possible CPU.
> > >
> > > What i miss here?
> > >
> > Sorry, I forget to reply to all :), I write a demo to test as follows:
> >
> > static int cpumask_init(void)
> > {
> >        struct cpumask mask;
> >        unsigned int cpu_id;
> >        cpumask_clear(&mask);
> >
> >        cpumask_set_cpu(1, &mask);
> >        cpumask_set_cpu(3, &mask);
> >        cpumask_set_cpu(5, &mask);
> >
> >        cpu_id = find_last_bit(cpumask_bits(&mask), NR_CPUS) + 1;
> >        pr_info("cpu_id:%d\n", cpu_id);
> >
> >        for (; i < nr_cpu_ids; i++) {
> >                pr_info("%d: cpu_%d\n", i, cpumask_nth(i, &mask));
> >        }
> >
> >        return 0;
> > }
> >
> > [    1.337020][    T1] cpu_id:6
> > [    1.337338][    T1] 0: cpu_1
> > [    1.337558][    T1] 1: cpu_3
> > [    1.337751][    T1] 2: cpu_5
> > [    1.337960][    T1] 3: cpu_64
> > [    1.338183][    T1] 4: cpu_64
> > [    1.338387][    T1] 5: cpu_64
> > [    1.338594][    T1] 6: cpu_64
> >
> > In summary, the nr_cpu_ids = last_bit + 1, and cpumask_nth() return the nth cpu_id.
>
> I think just using below change for a quick fix is enough. It doesn't
> have the issue cpumask_nth() has and very simple. For most of systems,
> it only adds an extra cpu_possible(idex) checking.
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 633363997dec..59a8951cc6c0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
>  static struct xarray *
>  addr_to_vb_xa(unsigned long addr)
>  {
> -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> +
> +	if (!cpu_possible(idex))
> +		index = cpumask_next(index, cpu_possible_mask);
>
>  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
>  }
>
Agreed! This is a very simple solution.

If cpumask is b1000001, addresses being distributed across different
CPUs could theoretically lead to such a situation, but it has not been
encountered in practice. I’m just pointing out the possibility here.

  CPU_0  CPU_6  CPU_6  CPU_6  CPU_6  CPU_6
    |      |      |      |      |      |
    V      V      V      V      V      V
0     10     20     30     40     50     60
|------|------|------|------|------|------|..

Thanks again for your reply, I learned a lot.

--
help you, help me,
Hailong.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-26 11:30                                       ` Hailong Liu
@ 2024-06-26 11:45                                         ` Uladzislau Rezki
  0 siblings, 0 replies; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-26 11:45 UTC (permalink / raw)
  To: Hailong Liu
  Cc: Baoquan He, Uladzislau Rezki, Nick Bowler, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On Wed, Jun 26, 2024 at 07:30:39PM +0800, Hailong Liu wrote:
> On Wed, 26. Jun 18:51, Baoquan He wrote:
> > On 06/26/24 at 06:03pm, Hailong Liu wrote:
> > > On Wed, 26. Jun 11:15, Uladzislau Rezki wrote:
> > > > On Wed, Jun 26, 2024 at 01:12:06PM +0800, Hailong Liu wrote:
> > > > > On Tue, 25. Jun 22:05, Uladzislau Rezki wrote:
> > > > > > > > > > > /**
> > > > > > > > > > >  * cpumask_next - get the next cpu in a cpumask
> > > > > > > > > > >  * @n: the cpu prior to the place to search (i.e. return will be > @n)
> > > > > > > > > > >  * @srcp: the cpumask pointer
> > > > > > > > > > >  *
> > > > > > > > > > >  * Return: >= nr_cpu_ids if no further cpus set.
> > > > > > > > > >
> > > > > > > > > > Ah, I got what you mean. In the vbq case, it may not have chance to get
> > > > > > > > > > a return number as nr_cpu_ids. Becuase the hashed index limits the
> > > > > > > > > > range to [0, nr_cpu_ids-1], and cpu_possible(index) will guarantee it
> > > > > > > > > > won't be the highest cpu number [nr_cpu_ids-1] since CPU[nr_cpu_ids-1] must
> > > > > > > > > > be possible CPU.
> > > > > > > > > >
> > > > > > > > > > Do I miss some corner cases?
> > > > > > > > > >
> > > > > > > > > Right. We guarantee that a highest CPU is available by doing: % nr_cpu_ids.
> > > > > > > > > So we do not need to use *next_wrap() variant. You do not miss anything :)
> > > > > > > > >
> > > > > > > > > Hailong Liu has proposed more simpler version:
> > > > > > > > >
> > > > > > > > > <snip>
> > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > > index 11fe5ea208aa..e1e63ffb9c57 100644
> > > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > > @@ -1994,8 +1994,9 @@ static struct xarray *
> > > > > > > > >  addr_to_vb_xa(unsigned long addr)
> > > > > > > > >  {
> > > > > > > > >         int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > > > > > +       int cpu = cpumask_nth(index, cpu_possible_mask);
> > > > > > > > >
> > > > > > > > > -       return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > > > > > > +       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
> > > > > > > > > <snip>
> > > > > > > > >
> > > > > > > > > which just takes a next CPU if an index is not set in the cpu_possible_mask.
> > > > > > > > >
> > > > > > > > > The only thing that can be updated in the patch is to replace num_possible_cpu()
> > > > > > > > > by the nr_cpu_ids.
> > > > > > > > >
> > > > > > > > > Any thoughts? I think we need to fix it by a minor change so it is
> > > > > > > > > easier to back-port on stable kernels.
> > > > > > > >
> > > > > > > > Yeah, sounds good since the regresson commit is merged in v6.3.
> > > > > > > > Please feel free to post this and the hash array patch separately for
> > > > > > > > formal reviewing.
> > > > > > > >
> > > > > > > Agreed! The patch about hash array i will post later.
> G> > > > > >
> > > > > > > > By the way, when I am replying this mail, I check the cpumask_nth()
> > > > > > > > again. I doubt it may take more checking then cpu_possible(), given most
> > > > > > > > of systems don't have gaps in cpu_possible_mask. I could be dizzy at
> > > > > > > > this moment.
> > > > > > > >
> > > > > > > > static inline unsigned int cpumask_nth(unsigned int cpu, const struct cpumask *srcp)
> > > > > > > > {
> > > > > > > >         return find_nth_bit(cpumask_bits(srcp), small_cpumask_bits, cpumask_check(cpu));
> > > > > > > > }
> > > > > > > >
> > > > > > > Yep, i do not think it is a big problem based on your noted fact.
> > > > > > >
> > > > > > Checked. There is a difference:
> > > > > >
> > > > > > 1. Default
> > > > > >
> > > > > > <snip>
> > > > > > ...
> > > > > > +   15.95%     6.05%  [kernel]        [k] __vmap_pages_range_noflush
> > > > > > +   15.91%     1.74%  [kernel]        [k] addr_to_vb_xa <---------------
> > > > > > +   15.13%    12.05%  [kernel]        [k] vunmap_p4d_range
> > > > > > +   14.17%    13.38%  [kernel]        [k] __find_nth_bit <--------------
> > > > > > +   10.62%     0.00%  [kernel]        [k] ret_from_fork_asm
> > > > > > +   10.62%     0.00%  [kernel]        [k] ret_from_fork
> > > > > > +   10.62%     0.00%  [kernel]        [k] kthread
> > > > > > ...
> > > > > > <snip>
> > > > > >
> > > > > > 2. Check if cpu_possible() and then fallback to cpumask_nth() if not
> > > > > >
> > > > > > <snip>
> > > > > > ...
> > > > > > +    6.84%     0.29%  [kernel]          [k] alloc_vmap_area
> > > > > > +    6.80%     6.70%  [kernel]          [k] native_queued_spin_lock_slowpath
> > > > > > +    4.24%     0.09%  [kernel]          [k] free_vmap_block
> > > > > > +    2.41%     2.38%  [kernel]          [k] addr_to_vb_xa <-----------
> > > > > > +    1.94%     1.91%  [kernel]          [k] xas_start
> > > > > > ...
> > > > > > <snip>
> > > > > >
> > > > > > It is _worth_ to check if an index is in possible mask:
> > > > > >
> > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > index 45e1506d58c3..af20f78c2cbf 100644
> > > > > > --- a/mm/vmalloc.c
> > > > > > +++ b/mm/vmalloc.c
> > > > > > @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > > > >  static struct xarray *
> > > > > >  addr_to_vb_xa(unsigned long addr)
> > > > > >  {
> > > > > > -       int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > > +       int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > > IIUC, use nr_cpu_ids here maybe incorrect.
> > > > >
> > > > > take b101 as example, nr_cpu_ids is 3. if index is 2 cpumask_nth(2, cpu_possible_mask);
> > > > > might return 64.
> > > > >
> > > > But then a CPU2 becomes possible? Cutting by % nr_cpu_ids generates values < nr_cpu_ids.
> > > > So, last CPU is always possible and we never do cpumask_nth() on a last possible CPU.
> > > >
> > > > What i miss here?
> > > >
> > > Sorry, I forget to reply to all :), I write a demo to test as follows:
> > >
> > > static int cpumask_init(void)
> > > {
> > >        struct cpumask mask;
> > >        unsigned int cpu_id;
> > >        cpumask_clear(&mask);
> > >
> > >        cpumask_set_cpu(1, &mask);
> > >        cpumask_set_cpu(3, &mask);
> > >        cpumask_set_cpu(5, &mask);
> > >
> > >        cpu_id = find_last_bit(cpumask_bits(&mask), NR_CPUS) + 1;
> > >        pr_info("cpu_id:%d\n", cpu_id);
> > >
> > >        for (; i < nr_cpu_ids; i++) {
> > >                pr_info("%d: cpu_%d\n", i, cpumask_nth(i, &mask));
> > >        }
> > >
> > >        return 0;
> > > }
> > >
> > > [    1.337020][    T1] cpu_id:6
> > > [    1.337338][    T1] 0: cpu_1
> > > [    1.337558][    T1] 1: cpu_3
> > > [    1.337751][    T1] 2: cpu_5
> > > [    1.337960][    T1] 3: cpu_64
> > > [    1.338183][    T1] 4: cpu_64
> > > [    1.338387][    T1] 5: cpu_64
> > > [    1.338594][    T1] 6: cpu_64
> > >
> > > In summary, the nr_cpu_ids = last_bit + 1, and cpumask_nth() return the nth cpu_id.
> >
> > I think just using below change for a quick fix is enough. It doesn't
> > have the issue cpumask_nth() has and very simple. For most of systems,
> > it only adds an extra cpu_possible(idex) checking.
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 633363997dec..59a8951cc6c0 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> >  static struct xarray *
> >  addr_to_vb_xa(unsigned long addr)
> >  {
> > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > +
> > +	if (!cpu_possible(idex))
> > +		index = cpumask_next(index, cpu_possible_mask);
> >
> >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> >  }
> >
> Agreed! This is a very simple solution.
> 
> If cpumask is b1000001, addresses being distributed across different
> CPUs could theoretically lead to such a situation, but it has not been
> encountered in practice. I’m just pointing out the possibility here.
> 
>   CPU_0  CPU_6  CPU_6  CPU_6  CPU_6  CPU_6
>     |      |      |      |      |      |
>     V      V      V      V      V      V
> 0     10     20     30     40     50     60
> |------|------|------|------|------|------|..
> 
Right!

>
> Thanks again for your reply, I learned a lot.
> 
Thank you for helping :)

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-26 10:03                                   ` Hailong Liu
  2024-06-26 10:51                                     ` Baoquan He
@ 2024-06-26 10:51                                     ` Uladzislau Rezki
  2024-06-26 13:34                                       ` Nick Bowler
  1 sibling, 1 reply; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-26 10:51 UTC (permalink / raw)
  To: Hailong Liu
  Cc: Uladzislau Rezki, Baoquan He, Nick Bowler, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On Wed, Jun 26, 2024 at 06:03:42PM +0800, Hailong Liu wrote:
> On Wed, 26. Jun 11:15, Uladzislau Rezki wrote:
> > On Wed, Jun 26, 2024 at 01:12:06PM +0800, Hailong Liu wrote:
> > > On Tue, 25. Jun 22:05, Uladzislau Rezki wrote:
> > > > > > > > > /**
> > > > > > > > >  * cpumask_next - get the next cpu in a cpumask
> > > > > > > > >  * @n: the cpu prior to the place to search (i.e. return will be > @n)
> > > > > > > > >  * @srcp: the cpumask pointer
> > > > > > > > >  *
> > > > > > > > >  * Return: >= nr_cpu_ids if no further cpus set.
> > > > > > > >
> > > > > > > > Ah, I got what you mean. In the vbq case, it may not have chance to get
> > > > > > > > a return number as nr_cpu_ids. Becuase the hashed index limits the
> > > > > > > > range to [0, nr_cpu_ids-1], and cpu_possible(index) will guarantee it
> > > > > > > > won't be the highest cpu number [nr_cpu_ids-1] since CPU[nr_cpu_ids-1] must
> > > > > > > > be possible CPU.
> > > > > > > >
> > > > > > > > Do I miss some corner cases?
> > > > > > > >
> > > > > > > Right. We guarantee that a highest CPU is available by doing: % nr_cpu_ids.
> > > > > > > So we do not need to use *next_wrap() variant. You do not miss anything :)
> > > > > > >
> > > > > > > Hailong Liu has proposed more simpler version:
> > > > > > >
> > > > > > > <snip>
> > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > index 11fe5ea208aa..e1e63ffb9c57 100644
> > > > > > > --- a/mm/vmalloc.c
> > > > > > > +++ b/mm/vmalloc.c
> > > > > > > @@ -1994,8 +1994,9 @@ static struct xarray *
> > > > > > >  addr_to_vb_xa(unsigned long addr)
> > > > > > >  {
> > > > > > >         int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > > > +       int cpu = cpumask_nth(index, cpu_possible_mask);
> > > > > > >
> > > > > > > -       return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > > > > +       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
> > > > > > > <snip>
> > > > > > >
> > > > > > > which just takes a next CPU if an index is not set in the cpu_possible_mask.
> > > > > > >
> > > > > > > The only thing that can be updated in the patch is to replace num_possible_cpu()
> > > > > > > by the nr_cpu_ids.
> > > > > > >
> > > > > > > Any thoughts? I think we need to fix it by a minor change so it is
> > > > > > > easier to back-port on stable kernels.
> > > > > >
> > > > > > Yeah, sounds good since the regresson commit is merged in v6.3.
> > > > > > Please feel free to post this and the hash array patch separately for
> > > > > > formal reviewing.
> > > > > >
> > > > > Agreed! The patch about hash array i will post later.
> > > > >
> > > > > > By the way, when I am replying this mail, I check the cpumask_nth()
> > > > > > again. I doubt it may take more checking then cpu_possible(), given most
> > > > > > of systems don't have gaps in cpu_possible_mask. I could be dizzy at
> > > > > > this moment.
> > > > > >
> > > > > > static inline unsigned int cpumask_nth(unsigned int cpu, const struct cpumask *srcp)
> > > > > > {
> > > > > >         return find_nth_bit(cpumask_bits(srcp), small_cpumask_bits, cpumask_check(cpu));
> > > > > > }
> > > > > >
> > > > > Yep, i do not think it is a big problem based on your noted fact.
> > > > >
> > > > Checked. There is a difference:
> > > >
> > > > 1. Default
> > > >
> > > > <snip>
> > > > ...
> > > > +   15.95%     6.05%  [kernel]        [k] __vmap_pages_range_noflush
> > > > +   15.91%     1.74%  [kernel]        [k] addr_to_vb_xa <---------------
> > > > +   15.13%    12.05%  [kernel]        [k] vunmap_p4d_range
> > > > +   14.17%    13.38%  [kernel]        [k] __find_nth_bit <--------------
> > > > +   10.62%     0.00%  [kernel]        [k] ret_from_fork_asm
> > > > +   10.62%     0.00%  [kernel]        [k] ret_from_fork
> > > > +   10.62%     0.00%  [kernel]        [k] kthread
> > > > ...
> > > > <snip>
> > > >
> > > > 2. Check if cpu_possible() and then fallback to cpumask_nth() if not
> > > >
> > > > <snip>
> > > > ...
> > > > +    6.84%     0.29%  [kernel]          [k] alloc_vmap_area
> > > > +    6.80%     6.70%  [kernel]          [k] native_queued_spin_lock_slowpath
> > > > +    4.24%     0.09%  [kernel]          [k] free_vmap_block
> > > > +    2.41%     2.38%  [kernel]          [k] addr_to_vb_xa <-----------
> > > > +    1.94%     1.91%  [kernel]          [k] xas_start
> > > > ...
> > > > <snip>
> > > >
> > > > It is _worth_ to check if an index is in possible mask:
> > > >
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index 45e1506d58c3..af20f78c2cbf 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > >  static struct xarray *
> > > >  addr_to_vb_xa(unsigned long addr)
> > > >  {
> > > > -       int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > +       int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > IIUC, use nr_cpu_ids here maybe incorrect.
> > >
> > > take b101 as example, nr_cpu_ids is 3. if index is 2 cpumask_nth(2, cpu_possible_mask);
> > > might return 64.
> > >
> > But then a CPU2 becomes possible? Cutting by % nr_cpu_ids generates values < nr_cpu_ids.
> > So, last CPU is always possible and we never do cpumask_nth() on a last possible CPU.
> >
> > What i miss here?
> >
> Sorry, I forget to reply to all :), I write a demo to test as follows:
> 
> static int cpumask_init(void)
> {
>        struct cpumask mask;
>        unsigned int cpu_id;
>        cpumask_clear(&mask);
> 
>        cpumask_set_cpu(1, &mask);
>        cpumask_set_cpu(3, &mask);
>        cpumask_set_cpu(5, &mask);
> 
>        cpu_id = find_last_bit(cpumask_bits(&mask), NR_CPUS) + 1;
>        pr_info("cpu_id:%d\n", cpu_id);
> 
>        for (; i < nr_cpu_ids; i++) {
>                pr_info("%d: cpu_%d\n", i, cpumask_nth(i, &mask));
>        }
> 
>        return 0;
> }
> 
> [    1.337020][    T1] cpu_id:6
> [    1.337338][    T1] 0: cpu_1
> [    1.337558][    T1] 1: cpu_3
> [    1.337751][    T1] 2: cpu_5
> [    1.337960][    T1] 3: cpu_64
> [    1.338183][    T1] 4: cpu_64
> [    1.338387][    T1] 5: cpu_64
> [    1.338594][    T1] 6: cpu_64
> 
> In summary, the nr_cpu_ids = last_bit + 1, and cpumask_nth() return the nth cpu_id.
> 
OK, i misread the cpumask_nth(). We should go with *_next() variant instead.

Thank you for pointing this. Below is updated version with extra comment:

<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 45e1506d58c3..03b82fb8ecd3 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2542,7 +2542,15 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
 static struct xarray *
 addr_to_vb_xa(unsigned long addr)
 {
-       int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
+       int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
+
+       /*
+        * Please note, nr_cpu_ids points on a highest set
+        * possible bit, i.e. we never invoke cpumask_next()
+        * if an index points on it which is nr_cpu_ids - 1.
+        */
+       if (!cpu_possible(index))
+               index = cpumask_next(index, cpu_possible_mask);
 
        return &per_cpu(vmap_block_queue, index).vmap_blocks;
 }
<snip>

Thanks!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-26 10:51                                     ` Uladzislau Rezki
@ 2024-06-26 13:34                                       ` Nick Bowler
  2024-06-26 13:38                                         ` Uladzislau Rezki
  0 siblings, 1 reply; 39+ messages in thread
From: Nick Bowler @ 2024-06-26 13:34 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Baoquan He, linux-kernel, Linux regressions mailing list,
	linux-mm, sparclinux, Andrew Morton, Hailong Liu

On 2024-06-26 06:51, Uladzislau Rezki wrote:
[...]
> Thank you for pointing this. Below is updated version with extra comment:

I checked that I can still reproduce the problem on 6.10-rc5 and with
this patch applied on top, xfsdump doesn't crash anymore.

> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 45e1506d58c3..03b82fb8ecd3 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2542,7 +2542,15 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
>  static struct xarray *
>  addr_to_vb_xa(unsigned long addr)
>  {
> -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> +
> +	/*
> +	 * Please note, nr_cpu_ids points on a highest set
> +	 * possible bit, i.e. we never invoke cpumask_next()
> +	 * if an index points on it which is nr_cpu_ids - 1.
> +	 */
> +	if (!cpu_possible(index))
> +		index = cpumask_next(index, cpu_possible_mask);
>  
>  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
>  }

Thanks,
  Nick


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-26 13:34                                       ` Nick Bowler
@ 2024-06-26 13:38                                         ` Uladzislau Rezki
  0 siblings, 0 replies; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-26 13:38 UTC (permalink / raw)
  To: Nick Bowler
  Cc: Uladzislau Rezki, Baoquan He, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton, Hailong Liu

> On 2024-06-26 06:51, Uladzislau Rezki wrote:
> [...]
> > Thank you for pointing this. Below is updated version with extra comment:
> 
> I checked that I can still reproduce the problem on 6.10-rc5 and with
> this patch applied on top, xfsdump doesn't crash anymore.
> 
Thank you for helping and testing. I appreciate it!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-25  3:30             ` Baoquan He
  2024-06-25 10:32               ` Uladzislau Rezki
@ 2024-06-25 11:19               ` Hailong Liu
  2024-06-25 12:41               ` Uladzislau Rezki
  2 siblings, 0 replies; 39+ messages in thread
From: Hailong Liu @ 2024-06-25 11:19 UTC (permalink / raw)
  To: Baoquan He, Uladzislau Rezki
  Cc: Nick Bowler, linux-kernel, Linux regressions mailing list,
	linux-mm, sparclinux, Andrew Morton

On Tue, 25. Jun 11:30, Baoquan He wrote:
> On 06/24/24 at 02:16pm, Uladzislau Rezki wrote:
> > On Fri, Jun 21, 2024 at 10:02:50PM +0800, Baoquan He wrote:
> > > On 06/21/24 at 11:44am, Uladzislau Rezki wrote:
> > > > On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > > > > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > > > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > > > > On 2024-06-20 02:19, Nick Bowler wrote:
> > > ......
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index be2dd281ea76..18e87cafbaf2 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > > >  static struct xarray *
> > > > >  addr_to_vb_xa(unsigned long addr)
> > > > >  {
> > > > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > >
> > > > >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > >  }
> > > > >
> > > > The problem i see is about not-initializing of the:
> > > > <snip>
> > > > 	for_each_possible_cpu(i) {
> > > > 		struct vmap_block_queue *vbq;
> > > > 		struct vfree_deferred *p;
> > > >
> > > > 		vbq = &per_cpu(vmap_block_queue, i);
> > > > 		spin_lock_init(&vbq->lock);
> > > > 		INIT_LIST_HEAD(&vbq->free);
> > > > 		p = &per_cpu(vfree_deferred, i);
> > > > 		init_llist_head(&p->list);
> > > > 		INIT_WORK(&p->wq, delayed_vfree_work);
> > > > 		xa_init(&vbq->vmap_blocks);
> > > > 	}
> > > > <snip>
> > > >
> > > > correctly or fully. It is my bad i did not think that CPUs in a possible mask
> > > > can be non sequential :-/
> > > >
> > > > nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
> > > > when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.
> > >
> > > I checked the generic version of setup_nr_cpu_ids(), from codes, they
> > > are different with my understanding.
> > >
> > > kernel/smp.c
> > > void __init setup_nr_cpu_ids(void)
> > > {
> > >         set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1);
> > > }
> > >
> > I see that it is not a weak function, so it is generic, thus the
> > behavior can not be overwritten, which is great. This does what we
> > need.
> >
> > Thank you for checking this you are right!
>
> Thanks for confirming this.
>
> >
> > Then it is just a matter of proper initialization of the hash:
> >
> > <snip>
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 5d3aa2dc88a8..1733946f7a12 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -5087,7 +5087,13 @@ void __init vmalloc_init(void)
> >          */
> >         vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
> >
> > -       for_each_possible_cpu(i) {
> > +       /*
> > +        * We use "nr_cpu_ids" here because some architectures
> > +        * may have "gaps" in cpu-possible-mask. It is OK for
> > +        * per-cpu approaches but is not OK for cases where it
> > +        * can be used as hashes also.
> > +        */
> > +       for (i = 0; i < nr_cpu_ids; i++) {
>
> I was wrong about earlier comments. Percpu variables are only available
> on possible CPUs. For those nonexistent possible CPUs of static percpu
> variable vmap_block_queue, there isn't memory allocated and mapped for
> them. So accessing into them will cause problem.
Thansk for pointing out. Indeed, I have very limited knowledge about
CPU-related :)
>
> In Nick's case, there are only CPU0, CPU2. If you access
> &per_cpu(vmap_block_queue, 1), problem occurs. So I think we may need to
> change to take other way for vbq. E.g:
> 1) Storing the vb in the nearest neighbouring vbq on possible CPU as
>    below draft patch;
> 2) create an normal array to store vbq of size nr_cpu_ids, then we can
>    store/fetch each vbq on non-possible CPU?
>
> The way 1) is simpler, the existing code can be adapted a little just as
> below.
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 633363997dec..59a8951cc6c0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2542,7 +2542,10 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
>  static struct xarray *
>  addr_to_vb_xa(unsigned long addr)
>  {
> -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> +
> +	if (!cpu_possible(idex))
> +		index = cpumask_next(index, cpu_possible_mask);

If cpumask is b1001, more addr would located in cpu3.
>
>  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
>  }
> @@ -2556,9 +2559,15 @@ addr_to_vb_xa(unsigned long addr)
>
>  static unsigned long addr_to_vb_idx(unsigned long addr)
>  {
> +	int id = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> +	int id_dest = id;
> +
> +	if (!cpu_possible(id))
> +		id_dest = cpumask_next(id, cpu_possible_mask);
> +
>  	addr -= VMALLOC_START & ~(VMAP_BLOCK_SIZE-1);
>  	addr /= VMAP_BLOCK_SIZE;
> -	return addr;
> +	return addr + (id_dest - id);
>  }

How about using cpumask_nth to get cpu id ? It looks simpler and more
straightforward, it may waste some cputimes :)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 11fe5ea208aa..e1e63ffb9c57 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1994,8 +1994,9 @@ static struct xarray *
 addr_to_vb_xa(unsigned long addr)
 {
        int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
+       int cpu = cpumask_nth(index, cpu_possible_mask);

-       return &per_cpu(vmap_block_queue, index).vmap_blocks;
+       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;
 }

>

--
help you, help me,
Hailong.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-25  3:30             ` Baoquan He
  2024-06-25 10:32               ` Uladzislau Rezki
  2024-06-25 11:19               ` Hailong Liu
@ 2024-06-25 12:41               ` Uladzislau Rezki
  2 siblings, 0 replies; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-25 12:41 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki, Nick Bowler, Hailong Liu, linux-kernel,
	Linux regressions mailing list, linux-mm, sparclinux,
	Andrew Morton

On Tue, Jun 25, 2024 at 11:30:33AM +0800, Baoquan He wrote:
> On 06/24/24 at 02:16pm, Uladzislau Rezki wrote:
> > On Fri, Jun 21, 2024 at 10:02:50PM +0800, Baoquan He wrote:
> > > On 06/21/24 at 11:44am, Uladzislau Rezki wrote:
> > > > On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > > > > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > > > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > > > > On 2024-06-20 02:19, Nick Bowler wrote:
> > > ......
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index be2dd281ea76..18e87cafbaf2 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> > > > >  static struct xarray *
> > > > >  addr_to_vb_xa(unsigned long addr)
> > > > >  {
> > > > > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > > > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> > > > >  
> > > > >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > > >  }
> > > > > 
> > > > The problem i see is about not-initializing of the:
> > > > <snip>
> > > > 	for_each_possible_cpu(i) {
> > > > 		struct vmap_block_queue *vbq;
> > > > 		struct vfree_deferred *p;
> > > > 
> > > > 		vbq = &per_cpu(vmap_block_queue, i);
> > > > 		spin_lock_init(&vbq->lock);
> > > > 		INIT_LIST_HEAD(&vbq->free);
> > > > 		p = &per_cpu(vfree_deferred, i);
> > > > 		init_llist_head(&p->list);
> > > > 		INIT_WORK(&p->wq, delayed_vfree_work);
> > > > 		xa_init(&vbq->vmap_blocks);
> > > > 	}
> > > > <snip>
> > > > 
> > > > correctly or fully. It is my bad i did not think that CPUs in a possible mask
> > > > can be non sequential :-/
> > > > 
> > > > nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
> > > > when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.
> > > 
> > > I checked the generic version of setup_nr_cpu_ids(), from codes, they
> > > are different with my understanding.
> > > 
> > > kernel/smp.c
> > > void __init setup_nr_cpu_ids(void)
> > > {
> > >         set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1);
> > > }
> > > 
> > I see that it is not a weak function, so it is generic, thus the
> > behavior can not be overwritten, which is great. This does what we
> > need.
> > 
> > Thank you for checking this you are right!
> 
> Thanks for confirming this.
> 
> > 
> > Then it is just a matter of proper initialization of the hash:
> > 
> > <snip>
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 5d3aa2dc88a8..1733946f7a12 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -5087,7 +5087,13 @@ void __init vmalloc_init(void)
> >          */
> >         vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
> >  
> > -       for_each_possible_cpu(i) {
> > +       /*
> > +        * We use "nr_cpu_ids" here because some architectures
> > +        * may have "gaps" in cpu-possible-mask. It is OK for
> > +        * per-cpu approaches but is not OK for cases where it
> > +        * can be used as hashes also.
> > +        */
> > +       for (i = 0; i < nr_cpu_ids; i++) {
> 
> I was wrong about earlier comments. Percpu variables are only available
> on possible CPUs. For those nonexistent possible CPUs of static percpu
> variable vmap_block_queue, there isn't memory allocated and mapped for
> them. So accessing into them will cause problem.
> 
> In Nick's case, there are only CPU0, CPU2. If you access
> &per_cpu(vmap_block_queue, 1), problem occurs. So I think we may need to
> change to take other way for vbq. E.g:
> 1) Storing the vb in the nearest neighbouring vbq on possible CPU as
>    below draft patch;
> 2) create an normal array to store vbq of size nr_cpu_ids, then we can
>    store/fetch each vbq on non-possible CPU?
> 
See below how the patch look like if we switch to hash array:

<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 45e1506d58c3..a8bcd9ceec2d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2501,7 +2501,8 @@ struct vmap_block {
 };
 
 /* Queue of free and dirty vmap blocks, for allocation and flushing purposes */
-static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
+static struct vmap_block_queue *vmap_block_queue;
+static bool vmap_block_queue_initialized;
 
 /*
  * In order to fast access to any "vmap_block" associated with a
@@ -2542,9 +2543,9 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
 static struct xarray *
 addr_to_vb_xa(unsigned long addr)
 {
-	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
+	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
 
-	return &per_cpu(vmap_block_queue, index).vmap_blocks;
+	return &vmap_block_queue[index].vmap_blocks;
 }
 
 /*
@@ -2626,7 +2627,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 		return ERR_PTR(err);
 	}
 
-	vbq = raw_cpu_ptr(&vmap_block_queue);
+	vbq = &vmap_block_queue[raw_smp_processor_id()];
 	spin_lock(&vbq->lock);
 	list_add_tail_rcu(&vb->free_list, &vbq->free);
 	spin_unlock(&vbq->lock);
@@ -2657,6 +2658,9 @@ static bool purge_fragmented_block(struct vmap_block *vb,
 		struct vmap_block_queue *vbq, struct list_head *purge_list,
 		bool force_purge)
 {
+	if (!vmap_block_queue_initialized)
+		return false;
+
 	if (vb->free + vb->dirty != VMAP_BBMAP_BITS ||
 	    vb->dirty == VMAP_BBMAP_BITS)
 		return false;
@@ -2692,7 +2696,12 @@ static void purge_fragmented_blocks(int cpu)
 {
 	LIST_HEAD(purge);
 	struct vmap_block *vb;
-	struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu);
+	struct vmap_block_queue *vbq;
+
+	if (!vmap_block_queue_initialized)
+		return;
+
+	vbq = &vmap_block_queue[cpu];
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(vb, &vbq->free, free_list) {
@@ -2715,7 +2724,7 @@ static void purge_fragmented_blocks_allcpus(void)
 {
 	int cpu;
 
-	for_each_possible_cpu(cpu)
+	for (cpu = 0; cpu < nr_cpu_ids; cpu++)
 		purge_fragmented_blocks(cpu);
 }
 
@@ -2739,7 +2748,7 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask)
 	order = get_order(size);
 
 	rcu_read_lock();
-	vbq = raw_cpu_ptr(&vmap_block_queue);
+	vbq = &vmap_block_queue[raw_smp_processor_id()];
 	list_for_each_entry_rcu(vb, &vbq->free, free_list) {
 		unsigned long pages_off;
 
@@ -2822,13 +2831,13 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
 	LIST_HEAD(purge_list);
 	int cpu;
 
-	if (unlikely(!vmap_initialized))
+	if (unlikely(!vmap_initialized || !vmap_block_queue_initialized))
 		return;
 
 	mutex_lock(&vmap_purge_lock);
 
-	for_each_possible_cpu(cpu) {
-		struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu);
+	for (cpu = 0; cpu < nr_cpu_ids; cpu++) {
+		struct vmap_block_queue *vbq = &vmap_block_queue[cpu];
 		struct vmap_block *vb;
 		unsigned long idx;
 
@@ -2910,7 +2919,7 @@ void vm_unmap_ram(const void *mem, unsigned int count)
 
 	kasan_poison_vmalloc(mem, size);
 
-	if (likely(count <= VMAP_MAX_ALLOC)) {
+	if (likely(count <= VMAP_MAX_ALLOC) && vmap_block_queue_initialized) {
 		debug_check_no_locks_freed(mem, size);
 		vb_free(addr, size);
 		return;
@@ -2946,7 +2955,7 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
 	unsigned long addr;
 	void *mem;
 
-	if (likely(count <= VMAP_MAX_ALLOC)) {
+	if (likely(count <= VMAP_MAX_ALLOC && vmap_block_queue_initialized)) {
 		mem = vb_alloc(size, GFP_KERNEL);
 		if (IS_ERR(mem))
 			return NULL;
@@ -5087,17 +5096,28 @@ void __init vmalloc_init(void)
 	 */
 	vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
 
+	vmap_block_queue = kmalloc_array(
+		nr_cpu_ids, sizeof(struct vmap_block_queue), GFP_NOWAIT);
+
+	if (vmap_block_queue) {
+		for (i = 0; i < nr_cpu_ids; i++) {
+			struct vmap_block_queue *vbq =
+				&vmap_block_queue[i];
+
+			spin_lock_init(&vbq->lock);
+			INIT_LIST_HEAD(&vbq->free);
+			xa_init(&vbq->vmap_blocks);
+		}
+	} else {
+		pr_err("Failed to allocate vmap_block_queue array, use fallback path!\n");
+	}
+
 	for_each_possible_cpu(i) {
-		struct vmap_block_queue *vbq;
-		struct vfree_deferred *p;
+		struct vfree_deferred *p =
+			&per_cpu(vfree_deferred, i);
 
-		vbq = &per_cpu(vmap_block_queue, i);
-		spin_lock_init(&vbq->lock);
-		INIT_LIST_HEAD(&vbq->free);
-		p = &per_cpu(vfree_deferred, i);
 		init_llist_head(&p->list);
 		INIT_WORK(&p->wq, delayed_vfree_work);
-		xa_init(&vbq->vmap_blocks);
 	}
 
 	/*
@@ -5125,6 +5145,9 @@ void __init vmalloc_init(void)
 	vmap_init_free_space();
 	vmap_initialized = true;
 
+	if (vmap_block_queue)
+		vmap_block_queue_initialized = true;
+
 	vmap_node_shrinker = shrinker_alloc(0, "vmap-node");
 	if (!vmap_node_shrinker) {
 		pr_err("Failed to allocate vmap-node shrinker!\n");
<snip>

Any thoughts?

--
Uladzislau Rezki



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: PROBLEM: kernel crashes when running xfsdump since ~6.4
  2024-06-20 18:02 ` Nick Bowler
  2024-06-21  3:30   ` Hailong Liu
@ 2024-06-24 12:20   ` Uladzislau Rezki
  1 sibling, 0 replies; 39+ messages in thread
From: Uladzislau Rezki @ 2024-06-24 12:20 UTC (permalink / raw)
  To: Nick Bowler
  Cc: linux-kernel, Linux regressions mailing list, linux-mm,
	sparclinux, Uladzislau Rezki (Sony),
	Andrew Morton

> On 2024-06-20 02:19, Nick Bowler wrote:
> > After upgrading my sparc to 6.9.5 I noticed that attempting to run
> > xfsdump instantly (within a couple seconds) and reliably crashes the
> > kernel.  The same problem is also observed on 6.10-rc4.
> [...]
> >   062eacf57ad91b5c272f89dc964fd6dd9715ea7d is the first bad commit
> >   commit 062eacf57ad91b5c272f89dc964fd6dd9715ea7d
> >   Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> >   Date:   Thu Mar 30 21:06:38 2023 +0200
> >   
> >       mm: vmalloc: remove a global vmap_blocks xarray
> 
> I think I might see what is happening here.
> 
> On this machine, there are two CPUs numbered 0 and 2 (there is no CPU1).
> 
> The per-cpu variables in mm/vmalloc.c are initialized like this, in
> vmalloc_init
> 
>   for_each_possible_cpu(i) {
>     /* ... */
>     vbq = &per_cpu(vmap_block_queue, i);
>     /* initialize stuff in vbq */
>   }
> 
> This loops over the set bits of cpu_possible_mask, bits 0 and 2 are set,
> so it initializes stuff with i=0 and i=2, skipping i=1 (I added prints to
> confirm this).
> 
> Then, in vm_map_ram, with the problematic change it calls the new
> function addr_to_vb_xa, which does this:
> 
>   int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
>   return &per_cpu(vmap_block_queue, index).vmap_blocks;
> 
> The num_possible_cpus() function counts the number of set bits in
> cpu_possible_mask, so it returns 2.  Thus, index is either 0 or 1, which
> does not correspond to what was initialized (0 or 2).  The crash occurs
> when the computed index is 1 in this function.  In this case, the
> returned value appears to be garbage (I added prints to confirm this).
> 
> If I change addr_to_vb_xa function to this:
> 
>   int index = ((addr / VMAP_BLOCK_SIZE) & 1) << 1; /* 0 or 2 */
>   return &per_cpu(vmap_block_queue, index).vmap_blocks;
> 
> xfsdump is working again.
> 
Could you please test below?

<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5d3aa2dc88a8..1733946f7a12 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -5087,7 +5087,13 @@ void __init vmalloc_init(void)
         */
        vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);

-       for_each_possible_cpu(i) {
+       /*
+        * We use "nr_cpu_ids" here because some architectures
+        * may have "gaps" in cpu-possible-mask. It is OK for
+        * per-cpu approaches but is not OK for cases where it
+        * can be used as hashes also.
+        */
+       for (i = 0; i < nr_cpu_ids; i++) {
                struct vmap_block_queue *vbq;
                struct vfree_deferred *p;
<snip>

Thank you in advance and i really appreciate for finding this
issue!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2024-06-26 13:38 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-20  6:19 PROBLEM: kernel crashes when running xfsdump since ~6.4 Nick Bowler
2024-06-20  6:37 ` Hailong Liu
2024-06-20 14:36   ` Nick Bowler
2024-06-20 18:02 ` Nick Bowler
2024-06-21  3:30   ` Hailong Liu
2024-06-21  7:07     ` Baoquan He
2024-06-21  9:44       ` Uladzislau Rezki
2024-06-21 10:45         ` Hailong Liu
2024-06-21 11:15         ` Hailong Liu
2024-06-24 12:18           ` Uladzislau Rezki
2024-06-25  9:26             ` Hailong Liu
2024-06-25  9:55               ` Uladzislau Rezki
2024-06-21 13:42         ` Michael Kelley
2024-06-24 12:17           ` Uladzislau Rezki
2024-06-21 14:02         ` Baoquan He
2024-06-24 12:16           ` Uladzislau Rezki
2024-06-25  3:30             ` Baoquan He
2024-06-25 10:32               ` Uladzislau Rezki
2024-06-25 11:40                 ` Baoquan He
2024-06-25 12:40                   ` Uladzislau Rezki
2024-06-25 13:02                     ` Baoquan He
2024-06-25 15:33                       ` Uladzislau Rezki
2024-06-25 15:49                         ` Baoquan He
2024-06-25 16:49                           ` Uladzislau Rezki
2024-06-25 20:05                             ` Uladzislau Rezki
2024-06-26  0:38                               ` Baoquan He
2024-06-26  5:12                               ` Hailong Liu
2024-06-26  9:15                                 ` Uladzislau Rezki
2024-06-26 10:03                                   ` Hailong Liu
2024-06-26 10:51                                     ` Baoquan He
2024-06-26 10:53                                       ` Uladzislau Rezki
2024-06-26 11:30                                       ` Hailong Liu
2024-06-26 11:45                                         ` Uladzislau Rezki
2024-06-26 10:51                                     ` Uladzislau Rezki
2024-06-26 13:34                                       ` Nick Bowler
2024-06-26 13:38                                         ` Uladzislau Rezki
2024-06-25 11:19               ` Hailong Liu
2024-06-25 12:41               ` Uladzislau Rezki
2024-06-24 12:20   ` Uladzislau Rezki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox