* [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area
@ 2024-08-29 13:06 Adrian Huang
2024-08-29 19:00 ` Uladzislau Rezki
2024-08-31 0:33 ` Andrew Morton
0 siblings, 2 replies; 8+ messages in thread
From: Adrian Huang @ 2024-08-29 13:06 UTC (permalink / raw)
To: Andrew Morton, Uladzislau Rezki, Christoph Hellwig
Cc: linux-mm, linux-kernel, Adrian Huang
From: Adrian Huang <ahuang12@lenovo.com>
When running the vmalloc stress on a 448-core system, observe the average
latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
'funclatency.py' tool [1].
# /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 29 | |
4 -> 7 : 19 | |
8 -> 15 : 56 | |
16 -> 31 : 483 |**** |
32 -> 63 : 1548 |************ |
64 -> 127 : 2634 |********************* |
128 -> 255 : 2535 |********************* |
256 -> 511 : 1776 |************** |
512 -> 1023 : 1015 |******** |
1024 -> 2047 : 573 |**** |
2048 -> 4095 : 488 |**** |
4096 -> 8191 : 1091 |********* |
8192 -> 16383 : 3078 |************************* |
16384 -> 32767 : 4821 |****************************************|
32768 -> 65535 : 3318 |*************************** |
65536 -> 131071 : 1718 |************** |
131072 -> 262143 : 2220 |****************** |
262144 -> 524287 : 1147 |********* |
524288 -> 1048575 : 1179 |********* |
1048576 -> 2097151 : 822 |****** |
2097152 -> 4194303 : 906 |******* |
4194304 -> 8388607 : 2148 |***************** |
8388608 -> 16777215 : 4497 |************************************* |
16777216 -> 33554431 : 289 |** |
avg = 2041714 usecs, total: 78381401772 usecs, count: 38390
The worst case is over 16-33 seconds, so soft lockup is triggered [2].
[Root Cause]
1) Each purge_list has the long list. The following shows the number of
vmap_area is purged.
crash> p vmap_nodes
vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged
nr_purged = 663070
...
nr_purged = 821670
nr_purged = 692214
nr_purged = 726808
...
2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
operation when purging each vmap_area. However, the iteration is over
600000 vmap_area (See 'nr_purged' above).
Here is objdump output:
$ objdump -D vmlinux
ffffffff813e8c80 <purge_vmap_node>:
...
ffffffff813e8d70: f0 48 29 2d 68 0c bb lock sub %rbp,0x2bb0c68(%rip)
...
Quote from "Instruction tables" pdf file [3]:
Instructions with a LOCK prefix have a long latency that depends on
cache organization and possibly RAM speed. If there are multiple
processors or cores or direct memory access (DMA) devices, then all
locked instructions will lock a cache line for exclusive access,
which may involve RAM access. A LOCK prefix typically costs more
than a hundred clock cycles, even on single-processor systems.
That's why the latency of purge_vmap_node() dramatically increases
on a many-core system: One core is busy on purging each vmap_area of
the *long* purge_list and executing atomic_long_sub() for each
vmap_area, while other cores free vmalloc allocations and execute
atomic_long_add_return() in free_vmap_area_noflush().
[Solution]
Employ a local variable to record the total purged pages, and execute
atomic_long_sub() after the traversal of the purge_list is done. The
experiment result shows the latency improvement is 99%.
[Experiment Result]
1) System Configuration: Three servers (with HT-enabled) are tested.
* 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
* 192-core server: 5th Gen Intel Xeon Scalable Processor*2
* 448-core server: AMD Zen 4 Processor*2
2) Kernel Config
* CONFIG_KASAN is disabled
3) The data in column "w/o patch" and "w/ patch"
* Unit: micro seconds (us)
* Each data is the average of 3-time measurements
System w/o patch (us) w/ patch (us) Improvement (%)
--------------- -------------- ------------- -------------
72-core server 2194 14 99.36%
192-core server 143799 1139 99.21%
448-core server 1992122 6883 99.65%
[1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
[2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12
[3] https://www.agner.org/optimize/instruction_tables.pdf
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
---
mm/vmalloc.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3f9b6bd707d2..607697c81e60 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2210,6 +2210,7 @@ static void purge_vmap_node(struct work_struct *work)
{
struct vmap_node *vn = container_of(work,
struct vmap_node, purge_work);
+ unsigned long nr_purged_pages = 0;
struct vmap_area *va, *n_va;
LIST_HEAD(local_list);
@@ -2224,7 +2225,7 @@ static void purge_vmap_node(struct work_struct *work)
list_del_init(&va->list);
- atomic_long_sub(nr, &vmap_lazy_nr);
+ nr_purged_pages += nr;
vn->nr_purged++;
if (is_vn_id_valid(vn_id) && !vn->skip_populate)
@@ -2235,6 +2236,8 @@ static void purge_vmap_node(struct work_struct *work)
list_add(&va->list, &local_list);
}
+ atomic_long_sub(nr_purged_pages, &vmap_lazy_nr);
+
reclaim_list_global(&local_list);
}
--
2.34.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area
2024-08-29 13:06 [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area Adrian Huang
@ 2024-08-29 19:00 ` Uladzislau Rezki
2024-08-30 16:26 ` Uladzislau Rezki
2024-09-02 12:00 ` Adrian Huang
2024-08-31 0:33 ` Andrew Morton
1 sibling, 2 replies; 8+ messages in thread
From: Uladzislau Rezki @ 2024-08-29 19:00 UTC (permalink / raw)
To: Adrian Huang
Cc: Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
linux-kernel, Adrian Huang
On Thu, Aug 29, 2024 at 09:06:33PM +0800, Adrian Huang wrote:
> From: Adrian Huang <ahuang12@lenovo.com>
>
> When running the vmalloc stress on a 448-core system, observe the average
> latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
> 'funclatency.py' tool [1].
>
> # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
>
> usecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 29 | |
> 4 -> 7 : 19 | |
> 8 -> 15 : 56 | |
> 16 -> 31 : 483 |**** |
> 32 -> 63 : 1548 |************ |
> 64 -> 127 : 2634 |********************* |
> 128 -> 255 : 2535 |********************* |
> 256 -> 511 : 1776 |************** |
> 512 -> 1023 : 1015 |******** |
> 1024 -> 2047 : 573 |**** |
> 2048 -> 4095 : 488 |**** |
> 4096 -> 8191 : 1091 |********* |
> 8192 -> 16383 : 3078 |************************* |
> 16384 -> 32767 : 4821 |****************************************|
> 32768 -> 65535 : 3318 |*************************** |
> 65536 -> 131071 : 1718 |************** |
> 131072 -> 262143 : 2220 |****************** |
> 262144 -> 524287 : 1147 |********* |
> 524288 -> 1048575 : 1179 |********* |
> 1048576 -> 2097151 : 822 |****** |
> 2097152 -> 4194303 : 906 |******* |
> 4194304 -> 8388607 : 2148 |***************** |
> 8388608 -> 16777215 : 4497 |************************************* |
> 16777216 -> 33554431 : 289 |** |
>
> avg = 2041714 usecs, total: 78381401772 usecs, count: 38390
>
> The worst case is over 16-33 seconds, so soft lockup is triggered [2].
>
> [Root Cause]
> 1) Each purge_list has the long list. The following shows the number of
> vmap_area is purged.
>
> crash> p vmap_nodes
> vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
> crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged
> nr_purged = 663070
> ...
> nr_purged = 821670
> nr_purged = 692214
> nr_purged = 726808
> ...
>
> 2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
> operation when purging each vmap_area. However, the iteration is over
> 600000 vmap_area (See 'nr_purged' above).
>
> Here is objdump output:
>
> $ objdump -D vmlinux
> ffffffff813e8c80 <purge_vmap_node>:
> ...
> ffffffff813e8d70: f0 48 29 2d 68 0c bb lock sub %rbp,0x2bb0c68(%rip)
> ...
>
> Quote from "Instruction tables" pdf file [3]:
> Instructions with a LOCK prefix have a long latency that depends on
> cache organization and possibly RAM speed. If there are multiple
> processors or cores or direct memory access (DMA) devices, then all
> locked instructions will lock a cache line for exclusive access,
> which may involve RAM access. A LOCK prefix typically costs more
> than a hundred clock cycles, even on single-processor systems.
>
> That's why the latency of purge_vmap_node() dramatically increases
> on a many-core system: One core is busy on purging each vmap_area of
> the *long* purge_list and executing atomic_long_sub() for each
> vmap_area, while other cores free vmalloc allocations and execute
> atomic_long_add_return() in free_vmap_area_noflush().
>
> [Solution]
> Employ a local variable to record the total purged pages, and execute
> atomic_long_sub() after the traversal of the purge_list is done. The
> experiment result shows the latency improvement is 99%.
>
> [Experiment Result]
> 1) System Configuration: Three servers (with HT-enabled) are tested.
> * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
> * 192-core server: 5th Gen Intel Xeon Scalable Processor*2
> * 448-core server: AMD Zen 4 Processor*2
>
> 2) Kernel Config
> * CONFIG_KASAN is disabled
>
> 3) The data in column "w/o patch" and "w/ patch"
> * Unit: micro seconds (us)
> * Each data is the average of 3-time measurements
>
> System w/o patch (us) w/ patch (us) Improvement (%)
> --------------- -------------- ------------- -------------
> 72-core server 2194 14 99.36%
> 192-core server 143799 1139 99.21%
> 448-core server 1992122 6883 99.65%
>
> [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
> [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12
> [3] https://www.agner.org/optimize/instruction_tables.pdf
>
> Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
> ---
> mm/vmalloc.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 3f9b6bd707d2..607697c81e60 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2210,6 +2210,7 @@ static void purge_vmap_node(struct work_struct *work)
> {
> struct vmap_node *vn = container_of(work,
> struct vmap_node, purge_work);
> + unsigned long nr_purged_pages = 0;
> struct vmap_area *va, *n_va;
> LIST_HEAD(local_list);
>
> @@ -2224,7 +2225,7 @@ static void purge_vmap_node(struct work_struct *work)
>
> list_del_init(&va->list);
>
> - atomic_long_sub(nr, &vmap_lazy_nr);
> + nr_purged_pages += nr;
> vn->nr_purged++;
>
> if (is_vn_id_valid(vn_id) && !vn->skip_populate)
> @@ -2235,6 +2236,8 @@ static void purge_vmap_node(struct work_struct *work)
> list_add(&va->list, &local_list);
> }
>
> + atomic_long_sub(nr_purged_pages, &vmap_lazy_nr);
> +
> reclaim_list_global(&local_list);
> }
>
> --
> 2.34.1
>
I see the point and it looks good to me.
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Thank you for improving this. There is one more spot which i detected
earlier, it is:
<snip>
static void free_vmap_area_noflush(struct vmap_area *va)
{
unsigned long nr_lazy_max = lazy_max_pages();
unsigned long va_start = va->va_start;
unsigned int vn_id = decode_vn_id(va->flags);
struct vmap_node *vn;
unsigned long nr_lazy;
if (WARN_ON_ONCE(!list_empty(&va->list)))
return;
nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>
PAGE_SHIFT, &vmap_lazy_nr);
...
<snip>
atomic_long_add_return() might also introduce a high contention. We can
optimize by splitting into more light atomics. Can you check it on your
448-cores system?
Tnanks!
--
Uladzislau Rezki
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area
2024-08-29 19:00 ` Uladzislau Rezki
@ 2024-08-30 16:26 ` Uladzislau Rezki
2024-08-31 7:03 ` Christophe JAILLET
2024-09-02 12:00 ` Adrian Huang
1 sibling, 1 reply; 8+ messages in thread
From: Uladzislau Rezki @ 2024-08-30 16:26 UTC (permalink / raw)
To: Adrian Huang
Cc: Adrian Huang, Andrew Morton, Christoph Hellwig, linux-mm,
linux-kernel, Adrian Huang
On Thu, Aug 29, 2024 at 09:00:16PM +0200, Uladzislau Rezki wrote:
> On Thu, Aug 29, 2024 at 09:06:33PM +0800, Adrian Huang wrote:
> > From: Adrian Huang <ahuang12@lenovo.com>
> >
> > When running the vmalloc stress on a 448-core system, observe the average
> > latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
> > 'funclatency.py' tool [1].
> >
> > # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
> >
> > usecs : count distribution
> > 0 -> 1 : 0 | |
> > 2 -> 3 : 29 | |
> > 4 -> 7 : 19 | |
> > 8 -> 15 : 56 | |
> > 16 -> 31 : 483 |**** |
> > 32 -> 63 : 1548 |************ |
> > 64 -> 127 : 2634 |********************* |
> > 128 -> 255 : 2535 |********************* |
> > 256 -> 511 : 1776 |************** |
> > 512 -> 1023 : 1015 |******** |
> > 1024 -> 2047 : 573 |**** |
> > 2048 -> 4095 : 488 |**** |
> > 4096 -> 8191 : 1091 |********* |
> > 8192 -> 16383 : 3078 |************************* |
> > 16384 -> 32767 : 4821 |****************************************|
> > 32768 -> 65535 : 3318 |*************************** |
> > 65536 -> 131071 : 1718 |************** |
> > 131072 -> 262143 : 2220 |****************** |
> > 262144 -> 524287 : 1147 |********* |
> > 524288 -> 1048575 : 1179 |********* |
> > 1048576 -> 2097151 : 822 |****** |
> > 2097152 -> 4194303 : 906 |******* |
> > 4194304 -> 8388607 : 2148 |***************** |
> > 8388608 -> 16777215 : 4497 |************************************* |
> > 16777216 -> 33554431 : 289 |** |
> >
> > avg = 2041714 usecs, total: 78381401772 usecs, count: 38390
> >
> > The worst case is over 16-33 seconds, so soft lockup is triggered [2].
> >
> > [Root Cause]
> > 1) Each purge_list has the long list. The following shows the number of
> > vmap_area is purged.
> >
> > crash> p vmap_nodes
> > vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
> > crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged
> > nr_purged = 663070
> > ...
> > nr_purged = 821670
> > nr_purged = 692214
> > nr_purged = 726808
> > ...
> >
> > 2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
> > operation when purging each vmap_area. However, the iteration is over
> > 600000 vmap_area (See 'nr_purged' above).
> >
> > Here is objdump output:
> >
> > $ objdump -D vmlinux
> > ffffffff813e8c80 <purge_vmap_node>:
> > ...
> > ffffffff813e8d70: f0 48 29 2d 68 0c bb lock sub %rbp,0x2bb0c68(%rip)
> > ...
> >
> > Quote from "Instruction tables" pdf file [3]:
> > Instructions with a LOCK prefix have a long latency that depends on
> > cache organization and possibly RAM speed. If there are multiple
> > processors or cores or direct memory access (DMA) devices, then all
> > locked instructions will lock a cache line for exclusive access,
> > which may involve RAM access. A LOCK prefix typically costs more
> > than a hundred clock cycles, even on single-processor systems.
> >
> > That's why the latency of purge_vmap_node() dramatically increases
> > on a many-core system: One core is busy on purging each vmap_area of
> > the *long* purge_list and executing atomic_long_sub() for each
> > vmap_area, while other cores free vmalloc allocations and execute
> > atomic_long_add_return() in free_vmap_area_noflush().
> >
> > [Solution]
> > Employ a local variable to record the total purged pages, and execute
> > atomic_long_sub() after the traversal of the purge_list is done. The
> > experiment result shows the latency improvement is 99%.
> >
> > [Experiment Result]
> > 1) System Configuration: Three servers (with HT-enabled) are tested.
> > * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
> > * 192-core server: 5th Gen Intel Xeon Scalable Processor*2
> > * 448-core server: AMD Zen 4 Processor*2
> >
> > 2) Kernel Config
> > * CONFIG_KASAN is disabled
> >
> > 3) The data in column "w/o patch" and "w/ patch"
> > * Unit: micro seconds (us)
> > * Each data is the average of 3-time measurements
> >
> > System w/o patch (us) w/ patch (us) Improvement (%)
> > --------------- -------------- ------------- -------------
> > 72-core server 2194 14 99.36%
> > 192-core server 143799 1139 99.21%
> > 448-core server 1992122 6883 99.65%
> >
> > [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
> > [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12
> > [3] https://www.agner.org/optimize/instruction_tables.pdf
> >
> > Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
> > ---
> > mm/vmalloc.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 3f9b6bd707d2..607697c81e60 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2210,6 +2210,7 @@ static void purge_vmap_node(struct work_struct *work)
> > {
> > struct vmap_node *vn = container_of(work,
> > struct vmap_node, purge_work);
> > + unsigned long nr_purged_pages = 0;
> > struct vmap_area *va, *n_va;
> > LIST_HEAD(local_list);
> >
> > @@ -2224,7 +2225,7 @@ static void purge_vmap_node(struct work_struct *work)
> >
> > list_del_init(&va->list);
> >
> > - atomic_long_sub(nr, &vmap_lazy_nr);
> > + nr_purged_pages += nr;
> > vn->nr_purged++;
> >
> > if (is_vn_id_valid(vn_id) && !vn->skip_populate)
> > @@ -2235,6 +2236,8 @@ static void purge_vmap_node(struct work_struct *work)
> > list_add(&va->list, &local_list);
> > }
> >
> > + atomic_long_sub(nr_purged_pages, &vmap_lazy_nr);
> > +
> > reclaim_list_global(&local_list);
> > }
> >
> > --
> > 2.34.1
> >
> I see the point and it looks good to me.
>
> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
>
> Thank you for improving this. There is one more spot which i detected
> earlier, it is:
>
> <snip>
> static void free_vmap_area_noflush(struct vmap_area *va)
> {
> unsigned long nr_lazy_max = lazy_max_pages();
> unsigned long va_start = va->va_start;
> unsigned int vn_id = decode_vn_id(va->flags);
> struct vmap_node *vn;
> unsigned long nr_lazy;
>
> if (WARN_ON_ONCE(!list_empty(&va->list)))
> return;
>
> nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>
> PAGE_SHIFT, &vmap_lazy_nr);
>
> ...
> <snip>
>
> atomic_long_add_return() might also introduce a high contention. We can
> optimize by splitting into more light atomics. Can you check it on your
> 448-cores system?
>
I have checked the free_vmap_area_noflush() on my hardware. It is 64
cores system:
<perf cycles>
...
+ 7.84% 5.18% [kernel] [k] free_vmap_area_noflush
+ 6.16% 1.61% [kernel] [k] free_unref_page
+ 5.57% 1.51% [kernel] [k] find_unlink_vmap_area
...
<perf cycles>
<perf cycles annotate>
..
│ arch_atomic64_add_return():
23352402 │ mov %r12,%rdx
│ lock xadd %rdx,vmap_lazy_nr
│ is_vn_id_valid():
52364447314 │ mov nr_vmap_nodes,%ecx <----- the hotest spot which consumes the CPU cycles the most(99%)
│ arch_atomic64_add_return():
45547180 │ add %rdx,%r12
│ is_vn_id_valid():
...
<perf cycles annotate>
At least in my case, HW, i do not see that atomic_long_add_return() is a
top when it comes to CPU cycles. Below one is the hottest instead:
static bool
is_vn_id_valid(unsigned int node_id)
{
if (node_id < nr_vmap_nodes)
return true;
return false;
}
access to "nr_vmap_nodes" which is read-only and globally defined:
static __read_mostly unsigned int nr_vmap_nodes = 1;
Any thoughts?
--
Uladzislau Rezki
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area
2024-08-29 13:06 [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area Adrian Huang
2024-08-29 19:00 ` Uladzislau Rezki
@ 2024-08-31 0:33 ` Andrew Morton
1 sibling, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2024-08-31 0:33 UTC (permalink / raw)
To: Adrian Huang
Cc: Uladzislau Rezki, Christoph Hellwig, linux-mm, linux-kernel,
Adrian Huang
On Thu, 29 Aug 2024 21:06:33 +0800 Adrian Huang <adrianhuang0701@gmail.com> wrote:
> From: Adrian Huang <ahuang12@lenovo.com>
>
> When running the vmalloc stress on a 448-core system, observe the average
> latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
> 'funclatency.py' tool [1].
>
> ...
>
> 3) The data in column "w/o patch" and "w/ patch"
> * Unit: micro seconds (us)
> * Each data is the average of 3-time measurements
>
> System w/o patch (us) w/ patch (us) Improvement (%)
> --------------- -------------- ------------- -------------
> 72-core server 2194 14 99.36%
> 192-core server 143799 1139 99.21%
> 448-core server 1992122 6883 99.65%
>
Holy cow. I'll add cc:stable to this. Partly because it fixes a
softlockup, partly because holy cow.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area
2024-08-30 16:26 ` Uladzislau Rezki
@ 2024-08-31 7:03 ` Christophe JAILLET
2024-09-02 17:03 ` Uladzislau Rezki
0 siblings, 1 reply; 8+ messages in thread
From: Christophe JAILLET @ 2024-08-31 7:03 UTC (permalink / raw)
To: Uladzislau Rezki, Adrian Huang
Cc: Andrew Morton, Christoph Hellwig, linux-mm, linux-kernel, Adrian Huang
Le 30/08/2024 à 18:26, Uladzislau Rezki a écrit :
> At least in my case, HW, i do not see that atomic_long_add_return() is a
> top when it comes to CPU cycles. Below one is the hottest instead:
>
> static bool
> is_vn_id_valid(unsigned int node_id)
> {
> if (node_id < nr_vmap_nodes)
> return true;
>
> return false;
> }
>
> access to "nr_vmap_nodes" which is read-only and globally defined:
>
> static __read_mostly unsigned int nr_vmap_nodes = 1;
>
> Any thoughts?
>
> --
> Uladzislau Rezki
>
Hi,
unrelated to your use case, but something that coud easily save a few
cycles on some system, IMHO.
Maybe:
#if NR_CPUS > 1
static __read_mostly unsigned int nr_vmap_nodes = 1;
static __read_mostly unsigned int vmap_zone_size = 1;
#else
#define nr_vmap_nodes 1
#define vmap_zone_size 1
#endif
So that the compiler can do a better job because some loops can be
optimized away and there is no need to access some memory to get theses
values.
Not sure if such a use case can exist or is of any interest.
This is valide because of [1] and the #ifdef around the
num_possible_cpus() declaration [2, 3].
Just my 2c.
CJ
[1]: https://elixir.bootlin.com/linux/v6.11-rc5/source/mm/vmalloc.c#L5026
[2]:
https://elixir.bootlin.com/linux/v6.11-rc5/source/include/linux/cpumask.h#L1083
[3]:
https://elixir.bootlin.com/linux/v6.11-rc5/source/include/linux/cpumask.h#L1136
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area
2024-08-29 19:00 ` Uladzislau Rezki
2024-08-30 16:26 ` Uladzislau Rezki
@ 2024-09-02 12:00 ` Adrian Huang
2024-09-02 17:00 ` Uladzislau Rezki
1 sibling, 1 reply; 8+ messages in thread
From: Adrian Huang @ 2024-09-02 12:00 UTC (permalink / raw)
To: urezki; +Cc: adrianhuang0701, ahuang12, akpm, hch, linux-kernel, linux-mm
On Fri, Aug 30, 2024 at 3:00 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> atomic_long_add_return() might also introduce a high contention. We can
> optimize by splitting into more light atomics. Can you check it on your
> 448-cores system?
Interestingly, the following result shows the latency of
free_vmap_area_noflush() is just 26 usecs (The worst case is 16ms-32ms).
/home/git-repo/bcc/tools/funclatency.py -u free_vmap_area_noflush & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
usecs : count distribution
0 -> 1 : 18166 | |
2 -> 3 : 41929818 |** |
4 -> 7 : 181203439 |*********** |
8 -> 15 : 464242836 |***************************** |
16 -> 31 : 620077545 |****************************************|
32 -> 63 : 442133041 |**************************** |
64 -> 127 : 111432597 |******* |
128 -> 255 : 3441649 | |
256 -> 511 : 302655 | |
512 -> 1023 : 738 | |
1024 -> 2047 : 73 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 0 | |
16384 -> 32767 : 196 | |
avg = 26 usecs, total: 49415657269 usecs, count: 1864782753
free_vmap_area_noflush() just executes the lock prefix one time, so the
wrost case might be just about a hundred clock cycles.
The problem of purge_vmap_node() is that some cores are busy on purging
each vmap_area of the *long* purge_list and executing atomic_long_sub()
for each vmap_area, while other cores free vmalloc allocations and execute
atomic_long_add_return() in free_vmap_area_noflush(). The following crash
log shows the 22 cores are busy on purging vmap_area structs [1]:
crash> bt -a | grep "purge_vmap_node+291" | wc -l
22
So, the latency of purge_vmap_node() dramatically increases becase it
excutes the lock prefix over 600,0000 times. The issue can be easier
to reproduce if more cores execute purge_vmap_node() simultaneously.
[1] https://gist.github.com/AdrianHuang/c0030dd7755e673ed00cb197b76ce0a7
Tested the following patch with the light atomics. However, nothing improved
(But, the worst case is improved):
usecs : count distribution
0 -> 1 : 7146 | |
2 -> 3 : 31734187 |** |
4 -> 7 : 161408609 |*********** |
8 -> 15 : 461411377 |********************************* |
16 -> 31 : 557005293 |****************************************|
32 -> 63 : 435518485 |******************************* |
64 -> 127 : 175033097 |************ |
128 -> 255 : 42265379 |*** |
256 -> 511 : 399112 | |
512 -> 1023 : 734 | |
1024 -> 2047 : 72 | |
avg = 32 usecs, total: 59952713176 usecs, count: 1864783491
[free_vmap_area_noflush() assembly instructions wo/ the test patch]
ffffffff813e6e90 <free_vmap_area_noflush>:
...
ffffffff813e6ed4: 4c 8b 65 08 mov 0x8(%rbp),%r12
ffffffff813e6ed8: 4c 2b 65 00 sub 0x0(%rbp),%r12
ffffffff813e6edc: 49 c1 ec 0c shr $0xc,%r12
ffffffff813e6ee0: 4c 89 e2 mov %r12,%rdx
ffffffff813e6ee3: f0 48 0f c1 15 f4 2a lock xadd %rdx,0x2bb2af4(%rip) # ffffffff83f999e0 <vmap_lazy_nr>
ffffffff813e6eea: bb 02
ffffffff813e6eec: 8b 0d 0e b1 a5 01 mov 0x1a5b10e(%rip),%ecx # ffffffff82e42000 <nr_vmap_nodes>
ffffffff813e6ef2: 49 01 d4 add %rdx,%r12
ffffffff813e6ef5: 39 c8 cmp %ecx,%eax
...
[free_vmap_area_noflush() assembly instructions w/ the test patch: no lock prefix]
ffffffff813e6e90 <free_vmap_area_noflush>:
...
ffffffff813e6edb: 48 8b 5d 08 mov 0x8(%rbp),%rbx
ffffffff813e6edf: 48 2b 5d 00 sub 0x0(%rbp),%rbx
ffffffff813e6ee3: 8b 0d d7 b0 a5 01 mov 0x1a5b0d7(%rip),%ecx # ffffffff82e41fc0 <nr_vmap_nodes>
ffffffff813e6ee9: 48 c1 eb 0c shr $0xc,%rbx
ffffffff813e6eed: 4c 8b 25 b4 f1 92 01 mov 0x192f1b4(%rip),%r12 # ffffffff82d160a8 <vmap_nodes>
ffffffff813e6ef4: 48 01 d3 add %rdx,%rbx
ffffffff813e6ef7: 48 89 1d e2 2a bb 02 mov %rbx,0x2bb2ae2(%rip) # ffffffff83f999e0 <vmap_lazy_nr>
ffffffff813e6efe: 39 c8 cmp %ecx,%eax
...
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3f9b6bd707d2..3927c541440b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2357,8 +2357,9 @@ static void free_vmap_area_noflush(struct vmap_area *va)
if (WARN_ON_ONCE(!list_empty(&va->list)))
return;
- nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>
- PAGE_SHIFT, &vmap_lazy_nr);
+ nr_lazy = atomic_long_read(&vmap_lazy_nr);
+ nr_lazy += (va->va_end - va->va_start) >> PAGE_SHIFT;
+ atomic_long_set(&vmap_lazy_nr, nr_lazy);
/*
* If it was request by a certain node we would like to
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area
2024-09-02 12:00 ` Adrian Huang
@ 2024-09-02 17:00 ` Uladzislau Rezki
0 siblings, 0 replies; 8+ messages in thread
From: Uladzislau Rezki @ 2024-09-02 17:00 UTC (permalink / raw)
To: Adrian Huang; +Cc: urezki, ahuang12, akpm, hch, linux-kernel, linux-mm
On Mon, Sep 02, 2024 at 08:00:46PM +0800, Adrian Huang wrote:
> On Fri, Aug 30, 2024 at 3:00 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > atomic_long_add_return() might also introduce a high contention. We can
> > optimize by splitting into more light atomics. Can you check it on your
> > 448-cores system?
>
> Interestingly, the following result shows the latency of
> free_vmap_area_noflush() is just 26 usecs (The worst case is 16ms-32ms).
>
> /home/git-repo/bcc/tools/funclatency.py -u free_vmap_area_noflush & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
>
> usecs : count distribution
> 0 -> 1 : 18166 | |
> 2 -> 3 : 41929818 |** |
> 4 -> 7 : 181203439 |*********** |
> 8 -> 15 : 464242836 |***************************** |
> 16 -> 31 : 620077545 |****************************************|
> 32 -> 63 : 442133041 |**************************** |
> 64 -> 127 : 111432597 |******* |
> 128 -> 255 : 3441649 | |
> 256 -> 511 : 302655 | |
> 512 -> 1023 : 738 | |
> 1024 -> 2047 : 73 | |
> 2048 -> 4095 : 0 | |
> 4096 -> 8191 : 0 | |
> 8192 -> 16383 : 0 | |
> 16384 -> 32767 : 196 | |
>
> avg = 26 usecs, total: 49415657269 usecs, count: 1864782753
>
>
> free_vmap_area_noflush() just executes the lock prefix one time, so the
> wrost case might be just about a hundred clock cycles.
>
> The problem of purge_vmap_node() is that some cores are busy on purging
> each vmap_area of the *long* purge_list and executing atomic_long_sub()
> for each vmap_area, while other cores free vmalloc allocations and execute
> atomic_long_add_return() in free_vmap_area_noflush(). The following crash
> log shows the 22 cores are busy on purging vmap_area structs [1]:
>
> crash> bt -a | grep "purge_vmap_node+291" | wc -l
> 22
>
> So, the latency of purge_vmap_node() dramatically increases becase it
> excutes the lock prefix over 600,0000 times. The issue can be easier
> to reproduce if more cores execute purge_vmap_node() simultaneously.
>
Right. This is clear to me. Under heavy stressing in a tight loop we
invoke atomic_long_sub() per one freed VA. Having 448-cores and one
stress job per-cpu we end up with a high-contention spot when access
to an atomic which requires a cache-line lock.
>
>
> Tested the following patch with the light atomics. However, nothing improved
> (But, the worst case is improved):
>
> usecs : count distribution
> 0 -> 1 : 7146 | |
> 2 -> 3 : 31734187 |** |
> 4 -> 7 : 161408609 |*********** |
> 8 -> 15 : 461411377 |********************************* |
> 16 -> 31 : 557005293 |****************************************|
> 32 -> 63 : 435518485 |******************************* |
> 64 -> 127 : 175033097 |************ |
> 128 -> 255 : 42265379 |*** |
> 256 -> 511 : 399112 | |
> 512 -> 1023 : 734 | |
> 1024 -> 2047 : 72 | |
>
> avg = 32 usecs, total: 59952713176 usecs, count: 1864783491
>
Thank you for checking this! So there is no difference. As for worst
case, it might be an error of measurements. The problem is that we/you
measure the time which includes a context switch because a context which
triggers the free_vmap_area_noflush() function can easily be preempted.
--
Uladzislau Rezki
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area
2024-08-31 7:03 ` Christophe JAILLET
@ 2024-09-02 17:03 ` Uladzislau Rezki
0 siblings, 0 replies; 8+ messages in thread
From: Uladzislau Rezki @ 2024-09-02 17:03 UTC (permalink / raw)
To: Christophe JAILLET
Cc: Uladzislau Rezki, Adrian Huang, Andrew Morton, Christoph Hellwig,
linux-mm, linux-kernel, Adrian Huang
Hello!
>
> Hi,
>
> unrelated to your use case, but something that coud easily save a few cycles
> on some system, IMHO.
>
> Maybe:
>
> #if NR_CPUS > 1
> static __read_mostly unsigned int nr_vmap_nodes = 1;
> static __read_mostly unsigned int vmap_zone_size = 1;
> #else
> #define nr_vmap_nodes 1
> #define vmap_zone_size 1
> #endif
>
> So that the compiler can do a better job because some loops can be optimized
> away and there is no need to access some memory to get theses values.
>
> Not sure if such a use case can exist or is of any interest.
>
> This is valide because of [1] and the #ifdef around the num_possible_cpus()
> declaration [2, 3].
>
>
> Just my 2c.
>
Thank you, i see your point.
--
Uladzislau Rezki
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-09-02 17:03 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-29 13:06 [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area Adrian Huang
2024-08-29 19:00 ` Uladzislau Rezki
2024-08-30 16:26 ` Uladzislau Rezki
2024-08-31 7:03 ` Christophe JAILLET
2024-09-02 17:03 ` Uladzislau Rezki
2024-09-02 12:00 ` Adrian Huang
2024-09-02 17:00 ` Uladzislau Rezki
2024-08-31 0:33 ` Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox