linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Adrian Huang <adrianhuang0701@gmail.com>
To: urezki@gmail.com
Cc: adrianhuang0701@gmail.com, ahuang12@lenovo.com,
	akpm@linux-foundation.org, hch@infradead.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area
Date: Mon,  2 Sep 2024 20:00:46 +0800	[thread overview]
Message-ID: <20240902120046.26478-1-ahuang12@lenovo.com> (raw)
In-Reply-To: <ZtDFQHGHMq6TfbKA@pc636>

On Fri, Aug 30, 2024 at 3:00 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> atomic_long_add_return() might also introduce a high contention. We can
> optimize by splitting into more light atomics. Can you check it on your
> 448-cores system?

Interestingly, the following result shows the latency of
free_vmap_area_noflush() is just 26 usecs (The worst case is 16ms-32ms).

/home/git-repo/bcc/tools/funclatency.py -u free_vmap_area_noflush & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1

         usecs       : count     distribution
         0 -> 1      : 18166     |                                        |
         2 -> 3      : 41929818  |**                                      |
         4 -> 7      : 181203439 |***********                             |
         8 -> 15     : 464242836 |*****************************           |
        16 -> 31     : 620077545 |****************************************|
        32 -> 63     : 442133041 |****************************            |
        64 -> 127    : 111432597 |*******                                 |
       128 -> 255    : 3441649   |                                        |
       256 -> 511    : 302655    |                                        |
       512 -> 1023   : 738       |                                        |
      1024 -> 2047   : 73        |                                        |
      2048 -> 4095   : 0         |                                        |
      4096 -> 8191   : 0         |                                        |
      8192 -> 16383  : 0         |                                        |
     16384 -> 32767  : 196       |                                        |

   avg = 26 usecs, total: 49415657269 usecs, count: 1864782753


free_vmap_area_noflush() just executes the lock prefix one time, so the
wrost case might be just about a hundred clock cycles.

The problem of purge_vmap_node() is that some cores are busy on purging
each vmap_area of the *long* purge_list and executing atomic_long_sub()
for each vmap_area, while other cores free vmalloc allocations and execute
atomic_long_add_return() in free_vmap_area_noflush(). The following crash
log shows the 22 cores are busy on purging vmap_area structs [1]:

  crash> bt -a | grep "purge_vmap_node+291" | wc -l
  22

So, the latency of purge_vmap_node() dramatically increases becase it
excutes the lock prefix over 600,0000 times. The issue can be easier
to reproduce if more cores execute purge_vmap_node() simultaneously.

[1] https://gist.github.com/AdrianHuang/c0030dd7755e673ed00cb197b76ce0a7


Tested the following patch with the light atomics. However, nothing improved 
(But, the worst case is improved):

         usecs        : count     distribution
         0 -> 1       : 7146      |                                        |
         2 -> 3       : 31734187  |**                                      |
         4 -> 7       : 161408609 |***********                             |
         8 -> 15      : 461411377 |*********************************       |
        16 -> 31      : 557005293 |****************************************|
        32 -> 63      : 435518485 |*******************************         |
        64 -> 127     : 175033097 |************                            |
       128 -> 255     : 42265379  |***                                     |
       256 -> 511     : 399112    |                                        |
       512 -> 1023    : 734       |                                        |
      1024 -> 2047    : 72        |                                        |

avg = 32 usecs, total: 59952713176 usecs, count: 1864783491

[free_vmap_area_noflush() assembly instructions wo/ the test patch]
ffffffff813e6e90 <free_vmap_area_noflush>:
...
ffffffff813e6ed4:       4c 8b 65 08             mov    0x8(%rbp),%r12
ffffffff813e6ed8:       4c 2b 65 00             sub    0x0(%rbp),%r12
ffffffff813e6edc:       49 c1 ec 0c             shr    $0xc,%r12
ffffffff813e6ee0:       4c 89 e2                mov    %r12,%rdx
ffffffff813e6ee3:       f0 48 0f c1 15 f4 2a    lock xadd %rdx,0x2bb2af4(%rip)        # ffffffff83f999e0 <vmap_lazy_nr>
ffffffff813e6eea:       bb 02
ffffffff813e6eec:       8b 0d 0e b1 a5 01       mov    0x1a5b10e(%rip),%ecx        # ffffffff82e42000 <nr_vmap_nodes>
ffffffff813e6ef2:       49 01 d4                add    %rdx,%r12
ffffffff813e6ef5:       39 c8                   cmp    %ecx,%eax
...

[free_vmap_area_noflush() assembly instructions w/ the test patch: no lock prefix]
ffffffff813e6e90 <free_vmap_area_noflush>:
...
ffffffff813e6edb:       48 8b 5d 08             mov    0x8(%rbp),%rbx
ffffffff813e6edf:       48 2b 5d 00             sub    0x0(%rbp),%rbx
ffffffff813e6ee3:       8b 0d d7 b0 a5 01       mov    0x1a5b0d7(%rip),%ecx        # ffffffff82e41fc0 <nr_vmap_nodes>
ffffffff813e6ee9:       48 c1 eb 0c             shr    $0xc,%rbx
ffffffff813e6eed:       4c 8b 25 b4 f1 92 01    mov    0x192f1b4(%rip),%r12        # ffffffff82d160a8 <vmap_nodes>
ffffffff813e6ef4:       48 01 d3                add    %rdx,%rbx
ffffffff813e6ef7:       48 89 1d e2 2a bb 02    mov    %rbx,0x2bb2ae2(%rip)        # ffffffff83f999e0 <vmap_lazy_nr>
ffffffff813e6efe:       39 c8                   cmp    %ecx,%eax
...


diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3f9b6bd707d2..3927c541440b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2357,8 +2357,9 @@ static void free_vmap_area_noflush(struct vmap_area *va)
 	if (WARN_ON_ONCE(!list_empty(&va->list)))
 		return;
 
-	nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>
-				PAGE_SHIFT, &vmap_lazy_nr);
+	nr_lazy = atomic_long_read(&vmap_lazy_nr);
+	nr_lazy += (va->va_end - va->va_start) >> PAGE_SHIFT;
+	atomic_long_set(&vmap_lazy_nr, nr_lazy);
 
 	/*
 	 * If it was request by a certain node we would like to


  parent reply	other threads:[~2024-09-02 12:01 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-29 13:06 Adrian Huang
2024-08-29 19:00 ` Uladzislau Rezki
2024-08-30 16:26   ` Uladzislau Rezki
2024-08-31  7:03     ` Christophe JAILLET
2024-09-02 17:03       ` Uladzislau Rezki
2024-09-02 12:00   ` Adrian Huang [this message]
2024-09-02 17:00     ` Uladzislau Rezki
2024-08-31  0:33 ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240902120046.26478-1-ahuang12@lenovo.com \
    --to=adrianhuang0701@gmail.com \
    --cc=ahuang12@lenovo.com \
    --cc=akpm@linux-foundation.org \
    --cc=hch@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=urezki@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox