linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
To: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>
Cc: LKML <linux-kernel@vger.kernel.org>, Baoquan He <bhe@redhat.com>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Christoph Hellwig <hch@infradead.org>,
	Matthew Wilcox <willy@infradead.org>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Dave Chinner <david@fromorbit.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Joel Fernandes <joel@joelfernandes.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
Subject: [PATCH v3 00/11] Mitigate a vmap lock contention v3
Date: Tue,  2 Jan 2024 19:46:22 +0100	[thread overview]
Message-ID: <20240102184633.748113-1-urezki@gmail.com> (raw)

This is v3. It is based on the 6.7.0-rc8.

1. Motivation

- Offload global vmap locks making it scaled to number of CPUS;
- If possible and there is an agreement, we can remove the "Per cpu kva allocator"
  to make the vmap code to be more simple;
- There were complains from XFS folk that a vmalloc might be contented
  on the their workloads.

2. Design(high level overview)

We introduce an effective vmap node logic. A node behaves as independent
entity to serve an allocation request directly(if possible) from its pool.
That way it bypasses a global vmap space that is protected by its own lock.

An access to pools are serialized by CPUs. Number of nodes are equal to
number of CPUs in a system. Please note the high threshold is bound to
128 nodes.

Pools are size segregated and populated based on system demand. The maximum
alloc request that can be stored into a segregated storage is 256 pages. The
lazily drain path decays a pool by 25% as a first step and as second populates
it by fresh freed VAs for reuse instead of returning them into a global space.

When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start
address is converted into a correct node where it should be placed and resided.
Doing so we balance VAs across the nodes as a result an access becomes scalable.
The addr_to_node() function does a proper address conversion to a correct node.

A vmap space is divided on segments with fixed size, it is 16 pages. That way
any address can be associated with a segment number. Number of segments are
equal to num_possible_cpus() but not grater then 128. The numeration starts
from 0. See below how it is converted:

static inline unsigned int
addr_to_node_id(unsigned long addr)
{
	return (addr / zone_size) % nr_nodes;
}

On a free path, a VA can be easily found by converting its "va_start" address
to a certain node it resides. It is moved from "busy" data to "lazy" data structure.
Later on, as noted earlier, the lazy kworker decays each node pool and populates it
by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc
request.

3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor

sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64

<default perf>
 94.41%     0.89%  [kernel]        [k] _raw_spin_lock
 93.35%    93.07%  [kernel]        [k] native_queued_spin_lock_slowpath
 76.13%     0.28%  [kernel]        [k] __vmalloc_node_range
 72.96%     0.81%  [kernel]        [k] alloc_vmap_area
 56.94%     0.00%  [kernel]        [k] __get_vm_area_node
 41.95%     0.00%  [kernel]        [k] vmalloc
 37.15%     0.01%  [test_vmalloc]  [k] full_fit_alloc_test
 35.17%     0.00%  [kernel]        [k] ret_from_fork_asm
 35.17%     0.00%  [kernel]        [k] ret_from_fork
 35.17%     0.00%  [kernel]        [k] kthread
 35.08%     0.00%  [test_vmalloc]  [k] test_func
 34.45%     0.00%  [test_vmalloc]  [k] fix_size_alloc_test
 28.09%     0.01%  [test_vmalloc]  [k] long_busy_list_alloc_test
 23.53%     0.25%  [kernel]        [k] vfree.part.0
 21.72%     0.00%  [kernel]        [k] remove_vm_area
 20.08%     0.21%  [kernel]        [k] find_unlink_vmap_area
  2.34%     0.61%  [kernel]        [k] free_vmap_area_noflush
<default perf>
   vs
<patch-series perf>
 82.32%     0.22%  [test_vmalloc]  [k] long_busy_list_alloc_test
 63.36%     0.02%  [kernel]        [k] vmalloc
 63.34%     2.64%  [kernel]        [k] __vmalloc_node_range
 30.42%     4.46%  [kernel]        [k] vfree.part.0
 28.98%     2.51%  [kernel]        [k] __alloc_pages_bulk
 27.28%     0.19%  [kernel]        [k] __get_vm_area_node
 26.13%     1.50%  [kernel]        [k] alloc_vmap_area
 21.72%    21.67%  [kernel]        [k] clear_page_rep
 19.51%     2.43%  [kernel]        [k] _raw_spin_lock
 16.61%    16.51%  [kernel]        [k] native_queued_spin_lock_slowpath
 13.40%     2.07%  [kernel]        [k] free_unref_page
 10.62%     0.01%  [kernel]        [k] remove_vm_area
  9.02%     8.73%  [kernel]        [k] insert_vmap_area
  8.94%     0.00%  [kernel]        [k] ret_from_fork_asm
  8.94%     0.00%  [kernel]        [k] ret_from_fork
  8.94%     0.00%  [kernel]        [k] kthread
  8.29%     0.00%  [test_vmalloc]  [k] test_func
  7.81%     0.05%  [test_vmalloc]  [k] full_fit_alloc_test
  5.30%     4.73%  [kernel]        [k] purge_vmap_node
  4.47%     2.65%  [kernel]        [k] free_vmap_area_noflush
<patch-series perf>

confirms that a native_queued_spin_lock_slowpath goes down to
16.51% percent from 93.07%.

The throughput is ~12x higher:

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
Run the test with following parameters: run_test_mask=7 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real    10m51.271s
user    0m0.013s
sys     0m0.187s
urezki@pc638:~$

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
Run the test with following parameters: run_test_mask=7 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real    0m51.301s
user    0m0.015s
sys     0m0.040s
urezki@pc638:~$

4. Changelog

v1: https://lore.kernel.org/linux-mm/ZIAqojPKjChJTssg@pc636/T/
v2: https://lore.kernel.org/lkml/20230829081142.3619-1-urezki@gmail.com/

Delta v2 -> v3:
  - fix comments from v2 feedback;
  - switch from pre-fetch chunk logic to a less complex size based pools.

Baoquan He (1):
  mm/vmalloc: remove vmap_area_list

Uladzislau Rezki (Sony) (10):
  mm: vmalloc: Add va_alloc() helper
  mm: vmalloc: Rename adjust_va_to_fit_type() function
  mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
  mm: vmalloc: Remove global vmap_area_root rb-tree
  mm: vmalloc: Remove global purge_vmap_area_root rb-tree
  mm: vmalloc: Offload free_vmap_area_lock lock
  mm: vmalloc: Support multiple nodes in vread_iter
  mm: vmalloc: Support multiple nodes in vmallocinfo
  mm: vmalloc: Set nr_nodes based on CPUs in a system
  mm: vmalloc: Add a shrinker to drain vmap pools

 .../admin-guide/kdump/vmcoreinfo.rst          |    8 +-
 arch/arm64/kernel/crash_core.c                |    1 -
 arch/riscv/kernel/crash_core.c                |    1 -
 include/linux/vmalloc.h                       |    1 -
 kernel/crash_core.c                           |    4 +-
 kernel/kallsyms_selftest.c                    |    1 -
 mm/nommu.c                                    |    2 -
 mm/vmalloc.c                                  | 1049 ++++++++++++-----
 8 files changed, 786 insertions(+), 281 deletions(-)

-- 
2.39.2



             reply	other threads:[~2024-01-02 18:46 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-02 18:46 Uladzislau Rezki (Sony) [this message]
2024-01-02 18:46 ` [PATCH v3 01/11] mm: vmalloc: Add va_alloc() helper Uladzislau Rezki (Sony)
2024-01-02 18:46 ` [PATCH v3 02/11] mm: vmalloc: Rename adjust_va_to_fit_type() function Uladzislau Rezki (Sony)
2024-01-02 18:46 ` [PATCH v3 03/11] mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c Uladzislau Rezki (Sony)
2024-01-02 18:46 ` [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree Uladzislau Rezki (Sony)
2024-01-05  8:10   ` Wen Gu
2024-01-05 10:50     ` Uladzislau Rezki
2024-01-06  9:17       ` Wen Gu
2024-01-06 16:36         ` Uladzislau Rezki
2024-01-07  6:59           ` Hillf Danton
2024-01-08  7:45             ` Wen Gu
2024-01-08 18:37               ` Uladzislau Rezki
2024-01-16 23:25   ` Lorenzo Stoakes
2024-01-18 13:15     ` Uladzislau Rezki
2024-01-20 12:55       ` Lorenzo Stoakes
2024-01-22 17:44         ` Uladzislau Rezki
2024-01-02 18:46 ` [PATCH v3 05/11] mm/vmalloc: remove vmap_area_list Uladzislau Rezki (Sony)
2024-01-16 23:36   ` Lorenzo Stoakes
2024-01-02 18:46 ` [PATCH v3 06/11] mm: vmalloc: Remove global purge_vmap_area_root rb-tree Uladzislau Rezki (Sony)
2024-01-02 18:46 ` [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock Uladzislau Rezki (Sony)
2024-01-03 11:08   ` Hillf Danton
2024-01-03 15:47     ` Uladzislau Rezki
2024-01-11  9:02   ` Dave Chinner
2024-01-11 15:54     ` Uladzislau Rezki
2024-01-11 20:37       ` Dave Chinner
2024-01-12 12:18         ` Uladzislau Rezki
2024-01-16 22:12           ` Dave Chinner
2024-01-18 18:15             ` Uladzislau Rezki
2024-02-08  0:25   ` Baoquan He
2024-02-08 13:57     ` Uladzislau Rezki
2024-02-28  9:48   ` Baoquan He
2024-02-28 10:39     ` Uladzislau Rezki
2024-02-28 12:26       ` Baoquan He
2024-03-22 18:21   ` Guenter Roeck
2024-03-22 19:03     ` Uladzislau Rezki
2024-03-22 20:53       ` Guenter Roeck
2024-01-02 18:46 ` [PATCH v3 08/11] mm: vmalloc: Support multiple nodes in vread_iter Uladzislau Rezki (Sony)
2024-01-02 18:46 ` [PATCH v3 09/11] mm: vmalloc: Support multiple nodes in vmallocinfo Uladzislau Rezki (Sony)
2024-01-02 18:46 ` [PATCH v3 10/11] mm: vmalloc: Set nr_nodes based on CPUs in a system Uladzislau Rezki (Sony)
2024-01-11  9:25   ` Dave Chinner
2024-01-15 19:09     ` Uladzislau Rezki
2024-01-16 22:06       ` Dave Chinner
2024-01-18 18:23         ` Uladzislau Rezki
2024-01-18 21:28           ` Dave Chinner
2024-01-19 10:32             ` Uladzislau Rezki
2024-01-02 18:46 ` [PATCH v3 11/11] mm: vmalloc: Add a shrinker to drain vmap pools Uladzislau Rezki (Sony)
2024-02-22  8:35 ` [PATCH v3 00/11] Mitigate a vmap lock contention v3 Uladzislau Rezki
2024-02-22 23:15   ` Pedro Falcato
2024-02-23  9:34     ` Uladzislau Rezki
2024-02-23 10:26       ` Baoquan He
2024-02-23 11:06         ` Uladzislau Rezki
2024-02-23 15:57           ` Baoquan He
2024-02-23 18:55             ` Uladzislau Rezki
2024-02-28  9:27               ` Baoquan He
2024-02-29 10:38                 ` Uladzislau Rezki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240102184633.748113-1-urezki@gmail.com \
    --to=urezki@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=joel@joelfernandes.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lstoakes@gmail.com \
    --cc=oleksiy.avramchenko@sony.com \
    --cc=paulmck@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox