From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1DF7FD0BB5B
	for <linux-mm@archiver.kernel.org>; Thu, 24 Oct 2024 03:52:10 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A53FC6B0082; Wed, 23 Oct 2024 23:52:09 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9FDBB6B0083; Wed, 23 Oct 2024 23:52:09 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 89D496B008A; Wed, 23 Oct 2024 23:52:09 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 6292C6B0082
	for <linux-mm@kvack.org>; Wed, 23 Oct 2024 23:52:09 -0400 (EDT)
Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 50F0AACB00
	for <linux-mm@kvack.org>; Thu, 24 Oct 2024 03:51:33 +0000 (UTC)
X-FDA: 82707122022.27.76425EE
Received: from mail-lj1-f176.google.com (mail-lj1-f176.google.com [209.85.208.176])
	by imf25.hostedemail.com (Postfix) with ESMTP id 5E1A9A0006
	for <linux-mm@kvack.org>; Thu, 24 Oct 2024 03:51:54 +0000 (UTC)
Authentication-Results: imf25.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=DzcdWl8A;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf25.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.176 as permitted sender) smtp.mailfrom=ryncsn@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729741814; a=rsa-sha256;
	cv=none;
	b=hUlMOJXIp7R6XAngShiLpfdKYGygurgMrWgHJhixau3TpXMaKPgKNxnJggH72D912A/9UV
	wLnmMjQVNPJxxUkrkQwfx5eGVulDIydsMbcS5CsdAbi4mm5b0jZjK58ItOHxyaDN3gUItB
	BNcCej2XGN+s4FL2TyRqOaKsXtIAXrk=
ARC-Authentication-Results: i=1;
	imf25.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=DzcdWl8A;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf25.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.176 as permitted sender) smtp.mailfrom=ryncsn@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1729741814;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=sRnPT8Dyhb9pG40q7iUbKYdz9ByQD97iggYnI/cI1nQ=;
	b=NqiLdCiYiYnqw3n2jDOenoyq4+5JOggINzYjvb8GmN0iUSfZFFgRPfNfA/F7psRt05lMqz
	CnLPsnQQUkdiSC+SwXELzXI+zVnoW5D7FN5pflhuO6wqh2dnso+LZb1fRtqw0CiR3BnHJd
	KengqZ7OCN9JmOGBHmJuBP9jzOnJoGs=
Received: by mail-lj1-f176.google.com with SMTP id 38308e7fff4ca-2fb5743074bso3575321fa.1
        for <linux-mm@kvack.org>; Wed, 23 Oct 2024 20:52:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1729741925; x=1730346725; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=sRnPT8Dyhb9pG40q7iUbKYdz9ByQD97iggYnI/cI1nQ=;
        b=DzcdWl8AZvwo31veLfmeAHzdSqkOSgZasxyBEoCFj4pbIVN66kDR3iATEC2gu8+iwK
         m6N6gAQf/nT6NPwgBWk8U8wq7NSHoWSOm3rI2/xetZsEI2BXJTZ4VeKcMaAw0Ul7I47F
         doLI1lQ2tMBwMSVo4/cTC/d6ja0Qg+2ad7YkxscRpZFZl+EYhDAwtm3AvP9fc3OotTOF
         vU7FHIoHy2qYmnm3i6vwfWAWIb0uRVBKUwdu6vo6j3EuAZ2ae02uIZKSY5piG8jIBYCs
         sg7GOuhHZdmag7qhYxzKvumxwpupA2XfpIZp/KumxJKoErbILmouHlZG1K8n9ygN9c3Y
         SjwQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1729741925; x=1730346725;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=sRnPT8Dyhb9pG40q7iUbKYdz9ByQD97iggYnI/cI1nQ=;
        b=WRLKMXn90xRNN2F5pxpKsQL64Qxd0ZB3rBf0dRN3utDkakr1s6gXk9kOWJJQTHS9YT
         btaQ7Q7rHrIZcWAL5rsFotulf/uIsv8SvBxJQPWV6xxM8StJhsS980udCjkmCLJDke8z
         ib51bSR7phEODg/tpOrvfq+qQ9f2S3n7Qb8U1sf3Ioco4xhdFmfuTrHLSeCgPfW2rrqp
         b9H/Z4y8WqeteeAYrWrVFN/lFKmCaMrGGXHDNhWHZwOvr/n60pA0kubtZxVRxNovJkqx
         IaQWJnJWhBXYOtsAEzEhhKqBq3u5gKwGs5Lgf9PMYMx/MSevSmjLEIDwFe2IbgnhMhjQ
         Fqkw==
X-Gm-Message-State: AOJu0YwgzHMG9yorK2j7F4MjP+rOvU8FCJFvCP0tprEHQLXRlha5zP6C
	J9/Yo7pdkyDKV2T6cRf7DWv1amOCOvyKteQ5huf1g1fnV4yrWsj+XyjIHbmfgYtJA8hYEPezqys
	n9zdFN/9I5g6l8950MYz67h1+qTg=
X-Google-Smtp-Source: AGHT+IGbpVn8SXfTFJAAla51cn/sy/ijW/MgYlQWJhDtQMXA+tDzl+zqOj/7IfyE+txR6engpS46MLDYBFX6xYmeOGs=
X-Received: by 2002:a2e:b8c6:0:b0:2f7:6277:f2be with SMTP id
 38308e7fff4ca-2fca8368bf6mr1529781fa.22.1729741925071; Wed, 23 Oct 2024
 20:52:05 -0700 (PDT)
MIME-Version: 1.0
References: <20241022192451.38138-1-ryncsn@gmail.com> <87ed474kvx.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <CAMgjq7BfO=dNYep4z1aS7nUAJU3bktR17gYAufx=kkLudq4dAQ@mail.gmail.com> <875xpi42wg.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <875xpi42wg.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Kairui Song <ryncsn@gmail.com>
Date: Thu, 24 Oct 2024 11:51:48 +0800
Message-ID: <CAMgjq7AxYyYTZ9NQ_3eJaxTis0CFAr5Zpi2oz3FPFYoWhDp00g@mail.gmail.com>
Subject: Re: [PATCH 00/13] mm, swap: rework of swap allocator locks
To: "Huang, Ying" <ying.huang@intel.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>, 
	Chris Li <chrisl@kernel.org>, Barry Song <v-songbaohua@oppo.com>, 
	Ryan Roberts <ryan.roberts@arm.com>, Hugh Dickins <hughd@google.com>, 
	Yosry Ahmed <yosryahmed@google.com>, Tim Chen <tim.c.chen@linux.intel.com>, 
	Nhat Pham <nphamcs@gmail.com>, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 5E1A9A0006
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: joi3bu4xsib9ho7o4demb7egj8ynkyxg
X-HE-Tag: 1729741914-15117
X-HE-Meta: U2FsdGVkX18VevpXsdPwLnZSZ/4C8nJ3s+A5JdciadjkP1/xcobcKJpsUHDEI3Avc2nrOtFfPDxA4jfn4baMe7T4a+66R7InnNGsmpnpxHCEsn/dyFNk0IE3bKj4WAE9lVd76rrkZgjFWvcv2yQsw4zDY5OErikHJKWW+awsUER/0/0l8yUEjY+Ghv6CLCtiZVPAlHCjjgdC0+/HMMxdGfWW3NgcZ+YUg6uhkPP96SDhihEI5IjQNWbcsgGM6fkD2f2dYbltqm+S0FL5bQGsF1Y7zy53Zwe4SEzOFW9CkV5duPzkixsyiu3DfdEFbyns8ufsNU/Jh84T8eOLXaAQ4zARXzWey2ClIKjMzgWuOy4jVQz0GBBXC3kpRu1Qw1p96cccm/kwdYW3or6gDDE5i/suHjwRGrDu6h4QV59pND66CwJ4BWofgEx27ikGGQHFSqTGwbk9lETLuy2v1W96O7F48sZhL7SQCaIoSp8KOgmlCSQEwq51T+/3jMaG58kSAuTx5wxBrsGflhTtovaBOGU9bFSh2HGWnl19itgOTjwGSj/uwdIlKgWF8TEJa4ytblO1HTU4gR23MJLC1f6CqY5YyYopkyP/pJIyW490v5iBmcQl2GQaDyBbidcOe4f2hdwhDkFK8cb0RLRvC58QPhWRl536A5Kd9HSd5Z36maBGIucXCIcs4vbviuqQNp/qKC3CCUzQ7hN4dpEk0waG6NGKyCRePOGgG4fKi36DjksFWXmLjtieunraogqQI+lSNGn5urXWR5QvPFFQuFVZmtRzqUQW3vaxjA1ZX/NCJj7ICXWx+u0Wf2fartoFQYpeQAFOk9HiwMXPra6/+LIR5NotF1VC6jisp47LBlhRubWZNITjM3m2yCDuyHQe0v32wuYIjZRKiqRsR6SoCbR5VKLF6z9MSG1VJOFnahfpkTuC/8/SPqlNX3LXMnT7b/k4LITc2/6HPc8bL/k0pg9
 07oaOPci
 LoiT7KLJAh38hmK5WzluMTy/eFnlHwVrZ7k3s5Ov6+bxu1GgAD7o11foWHdS+UeFlmDrGLrb8NMRFgu1yNwYlL3yVNbqwM2mZG3JglnF0nNEtZvbrObRk5a1qyW/chd//KU7LNeu3v3SZOSxshLV+HXCU7Jkd7j3gBfFfqkbvGs6Sik0Mmlc32/rpBznrVOY1UayCmwQvrFe5sLwZpYSa68e0wBb3VA+t8YRDJsnSm/DO+TiAstEwc3slZUMe/ZU4gJPlD/w8Jm7F8nrPqVvFMe3lJkT49G0BDa1aXvRJ8hBFO4kuyTxmJJvDRuqeyR1ZGdDxOA9jAvm1L+m5+321C19zJnKGgFZhKjlsxV6NdvbOWynpATzA29A26qsEF4Zv7CLJHtBaaWPsLC7YDV4jM1NnqJCxLz40cNhTQzKpJkQjYw/fxdbSXeGYWUXj6VW9wPoXvtJphw14RHy/zbPNroRrb8t/pUwK+VLnSXhk9zHg2es1TA6c5In+HQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Oct 24, 2024 at 11:08=E2=80=AFAM Huang, Ying <ying.huang@intel.com>=
 wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > On Wed, Oct 23, 2024 at 10:27=E2=80=AFAM Huang, Ying <ying.huang@intel.=
com> wrote:
> >>
> >> Hi, Kairui,
> >
> > Hi Ying,
> >
> >>
> >> Kairui Song <ryncsn@gmail.com> writes:
> >>
> >> > From: Kairui Song <kasong@tencent.com>
> >> >
> >> > This series improved the swap allocator performance greatly by rewor=
king
> >> > the locking design and simplify a lot of code path.
> >> >
> >> > This is follow up of previous swap cluster allocator series:
> >> > https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c14=
8b9297@kernel.org/
> >> >
> >> > And this series is based on an follow up fix of the swap cluster
> >> > allocator:
> >> > https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@gmail=
.com/
> >> >
> >> > This is part of the new swap allocator work item discussed in
> >> > Chris's "Swap Abstraction" discussion at LSF/MM 2024, and
> >> > "mTHP and swap allocator" discussion at LPC 2024.
> >> >
> >> > Previous series introduced a fully cluster based allocation algorith=
m,
> >> > this series completely get rid of the old allocation path and makes =
the
> >> > allocator avoid grabbing the si->lock unless needed. This bring huge
> >> > performance gain and get rid of slot cache on freeing path.
> >>
> >> Great!
> >>
> >> > Currently, swap locking is mainly composed of two locks, cluster loc=
k
> >> > (ci->lock) and device lock (si->lock). The device lock is widely use=
d
> >> > to protect many things, causing it to be the main bottleneck for SWA=
P.
> >>
> >> Device lock can be confusing with another device lock for struct devic=
e.
> >> Better to call it swap device lock?
> >
> > Good idea, I'll use the term swap device lock then.
> >
> >>
> >> > Cluster lock is much more fine-grained, so it will be best to use
> >> > ci->lock instead of si->lock as much as possible.
> >> >
> >> > `perf lock` indicates this issue clearly. Doing linux kernel build
> >> > using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg an=
d 4k
> >> > pages), result of "perf lock contention -ab sleep 3":
> >> >
> >> >   contended   total wait     max wait     avg wait         type   ca=
ller
> >> >
> >> >      34948     53.63 s       7.11 ms      1.53 ms     spinlock   fre=
e_swap_and_cache_nr+0x350
> >> >      16569     40.05 s       6.45 ms      2.42 ms     spinlock   get=
_swap_pages+0x231
> >> >      11191     28.41 s       7.03 ms      2.54 ms     spinlock   swa=
pcache_free_entries+0x59
> >> >       4147     22.78 s     122.66 ms      5.49 ms     spinlock   pag=
e_vma_mapped_walk+0x6f3
> >> >       4595      7.17 s       6.79 ms      1.56 ms     spinlock   swa=
pcache_free_entries+0x59
> >> >     406027      2.74 s       2.59 ms      6.74 us     spinlock   lis=
t_lru_add+0x39
> >> >   ...snip...
> >> >
> >> > The top 5 caller are all users of si->lock, total wait time up sums =
to
> >> > several minutes in the 3 seconds time window.
> >>
> >> Can you show results of `perf record -g`, `perf report -g` too?  I hav=
e
> >> interest to check hot spot shifting too.
> >
> > Sure. I think `perf lock` result is already good enough and cleaner.
> > My test environment are mostly VM based so spinlock slow path may get
> > offloaded to host, and can't be see by perf record, I collected
> > following data after disabled paravirt spinlock:
> >
> > The time consumption and stack trace of a page fault before:
> > -   78.45%     0.17%  cc1              [kernel.kallsyms]
> >                 [k] asm_exc_page_fault
> >    - 78.28% asm_exc_page_fault
> >       - 78.18% exc_page_fault
> >          - 78.17% do_user_addr_fault
> >             - 78.09% handle_mm_fault
> >                - 78.06% __handle_mm_fault
> >                   - 69.69% do_swap_page
> >                      - 55.87% alloc_swap_folio
> >                         - 55.60% mem_cgroup_swapin_charge_folio
> >                            - 55.48% charge_memcg
> >                               - 55.45% try_charge_memcg
> >                                  - 55.36% try_to_free_mem_cgroup_pages
> >                                     - do_try_to_free_pages
> >                                        - 55.35% shrink_node
> >                                           - 55.27% shrink_lruvec
> >                                              - 55.13% try_to_shrink_lru=
vec
> >                                                 - 54.79% evict_folios
> >                                                    - 54.35% shrink_foli=
o_list
> >                                                       - 30.01% add_to_s=
wap
> >                                                          - 29.77%
> > folio_alloc_swap
> >                                                             - 29.50%
> > get_swap_pages
> >
> > 25.03% queued_spin_lock_slowpath
> >                                                                - 2.71%
> > alloc_swap_scan_cluster
> >
> > 1.80% queued_spin_lock_slowpath
> >                                                                   +
> > 0.89% __try_to_reclaim_swap
> >                                                                - 1.74%
> > swap_reclaim_full_clusters
> >
> > 1.74% queued_spin_lock_slowpath
> >                                                       - 10.88%
> > try_to_unmap_flush_dirty
> >                                                          - 10.87%
> > arch_tlbbatch_flush
> >                                                             - 10.85%
> > on_each_cpu_cond_mask
> >
> > smp_call_function_many_cond
> >                                                       + 7.45% pageout
> >                                                       + 2.71% try_to_un=
map_flush
> >                                                       + 1.90% try_to_un=
map
> >                                                       + 0.78% folio_ref=
erenced
> >                      - 9.41% cluster_swap_free_nr
> >                         - 9.39% free_swap_slot
> >                            - 9.35% swapcache_free_entries
> >                                 8.40% queued_spin_lock_slowpath
> >                                 0.93% swap_entry_range_free
> >                      - 3.61% swap_read_folio_bdev_sync
> >                         - 3.55% submit_bio_wait
> >                            - 3.51% submit_bio_noacct_nocheck
> >                               + 3.46% __submit_bio
> >                   + 7.71% do_pte_missing
> >                   + 0.61% wp_page_copy
> >
> > The queued_spin_lock_slowpath above is the si->lock, and there are
> > multiple users of it so the total overhead is higher than shown.
> >
> > After:
> > -   75.05%     0.43%  cc1              [kernel.kallsyms]
> >                 [k] asm_exc_page_fault
> >    - 74.62% asm_exc_page_fault
> >       - 74.36% exc_page_fault
> >          - 74.34% do_user_addr_fault
> >             - 74.10% handle_mm_fault
> >                - 73.96% __handle_mm_fault
> >                   - 67.55% do_swap_page
> >                      - 45.92% alloc_swap_folio
> >                         - 45.03% mem_cgroup_swapin_charge_folio
> >                            - 44.58% charge_memcg
> >                               - 44.44% try_charge_memcg
> >                                  - 44.12% try_to_free_mem_cgroup_pages
> >                                     - do_try_to_free_pages
> >                                        - 44.10% shrink_node
> >                                           - 43.86% shrink_lruvec
> >                                              - 41.92% try_to_shrink_lru=
vec
> >                                                 - 40.67% evict_folios
> >                                                    - 37.12% shrink_foli=
o_list
> >                                                       - 20.88% pageout
> >                                                          + 20.02% swap_=
writepage
> >                                                          + 0.72% shmem_=
writepage
> >                                                       - 4.08% add_to_sw=
ap
> >                                                          - 2.48%
> > folio_alloc_swap
> >                                                             - 2.12%
> > __mem_cgroup_try_charge_swap
> >                                                                - 1.47%
> > swap_cgroup_record
> >                                                                   +
> > 1.32% _raw_spin_lock_irqsave
> >                                                          - 1.56%
> > add_to_swap_cache
> >                                                             - 1.04% xas=
_store
> >                                                                + 1.01%
> > workingset_update_node
> >                                                       + 3.97%
> > try_to_unmap_flush_dirty
> >                                                       + 3.51% folio_ref=
erenced
> >                                                       + 2.24% __remove_=
mapping
> >                                                       + 1.16% try_to_un=
map
> >                                                       + 0.52% try_to_un=
map_flush
> >                                                      2.50%
> > queued_spin_lock_slowpath
> >                                                      0.79% scan_folios
> >                                                 + 1.20% try_to_inc_max_=
seq
> >                                              + 1.92% lru_add_drain
> >                         + 0.73% vma_alloc_folio_noprof
> >                      - 9.81% swap_read_folio_bdev_sync
> >                         - 9.61% submit_bio_wait
> >                            + 9.49% submit_bio_noacct_nocheck
> >                      - 8.06% cluster_swap_free_nr
> >                         - 8.02% swap_entry_range_free
> >                            + 3.92% __mem_cgroup_uncharge_swap
> >                            + 2.90% zram_slot_free_notify
> >                              0.58% clear_shadow_from_swap_cache
> >                      - 1.32% __folio_batch_add_and_move
> >                         - 1.30% folio_batch_move_lru
> >                            + 1.10% folio_lruvec_lock_irqsave
>
> Thanks for data.
>
> It seems that the cycles shifts from spinning to memory compression.
> That is expected.
>
> > spin_lock usage is much lower.
> >
> > I prefer the perf lock output as it shows the exact time and user of lo=
cks.
>
> perf cycles data is more complete.  You can find which part becomes new
> hot spot.
>
> >>
> >> > Following the new allocator design, many operation doesn't need to t=
ouch
> >> > si->lock at all. We only need to take si->lock when doing operations
> >> > across multiple clusters (eg. changing the cluster list), other
> >> > operations only need to take ci->lock. So ideally allocator should
> >> > always take ci->lock first, then, if needed, take si->lock. But due
> >> > to historical reasons, ci->lock is used inside si->lock by design,
> >> > causing lock inversion if we simply try to acquire si->lock after
> >> > acquiring ci->lock.
> >> >
> >> > This series audited all si->lock usage, simplify legacy codes, elimi=
nate
> >> > usage of si->lock as much as possible by introducing new designs bas=
ed
> >> > on the new cluster allocator.
> >> >
> >> > Old HDD allocation codes are removed, cluster allocator is adapted
> >> > with small changes for HDD usage, test is looking OK.
> >>
> >> I think that it's a good idea to remove HDD allocation specific code.
> >> Can you check the performance of swapping to HDD?  However, I understa=
nd
> >> that many people have no HDD in hand.
> >
> > It's not hard to make cluster allocator work well with HDD in theory,
> > see the commit "mm, swap: use a global swap cluster for non-rotation
> > device".
> > The testing is not very reliable though, I found HDD swap performance
> > is very unstable because of the IO pattern of HDD, so it's just a best
> > effort try.
>
> Just to check whether code change cause something too bad for HDD.  No
> measurable difference is a good news.
>
> >> > And this also removed slot cache for freeing path. The performance i=
s
> >> > better without it, and this enables other clean up and optimizations
> >> > as discussed before:
> >> > https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdm=
esW_59W1BWw@mail.gmail.com/
> >> >
> >> > After this series, lock contention on si->lock is nearly unobservabl=
e
> >> > with `perf lock` with the same test above :
> >> >
> >> >   contended   total wait     max wait     avg wait         type   ca=
ller
> >> >   ... snip ...
> >> >          91    204.62 us      4.51 us      2.25 us     spinlock   cl=
uster_move+0x2e
> >> >   ... snip ...
> >> >          47    125.62 us      4.47 us      2.67 us     spinlock   cl=
uster_move+0x2e
> >> >   ... snip ...
> >> >          23     63.15 us      3.95 us      2.74 us     spinlock   cl=
uster_move+0x2e
> >> >   ... snip ...
> >> >          17     41.26 us      4.58 us      2.43 us     spinlock   cl=
uster_isolate_lock+0x1d
> >> >   ... snip ...
> >> >
> >> > cluster_move and cluster_isolate_lock are basically the only users
> >> > of si->lock now, performance gain is huge with reduced LOC.
> >> >
> >> > Tests
> >> > =3D=3D=3D
> >> >
> >> > Build kernel with defconfig on tmpfs with ZRAM as swap:
> >> > ---
> >> >
> >> > Running a test matrix which is scaled up progressive for a intuitive=
 result.
> >> > The test are ran on top of tmpfs, using memory cgroup for memory lim=
itation,
> >> > on a 48c96t system.
> >> >
> >> > 12 test run for each case, it can be seen clearly that as concurrent=
 job
> >> > number goes higher the performance gain is higher, the performance i=
s
> >> > higher even with low concurrency.
> >> >
> >> >    make -j<NR>     |   System Time (seconds)  |   Total Time (second=
s)
> >> >  (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / De=
lta)
> >> >  With 4k pages only:
> >> >   6 / 192M / 3G    |    5258 /  5235 / -0.3%  |    1420 /  1414 / -0=
.3%
> >> >  12 / 256M / 4G    |    5518 /  5337 / -3.3%  |     758 /   742 / -2=
.1%
> >> >  24 / 384M / 5G    |    7091 /  5766 / -18.7% |     476 /   422 / -1=
1.3%
> >> >  48 / 768M / 7G    |   11139 /  5831 / -47.7% |     330 /   221 / -3=
3.0%
> >> >  96 / 1.5G / 10G   |   21303 / 11353 / -46.7% |     283 /   180 / -3=
6.4%
> >> >  With 64k mTHP:
> >> >  24 / 512M / 5G    |    5104 /  4641 / -18.7% |     376 /   358 / -4=
.8%
> >> >  48 /   1G / 7G    |    8693 /  4662 / -18.7% |     257 /   176 / -3=
1.5%
> >> >  96 /   2G / 10G   |   17056 / 10263 / -39.8% |     234 /   169 / -2=
7.8%
> >>
> >> How much is the swap in/out throughput before/after the change?
> >
> > This may not be too beneficial for typical throughput measurement:
> > - For example doing the same test with brd will only show a ~20%
> > performance improvement, still a big gain though. I think the si->lock
> > spinlock wasting CPU cycles may effect CPU sensitive things like ZRAM
> > even more.
>
> 20% is a good data.  You don't need to guess.  perf cycles profiling can
> show the hot spot.
>
> > - And simple benchmarks which just do multiple sequential swaps in/out
> > in multiple thread hardly stress the allocator.
> >
> > I haven't found a good
> > benchmark to simulate random parallel IOs on SWAP yet, I can write one
> > later.
>
> I have used anon-w-rand test case of vm-scalability to simulate random
> parallel swap out.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tr=
ee/case-anon-w-rand
>
> > A more close to real word benchmark like build kernel test, or
> > mysql/sysbench all showed great improment.
>
> Yes.  Real work load is good.  We can use micro-benchmark to find out
> some performance limit, for example, max possible throughput.
>
> >>
> >> When I worked on swap in/out performance before, the hot spot shifts f=
rom
> >> swap related code to LRU lock and zone lock.  Things may change a lot
> >> now.
> >>
> >> If zram is used as swap device, the hot spot may become
> >> compression/decompression after solving the swap lock contention.  To
> >> stress swap subsystem further, we may use a ram disk as swap.
> >> Previously, we have used a simulated pmem device (backed by DRAM).  Th=
at
> >> can be setup as in,
> >>
> >> https://pmem.io/blog/2016/02/how-to-emulate-persistent-memory/
> >>
> >> After creating the raw block device: /dev/pmem0, we can do
> >>
> >> $ mkswap /dev/pmem0
> >> $ swapon /dev/pmem0
> >>
> >> Can you use something similar if necessary?
> >
> > I used to test with brd, as described above,
>
> brd will allocate memory during running, pmem can avoid that.  perf
> profile is your friends to root cause the possible issue.
>
> > I think using ZRAM with
> > test simulating real workload is more useful.
>
> Yes.  And, as I said before.  Micro-benchmark has its own value.

Hi Ying,

Thank you very much for the suggestion, I didn't mean I'm against
micro benchmarks in any way, just a lot of effort was spent on other
tests so I skipped that part for V1.

As you mentioned vm-scalability, I think this is definitely a good
idea to include that test when pmem simulation.

There are still some bottlenecks of SWAP, beside compression and page
fault / tlb, mostly cgroup lock and list lru locks. I have some ideas
to optimize these too, could be next steps.

> > And I did include a Sequential SWAP test, the result is looking OK (no
> > regression, minor to none improvement).
>
> Good.  At least we have no regression here.
>
> --
> Best Regards,
> Huang, Ying