From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1DF7FD0BB5B for ; Thu, 24 Oct 2024 03:52:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A53FC6B0082; Wed, 23 Oct 2024 23:52:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9FDBB6B0083; Wed, 23 Oct 2024 23:52:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 89D496B008A; Wed, 23 Oct 2024 23:52:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 6292C6B0082 for ; Wed, 23 Oct 2024 23:52:09 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 50F0AACB00 for ; Thu, 24 Oct 2024 03:51:33 +0000 (UTC) X-FDA: 82707122022.27.76425EE Received: from mail-lj1-f176.google.com (mail-lj1-f176.google.com [209.85.208.176]) by imf25.hostedemail.com (Postfix) with ESMTP id 5E1A9A0006 for ; Thu, 24 Oct 2024 03:51:54 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DzcdWl8A; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf25.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.176 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729741814; a=rsa-sha256; cv=none; b=hUlMOJXIp7R6XAngShiLpfdKYGygurgMrWgHJhixau3TpXMaKPgKNxnJggH72D912A/9UV wLnmMjQVNPJxxUkrkQwfx5eGVulDIydsMbcS5CsdAbi4mm5b0jZjK58ItOHxyaDN3gUItB BNcCej2XGN+s4FL2TyRqOaKsXtIAXrk= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DzcdWl8A; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf25.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.176 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729741814; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=sRnPT8Dyhb9pG40q7iUbKYdz9ByQD97iggYnI/cI1nQ=; b=NqiLdCiYiYnqw3n2jDOenoyq4+5JOggINzYjvb8GmN0iUSfZFFgRPfNfA/F7psRt05lMqz CnLPsnQQUkdiSC+SwXELzXI+zVnoW5D7FN5pflhuO6wqh2dnso+LZb1fRtqw0CiR3BnHJd KengqZ7OCN9JmOGBHmJuBP9jzOnJoGs= Received: by mail-lj1-f176.google.com with SMTP id 38308e7fff4ca-2fb5743074bso3575321fa.1 for ; Wed, 23 Oct 2024 20:52:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729741925; x=1730346725; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=sRnPT8Dyhb9pG40q7iUbKYdz9ByQD97iggYnI/cI1nQ=; b=DzcdWl8AZvwo31veLfmeAHzdSqkOSgZasxyBEoCFj4pbIVN66kDR3iATEC2gu8+iwK m6N6gAQf/nT6NPwgBWk8U8wq7NSHoWSOm3rI2/xetZsEI2BXJTZ4VeKcMaAw0Ul7I47F doLI1lQ2tMBwMSVo4/cTC/d6ja0Qg+2ad7YkxscRpZFZl+EYhDAwtm3AvP9fc3OotTOF vU7FHIoHy2qYmnm3i6vwfWAWIb0uRVBKUwdu6vo6j3EuAZ2ae02uIZKSY5piG8jIBYCs sg7GOuhHZdmag7qhYxzKvumxwpupA2XfpIZp/KumxJKoErbILmouHlZG1K8n9ygN9c3Y SjwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729741925; x=1730346725; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=sRnPT8Dyhb9pG40q7iUbKYdz9ByQD97iggYnI/cI1nQ=; b=WRLKMXn90xRNN2F5pxpKsQL64Qxd0ZB3rBf0dRN3utDkakr1s6gXk9kOWJJQTHS9YT btaQ7Q7rHrIZcWAL5rsFotulf/uIsv8SvBxJQPWV6xxM8StJhsS980udCjkmCLJDke8z ib51bSR7phEODg/tpOrvfq+qQ9f2S3n7Qb8U1sf3Ioco4xhdFmfuTrHLSeCgPfW2rrqp b9H/Z4y8WqeteeAYrWrVFN/lFKmCaMrGGXHDNhWHZwOvr/n60pA0kubtZxVRxNovJkqx IaQWJnJWhBXYOtsAEzEhhKqBq3u5gKwGs5Lgf9PMYMx/MSevSmjLEIDwFe2IbgnhMhjQ Fqkw== X-Gm-Message-State: AOJu0YwgzHMG9yorK2j7F4MjP+rOvU8FCJFvCP0tprEHQLXRlha5zP6C J9/Yo7pdkyDKV2T6cRf7DWv1amOCOvyKteQ5huf1g1fnV4yrWsj+XyjIHbmfgYtJA8hYEPezqys n9zdFN/9I5g6l8950MYz67h1+qTg= X-Google-Smtp-Source: AGHT+IGbpVn8SXfTFJAAla51cn/sy/ijW/MgYlQWJhDtQMXA+tDzl+zqOj/7IfyE+txR6engpS46MLDYBFX6xYmeOGs= X-Received: by 2002:a2e:b8c6:0:b0:2f7:6277:f2be with SMTP id 38308e7fff4ca-2fca8368bf6mr1529781fa.22.1729741925071; Wed, 23 Oct 2024 20:52:05 -0700 (PDT) MIME-Version: 1.0 References: <20241022192451.38138-1-ryncsn@gmail.com> <87ed474kvx.fsf@yhuang6-desk2.ccr.corp.intel.com> <875xpi42wg.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <875xpi42wg.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Kairui Song Date: Thu, 24 Oct 2024 11:51:48 +0800 Message-ID: Subject: Re: [PATCH 00/13] mm, swap: rework of swap allocator locks To: "Huang, Ying" Cc: linux-mm@kvack.org, Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 5E1A9A0006 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: joi3bu4xsib9ho7o4demb7egj8ynkyxg X-HE-Tag: 1729741914-15117 X-HE-Meta: U2FsdGVkX18VevpXsdPwLnZSZ/4C8nJ3s+A5JdciadjkP1/xcobcKJpsUHDEI3Avc2nrOtFfPDxA4jfn4baMe7T4a+66R7InnNGsmpnpxHCEsn/dyFNk0IE3bKj4WAE9lVd76rrkZgjFWvcv2yQsw4zDY5OErikHJKWW+awsUER/0/0l8yUEjY+Ghv6CLCtiZVPAlHCjjgdC0+/HMMxdGfWW3NgcZ+YUg6uhkPP96SDhihEI5IjQNWbcsgGM6fkD2f2dYbltqm+S0FL5bQGsF1Y7zy53Zwe4SEzOFW9CkV5duPzkixsyiu3DfdEFbyns8ufsNU/Jh84T8eOLXaAQ4zARXzWey2ClIKjMzgWuOy4jVQz0GBBXC3kpRu1Qw1p96cccm/kwdYW3or6gDDE5i/suHjwRGrDu6h4QV59pND66CwJ4BWofgEx27ikGGQHFSqTGwbk9lETLuy2v1W96O7F48sZhL7SQCaIoSp8KOgmlCSQEwq51T+/3jMaG58kSAuTx5wxBrsGflhTtovaBOGU9bFSh2HGWnl19itgOTjwGSj/uwdIlKgWF8TEJa4ytblO1HTU4gR23MJLC1f6CqY5YyYopkyP/pJIyW490v5iBmcQl2GQaDyBbidcOe4f2hdwhDkFK8cb0RLRvC58QPhWRl536A5Kd9HSd5Z36maBGIucXCIcs4vbviuqQNp/qKC3CCUzQ7hN4dpEk0waG6NGKyCRePOGgG4fKi36DjksFWXmLjtieunraogqQI+lSNGn5urXWR5QvPFFQuFVZmtRzqUQW3vaxjA1ZX/NCJj7ICXWx+u0Wf2fartoFQYpeQAFOk9HiwMXPra6/+LIR5NotF1VC6jisp47LBlhRubWZNITjM3m2yCDuyHQe0v32wuYIjZRKiqRsR6SoCbR5VKLF6z9MSG1VJOFnahfpkTuC/8/SPqlNX3LXMnT7b/k4LITc2/6HPc8bL/k0pg9 07oaOPci LoiT7KLJAh38hmK5WzluMTy/eFnlHwVrZ7k3s5Ov6+bxu1GgAD7o11foWHdS+UeFlmDrGLrb8NMRFgu1yNwYlL3yVNbqwM2mZG3JglnF0nNEtZvbrObRk5a1qyW/chd//KU7LNeu3v3SZOSxshLV+HXCU7Jkd7j3gBfFfqkbvGs6Sik0Mmlc32/rpBznrVOY1UayCmwQvrFe5sLwZpYSa68e0wBb3VA+t8YRDJsnSm/DO+TiAstEwc3slZUMe/ZU4gJPlD/w8Jm7F8nrPqVvFMe3lJkT49G0BDa1aXvRJ8hBFO4kuyTxmJJvDRuqeyR1ZGdDxOA9jAvm1L+m5+321C19zJnKGgFZhKjlsxV6NdvbOWynpATzA29A26qsEF4Zv7CLJHtBaaWPsLC7YDV4jM1NnqJCxLz40cNhTQzKpJkQjYw/fxdbSXeGYWUXj6VW9wPoXvtJphw14RHy/zbPNroRrb8t/pUwK+VLnSXhk9zHg2es1TA6c5In+HQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Oct 24, 2024 at 11:08=E2=80=AFAM Huang, Ying = wrote: > > Kairui Song writes: > > > On Wed, Oct 23, 2024 at 10:27=E2=80=AFAM Huang, Ying wrote: > >> > >> Hi, Kairui, > > > > Hi Ying, > > > >> > >> Kairui Song writes: > >> > >> > From: Kairui Song > >> > > >> > This series improved the swap allocator performance greatly by rewor= king > >> > the locking design and simplify a lot of code path. > >> > > >> > This is follow up of previous swap cluster allocator series: > >> > https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c14= 8b9297@kernel.org/ > >> > > >> > And this series is based on an follow up fix of the swap cluster > >> > allocator: > >> > https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@gmail= .com/ > >> > > >> > This is part of the new swap allocator work item discussed in > >> > Chris's "Swap Abstraction" discussion at LSF/MM 2024, and > >> > "mTHP and swap allocator" discussion at LPC 2024. > >> > > >> > Previous series introduced a fully cluster based allocation algorith= m, > >> > this series completely get rid of the old allocation path and makes = the > >> > allocator avoid grabbing the si->lock unless needed. This bring huge > >> > performance gain and get rid of slot cache on freeing path. > >> > >> Great! > >> > >> > Currently, swap locking is mainly composed of two locks, cluster loc= k > >> > (ci->lock) and device lock (si->lock). The device lock is widely use= d > >> > to protect many things, causing it to be the main bottleneck for SWA= P. > >> > >> Device lock can be confusing with another device lock for struct devic= e. > >> Better to call it swap device lock? > > > > Good idea, I'll use the term swap device lock then. > > > >> > >> > Cluster lock is much more fine-grained, so it will be best to use > >> > ci->lock instead of si->lock as much as possible. > >> > > >> > `perf lock` indicates this issue clearly. Doing linux kernel build > >> > using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg an= d 4k > >> > pages), result of "perf lock contention -ab sleep 3": > >> > > >> > contended total wait max wait avg wait type ca= ller > >> > > >> > 34948 53.63 s 7.11 ms 1.53 ms spinlock fre= e_swap_and_cache_nr+0x350 > >> > 16569 40.05 s 6.45 ms 2.42 ms spinlock get= _swap_pages+0x231 > >> > 11191 28.41 s 7.03 ms 2.54 ms spinlock swa= pcache_free_entries+0x59 > >> > 4147 22.78 s 122.66 ms 5.49 ms spinlock pag= e_vma_mapped_walk+0x6f3 > >> > 4595 7.17 s 6.79 ms 1.56 ms spinlock swa= pcache_free_entries+0x59 > >> > 406027 2.74 s 2.59 ms 6.74 us spinlock lis= t_lru_add+0x39 > >> > ...snip... > >> > > >> > The top 5 caller are all users of si->lock, total wait time up sums = to > >> > several minutes in the 3 seconds time window. > >> > >> Can you show results of `perf record -g`, `perf report -g` too? I hav= e > >> interest to check hot spot shifting too. > > > > Sure. I think `perf lock` result is already good enough and cleaner. > > My test environment are mostly VM based so spinlock slow path may get > > offloaded to host, and can't be see by perf record, I collected > > following data after disabled paravirt spinlock: > > > > The time consumption and stack trace of a page fault before: > > - 78.45% 0.17% cc1 [kernel.kallsyms] > > [k] asm_exc_page_fault > > - 78.28% asm_exc_page_fault > > - 78.18% exc_page_fault > > - 78.17% do_user_addr_fault > > - 78.09% handle_mm_fault > > - 78.06% __handle_mm_fault > > - 69.69% do_swap_page > > - 55.87% alloc_swap_folio > > - 55.60% mem_cgroup_swapin_charge_folio > > - 55.48% charge_memcg > > - 55.45% try_charge_memcg > > - 55.36% try_to_free_mem_cgroup_pages > > - do_try_to_free_pages > > - 55.35% shrink_node > > - 55.27% shrink_lruvec > > - 55.13% try_to_shrink_lru= vec > > - 54.79% evict_folios > > - 54.35% shrink_foli= o_list > > - 30.01% add_to_s= wap > > - 29.77% > > folio_alloc_swap > > - 29.50% > > get_swap_pages > > > > 25.03% queued_spin_lock_slowpath > > - 2.71% > > alloc_swap_scan_cluster > > > > 1.80% queued_spin_lock_slowpath > > + > > 0.89% __try_to_reclaim_swap > > - 1.74% > > swap_reclaim_full_clusters > > > > 1.74% queued_spin_lock_slowpath > > - 10.88% > > try_to_unmap_flush_dirty > > - 10.87% > > arch_tlbbatch_flush > > - 10.85% > > on_each_cpu_cond_mask > > > > smp_call_function_many_cond > > + 7.45% pageout > > + 2.71% try_to_un= map_flush > > + 1.90% try_to_un= map > > + 0.78% folio_ref= erenced > > - 9.41% cluster_swap_free_nr > > - 9.39% free_swap_slot > > - 9.35% swapcache_free_entries > > 8.40% queued_spin_lock_slowpath > > 0.93% swap_entry_range_free > > - 3.61% swap_read_folio_bdev_sync > > - 3.55% submit_bio_wait > > - 3.51% submit_bio_noacct_nocheck > > + 3.46% __submit_bio > > + 7.71% do_pte_missing > > + 0.61% wp_page_copy > > > > The queued_spin_lock_slowpath above is the si->lock, and there are > > multiple users of it so the total overhead is higher than shown. > > > > After: > > - 75.05% 0.43% cc1 [kernel.kallsyms] > > [k] asm_exc_page_fault > > - 74.62% asm_exc_page_fault > > - 74.36% exc_page_fault > > - 74.34% do_user_addr_fault > > - 74.10% handle_mm_fault > > - 73.96% __handle_mm_fault > > - 67.55% do_swap_page > > - 45.92% alloc_swap_folio > > - 45.03% mem_cgroup_swapin_charge_folio > > - 44.58% charge_memcg > > - 44.44% try_charge_memcg > > - 44.12% try_to_free_mem_cgroup_pages > > - do_try_to_free_pages > > - 44.10% shrink_node > > - 43.86% shrink_lruvec > > - 41.92% try_to_shrink_lru= vec > > - 40.67% evict_folios > > - 37.12% shrink_foli= o_list > > - 20.88% pageout > > + 20.02% swap_= writepage > > + 0.72% shmem_= writepage > > - 4.08% add_to_sw= ap > > - 2.48% > > folio_alloc_swap > > - 2.12% > > __mem_cgroup_try_charge_swap > > - 1.47% > > swap_cgroup_record > > + > > 1.32% _raw_spin_lock_irqsave > > - 1.56% > > add_to_swap_cache > > - 1.04% xas= _store > > + 1.01% > > workingset_update_node > > + 3.97% > > try_to_unmap_flush_dirty > > + 3.51% folio_ref= erenced > > + 2.24% __remove_= mapping > > + 1.16% try_to_un= map > > + 0.52% try_to_un= map_flush > > 2.50% > > queued_spin_lock_slowpath > > 0.79% scan_folios > > + 1.20% try_to_inc_max_= seq > > + 1.92% lru_add_drain > > + 0.73% vma_alloc_folio_noprof > > - 9.81% swap_read_folio_bdev_sync > > - 9.61% submit_bio_wait > > + 9.49% submit_bio_noacct_nocheck > > - 8.06% cluster_swap_free_nr > > - 8.02% swap_entry_range_free > > + 3.92% __mem_cgroup_uncharge_swap > > + 2.90% zram_slot_free_notify > > 0.58% clear_shadow_from_swap_cache > > - 1.32% __folio_batch_add_and_move > > - 1.30% folio_batch_move_lru > > + 1.10% folio_lruvec_lock_irqsave > > Thanks for data. > > It seems that the cycles shifts from spinning to memory compression. > That is expected. > > > spin_lock usage is much lower. > > > > I prefer the perf lock output as it shows the exact time and user of lo= cks. > > perf cycles data is more complete. You can find which part becomes new > hot spot. > > >> > >> > Following the new allocator design, many operation doesn't need to t= ouch > >> > si->lock at all. We only need to take si->lock when doing operations > >> > across multiple clusters (eg. changing the cluster list), other > >> > operations only need to take ci->lock. So ideally allocator should > >> > always take ci->lock first, then, if needed, take si->lock. But due > >> > to historical reasons, ci->lock is used inside si->lock by design, > >> > causing lock inversion if we simply try to acquire si->lock after > >> > acquiring ci->lock. > >> > > >> > This series audited all si->lock usage, simplify legacy codes, elimi= nate > >> > usage of si->lock as much as possible by introducing new designs bas= ed > >> > on the new cluster allocator. > >> > > >> > Old HDD allocation codes are removed, cluster allocator is adapted > >> > with small changes for HDD usage, test is looking OK. > >> > >> I think that it's a good idea to remove HDD allocation specific code. > >> Can you check the performance of swapping to HDD? However, I understa= nd > >> that many people have no HDD in hand. > > > > It's not hard to make cluster allocator work well with HDD in theory, > > see the commit "mm, swap: use a global swap cluster for non-rotation > > device". > > The testing is not very reliable though, I found HDD swap performance > > is very unstable because of the IO pattern of HDD, so it's just a best > > effort try. > > Just to check whether code change cause something too bad for HDD. No > measurable difference is a good news. > > >> > And this also removed slot cache for freeing path. The performance i= s > >> > better without it, and this enables other clean up and optimizations > >> > as discussed before: > >> > https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdm= esW_59W1BWw@mail.gmail.com/ > >> > > >> > After this series, lock contention on si->lock is nearly unobservabl= e > >> > with `perf lock` with the same test above : > >> > > >> > contended total wait max wait avg wait type ca= ller > >> > ... snip ... > >> > 91 204.62 us 4.51 us 2.25 us spinlock cl= uster_move+0x2e > >> > ... snip ... > >> > 47 125.62 us 4.47 us 2.67 us spinlock cl= uster_move+0x2e > >> > ... snip ... > >> > 23 63.15 us 3.95 us 2.74 us spinlock cl= uster_move+0x2e > >> > ... snip ... > >> > 17 41.26 us 4.58 us 2.43 us spinlock cl= uster_isolate_lock+0x1d > >> > ... snip ... > >> > > >> > cluster_move and cluster_isolate_lock are basically the only users > >> > of si->lock now, performance gain is huge with reduced LOC. > >> > > >> > Tests > >> > =3D=3D=3D > >> > > >> > Build kernel with defconfig on tmpfs with ZRAM as swap: > >> > --- > >> > > >> > Running a test matrix which is scaled up progressive for a intuitive= result. > >> > The test are ran on top of tmpfs, using memory cgroup for memory lim= itation, > >> > on a 48c96t system. > >> > > >> > 12 test run for each case, it can be seen clearly that as concurrent= job > >> > number goes higher the performance gain is higher, the performance i= s > >> > higher even with low concurrency. > >> > > >> > make -j | System Time (seconds) | Total Time (second= s) > >> > (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / De= lta) > >> > With 4k pages only: > >> > 6 / 192M / 3G | 5258 / 5235 / -0.3% | 1420 / 1414 / -0= .3% > >> > 12 / 256M / 4G | 5518 / 5337 / -3.3% | 758 / 742 / -2= .1% > >> > 24 / 384M / 5G | 7091 / 5766 / -18.7% | 476 / 422 / -1= 1.3% > >> > 48 / 768M / 7G | 11139 / 5831 / -47.7% | 330 / 221 / -3= 3.0% > >> > 96 / 1.5G / 10G | 21303 / 11353 / -46.7% | 283 / 180 / -3= 6.4% > >> > With 64k mTHP: > >> > 24 / 512M / 5G | 5104 / 4641 / -18.7% | 376 / 358 / -4= .8% > >> > 48 / 1G / 7G | 8693 / 4662 / -18.7% | 257 / 176 / -3= 1.5% > >> > 96 / 2G / 10G | 17056 / 10263 / -39.8% | 234 / 169 / -2= 7.8% > >> > >> How much is the swap in/out throughput before/after the change? > > > > This may not be too beneficial for typical throughput measurement: > > - For example doing the same test with brd will only show a ~20% > > performance improvement, still a big gain though. I think the si->lock > > spinlock wasting CPU cycles may effect CPU sensitive things like ZRAM > > even more. > > 20% is a good data. You don't need to guess. perf cycles profiling can > show the hot spot. > > > - And simple benchmarks which just do multiple sequential swaps in/out > > in multiple thread hardly stress the allocator. > > > > I haven't found a good > > benchmark to simulate random parallel IOs on SWAP yet, I can write one > > later. > > I have used anon-w-rand test case of vm-scalability to simulate random > parallel swap out. > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tr= ee/case-anon-w-rand > > > A more close to real word benchmark like build kernel test, or > > mysql/sysbench all showed great improment. > > Yes. Real work load is good. We can use micro-benchmark to find out > some performance limit, for example, max possible throughput. > > >> > >> When I worked on swap in/out performance before, the hot spot shifts f= rom > >> swap related code to LRU lock and zone lock. Things may change a lot > >> now. > >> > >> If zram is used as swap device, the hot spot may become > >> compression/decompression after solving the swap lock contention. To > >> stress swap subsystem further, we may use a ram disk as swap. > >> Previously, we have used a simulated pmem device (backed by DRAM). Th= at > >> can be setup as in, > >> > >> https://pmem.io/blog/2016/02/how-to-emulate-persistent-memory/ > >> > >> After creating the raw block device: /dev/pmem0, we can do > >> > >> $ mkswap /dev/pmem0 > >> $ swapon /dev/pmem0 > >> > >> Can you use something similar if necessary? > > > > I used to test with brd, as described above, > > brd will allocate memory during running, pmem can avoid that. perf > profile is your friends to root cause the possible issue. > > > I think using ZRAM with > > test simulating real workload is more useful. > > Yes. And, as I said before. Micro-benchmark has its own value. Hi Ying, Thank you very much for the suggestion, I didn't mean I'm against micro benchmarks in any way, just a lot of effort was spent on other tests so I skipped that part for V1. As you mentioned vm-scalability, I think this is definitely a good idea to include that test when pmem simulation. There are still some bottlenecks of SWAP, beside compression and page fault / tlb, mostly cgroup lock and list lru locks. I have some ideas to optimize these too, could be next steps. > > And I did include a Sequential SWAP test, the result is looking OK (no > > regression, minor to none improvement). > > Good. At least we have no regression here. > > -- > Best Regards, > Huang, Ying