From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0A40ED0BB4F for ; Thu, 24 Oct 2024 03:08:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D82C56B0082; Wed, 23 Oct 2024 23:08:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D0B666B0083; Wed, 23 Oct 2024 23:08:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B85DD6B0085; Wed, 23 Oct 2024 23:08:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 923D26B0082 for ; Wed, 23 Oct 2024 23:08:28 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id BAEE41A1085 for ; Thu, 24 Oct 2024 03:07:55 +0000 (UTC) X-FDA: 82707011940.27.5EEB783 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.12]) by imf25.hostedemail.com (Postfix) with ESMTP id 0E9FBA0006 for ; Thu, 24 Oct 2024 03:08:11 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=J+Q4+V2j; spf=pass (imf25.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729739254; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZmgGhSDeiAshay/swujGYJL31ENjMWIystXN7MsEWQE=; b=SpEq4rU0kuudpvP6RdPZ5WQ+jsBQT3beYzmGYI5NV4sPTnFlM08g7+1GNd9tgyFq3tmuQD S9vlHQB6QP6vCCMaikRQBD8nscvGHwXILpyHndaBK2gZxsfmJygdJgEbm8LMyceO5SuIMo zauLleZUZP+kCnGx2hnLAxj1xlsK4NE= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=J+Q4+V2j; spf=pass (imf25.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729739254; a=rsa-sha256; cv=none; b=aJmsfO8azIsy+9LvIV3WyfpNImeKtMQaNrSgHnElVCoXrFPxV0GaWZfezikHAvtieSGTpG Mx0qKJIHUkdHarR9yrX2fLY1I42LqDGsGMsEmwVc1e3i055ehlIo6mGCbXHgvcfGjA2u2U f5HnNGFlI/bsm5L7NQlo5vzSFN/55+0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1729739305; x=1761275305; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=mQoQUqn5jIhvcnk++g/CuO0SnOgEVHMl1BozhFvksWI=; b=J+Q4+V2jaXmLIMWw1gwMCvUo1ZzoHcJRnw9oYm9b8k6Wo3PI6fXnlf+7 P+adAXwlonJtG33yJdZw3RfzOGkAvKdx6R95FaKe575najB7M9E1rDZkj N1UKQPLHhcQY4GZTMc6dHnGY+YTLtqLC90TQOz1el+Ns2rd7DitM5lOTL P6vPTMqbRzBYXYzIe/35hmyvoan16zCBJwqJI1LhiFHUOubtFNVi0O1Uh pq/YV1l3lIthXujKXEM1dWXabzCwJQiG1rn+Xrer2bWL8gyZP7XTiMqO2 NvBKMCHFaeberj2pq15grgH05ItwSZEYLKqr4qluh3r1pVimqZf2xpv70 g==; X-CSE-ConnectionGUID: x/SYcYDWTaGkXJs01N8FRA== X-CSE-MsgGUID: 9rfrHBRSRsmOG/jNrmxkqw== X-IronPort-AV: E=McAfee;i="6700,10204,11234"; a="40746264" X-IronPort-AV: E=Sophos;i="6.11,228,1725346800"; d="scan'208";a="40746264" Received: from orviesa003.jf.intel.com ([10.64.159.143]) by orvoesa104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Oct 2024 20:08:23 -0700 X-CSE-ConnectionGUID: vMSpCKBFT0W74HPHRzYT3Q== X-CSE-MsgGUID: EwYFOJ6OQv2Bf+noFqiCJw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,199,1725346800"; d="scan'208";a="85240153" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by ORVIESA003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Oct 2024 20:08:20 -0700 From: "Huang, Ying" To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org Subject: Re: [PATCH 00/13] mm, swap: rework of swap allocator locks In-Reply-To: (Kairui Song's message of "Thu, 24 Oct 2024 02:01:43 +0800") References: <20241022192451.38138-1-ryncsn@gmail.com> <87ed474kvx.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Thu, 24 Oct 2024 11:04:47 +0800 Message-ID: <875xpi42wg.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: s7ocwbwd33urnkjxy49zc7m7dnpy11mx X-Rspamd-Queue-Id: 0E9FBA0006 X-Rspamd-Server: rspam11 X-HE-Tag: 1729739291-910300 X-HE-Meta: U2FsdGVkX19IfFcYAVNH6ktR9W8NDMNe5idj+g9hXdBvflCGG272dZVm2i8zn2xxzYsfKRfsbg3ogq5mSABfoWtGCeHde+6sKnJIIK3HtWX1f2b5ZV224V4yYJakdyRFEx6c2mQXvyPCmIj42Eavu3QqsxQX0Ul4FSIH2j0qL2wCY0oC6juD719AzYg/q1OnxFeVyfgJryL3JVED5S1MxbJcKIlGj8e7rkt52w7RbgAPbX53NmuPLOwBgF7LcTu9vPXFG6s1EHtMfPPwoyj2Ovc6jN6ZEra0GfBSBe/FFm6yeZF/f4brV4scGqVjP/K0s/VeCQHAufsBqaZiiFkx9fwDY8JpuFAbXOsyYmkVf06wGBLxh5p/acNGMhaOcyeDF7LPhJqL3PdWdUx/sNqEDf7cRXYFDknMAHFgTqW6JaiFI7JLPBxxmPlcHqeuJHYfWMQi9iluKxujGZNjeg+RNMxTNSg+2tMFx85vr4tXGfEWXG9kBcuhz9tsOcyQ+fn3ikOFZdpOBzz8YsoKepvrxK2u66MHgL85SQ14m2xB/4FpYNVl0mt1npqldt/VhuzJcM0W7SOnTKH5USAmxg5oxt0O10YIARX1CHPgcpQF0AYBhsc6VkHrHkBQ8j0iXN7kivZCSr0fQUPJd3NN7PyQgFiFiMwqtyHLDPkgVDe3bw/O4myqUeIGZrneoysNmDAAI0Eudeejk/0jSkCD+ARORUrFbs+B7czRN9naggrsiwHaFypwKBG00eqkU57ncelk5R+KGk9Uymjqt9sa9scw2KV/3PB1RsAwYscsLQwBPIvGRWIFuLT7v6nnJ+bpPxs02d2LwXokV1qSjGoInzlsWtRtouweMp2anJrzRfCIHiwKxhombENgw0GRZR61FJA/Vu35ibO/z6REZOgWRnrcWQNjEBurksRHUNjoE/g9aD79e9kjcQzVkSMJHGipVP+yib6LEUocb1neLnvNSAD ETJ+IGSf 1uqelglYkP+eSDV9kuMEmtK9t+tJw//6qZuWawTVGdb4Smp8+5tj6pZaF/ZasPl/sbivlGhtERweVT375el3She8i7+asIM3wW+UntvnEwoq02a9J/h/dg7aBI3JINhokROz2S4UoJ6a51uBsR96zSzXl7yo6TnP1zRSXr+4NkM7X0aOm+PhwJ75s8xJyR96smFhtv7zcmkmlT7ZPbD81dpiGrEOikHiP4WF/h5D912mtBtq+CLhXpmFGAZ+lPwUF4Jmtk6/dYlASm3An+lyW4TipbCHQ/GAIZrlcAKtOV94PTlY75InGORMJVLkQ9rQWev1T82Me2U4RMBwEi1I5OYTLTWq4Tw6poOdc X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Kairui Song writes: > On Wed, Oct 23, 2024 at 10:27=E2=80=AFAM Huang, Ying wrote: >> >> Hi, Kairui, > > Hi Ying, > >> >> Kairui Song writes: >> >> > From: Kairui Song >> > >> > This series improved the swap allocator performance greatly by reworki= ng >> > the locking design and simplify a lot of code path. >> > >> > This is follow up of previous swap cluster allocator series: >> > https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b= 9297@kernel.org/ >> > >> > And this series is based on an follow up fix of the swap cluster >> > allocator: >> > https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@gmail.c= om/ >> > >> > This is part of the new swap allocator work item discussed in >> > Chris's "Swap Abstraction" discussion at LSF/MM 2024, and >> > "mTHP and swap allocator" discussion at LPC 2024. >> > >> > Previous series introduced a fully cluster based allocation algorithm, >> > this series completely get rid of the old allocation path and makes the >> > allocator avoid grabbing the si->lock unless needed. This bring huge >> > performance gain and get rid of slot cache on freeing path. >> >> Great! >> >> > Currently, swap locking is mainly composed of two locks, cluster lock >> > (ci->lock) and device lock (si->lock). The device lock is widely used >> > to protect many things, causing it to be the main bottleneck for SWAP. >> >> Device lock can be confusing with another device lock for struct device. >> Better to call it swap device lock? > > Good idea, I'll use the term swap device lock then. > >> >> > Cluster lock is much more fine-grained, so it will be best to use >> > ci->lock instead of si->lock as much as possible. >> > >> > `perf lock` indicates this issue clearly. Doing linux kernel build >> > using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and = 4k >> > pages), result of "perf lock contention -ab sleep 3": >> > >> > contended total wait max wait avg wait type call= er >> > >> > 34948 53.63 s 7.11 ms 1.53 ms spinlock free_= swap_and_cache_nr+0x350 >> > 16569 40.05 s 6.45 ms 2.42 ms spinlock get_s= wap_pages+0x231 >> > 11191 28.41 s 7.03 ms 2.54 ms spinlock swapc= ache_free_entries+0x59 >> > 4147 22.78 s 122.66 ms 5.49 ms spinlock page_= vma_mapped_walk+0x6f3 >> > 4595 7.17 s 6.79 ms 1.56 ms spinlock swapc= ache_free_entries+0x59 >> > 406027 2.74 s 2.59 ms 6.74 us spinlock list_= lru_add+0x39 >> > ...snip... >> > >> > The top 5 caller are all users of si->lock, total wait time up sums to >> > several minutes in the 3 seconds time window. >> >> Can you show results of `perf record -g`, `perf report -g` too? I have >> interest to check hot spot shifting too. > > Sure. I think `perf lock` result is already good enough and cleaner. > My test environment are mostly VM based so spinlock slow path may get > offloaded to host, and can't be see by perf record, I collected > following data after disabled paravirt spinlock: > > The time consumption and stack trace of a page fault before: > - 78.45% 0.17% cc1 [kernel.kallsyms] > [k] asm_exc_page_fault > - 78.28% asm_exc_page_fault > - 78.18% exc_page_fault > - 78.17% do_user_addr_fault > - 78.09% handle_mm_fault > - 78.06% __handle_mm_fault > - 69.69% do_swap_page > - 55.87% alloc_swap_folio > - 55.60% mem_cgroup_swapin_charge_folio > - 55.48% charge_memcg > - 55.45% try_charge_memcg > - 55.36% try_to_free_mem_cgroup_pages > - do_try_to_free_pages > - 55.35% shrink_node > - 55.27% shrink_lruvec > - 55.13% try_to_shrink_lruvec > - 54.79% evict_folios > - 54.35% shrink_folio_= list > - 30.01% add_to_swap > - 29.77% > folio_alloc_swap > - 29.50% > get_swap_pages > > 25.03% queued_spin_lock_slowpath > - 2.71% > alloc_swap_scan_cluster > > 1.80% queued_spin_lock_slowpath > + > 0.89% __try_to_reclaim_swap > - 1.74% > swap_reclaim_full_clusters > > 1.74% queued_spin_lock_slowpath > - 10.88% > try_to_unmap_flush_dirty > - 10.87% > arch_tlbbatch_flush > - 10.85% > on_each_cpu_cond_mask > > smp_call_function_many_cond > + 7.45% pageout > + 2.71% try_to_unma= p_flush > + 1.90% try_to_unmap > + 0.78% folio_refer= enced > - 9.41% cluster_swap_free_nr > - 9.39% free_swap_slot > - 9.35% swapcache_free_entries > 8.40% queued_spin_lock_slowpath > 0.93% swap_entry_range_free > - 3.61% swap_read_folio_bdev_sync > - 3.55% submit_bio_wait > - 3.51% submit_bio_noacct_nocheck > + 3.46% __submit_bio > + 7.71% do_pte_missing > + 0.61% wp_page_copy > > The queued_spin_lock_slowpath above is the si->lock, and there are > multiple users of it so the total overhead is higher than shown. > > After: > - 75.05% 0.43% cc1 [kernel.kallsyms] > [k] asm_exc_page_fault > - 74.62% asm_exc_page_fault > - 74.36% exc_page_fault > - 74.34% do_user_addr_fault > - 74.10% handle_mm_fault > - 73.96% __handle_mm_fault > - 67.55% do_swap_page > - 45.92% alloc_swap_folio > - 45.03% mem_cgroup_swapin_charge_folio > - 44.58% charge_memcg > - 44.44% try_charge_memcg > - 44.12% try_to_free_mem_cgroup_pages > - do_try_to_free_pages > - 44.10% shrink_node > - 43.86% shrink_lruvec > - 41.92% try_to_shrink_lruvec > - 40.67% evict_folios > - 37.12% shrink_folio_= list > - 20.88% pageout > + 20.02% swap_wr= itepage > + 0.72% shmem_wr= itepage > - 4.08% add_to_swap > - 2.48% > folio_alloc_swap > - 2.12% > __mem_cgroup_try_charge_swap > - 1.47% > swap_cgroup_record > + > 1.32% _raw_spin_lock_irqsave > - 1.56% > add_to_swap_cache > - 1.04% xas_s= tore > + 1.01% > workingset_update_node > + 3.97% > try_to_unmap_flush_dirty > + 3.51% folio_refer= enced > + 2.24% __remove_ma= pping > + 1.16% try_to_unmap > + 0.52% try_to_unma= p_flush > 2.50% > queued_spin_lock_slowpath > 0.79% scan_folios > + 1.20% try_to_inc_max_seq > + 1.92% lru_add_drain > + 0.73% vma_alloc_folio_noprof > - 9.81% swap_read_folio_bdev_sync > - 9.61% submit_bio_wait > + 9.49% submit_bio_noacct_nocheck > - 8.06% cluster_swap_free_nr > - 8.02% swap_entry_range_free > + 3.92% __mem_cgroup_uncharge_swap > + 2.90% zram_slot_free_notify > 0.58% clear_shadow_from_swap_cache > - 1.32% __folio_batch_add_and_move > - 1.30% folio_batch_move_lru > + 1.10% folio_lruvec_lock_irqsave Thanks for data. It seems that the cycles shifts from spinning to memory compression. That is expected. > spin_lock usage is much lower. > > I prefer the perf lock output as it shows the exact time and user of lock= s. perf cycles data is more complete. You can find which part becomes new hot spot. >> >> > Following the new allocator design, many operation doesn't need to tou= ch >> > si->lock at all. We only need to take si->lock when doing operations >> > across multiple clusters (eg. changing the cluster list), other >> > operations only need to take ci->lock. So ideally allocator should >> > always take ci->lock first, then, if needed, take si->lock. But due >> > to historical reasons, ci->lock is used inside si->lock by design, >> > causing lock inversion if we simply try to acquire si->lock after >> > acquiring ci->lock. >> > >> > This series audited all si->lock usage, simplify legacy codes, elimina= te >> > usage of si->lock as much as possible by introducing new designs based >> > on the new cluster allocator. >> > >> > Old HDD allocation codes are removed, cluster allocator is adapted >> > with small changes for HDD usage, test is looking OK. >> >> I think that it's a good idea to remove HDD allocation specific code. >> Can you check the performance of swapping to HDD? However, I understand >> that many people have no HDD in hand. > > It's not hard to make cluster allocator work well with HDD in theory, > see the commit "mm, swap: use a global swap cluster for non-rotation > device". > The testing is not very reliable though, I found HDD swap performance > is very unstable because of the IO pattern of HDD, so it's just a best > effort try. Just to check whether code change cause something too bad for HDD. No measurable difference is a good news. >> > And this also removed slot cache for freeing path. The performance is >> > better without it, and this enables other clean up and optimizations >> > as discussed before: >> > https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmes= W_59W1BWw@mail.gmail.com/ >> > >> > After this series, lock contention on si->lock is nearly unobservable >> > with `perf lock` with the same test above : >> > >> > contended total wait max wait avg wait type call= er >> > ... snip ... >> > 91 204.62 us 4.51 us 2.25 us spinlock clus= ter_move+0x2e >> > ... snip ... >> > 47 125.62 us 4.47 us 2.67 us spinlock clus= ter_move+0x2e >> > ... snip ... >> > 23 63.15 us 3.95 us 2.74 us spinlock clus= ter_move+0x2e >> > ... snip ... >> > 17 41.26 us 4.58 us 2.43 us spinlock clus= ter_isolate_lock+0x1d >> > ... snip ... >> > >> > cluster_move and cluster_isolate_lock are basically the only users >> > of si->lock now, performance gain is huge with reduced LOC. >> > >> > Tests >> > =3D=3D=3D >> > >> > Build kernel with defconfig on tmpfs with ZRAM as swap: >> > --- >> > >> > Running a test matrix which is scaled up progressive for a intuitive r= esult. >> > The test are ran on top of tmpfs, using memory cgroup for memory limit= ation, >> > on a 48c96t system. >> > >> > 12 test run for each case, it can be seen clearly that as concurrent j= ob >> > number goes higher the performance gain is higher, the performance is >> > higher even with low concurrency. >> > >> > make -j | System Time (seconds) | Total Time (seconds) >> > (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delt= a) >> > With 4k pages only: >> > 6 / 192M / 3G | 5258 / 5235 / -0.3% | 1420 / 1414 / -0.3% >> > 12 / 256M / 4G | 5518 / 5337 / -3.3% | 758 / 742 / -2.1% >> > 24 / 384M / 5G | 7091 / 5766 / -18.7% | 476 / 422 / -11.= 3% >> > 48 / 768M / 7G | 11139 / 5831 / -47.7% | 330 / 221 / -33.= 0% >> > 96 / 1.5G / 10G | 21303 / 11353 / -46.7% | 283 / 180 / -36.= 4% >> > With 64k mTHP: >> > 24 / 512M / 5G | 5104 / 4641 / -18.7% | 376 / 358 / -4.8% >> > 48 / 1G / 7G | 8693 / 4662 / -18.7% | 257 / 176 / -31.= 5% >> > 96 / 2G / 10G | 17056 / 10263 / -39.8% | 234 / 169 / -27.= 8% >> >> How much is the swap in/out throughput before/after the change? > > This may not be too beneficial for typical throughput measurement: > - For example doing the same test with brd will only show a ~20% > performance improvement, still a big gain though. I think the si->lock > spinlock wasting CPU cycles may effect CPU sensitive things like ZRAM > even more. 20% is a good data. You don't need to guess. perf cycles profiling can show the hot spot. > - And simple benchmarks which just do multiple sequential swaps in/out > in multiple thread hardly stress the allocator. > > I haven't found a good > benchmark to simulate random parallel IOs on SWAP yet, I can write one > later. I have used anon-w-rand test case of vm-scalability to simulate random parallel swap out. https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree= /case-anon-w-rand > A more close to real word benchmark like build kernel test, or > mysql/sysbench all showed great improment. Yes. Real work load is good. We can use micro-benchmark to find out some performance limit, for example, max possible throughput. >> >> When I worked on swap in/out performance before, the hot spot shifts from >> swap related code to LRU lock and zone lock. Things may change a lot >> now. >> >> If zram is used as swap device, the hot spot may become >> compression/decompression after solving the swap lock contention. To >> stress swap subsystem further, we may use a ram disk as swap. >> Previously, we have used a simulated pmem device (backed by DRAM). That >> can be setup as in, >> >> https://pmem.io/blog/2016/02/how-to-emulate-persistent-memory/ >> >> After creating the raw block device: /dev/pmem0, we can do >> >> $ mkswap /dev/pmem0 >> $ swapon /dev/pmem0 >> >> Can you use something similar if necessary? > > I used to test with brd, as described above, brd will allocate memory during running, pmem can avoid that. perf profile is your friends to root cause the possible issue. > I think using ZRAM with > test simulating real workload is more useful. Yes. And, as I said before. Micro-benchmark has its own value. > And I did include a Sequential SWAP test, the result is looking OK (no > regression, minor to none improvement). Good. At least we have no regression here. > I can have a try with the pmem setup later, I guess the result will > be similar to brd test. > > >> >> > With more aggressive setup, it shows clearly both the performance and >> > fragmentation are better: >> > >> > tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C * 2: >> > (avg of 4 test run) >> > Before: >> > Sys time: 73578.30, Real time: 864.05 >> > tiem make -j96 / 1G memcg, 4K pages, 10G ZRAM: >> > After: (-54.7% sys time, -49.3% real time) >> > Sys time: 33314.76, Real time: 437.67 >> > >> > time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C * 2: >> > (avg of 4 test run) >> > Before: >> > Sys time: 74044.85, Real time: 846.51 >> > hugepages-64kB/stats/swpout: 1735216 >> > hugepages-64kB/stats/swpout_fallback: 430333 >> > After: (-51.4% sys time, -47.7% real time, -63.2% mTHP failure) >> > Sys time: 35958.87, Real time: 442.69 >> > hugepages-64kB/stats/swpout: 1866267 >> > hugepages-64kB/stats/swpout_fallback: 158330 >> > >> > There is a up to 54.7% improvement for build kernel test, and lower >> > fragmentation rate. Performance improvement should be even larger for >> > micro benchmarks >> >> Very good result! >> >> > Build kernel with tinyconfig on tmpfs with HDD as swap: >> > --- >> > >> > This test is similar to above, but HDD test is very noisy and slow, the >> > deviation is huge, so just use tinyconfig instead and take the median = test >> > result of 3 test run, which looks OK: >> > >> > Before this series: >> > 114.44user 29.11system 39:42.90elapsed 6%CPU >> > 2901232inputs+0outputs (238877major+4227640minor)pagefaults >> > >> > After this commit: >> > 113.90user 23.81system 38:11.77elapsed 6%CPU >> > 2548728inputs+0outputs (235471major+4238110minor)pagefaults >> > >> > Single thread SWAP: >> > --- >> > >> > Sequential SWAP should also be slightly faster as we removed a lot of >> > unnecessary parts. Test using micro benchmark for swapout/in 4G >> > zero memory using ZRAM, 10 test runs: >> > >> > Swapout Before (avg. 3359304): >> > 3353796 3358551 3371305 3356043 3367524 3355303 3355924 3354513 3360776 >> > >> > Swapin Before (avg. 1928698): >> > 1920283 1927183 1934105 1921373 1926562 1938261 1927726 1928636 1934155 >> > >> > Swapout After (avg. 3347511, -0.4%): >> > 3337863 3347948 3355235 3339081 3333134 3353006 3354917 3346055 3360359 >> > >> > Swapin After (avg. 1922290, -0.3%): >> > 1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913 >> > >> > Worth noticing the patch "mm, swap: use a global swap cluster for >> > non-rotation device" introduced minor overhead for certain tests (see >> > the test results in commit message), but the gain from later commit >> > covered that, it can be further improved later. >> > >> > Suggested-by: Chris Li >> > Signed-off-by: Kairui Song >> > >> > Kairui Song (13): >> > mm, swap: minor clean up for swap entry allocation >> > mm, swap: fold swap_info_get_cont in the only caller >> > mm, swap: remove old allocation path for HDD >> > mm, swap: use cluster lock for HDD >> > mm, swap: clean up device availability check >> > mm, swap: clean up plist removal and adding >> > mm, swap: hold a reference of si during scan and clean up flags >> > mm, swap: use an enum to define all cluster flags and wrap flags >> > changes >> > mm, swap: reduce contention on device lock >> > mm, swap: simplify percpu cluster updating >> > mm, swap: introduce a helper for retrieving cluster from offset >> > mm, swap: use a global swap cluster for non-rotation device >> > mm, swap_slots: remove slot cache for freeing path >> > >> > fs/btrfs/inode.c | 1 - >> > fs/iomap/swapfile.c | 1 - >> > include/linux/swap.h | 36 +- >> > include/linux/swap_slots.h | 3 - >> > mm/page_io.c | 1 - >> > mm/swap_slots.c | 78 +-- >> > mm/swapfile.c | 1198 ++++++++++++++++-------------------- >> > 7 files changed, 558 insertions(+), 760 deletions(-) -- Best Regards, Huang, Ying