From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6914C54E67 for ; Wed, 27 Mar 2024 02:54:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6D0F36B0096; Tue, 26 Mar 2024 22:54:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 680A36B0099; Tue, 26 Mar 2024 22:54:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5485B6B009A; Tue, 26 Mar 2024 22:54:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 44C4C6B0096 for ; Tue, 26 Mar 2024 22:54:28 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 11A85A0C5C for ; Wed, 27 Mar 2024 02:54:28 +0000 (UTC) X-FDA: 81941300616.04.7266AC8 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) by imf09.hostedemail.com (Postfix) with ESMTP id C96A9140013 for ; Wed, 27 Mar 2024 02:54:25 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=NVRJxC4d; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711508066; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Z8wB2dU3a/K1l75wSr4oDqclEwOllowyzHa4ngzKCiM=; b=Ok6vflFQjR8SONoGQu2DIZikGopHyQkGt/6FO0LpMSVtA8zqcV/bbzS4xI1lScNwKUkWEN HX3klYfLuSS9E+GNlb7QWS0C4MFRBsiDNnn4ZJ73FmbVnkeEFKGr7Z0oZMR6Le2K6y7ZgE kutaK1Pp4ptgrN9HGRMoxgVoehGACaw= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=NVRJxC4d; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711508066; a=rsa-sha256; cv=none; b=cmAUDqLRupmJcTHczx3aXAYq+wu54IguZ0v1wRAsnAH1fuT1rncbq2eqhAOE+fh3l/7JWK AZdVpsabYzeHf9LscULef53avwtkJ7z7gt7BD0WA6DqOdYlYr75s1tkKtyoLCtpGocrFDj lRF9pYofeaDK39EBFouXbl5fawK5/JU= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1711508066; x=1743044066; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=YxNB/+KVg3sJySwSmzUJhuJT2OVZcJaWVmnZNq1sfzU=; b=NVRJxC4d2WIvj4VCcR0K26h4UhClfTTTrwdc868ZdNlrvRLFL3pvkPnr Ns0n4vku2oB/g3L1gLOk86/+XqotB0rhUV1zNzbvUoEs8nCbymSLFyuly fVgaUUBfVuaOq/XJV+F0WujmrjggOta94EnISjeBJSTnVxnZmaQCJWAnA dzdjaTd4dpngcODJ0V2chrQPIbst8MzHJTPs6PLe/dRipZAOPhgYWGrqV JDpixNTU/VZSD6EKdjkXQGqlm6nQk8FPL9prXJikgSoqfYJL0vC0Ql96+ ozjyxMjx41AZdSEDfRlPldg/pdQMuqkXkgy3dMhk8r2qc95NxG989Wi1t A==; X-CSE-ConnectionGUID: 1RNP0a3xSDKGMyAhh3X4Mw== X-CSE-MsgGUID: 7p3T2EyXRAeYTq0V2zojnw== X-IronPort-AV: E=McAfee;i="6600,9927,11025"; a="10389179" X-IronPort-AV: E=Sophos;i="6.07,157,1708416000"; d="scan'208";a="10389179" Received: from orviesa003.jf.intel.com ([10.64.159.143]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Mar 2024 19:54:25 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,157,1708416000"; d="scan'208";a="20805218" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by ORVIESA003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Mar 2024 19:54:21 -0700 From: "Huang, Ying" To: Kairui Song Cc: linux-mm@kvack.org, Kairui Song , Chris Li , Minchan Kim , Barry Song , Ryan Roberts , Yu Zhao , SeongJae Park , David Hildenbrand , Yosry Ahmed , Johannes Weiner , Matthew Wilcox , Nhat Pham , Chengming Zhou , Andrew Morton , linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization In-Reply-To: <20240326185032.72159-1-ryncsn@gmail.com> (Kairui Song's message of "Wed, 27 Mar 2024 02:50:22 +0800") References: <20240326185032.72159-1-ryncsn@gmail.com> Date: Wed, 27 Mar 2024 10:52:26 +0800 Message-ID: <878r24o07p.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: C96A9140013 X-Stat-Signature: thowgocc37ykr9wn116kcb1f1g38r6mq X-Rspam-User: X-HE-Tag: 1711508065-648414 X-HE-Meta: U2FsdGVkX1/kpFkycWf2HHCl2sINhiYzmjDkblXt2QURTFOaNg6h0+eCLikUuGfvUX6cVmqp5xFKJQytHIPHtc09pBS+mXHema37OvAvyjkQKfkpa052Cu3ZpGnSoe8oKiB39kqE07wBzzL+i2rh3ZJ8NIANrqWq11cORECNTfL/0DS1OblKOMUtyOJFsXNTmvpOiz7z8VIUDrkaYu7gqF0zdZTTAg/dhDumMvf14AqLw8f5gngMj+RgEJnjoJerbgVD7zJJMqnzitfcZtUwLEvhcaCTc9338FLTk5Y6BhjaoXpGyeZAILFvSF52rrEd1ndN15MpObF7tQj17NRDfhBAAqDYmTd+7QU/sxiEx+GJiqOG6MU07EKWHzsvmYHcydIkbXG753bkEIpD2FXOxraeiAmAhBz6Pw4QO37ckemLAn0Uc9KIFXeekQz4j9cwbJvcRcPBPzVJT6qqWUgiMvLnuLwRzgHo4iuBNlKoTWT+NeMybARiqzPwgbTnGzW+4uiFClgYQzTrmnSYB9n3r3cpOgc6ZfqvtJPBLin/bhHf7frCkEAIO4WadB2D1o5eXOVBYXvOY21qWGJKmaUysGeuBgHD3A+aEAuYoFWt+P4Tp6w+mQ8cDG5LZqPc4KJAwdXE6hXqGQRimSJa6RLukTegxLT9FKLf4q61p4uYikJOLw4EDpHuu9+hR8rW2mR1DZbbiGaI1NOVo6ZkmluY5yr1byr1Z1zxMHZEVG+Z7iSnyEXLQczYnNS4zFjYJgprY1HNTNOJJEBTBBj8xu+yE5oqfE8bbmO+ulw5UwlhARpxKt/Q0IdamCIB4nAtmaCL8SoTEkrgrsXoUu+shmnx1KYLiqKyCl4G+nxmL16HH2Kj3vkKLtGki0zf32UGExTc27ctfHGeX+LEVy1WUax49eNWm+4uatG2aAcmawtT8h8MliBljYX7zIjYrhFmDUU8hvtYniPWznG4zwSu2ev 4BqXf849 ff0TlMPAa2S1P4EXlr1yOYtOv8nxk1wPznZdbT/vvex8QVz/QrNkfh0z9el4oUJNwevmjRDtEw4pegHlSHcnfE8FD1mgEG99IAWZOndSlPHE5tkYlZkgF4PymF4POm55A6jsLThuI5X/MTNM+1hIjwQJ6jP9VXvOeyn2kGs9z2d7KjEy1OAZiUL46b3Y787/EbBA7PpEBBslcFnurmR4zeTIPr0HrBYmlNO5FtEpSpNSs2O6/QoGq3btAomxpyYNb36h/WHGaTS6SuMdunIkIuo6ota3z021sVe71FYFmR5j+otfWc2oifHGBcA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, Kairui, Kairui Song writes: > From: Kairui Song > > A month ago a bug was fixed for SWP_SYNCHRONOUS_IO swapin (swap cache > bypass swapin): > https://lore.kernel.org/linux-mm/20240219082040.7495-1-ryncsn@gmail.com/ > > Because we have to spin on the swap map on race, and swap map is too small > to contain more usable info, an ugly schedule_timeout_uninterruptible(1) > is added. It's not the first time a hackish workaround was added for cache > bypass swapin and not the last time. I did many experiments locally to > see if the swap cache bypass path can be dropped while keeping the > performance still comparable. And it seems doable. > In general, I think that it's a good idea to unify cache bypass swapin and normal swapin. But I haven't dive into the implementation yet. > This series does the following things: > 1. Remove swap cache bypass completely. > 2. Apply multiple optimizations after that, these optimizations are > either undoable or very difficult to do without dropping the cache > bypass swapin path. > 3. Use swap cache as a synchronization layer, also unify some code > with page cache (filemap). > > As a result, we have: > 1. A comparable performance, some tests are even faster. > 2. Multi-index support for swap cache. > 3. Removed many hackish workarounds including above long tailing > issue is gone. > > Sending this as RFC to collect some discussion, suggestion, or rejection > early, this seems need to be split into multiple series, but the > performance is not good until the last patch so I think start by > seperating them may make this approach not very convincing. And there > are still some (maybe further) TODO items and optimization space > if we are OK with this approach. > > This is based on my another series, for reusing filemap code for swapcache: > [PATCH v2 0/4] mm/filemap: optimize folio adding and splitting > https://lore.kernel.org/linux-mm/20240325171405.99971-1-ryncsn@gmail.com/ > > Patch 1/10, introduce a helper from filemap side to be used later. > Patch 2/10, 3/10 are clean up and prepare for removing the swap cache > bypass swapin path. > Patch 4/10, removed the swap cache bypass swapin path, and the > performance drop heavily (-28%). > Patch 5/10, apply the first optimization after the removal, since all > folios goes through swap cache now, there is no need to explicit shadow > clearing any more. > Patch 6/10, apply another optimization after clean up shadow clearing > routines. Now swapcache is very alike page cache, so just reuse page > cache code and we will have multi-index support. Shadow memory usage > dropped a lot. > Patch 7/10, just rename __read_swap_cache_async, it will be refactored > and a key part of this series, and the naming is very confusing to me. > Patch 8/10, make swap cache as a synchronization layer, introduce two > helpers for adding folios to swap cache, caller will either succeed or > get a folio to wait on. > Patch 9/10, apply another optimization. With above two helpers, looking > up of swapcache can be optimized and avoid false looking up, which > helped improve the performance. > Patch 10/10, apply a major optimization for SWP_SYNCHRONOUS_IO devices, > after this commit, performance for simple swapin/swapout is basically > same as before. > > Test 1, sequential swapin/out of 30G zero page on ZRAM: > > Before (us) After (us) > Swapout: 33619409 33886008 > Swapin: 32393771 32465441 (- 0.2%) > Swapout (THP): 7817909 6899938 (+11.8%) > Swapin (THP) : 32452387 33193479 (- 2.2%) If my understanding were correct, we don't have swapin (THP) support, yet. Right? > And after swapping out 30G with THP, the radix node usage dropped by a > lot: > > Before: radix_tree_node 73728K > After: radix_tree_node 7056K (-94%) Good! > Test 2: > Mysql (16g buffer pool, 32G ZRAM SWAP, 4G memcg, Zswap disabled, THP never) > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \ > --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \ > --threads=48 --time=300 --report-interval=10 run > > Before: transactions: 4849.25 per sec > After: transactions: 4849.40 per sec > > Test 3: > Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP never) > echo never > /sys/kernel/mm/transparent_hugepage/enabled > echo 100 > /sys/module/zswap/parameters/max_pool_percent > echo 1 > /sys/module/zswap/parameters/enabled > echo y > /sys/module/zswap/parameters/shrinker_enabled > > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \ > --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \ > --threads=48 --time=600 --report-interval=10 run > > Before: transactions: 1662.90 per sec > After: transactions: 1726.52 per sec 3.8% improvement. Good! > Test 4: > Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP always) > echo always > /sys/kernel/mm/transparent_hugepage/enabled > echo 100 > /sys/module/zswap/parameters/max_pool_percent > echo 1 > /sys/module/zswap/parameters/enabled > echo y > /sys/module/zswap/parameters/shrinker_enabled > > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \ > --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \ > --threads=48 --time=600 --report-interval=10 run > > Before: transactions: 2860.90 per sec. > After: transactions: 2802.55 per sec. > > Test 5: > Memtier / memcached (16G brd SWAP, 8G memcg, THP never): > > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary & > > memtier_benchmark -S /tmp/memcached.socket \ > -P memcache_binary -n allkeys --key-minimum=1 \ > --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \ > --ratio 1:0 --pipeline 8 -d 1000 > > Before: 106730.31 Ops/sec > After: 106360.11 Ops/sec > > Test 5: > Memtier / memcached (16G brd SWAP, 8G memcg, THP always): > > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary & > > memtier_benchmark -S /tmp/memcached.socket \ > -P memcache_binary -n allkeys --key-minimum=1 \ > --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \ > --ratio 1:0 --pipeline 8 -d 1000 > > Before: 83193.11 Ops/sec > After: 82504.89 Ops/sec > > These tests are tested under heavy memory stress, and the performance > seems basically same as before,very slightly better/worse for certain > cases, the benefits of multi-index are basically erased by > fragmentation and workingset nodes usage is slightly lower. > > Some (maybe further) TODO items if we are OK with this approach: > > - I see a slight performance regression for THP tests, > could identify a clear hotspot with perf, my guess is the > content on the xa_lock is an issue (we have a xa_lock for > every 64M swap cache space), THP handling needs to take the lock > longer than usual. splitting the xa_lock to be more > fine-grained seems a good solution. We have > SWAP_ADDRESS_SPACE_SHIFT = 14 which is not an optimal value. > Considering XA_CHUNK_SHIFT is 6, we will have three layer of Xarray > just for 2 extra bits. 12 should be better to always make use of > the whole XA chunk and having two layers at most. But duplicated > address_space struct also wastes more memory and cacheline. > I see an observable performance drop (~3%) after change > SWAP_ADDRESS_SPACE_SHIFT to 12. Might be a good idea to > decouple swap cache xarray from address_space (there are > too many user for swapcache, shouldn't come too dirty). > > - Actually after patch Patch 4/10, the performance is much better for > tests limited with memory cgroup, until 10/10 applied the direct swap > cache freeing logic for SWP_SYNCHRONOUS_IO swapin. Because if the swap > device is not near full, swapin doesn't clear up the swapcache, so > repeated swapout doesn't need to re-alloc a swap entry, make things > faster. This may indicate that lazy freeing of swap cache could benifit > certain workloads and may worth looking into later. > > - Now SWP_SYNCHRONOUS_IO swapin will bypass readahead and force drop > swap cache after swapin is done, which can be cleaned up and optimized > further after this patch. Device type will only determine the > readahead logic, and swap cache drop check can be based purely on swap > count. > > - Recent mTHP swapin/swapout series should have no fundamental > conflict with this. > > Kairui Song (10): > mm/filemap: split filemap storing logic into a standalone helper > mm/swap: move no readahead swapin code to a stand-alone helper > mm/swap: convert swapin_readahead to return a folio > mm/swap: remove cache bypass swapin > mm/swap: clean shadow only in unmap path > mm/swap: switch to use multi index entries > mm/swap: rename __read_swap_cache_async to swap_cache_alloc_or_get > mm/swap: use swap cache as a synchronization layer > mm/swap: delay the swap cache look up for swapin > mm/swap: optimize synchronous swapin > > include/linux/swapops.h | 5 +- > mm/filemap.c | 161 +++++++++----- > mm/huge_memory.c | 78 +++---- > mm/internal.h | 2 + > mm/memory.c | 133 ++++------- > mm/shmem.c | 44 ++-- > mm/swap.h | 71 ++++-- > mm/swap_state.c | 478 +++++++++++++++++++++------------------- > mm/swapfile.c | 64 +++--- > mm/vmscan.c | 8 +- > mm/workingset.c | 2 +- > mm/zswap.c | 4 +- > 12 files changed, 540 insertions(+), 510 deletions(-) -- Best Regards, Huang, Ying