From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 368FEC3DA4A for ; Fri, 16 Aug 2024 09:06:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A2D076B0298; Fri, 16 Aug 2024 05:06:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9B5906B0299; Fri, 16 Aug 2024 05:06:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 82F566B029A; Fri, 16 Aug 2024 05:06:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 625B06B0298 for ; Fri, 16 Aug 2024 05:06:16 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 1644241A8E for ; Fri, 16 Aug 2024 09:06:16 +0000 (UTC) X-FDA: 82457527152.02.418452C Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.21]) by imf02.hostedemail.com (Postfix) with ESMTP id 8E4978001C for ; Fri, 16 Aug 2024 09:06:12 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=WGjyhbhT; spf=pass (imf02.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723799136; a=rsa-sha256; cv=none; b=f9x09aCDhL7aET0zjThC9frxXV7lUO+KFx+znwxONed1wvfl880rhulNIeGXa0Lx3sAoY1 66L3n0FmxxZUy6xa8YY2DOAOdsPJpFIqMITY8zBeIwJjLWkxS8ZExPtnzAzqoACehomH5e Z+fDYG33zNwzzvyuvviHqo2OTr4bYic= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=WGjyhbhT; spf=pass (imf02.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723799136; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=IWhC1VSK61CYoVztx/914nLmzE9rkaaisK5m818xkI4=; b=ojVQMIt82ITaKlmUSxxLUmKXJinULyH7Fc4/8rIzKNiBJCSZ7F4zq0tin3epXgfLKVlSy8 bMhvQejquhCt96YqV2SloJgL4fue06u5zOyBwROZiag5eUzPTdwY62Kck/LkekGi6qQzTD dI4LFbRgCAvQ93xtgZgHtBhUMcODlpY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1723799173; x=1755335173; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=MyI/45LMmsU6KUFtM3kYchwuIuP+akqv6p8GzSK4sF4=; b=WGjyhbhTVmTdDJHvLmhU3peE5JrmkXe2rpPTF6GZy1ZwCU15nJDyL/v3 /5fKXynmxvoLeC1IKK+YmlPWe1ZRf2YnbCtna1Xe4MWX/mGrUpOztbhEc X1jqLm0RmgQtAeNAxHYYmvVuU7YnAqlyL/JGCw9FY9UUm7McdYFHTSxVy ozUMzfOO7zXuEjl1PN2XiLvMEiB57W1wgZuzZnthIaCCFdcNVdzHbGbr6 +kBRYiX/cWMP92+ibRtd3TUuxsnvcPDaYiJB30K+K62rd1hto+t9mgdLc jmlOTWU8T6e44Q7+7B+LgsdfOlQ9/h0791DZwDOpzGBr1qCxQGrEFNV3a Q==; X-CSE-ConnectionGUID: +0Hkv7AhQr2vaegfTUV+mg== X-CSE-MsgGUID: KTTn+vMJQT6Q3q3zOWMFrQ== X-IronPort-AV: E=McAfee;i="6700,10204,11165"; a="22058409" X-IronPort-AV: E=Sophos;i="6.10,151,1719903600"; d="scan'208";a="22058409" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by orvoesa113.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Aug 2024 02:06:09 -0700 X-CSE-ConnectionGUID: 4FvkaFBiRNadwdZRU84CEQ== X-CSE-MsgGUID: mc09l7cLQDqIlX7/MHGo8g== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.10,151,1719903600"; d="scan'208";a="59298158" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Aug 2024 02:06:05 -0700 From: "Huang, Ying" To: Kanchana P Sridhar Cc: , , , , , , <21cnbao@gmail.com>, , , , Subject: Re: [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios In-Reply-To: <20240816054805.5201-1-kanchana.p.sridhar@intel.com> (Kanchana P. Sridhar's message of "Thu, 15 Aug 2024 22:48:01 -0700") References: <20240816054805.5201-1-kanchana.p.sridhar@intel.com> Date: Fri, 16 Aug 2024 17:02:32 +0800 Message-ID: <87ttfkj0wn.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Stat-Signature: 9j65p4ctrtndapgn4g8kha88fjco44u9 X-Rspamd-Queue-Id: 8E4978001C X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1723799172-442859 X-HE-Meta: U2FsdGVkX18oTQKA4+W0deJNoEZFsyDL6EsD/iAEufoU/llbr5VLDlcCBzoCO2dKo+jsQXzTOmS1fUxrOSN7eHg2WZ13ydHYqmmR3NCVIMeKPaCG6XMcPpwALMfiFrhk1G+ck+8gRfFDagCsR6gchvqf0C1K1hc3W4nitnjOsX9NEsFj59scXHzHNMapP0p9My4qJf27aaHgGI51rJ42lIqEtt8d3ASiLPZcilzNtV80pQurApFxDkvAMTgA0Q2T++IRceeJJuGp8kVeDBoAmziJ7S/4u8FTIKA5J33nJTS7hGBhkOnxAWtHvoptPiMbdYFH9oxvH18Goss/eRgOJp3Pg7NvLTJj0C4+OEkRQAem5aMMESNZlHgiYNWGPYN/SLUEE8CLg+pcA4XPNTDCGmjTkqQZG5ULKb/0jEnjAtcQuPf1lt3pSonLyWud9PS4go9N3NjaJlqtg5LhIj2DS/Z0sB6ln7IDMVlllh0Wea7FR3lsS8iAfBSfexhrwJIWvJ3/IwIL4PUkgeuSS7ReuzfaP/tZZyhJYXyPsEOpFuFxU1psNKvo3is+Y05espH47H8cgkeHAKUrnzcsFOfp0FHDsZvX21HRskIcUnYbXzh9sWpbDq6YE4+eTMn/DPdahmYaTPFlUIwQnQ5d3+vLQ65TPR50TAf184FG0As6AuUl2rSSfoYesD7JJw4IS2E7NAS5DlIo5XxzyRoumAmXFTT6A2L8gQUN2lbix9/7E2J549RYzIbS6n2STZFEY7IGrlvNqKU4rrL5E+UZkL9HSEIKeyPCfIsTuyMKD/r1VHCkTvVgr4+JHLyVZJJjYP158TpGRC9f4e3f310FjDFliKI30tz7BE7jIlz5RaGnMAq4J+SQi5Z0uhkrEJ3mxAbFW2Qin/ZF6OecZ+LnG1TUijpjmKRmfy4fbUfeFqJpZ2qQZ7DfX2pP3GpJPI23AIaNFOD3ltai6mG7dKQD8hQ j5QAs6Il KGQodAwIJCWbzN8MNatkBAfl7FLTls+Tj9lR+KSyUGpDYT/nTZtzQqRHT8V/12vAEY4rsASr/3p1YkhnlvNl52HeJHsd/TgqcfKwYHbcdPlGX87LZpbBfE9cCQ8wQlcpWEAil0jKw7oHWhXoEjQ9/0KjbrFO87hUPsabT18BXPSVe80G5SrvoRtu+C70ApHUp5rs7sqZj5PlGt4rHJ0ZiI1c4BM6hWfJT+7H4JfPnVD7dS2A+7IvjOVZPSfA8wNN12OqZTb/zDZvJ1ikiaXWxgnuDg1uTdSIBdhfk X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Kanchana P Sridhar writes: > Hi All, > > This patch-series enables zswap_store() to accept and store mTHP > folios. The most significant contribution in this series is from the > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > migrated to v6.11-rc3 in patch 2/4 of this series. > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u > > Additionally, there is an attempt to modularize some of the functionality > in zswap_store(), to make it more amenable to supporting any-order > mTHPs. > > For instance, the determination of whether a folio is same-filled is > based on mapping an index into the folio to derive the page. Likewise, > there is a function "zswap_store_entry" added to store a zswap_entry in > the xarray. > > For accounting purposes, the patch-series adds per-order mTHP sysfs > "zswpout" counters that get incremented upon successful zswap_store of > an mTHP folio: > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > This patch-series is a precursor to ZSWAP compress batching of mTHP > swap-out and decompress batching of swap-ins based on swapin_readahead(), > using Intel IAA hardware acceleration, which we would like to submit in > subsequent RFC patch-series, with performance improvement data. > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > Changes since RFC v1: > ===================== > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > Thanks Barry! > 2) Addressed some of the code review comments that Nhat Pham provided in > Ryan's initial RFC [1]: > - Added a comment about the cgroup zswap limit checks occuring once per > folio at the beginning of zswap_store(). > Nhat, Ryan, please do let me know if the comments convey the summary > from the RFC discussion. Thanks! > - Posted data on running the cgroup suite's zswap kselftest. > 3) Rebased to v6.11-rc3. > 4) Gathered performance data with usemem and the rebased patch-series. > > Performance Testing: > ==================== > Testing of this patch-series was done with the v6.11-rc3 mainline, without > and with this patch-series, on an Intel Sapphire Rapids server, > dual-socket 56 cores per socket, 4 IAA devices per socket. > > The system has 503 GiB RAM, 176 GiB swap/ZSWAP with ZRAM as the backing > swap device. Core frequency was fixed at 2500MHz. I don't think that this is a reasonable test configuration, there's no benefit to use ZSWAP+ZRAM. We should use a normal SSD as backing swap device. > The vm-scalability "usemem" test was run in a cgroup whose memory.high > was fixed at 40G. Following a similar methodology as in Ryan Roberts' > "Swap-out mTHP without splitting" series [2], 70 usemem processes were > run, each allocating and writing 1G of memory: > > usemem --init-time -w -O -n 70 1g > > Other kernel configuration parameters: > > ZSWAP Compressor : LZ4, DEFLATE-IAA > ZSWAP Allocator : ZSMALLOC > ZRAM Compressor : LZO-RLE > SWAP page-cluster : 2 > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > IAA "compression verification" is enabled. Hence each IAA compression > will be decompressed internally by the "iaa_crypto" driver, the crc-s > returned by the hardware will be compared and errors reported in case of > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > compared to the software compressors. > > Throughput reported by usemem and perf sys time for running the test > are as follows: > > 64KB mTHP: > ========== > ------------------------------------------------------------------ > | | | | | > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| > | | | KB/s | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | ZRAM lzo-rle | 118,928 | Baseline | > |zswap-mTHP-Store | ZSWAP lz4 | 82,665 | -30% | Because the test configuration isn't reasonable, the performance drop isn't reasonable too. We should compare between zswap+SSD w/o mTHP zswap and zswap+SSD w/ mTHP zswap. I think that there should be performance improvement for that. > |zswap-mTHP-Store | ZSWAP deflate-iaa | 176,210 | 48% | > |------------------------------------------------------------------| > | | | | | > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| > | | | sec | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | ZRAM lzo-rle | 1,032.20 | Baseline | > |zswap-mTHP=Store | ZSWAP lz4 | 1,854.51 | -80% | > |zswap-mTHP-Store | ZSWAP deflate-iaa | 582.71 | 44% | > ------------------------------------------------------------------ > > ----------------------------------------------------------------------- > | VMSTATS, mTHP ZSWAP stats, | v6.11-rc3 | zswap-mTHP | zswap-mTHP | > | mTHP ZRAM stats: | mainline | Store | Store | > | | | lz4 | deflate-iaa | > |-----------------------------------------------------------------------| > | pswpin | 16 | 0 | 0 | > | pswpout | 7,770,720 | 0 | 0 | > | zswpin | 547 | 695 | 579 | > | zswpout | 1,394 | 15,462,778 | 7,284,554 | > |-----------------------------------------------------------------------| > | thp_swpout | 0 | 0 | 0 | > | thp_swpout_fallback | 0 | 0 | 0 | > | pgmajfault | 3,786 | 3,541 | 3,367 | > |-----------------------------------------------------------------------| > | hugepages-64kB/stats/zswpout | | 966,328 | 455,196 | > |-----------------------------------------------------------------------| > | hugepages-64kB/stats/swpout | 485,670 | 0 | 0 | > ----------------------------------------------------------------------- > > > 2MB PMD-THP/2048K mTHP: > ======================= > ------------------------------------------------------------------ > | | | | | > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| > | | | KB/s | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | ZRAM lzo-rle | 177,340 | Baseline | > |zswap-mTHP-Store | ZSWAP lz4 | 84,030 | -53% | > |zswap-mTHP-Store | ZSWAP deflate-iaa | 185,691 | 5% | > |------------------------------------------------------------------| > | | | | | > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| > | | | sec | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | ZRAM lzo-rle | 876.29 | Baseline | > |zswap-mTHP-Store | ZSWAP lz4 | 1,740.55 | -99% | > |zswap-mTHP-Store | ZSWAP deflate-iaa | 650.33 | 26% | > ------------------------------------------------------------------ > > ------------------------------------------------------------------------- > | VMSTATS, mTHP ZSWAP stats, | v6.11-rc3 | zswap-mTHP | zswap-mTHP | > | mTHP ZRAM stats: | mainline | Store | Store | > | | | lz4 | deflate-iaa | > |-------------------------------------------------------------------------| > | pswpin | 0 | 0 | 0 | > | pswpout | 8,628,224 | 0 | 0 | > | zswpin | 678 | 22,733 | 1,641 | > | zswpout | 1,481 | 14,828,597 | 9,404,937 | > |-------------------------------------------------------------------------| > | thp_swpout | 16,852 | 0 | 0 | > | thp_swpout_fallback | 0 | 0 | 0 | > | pgmajfault | 3,467 | 25,550 | 4,800 | > |-------------------------------------------------------------------------| > | hugepages-2048kB/stats/zswpout | | 28,924 | 18,366 | > |-------------------------------------------------------------------------| > | hugepages-2048kB/stats/swpout | 16,852 | 0 | 0 | > ------------------------------------------------------------------------- > > As expected, in the "Before" experiment, there are relatively fewer > swapouts because ZRAM utilization is not accounted in the cgroup. > > With the introduction of zswap_store mTHP, the "After" data reflects the > higher swapout activity, and consequent throughput/sys time degradation > when LZ4 is used as the zswap compressor. However, we observe considerable > throughput and sys time improvement in the "After" data when DEFLATE-IAA > is the zswap compressor. This observation holds for 64K mTHP and 2MB THP > experiments. IAA's higher compression ratio and better compress latency > can be attributed to fewer swap-outs and major page-faults, that result > in better throughput and sys time. > > Our goal is to improve ZSWAP mTHP store performance using batching. With > Intel IAA compress/decompress batching used in ZSWAP (to be submitted as > additional RFC series), we are able to demonstrate significant > performance improvements and memory savings with IAA as compared to > software compressors. > > cgroup zswap kselftest: > ======================= > > "Before": > ========= > Test run with v6.11-rc3 and no code changes: > mTHP 64K set to 'always' > zswap compressor set to 'lz4' > page-cluster = 3 > > zswap shrinker_enabled = N: > --------------------------- > ok 1 test_zswap_usage > ok 2 test_swapin_nozswap > # at least 24MB should be brought back from zswap > not ok 3 test_zswapin > # zswpwb_after is 0 while wb is enablednot ok 4 test_zswap_writeback_enabled > # Failed to reclaim all of the requested memory > not ok 5 test_zswap_writeback_disabled > ok 6 # SKIP test_no_kmem_bypass > ok 7 test_no_invasive_cgroup_shrink > > zswap shrinker_enabled = Y: > --------------------------- > ok 1 test_zswap_usage > ok 2 test_swapin_nozswap > # at least 24MB should be brought back from zswap > not ok 3 test_zswapin > # zswpwb_after is 0 while wb is enablednot ok 4 test_zswap_writeback_enabled > # Failed to reclaim all of the requested memory > not ok 5 test_zswap_writeback_disabled > ok 6 # SKIP test_no_kmem_bypass > not ok 7 test_no_invasive_cgroup_shrink > > "After": > ======== > Test run with this patch-series and v6.11-rc3: > mTHP 64K set to 'always' > zswap compressor set to 'deflate-iaa' > page-cluster = 3 > > zswap shrinker_enabled = N: > --------------------------- > ok 1 test_zswap_usage > ok 2 test_swapin_nozswap > ok 3 test_zswapin > ok 4 test_zswap_writeback_enabled > ok 5 test_zswap_writeback_disabled > ok 6 # SKIP test_no_kmem_bypass > ok 7 test_no_invasive_cgroup_shrink > > zswap shrinker_enabled = Y: > --------------------------- > ok 1 test_zswap_usage > ok 2 test_swapin_nozswap > # at least 24MB should be brought back from zswap > not ok 3 test_zswapin > ok 4 test_zswap_writeback_enabled > ok 5 test_zswap_writeback_disabled > ok 6 # SKIP test_no_kmem_bypass > not ok 7 test_no_invasive_cgroup_shrink > > I haven't taken an in-depth look into the cgroup zswap tests, but it > looks like the results with the patch-series are no worse than without, > and in some cases better (not exactly sure why, this needs more > analysis). > > I would greatly appreciate your code review comments and suggestions! > > Thanks, > Kanchana > > [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/ > > > Kanchana P Sridhar (4): > mm: zswap: zswap_is_folio_same_filled() takes an index in the folio. > mm: zswap: zswap_store() extended to handle mTHP folios. > mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats. > mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats. > > include/linux/huge_mm.h | 1 + > mm/huge_memory.c | 2 + > mm/page_io.c | 7 ++ > mm/zswap.c | 238 +++++++++++++++++++++++++++++----------- > 4 files changed, 184 insertions(+), 64 deletions(-) -- Best Regards, Huang, Ying