linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yosry Ahmed <yosryahmed@google.com>
To: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org,  nphamcs@gmail.com, ryan.roberts@arm.com,
	ying.huang@intel.com,  21cnbao@gmail.com,
	akpm@linux-foundation.org, nanhai.zou@intel.com,
	 wajdi.k.feghali@intel.com, vinodh.gopal@intel.com
Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
Date: Wed, 28 Aug 2024 15:37:28 -0700	[thread overview]
Message-ID: <CAJD7tkZv3YnhxFo-rvHNB2_mro1+UuKOP69yXV8KmaeEz5F1mA@mail.gmail.com> (raw)
In-Reply-To: <20240828093516.30228-1-kanchana.p.sridhar@intel.com>

On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi All,
>
> This patch-series enables zswap_store() to accept and store mTHP
> folios. The most significant contribution in this series is from the
> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> migrated to v6.11-rc3 in patch 2/4 of this series.
>
> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
>      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Additionally, there is an attempt to modularize some of the functionality
> in zswap_store(), to make it more amenable to supporting any-order
> mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
> in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> delete all offsets corresponding to a higher order folio stored in zswap.
>
> For accounting purposes, the patch-series adds per-order mTHP sysfs
> "zswpout" counters that get incremented upon successful zswap_store of
> an mTHP folio:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>
> This patch-series is a precursor to ZSWAP compress batching of mTHP
> swap-out and decompress batching of swap-ins based on swapin_readahead(),
> using Intel IAA hardware acceleration, which we would like to submit in
> subsequent RFC patch-series, with performance improvement data.
>
> Thanks to Ying Huang for pre-posting review feedback and suggestions!
>
> Changes since v4:
> =================
> 1) Published before/after data with zstd, as suggested by Nhat (Thanks
>    Nhat for the data reviews!).
> 2) Rebased to mm-unstable from 8/27/2024,
>    commit b659edec079c90012cf8d05624e312d1062b8b87.
> 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
>    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
>    robot; as per Nhat's and Michal's suggestion to not require a separate
>    patch to fix the build errors (thanks both!).
> 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
>    suggested by Yosry (Thanks Yosry!).
> 5) Squashed the commits that define new mthp zswpout stat counters, and
>    invoke count_mthp_stat() after successful zswap_store()s; into a single
>    commit. Thanks Yosry for this suggestion!
>
> Changes since v3:
> =================
> 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
>    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
>    changes to count_mthp_stat() so that it's always defined, even when THP
>    is disabled. Barry, I have also made one other change in page_io.c
>    where count_mthp_stat() is called by count_swpout_vm_event(). I would
>    appreciate it if you can review this. Thanks!
>    Hopefully this should resolve the kernel robot build errors.
>
> Changes since v2:
> =================
> 1) Gathered usemem data using SSD as the backing swap device for zswap,
>    as suggested by Ying Huang. Ying, I would appreciate it if you can
>    review the latest data. Thanks!
> 2) Generated the base commit info in the patches to attempt to address
>    the kernel test robot build errors.
> 3) No code changes to the individual patches themselves.
>
> Changes since RFC v1:
> =====================
>
> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
>    Thanks Barry!
> 2) Addressed some of the code review comments that Nhat Pham provided in
>    Ryan's initial RFC [1]:
>    - Added a comment about the cgroup zswap limit checks occuring once per
>      folio at the beginning of zswap_store().
>      Nhat, Ryan, please do let me know if the comments convey the summary
>      from the RFC discussion. Thanks!
>    - Posted data on running the cgroup suite's zswap kselftest.
> 3) Rebased to v6.11-rc3.
> 4) Gathered performance data with usemem and the rebased patch-series.
>
> Performance Testing:
> ====================
> Testing of this patch-series was done with the v6.11-rc3 mainline, without
> and with this patch-series, on an Intel Sapphire Rapids server,
> dual-socket 56 cores per socket, 4 IAA devices per socket.
>
> The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as the
> backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
> Core frequency was fixed at 2500MHz.
>
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 40G. The is no swap limit set for the cgroup. Following a
> similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> series [2], 70 usemem processes were run, each allocating and writing 1G of
> memory:
>
>     usemem --init-time -w -O -n 70 1g
>
> The vm/sysfs mTHP stats included with the performance data provide details
> on the swapout activity to ZSWAP/swap.
>
> Other kernel configuration parameters:
>
>     ZSWAP Compressors : zstd, deflate-iaa
>     ZSWAP Allocator   : zsmalloc
>     SWAP page-cluster : 2
>
> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> IAA "compression verification" is enabled. Hence each IAA compression
> will be decompressed internally by the "iaa_crypto" driver, the crc-s
> returned by the hardware will be compared and errors reported in case of
> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> compared to the software compressors.
>
> Throughput is derived by averaging the individual 70 processes' throughputs
> reported by usemem. sys time is measured with perf. All data points are
> averaged across 3 runs.
>
>  64KB mTHP (cgroup memory.high set to 40G):
>  ==========================================
>
>  ------------------------------------------------------------------------------
>                      v6.11-rc3 mainline              zswap-mTHP      Change wrt
>                                Baseline                                Baseline
>  ------------------------------------------------------------------------------
>  ZSWAP compressor       zstd   deflate-        zstd    deflate-   zstd deflate-
>                                     iaa                     iaa             iaa
>  ------------------------------------------------------------------------------
>  Throughput (KB/s)   161,496    156,343     140,363     151,938   -13%      -3%
>  sys time (sec)       771.68     802.08      954.85      735.47   -24%       8%
>  memcg_high          111,223    110,889     138,651     133,884
>  memcg_swap_high           0          0           0           0
>  memcg_swap_fail           0          0           0           0
>  pswpin                   16         16           0           0
>  pswpout           7,471,472  7,527,963           0           0
>  zswpin                  635        605         624         639
>  zswpout               1,509      1,478   9,453,761   9,385,910
>  thp_swpout                0          0           0           0
>  thp_swpout_               0          0           0           0
>   fallback
>  pgmajfault            3,616      3,430       4,633       3,611
>  ZSWPOUT-64kB            n/a        n/a     590,768     586,521
>  SWPOUT-64kB         466,967    470,498           0           0
>  ------------------------------------------------------------------------------
>
>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
>  =======================================================
>
>  ------------------------------------------------------------------------------
>                       v6.11-rc3 mainline              zswap-mTHP     Change wrt
>                                 Baseline                               Baseline
>  ------------------------------------------------------------------------------
>  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
>                                      iaa                     iaa            iaa
>  ------------------------------------------------------------------------------
>  Throughput (KB/s)    192,164    194,643     165,005     174,536  -14%     -10%
>  sys time (sec)        823.55     830.42      801.72      676.65    3%      19%
>  memcg_high            16,054     15,936      14,951      16,096
>  memcg_swap_high            0          0           0           0
>  memcg_swap_fail            0          0           0           0
>  pswpin                     0          0           0           0
>  pswpout            8,629,248  8,628,907           0           0
>  zswpin                   560        645       5,333         781
>  zswpout                1,416      1,503   8,546,895   9,355,760
>  thp_swpout            16,854     16,853           0           0
>  thp_swpout_                0          0           0           0
>   fallback
>  pgmajfault             3,341      3,574       8,139       3,582
>  ZSWPOUT-2048kB           n/a        n/a      16,684      18,270
>  SWPOUT-2048kB         16,854     16,853           0           0
>  ------------------------------------------------------------------------------
>
> In the "Before" scenario, when zswap does not store mTHP, only allocations
> count towards the cgroup memory limit. However, in the "After" scenario,
> with the introduction of zswap_store() mTHP, both, allocations as well as
> the zswap compressed pool usage from all 70 processes are counted towards
> the memory limit. As a result, we see higher swapout activity in the
> "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> charge leads to more frequent memory.high breaches.
>
> This causes degradation in throughput and sys time with zswap mTHP, more so
> in case of zstd than deflate-iaa. Compress latency could play a part in
> this - when there is more swapout activity happening, a slower compressor
> would cause allocations to stall for any/all of the 70 processes.
>
> In my opinion, even though the test set up does not provide an accurate
> way for a direct before/after comparison (because of zswap usage being
> counted in cgroup, hence towards the memory.high), it still seems
> reasonable for zswap_store to support (m)THP, so that further performance
> improvements can be implemented.

Are you saying that in the "Before" data we end up skipping zswap
completely because of using mTHPs?

Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data
to force the mTHPs to be split and for the data to be stored in zswap?
This would be a more fair Before/After comparison where the memory
goes to zswap in both cases, but "Before" has to be split because of
zswap's lack of support for mTHP. I assume most setups relying on
zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe not.
Nhat, is this something you can share?


  parent reply	other threads:[~2024-08-28 22:38 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-28  9:35 Kanchana P Sridhar
2024-08-28  9:35 ` [PATCH v5 1/3] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
2024-08-28  9:35 ` [PATCH v5 2/3] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
2024-08-28  9:35 ` [PATCH v5 3/3] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats Kanchana P Sridhar
2024-08-28 15:55 ` [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios Nhat Pham
2024-08-28 17:23   ` Nhat Pham
2024-08-28 19:30     ` Sridhar, Kanchana P
2024-08-28 19:24   ` Sridhar, Kanchana P
2024-08-28 21:35 ` Nhat Pham
2024-08-29  0:06   ` Sridhar, Kanchana P
2024-08-29 17:10     ` Nhat Pham
2024-08-29 19:38       ` Sridhar, Kanchana P
2024-08-30  4:52         ` Chengming Zhou
2024-09-20  2:34           ` Sridhar, Kanchana P
2024-08-29  3:59   ` Sridhar, Kanchana P
2024-08-28 22:37 ` Yosry Ahmed [this message]
2024-08-29  0:20   ` Sridhar, Kanchana P
2024-08-29  1:01     ` Yosry Ahmed
2024-08-29  3:10       ` Sridhar, Kanchana P
2024-08-29 23:33   ` Nhat Pham
2024-08-29 23:38     ` Yosry Ahmed
2024-08-29 23:47       ` Nhat Pham
2024-08-29 23:55         ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJD7tkZv3YnhxFo-rvHNB2_mro1+UuKOP69yXV8KmaeEz5F1mA@mail.gmail.com \
    --to=yosryahmed@google.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=kanchana.p.sridhar@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nanhai.zou@intel.com \
    --cc=nphamcs@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=vinodh.gopal@intel.com \
    --cc=wajdi.k.feghali@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox